Running LLMs Without the Cloud: Why Local Inference Matters for AI in Africa
Most AI applications are built around an assumption that's invisible until it breaks: reliable, low-latency internet connectivity. In African markets, that assumption breaks constantly. Here is what that means for how AI systems need to be built, and where local inference fits in.
The Connectivity Problem
A standard AI application makes a round trip on every inference call: the user's device sends a request to a cloud API, the API processes it on a GPU cluster, the response comes back. The whole thing depends on a connection that can sustain that round trip reliably, at acceptable latency, for every user, every time.
In African markets, that assumption fails in predictable ways. Rural agent banking networks — the last-mile infrastructure through which millions of people access financial services — operate on devices with intermittent 2G/3G connectivity. Network reliability varies significantly by region and time of day. A KYC verification assistant that requires a cloud API call isn't usable in the field. A regulatory Q&A tool that drops responses when the network stutters creates compliance risk, not compliance value.
This is where local LLM inference becomes practically important, not just technically interesting.
Cloud vs. Local Inference: Key Tradeoffs
- ✗Requires stable connectivity
- ✗Per-token cost compounds at scale
- ✗Data leaves the device
- ✓State-of-the-art model quality
- ✓No local hardware requirement
- ✓Works offline or on poor connections
- ✓Zero per-query cost after setup
- ✓Data stays on device
- ✗Smaller models, bounded capability
- ✗Slower generation on CPU
What llama.cpp Is
llama.cpp is a C/C++ inference engine for running large language models locally on CPUs, created by Georgi Gerganov (currently at Meta AI). It uses GGUF quantization to compress model weights to sizes that run on consumer hardware without a GPU.
The key mechanism is quantization: instead of storing weights as full 32-bit floats, llama.cpp stores them in 4-bit or 8-bit format. A 7-billion parameter model that normally requires 14GB of VRAM can be compressed to around 4GB in Q4_K_M format and run on CPU. For inference workloads that aren't latency-critical, this is entirely viable. On a modern laptop, a 7B model in Q4_K_M generates roughly 15–25 tokens per second — fast enough for document Q&A and structured output tasks.
Where It Fits in a Compliance AI Stack
In the systems I've built for African financial institutions, local inference has been useful in three specific contexts:
Agent banking staff need answers to customer questions about documentation requirements and transaction limits. A locally-running RAG pipeline — compliance documents embedded on-device, retrieval and generation running locally — works without internet. A well-prompted 7B model with access to the right documents is sufficient for this bounded domain.
When an agent is onboarding a customer in a low-connectivity area, a local model can help with document classification, field extraction validation, and basic risk flagging without a cloud call. The computationally heavy work — deepfake detection, full face matching — queues for server-side processing when connectivity returns. The agent gets an immediate preliminary result.
For iterating on RAG pipelines and prompt engineering during development, running locally avoids per-token API costs. A week of prompt testing on a local 7B model costs nothing. The same work against a cloud API adds up quickly when query volume is high and prompts are still being refined.
Practical Setup
Install via the Python bindings — more stable than the CLI across llama.cpp's frequent releases:
pip install llama-cpp-pythonDownload a model in GGUF format from Hugging Face. For compliance Q&A use cases, Mistral 7B Instruct Q4_K_M is a reliable starting point:
from llama_cpp import Llama
llm = Llama(
model_path="./models/mistral-7b-instruct-q4_k_m.gguf",
n_ctx=4096,
n_threads=8,
)
response = llm(
"[INST] Summarize the FATF risk-based approach to customer due diligence in three sentences. [/INST]",
max_tokens=256,
temperature=0.1,
)
print(response["choices"][0]["text"])For RAG workloads, combine with a local vector store (ChromaDB or FAISS) and a lightweight embedding model that also runs locally — no API key required at any point in the pipeline.
Model Selection
| Model | Quantization | Size | Good for |
|---|---|---|---|
| Mistral 7B Instruct | Q4_K_M | 4.1 GB | General Q&A, structured output |
| Phi-3 Mini | Q4_K_M | 2.2 GB | Tight RAM budgets, classification |
| Llama 3.2 3B | Q4_K_M | 1.9 GB | Lowest hardware requirement |
| Mistral 7B Instruct | Q8_0 | 7.7 GB | Better accuracy where RAM allows |
For compliance domain work, Mistral 7B Instruct Q4_K_M is the default choice. It handles structured output reliably, follows system prompts consistently, and runs on 8GB RAM with room to spare.
The Limits
Local inference is not a replacement for cloud-scale models on complex tasks. A locally-running 7B model does not match frontier models on multi-document synthesis, complex reasoning, or anything requiring broad world knowledge. The use cases above work because they're bounded: the model has access to a specific document corpus and operates on a well-defined task.
The connectivity problem in African deployments is real, but it's not the only consideration. For any task where quality matters more than offline capability, cloud APIs remain the right call. Local inference is the right tool specifically when connectivity is a hard constraint and the task can be adequately handled by a smaller model with the right context.