Building a PEP Database for 54 African Countries: Lessons from AfricaPEP
Africa had no comprehensive, open-source PEP database. Financial institutions screening customers against global watchlists had excellent coverage for European and North American politicians but almost nothing for African heads of state, ministers, or senior military officers. So we built one.
Why Africa Had a PEP Data Gap
Politically Exposed Persons screening is a cornerstone of AML compliance. Every bank, fintech, and money transfer operator is required to identify whether their customers hold (or are connected to) senior public positions. The Financial Action Task Force defines PEPs broadly: heads of state, senior politicians, judicial officials, military leaders, and executives of state-owned enterprises, along with their family members and close associates.
The commercial PEP databases that most institutions rely on (Dow Jones, Refinitiv, LexisNexis) are built primarily from structured Western government sources: parliamentary records, electoral databases, government gazettes. These sources are well maintained in Europe, North America, and parts of Asia. But African government data infrastructure varies enormously. Some countries publish comprehensive ministerial lists online. Others have no digital presence at all for their cabinet members.
The consequence was a dangerous blind spot. A compliance officer screening a customer at a Lagos bank might get a hit on a German parliamentarian but miss the fact that the customer is the nephew of a sitting West African finance minister. This is not a theoretical concern. Several high-profile money laundering cases involving African PEPs were facilitated precisely because screening databases lacked adequate African coverage.
PEP Data Availability: The Global Imbalance
Estimated senior-official coverage in commercial PEP databases, pre and post AfricaPEP
The Data Collection Pipeline
We started where most structured data projects start: by mapping what already existed. Wikidata turned out to be surprisingly rich for African political figures, though unevenly so. Nigeria, South Africa, Kenya, and Egypt had decent coverage. Countries like Equatorial Guinea, Comoros, and Eswatini had almost nothing.
The pipeline combined three data sources in order of reliability:
Multi-Source Data Pipeline
FATF Tier Classification
Not all PEPs carry the same risk. FATF guidance distinguishes between domestic PEPs, foreign PEPs, and international organization PEPs. But in practice, the risk also varies by the specific role. A head of state has more opportunity and motive for corruption than a backbench parliamentarian in a country with strong institutional oversight.
We built a four-tier classification system that goes beyond the standard FATF categories. The tier determines how strictly a financial institution should apply enhanced due diligence:
AfricaPEP Risk Tier Classification
The Name Matching Challenge
This is where the project got genuinely difficult. Name matching across 54 African countries means dealing with Arabic, French, English, Portuguese, Swahili, Amharic, Hausa, Yoruba, Zulu, and dozens of other languages and scripts. A single person might be recorded as “Mohamed” in one source and “Muhammad” in another and “Muhammadu” in a third. These are all valid transliterations.
African naming conventions compound the difficulty. In many West African cultures, a person has a given name, a family name, and often a patronymic or village name that may or may not appear in official records. North African names frequently include “Al-”, “El-”, or “Ben” prefixes that are handled inconsistently across databases. East African names may include clan identifiers. Southern African names sometimes include anglicized versions alongside traditional spellings.
No single matching algorithm handles all of these cases well. That is why we built a hybrid scoring system combining four approaches:
Matching Algorithms Compared (Click Each)
Counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another.
The hybrid approach proved critical. In our validation dataset of 5,000 manually confirmed matches, the hybrid scorer achieved a precision of 94.2% at a recall of 91.7%. The best single algorithm (Jaro-Winkler) only managed 87.3% precision at the same recall threshold. That 7-point gap in precision translates to thousands of fewer false matches in a database of 27,000 profiles.
Graph-Powered Relationship Mapping
PEP screening is not just about identifying the PEP themselves. FATF guidelines require institutions to also identify family members and close associates of PEPs. A customer who is the spouse, child, or business partner of a minister carries elevated risk even if they hold no public position themselves.
Relational databases are terrible at this. Finding “all people within two relationship hops of a given PEP” requires recursive joins that are slow and awkward to express in SQL. We used Neo4j, a graph database, where this query is both natural and fast.
Neo4j Graph Data Model
WHERE p.name =~ '.*Buhari.*'
RETURN connected.name, connected.relationship_type, connected.country
The graph approach revealed connections that would have been invisible in flat data. We found over 1,300 family links across the 27,000 profiles, including cross-border family networks where a PEP in one country had relatives holding senior positions in neighboring states. These cross-border connections are precisely the kind of risk that country-by-country screening misses.
Regional Data Landscape (Click to Explore)
The API Design
AfricaPEP exposes a REST API designed for integration with existing KYC platforms. The primary endpoint accepts a name (and optionally a country, date of birth, or position) and returns ranked matches with confidence scores, relationship data, and FATF tier classifications.
API Response Structure
{
"query": "Mohammed Al-Bashir",
"matches": [
{
"name": "Omar Hassan Ahmad al-Bashir",
"confidence": 0.72,
"country": "Sudan",
"tier": 1,
"positions": [
{
"title": "President of Sudan",
"start": "1989-06-30",
"end": "2019-04-11",
"status": "former"
}
],
"related_peps": 8,
"family_connections": [
{ "name": "Widad Babiker", "relation": "spouse" }
]
}
],
"total_matches": 3,
"search_time_ms": 42
}Response times average 40ms for single-name queries and under 200ms for batch queries of up to 100 names. We achieved this by pre-computing phonetic indices and maintaining an in-memory blocking index that eliminates 98% of candidate pairs before the more expensive fuzzy matching runs.
Lessons Learned
We spent more time validating and deduplicating than collecting. 27,000 verified profiles are more valuable than 100,000 noisy ones. Every false positive in a PEP database wastes analyst time downstream.
Each algorithm has blind spots. The ensemble approach was not optional; it was the only way to achieve acceptable precision across the full range of African naming conventions.
We initially prototyped with PostgreSQL recursive CTEs. It worked for small datasets but became unmanageable as relationship depth increased. Neo4j was the right tool from the start; we should have committed earlier.
Our best technical work was meaningless without domain experts who understood naming conventions, political structures, and data source reliability in each African region. The data team included analysts from West, East, and Southern Africa.
Making AfricaPEP open source gave smaller African fintechs access to PEP data they could never have afforded from commercial providers. This was a deliberate choice: compliance should not be a luxury reserved for large institutions.