PA.
HomeProjectsTech StackBlog
Resume
PA.

Senior AI/ML Engineer · KYC/AML · Africa

HomeProjectsTech StackBlog

© 2026 Patrick Attankurugu. Built with Next.js.

Back to home
Case Study · AfricaPEP

Building a PEP Database for 54 African Countries: Lessons from AfricaPEP

February 2025·12 min read

Africa had no comprehensive, open-source PEP database. Financial institutions screening customers against global watchlists had excellent coverage for European and North American politicians but almost nothing for African heads of state, ministers, or senior military officers. So we built one.

27K+
PEP Profiles
54
Countries Covered
1,300+
Family Links Mapped
4
FATF Tiers Classified

Why Africa Had a PEP Data Gap

Politically Exposed Persons screening is a cornerstone of AML compliance. Every bank, fintech, and money transfer operator is required to identify whether their customers hold (or are connected to) senior public positions. The Financial Action Task Force defines PEPs broadly: heads of state, senior politicians, judicial officials, military leaders, and executives of state-owned enterprises, along with their family members and close associates.

The commercial PEP databases that most institutions rely on (Dow Jones, Refinitiv, LexisNexis) are built primarily from structured Western government sources: parliamentary records, electoral databases, government gazettes. These sources are well maintained in Europe, North America, and parts of Asia. But African government data infrastructure varies enormously. Some countries publish comprehensive ministerial lists online. Others have no digital presence at all for their cabinet members.

The consequence was a dangerous blind spot. A compliance officer screening a customer at a Lagos bank might get a hit on a German parliamentarian but miss the fact that the customer is the nephew of a sitting West African finance minister. This is not a theoretical concern. Several high-profile money laundering cases involving African PEPs were facilitated precisely because screening databases lacked adequate African coverage.

PEP Data Availability: The Global Imbalance

Europe
95%
North America
92%
Asia Pacific
73%
Latin America
58%
Africa (before AfricaPEP)
12%
Africa (after AfricaPEP)
78%

Estimated senior-official coverage in commercial PEP databases, pre and post AfricaPEP

The Data Collection Pipeline

We started where most structured data projects start: by mapping what already existed. Wikidata turned out to be surprisingly rich for African political figures, though unevenly so. Nigeria, South Africa, Kenya, and Egypt had decent coverage. Countries like Equatorial Guinea, Comoros, and Eswatini had almost nothing.

The pipeline combined three data sources in order of reliability:

Multi-Source Data Pipeline

1
Wikidata SPARQL Extraction
Structured queries for all African nationals with political position (P39), military rank (P410), or judicial role properties. Extracted names in multiple languages, birth dates, positions held, and start/end dates.
Output: ~18,000 raw profiles
2
Government Source Scraping
Custom scrapers for 38 countries with accessible government websites. Parliamentary member lists, cabinet compositions, judiciary rosters, central bank boards, and state enterprise directors.
Output: ~6,500 additional profiles
3
Manual Research and Validation
For the 16 countries with minimal digital presence, research analysts manually compiled profiles from news sources, NGO reports, and diplomatic records. Every profile verified against at least two independent sources.
Output: ~2,500 additional profiles
4
Deduplication and Merge
Entity resolution pipeline (detailed below) to identify and merge duplicate profiles across sources. A single PEP often appeared in Wikidata, their parliament website, AND news scraping with different name spellings.
Output: 27,000+ unique, verified profiles

FATF Tier Classification

Not all PEPs carry the same risk. FATF guidance distinguishes between domestic PEPs, foreign PEPs, and international organization PEPs. But in practice, the risk also varies by the specific role. A head of state has more opportunity and motive for corruption than a backbench parliamentarian in a country with strong institutional oversight.

We built a four-tier classification system that goes beyond the standard FATF categories. The tier determines how strictly a financial institution should apply enhanced due diligence:

AfricaPEP Risk Tier Classification

Tier 1 (Critical)
Roles: Heads of state, prime ministers, central bank governors, supreme court chief justices
EDD Requirement: Mandatory enhanced due diligence. Senior management approval required.
Tier 2 (High)
Roles: Cabinet ministers, military chiefs, senior judiciary, state enterprise CEOs, ambassadors to major partners
EDD Requirement: Enhanced due diligence with source-of-wealth verification.
Tier 3 (Elevated)
Roles: Members of parliament, regional governors, mid-ranking military, regulatory agency heads
EDD Requirement: Simplified enhanced due diligence. Periodic review.
Tier 4 (Standard)
Roles: Local council members, junior officials, state enterprise board members, party officials
EDD Requirement: Standard due diligence with PEP flag for monitoring.

The Name Matching Challenge

This is where the project got genuinely difficult. Name matching across 54 African countries means dealing with Arabic, French, English, Portuguese, Swahili, Amharic, Hausa, Yoruba, Zulu, and dozens of other languages and scripts. A single person might be recorded as “Mohamed” in one source and “Muhammad” in another and “Muhammadu” in a third. These are all valid transliterations.

African naming conventions compound the difficulty. In many West African cultures, a person has a given name, a family name, and often a patronymic or village name that may or may not appear in official records. North African names frequently include “Al-”, “El-”, or “Ben” prefixes that are handled inconsistently across databases. East African names may include clan identifiers. Southern African names sometimes include anglicized versions alongside traditional spellings.

No single matching algorithm handles all of these cases well. That is why we built a hybrid scoring system combining four approaches:

Matching Algorithms Compared (Click Each)

Counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another.

Example Match
Mohammed Al-BashirvsMohamed Al Bashir0.89
Strengths
Catches typos and minor spelling variations. Intuitive distance metric.
Weaknesses
Struggles with transpositions. Penalizes length differences harshly.

The hybrid approach proved critical. In our validation dataset of 5,000 manually confirmed matches, the hybrid scorer achieved a precision of 94.2% at a recall of 91.7%. The best single algorithm (Jaro-Winkler) only managed 87.3% precision at the same recall threshold. That 7-point gap in precision translates to thousands of fewer false matches in a database of 27,000 profiles.

Graph-Powered Relationship Mapping

PEP screening is not just about identifying the PEP themselves. FATF guidelines require institutions to also identify family members and close associates of PEPs. A customer who is the spouse, child, or business partner of a minister carries elevated risk even if they hold no public position themselves.

Relational databases are terrible at this. Finding “all people within two relationship hops of a given PEP” requires recursive joins that are slow and awkward to express in SQL. We used Neo4j, a graph database, where this query is both natural and fast.

Neo4j Graph Data Model

PEP Node
← SPOUSE_OF →
Family Member
PARENT_OF ↓BUSINESS_PARTNER ↓
Child
Associate
← DIRECTS →
Company Node
MATCH (p:PEP)-[:RELATED_TO*1..2]-(connected)
WHERE p.name =~ '.*Buhari.*'
RETURN connected.name, connected.relationship_type, connected.country

The graph approach revealed connections that would have been invisible in flat data. We found over 1,300 family links across the 27,000 profiles, including cross-border family networks where a PEP in one country had relatives holding senior positions in neighboring states. These cross-border connections are precisely the kind of risk that country-by-country screening misses.

Regional Data Landscape (Click to Explore)

8,200+
PEP Profiles
15
Countries
Key Challenges
Multiple official languages (French, English, Portuguese). High volume of political transitions. Complex traditional governance structures alongside modern government.
Primary Sources
National assemblies, electoral commissions, Wikidata, ECOWAS records

The API Design

AfricaPEP exposes a REST API designed for integration with existing KYC platforms. The primary endpoint accepts a name (and optionally a country, date of birth, or position) and returns ranked matches with confidence scores, relationship data, and FATF tier classifications.

API Response Structure

{
  "query": "Mohammed Al-Bashir",
  "matches": [
    {
      "name": "Omar Hassan Ahmad al-Bashir",
      "confidence": 0.72,
      "country": "Sudan",
      "tier": 1,
      "positions": [
        {
          "title": "President of Sudan",
          "start": "1989-06-30",
          "end": "2019-04-11",
          "status": "former"
        }
      ],
      "related_peps": 8,
      "family_connections": [
        { "name": "Widad Babiker", "relation": "spouse" }
      ]
    }
  ],
  "total_matches": 3,
  "search_time_ms": 42
}

Response times average 40ms for single-name queries and under 200ms for batch queries of up to 100 names. We achieved this by pre-computing phonetic indices and maintaining an in-memory blocking index that eliminates 98% of candidate pairs before the more expensive fuzzy matching runs.

Lessons Learned

Data quality trumps data quantity

We spent more time validating and deduplicating than collecting. 27,000 verified profiles are more valuable than 100,000 noisy ones. Every false positive in a PEP database wastes analyst time downstream.

No single matching algorithm is enough

Each algorithm has blind spots. The ensemble approach was not optional; it was the only way to achieve acceptable precision across the full range of African naming conventions.

Graph databases are not a luxury for relationship data

We initially prototyped with PostgreSQL recursive CTEs. It worked for small datasets but became unmanageable as relationship depth increased. Neo4j was the right tool from the start; we should have committed earlier.

Regional expertise is non-negotiable

Our best technical work was meaningless without domain experts who understood naming conventions, political structures, and data source reliability in each African region. The data team included analysts from West, East, and Southern Africa.

Open source accelerates adoption

Making AfricaPEP open source gave smaller African fintechs access to PEP data they could never have afforded from commercial providers. This was a deliberate choice: compliance should not be a luxury reserved for large institutions.

PA
Patrick Attankurugu
Creator of AfricaPEP. Senior AI/ML Engineer specializing in KYC/AML systems and identity verification across Africa. Senior AI/ML Engineer at Agregar Technologies.