PA.
HomeProjectsTech StackBlog
Resume
PA.

Senior AI/ML Engineer · KYC/AML · Africa

HomeProjectsTech StackBlog

© 2026 Patrick Attankurugu. Built with Next.js.

Back to home
NLP · Compliance Engineering

Entity Resolution at Scale: Matching Names Across Languages, Scripts, and Borders

October 2024·8 min read

Sanctions screening sounds simple until you realize the same person's name can appear in Arabic, French, English, and local transliterations, all slightly different. This is the NLP pipeline we built for high-precision name matching across 54 African countries.

Why Name Matching is Hard

Consider a single name: the former President of Nigeria. In official UN documents, he appears as “Muhammadu Buhari.” In French-language press, “Mohamed Bouhari.” In Arabic script, a completely different character set. In Hausa (his native language), yet another spelling convention. All four refer to the same person. A screening system that cannot link them has a critical gap.

Africa amplifies this challenge in several ways. The continent spans every major script family: Latin, Arabic, Amharic/Ge'ez, and various indigenous writing systems. Colonial history means that the same name might follow French, English, Portuguese, or Arabic conventions depending on which country recorded it. Cultural naming practices vary enormously: patronymics in parts of East Africa, matronymics in some West African cultures, clan names in Somalia, praise names in Zulu tradition, and compound names that may or may not include all elements in official records.

🌐
Transliteration Ambiguity

Arabic "محمد" has at least 12 common English spellings: Muhammad, Mohammed, Mohamed, Mohamad, Muhammed, and more. Each is valid.

🔄
Name Component Order

Western convention: given name first. Arabic: may lead with family name. Yoruba: family name placement varies. Order is unreliable as a matching signal.

🔗
Prefix and Particle Handling

Al-, El-, Bin, Ben, Ibn, Ould, Di, Van, De. Sometimes part of the family name, sometimes omitted. Inconsistent across databases.

📄
Missing and Extra Components

Official records may include only two name tokens. The same person's passport has four. Matching must handle partial information gracefully.

The NLP Pipeline

Our entity resolution pipeline processes names in five stages, each designed to reduce a specific source of matching failure. The pipeline runs both during PEP database ingestion (normalizing and indexing all 27,000+ profiles) and at query time (processing the name being screened).

NLP Pipeline (Click Each Stage)

Lowercase, remove diacritics, normalize Unicode (NFC), strip honorifics (Dr., Hon., Sheikh), expand abbreviations, handle hyphens and apostrophes.

"Dr. Abd El-Fattah Al-Sissi" → "abd el fattah al sissi"

Matching Algorithm Comparison

No single string matching algorithm handles the full range of name variation. We evaluated four approaches and found that each excels in different scenarios. The interactive comparison below shows how different algorithms score the same name pairs, revealing their individual strengths and blind spots.

Name Matching in Action (Click Each Pair)

محمد الرشيدvsMohammed Al-Rashid
N/A (diff script)
Levenshtein
N/A
Jaro-Winkler
0.95
Phonetic
0.93
Hybrid
Arabic to English transliteration. Phonetic encoding bridges the script gap.

Scaling: The Blocking Strategy

Comparing every incoming name against all 27,000 PEP profiles using four matching algorithms is computationally expensive. Naive implementation means 108,000 comparisons per query, each involving multiple string operations. At the volumes our clients process (thousands of screenings per hour), this approach does not scale.

Blocking is the solution. Instead of comparing against every profile, we first identify a small set of candidates that might plausibly match, and then run the full hybrid scoring only on those candidates. Our blocking strategy uses three complementary keys:

Three-Layer Blocking Strategy

Phonetic Block95% pair reduction
Double Metaphone encoding of each name token. Names that sound alike share a block regardless of spelling.
Character N-gram Block87% pair reduction
First 3 characters of the normalized family name. Fast, catches common prefixes.
Token Signature Block91% pair reduction
Sorted first characters of all name tokens. Handles name reordering.
Combined: 27K profiles × query = ~50 candidate pairs (from 27K possible)

Results and Validation

Validation Results on 5,000 Manually Labeled Pairs

94.2%
Precision
True matches among all returned matches
91.7%
Recall
Found matches among all true matches
0.929
F1 Score
Harmonic mean of precision and recall
38ms
Avg Latency
Per-query including blocking and scoring

The 8.3% of true matches that we miss (recall gap) are predominantly alias cases: completely different names used by the same person (birth names vs. adopted names, traditional vs. official names). These are not solvable by string matching alone. We maintain a separate alias database, populated from intelligence sources, to catch these cases. String matching and alias lookup together bring effective recall above 96%.

Lessons Learned

Phonetic encoding is the single most valuable technique

For cross-transliteration matching, phonetic methods outperform edit distance by a wide margin. They capture the intent of the name rather than its surface form.

Blocking strategy determines scalability

We spent more time tuning blocking keys than tuning the matching algorithms. A good blocking strategy is the difference between 40ms and 4-second queries.

Ground truth data is expensive and essential

The 5,000 manually labeled pairs took three months to create with domain experts from multiple African regions. Automated labels are not sufficient for this domain.

PA
Patrick Attankurugu
Senior AI/ML Engineer building NLP systems for compliance screening across Africa. Senior AI/ML Engineer at Agregar Technologies.