Entity Resolution at Scale: Matching Names Across Languages, Scripts, and Borders
Sanctions screening sounds simple until you realize the same person's name can appear in Arabic, French, English, and local transliterations, all slightly different. This is the NLP pipeline we built for high-precision name matching across 54 African countries.
Why Name Matching is Hard
Consider a single name: the former President of Nigeria. In official UN documents, he appears as “Muhammadu Buhari.” In French-language press, “Mohamed Bouhari.” In Arabic script, a completely different character set. In Hausa (his native language), yet another spelling convention. All four refer to the same person. A screening system that cannot link them has a critical gap.
Africa amplifies this challenge in several ways. The continent spans every major script family: Latin, Arabic, Amharic/Ge'ez, and various indigenous writing systems. Colonial history means that the same name might follow French, English, Portuguese, or Arabic conventions depending on which country recorded it. Cultural naming practices vary enormously: patronymics in parts of East Africa, matronymics in some West African cultures, clan names in Somalia, praise names in Zulu tradition, and compound names that may or may not include all elements in official records.
Arabic "محمد" has at least 12 common English spellings: Muhammad, Mohammed, Mohamed, Mohamad, Muhammed, and more. Each is valid.
Western convention: given name first. Arabic: may lead with family name. Yoruba: family name placement varies. Order is unreliable as a matching signal.
Al-, El-, Bin, Ben, Ibn, Ould, Di, Van, De. Sometimes part of the family name, sometimes omitted. Inconsistent across databases.
Official records may include only two name tokens. The same person's passport has four. Matching must handle partial information gracefully.
The NLP Pipeline
Our entity resolution pipeline processes names in five stages, each designed to reduce a specific source of matching failure. The pipeline runs both during PEP database ingestion (normalizing and indexing all 27,000+ profiles) and at query time (processing the name being screened).
NLP Pipeline (Click Each Stage)
Lowercase, remove diacritics, normalize Unicode (NFC), strip honorifics (Dr., Hon., Sheikh), expand abbreviations, handle hyphens and apostrophes.
Matching Algorithm Comparison
No single string matching algorithm handles the full range of name variation. We evaluated four approaches and found that each excels in different scenarios. The interactive comparison below shows how different algorithms score the same name pairs, revealing their individual strengths and blind spots.
Name Matching in Action (Click Each Pair)
Scaling: The Blocking Strategy
Comparing every incoming name against all 27,000 PEP profiles using four matching algorithms is computationally expensive. Naive implementation means 108,000 comparisons per query, each involving multiple string operations. At the volumes our clients process (thousands of screenings per hour), this approach does not scale.
Blocking is the solution. Instead of comparing against every profile, we first identify a small set of candidates that might plausibly match, and then run the full hybrid scoring only on those candidates. Our blocking strategy uses three complementary keys:
Three-Layer Blocking Strategy
Results and Validation
Validation Results on 5,000 Manually Labeled Pairs
The 8.3% of true matches that we miss (recall gap) are predominantly alias cases: completely different names used by the same person (birth names vs. adopted names, traditional vs. official names). These are not solvable by string matching alone. We maintain a separate alias database, populated from intelligence sources, to catch these cases. String matching and alias lookup together bring effective recall above 96%.
Lessons Learned
For cross-transliteration matching, phonetic methods outperform edit distance by a wide margin. They capture the intent of the name rather than its surface form.
We spent more time tuning blocking keys than tuning the matching algorithms. A good blocking strategy is the difference between 40ms and 4-second queries.
The 5,000 manually labeled pairs took three months to create with domain experts from multiple African regions. Automated labels are not sufficient for this domain.