NLP · Compliance Engineering

Entity Resolution at Scale: Matching Names Across Languages, Scripts, and Borders

October 2024 · revised July 2026·16 min read

Every screening system eventually reduces to one question: is this customer the same person as this record? There is no shared key to join on. The join key is a human name, and names are the least reliable identifier ever standardized into law. This post is how I handle that in AfricaPEP, the open-source PEP screening platform I maintain, with real scores from the production matching stack rather than tidy illustrations. Every number here is reproducible from the repo or the live API.

Names are claims, not keys

Databases want identifiers to be stable, unique, and canonical. Names are none of these. They are claims written down by whoever held the pen: a passport officer, a French wire reporter, a parliamentary clerk, a Wikidata editor. Each writes what they heard, in the orthography of their own language.

Hub diagram showing one canonical record, Muhammadu Buhari with Wikidata id Q361567, surrounded by seven legitimate surface forms: Mohammed Buhari (transliteration drift), Mouhamadou Bouhari (French-press orthography), Arabic script, Buhari Muhammadu (surname-first), M. Buhari (initials), Gen. Muhammadu Buhari rtd. (honorifics), and Muhamadu Buhari (single-letter spelling drift). — Figure 1. One entity, many claims. Screen the live API for this name and the top hit is wd:Q361567 with 'M. Buhari' and 'Buhari, Muhammadu' already stored as variants; that storage decision is doing more work than any algorithm in this post.

Africa concentrates every hard case in one dataset. The continent spans Latin, Arabic, and Ge'ez scripts. Arabic names reach Latin records through French in Dakar and through English in Nairobi, producing systematically different spellings of the same name. Particles like Al-, El-, Ben, Ibn, and Ould attach, detach, and change capitalization between databases. Honorifics (Alhaji, Nana, General, rtd.) get recorded as name tokens. And the sheer frequency of shared names means the opposite failure is always nearby: records that look identical and belong to different people.

One concrete anchor before the algorithms. The Metaphone code for Mohammed, Muhammad, and Muhammadu is the same four characters: MHMT. That single fact is why phonetic encoding earns its place in every serious name matcher. The rest of this post is about why it is nowhere near sufficient.

Every algorithm has a blind spot, and I can show you each one

The standard toolbox has four families: edit-distance scorers (Jaro-Winkler), token-based scorers (sort the words, then compare), phonetic codes (Metaphone, Soundex), and embeddings, which I will get to at the end. Rather than describe them, here are the actual scores the production libraries produce on five pairs chosen to break them:

Score matrix of five name pairs against four algorithms with real computed values. Muammar Gaddafi versus Moammar Qadhafi: token-sort 0.40 fails, Jaro-Winkler 0.88 passes, Metaphone and Soundex 0.50 fail. Uhuru Kenyatta versus Kenyatta comma Uhuru: token-sort 0.97 and phonetics 1.00 pass, Jaro-Winkler 0.63 fails. Jerry John Rawlings versus J. J. Rawlings: everything fails, best 0.75. Abdoulaye versus Abdulaye Wade: everything passes. Aminata Toure versus Aminata Traore, two different real politicians: Jaro-Winkler 0.94 and token-sort 0.89 pass, phonetics 0.50 fail. — Figure 2. Computed with rapidfuzz and jellyfish, the libraries AfricaPEP actually ships. Read the diagonal of failures: each algorithm is defeated by a different, perfectly ordinary kind of name variation. Then read row five.

Walk the rows. Gaddafi/Qadhafi is the most famous transliteration cluster in compliance, and token matching collapses on it (0.40) while even the phonetic codes disagree: Metaphone hashes Gaddafi to KTF and Qadhafi to KTHF, and Soundex, which preserves the first letter by design, can never reconcile a G with a Q. Only Jaro-Winkler survives. Flip to the Kenyatta row and Jaro-Winkler is the one that dies (0.63), because edit-distance scorers read surname-first records as heavy edits. The Rawlings row fails everything, which is the quiet argument for generating name variants at ingest time: AfricaPEP stores “J. J. Rawlings”-style initial forms alongside every profile precisely so this comparison never has to be won by a string algorithm at query time.

Row five is the one I show compliance teams. Aminata Touré ran Senegal's government; Aminata Traoré served in Mali's. Different women, different countries, one letter apart. Jaro-Winkler scores them 0.94, comfortably above any matching threshold you would tune on the first four rows. No string score fixes this, because the strings really are that similar. What fixes it is refusing to let string similarity make destructive decisions alone, which is the second half of this post.

This is why AfricaPEP's scorer is deliberately dumb at the top: it takes the maximum of the orthographic score (the better of token-sort and Jaro-Winkler) and the phonetic score (Metaphone with a Soundex assist, token-aligned). Maximum, not average, because Figure 2's whole lesson is that the right algorithm depends on which variation you are facing, and you do not know that in advance. A recall-first screener wants the most generous defensible reading of every pair.

def name_match(query: str, candidate: str) -> float:
    # orthographic: word order (token_sort) or char edits (JW),
    # whichever reads this pair more charitably
    orthographic = max(
        fuzz.token_sort_ratio(query, candidate) / 100,
        JaroWinkler.similarity(query, candidate),
    )
    # phonetic: Metaphone codes per token, greedy 1:1 alignment,
    # normalised by the LARGER token count (partial names score lower)
    phonetic = phonetic_similarity(query, candidate)

    return max(orthographic, phonetic)  # recall-first by construction

Two details in there took real debugging to get right. Normalising the phonetic alignment by the larger token count means “Ali Hassan” against “Ali Hassan Mwinyi Kikwete” scores 0.5, not 1.0; partial agreement should read as partial. And the alignment is one-to-one: without that, a query token can “spend” itself on two candidate tokens with the same code and inflate the score. Matching bugs are almost never in the headline algorithm; they are in the bookkeeping around it.

You cannot score a billion pairs, so choose what never meets

AfricaPEP currently holds 48,377 profiles. Deduplicating that naively means scoring every record against every other: 1,170,142,876 pairs, with multiple string operations each. And screening has the mirror problem: every incoming query arriving against all 48,377 rows. The universal answer is blocking, but the two workloads need different blocks, and conflating them is a common design mistake.

Two funnels. Batch dedup: 48,377 profiles form 1,170,142,876 naive pairs; blocking on country plus surname initial reduces this to small per-block pair sets scoreable in minutes, with the noted cost that records blocked apart are never compared. Query-time screening: one query against 48,377 rows passes through a pg_trgm trigram index with a deliberately loosened similarity threshold, returning at most 50 candidates for full scoring. — Figure 3. Same principle, two mechanisms. Dedup is quadratic in the dataset, so it blocks on stable attributes. Screening is linear in queries, so it blocks with an index. In both, the cheap filter is tuned loose so the expensive scorer only sees plausible pairs.

For batch dedup the blocking key is country plus surname initial: two records must share both before any scorer sees them. For screening, the block is a PostgreSQL pg_trgm trigram index queried at a threshold deliberately loosened to 0.3 below the caller's matching threshold, returning at most 50 candidates for the full scorer. Note what the trap pair does here: Touré is Senegalese and Traoré is Malian, so in dedup they land in different blocks and are never even compared. Blocking is usually sold as a performance optimization; designed well, it is also your first correctness filter.

It cuts the other way, and honesty requires saying so: anything blocking separates can never match. A profile with a misrecorded country is invisible to dedup, and the repo documents an equivalent screening limitation openly, since the trigram gate can starve the phonetic scorer of candidates whose spelling shares few trigrams. Blocking keys are recall decisions wearing a performance costume, and they belong in your design review, not your query tuner.

A score is not a decision

Here is where most matching writeups stop and where the actual engineering starts. Suppose the scorer says 0.88. So what? The answer depends entirely on what you are about to do with it, and the two things AfricaPEP does with scores have opposite failure costs.

In screening, a false positive costs an analyst a minute; a false negative is an unflagged PEP and a regulatory finding. So the screening gate is 0.75, recall-first, and on the repo's 23-pair adversarial fixture it holds recall at 1.00 while paying precision of 0.71. That price is deliberate. In merging, the asymmetry flips: a missed duplicate is cosmetic, while a false merge welds one person's political history onto another's name. Row five of Figure 2 made the case that no string threshold can be trusted with that decision, so the merge path is built to make name similarity structurally insufficient:

Merge decision flow. A shared Wikidata QID merges immediately as identity rather than similarity. Otherwise a composite score weighs name similarity 0.5, birth date 0.3, and position held 0.2 (0.7 and 0.3 when birth date is missing). At or above 0.85 the pair auto-merges, 0.70 to 0.85 goes to a human review queue, below 0.70 the records stay distinct. A footnote records that phonetic similarity of 0.90 or more merges only with an exact birth date match or strong position overlap. — Figure 4. The merge path. With name capped at half the composite weight (0.7 when birth date is missing), a perfect name score alone tops out below the 0.85 bar. The Touré/Traoré pair cannot merge here even if blocking had let them meet.

The composite weighting is the quiet safeguard: name similarity carries 0.5 of the score, birth date 0.3, position 0.2, and when birth date is missing the name rises only to 0.7. A perfect 1.0 name with nothing else agreeing lands at 0.7: review queue, not merge. The system is arranged so that the only routes to an automatic merge are identity (a shared Wikidata QID) or multi-field corroboration. Sounding identical, which Metaphone happily reports for many different West African surnames, merges nothing by itself, ever.

When rules run out: probabilistic linkage

Threshold rules encode intuition, but they treat all evidence as equally surprising, and it is not. Two records agreeing on nationality means almost nothing (everyone in the block agrees on nationality); two records agreeing on an exact birth date means a great deal. The classical framework that formalizes this is Fellegi-Sunter (1969), and it is worth understanding because it is fifty years old, unglamorous, and still what production record-linkage systems run on.

Evidence ledger diagram for one candidate pair. Each field contributes a signed weight equal to the log ratio of its agreement probability among true matches (m) to its agreement probability by chance (u). Exact birth date agreement gets a long positive bar, name agreement in the top Jaro-Winkler band a strong bar, Metaphone-only agreement a moderate bar, nationality agreement a tiny bar, and a position disagreement a negative bar. The weights sum to a match probability with gates at 0.90 for human review and 0.99 plus corroboration for auto-merge. — Figure 5. Fellegi-Sunter in one picture: every field votes with a weight of log(m/u), so evidence counts in proportion to how unlikely it is by chance. Implemented in AfricaPEP with Splink over DuckDB as an offline pass.

For each field you estimate two probabilities: m, how often true matches agree on it, and u, how often random non-matches agree by coincidence. The field's evidence weight is log(m/u). Birth dates have a tiny u, so agreement is worth a lot and disagreement is heavily negative. Nationality inside a nationality-blocked candidate set has u near 1, so it is worth almost nothing, and the math knows that even when an eyeballed rule would not. Names enter not as one signal but as a ladder of comparison levels: exact, Jaro-Winkler above 0.92, Metaphone codes agree, Jaro-Winkler above 0.82, else. Splink, the open-source Fellegi-Sunter implementation from the UK Ministry of Justice, estimates the m and u values with expectation-maximization, meaning the dataset itself calibrates how much each agreement should count.

In AfricaPEP this runs strictly offline (it is a 400MB dependency that has no business in a screening API image), with the policy an AML system demands: auto-merge only at 99 percent match probability with corroborating evidence, human review from 90 to 99, nothing silent below that.

And it produced my favorite bug in the whole project. Splink blocked candidate pairs on phonetic surname codes. For names whose script survives ASCII folding badly (Amharic, some Arabic forms), the Metaphone of the folded string is empty. Empty equals empty. Entire clusters of unrelated non-Latin names began agreeing on their blank phonetic key, and the model dutifully weighted that agreement as evidence. The fix is one idea: a derived field that could not be computed is NULL, not "", because NULL never equals anything, including itself.

-- before: blank phonetic keys "agree" with each other
metaphone(fold(name)) = ''   -- for every non-Latin name

-- after: incomputable evidence is absent, not equal
CASE WHEN metaphone_key = '' THEN NULL ELSE metaphone_key END

If you build matching over multilingual data, some derived field somewhere can be empty, and somewhere downstream two empties will be compared. Audit for it before it audits you.

Evaluate on pairs that want to hurt you

The previous version of this post cited an eval I could not reproduce, so let me say plainly how this is measured now. The repo carries a fixture of 23 adversarial pairs: 12 that must match (transliteration clusters, order swaps, initials) and 11 that must not (the Touré/Traoré genre), each pinned to Wikidata QIDs so ground truth is checkable by anyone. One script reproduces this table:

Decision rule	Precision	Recall	F1
Screening gate (0.75, recall-first)	0.71	1.00	0.83
High gate (0.90), orthographic only	0.91	0.83	0.87
High gate (0.90), orthographic + phonetic	0.92	1.00	0.96

Twenty-three pairs is a small fixture, and I prefer it small and hostile to large and flattering. Sample random pairs from any name database and the overwhelming majority are trivially different; a matcher scores 99 percent on that eval while failing every case that matters. Adversarial fixtures invert the economics: every pair earns its place by having broken something once, each addition is reviewed like code, and the fixture doubles as executable documentation of exactly what the matcher promises. It is the same philosophy as regression tests, applied to a statistical component.

What about embeddings and LLMs?

The fashionable answer to name matching in 2026 is to embed both strings and take a cosine similarity, or to ask an LLM. I use neither in this path, and the reasons are architectural rather than aesthetic. Character-level trigrams and phonetics already capture most of what an embedding learns about surface form, at four orders of magnitude less compute and with exact explainability: I can tell a regulator that a match fired because the Metaphone codes agreed and Jaro-Winkler read 0.94, and I can put those two numbers in the API response, which AfricaPEP does. An embedding score of 0.83 explains nothing, and an LLM verdict is not even stable across calls. Where learned models genuinely help is the layer above strings (deciding whether two profiles refer to one person given positions, dates, and networks), and there the production answer is already in this post: a probabilistic model whose weights are interpretable line by line. In compliance, explainability is not a preference. It is a filing requirement.

The checklist I actually use

Normalise aggressively, but keep the original. Strip honorifics, fold diacritics, fix Mc/O' casing, and store every raw form you saw. Normalisation loses information by design; provenance is how you get it back.
Generate variants at ingest, not query time. Surname-first forms, initials, prefix alternates (Mohammed/Muhammad, ben/ibn/bin). Figure 2's Rawlings row is unwinnable at query time and free at ingest time.
Combine scorers with max, not average, when the goal is recall; average launders one scorer's blind spot into everyone's score.
Set thresholds by the cost of the action, not by the shape of the score distribution: 0.75 to show a human, 0.85 multi-field to merge, 0.99 for an offline batch pass.
Make name similarity structurally insufficient for destructive actions. Weights, corroboration requirements, and review queues exist so that row five of Figure 2 cannot hurt you.
Treat incomputable evidence as NULL, never as empty string, and evaluate on adversarial pairs pinned to stable identifiers.

All of this is running and inspectable: screen a name at pep.patrickaiafrica.com and the response carries the match scores and method described here, or read the matching stack in the AfricaPEP repo. The wider system this sits in, the four-branch Wikidata pipeline, the graph of record, the production ops, is covered in the AfricaPEP case study, and the downstream use of these matches in monitoring is in the AML piece. Open issues exist for Arabic transliteration and additional African language phonetics; if that is your domain, the repo would love your help.

Sources and further reading

Fellegi & Sunter, A Theory for Record Linkage (JASA, 1969) (the framework behind every serious linkage system since)
Splink documentation, UK Ministry of Justice (open-source Fellegi-Sunter at scale)
Winkler, Overview of Record Linkage and Current Research Directions (US Census Bureau, 2006)
PostgreSQL pg_trgm documentation (trigram indexes for candidate generation)
RapidFuzz and jellyfish (the scoring libraries used throughout this post)

Patrick Attankurugu

Senior AI Engineer at Agregar Technologies, building production AML, KYC, and compliance AI systems for African financial institutions. Creator and maintainer of AfricaPEP.