Hey all! We’re working on some refinements for our yente matching API and in order to do good quality assurance on this I’m collecting some ground truth for complex name matching problems here:
If anyone has an idea for an example we should test for which they can share without revealing customer PII, please post it here
Hello. I noticed that one of the examples was that a title such as “Mr.” could be added to a name in error (…or perhaps deliberately in the hope of evading a check or control). A similar possibility would be the addition of post nominals at the end of a name. In the UK these can be honours such as OBE Orders, Decorations and Medals - UK Honours System as well as academic qualifications (such as BA, BSc and PhD) professional qualifications (such as ACA, CEng), political offices (MP) and religious societies (SSF). In addition, men can use “Esq.” as a post nominal in place of “Mr”.
Ooooh this is a fun resource, thank you for sharing it! As you know: A list of people who don’t exist is actually a very valuable asset for us in terms of testing the overall scoring system - we can just assume that every row in this is a negative match against every sanctions list. These sort of “external truth” things have gotten us a lot of mileage already
Regarding prefix removal on names, we’re actually doing a lot of build up on name reference data, including a prefix list here: rigour/resources/names/stopwords.yml at main · opensanctions/rigour · GitHub .. We need to make this more broadly findable at some point (check the org types file in the same folder).
One approach, if you run entity resolution in Senzing on a collection of datasets, the resolved entities will each have their lists of values for features, such as names.
In other words, one of the byproducts from running ER is that it produces a domain-specific thesaurus, plus data quality metrics for the features.
Then you can get from the results which name variants are related, and which are relatively the most popularly used.
If this might help develop data and tests for name matching?
We do have quite a bit of that sort of data, just also from merging entities across sanctions lists. (eg: One fun asset that we have and should talk more about is a pairwise match file of person and companies that’s generated off the main OS data.)
What I’m trying to chase down at the moment a bit is a more domain-inspired typology of name matching error types.
For example:
A screening system should consider John B. Roberts and John A. Roberts to be different people. Mainly if we know they’re in America…
LLC ORION and ORION OOO are the same Russian company,
Ben Netanyahu and Benjamin Netanyahu - is that a match? Does that get too broad?
…
We often get screening false positives (unfortunately: we rarely get false negatives!) sent in by people, and I think those bits can serve as a harness on the API to make sure we at least don’t do the same mistake twice
There is obviously more than one way to approach this. One approach would be to try to work out how many people in the ‘relevant population’ have the same name. The relevant population could be people living in France or perhaps Europe or even the world. The population need not be defined solely in terms of Geography. We might be willing to limit the population to people of working age. The number of people called “Peter Smith” for example may not be available but it might be possible to find the number of people with first name “Peter” and the number with a last name “Smith” and approximate something based on that. I am assuming that is how this site works: https://howmany-ofme.com/
This would be imperfect. For example, parents might purposefully avoid names like “Jack Jackson” even though both elements are individually quite common. Parents may also try to avoid the names of celebrities, well known criminals, fictional characters and names that could lead to obvious jokes. There could also be higher or lower frequencies for names that rhyme and alliterative names. Also some names may be more or less common in higher socioeconomic groups that are more likely to become legislators and senior judges. First names come in and out of fashion. For example, in the UK, Alexa has gone out of fashion probably due to the Amazon’s voice assistant. Where year of birth is available these differences could be allowed for. One further observation is that some first names are associated with months of birth so in the UK women called April, May or June are presumably less likely to have been born in December than those called Anne, Mary and Jane. There is also the issue of nominative determinism:
Another complication is that some shortened versions of names are also full names in their own right for example Elizabeth shortens to Beth but not every Beth is an Elizabeth. This would obviously increase the potential for false positives.
That’s a really thoughtful take — I especially like the idea of narrowing the “relevant population” beyond geography. It reminds me of how name frequency data can reveal surprising cultural patterns.
If you’re curious about how common certain names are in the U.S., you can try running them through howmanyofmes.com. It estimates how many people share a given first and last name based on census data ot perfect, but it’s a fun way to see the scale of name overlap before diving into deeper analysis.