Call for false positives: help us build out a great set of name tests

Hey all! We’re working on some refinements for our yente matching API and in order to do good quality assurance on this I’m collecting some ground truth for complex name matching problems here:

If anyone has an idea for an example we should test for which they can share without revealing customer PII, please post it here :slight_smile:

1 Like

The UK Information Commissioner’s Office publishes some dummy data here: https://ico.org.uk/media/1432969/exampledataset.csv - the way they generated it is explained here: https://ico.org.uk/media/for-organisations/documents/2021/2618998/how-to-disclose-information-safely-20201224.pdf - as these people are made up they should all be false positives (but there could be coincidental name matches)

Hello. I noticed that one of the examples was that a title such as “Mr.” could be added to a name in error (…or perhaps deliberately in the hope of evading a check or control). A similar possibility would be the addition of post nominals at the end of a name. In the UK these can be honours such as OBE Orders, Decorations and Medals - UK Honours System as well as academic qualifications (such as BA, BSc and PhD) professional qualifications (such as ACA, CEng), political offices (MP) and religious societies (SSF). In addition, men can use “Esq.” as a post nominal in place of “Mr”.

Ooooh this is a fun resource, thank you for sharing it! As you know: A list of people who don’t exist is actually a very valuable asset for us in terms of testing the overall scoring system - we can just assume that every row in this is a negative match against every sanctions list. These sort of “external truth” things have gotten us a lot of mileage already :slight_smile:

Regarding prefix removal on names, we’re actually doing a lot of build up on name reference data, including a prefix list here: rigour/resources/names/stopwords.yml at main · opensanctions/rigour · GitHub .. We need to make this more broadly findable at some point (check the org types file in the same folder).

I have raised a pull request here: Update stopwords.yml by confirmordeny · Pull Request #28 · opensanctions/rigour · GitHub - hope this is of some help.

1 Like

This is fantastic! You are, according to the data, The Worshipful!

Thank you for your kind words. I have been called many things but that’s a first!

I have submitted another pull request this time for OBE, MBE etc. Update symbols.yml - UK/British Orders of Knighthood by confirmordeny · Pull Request #30 · opensanctions/rigour · GitHub

Could you look at every human on wikidata with a date of death before 1800 (say) and any instances of ‘fictional human’: fictional human - Wikidata ?

One approach, if you run entity resolution in Senzing on a collection of datasets, the resolved entities will each have their lists of values for features, such as names.

In other words, one of the byproducts from running ER is that it produces a domain-specific thesaurus, plus data quality metrics for the features.

Then you can get from the results which name variants are related, and which are relatively the most popularly used.

If this might help develop data and tests for name matching?

1 Like

We do have quite a bit of that sort of data, just also from merging entities across sanctions lists. (eg: One fun asset that we have and should talk more about is a pairwise match file of person and companies that’s generated off the main OS data.)

What I’m trying to chase down at the moment a bit is a more domain-inspired typology of name matching error types.

For example:

  • A screening system should consider John B. Roberts and John A. Roberts to be different people. Mainly if we know they’re in America…
  • LLC ORION and ORION OOO are the same Russian company,
  • Ben Netanyahu and Benjamin Netanyahu - is that a match? Does that get too broad?

We often get screening false positives (unfortunately: we rarely get false negatives!) sent in by people, and I think those bits can serve as a harness on the API to make sure we at least don’t do the same mistake twice :slight_smile:

1 Like

I believe there are false positives here (as in matches that could have been made but were not): Search - OpenSanctions

Libyan Investment Authority: there appear to be three entries here: Search - OpenSanctions

Public Investment Fund (Saudi Arabia): Search - OpenSanctions - two entries

There is obviously more than one way to approach this. One approach would be to try to work out how many people in the ‘relevant population’ have the same name. The relevant population could be people living in France or perhaps Europe or even the world. The population need not be defined solely in terms of Geography. We might be willing to limit the population to people of working age. The number of people called “Peter Smith” for example may not be available but it might be possible to find the number of people with first name “Peter” and the number with a last name “Smith” and approximate something based on that. I am assuming that is how this site works: https://howmany-ofme.com/

This would be imperfect. For example, parents might purposefully avoid names like “Jack Jackson” even though both elements are individually quite common. Parents may also try to avoid the names of celebrities, well known criminals, fictional characters and names that could lead to obvious jokes. There could also be higher or lower frequencies for names that rhyme and alliterative names. Also some names may be more or less common in higher socioeconomic groups that are more likely to become legislators and senior judges. First names come in and out of fashion. For example, in the UK, Alexa has gone out of fashion probably due to the Amazon’s voice assistant. Where year of birth is available these differences could be allowed for. One further observation is that some first names are associated with months of birth so in the UK women called April, May or June are presumably less likely to have been born in December than those called Anne, Mary and Jane. There is also the issue of nominative determinism:

Another complication is that some shortened versions of names are also full names in their own right for example Elizabeth shortens to Beth but not every Beth is an Elizabeth. This would obviously increase the potential for false positives.

That’s a really thoughtful take — I especially like the idea of narrowing the “relevant population” beyond geography. It reminds me of how name frequency data can reveal surprising cultural patterns.

If you’re curious about how common certain names are in the U.S., you can try running them through howmanyofmes.com. It estimates how many people share a given first and last name based on census data ot perfect, but it’s a fun way to see the scale of name overlap before diving into deeper analysis.

1 Like