I wanted to share a model-in-progress for person and company name matching we’re working on with the Open Sanctions community. It is a work in progress and doesn’t work quite right yet, but I wanted to gauge interest in the idea. It uses the Open Sanctions Matcher training data to fine-tune a pre-trained paraphrase model to improve its performance for matching person and company names. This lets you compare one or more sets of names using cosine similarity on the text embedding vectors.
At this point I’m looking for feedback on the idea. When it actually works, it will automagically incorporate lessons learned from the 2.3M unique pairs of names in the Matcher training data The library comes with a sweet CLI that lets you download the Matcher training data, run a report on the labels, train a model and then test inference.
Some information from the README:
Deep fuzzy matching people and company names for multilingual entity resolution using representation learning… that incorporates a deep understanding of people and company names and works much better than string distance methods.
This project is a deep fuzzy matching system for entity resolution using representation learning. It is designed to match people and company names across languages and character sets, using a pre-trained text embedding model from HuggingFace that we fine-tune using contrastive learning on 2 million labeled pairs of person and company names from the Open Sanctions Matcher training data. The project includes a command-line interface (CLI) utility for training the model and comparing pairs of names using cosine similarity.
Matching people and company names is an intractable problem using traditional parsing based methods: there is too much variation across cultures and jurisdictions to solve the problem by humans programming. Machine learning is used in problems like this one of cultural relevance, where programming a solution approaches infinite complexity, to automatically write a program. Since 2008 there has been an explosion of deep learning methods that automate feature engineering via representation learning methods including such as text embeddings. This project loads the pre-trained paraphrase-multilingual-MiniLM-L12-v2 paraphrase model from HuggingFace and fine-tunes it for the name matching task using contrastive learning on more than 2 million labeled pairs of matching and non-matching (just as important) person and company names from the matcher training data to create a deep fuzzy matching system for entity resolution.
This model is available on HuggingFace Hub as Graphlet-AI/eridu and can be used in any Python project using the Sentence Transformers library. The model is designed to be used for entity resolution tasks, such as matching people and company names across different languages and character sets.
So there was a proble in training the first model, such that I am retraining it today I believe the bug was that the best model was not being saved…
You can see today’s work here, I am altering the model saving to be epoch based rather than steps… I think the model was testing well but performing poorly because it was not saving the actual best model.
These changes are the basis for today’s training run with these settings:
Once I train it, I will do a deep dive into the results of classifying name pairs in the test set to understand if and why the actual best model is ranking dissimilar names too similarly. My fallback options are to split the person and company name models in two to make it easier to learn each one, and to try OnlineContrastiveLoss, which selects the best pairs to use as training data rather than the 10% random sampling I am currently using each epoch.
@pudo does 21M unique pairs of labeled names - including both company and person - sound like the right number to you? This comes from pairing all names on the left with all names on the right, maintaining the label.
I am curious what your team thinks about which data I might use versus what data I might filter out? What constitutes the best training data among these 21M labeled pairs?
I am pretty confused by the quantified results compared with my own use of the model. You can go to Graphlet-AI/eridu · Hugging Face and click on the HF Inference API and try a list of names compared to a given name. It behaves terribly. I am saving the test data as Parquet so I can dig in and figure out why it isn’t learning the semantics of names in any corner case but performs well… over-fitting? Or is uploading to HuggingFace messed up? Got to debug it
Yeah those numbers sound pretty accurate. Some of this is synthetic, e.g. made from just juxtaposing different Wikidata entries that have different names. That, in my mind, would be the first to go.
So I can tell if its wiki data if the data source is wiki or the id starts with Q: on L175? (btw, can you upgrade me to be able to post links to Github and other places?)
Yes, everyone really likes the founder Sonal Goyal She’s very nice, and I hear good things about the product. Back when I read the code more than a year ago, it was using string distance but I do remember their method of sampling was really interesting…
Okay, I am examining the data and definitely see a lot of examples that are preventing the model from learning…
Same identity, different names. I don’t think an embedding alone can understand this, if there isn’t some inferable logic linking them contained in the string about the culture of the name.
Type
Entity 1
Entity 2
Type 2
Match
ORG
SKLASS DRUG COMPANY
LANDSMAN PHARMACY
ORG
true
PER
JASMEET SINGH
JASMEET HAKIMZADA
PER
true
PER
RAYYA, Samer
Maher Samer Abu Hussein
PER
true
Abbreviations… should be possible to learn, I think!
More progress… Claude Desktop is helping me direct Claude Code to do the work It suggested I had data leakage from duplicates, so I created a duplicate section of the report… they’re on to something. Data leakage may be causing the overfitting… it offers the model a shortcut it can learn just exact matches using and then not work on any other data… to not generalize…
Here’s the data converted to Markdown tables:
Duplicate Analysis Summary
Metric
Value
Total records
26,632,762
Unique records (by left_name, right_name)
26,166,197
Duplicate records
466,565 (1.8%)
Top duplicate patterns found
1,000
Top Duplicate Patterns
Left Name
Right Name
Count
ABOU MUSAB ABDELOUADOUD
ABOU MOUSSAAB ABDEL WADOUD
46
ABU MUSAB ABDELOUDOUD
ABOU MOUSAAB ABDELOUADOUD
46
ABU MUS’AB ABDELOUADOUD
ABDEL MALEK DROUKDEL
46
ABDEL MALEK DEROUDEL
ABD AL-MALIK DURIKDAL
46
DROUKDEL ABDELMALEK
ABOU MOSAAB ABKELWADOUD
46
DROKDAL ABDELMALEK
ABDELOUADOUR DROUKDEL
46
ABDELMALEK DROUKBEL
ABD-AL-MALIK DROKDAL
46
ABU MUS’AB ABDELOUADOUD
ABDEL MALEK DEROUDEL
46
ABU MOSSAAB ABDEL EL-WADOUD
ABD AL-MALIK DURIKDAL
46
ABU MOSSAB ABDELOUADOUD
ABDELOUADOUD ABOU MOSSAAH
46
Sample Duplicate Records (Pattern appears 46 times)
Left Name
Left Norm
Left FP
Left Lang
Left Category
Right Name
Right Norm
Right FP
Right Lang
Right Category
Match
Dist Norm
Dist FP
Score
Source
ABOU MUSAB ABDELOUADOUD
abou musab abdelouadoud
abdelouadoud abou musab
eng
PER
ABOU MOUSSAAB ABDEL WADOUD
abou moussaab abdel wadoud
abdel abou moussaab wadoud
eng
PER
true
5.0
16.0
0.8
usgsa-s4mr3r455
ABOU MUSAB ABDELOUADOUD
abou musab abdelouadoud
abdelouadoud abou musab
eng
PER
ABOU MOUSSAAB ABDEL WADOUD
abou moussaab abdel wadoud
abdel abou moussaab wadoud
eng
PER
true
5.0
16.0
0.8
usgsa-s4mr3pr4f
ABOU MUSAB ABDELOUADOUD
abou musab abdelouadoud
abdelouadoud abou musab
eng
PER
ABOU MOUSSAAB ABDEL WADOUD
abou moussaab abdel wadoud
abdel abou moussaab wadoud
eng
PER
true
5.0
16.0
0.8
usgsa-s4mr3pr4k
ABOU MUSAB ABDELOUADOUD
abou musab abdelouadoud
abdelouadoud abou musab
eng
PER
ABOU MOUSSAAB ABDEL WADOUD
abou moussaab abdel wadoud
abdel abou moussaab wadoud
eng
PER
true
5.0
16.0
0.8
usgsa-s4mr3pr44
ABOU MUSAB ABDELOUADOUD
abou musab abdelouadoud
abdelouadoud abou musab
eng
PER
ABOU MOUSSAAB ABDEL WADOUD
abou moussaab abdel wadoud
abdel abou moussaab wadoud
eng
PER
true
5.0
16.0
0.8
usgsa-s4mr3r457
Claude Sonnet 4: This duplicate analysis reveals a critical insight about your overfitting! The fact that you have 466K duplicate records (1.8% of your dataset) with the exact same name pairs appearing multiple times across different sources explains part of your inflated metrics. The model is seeing identical training examples repeatedly, which reinforces memorization patterns and contributes to the unrealistic performance scores you’re observing.
Now… how do I address data leakage for VERY similar name pairs? I am now deduplicating the data, but I don’t know if that is enough to make it work. Trying to find out now…
I am splitting the person name model from the company name model, which means you need to know which one it is… that is not desirable. It calls for a person / company name classifier - something we can look into after this embedding works
Here is the output of the command eridu etl filter which deduplicates the data, throwing out a lot of data - maybe 800K records? It was about 2.3M and now it is 1.6M records.
Reading data from ./data/pairs-all.parquet
Initial record count: 26,632,762
Removed 443,075 duplicate records
Records after deduplication: 3,483,749
Final filtered record count: 3,483,749
Total removed 23,149,013 records (86.9%)
Filtered people records count: 1,807,999
Filtered companies records count: 1,680,943
Writing filtered data to ./data/filtered
25/06/13 07:54:08 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Filtering completed successfully!
- All filtered data: ./data/filtered/filtered.parquet
- People data: ./data/filtered/people.parquet
- Companies data: ./data/filtered/companies.parquet
A new training run without all the dupes… if this doesn’t work I am going to… have to cluster pairs and assign very similar ones to training, eval or test data together… this is weird. Not sure
Nice progress on this! Just to clarify one thing: not all the decisions involving two “Q” IDs are synthetic, a lot are made manually. But of course, by “a lot” I mean less than a million
Have you had a chance to run this against our benchmark dataset yet? I’m super curious for the result. Our rule-based matcher is hovering around 80% right now, which is much better than what we currently have in production…
Thanks! Oh man, that’s not good - how do I identify Wikipedia entries? I read the code and thought that would work.
I haven’t run the benchmark dataset yet because the model overfits so badly in practice, despite excellent test scores that it wouldn’t do very well. I could try to train on just that data… it should get better and there wouldn’t be the problem with similar records across the train/test split. I have actually implemented a clustering solution using both embeddings and WordPiece counts that I am tuning. Maybe I should just try training locally on the benchmark.
I split the company model out and got reasonable results
Model Evaluation Report
========================================
Model: data/fine-tuned-sbert-sentence-transformers-paraphrase-multilingual-MiniLM-L12-v2-original-adafactor
Test data: 10,000 pairs
Classification threshold: 0.7159
Performance Metrics:
Accuracy: 0.8532
Precision: 0.8142
Recall: 0.8546
F1 Score: 0.8339
AUC-ROC: 0.9252
Confusion Matrix:
True Positives: 3,685
False Positives: 841
True Negatives: 4,847
False Negatives: 627
Error Analysis:
False Positive Examples (showing top 5 of 841):
1. 'Siły zbrojne Federacji Rosyjskiej' vs 'Siły zbrojne Federacji Rosyjskie' (Score: 0.9973)
2. 'Акціонерне товариство "Зовнішньоторговельна компанія "КАМАЗ"' vs 'АКЦИОНЕРНОЕ ОБЩЕСТВО "ЛИЗИНГОВАЯ КОМПАНИЯ "КАМАЗ"' (Score: 0.9718)
3. 'АКЦИОНЕРНОЕ ОБЩЕСТВО "ЦЕНТР ПЕРСПЕКТИВНЫХ ТЕХНОЛОГИЙ И АППАРАТУРЫ"' vs 'АКЦИОНЕРНОЕ ОБЩЕСТВО "ЦЕНТР ПЕРСПЕКТИВНЫХ ТЕХНОЛОГИЙ"' (Score: 0.9640)
4. 'STROYTRANSGAZ OJSC' vs 'OOO STROYTRANSGAZ' (Score: 0.9615)
5. 'Persian Gulf Petrochemical Industry Company' vs 'Persian Gulf Petrochemical Industry Commercial' (Score: 0.9573)
False Negative Examples (showing top 5 of 627):
1. 'unité militaire 26165' vs '85. hlavné centrum pre špeciálne služby (GCSS) Hlavného riaditeľstva Generálneho štábu ozbrojených síl Ruskej federácie (GU/GRU)' (Score: 0.7159)
2. 'Ди Ед Органайзейшн оф ди Улема' vs 'MEYMAR TRUST' (Score: 0.7158)
3. 'Бюро 39' vs 'Third Floor' (Score: 0.7156)
4. 'ФЕДЕРАЛЛЬНАЯ СЛУЖБА ВОЙСК НАЦИОНАЛЬНОЙ ГВАРДИИ РОССИЙСКОЙ ФЕДЕРАЦИИ' vs 'Националната гвардия на Руската федерация' (Score: 0.7155)
5. 'МИНИСТЕРСТВО ИМУЩЕСТВЕННЫХ ОТНОШЕНИЙЙ РС' vs 'ГОСУДАРСТВЕННОЕ УНИТАРНОЕ ПРЕДПРИЯТИЕ " ДИРЕКЦИЯ АЭРОПОРТОВ РЕСПУБЛИКИ САХА (ЯКУТИЯ) "' (Score: 0.7154)
@pudo can you clue me in on how to perform the filter you mentioned - to remove the low quality Wikipedia entries from the match data? I read the file you mentioned but it does just show them as starting with Q…