Deep, fuzzy matching person and company names with Project Eridu

Dear Open Sanctions Community,

I wanted to share a model-in-progress for person and company name matching we’re working on with the Open Sanctions community. It is a work in progress and doesn’t work quite right yet, but I wanted to gauge interest in the idea. It uses the Open Sanctions Matcher training data to fine-tune a pre-trained paraphrase model to improve its performance for matching person and company names. This lets you compare one or more sets of names using cosine similarity on the text embedding vectors.

At this point I’m looking for feedback on the idea. When it actually works, it will automagically incorporate lessons learned from the 2.3M unique pairs of names in the Matcher training data :slight_smile: The library comes with a sweet CLI that lets you download the Matcher training data, run a report on the labels, train a model and then test inference.

Some information from the README:

Deep fuzzy matching people and company names for multilingual entity resolution using representation learning… that incorporates a deep understanding of people and company names and works much better than string distance methods.

This project is a deep fuzzy matching system for entity resolution using representation learning. It is designed to match people and company names across languages and character sets, using a pre-trained text embedding model from HuggingFace that we fine-tune using contrastive learning on 2 million labeled pairs of person and company names from the Open Sanctions Matcher training data. The project includes a command-line interface (CLI) utility for training the model and comparing pairs of names using cosine similarity.

Matching people and company names is an intractable problem using traditional parsing based methods: there is too much variation across cultures and jurisdictions to solve the problem by humans programming. Machine learning is used in problems like this one of cultural relevance, where programming a solution approaches infinite complexity, to automatically write a program. Since 2008 there has been an explosion of deep learning methods that automate feature engineering via representation learning methods including such as text embeddings. This project loads the pre-trained paraphrase-multilingual-MiniLM-L12-v2 paraphrase model from HuggingFace and fine-tunes it for the name matching task using contrastive learning on more than 2 million labeled pairs of matching and non-matching (just as important) person and company names from the matcher training data to create a deep fuzzy matching system for entity resolution.

This model is available on HuggingFace Hub as Graphlet-AI/eridu and can be used in any Python project using the Sentence Transformers library. The model is designed to be used for entity resolution tasks, such as matching people and company names across different languages and character sets.

The project on GitHub at: GitHub - Graphlet-AI/eridu: Deep fuzzy matching people and company names for multilingual entity resolution using representation learning

On HuggingFace Hub as Graphlet-AI/eridu

On PyPi as eridu

Please let me know what you think! I can only post two links per post atm, so thanks for baring with me :slight_smile:

2 Likes

So there was a proble in training the first model, such that I am retraining it today :slight_smile: I believe the bug was that the best model was not being saved…

You can see today’s work here, I am altering the model saving to be epoch based rather than steps… I think the model was testing well but performing poorly because it was not saving the actual best model.

These changes are the basis for today’s training run with these settings:

eridu train --use-gpu \
            --batch-size 768 \
            --epochs 4 \
            --patience 1 \
            --resampling \
            --weight-decay 0.01 \
            --random-seed 31337 \
            --warmup-ratio 0.1 \
            --learning-rate 3e-5 \
            --save-strategy epoch \
            --eval-strategy epoch \
            --sample-fraction 0.1 \
            --fp16

Once I train it, I will do a deep dive into the results of classifying name pairs in the test set to understand if and why the actual best model is ranking dissimilar names too similarly. My fallback options are to split the person and company name models in two to make it easier to learn each one, and to try OnlineContrastiveLoss, which selects the best pairs to use as training data rather than the 10% random sampling I am currently using each epoch.

Here is the live training run: Weights & Biases

Stay tuned!

@pudo does 21M unique pairs of labeled names - including both company and person - sound like the right number to you? This comes from pairing all names on the left with all names on the right, maintaining the label.

I am curious what your team thinks about which data I might use versus what data I might filter out? What constitutes the best training data among these 21M labeled pairs?

This is the performance after a single epoch… very promising.

I am pretty confused by the quantified results compared with my own use of the model. You can go to Graphlet-AI/eridu · Hugging Face and click on the HF Inference API and try a list of names compared to a given name. It behaves terribly. I am saving the test data as Parquet so I can dig in and figure out why it isn’t learning the semantics of names in any corner case but performs well… over-fitting? Or is uploading to HuggingFace messed up? Got to debug it :slight_smile:

Another run, this time I will be saving the test data to do some data science and see what is going on.

I unset FP16 mixed precision, I dropped batch size from 768 to 64 (16 was taking 10 hours an epoch) and I’m running 5 instead of 4 epochs.

Yeah those numbers sound pretty accurate. Some of this is synthetic, e.g. made from just juxtaposing different Wikidata entries that have different names. That, in my mind, would be the first to go.

Here’s the script I used to make the file: qarin/namepairs/duck_gen.py at main · opensanctions/qarin · GitHub - if you remove the query on L173 that gets rid of the Wikidata name pairs.

Btw have you ever heard of https://www.zingg.ai/? I spoke to their CEO recently, she’s really into building open source ER.

So I can tell if its wiki data if the data source is wiki or the id starts with Q: on L175? (btw, can you upgrade me to be able to post links to Github and other places?)

Yes, everyone really likes the founder Sonal Goyal :slight_smile: She’s very nice, and I hear good things about the product. Back when I read the code more than a year ago, it was using string distance but I do remember their method of sampling was really interesting…

Wait… 22M of 26M come from Wikipedia? So you would say the others are of better quality?

In [22]: pairs_df.filter(F.col("source").startswith("Q")).count()
Out[22]: 22,705,938                                                               

In [23]: pairs_df.count()
Out[23]: 26,632,762

Okay, I am examining the data and definitely see a lot of examples that are preventing the model from learning…

  1. Same identity, different names. I don’t think an embedding alone can understand this, if there isn’t some inferable logic linking them contained in the string about the culture of the name.
Type Entity 1 Entity 2 Type 2 Match
ORG SKLASS DRUG COMPANY LANDSMAN PHARMACY ORG true
PER JASMEET SINGH JASMEET HAKIMZADA PER true
PER RAYYA, Samer Maher Samer Abu Hussein PER true
  1. Abbreviations… should be possible to learn, I think!
Type Entity 1 Entity 2 Type 2 Match
ORG Дизайн енд Меньюфекчурінг оф Ейркрафт інжинз DAMA ORG true

More progress… Claude Desktop is helping me direct Claude Code to do the work :slight_smile: It suggested I had data leakage from duplicates, so I created a duplicate section of the report… they’re on to something. Data leakage may be causing the overfitting… it offers the model a shortcut it can learn just exact matches using and then not work on any other data… to not generalize…

Here’s the data converted to Markdown tables:

Duplicate Analysis Summary

Metric Value
Total records 26,632,762
Unique records (by left_name, right_name) 26,166,197
Duplicate records 466,565 (1.8%)
Top duplicate patterns found 1,000

Top Duplicate Patterns

Left Name Right Name Count
ABOU MUSAB ABDELOUADOUD ABOU MOUSSAAB ABDEL WADOUD 46
ABU MUSAB ABDELOUDOUD ABOU MOUSAAB ABDELOUADOUD 46
ABU MUS’AB ABDELOUADOUD ABDEL MALEK DROUKDEL 46
ABDEL MALEK DEROUDEL ABD AL-MALIK DURIKDAL 46
DROUKDEL ABDELMALEK ABOU MOSAAB ABKELWADOUD 46
DROKDAL ABDELMALEK ABDELOUADOUR DROUKDEL 46
ABDELMALEK DROUKBEL ABD-AL-MALIK DROKDAL 46
ABU MUS’AB ABDELOUADOUD ABDEL MALEK DEROUDEL 46
ABU MOSSAAB ABDEL EL-WADOUD ABD AL-MALIK DURIKDAL 46
ABU MOSSAB ABDELOUADOUD ABDELOUADOUD ABOU MOSSAAH 46

Sample Duplicate Records (Pattern appears 46 times)

Left Name Left Norm Left FP Left Lang Left Category Right Name Right Norm Right FP Right Lang Right Category Match Dist Norm Dist FP Score Source
ABOU MUSAB ABDELOUADOUD abou musab abdelouadoud abdelouadoud abou musab eng PER ABOU MOUSSAAB ABDEL WADOUD abou moussaab abdel wadoud abdel abou moussaab wadoud eng PER true 5.0 16.0 0.8 usgsa-s4mr3r455
ABOU MUSAB ABDELOUADOUD abou musab abdelouadoud abdelouadoud abou musab eng PER ABOU MOUSSAAB ABDEL WADOUD abou moussaab abdel wadoud abdel abou moussaab wadoud eng PER true 5.0 16.0 0.8 usgsa-s4mr3pr4f
ABOU MUSAB ABDELOUADOUD abou musab abdelouadoud abdelouadoud abou musab eng PER ABOU MOUSSAAB ABDEL WADOUD abou moussaab abdel wadoud abdel abou moussaab wadoud eng PER true 5.0 16.0 0.8 usgsa-s4mr3pr4k
ABOU MUSAB ABDELOUADOUD abou musab abdelouadoud abdelouadoud abou musab eng PER ABOU MOUSSAAB ABDEL WADOUD abou moussaab abdel wadoud abdel abou moussaab wadoud eng PER true 5.0 16.0 0.8 usgsa-s4mr3pr44
ABOU MUSAB ABDELOUADOUD abou musab abdelouadoud abdelouadoud abou musab eng PER ABOU MOUSSAAB ABDEL WADOUD abou moussaab abdel wadoud abdel abou moussaab wadoud eng PER true 5.0 16.0 0.8 usgsa-s4mr3r457

Claude Sonnet 4: This duplicate analysis reveals a critical insight about your overfitting! The fact that you have 466K duplicate records (1.8% of your dataset) with the exact same name pairs appearing multiple times across different sources explains part of your inflated metrics. The model is seeing identical training examples repeatedly, which reinforces memorization patterns and contributes to the unrealistic performance scores you’re observing.

Now… how do I address data leakage for VERY similar name pairs? I am now deduplicating the data, but I don’t know if that is enough to make it work. Trying to find out now…

I am splitting the person name model from the company name model, which means you need to know which one it is… that is not desirable. It calls for a person / company name classifier - something we can look into after this embedding works :slight_smile:

Here is the output of the command eridu etl filter which deduplicates the data, throwing out a lot of data - maybe 800K records? It was about 2.3M and now it is 1.6M records.

Reading data from ./data/pairs-all.parquet
Initial record count: 26,632,762
Removed 443,075 duplicate records                                               
Records after deduplication: 3,483,749
Final filtered record count: 3,483,749                                          
Total removed 23,149,013 records (86.9%)
Filtered people records count: 1,807,999                                        
Filtered companies records count: 1,680,943                                     
Writing filtered data to ./data/filtered
25/06/13 07:54:08 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Filtering completed successfully!                                               
  - All filtered data: ./data/filtered/filtered.parquet
  - People data: ./data/filtered/people.parquet
  - Companies data: ./data/filtered/companies.parquet

A new training run without all the dupes… if this doesn’t work I am going to… have to cluster pairs and assign very similar ones to training, eval or test data together… this is weird. Not sure :slight_smile:

nohup eridu train --use-gpu \
            --batch-size 32 \
            --epochs 8 \
            --patience 1 \
            --weight-decay 0.01 \
            --random-seed 31337 \
            --warmup-ratio 0.1 \
            --learning-rate 1e-5 \
            --save-strategy epoch \
            --eval-strategy epoch \
            --sample-fraction 1.0 \
            --input data/filtered/people.parquet \
            --output data/output \
            --data-type people &
tail -f nohup.out

The current training run on Weights & Biases

Oh, here is the latest training run :slight_smile:

Nice progress on this! Just to clarify one thing: not all the decisions involving two “Q” IDs are synthetic, a lot are made manually. But of course, by “a lot” I mean less than a million :slight_smile:

Have you had a chance to run this against our benchmark dataset yet? I’m super curious for the result. Our rule-based matcher is hovering around 80% right now, which is much better than what we currently have in production…

Thanks! Oh man, that’s not good - how do I identify Wikipedia entries? I read the code and thought that would work.

I haven’t run the benchmark dataset yet because the model overfits so badly in practice, despite excellent test scores that it wouldn’t do very well. I could try to train on just that data… it should get better and there wouldn’t be the problem with similar records across the train/test split. I have actually implemented a clustering solution using both embeddings and WordPiece counts that I am tuning. Maybe I should just try training locally on the benchmark.

I split the company model out and got reasonable results :slight_smile:

Model Evaluation Report
========================================
Model: data/fine-tuned-sbert-sentence-transformers-paraphrase-multilingual-MiniLM-L12-v2-original-adafactor
Test data: 10,000 pairs
Classification threshold: 0.7159

Performance Metrics:
  Accuracy:  0.8532
  Precision: 0.8142
  Recall:    0.8546
  F1 Score:  0.8339
  AUC-ROC:   0.9252

Confusion Matrix:
  True Positives:  3,685
  False Positives: 841
  True Negatives:  4,847
  False Negatives: 627

Error Analysis:
  False Positive Examples (showing top 5 of 841):
    1. 'Siły zbrojne Federacji Rosyjskiej' vs 'Siły zbrojne Federacji Rosyjskie' (Score: 0.9973)
    2. 'Акціонерне товариство "Зовнішньоторговельна компанія "КАМАЗ"' vs 'АКЦИОНЕРНОЕ ОБЩЕСТВО "ЛИЗИНГОВАЯ КОМПАНИЯ "КАМАЗ"' (Score: 0.9718)
    3. 'АКЦИОНЕРНОЕ ОБЩЕСТВО "ЦЕНТР ПЕРСПЕКТИВНЫХ ТЕХНОЛОГИЙ И АППАРАТУРЫ"' vs 'АКЦИОНЕРНОЕ ОБЩЕСТВО "ЦЕНТР ПЕРСПЕКТИВНЫХ ТЕХНОЛОГИЙ"' (Score: 0.9640)
    4. 'STROYTRANSGAZ OJSC' vs 'OOO STROYTRANSGAZ' (Score: 0.9615)
    5. 'Persian Gulf Petrochemical Industry Company' vs 'Persian Gulf Petrochemical Industry Commercial' (Score: 0.9573)
  False Negative Examples (showing top 5 of 627):
    1. 'unité militaire 26165' vs '85. hlavné centrum pre špeciálne služby (GCSS) Hlavného riaditeľstva Generálneho štábu ozbrojených síl Ruskej federácie (GU/GRU)' (Score: 0.7159)
    2. 'Ди Ед Органайзейшн оф ди Улема' vs 'MEYMAR TRUST' (Score: 0.7158)
    3. 'Бюро 39' vs 'Third Floor' (Score: 0.7156)
    4. 'ФЕДЕРАЛЛЬНАЯ СЛУЖБА ВОЙСК НАЦИОНАЛЬНОЙ ГВАРДИИ РОССИЙСКОЙ ФЕДЕРАЦИИ' vs 'Националната гвардия на Руската федерация' (Score: 0.7155)
    5. 'МИНИСТЕРСТВО ИМУЩЕСТВЕННЫХ ОТНОШЕНИЙЙ РС' vs 'ГОСУДАРСТВЕННОЕ УНИТАРНОЕ ПРЕДПРИЯТИЕ " ДИРЕКЦИЯ АЭРОПОРТОВ РЕСПУБЛИКИ САХА (ЯКУТИЯ) "' (Score: 0.7154)

I trained the model locally on my GPU machine and got through 2 epochs and got these results:

Here is the company only model on a test dataset:

Confusion Matrix

True Positives: 3,685
False Positives: 841
True Negatives: 4,847
False Negatives: 627

How does that compare with what you’ve got?

Performance Metrics

Accuracy: 0.8532
Precision: 0.8142
Recall: 0.8546
F1 Score: 0.8339
AUC-ROC: 0.9252

Here is the checks.yml performance score… too many YES answers… look at how high the F1 is compared to accuracy and precision! 1.0 recall :slight_smile:

============================================================
COMBINED EVALUATION SUMMARY
============================================================
Total checks: 182
Overall Accuracy:  0.5886
Overall Precision: 0.5886
Overall Recall:    1.0000
Overall F1 Score:  0.7410

@pudo can you clue me in on how to perform the filter you mentioned - to remove the low quality Wikipedia entries from the match data? I read the file you mentioned but it does just show them as starting with Q