Deep, fuzzy matching person and company names with Project Eridu

rjurney · May 15, 2025, 10:14am

Dear Open Sanctions Community,

I wanted to share a model-in-progress for person and company name matching we’re working on with the Open Sanctions community. It is a work in progress and doesn’t work quite right yet, but I wanted to gauge interest in the idea. It uses the Open Sanctions Matcher training data to fine-tune a pre-trained paraphrase model to improve its performance for matching person and company names. This lets you compare one or more sets of names using cosine similarity on the text embedding vectors.

At this point I’m looking for feedback on the idea. When it actually works, it will automagically incorporate lessons learned from the 2.3M unique pairs of names in the Matcher training data The library comes with a sweet CLI that lets you download the Matcher training data, run a report on the labels, train a model and then test inference.

Some information from the README:

Deep fuzzy matching people and company names for multilingual entity resolution using representation learning… that incorporates a deep understanding of people and company names and works much better than string distance methods.

This project is a deep fuzzy matching system for entity resolution using representation learning. It is designed to match people and company names across languages and character sets, using a pre-trained text embedding model from HuggingFace that we fine-tune using contrastive learning on 2 million labeled pairs of person and company names from the Open Sanctions Matcher training data. The project includes a command-line interface (CLI) utility for training the model and comparing pairs of names using cosine similarity.

Matching people and company names is an intractable problem using traditional parsing based methods: there is too much variation across cultures and jurisdictions to solve the problem by humans programming. Machine learning is used in problems like this one of cultural relevance, where programming a solution approaches infinite complexity, to automatically write a program. Since 2008 there has been an explosion of deep learning methods that automate feature engineering via representation learning methods including such as text embeddings. This project loads the pre-trained paraphrase-multilingual-MiniLM-L12-v2 paraphrase model from HuggingFace and fine-tunes it for the name matching task using contrastive learning on more than 2 million labeled pairs of matching and non-matching (just as important) person and company names from the matcher training data to create a deep fuzzy matching system for entity resolution.

This model is available on HuggingFace Hub as Graphlet-AI/eridu and can be used in any Python project using the Sentence Transformers library. The model is designed to be used for entity resolution tasks, such as matching people and company names across different languages and character sets.

The project on GitHub at: GitHub - Graphlet-AI/eridu: Deep fuzzy matching people and company names for multilingual entity resolution using representation learning

On HuggingFace Hub as Graphlet-AI/eridu

On PyPi as eridu

Please let me know what you think! I can only post two links per post atm, so thanks for baring with me

rjurney · June 3, 2025, 6:34pm

So there was a proble in training the first model, such that I am retraining it today I believe the bug was that the best model was not being saved…

You can see today’s work here, I am altering the model saving to be epoch based rather than steps… I think the model was testing well but performing poorly because it was not saving the actual best model.

These changes are the basis for today’s training run with these settings:

eridu train --use-gpu \
            --batch-size 768 \
            --epochs 4 \
            --patience 1 \
            --resampling \
            --weight-decay 0.01 \
            --random-seed 31337 \
            --warmup-ratio 0.1 \
            --learning-rate 3e-5 \
            --save-strategy epoch \
            --eval-strategy epoch \
            --sample-fraction 0.1 \
            --fp16

Once I train it, I will do a deep dive into the results of classifying name pairs in the test set to understand if and why the actual best model is ranking dissimilar names too similarly. My fallback options are to split the person and company name models in two to make it easier to learn each one, and to try OnlineContrastiveLoss, which selects the best pairs to use as training data rather than the 10% random sampling I am currently using each epoch.

Here is the live training run: Weights & Biases

Stay tuned!

rjurney · June 3, 2025, 6:49pm

@pudo does 21M unique pairs of labeled names - including both company and person - sound like the right number to you? This comes from pairing all names on the left with all names on the right, maintaining the label.

I am curious what your team thinks about which data I might use versus what data I might filter out? What constitutes the best training data among these 21M labeled pairs?

rjurney · June 3, 2025, 8:30pm

This is the performance after a single epoch… very promising.

rjurney · June 4, 2025, 5:19am

I am pretty confused by the quantified results compared with my own use of the model. You can go to Graphlet-AI/eridu · Hugging Face and click on the HF Inference API and try a list of names compared to a given name. It behaves terribly. I am saving the test data as Parquet so I can dig in and figure out why it isn’t learning the semantics of names in any corner case but performs well… over-fitting? Or is uploading to HuggingFace messed up? Got to debug it

rjurney · June 4, 2025, 6:44am

Another run, this time I will be saving the test data to do some data science and see what is going on.

I unset FP16 mixed precision, I dropped batch size from 768 to 64 (16 was taking 10 hours an epoch) and I’m running 5 instead of 4 epochs.

pudo · June 5, 2025, 10:57am

Yeah those numbers sound pretty accurate. Some of this is synthetic, e.g. made from just juxtaposing different Wikidata entries that have different names. That, in my mind, would be the first to go.

Here’s the script I used to make the file: qarin/namepairs/duck_gen.py at main · opensanctions/qarin · GitHub - if you remove the query on L173 that gets rid of the Wikidata name pairs.

Btw have you ever heard of https://www.zingg.ai/? I spoke to their CEO recently, she’s really into building open source ER.

rjurney · June 13, 2025, 2:47am

So I can tell if its wiki data if the data source is wiki or the id starts with Q: on L175? (btw, can you upgrade me to be able to post links to Github and other places?)

Yes, everyone really likes the founder Sonal Goyal She’s very nice, and I hear good things about the product. Back when I read the code more than a year ago, it was using string distance but I do remember their method of sampling was really interesting…

rjurney · June 13, 2025, 6:46am

Wait… 22M of 26M come from Wikipedia? So you would say the others are of better quality?

In [22]: pairs_df.filter(F.col("source").startswith("Q")).count()
Out[22]: 22,705,938                                                               

In [23]: pairs_df.count()
Out[23]: 26,632,762

rjurney · June 13, 2025, 7:56am

Okay, I am examining the data and definitely see a lot of examples that are preventing the model from learning…

Same identity, different names. I don’t think an embedding alone can understand this, if there isn’t some inferable logic linking them contained in the string about the culture of the name.

Type	Entity 1	Entity 2	Type 2	Match
ORG	SKLASS DRUG COMPANY	LANDSMAN PHARMACY	ORG	true
PER	JASMEET SINGH	JASMEET HAKIMZADA	PER	true
PER	RAYYA, Samer	Maher Samer Abu Hussein	PER	true

Abbreviations… should be possible to learn, I think!

Type	Entity 1	Entity 2	Type 2	Match
ORG	Дизайн енд Меньюфекчурінг оф Ейркрафт інжинз	DAMA	ORG	true

rjurney · June 13, 2025, 2:48pm

More progress… Claude Desktop is helping me direct Claude Code to do the work It suggested I had data leakage from duplicates, so I created a duplicate section of the report… they’re on to something. Data leakage may be causing the overfitting… it offers the model a shortcut it can learn just exact matches using and then not work on any other data… to not generalize…

Here’s the data converted to Markdown tables:

Duplicate Analysis Summary

Metric	Value
Total records	26,632,762
Unique records (by left_name, right_name)	26,166,197
Duplicate records	466,565 (1.8%)
Top duplicate patterns found	1,000

Top Duplicate Patterns

Left Name	Right Name	Count
ABOU MUSAB ABDELOUADOUD	ABOU MOUSSAAB ABDEL WADOUD	46
ABU MUSAB ABDELOUDOUD	ABOU MOUSAAB ABDELOUADOUD	46
ABU MUS’AB ABDELOUADOUD	ABDEL MALEK DROUKDEL	46
ABDEL MALEK DEROUDEL	ABD AL-MALIK DURIKDAL	46
DROUKDEL ABDELMALEK	ABOU MOSAAB ABKELWADOUD	46
DROKDAL ABDELMALEK	ABDELOUADOUR DROUKDEL	46
ABDELMALEK DROUKBEL	ABD-AL-MALIK DROKDAL	46
ABU MUS’AB ABDELOUADOUD	ABDEL MALEK DEROUDEL	46
ABU MOSSAAB ABDEL EL-WADOUD	ABD AL-MALIK DURIKDAL	46
ABU MOSSAB ABDELOUADOUD	ABDELOUADOUD ABOU MOSSAAH	46

Sample Duplicate Records (Pattern appears 46 times)

Left Name	Left Norm	Left FP	Left Lang	Left Category	Right Name	Right Norm	Right FP	Right Lang	Right Category	Match	Dist Norm	Dist FP	Score	Source
ABOU MUSAB ABDELOUADOUD	abou musab abdelouadoud	abdelouadoud abou musab	eng	PER	ABOU MOUSSAAB ABDEL WADOUD	abou moussaab abdel wadoud	abdel abou moussaab wadoud	eng	PER	true	5.0	16.0	0.8	usgsa-s4mr3r455
ABOU MUSAB ABDELOUADOUD	abou musab abdelouadoud	abdelouadoud abou musab	eng	PER	ABOU MOUSSAAB ABDEL WADOUD	abou moussaab abdel wadoud	abdel abou moussaab wadoud	eng	PER	true	5.0	16.0	0.8	usgsa-s4mr3pr4f
ABOU MUSAB ABDELOUADOUD	abou musab abdelouadoud	abdelouadoud abou musab	eng	PER	ABOU MOUSSAAB ABDEL WADOUD	abou moussaab abdel wadoud	abdel abou moussaab wadoud	eng	PER	true	5.0	16.0	0.8	usgsa-s4mr3pr4k
ABOU MUSAB ABDELOUADOUD	abou musab abdelouadoud	abdelouadoud abou musab	eng	PER	ABOU MOUSSAAB ABDEL WADOUD	abou moussaab abdel wadoud	abdel abou moussaab wadoud	eng	PER	true	5.0	16.0	0.8	usgsa-s4mr3pr44
ABOU MUSAB ABDELOUADOUD	abou musab abdelouadoud	abdelouadoud abou musab	eng	PER	ABOU MOUSSAAB ABDEL WADOUD	abou moussaab abdel wadoud	abdel abou moussaab wadoud	eng	PER	true	5.0	16.0	0.8	usgsa-s4mr3r457

Claude Sonnet 4: This duplicate analysis reveals a critical insight about your overfitting! The fact that you have 466K duplicate records (1.8% of your dataset) with the exact same name pairs appearing multiple times across different sources explains part of your inflated metrics. The model is seeing identical training examples repeatedly, which reinforces memorization patterns and contributes to the unrealistic performance scores you’re observing.

rjurney · June 13, 2025, 2:49pm

Now… how do I address data leakage for VERY similar name pairs? I am now deduplicating the data, but I don’t know if that is enough to make it work. Trying to find out now…

rjurney · June 13, 2025, 2:58pm

I am splitting the person name model from the company name model, which means you need to know which one it is… that is not desirable. It calls for a person / company name classifier - something we can look into after this embedding works

Here is the output of the command eridu etl filter which deduplicates the data, throwing out a lot of data - maybe 800K records? It was about 2.3M and now it is 1.6M records.

Reading data from ./data/pairs-all.parquet
Initial record count: 26,632,762
Removed 443,075 duplicate records                                               
Records after deduplication: 3,483,749
Final filtered record count: 3,483,749                                          
Total removed 23,149,013 records (86.9%)
Filtered people records count: 1,807,999                                        
Filtered companies records count: 1,680,943                                     
Writing filtered data to ./data/filtered
25/06/13 07:54:08 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Filtering completed successfully!                                               
  - All filtered data: ./data/filtered/filtered.parquet
  - People data: ./data/filtered/people.parquet
  - Companies data: ./data/filtered/companies.parquet

rjurney · June 13, 2025, 3:01pm

A new training run without all the dupes… if this doesn’t work I am going to… have to cluster pairs and assign very similar ones to training, eval or test data together… this is weird. Not sure

nohup eridu train --use-gpu \
            --batch-size 32 \
            --epochs 8 \
            --patience 1 \
            --weight-decay 0.01 \
            --random-seed 31337 \
            --warmup-ratio 0.1 \
            --learning-rate 1e-5 \
            --save-strategy epoch \
            --eval-strategy epoch \
            --sample-fraction 1.0 \
            --input data/filtered/people.parquet \
            --output data/output \
            --data-type people &

tail -f nohup.out

The current training run on Weights & Biases

rjurney · June 13, 2025, 3:17pm

Oh, here is the latest training run

pudo · June 16, 2025, 8:22am

Nice progress on this! Just to clarify one thing: not all the decisions involving two “Q” IDs are synthetic, a lot are made manually. But of course, by “a lot” I mean less than a million

Have you had a chance to run this against our benchmark dataset yet? I’m super curious for the result. Our rule-based matcher is hovering around 80% right now, which is much better than what we currently have in production…

rjurney · June 16, 2025, 3:05pm

Thanks! Oh man, that’s not good - how do I identify Wikipedia entries? I read the code and thought that would work.

I haven’t run the benchmark dataset yet because the model overfits so badly in practice, despite excellent test scores that it wouldn’t do very well. I could try to train on just that data… it should get better and there wouldn’t be the problem with similar records across the train/test split. I have actually implemented a clustering solution using both embeddings and WordPiece counts that I am tuning. Maybe I should just try training locally on the benchmark.

rjurney · July 9, 2025, 3:06am

I split the company model out and got reasonable results

Model Evaluation Report
========================================
Model: data/fine-tuned-sbert-sentence-transformers-paraphrase-multilingual-MiniLM-L12-v2-original-adafactor
Test data: 10,000 pairs
Classification threshold: 0.7159

Performance Metrics:
  Accuracy:  0.8532
  Precision: 0.8142
  Recall:    0.8546
  F1 Score:  0.8339
  AUC-ROC:   0.9252

Confusion Matrix:
  True Positives:  3,685
  False Positives: 841
  True Negatives:  4,847
  False Negatives: 627

Error Analysis:
  False Positive Examples (showing top 5 of 841):
    1. 'Siły zbrojne Federacji Rosyjskiej' vs 'Siły zbrojne Federacji Rosyjskie' (Score: 0.9973)
    2. 'Акціонерне товариство "Зовнішньоторговельна компанія "КАМАЗ"' vs 'АКЦИОНЕРНОЕ ОБЩЕСТВО "ЛИЗИНГОВАЯ КОМПАНИЯ "КАМАЗ"' (Score: 0.9718)
    3. 'АКЦИОНЕРНОЕ ОБЩЕСТВО "ЦЕНТР ПЕРСПЕКТИВНЫХ ТЕХНОЛОГИЙ И АППАРАТУРЫ"' vs 'АКЦИОНЕРНОЕ ОБЩЕСТВО "ЦЕНТР ПЕРСПЕКТИВНЫХ ТЕХНОЛОГИЙ"' (Score: 0.9640)
    4. 'STROYTRANSGAZ OJSC' vs 'OOO STROYTRANSGAZ' (Score: 0.9615)
    5. 'Persian Gulf Petrochemical Industry Company' vs 'Persian Gulf Petrochemical Industry Commercial' (Score: 0.9573)
  False Negative Examples (showing top 5 of 627):
    1. 'unité militaire 26165' vs '85. hlavné centrum pre špeciálne služby (GCSS) Hlavného riaditeľstva Generálneho štábu ozbrojených síl Ruskej federácie (GU/GRU)' (Score: 0.7159)
    2. 'Ди Ед Органайзейшн оф ди Улема' vs 'MEYMAR TRUST' (Score: 0.7158)
    3. 'Бюро 39' vs 'Third Floor' (Score: 0.7156)
    4. 'ФЕДЕРАЛЛЬНАЯ СЛУЖБА ВОЙСК НАЦИОНАЛЬНОЙ ГВАРДИИ РОССИЙСКОЙ ФЕДЕРАЦИИ' vs 'Националната гвардия на Руската федерация' (Score: 0.7155)
    5. 'МИНИСТЕРСТВО ИМУЩЕСТВЕННЫХ ОТНОШЕНИЙЙ РС' vs 'ГОСУДАРСТВЕННОЕ УНИТАРНОЕ ПРЕДПРИЯТИЕ " ДИРЕКЦИЯ АЭРОПОРТОВ РЕСПУБЛИКИ САХА (ЯКУТИЯ) "' (Score: 0.7154)

rjurney · July 12, 2025, 6:51am

I trained the model locally on my GPU machine and got through 2 epochs and got these results:

Here is the company only model on a test dataset:

Confusion Matrix

True Positives: 3,685
False Positives: 841
True Negatives: 4,847
False Negatives: 627

How does that compare with what you’ve got?

Performance Metrics

Accuracy: 0.8532
Precision: 0.8142
Recall: 0.8546
F1 Score: 0.8339
AUC-ROC: 0.9252

Here is the checks.yml performance score… too many YES answers… look at how high the F1 is compared to accuracy and precision! 1.0 recall

============================================================
COMBINED EVALUATION SUMMARY
============================================================
Total checks: 182
Overall Accuracy:  0.5886
Overall Precision: 0.5886
Overall Recall:    1.0000
Overall F1 Score:  0.7410

rjurney · July 14, 2025, 9:53am

@pudo can you clue me in on how to perform the filter you mentioned - to remove the low quality Wikipedia entries from the match data? I read the file you mentioned but it does just show them as starting with Q…

github.com/opensanctions/qarin

namepairs/duck_gen.py

main

# import os
import duckdb
import logging
from pathlib import Path
from typing import Optional
from normality import latinize_text

# from rapidfuzz.distance import Levenshtein
from fingerprints import fingerprint
from fingerprints import clean_entity_prefix
from rigour.names.tokenize import tokenize_name
# from rigour.names.compare import align_name_parts
# from rigour.text.distance import levenshtein_similarity

STATEMENTS_PATH = Path("/Users/pudo/Data/statements.csv")
RESOLVER_PATH = Path("/Users/pudo/Code/operations/etl/data/resolve.ijson")
con = duckdb.connect("work.duckdb")
log = logging.getLogger("duck_gen")

score_count = 0

This file has been truncated. show original

Topic		Replies	Views
Call for false positives: help us build out a great set of name tests Research & Development	15	254	October 22, 2025
PoliLoom – Loom for weaving politician's data Research & Development poliloom	27	1412	June 9, 2026
Yente 5 - resilient updates, powerful name matcher! Announcements yente , announcements	0	87	September 15, 2025
Follow The Money 4.8.0 Research & Development opensource , announcements	0	41	April 16, 2026
Zavod eu_fsf dataset broken? Support & Questions	2	50	October 17, 2025

Deep, fuzzy matching person and company names with Project Eridu

Duplicate Analysis Summary

Top Duplicate Patterns

Sample Duplicate Records (Pattern appears 46 times)

Related topics