Deep, fuzzy matching person and company names with Project Eridu

Dear Open Sanctions Community,

I wanted to share a model-in-progress for person and company name matching we’re working on with the Open Sanctions community. It is a work in progress and doesn’t work quite right yet, but I wanted to gauge interest in the idea. It uses the Open Sanctions Matcher training data to fine-tune a pre-trained paraphrase model to improve its performance for matching person and company names. This lets you compare one or more sets of names using cosine similarity on the text embedding vectors.

At this point I’m looking for feedback on the idea. When it actually works, it will automagically incorporate lessons learned from the 2.3M unique pairs of names in the Matcher training data :slight_smile: The library comes with a sweet CLI that lets you download the Matcher training data, run a report on the labels, train a model and then test inference.

Some information from the README:

Deep fuzzy matching people and company names for multilingual entity resolution using representation learning… that incorporates a deep understanding of people and company names and works much better than string distance methods.

This project is a deep fuzzy matching system for entity resolution using representation learning. It is designed to match people and company names across languages and character sets, using a pre-trained text embedding model from HuggingFace that we fine-tune using contrastive learning on 2 million labeled pairs of person and company names from the Open Sanctions Matcher training data. The project includes a command-line interface (CLI) utility for training the model and comparing pairs of names using cosine similarity.

Matching people and company names is an intractable problem using traditional parsing based methods: there is too much variation across cultures and jurisdictions to solve the problem by humans programming. Machine learning is used in problems like this one of cultural relevance, where programming a solution approaches infinite complexity, to automatically write a program. Since 2008 there has been an explosion of deep learning methods that automate feature engineering via representation learning methods including such as text embeddings. This project loads the pre-trained paraphrase-multilingual-MiniLM-L12-v2 paraphrase model from HuggingFace and fine-tunes it for the name matching task using contrastive learning on more than 2 million labeled pairs of matching and non-matching (just as important) person and company names from the matcher training data to create a deep fuzzy matching system for entity resolution.

This model is available on HuggingFace Hub as Graphlet-AI/eridu and can be used in any Python project using the Sentence Transformers library. The model is designed to be used for entity resolution tasks, such as matching people and company names across different languages and character sets.

The project on GitHub at: GitHub - Graphlet-AI/eridu: Deep fuzzy matching people and company names for multilingual entity resolution using representation learning

On HuggingFace Hub as Graphlet-AI/eridu

On PyPi as eridu

Please let me know what you think! I can only post two links per post atm, so thanks for baring with me :slight_smile:

1 Like

So there was a proble in training the first model, such that I am retraining it today :slight_smile: I believe the bug was that the best model was not being saved…

You can see today’s work here, I am altering the model saving to be epoch based rather than steps… I think the model was testing well but performing poorly because it was not saving the actual best model.

These changes are the basis for today’s training run with these settings:

eridu train --use-gpu \
            --batch-size 768 \
            --epochs 4 \
            --patience 1 \
            --resampling \
            --weight-decay 0.01 \
            --random-seed 31337 \
            --warmup-ratio 0.1 \
            --learning-rate 3e-5 \
            --save-strategy epoch \
            --eval-strategy epoch \
            --sample-fraction 0.1 \
            --fp16

Once I train it, I will do a deep dive into the results of classifying name pairs in the test set to understand if and why the actual best model is ranking dissimilar names too similarly. My fallback options are to split the person and company name models in two to make it easier to learn each one, and to try OnlineContrastiveLoss, which selects the best pairs to use as training data rather than the 10% random sampling I am currently using each epoch.

Here is the live training run: Weights & Biases

Stay tuned!

@pudo does 21M unique pairs of labeled names - including both company and person - sound like the right number to you? This comes from pairing all names on the left with all names on the right, maintaining the label.

I am curious what your team thinks about which data I might use versus what data I might filter out? What constitutes the best training data among these 21M labeled pairs?

This is the performance after a single epoch… very promising.

I am pretty confused by the quantified results compared with my own use of the model. You can go to Graphlet-AI/eridu · Hugging Face and click on the HF Inference API and try a list of names compared to a given name. It behaves terribly. I am saving the test data as Parquet so I can dig in and figure out why it isn’t learning the semantics of names in any corner case but performs well… over-fitting? Or is uploading to HuggingFace messed up? Got to debug it :slight_smile:

Another run, this time I will be saving the test data to do some data science and see what is going on.

I unset FP16 mixed precision, I dropped batch size from 768 to 64 (16 was taking 10 hours an epoch) and I’m running 5 instead of 4 epochs.

Yeah those numbers sound pretty accurate. Some of this is synthetic, e.g. made from just juxtaposing different Wikidata entries that have different names. That, in my mind, would be the first to go.

Here’s the script I used to make the file: qarin/namepairs/duck_gen.py at main · opensanctions/qarin · GitHub - if you remove the query on L173 that gets rid of the Wikidata name pairs.

Btw have you ever heard of https://www.zingg.ai/? I spoke to their CEO recently, she’s really into building open source ER.