Dear Open Sanctions Community,
I wanted to share a model-in-progress for person and company name matching we’re working on with the Open Sanctions community. It is a work in progress and doesn’t work quite right yet, but I wanted to gauge interest in the idea. It uses the Open Sanctions Matcher training data to fine-tune a pre-trained paraphrase model to improve its performance for matching person and company names. This lets you compare one or more sets of names using cosine similarity on the text embedding vectors.
At this point I’m looking for feedback on the idea. When it actually works, it will automagically incorporate lessons learned from the 2.3M unique pairs of names in the Matcher training data The library comes with a sweet CLI that lets you download the Matcher training data, run a report on the labels, train a model and then test inference.
Some information from the README:
Deep fuzzy matching people and company names for multilingual entity resolution using representation learning… that incorporates a deep understanding of people and company names and works much better than string distance methods.
This project is a deep fuzzy matching system for entity resolution using representation learning. It is designed to match people and company names across languages and character sets, using a pre-trained text embedding model from HuggingFace that we fine-tune using contrastive learning on 2 million labeled pairs of person and company names from the Open Sanctions Matcher training data. The project includes a command-line interface (CLI) utility for training the model and comparing pairs of names using cosine similarity.
Matching people and company names is an intractable problem using traditional parsing based methods: there is too much variation across cultures and jurisdictions to solve the problem by humans programming. Machine learning is used in problems like this one of cultural relevance, where programming a solution approaches infinite complexity, to automatically write a program. Since 2008 there has been an explosion of deep learning methods that automate feature engineering via representation learning methods including such as text embeddings. This project loads the pre-trained
paraphrase-multilingual-MiniLM-L12-v2 paraphrase
model from HuggingFace and fine-tunes it for the name matching task using contrastive learning on more than 2 million labeled pairs of matching and non-matching (just as important) person and company names from the matcher training data to create a deep fuzzy matching system for entity resolution.
This model is available on HuggingFace Hub as
Graphlet-AI/eridu
and can be used in any Python project using the Sentence Transformers library. The model is designed to be used for entity resolution tasks, such as matching people and company names across different languages and character sets.
The project on GitHub at: GitHub - Graphlet-AI/eridu: Deep fuzzy matching people and company names for multilingual entity resolution using representation learning
On HuggingFace Hub as Graphlet-AI/eridu
On PyPi as eridu
Please let me know what you think! I can only post two links per post atm, so thanks for baring with me