Deep, fuzzy matching person and company names with Project Eridu

Thanks for posting the updates! It’s a lot of false positives, I feel - is there perhaps a way to make a second stage/phase that is trained on a more discriminatory set of criteria and that discards a vast subset of the false positives?

Coming down to use cases, this would be super interesting for data integration/dedupe/merging scenarios, but of course for any screening use you’d want something that is really really precise.

So the thing that generates “synthetic” Wikidata matches is the code here: qarin/namepairs/duck_gen.py at main · opensanctions/qarin · GitHub - if you comment that out and run all the steps (down in if __name__ == "__main__"), then you’d not only get a reduced version of the matching table, but I also think the DuckDB resulting from this would be fun to play with in terms of other experiments.

I’ve sent you the resolver file on Slack, very happy to also send you my version of the DuckDB file if needed - although doing it from scratch seems much cleaner.

When I read this, I just see it using Q as what makes it a wiki example… right?

Yeah, QIDs are Wikidata IDs and we never mint them other than when referencing a Wikidata thing. We also only source Person entities from Wikidata, so it’s implicit in this query that it will only build person-person pairs.

Sooo glad to find that you have done amazing work on matching company names. I’ve working on it for a long time.BUT after trying to use it I sadly found eridu has been deleted from huggingface hub and can’t be used anymore. Could you please tell me how can I use it now? Thank you soooooo much.Best wishes.