How to deduplicate custom datasets with the default OpenSanctions dataset?

I’ve created a custom dataset (World Bank Leadership) and deployed it to yente. It
works, but I’m seeing duplicate entities in search results.

For example:

  • My dataset has: “Ajay Banga” (wb-lead-120ea…)

  • OpenSanctions default has: “Ajaypal Singh Banga” (Q4699676)

{
“id”: “wb-lead-120ea365ea249e476b8507d585cfdb7e24fa21dd”,
“caption”: “Ajay Banga”,
“schema”: “Person”,
“datasets”: [“worldbank_leadership”],
“properties”: {
“name”: [“Ajay Banga”],
“topics”: [“gov.igo”, “role.pep”],
“sourceUrl”: [“https://www.worldbank.org/ext/en/who-we-are/leadership/ajay-banga”]
},
“target”: true,
“first_seen”: “2025-11-17T18:09:43”,
“last_seen”: “2025-11-17T18:09:43”
}

// Entity 2: OpenSanctions Default
{
“id”: “Q4699676”,
“caption”: “Ajaypal Singh Banga”,
“schema”: “Person”,
“datasets”: [“wikidata”],
“properties”: {
“name”: [“Ajaypal Singh Banga”],
“wikidataId”: [“Q4699676”],
“country”: [“zz”]
},
“target”: true
}

These are the same person but appear as separate results.

Question: How do I deduplicate my custom dataset with the default OpenSanctions
catalog?

Hey! That’s a bigger project :slight_smile:

We’re using our own (open source) framework to deduplicate the OpenSanctions data internally, called nomenklatura. The way it works is to basically run a comparison on the entities in the dataset (nomenklatura xref ), choose which ones to auto-merge and which ones to verify (nomenklatura dedupe) and then using the resulting SQL lookup table (the so-called resolver) to re-write and merge the entities of both datasets into one.

We wrote a blog post about the process a long time ago, and there’s some superficial instructions in the nomenklatura readme file. We’d love any help documenting this further.

Doing this is a. lot. of. work. - so I’d also make sure that you can’t just live with multiple hits on the PEP….

1 Like