Using own custom dataset in yente?

disco9283 · February 19, 2025, 9:55am

Hey OpenSanctions team!

I’m looking for advice on how to load a non-OpenSanctions dataset into our on-prem yente instance.

We’ve got a CSV file that we want to update once a month that contains a block list of customers. Each row includes a person name, date of birth, tax ID and their customer ID.

How can I screen against this list alongside the main OpenSanctions data using yente?

pudo · February 19, 2025, 12:53pm

Hi @disco9283, welcome here!

We’ve got some existing documentation for this process, but perhaps a quick overview:

You’ll need to convert your data to the FollowTheMoney data model using a CSV or SQL mapping. This is essentially a small YAML file that tells the FollowTheMoney command-line tool (ftm) how to convert each row in your data into a JSON object that yente will understand.
- In order to do this, you need to install the followthemoney Python library somewhere so you can use it. If you’re good with creating a shell inside a Docker container (docker run -ti --mount type=bind,src=/Users/pudo/Data,dst=/data ghcr.io/opensanctions/yente:latest bash, you can use the copy that is pre-installed inside the yente container
- Make sure your FtM dataset uses unique keys (entity IDs) that aren’t used by the main OpenSanctions dataset. By default, FtM generates keys as a hash of some of the columns of the source data, which will work well.
Next, you’ll need to make a manifest file for yente. This is shown in the documentation, but the important thing here is that the manifest file needs to specify a path to your newly made FtM data.
- The data path - alongside the manifest file - will need to be accessible inside your yente container, e.g. using a bind mount or as an http URL. If you’re using a bind mount, yente will use the modification time of the file to check if the data has been updated, so automatic updates should work as expected.
- Inside the manifest YAML, you need to specify an extra dataset for your in-house data. You may also want to define an extra collection (ie. a dataset with a children section) that includes both that new dataset and the OpenSanctions default collection. This combined collection then becomes the scope for your API queries (eg. /search/combinedscope).

These are just some starting points - please feel free to share your mapping or any extra questions as you move forward!

Topic		Replies	Views
Technical heads-up: data model changes in FollowTheMoney upcoming Datasets	0	12	May 12, 2025
WIP: Better workflow for Neo4J data conversion Research & Development	3	52	March 1, 2025
Yente 4.3: New adjacent entities API Announcements	0	31	April 4, 2025
About the Use Cases & Case Studies category Use Cases & Case Studies use-cases , case-studies , showcase , bibliography	3	57	February 21, 2025
Human readable dataset titles? Support & Questions support	1	12	June 9, 2025

Using own custom dataset in yente?

Related topics