Using own custom dataset in yente?

Hey OpenSanctions team!

I’m looking for advice on how to load a non-OpenSanctions dataset into our on-prem yente instance.

We’ve got a CSV file that we want to update once a month that contains a block list of customers. Each row includes a person name, date of birth, tax ID and their customer ID.

How can I screen against this list alongside the main OpenSanctions data using yente?

2 Likes

Hi @disco9283, welcome here!

We’ve got some existing documentation for this process, but perhaps a quick overview:

  • You’ll need to convert your data to the FollowTheMoney data model using a CSV or SQL mapping. This is essentially a small YAML file that tells the FollowTheMoney command-line tool (ftm) how to convert each row in your data into a JSON object that yente will understand.
    • In order to do this, you need to install the followthemoney Python library somewhere so you can use it. If you’re good with creating a shell inside a Docker container (docker run -ti --mount type=bind,src=/Users/pudo/Data,dst=/data ghcr.io/opensanctions/yente:latest bash, you can use the copy that is pre-installed inside the yente container :slight_smile:
    • Make sure your FtM dataset uses unique keys (entity IDs) that aren’t used by the main OpenSanctions dataset. By default, FtM generates keys as a hash of some of the columns of the source data, which will work well.
  • Next, you’ll need to make a manifest file for yente. This is shown in the documentation, but the important thing here is that the manifest file needs to specify a path to your newly made FtM data.
    • The data path - alongside the manifest file - will need to be accessible inside your yente container, e.g. using a bind mount or as an http URL. If you’re using a bind mount, yente will use the modification time of the file to check if the data has been updated, so automatic updates should work as expected.
    • Inside the manifest YAML, you need to specify an extra dataset for your in-house data. You may also want to define an extra collection (ie. a dataset with a children section) that includes both that new dataset and the OpenSanctions default collection. This combined collection then becomes the scope for your API queries (eg. /search/combinedscope).

These are just some starting points - please feel free to share your mapping or any extra questions as you move forward!

2 Likes