I’m looking for advice on how to load a non-OpenSanctions dataset into our on-prem yente instance.
We’ve got a CSV file that we want to update once a month that contains a block list of customers. Each row includes a person name, date of birth, tax ID and their customer ID.
How can I screen against this list alongside the main OpenSanctions data using yente?
You’ll need to convert your data to the FollowTheMoney data model using a CSV or SQL mapping. This is essentially a small YAML file that tells the FollowTheMoney command-line tool (ftm) how to convert each row in your data into a JSON object that yente will understand.
In order to do this, you need to install the followthemoney Python library somewhere so you can use it. If you’re good with creating a shell inside a Docker container (docker run -ti --mount type=bind,src=/Users/pudo/Data,dst=/data ghcr.io/opensanctions/yente:latest bash, you can use the copy that is pre-installed inside the yente container
Make sure your FtM dataset uses unique keys (entity IDs) that aren’t used by the main OpenSanctions dataset. By default, FtM generates keys as a hash of some of the columns of the source data, which will work well.
Next, you’ll need to make a manifest file for yente. This is shown in the documentation, but the important thing here is that the manifest file needs to specify a path to your newly made FtM data.
The data path - alongside the manifest file - will need to be accessible inside your yente container, e.g. using a bind mount or as an http URL. If you’re using a bind mount, yente will use the modification time of the file to check if the data has been updated, so automatic updates should work as expected.
Inside the manifest YAML, you need to specify an extra dataset for your in-house data. You may also want to define an extra collection (ie. a dataset with a children section) that includes both that new dataset and the OpenSanctions default collection. This combined collection then becomes the scope for your API queries (eg. /search/combinedscope).
These are just some starting points - please feel free to share your mapping or any extra questions as you move forward!