Zavod and yente setup questions

Hello. First of all: Thank you for all the work you have put into this project. I am very new to both yente and zavod; i would appreciate any hints to get started.

I have both tools running locally.

Q1: I created a custom yente manifest that points to local files i was able to create with zavod.

datasets:
  - name: cz_national_sanctions
    title: "CZ National Sanctions (Zavod)"
    path: /Users/sbz/code/opensanctions/data/datasets/cz_national_sanctions/entities.ftm.json
    version: "20251016"

  - name: de_bka_wanted
    title: "DE BKA Wanted (Zavod)"
    path: /Users/sbz/code/opensanctions/data/datasets/de_bka_wanted/entities.ftm.json
    version: "20251016"

  - name: default
    title: "Default: All (CZ + DE BKA)"
    datasets:
      - cz_national_sanctions
      - de_bka_wanted

Does this look correct? It seems to be working fine so far.

Q2: Can zavod automatically create catalog files?

The default yente manifest looks like this:

catalogs:
  - url: "https://data.opensanctions.org/datasets/latest/index.json"
    scope: default
    resource_name: entities.ftm.json

Can zavod create such a index.json file that is automatically updates every time i do a zavod run --latest datasets/example/example.yml ?

Q3: ZAVOD_RESOLVER_PATH setting

The docs state that

ZAVOD_RESOLVER_PATH must be set to the path to a nomenklatura resolver JSON lines file. It can be an empty file. e.g. data/resolver.ijson

I did not set this setting but i still seems to work? Is there a default resolver.ijson config i should be using?

Q4: post zavod run tasks

Are there any zavod commands i should execute after a zavod run for deduplication or other data improvements, or is it done automatically on the run?

Q5: Transliteration support

Am i correct in seeing that transliteration support (Example: .fmt data file contains the name"Berti" but i search for “Берти”) works ONLY on the /match endpoint and not on the /search endpoint?

Thats all for now, i would greatly appreciate your guidance.

Hey @sbz , welcome! Thanks for posting this question. I want to take a tiny step back and ask: do you really want to run zavod? Doing that makes sense if you’re an open source enthusiast, but it’s not anywhere near the shortest path to getting a running yente instance with sanctions data in it :slight_smile:

Some comments re your questions:

Q1: that looks correct, although it requires your yente have instance having access to the whole file system of your computer. If you’re running it in Docker, you’ll want to use relative paths and a volume mount instead.

Q2: Yes, for any collection that you zavod export, a catalog.json is generated. So you could (in your example) make a new collection, maybe datasets/_collections/custom.yml that contains cz_ and de_ and then zavod export that after running the individual datasets.

Q3: That’s actually not needed any more, some cruft we should clean up. The resolver now lives in the SQL database. The contents of the resolver (now table, formerly file) are the result of running zavod xref followed by zavod dedupe . The latter is for manual resolution of uncertain entity matches, and the specific decisions we’ve made there are our little value-add for the commercial product - so you’d need to do the same work :slight_smile:

Q4: zavod run is a bundle that contains zavod crawl (fetch the data), zavod export (build the output files) and zavod publish (upload them to public storage). Any time you run zavod export, it will emit de-duplicated entities based on the contents of your resolver table. For how to build those contents, see Q3 :slight_smile:

Q5: That’s correct, very nice observation! We do really limited processing on search queries at the moment, although the very simple example you included should be handled by Elastic’s built-in asciifolding . Still, in the future it might be cool to annotate query text with name symbols and add that as silent query context to bring up recall a bit…

Cheers

– Friedrich