javier
September 29, 2025, 7:15am
1
I have noticed that when I use the default manifest and uncomment graph and create a combined dataset, after everything is indexed, many of the indexes that are newer than the graph date are deleted. Is this normal behaviour?
Can you please suggest how to setup the manifest so that default and graph coexist?
the examples show that its possible, but I am not sure if I am doing it right. Here is my manifest
catalogs:
- url: https://data.opensanctions.org/datasets/latest/index.json
scope: default
resource_name: entities.ftm.json
- url: https://data.opensanctions.org/datasets/latest/graph/catalog.json
scope: graph
resource_name: entities.ftm.json
datasets:
- name: combined
title: Combined
datasets:
- default
- graph
javier
September 30, 2025, 8:01am
2
To correct my first post, after the indexing finishes, all the indexes that get deleted are downloaded again and indexed from start. This process never finishes.
datasets in default and graph conflict with each other?
I tried this config with exactly the same results. (using latest 5.0.1 docker image)
catalogs:
- url: "https://data.opensanctions.org/datasets/latest/index.json"
scope: default
resource_name: entities.ftm.json
- url: "https://data.opensanctions.org/datasets/latest/graph/catalog.json"
resource_name: entities.ftm.json
datasets: []
pudo
September 30, 2025, 8:34am
3
Hi Javier, this sounds like it may be an issue with Yente 5.0. Just to be clear: graph and default do conflict with each other - they contain the same stuff. You can try and import default and kyb into the same yente - that still breaks some entities but it’s not as bad. In general, indexing the kyb datasets is an order of magnitude more complex than just doing default on its own. It’s going to require significant sysadmin resources. It may be worth asking: what is your business case for indexing both? Are you really dependent on that outcome for your adoption of OpenSanctions?
javier
September 30, 2025, 9:45am
4
At this point I am still researching the software and its capabilities. Im experimenting with various datasets to make an agentic profiling tool. The goal is to be able to make a screening process flow that drills down matched entities in datasets to establish a detailed view of their roles, ownership, connected people, relatives, addresses etc. These days “sketchy” entities use various companies, family members or friends to setup shop and avoid deeper AML checks.
What made me look at graph what the sanction.linked data, that was not fully present in default but was in graph via ann_graph_topics. Entities who were directors of a sanctioned company didnt show up in default as the company data was not complete. (maybe this was due to bad manifest or deleted indexes)
But then i realized that graph is not as updated as default. Hence i tried to combine them to get the best out of both. So I think im not setting it up right. I used graph because the datasets it has includes datasets like gb_coh_psc entities which is not available to download as a resource.
Thanks for clarifying that datasets and catalogues conflict, it was not clear until now. I need understand better the use/combination of the datasets and correct practices for manifest to understand which datasets/catalogues work well with each to other and which don’t.
Is it advised to keep separate yente instances/indexes with separate manifests to avoid conflicts and data deletion?
How should i use default + kyb correctly in a manifest?
Is there an updated manifest example anywhere? the ones i find in the docs still say scope: all and combine default with graph
Thank you for your interest!
jbothma
September 30, 2025, 4:44pm
5
Hi Javier
I’m trying to reproduce what you’re seeing and just want to make sure I’m reproducing the exact scenario.
Could you share some of the logs from indexing? Ideally a full yente reindex run, but specifically some lines like
No update needed
Indexing entities (all)
Create index (all)
Streaming data (all)
Index: 42000 entities… (just one or two for a couple of different datasets is fine)
Index is now aliased to: (all)
Any deletion lines (all)
jbothma
September 30, 2025, 5:17pm
6
Hi again Javier
Could you double-check that this happens repeatedly with this config? Specifically where both catalogs have these scopes
catalogs:
- url: https://data.opensanctions.org/datasets/latest/index.json
scope: default
resource_name: entities.ftm.json
- url: https://data.opensanctions.org/datasets/latest/graph/catalog.json
scope: graph
resource_name: entities.ftm.json
datasets:
- name: combined
title: Combined
datasets:
- default
- graph
I can see why the load/deletion would happen when the second catalog doesn’t have a scope defined.
Configuration of multiple catalogs with datasets with the same name can conflict - I’ll make a note to make that a bit clearer.
Another issue is that elasticsearch can match candidate entities with the same ID twice and I think elasticsearch will error in that case.
As a way forward, I’d suggest you start with just a default again and (with scope constrained to default and see if you get enough of the network around risky entities.
If you need all of graph or kyb, I think you’re going to need a separate yente instance. They can share an elasticsearch/opensearch instance, but you’ll need to configure a different YENTE_INDEX_NAME setting so the two yentes don’t mess with each others’ indices. I think it’s also worth constraining scope for these to their collections.
The repeated deletion is a bug, but I think the fix would really be to prevent loading the conflicting datasets. So this overlapping dataset configuration that appears to be an option is not supported as far as I can tell. Additional catalogs/datasets are intended to add entities not already in the index. Clarify multi catalog/dataset manifest configuration by jbothma · Pull Request #904 · opensanctions/yente · GitHub
I’d still love logs just to be sure we’re looking at the same issue, when you have a moment.
javier
October 2, 2025, 8:10am
8
Thanks for the advice,
I dropped the original docker container setup that I had mentioned in the first post and refreshed it again with the second config i posted.
a quick grep: $ docker logs -f yente | grep -v “entities\.\.\.” | grep -v healthz | grep -v Refresh
2025-09-30T08:39:53.773429Z [info ] Loading org type/symbol tagger... [rigour.names.tagging]
2025-09-30T08:39:53.797361Z [info ] Loaded organization tagger (8014 terms). [rigour.names.tagging]
2025-09-30T08:39:53.807165Z [info ] Setting up background refresh [yente] auto_reindex=True crontab='52 * * * *'
2025-09-30T08:39:57.696347Z [info ] Index update check [yente.search.indexer]
2025-09-30T08:39:57.774361Z [info ] Acquired lock 27c16eaa-e8c8-4757-9bc7-17faeeb1af2e [yente.search.lock]
2025-09-30T08:39:57.991583Z [info ] HTTP Request: GET https://data.opensanctions.org/artifacts/default/20250930065425-msl/delta.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T08:39:57.992701Z [info ] Indexing entities [yente.search.indexer] base_version=20250930005424-eju dataset=default force=False incremental=True url=https://data.opensanctions.org/datasets/20250930/default/entities.ftm.json version=20250930065425-msl
2025-09-30T08:39:58.273940Z [info ] Cloned index [yente.provider.elastic] base=yente-entities-default-016a040202-20250930005424-eju target=yente-entities-default-016a040202-20250930065425-msl
2025-09-30T08:39:58.275003Z [info ] Fetching data [yente.data.loader] path=/tmp/default-delta-20250930065425-msl url=https://data.opensanctions.org/artifacts/default/20250930065425-msl/entities.delta.json
2025-09-30T08:39:58.494263Z [info ] HTTP Request: GET https://data.opensanctions.org/artifacts/default/20250930065425-msl/entities.delta.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T08:40:00.542383Z [info ] Indexed 4037 entities [yente.search.indexer] added=508 deleted=1681 modified=1848
2025-09-30T08:40:01.185554Z [info ] Index is now aliased to: yente-entities [yente.search.indexer] index=yente-entities-default-016a040202-20250930065425-msl
2025-09-30T08:40:01.714433Z [info ] Indexing entities [yente.search.indexer] base_version=None dataset=graph force=False incremental=False url=https://data.opensanctions.org/datasets/20250815/graph/entities.ftm.json version=20250815190158-ikw
2025-09-30T08:40:01.725283Z [info ] Create index [yente.provider.elastic] index=yente-entities-graph-016a040202-20250815190158-ikw
2025-09-30T08:40:01.816246Z [info ] Fetching data [yente.data.loader] path=/tmp/graph-20250815190158-ikw url=https://data.opensanctions.org/datasets/20250815/graph/entities.ftm.json
2025-09-30T08:40:03.190868Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/20250815/graph/entities.ftm.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T08:52:00.267398Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/index.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T08:52:00.695244Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/graph/catalog.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T08:52:01.423272Z [info ] Index update check [yente.search.indexer]
2025-09-30T08:52:01.448045Z [warning ] Failed to acquire lock, skipping index update [yente.search.indexer]
2025-09-30T09:52:00.289691Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/index.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T09:52:00.727879Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/graph/catalog.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T09:52:01.460322Z [info ] Index update check [yente.search.indexer]
2025-09-30T09:52:01.471108Z [warning ] Failed to acquire lock, skipping index update [yente.search.indexer]
2025-09-30T10:52:00.267128Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/index.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T10:52:00.695887Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/graph/catalog.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T10:52:01.451054Z [info ] Index update check [yente.search.indexer]
2025-09-30T10:52:01.459795Z [warning ] Failed to acquire lock, skipping index update [yente.search.indexer]
2025-09-30T11:52:04.270303Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/index.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T11:52:04.707784Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/graph/catalog.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T11:52:05.476391Z [info ] Index update check [yente.search.indexer]
2025-09-30T11:52:05.486768Z [warning ] Failed to acquire lock, skipping index update [yente.search.indexer]
2025-09-30T12:52:00.357738Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/index.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T12:52:00.978545Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/graph/catalog.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T12:52:01.760802Z [info ] Index update check [yente.search.indexer]
2025-09-30T12:52:01.772359Z [warning ] Failed to acquire lock, skipping index update [yente.search.indexer]
2025-09-30T13:52:00.388839Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/index.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T13:52:00.983808Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/graph/catalog.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T13:52:01.666134Z [info ] Index update check [yente.search.indexer]
2025-09-30T13:52:01.677214Z [warning ] Failed to acquire lock, skipping index update [yente.search.indexer]
2025-09-30T14:52:00.495691Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/index.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T14:52:01.172188Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/graph/catalog.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T14:52:01.804247Z [info ] Index update check [yente.search.indexer]
2025-09-30T14:52:01.814286Z [warning ] Failed to acquire lock, skipping index update [yente.search.indexer]
2025-09-30T15:52:00.501164Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/index.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T15:52:01.108526Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/graph/catalog.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T15:52:01.780478Z [info ] Index update check [yente.search.indexer]
2025-09-30T15:52:01.789675Z [warning ] Failed to acquire lock, skipping index update [yente.search.indexer]
2025-09-30T16:52:00.406831Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/index.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T16:52:01.020710Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/graph/catalog.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T16:52:01.661645Z [info ] Index update check [yente.search.indexer]
2025-09-30T16:52:01.669950Z [warning ] Failed to acquire lock, skipping index update [yente.search.indexer]
2025-09-30T17:52:00.528789Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/index.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T17:52:01.162920Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/graph/catalog.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T17:52:01.773383Z [info ] Index update check [yente.search.indexer]
2025-09-30T17:52:01.798859Z [warning ] Failed to acquire lock, skipping index update [yente.search.indexer]
2025-09-30T18:13:43.821838Z [info ] Indexed 151725420 entities [yente.search.indexer] added=151725420 deleted=0 modified=0
2025-09-30T18:13:44.127448Z [info ] Index is now aliased to: yente-entities [yente.search.indexer] index=yente-entities-graph-016a040202-20250815190158-ikw
2025-09-30T18:13:44.214796Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=icij_offshoreleaks index=yente-entities-icij_offshoreleaks-016a040202-20250813165203-eij
2025-09-30T18:13:44.290301Z [info ] Deleting orphaned index [yente.search.indexer] index=yente-entities-default-016a040202-20250930005424-eju
2025-09-30T18:13:44.342436Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=ee_ariregister index=yente-entities-ee_ariregister-016a040202-20250814061801-mzi
2025-09-30T18:13:44.395597Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=ua_edr index=yente-entities-ua_edr-016a040202-20250804115548-ims
2025-09-30T18:13:44.452318Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=lv_business_register index=yente-entities-lv_business_register-016a040202-20250809180716-hqx
2025-09-30T18:13:44.502733Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=cy_companies index=yente-entities-cy_companies-016a040202-20250811203202-gwd
2025-09-30T18:13:44.546399Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=de_offeneregister index=yente-entities-de_offeneregister-016a040202-20250806165258-cec
2025-09-30T18:13:44.602416Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=eu_esma_firds index=yente-entities-eu_esma_firds-016a040202-20250808084114-nyz
2025-09-30T18:13:44.643785Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=us_corpwatch index=yente-entities-us_corpwatch-016a040202-20250814170619-huw
2025-09-30T18:13:44.680351Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=us_irs_ffi index=yente-entities-us_irs_ffi-016a040202-20250726094501-jno
2025-09-30T18:13:44.715215Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=ror index=yente-entities-ror-016a040202-20250810215702-cuh
2025-09-30T18:13:44.748358Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=cz_business_register index=yente-entities-cz_business_register-016a040202-20250809040401-jil
2025-09-30T18:13:44.792827Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=kz_companies index=yente-entities-kz_companies-016a040202-20250520071902-mer
2025-09-30T18:13:44.835809Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=sk_rpvs index=yente-entities-sk_rpvs-016a040202-20250814194301-kbf
2025-09-30T18:13:44.870750Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=gleif index=yente-entities-gleif-016a040202-20250813080802-duz
2025-09-30T18:13:44.913071Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=tj_companies index=yente-entities-tj_companies-016a040202-20250808235901-obq
2025-09-30T18:13:44.939703Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=ru_egrul index=yente-entities-ru_egrul-016a040202-20250812190817-hje
2025-09-30T18:13:44.998535Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=us_fara_filings index=yente-entities-us_fara_filings-016a040202-20250815133701-oup
2025-09-30T18:13:45.029163Z [info ] Deleting index of non-scope dataset [yente.search.indexer] dataset=us_fincen_msb index=yente-entities-us_fincen_msb-016a040202-20250811080802-bhs
2025-09-30T18:13:45.056345Z [info ] Index update complete. [yente.search.indexer]
2025-09-30T18:13:45.062225Z [info ] Released lock 27c16eaa-e8c8-4757-9bc7-17faeeb1af2e [yente.search.lock]
2025-09-30T18:52:00.390225Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/index.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T18:52:01.036706Z [info ] HTTP Request: GET https://data.opensanctions.org/datasets/latest/graph/catalog.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T18:52:01.658208Z [info ] Index update check [yente.search.indexer]
2025-09-30T18:52:01.666376Z [info ] Acquired lock 873705d1-1967-4066-808c-8f439da3d2d4 [yente.search.lock]
2025-09-30T18:52:02.046054Z [info ] HTTP Request: GET https://data.opensanctions.org/artifacts/default/20250930125415-eey/delta.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T18:52:02.050279Z [info ] Indexing entities [yente.search.indexer] base_version=20250930065425-msl dataset=default force=False incremental=True url=https://data.opensanctions.org/datasets/20250930/default/entities.ftm.json version=20250930125415-eey
2025-09-30T18:52:02.382534Z [info ] Cloned index [yente.provider.elastic] base=yente-entities-default-016a040202-20250930065425-msl target=yente-entities-default-016a040202-20250930125415-eey
2025-09-30T18:52:02.383734Z [info ] Fetching data [yente.data.loader] path=/tmp/default-delta-20250930125415-eey url=https://data.opensanctions.org/artifacts/default/20250930125415-eey/entities.delta.json
2025-09-30T18:52:02.666467Z [info ] HTTP Request: GET https://data.opensanctions.org/artifacts/default/20250930125415-eey/entities.delta.json "HTTP/1.1 200 OK" [httpx]
2025-09-30T18:52:03.640754Z [info ] Indexed 1639 entities [yente.search.indexer] added=205 deleted=762 modified=672
2025-09-30T18:52:03.851522Z [info ] Index is now aliased to: yente-entities [yente.search.indexer] index=yente-entities-default-016a040202-20250930125415-eey
2025-09-30T18:52:04.039790Z [info ] No update needed [yente.search.indexer] dataset=graph version=20250815190158-ikw
2025-09-30T18:52:04.092267Z [info ] Deleting orphaned index [yente.search.indexer] index=yente-entities-default-016a040202-20250930065425-msl
2025-09-30T18:52:04.131583Z [info ] Index update complete. [yente.search.indexer]
2025-09-30T18:52:04.138122Z [info ] Released lock 873705d1-1967-4066-808c-8f439da3d2d4 [yente.search.lock]
I will avoid mixing anything on my next manifest and will attempt to separate my tests into different yente instances to avoid conflicting datasets.