How does production scheduling and GCS publishing work?

jordangoulet13 · January 22, 2026, 5:42pm

Hi,

I’ve been exploring the opensanctions codebase and I’m curious about how the crawlers are scheduled and published in production.

From looking at the code, I can see that:

Individual datasets have cron schedules defined in their .yml files (e.g., “0 */2 * * *” for OFAC)
The zavod/zavod/archive/ module has a GoogleCloudBackend for publishing to GCS buckets
The GitHub Actions workflow dispatches events to an opensanctions/operations repository after builds complete

However, the operations repository appears to be private, so I can’t see how the actual production orchestration works. Can you confirm that it is private?

I’m interested in understanding:

What orchestration tool is used to run the per-dataset cron schedules in production? (Kubernetes CronJobs, Airflow, Cloud Scheduler, etc.)
Is there any public documentation on the deployment architecture?
For anyone looking to self-host or run their own instance of the crawlers on a schedule, are there recommended approaches or example configurations?

Thanks for any insights!

leon.handreke · January 23, 2026, 10:38am

Hi there!

the way we do it currently is that we have a bunch of scripts in our private operations repository that convert the deployment spec from the dataset metadata to a Kubernetes CronJob in a GH action. We’ve been thinking about adopting something more sophisticated (like Airflow or Argo) for a while, but right now it’s just that. This part is currently not documented, but there really isn’t any magic or secret sauce involved. It’s a pretty standard cloud deployment that will however be different for every user, which is why we haven’t put any effort into documenting it.

Two things that I can think of right now that might not be super obvious: schedule takes precedence over frequency , and premium: true/false controls whether to schedule on cheaper, interruptible spot market instances on GCE.

Note that you’re likely to encounter a few more hurdles: Some crawlers that crawl non-public data also expect API keys to be passed in as environment variables, and you’ll need to supply Zyte API credentials for the crawlers that need a geolocated IP.

Of course, we welcome PRs that provide hints for those paths less trodden

Hope that helps a bit!

Leon

Topic		Replies	Views
Yente Manifest setup Support & Questions	7	66	October 2, 2025
Zavod and yente setup questions Support & Questions	1	49	October 17, 2025
Using own custom dataset in yente? Support & Questions yente , support	1	139	February 19, 2025
About the Support & Questions category Support & Questions support	0	47	January 25, 2025
Upcoming API change: logic-v2 to become default algorithm on Nov. 30 Announcements api	0	35	October 6, 2025

How does production scheduling and GCS publishing work?

Related topics