Hi,
I’ve been exploring the opensanctions codebase and I’m curious about how the crawlers are scheduled and published in production.
From looking at the code, I can see that:
- Individual datasets have cron schedules defined in their .yml files (e.g., “0 */2 * * *” for OFAC)
- The zavod/zavod/archive/ module has a GoogleCloudBackend for publishing to GCS buckets
- The GitHub Actions workflow dispatches events to an opensanctions/operations repository after builds complete
However, the operations repository appears to be private, so I can’t see how the actual production orchestration works. Can you confirm that it is private?
I’m interested in understanding:
- What orchestration tool is used to run the per-dataset cron schedules in production? (Kubernetes CronJobs, Airflow, Cloud Scheduler, etc.)
- Is there any public documentation on the deployment architecture?
- For anyone looking to self-host or run their own instance of the crawlers on a schedule, are there recommended approaches or example configurations?
Thanks for any insights!
Hi there!
the way we do it currently is that we have a bunch of scripts in our private operations repository that convert the deployment spec from the dataset metadata to a Kubernetes CronJob in a GH action. We’ve been thinking about adopting something more sophisticated (like Airflow or Argo) for a while, but right now it’s just that. This part is currently not documented, but there really isn’t any magic or secret sauce involved. It’s a pretty standard cloud deployment that will however be different for every user, which is why we haven’t put any effort into documenting it.
Two things that I can think of right now that might not be super obvious: schedule takes precedence over frequency , and premium: true/false controls whether to schedule on cheaper, interruptible spot market instances on GCE.
Note that you’re likely to encounter a few more hurdles: Some crawlers that crawl non-public data also expect API keys to be passed in as environment variables, and you’ll need to supply Zyte API credentials for the crawlers that need a geolocated IP.
Of course, we welcome PRs that provide hints for those paths less trodden 
Hope that helps a bit!
Leon
1 Like