How does production scheduling and GCS publishing work?

Hi,

I’ve been exploring the opensanctions codebase and I’m curious about how the crawlers are scheduled and published in production.

From looking at the code, I can see that:

  • Individual datasets have cron schedules defined in their .yml files (e.g., “0 */2 * * *” for OFAC)
  • The zavod/zavod/archive/ module has a GoogleCloudBackend for publishing to GCS buckets
  • The GitHub Actions workflow dispatches events to an opensanctions/operations repository after builds complete

However, the operations repository appears to be private, so I can’t see how the actual production orchestration works. Can you confirm that it is private?

I’m interested in understanding:

  1. What orchestration tool is used to run the per-dataset cron schedules in production? (Kubernetes CronJobs, Airflow, Cloud Scheduler, etc.)
  2. Is there any public documentation on the deployment architecture?
  3. For anyone looking to self-host or run their own instance of the crawlers on a schedule, are there recommended approaches or example configurations?

Thanks for any insights!

Hi there!

the way we do it currently is that we have a bunch of scripts in our private operations repository that convert the deployment spec from the dataset metadata to a Kubernetes CronJob in a GH action. We’ve been thinking about adopting something more sophisticated (like Airflow or Argo) for a while, but right now it’s just that. This part is currently not documented, but there really isn’t any magic or secret sauce involved. It’s a pretty standard cloud deployment that will however be different for every user, which is why we haven’t put any effort into documenting it.

Two things that I can think of right now that might not be super obvious: schedule takes precedence over frequency , and premium: true/false controls whether to schedule on cheaper, interruptible spot market instances on GCE.

Note that you’re likely to encounter a few more hurdles: Some crawlers that crawl non-public data also expect API keys to be passed in as environment variables, and you’ll need to supply Zyte API credentials for the crawlers that need a geolocated IP.

Of course, we welcome PRs that provide hints for those paths less trodden :slight_smile:

Hope that helps a bit!

Leon

1 Like