Hi everyone!
This project has been cooking for a while, and now I’ve finally gotten around to write a little blogpost about it: Registry Amnesia: How Russian Data on Sanctioned Companies is disappearing – and how we put it back.
Since Russia’s full-scale invasion of Ukraine in 2022, the Russian dictatorship has increasingly hidden and falsified data in its trade registry. Over the past weeks, I’ve worked on improving our data pipeline so that we can now present both: the current, memory-impaired version as well as the pre-war data.
Behind the scenes (or rather, out in the open, as all the code is available in our source repository), we’re now regularly crunching through over 100GB of compressed XML files using the pySpark framework to assemble this dataset.
Feel free to ask any questions about this you may have, share your thoughts or dig into the data yourself - I’m sure there are more interesting stories to be uncovered. I’m also happy to provide the Parquet files that we generate the final data export from, in case anyone is interested.