Kolkhoz & Pravda
Hey everyone! Johan here, with something new: Imagine you have a straw. Into the straw, you push the internet. Out of the straw comes a list of all political position holders.
What would this straw look like?
That’s the question I’m chasing: can one automated pipeline handle the long tail of governments and institutions out there? The orgs where writing a bespoke scraper doesn’t make economic sense — regional water boards, municipal councils, specialised agencies — but collectively represent a huge gap in our structured data.
The approach so far is split into two parts:
- Pravda — the evidence layer. A shared service that snapshots web pages and stores the raw artifacts.
- Kolkhoz — the extraction experiments. It feeds Pravda snapshots into LLMs and extracts structured human/position pairs.
Pravda
Before we can ask a model “who runs this water board?”, we need the page. Not just the HTML, we need the rendered document. We learned from Poliloom, and are taking the opportunity to move the “capture webpage” part into a separate service.
That’s Pravda: a small FastAPI service that turns a URL into some artifacts: plaintext, rendered HTML, a full-page screenshot, and an archive of resources the page requested. It stores it all in content-addressed storage. It is basically a thin wrapper around playwright.
The goal is to have a single evidence layer that other projects, like Kolkhoz, but soon also Poliloom, can make use of to store the artifacts that extracted data is sourced from.
Kolkhoz
Right, so with the storage out of the way, let’s get back to “who runs this water board?”. What can we extract automatically from these pages, without writing a scraper?
We tackled a variant of this challenge before. Poliloom is extracting structured data from web pages, but only for pages that we know are about a certain politician. Can we do what Poliloom is doing, but for arbitrary pages? And can we identify what pages we should look at?
I think we can.
To get an idea of the random quirks of the internet, we started with a synthesized test dataset. Leadership pages from international organizations: 1049 URLs across 487 organizations.
LLMs are pretty good at structured data extraction these days, but what input do you give them? Because not every webpage is created equal, we’re leaning into multi-modality. We start with the plain text extracted from the page, however, some pages have barely any text. In those cases we also attach the screenshot.
There is pages out there that contain a forty headshot photos with names superimposed in fancy fonts. The remedy is a couple of signals that decide whether to attach a screenshot: if the plaintext is very short, or if the page has a lot of images.
We also tried letting the model decide for itself whether it needed the image, through function calling. It turns out, it will not think it needs it. Ever.
Currently, extraction is a single structured response: we ask the the model to return the page_type and the holders[] list in a single call. A holder is a human with a position. page_type is there for us to make sense of the quality of URLs we pass in.
Of the ~1000 pages processed, holders were extracted from 60%. Across those ~600 pages, ~10,000 holders were identified. Screenshots were attached in roughly one fourth of extractions, split between thin-text and image-dense pages.
The rest of the pages on which no holders were found were overwhelmingly classified as other. Looking at these, it turns out that there’s quite a couple generic pages in our test set.
Next steps
Actually look properly at the extracted data! So far, we’ve only evaluated this machinery by taking samples and comparing with source documents manually. The samples look promising, but to say anything meaningful about the quality of this LLM extraction, we’d have to compare with a known set.
Something else to dive deeper into is the page_type signal. For the testing I did currently, I wanted to know what sort of pages I was passing in. However, the real question we’re trying to answer is more along the lines of “should we recurrently scrape this page?”, the answer to which is probably contained in the page having holders or not.
Also, what other pages does a page link to? Do any of those on the same web property maybe also contain holders? Can we “intelligently” spider?
Then there is the Pravda side of the story. How does it handle all the blocks the modern internet has to offer? Think bot detection, geo-fencing and stuff like that.
There is many open questions, as I’m sure you are aware of. As always, I’d love to hear what you think, let me know! ![]()