Kolkhoz & Pravda

Kolkhoz & Pravda

Hey everyone! Johan here, with something new: Imagine you have a straw. Into the straw, you push the internet. Out of the straw comes a list of all political position holders.

What would this straw look like?

That’s the question I’m chasing: can one automated pipeline handle the long tail of governments and institutions out there? The orgs where writing a bespoke scraper doesn’t make economic sense — regional water boards, municipal councils, specialised agencies — but collectively represent a huge gap in our structured data.

The approach so far is split into two parts:

  • Pravda — the evidence layer. A shared service that snapshots web pages and stores the raw artifacts.
  • Kolkhoz — the extraction experiments. It feeds Pravda snapshots into LLMs and extracts structured human/position pairs.

Pravda

Before we can ask a model “who runs this water board?”, we need the page. Not just the HTML, we need the rendered document. We learned from Poliloom, and are taking the opportunity to move the “capture webpage” part into a separate service.

That’s Pravda: a small FastAPI service that turns a URL into some artifacts: plaintext, rendered HTML, a full-page screenshot, and an archive of resources the page requested. It stores it all in content-addressed storage. It is basically a thin wrapper around playwright.

The goal is to have a single evidence layer that other projects, like Kolkhoz, but soon also Poliloom, can make use of to store the artifacts that extracted data is sourced from.

Kolkhoz

Right, so with the storage out of the way, let’s get back to “who runs this water board?”. What can we extract automatically from these pages, without writing a scraper?

We tackled a variant of this challenge before. Poliloom is extracting structured data from web pages, but only for pages that we know are about a certain politician. Can we do what Poliloom is doing, but for arbitrary pages? And can we identify what pages we should look at?

I think we can.

To get an idea of the random quirks of the internet, we started with a synthesized test dataset. Leadership pages from international organizations: 1049 URLs across 487 organizations.

LLMs are pretty good at structured data extraction these days, but what input do you give them? Because not every webpage is created equal, we’re leaning into multi-modality. We start with the plain text extracted from the page, however, some pages have barely any text. In those cases we also attach the screenshot.

There is pages out there that contain a forty headshot photos with names superimposed in fancy fonts. The remedy is a couple of signals that decide whether to attach a screenshot: if the plaintext is very short, or if the page has a lot of images.

We also tried letting the model decide for itself whether it needed the image, through function calling. It turns out, it will not think it needs it. Ever.

Currently, extraction is a single structured response: we ask the the model to return the page_type and the holders[] list in a single call. A holder is a human with a position. page_type is there for us to make sense of the quality of URLs we pass in.

Of the ~1000 pages processed, holders were extracted from 60%. Across those ~600 pages, ~10,000 holders were identified. Screenshots were attached in roughly one fourth of extractions, split between thin-text and image-dense pages.

The rest of the pages on which no holders were found were overwhelmingly classified as other. Looking at these, it turns out that there’s quite a couple generic pages in our test set.

Next steps

Actually look properly at the extracted data! So far, we’ve only evaluated this machinery by taking samples and comparing with source documents manually. The samples look promising, but to say anything meaningful about the quality of this LLM extraction, we’d have to compare with a known set.

Something else to dive deeper into is the page_type signal. For the testing I did currently, I wanted to know what sort of pages I was passing in. However, the real question we’re trying to answer is more along the lines of “should we recurrently scrape this page?”, the answer to which is probably contained in the page having holders or not.

Also, what other pages does a page link to? Do any of those on the same web property maybe also contain holders? Can we “intelligently” spider?

Then there is the Pravda side of the story. How does it handle all the blocks the modern internet has to offer? Think bot detection, geo-fencing and stuff like that.

There is many open questions, as I’m sure you are aware of. As always, I’d love to hear what you think, let me know! :smiley:

2 Likes

Many thanks for this! One question, how strong would the LLM need to perform against the known set in your view?

1 Like

Thanks @frederikrichter! In my view there’s 2 dimensions to that question that are similar but not the same.

The first one would be “Faithfulness” or “copy fidelity”. How likely is a name to come back altered? For short proper-name strings at small list sizes, mutation is rare but not negligible. The risk grows with output length and with how “familiar” the model thinks a corrected version of the name is.

The other question I have is: How likely are names to be dropped (getting a incomplete list)? There’s 2 factors at play there. Position bias, and Length-dependent omission in the output. LLMs perform best for information that is at the start or end of context, with information getting “lost in the middle”. They also get worse the longer lists get, basically, completeness collapses as the list to emit grows.

To keep the model “Faithful”, we can keep context short, and validate extracted data against the input to make sure no strings roll out that were not in the original document. To perform well on longer lists, there is different strategies we can apply. One thing that comes to my mind is to have the LLM generate CSS selectors for the elements that contain holder information.

Now all this gets a lot harder when there’s no textual info on a page, and we’re using images as input. Here the problem of “Faithfulness” is amplified. Language priors can outweigh the visual signals, leading to the model “reading” a plausible word, rather than the actual word. I think we can still do things like use different models (think OCR) and compare their output. I’m curious to see how well LLMs perform there. We should be able to test this by comparing the outputs we get for the screenshots of pages where we also have the textual data.

Now to come back to your question, how strong would the LLM need to perform. I think that there is no single answer here, and that the question we should try to answer is more along the lines of “What strategy do we apply when?”. I think the goal should be to find the right strategies that allow us to stay true to the source documents.

Does that answer your question?

1 Like

Yes, and thank you!!

This is interesting. One very manual approach would be to take small screenshots and then have people double keying. See https://journals.openedition.org/jtei/739

Another thought would be could you build two AI tools that are independent and only add people where both models agree?