PoliLoom – Loom for weaving politician's data

PoliLoom

Hey there OpenSanctions community! I’m Johan. Friedrich asked me to have a go at some data extraction, and share the development process here, so let’s dive into it together.

We’re about to find out how great LLMs are at extracting metadata on politicians from Wikipedia and other sources. Our goal is to enrich Wikidata. This is a research project, the goal is to have a usable proof of concept.

Below is something that resembles a plan. I did my best structuring my thoughts on how I think we should approach this, based on a day or so of research. I’m also left with some questions, that i left sprinkled throughout this document.

If you have some insight or can tell me that i’m on the wrong track, i would be grateful!

Let’s get into it.

Database

We want to have users decide if our changes are valid or not, so for that we’ll have to store them first. But to know what has changed, we need to know what properties are currently in Wikidata.

We’ll be using SQLAlchemy and Alembic. In development we’ll use SQLite. In production PostgreSQL.

Schema

We’d be reproducing a small part of the Wikidata politician data model, to keep things as simple as possible, we’d have Politician, Position, Property and a many-to-many HoldsPosition relationship entity.

  • Politician would be politicians with names and a country.
  • Source web source that contains information on politician
  • Property would be things like date-of-birth and birthplace for politicians.
  • Position is all the positions from Wikidata with their country.
  • HoldsPosition contains a link to a politician and a position, together with start & end dates.

Source would have a many-to-many relation to Politician, Property and Position

Property and HoldsPosition would also hold information on whether they are new extracted properties, and if/what user has confirmed their correctness and updated Wikidata. Property would have a type. Where type would be either BirthDate or BirthPlace.

Populating our database

We can get a list of all politicians in Wikidata by querying. Either by occupation:

SELECT DISTINCT ?politician ?politicianLabel WHERE {
  ?politician wdt:P31 wd:Q5 .                    # must be human
  ?politician wdt:P106/wdt:P279* wd:Q82955 .     # occupation is politician or subclass
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
LIMIT 100

Or by position held:

SELECT DISTINCT ?politician ?politicianLabel WHERE {
  ?politician wdt:P31 wd:Q5 .                           # must be human
  ?politician wdt:P39 ?position .                       # holds a position
  ?position wdt:P31/wdt:P279* wd:Q4164871 .            # position is political
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
LIMIT 100

We could also fetch both and merge the sets. However, these queries are slow, and have to be paginated.

Wikipedia pages should have a Wikidata item linked, Wikidata records of politicians should almost always have a Wikipedia link. We should be able to connect these entities.

Then there probably is politician Wikidata records that don’t have a linked Wikipedia page, and there’s probably Wikipedia pages that don’t have a Wikidata entry. We could try to process Wikipedia politician articles that have no linked entity like the random web pages we’ll query.

When we populate our local database, we want to have Politician entities, together with their Position and Properties. When populating the database, try to get the Wikipedia link of the Politician in English and their local language.

Questions

  • Do we filter out deceased people?
  • If we’d be importing from the Wikipedia politician category, do we filter out all deceased people with a rule/keyword based filter?

Extraction of new properties

Once we have our local list of politicians, we can start extraction of properties. We basically have 2 types of sources. Wikipedia articles that are linked to our Wikidata entries, and random web sources that are not linked to our Wikidata entries.

Then there’s 2 types of information that we want to extract. Specific properties like date of birth, birthplace etc. And political positions. Extracting properties should be relatively simple. For positions, we’d have to know what positions we’d like to extract.

For extraction we’d feed the page to OpenAI and use the structured data API to get back data that fits our model.

We can query all political positions in Wikidata. At the time of writing that returns 37631 rows:

SELECT DISTINCT ?position ?positionLabel WHERE {
  ?position wdt:P31/wdt:P279* wd:Q294414 . # elected or appointed political position
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

It would be nice if we can tell our LLM what positions we’d like to extract, instead of having it do whatever and then having to match it to any of the positions we know. However, 37631 rows is a lot of tokens. As we know the country of the politician before extracting the positions, we can select the subset of positions for that country to pass to our LLM so we can be sure to extract correct references.

Questions

  • How do we handle incomplete birth dates (1962 / jun 1982)
  • How do we manage names of people in different scripts?

Wikipedia

Scraping wikipedia would be relatively simple. During our wikidata import we’ve stored Wikipedia links. We’d have to fetch those for our Politician and extract properties / positions from the data. We’d try to get the page in English and in the local language of the politician.

Scraping random webpages

Then there is the random pages. For these we’d have to link any information we extract to a Wikipedia article / Wikidata entity ourselves.

I think we should handle 2 general page types here, index pages and detail pages.

  • For index pages, we can have the LLM output a xpath/CSS selector for all detail page links, and possibly for pagination.
  • For detail pages we try to extract what we can, then try to find the relating entity in our database. We can do a similarity search for entities with the same properties and use a score treshhold to decide wether we should try to update a found entity.

My mind is going wild on building some sort of scraper building robot, that generates xpath selectors with a LLM, then scrapes pages with those selectors. Re-running the xpath generation when the scraper breaks. But i think that would only make sense if we want to run our scraper often. The most simple thing would be to feed everything into the model.

Questions

  • Do we archive web sources?
  • How often do we want to run our tool?
  • How do we handle multilingual variations in names
  • How will you handle conflicting information between sources? (e.g., Wikipedia says one birth date, government site says another)

API

We’ll create a FastAPI API that can be queried for entities with properties and positions. There should also be a route for users to mark a politician or position as correct.

Authentication could probably be handled by the MediaWiki oAuth system.

Routes

  • /politicians/unconfirmed - lists politicians with unconfirmed Property or HoldsPosition
  • /politicians/{politician_id}/confirm - Allows the user to confirm the properties/positions of a politician. Should allow for the user to “throw out” properties / positions in the POST data.

CLI commands

The API project would also house the CLI functionality for importing and enriching data.

  • Import single ID from Wikidata
  • Enrich single ID with data from Wikipedia

Confirmation GUI

We’d like users to be able to confirm our generated data. To do this we’ll build a NextJS GUI where users will be provided with statements and their accompanying source data / URL.

When users confirm the correctness of our generated properties, we’d like to update Wikidata on their behalf through the Wikidata API.

Questions

  • What choices do we want to have our users make? Do we want to present single properties with a source link where we found the information? Or do we want to show a complete entity with all new information and (possibly multiple) source links?
  • What source data do we want to show? Do we want to show links, or archived pages?

Project planning

Wikipedia articles are linked to Wikidata entities, saving us from having to find those links ourselves. That’s why i propose starting with extraction from these.

  1. Start with populating a local database with politicians, this includes scraping Wikipedia and related Wikidata entries
  2. Extraction of properties from Wikipedia articles
  3. Confirmation GUI
  4. Random web sources extraction

I think getting to point 3 and having a system that works is the most important thing. Point 4 is probably not easy and will require more budget then we have. But regardless, we’ll see how far we can get with that.

Interesting

You can find the development repository here: GitHub - opensanctions/poliloom: Help build the world's largest open database of politicians.

Please let me know if anything does not make sense, or if there’s something else that makes more sense. Also, I’d love to hear your opinion on any of the questions. I’m curious to hear what you think.

5 Likes

Thanks for sharing this, @Monneyboi - fascinating stuff! Are there any notable hurdles or successes you’ve encountered so far at this stage?

Re: parsing Wikipedia articles in multiple languages, are there any factors that you expect will guide deciding between using just English, the local language, or a combination of both for extraction? I’d imagine that it may vary from not just language-to-language, but region-to-region and/or entity-to-entity? Is it “completeness”, or “accuracy” that seems like the bigger thing to solve there?

And where can we follow along as the project evolves? :slight_smile:

Thank you so much for the brilliant write-up, @Monneyboi ! To me, the goal of this research spike is this weird social-technical mix question: if we make a verification machine (the “loom”) on top of this data aggregator, is that machine something that is fun and encouraging to use, or is it a torture chair?

Regarding all the tech choices I couldn’t agree more. Perhaps a pet peeve of @jbothma to mention, we’ve been discussing what HoldsPosition should look like for a while now, and the leading contender at the moment is Tenure … wdyt?

Is it “completeness”, or “accuracy” that seems like the bigger thing to solve there?

Mostly completeness. I noticed that for some languages, the local language contained more information compared to the English page.

Are there any notable hurdles or successes you’ve encountered so far at this stage?

Well, I’m quite happy that it looks like it’s possible (I haven’t tried yet) to use the Wikipedia login system for this, providing us with a simple way for users to make use of this system, while allowing us to update Wikidata on their behalf.

And where can we follow along as the project evolves? :slight_smile:

I’ve updated this post with a link to the repository. Also, i’ll keep this post up-to-date with new info / updates.

if we make a verification machine (the “loom”) on top of this data aggregator, is that machine something that is fun and encouraging to use, or is it a torture chair?

That depends on what our users would enjoy more. I’m guessing fun and encouragement. We could have a leaderboard, should be relatively simple to add. Many gamification concepts come to mind, but I don’t know if we should invest to much time in that.

we’ve been discussing what HoldsPosition should look like for a while now, and the leading contender at the moment is Tenure … wdyt?

I think HoldsPosition is about as clear as you could be. When I read that I know what’s up. When i read Tenure I’d have to check the docs. Yes it’s 2 words and it looks kinda derpy, but it is very self-documenting.

Tenure sounds like a more detailed Position to me, like a Position with a timeframe attached.

I was chatting with @pudo about sources recently and this is exactly the kind of deep dive I was hoping for when I originally asked about regional source coverage and enrichment! My inner nerd is so chuffed to get more details about the how you guys approach this. Huge thanks again, this kind of transparency and collaborative thinking that makes an open data ecosystem so powerful. I’ll be following along (and cheering :confetti_ball: from the sidelines) as PoliLoom continues to take shape!

1 Like

@SpottyMoose Thanks for the encouraging words!

Devlog #1

Yesterday me and Claude have set-up most of the importing logic, there were a couple of lessons:

  • Having all countries used by Wikidata locally saves us a lot of queries.
  • Mapping country to Wikipedia languages is not trivial.
  • Maybe we can eliminate the need for countries if the English Wikipedia has all we need, + we can pass all positions to the structured data API. As these are not tokens sent into context, but rather the token vocabulary.

Learning is great :tada:

Also, somewhere else, we discussed: “Do you think we should have a user confirm every property / position seperately, or should he/she just approve a politician in one click, with all new data?”

To which this great answer came:

ha, this is fun. Maybe like with a “throw out” option? like all the proposed changes are shown, and then you can kick stuff off (or later: correct it)

Which i think is a good way to handle confirmation.

Also, we got our Wikipedia oAuth app approved :star:

Today I plan on having Claude fix up the last importing logic, and figuring out what positions to load. Then diving into the enrichment part of the API/CLI project and exploring the simplest way to get nice structured data.

2 Likes

Woo amazing stuff, and fun updates! Please keep these going :blush:

2 Likes

Devlog #2

End of last week we’ve fleshed out the last import logic, including:

  • Support for loading of manually curated political positions.
  • Used SparQL for politician loading.
  • Linked politician and positions to countries. :link:
  • basic Wikipedia extraction
  • Unit tests :love_letter:

What we’ve learned: With long list of manually curated political positions, we need a way to filter what positions we present to the LLM for extraction. As political positions and politicians are both easily linked to countries, we’ll use those to filter the positions we try to extract for a politician.

Today is extraction day :fire:. Before the day ends I want to have LLM proposed properties in my local database. I’ll probably start building the GUI application in parallel.

New week, let’s go!
:partying_face:

3 Likes

Many thanks for sharing all of these updates so far! Why do you need to map countries to languages/vice versa? I guess there’s languages spoken in many countries (like Arabic or Spanish) but also countries with many languages, so I imagine this to be hard.

Why do you need to map countries to languages/vice versa?

As there is many different Wikipedia languages, and local language Wikipedia articles could have more information on a politician then the English article.

Right now I’m only using the English Wikipedia, but I would like to also use the other languages. The simplest thing to do would be to pull in the articles from all languages, which would also be the most expensive.

As we have the citizenship countries of politicians, I was thinking of having some sort of mapping between countries of citizenship and Wikipedia articles, saving us feeding loads of redundant information into model context.

@frederikrichter, @leon.handreke & others I’ve forgotten. Thank you for the kind words and support!

Devlog #3

Welcome back! Strap in, this is a long one.

So the main thing we figured out is that we have to many positions to send to the OpenAI structured data API. This led us into similarity search territory, time to have fun with vectors.

Let me explain how our tactic changed over time. First we did this:

  1. Fetch the Wikipedia article for a politician
  2. Fetch positions that the politician could have based on his citizenship
  3. Pass those positions to the structured data OpenAI API.
  4. Get back mostly correct positions for the politician :magic_wand:

To me, this is the greatest way to do it. The LLM can only output positions that we have a Wikidata ID for.

However, looking at our positions data:

select countries.name, count(position_country.position_id) as pos_count from countries left outer join position_country on position_country.country_id = countries.id group by countries.id order by pos_count desc;

We see that we have quite a couple positions for some countries:

FR	78871
ES	13977
BR	5890
US	4264
...

And the OpenAI structured outputs API has some pesky limitations that only allow us 500 enum properties :confused:

What now?

So we’ll have to find the positions that are most likely to be in our Wikipedia article. Enter similarity search. If you’re interested in how that works, Here is part of my talk that tries to explain it.

So now, we started of with embedding all the positions, what we did next was:

  1. Fetch the Wikipedia article for a politician
  2. Embed the whole article
  3. Fetch positions that the politician could have based on his citizenship
  4. Select the 500 positions that are most similar to the article.
  5. Pass those positions to the structured data OpenAI API
  6. Get back mostly wrong positions for the politician :mage:

It turns out that there is too much info in the articles to give us a reliable set of similar positions, leading to the model outputting incorrect positions, as we force it to use any of the positions we pass it.

For example, when searching for similar positions for Aurélien Pradié, “Mayor of Pradiers” will be very similar. Looking at the proof, the model has a good understanding of what positions Aurélien has had, but is forced to output tokens that match the structured data schema:

LLM extracted data for Aurélien Pradié:
  Positions (5):
    Deputy of the French Second Republic (2017 - present)
      Proof: Aurélien Pradié (French pronunciation: [oʁeljɛ̃ pʁadje]; born 14 March 1986) is a French politician who has represented the 1st constituency of the Lot department in the National Assembly since 2017.
    Mayor of Pradiers (2014 - 2018-01-05)
      Proof: In the 2014 municipal elections, Pradié was elected mayor of Labastide-Murat, when his party list received over 70% of the vote.
    Mayor of Pradiers (2016 - 2018-01-05)
      Proof: Following the merger of communities, he became mayor of Cœur-de-Causse (the new merged commune) in 2016.
    deputy to the National Legislative Assembly (France) (2016 - 2018)
      Proof: He has also held a seat in the Regional Council of Occitania since 2021, previously in office from 2016 to 2018.
    deputy to the National Legislative Assembly (France) (2021 - present)
      Proof: He has also held a seat in the Regional Council of Occitania since 2021, previously in office from 2016 to 2018.

The new strategy

So we finally arrive at the strategy I was trying to avoid. What I think we’ll have to do now is:

  1. Fetch the Wikipedia article for a politician
  2. Prompt the language model to extract arbitrary positions, have it return whatever.

Then for every position:

  1. Check if we have a exact match with our Wikidata positions.
  2. If not, fetch the 500 most similar positions
  3. Pass those positions to the structured data OpenAI API.
  4. Prompt the model for the correct Wikidata position, or None
  5. Get back mostly correct Wikidata positions for the Wikipedia positions :relieved_face:

That’s the theory at least. :tada:

The benefit of this tactic is that it can be generalized to other properties as well. We’ll have to do the same for the BirthPlace property for example.

Other news

While Claude was fleshing out this enrichment process, I had Claude start work on the NextJS GUI in a second working directory. Now we also have a basic-gui branch containing the start of the confirmation GUI.

Which does not much yet, however, it does Mediawiki auth (With help from Bryan Davis, thanks!) :grinning_face:

Thank you for reading this far, I’ll try to keep the next one shorter :frog:.

Again, if anything I say sounds weird or I’m going in a direction that you think is not fruitful. Please let me know!

That was it! Have a great day!

2 Likes

Thanks, I was aware of the context but wasn’t sure why the mapping was required!

A couple tips -

For extracting from unstructured / semi-structured sources such as Wikipedia, GLiNER and BAML are your friends:

There are better examples, but here’s some related code:

And in general, the team at Explosion.ai in Berlin is amazing and brilliant: https://spacy.io/

Also, there are ways to align semantics across the Wikidata and DBPedia data sources. I would recommend building a graph from Wikidata for the parts which are most interesting, then using this as a “backbone” to build on. The data in Wikidata is pretty good, although arguably quite imbalanced and sparse, so the flexibility of a graph can help. Otherwise you’ll end up with really complicated SQL which is expensive to maintain.

For bringing in news articles / random web sources – I’ll put in a plug for Newsplunker from AskNews https://newsplunker.com/ which develops graph elements (and source links) from news articles in a way that’s friendly to publishers and reporters.

There are also commercial data sources which might help, if this is an option? At least to get the politician entities populated.

I’m happy to join a discussion about this kind of work. Hope it goes well.

Devlog #4

Welcome back to another update! This one’s exciting - we’ve made some major architectural shifts that have completely transformed how the project works. :rocket:

The Big Pivot: Dump Processing Over APIs

Remember how we started with Wikidata’s SPARQL API? Well, we’ve completely moved away from that approach. Turns out, when you’re trying to process millions of political entities, API rate limits and timeouts become your worst enemy.

We’re now processing the complete Wikidata dump directly - that’s a ~100GB compressed file that expands to about 1.7TB of JSON. Yes, you read that right. Over a terabyte of juicy data! :exploding_head:

But here’s the cool part - we’ve implemented a three-pass processing strategy that actually makes this manageable:

  1. Pass 1 - Build Hierarchy Trees: Extract all the “subclass of” relationships to figure out what counts as a political position vs. a geographic location
  2. Pass 2 - Import Supporting Entities: Get all the positions, locations, and countries into our database first
  3. Pass 3 - Import Politicians: Finally import the politicians and link them to everything we’ve already stored

This multipass process prevents dead-lock issues that would occur when mulitple workers would try to insert conflicting entities, saving us some complex back-off logic.

Vector Search Actually Works!

Remember our struggle with similarity search from devlog #3? Good news - it works beautifully now!

Our two-stage extraction strategy is humming along:

  1. Stage 1: LLM extracts arbitrary position names from Wikipedia (e.g., “Mayor of Labastide-Murat”)
  2. Stage 2: We embed that text, find the 100 most similar Wikidata positions, then ask the LLM to pick the right one

This is what the results looked like (edited for brevity):

Enriching politician with Wikidata ID: Q30351657
Mapped 7 out of 19 extracted positions for Aurélien Pradié
LLM extracted data for Aurélien Pradié:
  Properties (2):
    PropertyType.BIRTH_DATE: 1986-03-14
    PropertyType.BIRTH_PLACE: Cahors, France
  Positions (7):
    member of the French National Assembly (2017 - present)
      Proof: Aurélien Pradié (French pronunciation: [oʁeljɛ̃ pʁadje]; born 14 March 1986) is a French politician who has represented the 1st constituency of the Lot department in the National Assembly since 2017.
    Regional councillor of Occitanie (2021 - present)
      Proof: He has also held a seat in the Regional Council of Occitania since 2021, previously in office from 2016 to 2018.
    Regional councillor of Occitanie (2016 - 2018)
      Proof: He has also held a seat in the Regional Council of Occitania since 2021, previously in office from 2016 to 2018.
    conseiller communautaire du Causse de Labastide-Murat (2008 - present)
      Proof: In the 2008 cantonal elections, Pradié was elected in the first round for the canton of Labastide-Murat, becoming the second-youngest councilor in France behind Jean Sarkozy and beating Lucien-Georges Foissac, his former teacher.
    Mayor of Labastide-Murat (2014 - 2016)
      Proof: In the 2014 municipal elections, Pradié was elected mayor of Labastide-Murat, when his party list received over 70% of the vote.
    Mayor of Cœur-de-Causse (2016 - 2018)
      Proof: Following the merger of communities, he became mayor of Cœur-de-Causse (the new merged commune) in 2016.
    Regional councillor of Occitanie (2015 - 2018)
      Proof: That same year, he was elected a regional councilor in Occitanie, where he sat on the committee for labor, professional development and apprenticeships.
Successfully enriched politician Aurélien Pradié
✅ Successfully enriched politician data from Wikipedia sources

The results are surprisingly good. The model can now correctly map extracted positions like “Deputy of the French Second Republic” to the right Wikidata entity, even when the text doesn’t match exactly.

We’re using SentenceTransformers with the ‘all-MiniLM-L6-v2’ model for embeddings, stored in PostgreSQL with the pgvector extension. It’s fast, it’s accurate, and it scales! :chart_increasing:

Performance Breakthroughs

The dump processing is properly parallelized now. We’re talking:

  • Chunk-based parallel processing that scales linearly with CPU cores
  • Near-linear speedup up to 32+ cores (tested!)
  • Memory-efficient line-by-line processing for the 1.7TB file
  • Batch database operations for optimal performance

Running poliloom dump build-hierarchy on a decent machine processes the entire dump hierarchy in parallel on all cores in reasonable time. On my machine it takes roughly 19 minutes for one pass :fire:

What We Learned About Wikidata

Processing the full dump taught us some interesting things:

  • Statement ranks matter: Wikidata has preferred, normal, and deprecated statements. We need to handle these properly.
  • Incomplete dates are everywhere: Things like “1962” or “JUN 1982” are common and need special handling
  • The hierarchy is deep: Political positions have complex subclass relationships that go many levels deep
  • Countries have SO many positions: France alone has 78K+ political positions in Wikidata!

Testing That Actually Works

We’ve got a proper testing framework now using pytest with Docker Compose for the test database. The tests focus on the core data pipeline - no more guessing if our imports work correctly.

What we test: All the business logic, database models, dump processing, LLM extraction (mocked), and API endpoints.

What we don’t test: CLI interfaces (they’re thin wrappers), database migrations, and performance (that’s beyond our scope for now).

Next Steps

The two-stage extraction is working for positions and birthplaces. Now we need to:

  1. Create a evaluation mechanism to be able to improve our prompts and similarity search.
  2. Archive scraped pages so users can easily review our extractions
  3. Finish and test our confirmation GUI so that users can actually review our extractions.

The foundation is solid now. We can process the entire Wikidata dump, extract meaningful information with LLMs, and have a robust similarity search system. That’s a pretty good place to be! :tada:

Shoutouts

Big thanks to everyone who’s been following along and providing feedback. The architectural decisions we’ve made - especially moving to dump processing - have made this project actually viable at scale.

Question for the community: How do we improve the similarity search matching of “free form”, just extracted strings, with our Wikidata entities? Is there properties that we can add to the embedding (for example it’s class) that would improve our results? What do we add to the prompts?

That’s all for now! Next update will hopefully show how we improved our reconciliation and how that improved our enrichment pipeline.

Stay tuned! :waving_hand:

P.S. If I missed something, you have any questions, or you think we should change our strategy on any of the discussed techniques, let us know!

2 Likes

Damn, that’s a hell of an update! Let’s talk about getting you a VM to run this on ASAP? Or what kind setup would be a good place?

1 Like

Devlog #5: Show Me The Receipts! :clipboard::sparkles:

Hey everyone! I’m back from a lovely 2 weeks in northern Spain, Time for another PoliLoom update! Remember how in devlog #4 we cracked the puzzle of processing Wikidata dumps and got our AI extraction pipeline humming? Well, we’ve been tackling the next critical challenge: trust.

The Trust Problem :thinking:

Here’s the thing - having an AI tell you “Emmanuel Macron was born in Amiens” is one thing, but when you’re planning to feed that data back to Wikidata, you need to be absolutely certain it’s correct. The question we kept asking ourselves was: how do we give human evaluators the confidence to approve AI-extracted claims?

The answer turned out to be beautifully simple: show them exactly where the AI found each piece of information.

Enter: Source Traceability :magnifying_glass_tilted_left:

We’ve just shipped what I’m calling our “show your work” feature - a complete source traceability system that preserves and displays the original evidence for every AI extraction.

Here’s how it works:

:page_facing_up: Archived Pages: Every time our system processes a web page to extract politician data, we now save a complete copy of that page to our archives. No more broken links, no more “this page has changed” headaches - we preserve the exact content our AI analyzed.

:bullseye: Proof Lines: Each extracted property, position, and birthplace now comes with a “proof line” - the specific sentence or phrase from the web page that the AI used to make its claim.

:sparkles: Visual Evidence: Our evaluation interface now features a beautiful split-panel design. On the left, you see the extracted claims awaiting evaluation. On the right? The actual web page where each claim was found, with the relevant text automatically highlighted and scrolled into view:

Making Verification Actually Happen :magnifying_glass_tilted_left:

Here’s the reality: AI extraction is only as good as the human verification that follows. But verification is tedious work, and if we make it even slightly frustrating, people simply won’t do it thoroughly. We need evaluators to actually want to verify claims, not just click through them.

The key insight is that good verification requires context. When you can see exactly where the AI found each piece of information, the work becomes surprisingly satisfying - there’s something oddly pleasing about seeing text highlight exactly where you expect it. But this raises an interesting question: should we lean into this satisfaction?

I’m genuinely torn about gamification here. On one hand, leaderboards and verification streaks could motivate more people to contribute. Wikipedia has shown that recognition systems can work in collaborative knowledge projects. A “top verifier of the week” badge might turn tedious validation into friendly competition.

On the other hand, civic data isn’t a game. Would racing to the top of a leaderboard encourage hasty approvals? Could competitive elements actually decrease verification quality? There’s a real risk that gamification could create perverse incentives - people optimizing for quantity over quality, or worse, blindly approving claims to climb rankings.

By making verification effortless and contextual, we’ve already dramatically increased the chances that people will do careful evaluation work. The question is: should we stop there, or risk adding game elements that might boost participation but potentially compromise the trustworthiness we’re trying to build?

Technical Tidbits for the Curious :hammer_and_wrench:

For the technically inclined: we’re storing archived pages using content-addressable storage with fetch timestamps, serving them through authenticated API endpoints, and using a clever TreeWalker-based highlighting system that works reliably across different web page structures. The iframe integration required some careful CSP handling, but the result is buttery smooth.

The proof line extraction happens during the LLM processing phase - we ask our AI not just what it found, but where it found it, creating this beautiful paper trail from raw HTML to structured claim.

What’s Next? :glowing_star:

This source traceability system feels like a huge step toward making PoliLoom genuinely useful for Wikidata contributions. When human evaluators can easily verify AI claims against primary sources, we’re building the foundation for trustworthy automated data enrichment.

But we’re not done yet! The evaluation interface is already revealing some interesting gaps in our entity coverage. We’re discovering holes in the Wikidata hierarchies we import - places and positions that should be in our database but aren’t, because they follow different classification paths than we expected. This means our AI sometimes can’t properly map extracted information to existing Wikidata entities.

This highlights the need for better entity coverage and also better prompt tuning. Our plan is to set up a test environment soon, where we can build a curated test dataset to systematically improve both our entity imports and our LLM extraction prompts. Having real examples of what works and what doesn’t will let us iterate much faster than just hoping for the best with each politician we process.

I’m curious - what do you think about this approach to AI transparency? Are there other verification features you’d want to see?

As always, would love to hear your thoughts! :thought_balloon:

3 Likes

Brilliant. The highlighting and scrolling is such a big deal for turning the drudge of verifying extraction from a wall of text into something at least much quicker and easier.

Super pumped to see this.

1 Like

This approach is needed in so many areas in AI fact-verification, as those of us prompting LLMs to link URLs with link to text fragments syntax (#:~:text= ) have attempted - it does work and helps a lot but they ignore the prompt a lot. Amazing work to provide such a straightforward and useful interface. Congrats @Monneyboi !!

1 Like

Devlog #6: Localhost to Cloud :cloud:

Hey everyone, we’re back! :waving_hand: Back from running a marathon. So, update time!

Remember that proof of concept from last devlog? It’s now running online! :rocket:

*checks notes*

126 files changed, 13,789 lines added, 10,863 deleted. Basically just moved a comma or two. Nothing major. So what changed? Well, nothing really… And everything. Let me explain.

The Great Migration

We’ve basically refactored the entire system to run properly in the cloud. This wasn’t just a lift-and-shift, it was almost a complete re-architecture for production scale.

Here’s the parts of our “minor refactor” involved:

  • Refactored the storage layer to support local storage and Google Cloud Storage.
  • Added Cloud SQL connector support, same story.
  • Dockerized everything. Containers for API / GUI services and containerized data pipelines.
  • GitHub Actions workflows that build and push images
  • Production and development Docker Compose profiles (because docker-compose.override.yml is a thing of beauty)

The testing environment is live now! :tada: We’ll drop a link later this week once we’re confident it won’t immediately fall over.

The Great Refactoring (Or: How I switched “vibes”)

Here’s what happened: The proof of concept phase was all about speed, where we used Claude in plan mode to rapidly prototype and validate the core idea. It worked brilliantly for that purpose.

Now we’ve taken that validated concept, kept the good parts and improved upon the bad.

The first version prioritized validation over optimization and extensibility. Now that we proved the concept, we systematically refactored everything. During this refactor, we’ve made sure to generalize some of our logic in a way that allows for easier future changes, like the multi-lingual pages support, or providing more context to the language models in the enrichment step.

This time, I acted as the architect, directing the LLM with specific implementation details rather than letting it generate entire file structures. The core functionality remains the same: extracting politician data, enriching it, and pushing to Wikidata, but all the functionality also changed and improved.

Those 13,000 lines of changes represent the evolution from rapid prototype to production system. Every refactor had a purpose: better performance, cleaner architecture, future extensibility, or improved reliability.

What Actually Works Now?

The boring stuff that makes this actually usable:

  • Real Wikidata Integration: We’re pushing statements that actually stick. The API even accepts them!
  • Importing: It runs roughly 3 times faster, uses only the database instead of being partly in a JSON file. And we import a wider array of supporting locations and positions, using a branch filtering strategy for the class hierarchy.
  • Proper Evaluation UI: Shows existing Wikidata data alongside extracted properties, Wikimedia authentication.
  • Enrichment: Now uses related classes to provide the language model with more context, was switched over to GPT-5, stores archived pages in cloud storage and handles overlapping time frames better.
  • Dump Tracking: We keep track of the published Wikidata dumps so we can automatically stay up to date with those.

The GPT-5 Situation

Oh right, OpenAI launched their Responses API and deprecated everything we were using. Spent a delightful weekend migrating to the new format. GPT-5 is faster and better though, so there’s that.

What we’re working on now

The real improvements you’ll notice:

  • Smarter entity matching: Have the LLM understand that “Minister of Defence” in a Myanmar article probably means Myanmar’s defense minister, not Belgium’s
  • Context-aware extraction: Provide the LLM with more context by using the class hierarchy and other properties like part of or country when mapping positions and locations to extracted statements, dramatically improving accuracy
  • Filtered position trees: We should block some religious and military positions in our dataset, basically blocking some branches of the class tree. (turns out 21,818 bishops aren’t really politicians)
  • Multilingual support coming: Building infrastructure to allow users to set their preferred languages, so we can also evaluate politicians’ native language sources, not just English Wikipedia

The Commit Messages Hall of Fame

Special shoutout to these absolute gems from the git log:

  • “semantics” (2 AM variable renaming)
  • “lost commit of removed contextmanager” (???)
  • “steal position class tree from zavod” (borrowed with attribution!)

What’s Next?

We’ve got a working system online! :confetti_ball: The testing environment will be open for anyone who wants to help enrich politician data. We’re excited to see how it performs with real users.

The backlog is looking healthy with 14 open issues. Everything from multilingual support to building a full leaderboard system for Wikidata contributions. You know, simple stuff.

We went from local development to cloud production in three weeks. The system now reliably processes politicians, enriches their data, and pushes validated statements to Wikidata at scale.

Small wins, people. Small wins. :trophy:

P.S. - Testing environment link dropping later this week. Get ready! :ballot_box_with_ballot:

1 Like