PoliLoom – Loom for weaving politician's data

PoliLoom

Hey there OpenSanctions community! I’m Johan. Friedrich asked me to have a go at some data extraction, and share the development process here, so let’s dive into it together.

We’re about to find out how great LLMs are at extracting metadata on politicians from Wikipedia and other sources. Our goal is to enrich Wikidata. This is a research project, the goal is to have a usable proof of concept.

Below is something that resembles a plan. I did my best structuring my thoughts on how I think we should approach this, based on a day or so of research. I’m also left with some questions, that i left sprinkled throughout this document.

If you have some insight or can tell me that i’m on the wrong track, i would be grateful!

Let’s get into it.

Database

We want to have users decide if our changes are valid or not, so for that we’ll have to store them first. But to know what has changed, we need to know what properties are currently in Wikidata.

We’ll be using SQLAlchemy and Alembic. In development we’ll use SQLite. In production PostgreSQL.

Schema

We’d be reproducing a small part of the Wikidata politician data model, to keep things as simple as possible, we’d have Politician, Position, Property and a many-to-many HoldsPosition relationship entity.

  • Politician would be politicians with names and a country.
  • Source web source that contains information on politician
  • Property would be things like date-of-birth and birthplace for politicians.
  • Position is all the positions from Wikidata with their country.
  • HoldsPosition contains a link to a politician and a position, together with start & end dates.

Source would have a many-to-many relation to Politician, Property and Position

Property and HoldsPosition would also hold information on whether they are new extracted properties, and if/what user has confirmed their correctness and updated Wikidata. Property would have a type. Where type would be either BirthDate or BirthPlace.

Populating our database

We can get a list of all politicians in Wikidata by querying. Either by occupation:

SELECT DISTINCT ?politician ?politicianLabel WHERE {
  ?politician wdt:P31 wd:Q5 .                    # must be human
  ?politician wdt:P106/wdt:P279* wd:Q82955 .     # occupation is politician or subclass
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
LIMIT 100

Or by position held:

SELECT DISTINCT ?politician ?politicianLabel WHERE {
  ?politician wdt:P31 wd:Q5 .                           # must be human
  ?politician wdt:P39 ?position .                       # holds a position
  ?position wdt:P31/wdt:P279* wd:Q4164871 .            # position is political
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
LIMIT 100

We could also fetch both and merge the sets. However, these queries are slow, and have to be paginated.

Wikipedia pages should have a Wikidata item linked, Wikidata records of politicians should almost always have a Wikipedia link. We should be able to connect these entities.

Then there probably is politician Wikidata records that don’t have a linked Wikipedia page, and there’s probably Wikipedia pages that don’t have a Wikidata entry. We could try to process Wikipedia politician articles that have no linked entity like the random web pages we’ll query.

When we populate our local database, we want to have Politician entities, together with their Position and Properties. When populating the database, try to get the Wikipedia link of the Politician in English and their local language.

Questions

  • Do we filter out deceased people?
  • If we’d be importing from the Wikipedia politician category, do we filter out all deceased people with a rule/keyword based filter?

Extraction of new properties

Once we have our local list of politicians, we can start extraction of properties. We basically have 2 types of sources. Wikipedia articles that are linked to our Wikidata entries, and random web sources that are not linked to our Wikidata entries.

Then there’s 2 types of information that we want to extract. Specific properties like date of birth, birthplace etc. And political positions. Extracting properties should be relatively simple. For positions, we’d have to know what positions we’d like to extract.

For extraction we’d feed the page to OpenAI and use the structured data API to get back data that fits our model.

We can query all political positions in Wikidata. At the time of writing that returns 37631 rows:

SELECT DISTINCT ?position ?positionLabel WHERE {
  ?position wdt:P31/wdt:P279* wd:Q294414 . # elected or appointed political position
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

It would be nice if we can tell our LLM what positions we’d like to extract, instead of having it do whatever and then having to match it to any of the positions we know. However, 37631 rows is a lot of tokens. As we know the country of the politician before extracting the positions, we can select the subset of positions for that country to pass to our LLM so we can be sure to extract correct references.

Questions

  • How do we handle incomplete birth dates (1962 / jun 1982)
  • How do we manage names of people in different scripts?

Wikipedia

Scraping wikipedia would be relatively simple. During our wikidata import we’ve stored Wikipedia links. We’d have to fetch those for our Politician and extract properties / positions from the data. We’d try to get the page in English and in the local language of the politician.

Scraping random webpages

Then there is the random pages. For these we’d have to link any information we extract to a Wikipedia article / Wikidata entity ourselves.

I think we should handle 2 general page types here, index pages and detail pages.

  • For index pages, we can have the LLM output a xpath/CSS selector for all detail page links, and possibly for pagination.
  • For detail pages we try to extract what we can, then try to find the relating entity in our database. We can do a similarity search for entities with the same properties and use a score treshhold to decide wether we should try to update a found entity.

My mind is going wild on building some sort of scraper building robot, that generates xpath selectors with a LLM, then scrapes pages with those selectors. Re-running the xpath generation when the scraper breaks. But i think that would only make sense if we want to run our scraper often. The most simple thing would be to feed everything into the model.

Questions

  • Do we archive web sources?
  • How often do we want to run our tool?
  • How do we handle multilingual variations in names
  • How will you handle conflicting information between sources? (e.g., Wikipedia says one birth date, government site says another)

API

We’ll create a FastAPI API that can be queried for entities with properties and positions. There should also be a route for users to mark a politician or position as correct.

Authentication could probably be handled by the MediaWiki oAuth system.

Routes

  • /politicians/unconfirmed - lists politicians with unconfirmed Property or HoldsPosition
  • /politicians/{politician_id}/confirm - Allows the user to confirm the properties/positions of a politician. Should allow for the user to “throw out” properties / positions in the POST data.

CLI commands

The API project would also house the CLI functionality for importing and enriching data.

  • Import single ID from Wikidata
  • Enrich single ID with data from Wikipedia

Confirmation GUI

We’d like users to be able to confirm our generated data. To do this we’ll build a NextJS GUI where users will be provided with statements and their accompanying source data / URL.

When users confirm the correctness of our generated properties, we’d like to update Wikidata on their behalf through the Wikidata API.

Questions

  • What choices do we want to have our users make? Do we want to present single properties with a source link where we found the information? Or do we want to show a complete entity with all new information and (possibly multiple) source links?
  • What source data do we want to show? Do we want to show links, or archived pages?

Project planning

Wikipedia articles are linked to Wikidata entities, saving us from having to find those links ourselves. That’s why i propose starting with extraction from these.

  1. Start with populating a local database with politicians, this includes scraping Wikipedia and related Wikidata entries
  2. Extraction of properties from Wikipedia articles
  3. Confirmation GUI
  4. Random web sources extraction

I think getting to point 3 and having a system that works is the most important thing. Point 4 is probably not easy and will require more budget then we have. But regardless, we’ll see how far we can get with that.

Interesting

You can find the development repository here: GitHub - opensanctions/poliloom: Loom for weaving politician's data

Please let me know if anything does not make sense, or if there’s something else that makes more sense. Also, I’d love to hear your opinion on any of the questions. I’m curious to hear what you think.

5 Likes

Thanks for sharing this, @Monneyboi - fascinating stuff! Are there any notable hurdles or successes you’ve encountered so far at this stage?

Re: parsing Wikipedia articles in multiple languages, are there any factors that you expect will guide deciding between using just English, the local language, or a combination of both for extraction? I’d imagine that it may vary from not just language-to-language, but region-to-region and/or entity-to-entity? Is it “completeness”, or “accuracy” that seems like the bigger thing to solve there?

And where can we follow along as the project evolves? :slight_smile:

Thank you so much for the brilliant write-up, @Monneyboi ! To me, the goal of this research spike is this weird social-technical mix question: if we make a verification machine (the “loom”) on top of this data aggregator, is that machine something that is fun and encouraging to use, or is it a torture chair?

Regarding all the tech choices I couldn’t agree more. Perhaps a pet peeve of @jbothma to mention, we’ve been discussing what HoldsPosition should look like for a while now, and the leading contender at the moment is Tenure … wdyt?

Is it “completeness”, or “accuracy” that seems like the bigger thing to solve there?

Mostly completeness. I noticed that for some languages, the local language contained more information compared to the English page.

Are there any notable hurdles or successes you’ve encountered so far at this stage?

Well, I’m quite happy that it looks like it’s possible (I haven’t tried yet) to use the Wikipedia login system for this, providing us with a simple way for users to make use of this system, while allowing us to update Wikidata on their behalf.

And where can we follow along as the project evolves? :slight_smile:

I’ve updated this post with a link to the repository. Also, i’ll keep this post up-to-date with new info / updates.

if we make a verification machine (the “loom”) on top of this data aggregator, is that machine something that is fun and encouraging to use, or is it a torture chair?

That depends on what our users would enjoy more. I’m guessing fun and encouragement. We could have a leaderboard, should be relatively simple to add. Many gamification concepts come to mind, but I don’t know if we should invest to much time in that.

we’ve been discussing what HoldsPosition should look like for a while now, and the leading contender at the moment is Tenure … wdyt?

I think HoldsPosition is about as clear as you could be. When I read that I know what’s up. When i read Tenure I’d have to check the docs. Yes it’s 2 words and it looks kinda derpy, but it is very self-documenting.

Tenure sounds like a more detailed Position to me, like a Position with a timeframe attached.

I was chatting with @pudo about sources recently and this is exactly the kind of deep dive I was hoping for when I originally asked about regional source coverage and enrichment! My inner nerd is so chuffed to get more details about the how you guys approach this. Huge thanks again, this kind of transparency and collaborative thinking that makes an open data ecosystem so powerful. I’ll be following along (and cheering :confetti_ball: from the sidelines) as PoliLoom continues to take shape!

1 Like

@SpottyMoose Thanks for the encouraging words!

Devlog #1

Yesterday me and Claude have set-up most of the importing logic, there were a couple of lessons:

  • Having all countries used by Wikidata locally saves us a lot of queries.
  • Mapping country to Wikipedia languages is not trivial.
  • Maybe we can eliminate the need for countries if the English Wikipedia has all we need, + we can pass all positions to the structured data API. As these are not tokens sent into context, but rather the token vocabulary.

Learning is great :tada:

Also, somewhere else, we discussed: “Do you think we should have a user confirm every property / position seperately, or should he/she just approve a politician in one click, with all new data?”

To which this great answer came:

ha, this is fun. Maybe like with a “throw out” option? like all the proposed changes are shown, and then you can kick stuff off (or later: correct it)

Which i think is a good way to handle confirmation.

Also, we got our Wikipedia oAuth app approved :star:

Today I plan on having Claude fix up the last importing logic, and figuring out what positions to load. Then diving into the enrichment part of the API/CLI project and exploring the simplest way to get nice structured data.

2 Likes

Woo amazing stuff, and fun updates! Please keep these going :blush:

2 Likes

Devlog #2

End of last week we’ve fleshed out the last import logic, including:

  • Support for loading of manually curated political positions.
  • Used SparQL for politician loading.
  • Linked politician and positions to countries. :link:
  • basic Wikipedia extraction
  • Unit tests :love_letter:

What we’ve learned: With long list of manually curated political positions, we need a way to filter what positions we present to the LLM for extraction. As political positions and politicians are both easily linked to countries, we’ll use those to filter the positions we try to extract for a politician.

Today is extraction day :fire:. Before the day ends I want to have LLM proposed properties in my local database. I’ll probably start building the GUI application in parallel.

New week, let’s go!
:partying_face:

3 Likes

Many thanks for sharing all of these updates so far! Why do you need to map countries to languages/vice versa? I guess there’s languages spoken in many countries (like Arabic or Spanish) but also countries with many languages, so I imagine this to be hard.

Why do you need to map countries to languages/vice versa?

As there is many different Wikipedia languages, and local language Wikipedia articles could have more information on a politician then the English article.

Right now I’m only using the English Wikipedia, but I would like to also use the other languages. The simplest thing to do would be to pull in the articles from all languages, which would also be the most expensive.

As we have the citizenship countries of politicians, I was thinking of having some sort of mapping between countries of citizenship and Wikipedia articles, saving us feeding loads of redundant information into model context.

@frederikrichter, @leon.handreke & others I’ve forgotten. Thank you for the kind words and support!

Devlog #3

Welcome back! Strap in, this is a long one.

So the main thing we figured out is that we have to many positions to send to the OpenAI structured data API. This led us into similarity search territory, time to have fun with vectors.

Let me explain how our tactic changed over time. First we did this:

  1. Fetch the Wikipedia article for a politician
  2. Fetch positions that the politician could have based on his citizenship
  3. Pass those positions to the structured data OpenAI API.
  4. Get back mostly correct positions for the politician :magic_wand:

To me, this is the greatest way to do it. The LLM can only output positions that we have a Wikidata ID for.

However, looking at our positions data:

select countries.name, count(position_country.position_id) as pos_count from countries left outer join position_country on position_country.country_id = countries.id group by countries.id order by pos_count desc;

We see that we have quite a couple positions for some countries:

FR	78871
ES	13977
BR	5890
US	4264
...

And the OpenAI structured outputs API has some pesky limitations that only allow us 500 enum properties :confused:

What now?

So we’ll have to find the positions that are most likely to be in our Wikipedia article. Enter similarity search. If you’re interested in how that works, Here is part of my talk that tries to explain it.

So now, we started of with embedding all the positions, what we did next was:

  1. Fetch the Wikipedia article for a politician
  2. Embed the whole article
  3. Fetch positions that the politician could have based on his citizenship
  4. Select the 500 positions that are most similar to the article.
  5. Pass those positions to the structured data OpenAI API
  6. Get back mostly wrong positions for the politician :mage:

It turns out that there is too much info in the articles to give us a reliable set of similar positions, leading to the model outputting incorrect positions, as we force it to use any of the positions we pass it.

For example, when searching for similar positions for Aurélien Pradié, “Mayor of Pradiers” will be very similar. Looking at the proof, the model has a good understanding of what positions Aurélien has had, but is forced to output tokens that match the structured data schema:

LLM extracted data for Aurélien Pradié:
  Positions (5):
    Deputy of the French Second Republic (2017 - present)
      Proof: Aurélien Pradié (French pronunciation: [oʁeljɛ̃ pʁadje]; born 14 March 1986) is a French politician who has represented the 1st constituency of the Lot department in the National Assembly since 2017.
    Mayor of Pradiers (2014 - 2018-01-05)
      Proof: In the 2014 municipal elections, Pradié was elected mayor of Labastide-Murat, when his party list received over 70% of the vote.
    Mayor of Pradiers (2016 - 2018-01-05)
      Proof: Following the merger of communities, he became mayor of Cœur-de-Causse (the new merged commune) in 2016.
    deputy to the National Legislative Assembly (France) (2016 - 2018)
      Proof: He has also held a seat in the Regional Council of Occitania since 2021, previously in office from 2016 to 2018.
    deputy to the National Legislative Assembly (France) (2021 - present)
      Proof: He has also held a seat in the Regional Council of Occitania since 2021, previously in office from 2016 to 2018.

The new strategy

So we finally arrive at the strategy I was trying to avoid. What I think we’ll have to do now is:

  1. Fetch the Wikipedia article for a politician
  2. Prompt the language model to extract arbitrary positions, have it return whatever.

Then for every position:

  1. Check if we have a exact match with our Wikidata positions.
  2. If not, fetch the 500 most similar positions
  3. Pass those positions to the structured data OpenAI API.
  4. Prompt the model for the correct Wikidata position, or None
  5. Get back mostly correct Wikidata positions for the Wikipedia positions :relieved_face:

That’s the theory at least. :tada:

The benefit of this tactic is that it can be generalized to other properties as well. We’ll have to do the same for the BirthPlace property for example.

Other news

While Claude was fleshing out this enrichment process, I had Claude start work on the NextJS GUI in a second working directory. Now we also have a basic-gui branch containing the start of the confirmation GUI.

Which does not much yet, however, it does Mediawiki auth (With help from Bryan Davis, thanks!) :grinning_face:

Thank you for reading this far, I’ll try to keep the next one shorter :frog:.

Again, if anything I say sounds weird or I’m going in a direction that you think is not fruitful. Please let me know!

That was it! Have a great day!

2 Likes

Thanks, I was aware of the context but wasn’t sure why the mapping was required!

A couple tips -

For extracting from unstructured / semi-structured sources such as Wikipedia, GLiNER and BAML are your friends:

There are better examples, but here’s some related code:

And in general, the team at Explosion.ai in Berlin is amazing and brilliant: https://spacy.io/

Also, there are ways to align semantics across the Wikidata and DBPedia data sources. I would recommend building a graph from Wikidata for the parts which are most interesting, then using this as a “backbone” to build on. The data in Wikidata is pretty good, although arguably quite imbalanced and sparse, so the flexibility of a graph can help. Otherwise you’ll end up with really complicated SQL which is expensive to maintain.

For bringing in news articles / random web sources – I’ll put in a plug for Newsplunker from AskNews https://newsplunker.com/ which develops graph elements (and source links) from news articles in a way that’s friendly to publishers and reporters.

There are also commercial data sources which might help, if this is an option? At least to get the politician entities populated.

I’m happy to join a discussion about this kind of work. Hope it goes well.

Devlog #4

Welcome back to another update! This one’s exciting - we’ve made some major architectural shifts that have completely transformed how the project works. :rocket:

The Big Pivot: Dump Processing Over APIs

Remember how we started with Wikidata’s SPARQL API? Well, we’ve completely moved away from that approach. Turns out, when you’re trying to process millions of political entities, API rate limits and timeouts become your worst enemy.

We’re now processing the complete Wikidata dump directly - that’s a ~100GB compressed file that expands to about 1.7TB of JSON. Yes, you read that right. Over a terabyte of juicy data! :exploding_head:

But here’s the cool part - we’ve implemented a three-pass processing strategy that actually makes this manageable:

  1. Pass 1 - Build Hierarchy Trees: Extract all the “subclass of” relationships to figure out what counts as a political position vs. a geographic location
  2. Pass 2 - Import Supporting Entities: Get all the positions, locations, and countries into our database first
  3. Pass 3 - Import Politicians: Finally import the politicians and link them to everything we’ve already stored

This multipass process prevents dead-lock issues that would occur when mulitple workers would try to insert conflicting entities, saving us some complex back-off logic.

Vector Search Actually Works!

Remember our struggle with similarity search from devlog #3? Good news - it works beautifully now!

Our two-stage extraction strategy is humming along:

  1. Stage 1: LLM extracts arbitrary position names from Wikipedia (e.g., “Mayor of Labastide-Murat”)
  2. Stage 2: We embed that text, find the 100 most similar Wikidata positions, then ask the LLM to pick the right one

This is what the results looked like (edited for brevity):

Enriching politician with Wikidata ID: Q30351657
Mapped 7 out of 19 extracted positions for Aurélien Pradié
LLM extracted data for Aurélien Pradié:
  Properties (2):
    PropertyType.BIRTH_DATE: 1986-03-14
    PropertyType.BIRTH_PLACE: Cahors, France
  Positions (7):
    member of the French National Assembly (2017 - present)
      Proof: Aurélien Pradié (French pronunciation: [oʁeljɛ̃ pʁadje]; born 14 March 1986) is a French politician who has represented the 1st constituency of the Lot department in the National Assembly since 2017.
    Regional councillor of Occitanie (2021 - present)
      Proof: He has also held a seat in the Regional Council of Occitania since 2021, previously in office from 2016 to 2018.
    Regional councillor of Occitanie (2016 - 2018)
      Proof: He has also held a seat in the Regional Council of Occitania since 2021, previously in office from 2016 to 2018.
    conseiller communautaire du Causse de Labastide-Murat (2008 - present)
      Proof: In the 2008 cantonal elections, Pradié was elected in the first round for the canton of Labastide-Murat, becoming the second-youngest councilor in France behind Jean Sarkozy and beating Lucien-Georges Foissac, his former teacher.
    Mayor of Labastide-Murat (2014 - 2016)
      Proof: In the 2014 municipal elections, Pradié was elected mayor of Labastide-Murat, when his party list received over 70% of the vote.
    Mayor of Cœur-de-Causse (2016 - 2018)
      Proof: Following the merger of communities, he became mayor of Cœur-de-Causse (the new merged commune) in 2016.
    Regional councillor of Occitanie (2015 - 2018)
      Proof: That same year, he was elected a regional councilor in Occitanie, where he sat on the committee for labor, professional development and apprenticeships.
Successfully enriched politician Aurélien Pradié
✅ Successfully enriched politician data from Wikipedia sources

The results are surprisingly good. The model can now correctly map extracted positions like “Deputy of the French Second Republic” to the right Wikidata entity, even when the text doesn’t match exactly.

We’re using SentenceTransformers with the ‘all-MiniLM-L6-v2’ model for embeddings, stored in PostgreSQL with the pgvector extension. It’s fast, it’s accurate, and it scales! :chart_increasing:

Performance Breakthroughs

The dump processing is properly parallelized now. We’re talking:

  • Chunk-based parallel processing that scales linearly with CPU cores
  • Near-linear speedup up to 32+ cores (tested!)
  • Memory-efficient line-by-line processing for the 1.7TB file
  • Batch database operations for optimal performance

Running poliloom dump build-hierarchy on a decent machine processes the entire dump hierarchy in parallel on all cores in reasonable time. On my machine it takes roughly 19 minutes for one pass :fire:

What We Learned About Wikidata

Processing the full dump taught us some interesting things:

  • Statement ranks matter: Wikidata has preferred, normal, and deprecated statements. We need to handle these properly.
  • Incomplete dates are everywhere: Things like “1962” or “JUN 1982” are common and need special handling
  • The hierarchy is deep: Political positions have complex subclass relationships that go many levels deep
  • Countries have SO many positions: France alone has 78K+ political positions in Wikidata!

Testing That Actually Works

We’ve got a proper testing framework now using pytest with Docker Compose for the test database. The tests focus on the core data pipeline - no more guessing if our imports work correctly.

What we test: All the business logic, database models, dump processing, LLM extraction (mocked), and API endpoints.

What we don’t test: CLI interfaces (they’re thin wrappers), database migrations, and performance (that’s beyond our scope for now).

Next Steps

The two-stage extraction is working for positions and birthplaces. Now we need to:

  1. Create a evaluation mechanism to be able to improve our prompts and similarity search.
  2. Archive scraped pages so users can easily review our extractions
  3. Finish and test our confirmation GUI so that users can actually review our extractions.

The foundation is solid now. We can process the entire Wikidata dump, extract meaningful information with LLMs, and have a robust similarity search system. That’s a pretty good place to be! :tada:

Shoutouts

Big thanks to everyone who’s been following along and providing feedback. The architectural decisions we’ve made - especially moving to dump processing - have made this project actually viable at scale.

Question for the community: How do we improve the similarity search matching of “free form”, just extracted strings, with our Wikidata entities? Is there properties that we can add to the embedding (for example it’s class) that would improve our results? What do we add to the prompts?

That’s all for now! Next update will hopefully show how we improved our reconciliation and how that improved our enrichment pipeline.

Stay tuned! :waving_hand:

P.S. If I missed something, you have any questions, or you think we should change our strategy on any of the discussed techniques, let us know!

2 Likes

Damn, that’s a hell of an update! Let’s talk about getting you a VM to run this on ASAP? Or what kind setup would be a good place?

1 Like