PoliLoom – Loom for weaving politician's data

PoliLoom

Hey there OpenSanctions community! I’m Johan. Friedrich asked me to have a go at some data extraction, and share the development process here, so let’s dive into it together.

We’re about to find out how great LLMs are at extracting metadata on politicians from Wikipedia and other sources. Our goal is to enrich Wikidata. This is a research project, the goal is to have a usable proof of concept.

Below is something that resembles a plan. I did my best structuring my thoughts on how I think we should approach this, based on a day or so of research. I’m also left with some questions, that i left sprinkled throughout this document.

If you have some insight or can tell me that i’m on the wrong track, i would be grateful!

Let’s get into it.

Database

We want to have users decide if our changes are valid or not, so for that we’ll have to store them first. But to know what has changed, we need to know what properties are currently in Wikidata.

We’ll be using SQLAlchemy and Alembic. In development we’ll use SQLite. In production PostgreSQL.

Schema

We’d be reproducing a small part of the Wikidata politician data model, to keep things as simple as possible, we’d have Politician, Position, Property and a many-to-many HoldsPosition relationship entity.

  • Politician would be politicians with names and a country.
  • Source web source that contains information on politician
  • Property would be things like date-of-birth and birthplace for politicians.
  • Position is all the positions from Wikidata with their country.
  • HoldsPosition contains a link to a politician and a position, together with start & end dates.

Source would have a many-to-many relation to Politician, Property and Position

Property and HoldsPosition would also hold information on whether they are new extracted properties, and if/what user has confirmed their correctness and updated Wikidata. Property would have a type. Where type would be either BirthDate or BirthPlace.

Populating our database

We can get a list of all politicians in Wikidata by querying. Either by occupation:

SELECT DISTINCT ?politician ?politicianLabel WHERE {
  ?politician wdt:P31 wd:Q5 .                    # must be human
  ?politician wdt:P106/wdt:P279* wd:Q82955 .     # occupation is politician or subclass
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
LIMIT 100

Or by position held:

SELECT DISTINCT ?politician ?politicianLabel WHERE {
  ?politician wdt:P31 wd:Q5 .                           # must be human
  ?politician wdt:P39 ?position .                       # holds a position
  ?position wdt:P31/wdt:P279* wd:Q4164871 .            # position is political
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
LIMIT 100

We could also fetch both and merge the sets. However, these queries are slow, and have to be paginated.

Wikipedia pages should have a Wikidata item linked, Wikidata records of politicians should almost always have a Wikipedia link. We should be able to connect these entities.

Then there probably is politician Wikidata records that don’t have a linked Wikipedia page, and there’s probably Wikipedia pages that don’t have a Wikidata entry. We could try to process Wikipedia politician articles that have no linked entity like the random web pages we’ll query.

When we populate our local database, we want to have Politician entities, together with their Position and Properties. When populating the database, try to get the Wikipedia link of the Politician in English and their local language.

Questions

  • Do we filter out deceased people?
  • If we’d be importing from the Wikipedia politician category, do we filter out all deceased people with a rule/keyword based filter?

Extraction of new properties

Once we have our local list of politicians, we can start extraction of properties. We basically have 2 types of sources. Wikipedia articles that are linked to our Wikidata entries, and random web sources that are not linked to our Wikidata entries.

Then there’s 2 types of information that we want to extract. Specific properties like date of birth, birthplace etc. And political positions. Extracting properties should be relatively simple. For positions, we’d have to know what positions we’d like to extract.

For extraction we’d feed the page to OpenAI and use the structured data API to get back data that fits our model.

We can query all political positions in Wikidata. At the time of writing that returns 37631 rows:

SELECT DISTINCT ?position ?positionLabel WHERE {
  ?position wdt:P31/wdt:P279* wd:Q294414 . # elected or appointed political position
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

It would be nice if we can tell our LLM what positions we’d like to extract, instead of having it do whatever and then having to match it to any of the positions we know. However, 37631 rows is a lot of tokens. As we know the country of the politician before extracting the positions, we can select the subset of positions for that country to pass to our LLM so we can be sure to extract correct references.

Questions

  • How do we handle incomplete birth dates (1962 / jun 1982)
  • How do we manage names of people in different scripts?

Wikipedia

Scraping wikipedia would be relatively simple. During our wikidata import we’ve stored Wikipedia links. We’d have to fetch those for our Politician and extract properties / positions from the data. We’d try to get the page in English and in the local language of the politician.

Scraping random webpages

Then there is the random pages. For these we’d have to link any information we extract to a Wikipedia article / Wikidata entity ourselves.

I think we should handle 2 general page types here, index pages and detail pages.

  • For index pages, we can have the LLM output a xpath/CSS selector for all detail page links, and possibly for pagination.
  • For detail pages we try to extract what we can, then try to find the relating entity in our database. We can do a similarity search for entities with the same properties and use a score treshhold to decide wether we should try to update a found entity.

My mind is going wild on building some sort of scraper building robot, that generates xpath selectors with a LLM, then scrapes pages with those selectors. Re-running the xpath generation when the scraper breaks. But i think that would only make sense if we want to run our scraper often. The most simple thing would be to feed everything into the model.

Questions

  • Do we archive web sources?
  • How often do we want to run our tool?
  • How do we handle multilingual variations in names
  • How will you handle conflicting information between sources? (e.g., Wikipedia says one birth date, government site says another)

API

We’ll create a FastAPI API that can be queried for entities with properties and positions. There should also be a route for users to mark a politician or position as correct.

Authentication could probably be handled by the MediaWiki oAuth system.

Routes

  • /politicians/unconfirmed - lists politicians with unconfirmed Property or HoldsPosition
  • /politicians/{politician_id}/confirm - Allows the user to confirm the properties/positions of a politician. Should allow for the user to “throw out” properties / positions in the POST data.

CLI commands

The API project would also house the CLI functionality for importing and enriching data.

  • Import single ID from Wikidata
  • Enrich single ID with data from Wikipedia

Confirmation GUI

We’d like users to be able to confirm our generated data. To do this we’ll build a NextJS GUI where users will be provided with statements and their accompanying source data / URL.

When users confirm the correctness of our generated properties, we’d like to update Wikidata on their behalf through the Wikidata API.

Questions

  • What choices do we want to have our users make? Do we want to present single properties with a source link where we found the information? Or do we want to show a complete entity with all new information and (possibly multiple) source links?
  • What source data do we want to show? Do we want to show links, or archived pages?

Project planning

Wikipedia articles are linked to Wikidata entities, saving us from having to find those links ourselves. That’s why i propose starting with extraction from these.

  1. Start with populating a local database with politicians, this includes scraping Wikipedia and related Wikidata entries
  2. Extraction of properties from Wikipedia articles
  3. Confirmation GUI
  4. Random web sources extraction

I think getting to point 3 and having a system that works is the most important thing. Point 4 is probably not easy and will require more budget then we have. But regardless, we’ll see how far we can get with that.

Interesting

You can find the development repository here: GitHub - opensanctions/poliloom: Loom for weaving politician's data

Please let me know if anything does not make sense, or if there’s something else that makes more sense. Also, I’d love to hear your opinion on any of the questions. I’m curious to hear what you think.

3 Likes

Thanks for sharing this, @Monneyboi - fascinating stuff! Are there any notable hurdles or successes you’ve encountered so far at this stage?

Re: parsing Wikipedia articles in multiple languages, are there any factors that you expect will guide deciding between using just English, the local language, or a combination of both for extraction? I’d imagine that it may vary from not just language-to-language, but region-to-region and/or entity-to-entity? Is it “completeness”, or “accuracy” that seems like the bigger thing to solve there?

And where can we follow along as the project evolves? :slight_smile:

Thank you so much for the brilliant write-up, @Monneyboi ! To me, the goal of this research spike is this weird social-technical mix question: if we make a verification machine (the “loom”) on top of this data aggregator, is that machine something that is fun and encouraging to use, or is it a torture chair?

Regarding all the tech choices I couldn’t agree more. Perhaps a pet peeve of @jbothma to mention, we’ve been discussing what HoldsPosition should look like for a while now, and the leading contender at the moment is Tenure … wdyt?

Is it “completeness”, or “accuracy” that seems like the bigger thing to solve there?

Mostly completeness. I noticed that for some languages, the local language contained more information compared to the English page.

Are there any notable hurdles or successes you’ve encountered so far at this stage?

Well, I’m quite happy that it looks like it’s possible (I haven’t tried yet) to use the Wikipedia login system for this, providing us with a simple way for users to make use of this system, while allowing us to update Wikidata on their behalf.

And where can we follow along as the project evolves? :slight_smile:

I’ve updated this post with a link to the repository. Also, i’ll keep this post up-to-date with new info / updates.

if we make a verification machine (the “loom”) on top of this data aggregator, is that machine something that is fun and encouraging to use, or is it a torture chair?

That depends on what our users would enjoy more. I’m guessing fun and encouragement. We could have a leaderboard, should be relatively simple to add. Many gamification concepts come to mind, but I don’t know if we should invest to much time in that.

we’ve been discussing what HoldsPosition should look like for a while now, and the leading contender at the moment is Tenure … wdyt?

I think HoldsPosition is about as clear as you could be. When I read that I know what’s up. When i read Tenure I’d have to check the docs. Yes it’s 2 words and it looks kinda derpy, but it is very self-documenting.

Tenure sounds like a more detailed Position to me, like a Position with a timeframe attached.

I was chatting with @pudo about sources recently and this is exactly the kind of deep dive I was hoping for when I originally asked about regional source coverage and enrichment! My inner nerd is so chuffed to get more details about the how you guys approach this. Huge thanks again, this kind of transparency and collaborative thinking that makes an open data ecosystem so powerful. I’ll be following along (and cheering :confetti_ball: from the sidelines) as PoliLoom continues to take shape!

@SpottyMoose Thanks for the encouraging words!

Devlog #1

Yesterday me and Claude have set-up most of the importing logic, there were a couple of lessons:

  • Having all countries used by Wikidata locally saves us a lot of queries.
  • Mapping country to Wikipedia languages is not trivial.
  • Maybe we can eliminate the need for countries if the English Wikipedia has all we need, + we can pass all positions to the structured data API. As these are not tokens sent into context, but rather the token vocabulary.

Learning is great :tada:

Also, somewhere else, we discussed: “Do you think we should have a user confirm every property / position seperately, or should he/she just approve a politician in one click, with all new data?”

To which this great answer came:

ha, this is fun. Maybe like with a “throw out” option? like all the proposed changes are shown, and then you can kick stuff off (or later: correct it)

Which i think is a good way to handle confirmation.

Also, we got our Wikipedia oAuth app approved :star:

Today I plan on having Claude fix up the last importing logic, and figuring out what positions to load. Then diving into the enrichment part of the API/CLI project and exploring the simplest way to get nice structured data.

1 Like

Woo amazing stuff, and fun updates! Please keep these going :blush:

1 Like