Schema changes to `Analyzable`

Hello, over at OpenAleph we are currently working on surfacing more structured data from documents to better connect text analysis with FollowTheMoney data. I’d like to open a discussion here about some schema tweaks, mainly related to the subclasses of Analyzable which may or may not concern OpenSanctions in the same way as us :slight_smile:

We are currently experimenting within our analysis pipeline with these schema extensions and how they’d work in practice and I wanted to start a discussion around it.

Analyzable

Add datesMentioned and topicsMentioned

Similar to the existing properties (peopleMentioned, …) that store mentioned data in a structured way, we would love to store dates and topics in the same way. An analysis pipeline could extract dates from the text of a document and store it in this property. This is not dates of the document (created at, last modified etc.) but dates surfaced from the body text.

Similar for topicsMentioned, and this may be my own data ocd, but I’d like a distinction between the global topics property and topics that are mentioned. A document that mentiones a person name that is on a PEP list is not necessary talking about PEPs (thus, the topics property should not set to e.g. role.pep on this document) but it mentiones a name that is a possible match on a PEP list (or sanctions or whatever), so an analyzer could assign topicsMentioned: role.pep to it.

Mention

Add contextText and contextNames

Similar to the existing context<...> properties, it would help for further disambiguation analysis of Mentions to store 1) a text fragment around that mentioned name from the original document and 2) other mentioned names close to the name for this Mention.

normalizedName !?!?

Another thing and here I definetly need help from smarter people: In the text analysis of OpenAleph we always run into the problem of the differences between the original surface form of a name mention in fulltext (e.g. “Acme Corporation”) and a normalized representation of this name (e.g. “acme corp” which could be compiled from the great rigour library :)) For search analysis, we like to use the normalized form (Elasticsearch terms aggregation etc.), but of course we need to keep track of the original (“surface”) form of the original mention.

This could be achieved via Mention.name holding the actual surfaced name (as of now), and Mention.normalizedName holding the transformed value. But this is not completly thought through yet on our end but I just wanted to drop it here and hear your thoughts on it.

Keen to discuss this! And if someone is facing similar challenges, I am more than happy to dig into it further and share experiments! :slight_smile: