What should we do if we find false negatives while using the site? By that I mean one entity listed two or more times - so a match could have been made but was not made.
The answer may be nothing because fixing individual entries manually is time consuming and takes time away from automation. I just thought I should ask.
Hi! It’s a very fair question, and maybe I can respond a bit more broadly:
First off, we wouldn’t necessary refer to this as “false negatives”, instead perhaps “unmerged duplicates”. I want to clarify this because it provides context to the many contributions you’re making to rigour. There’s similar but distinct processes happening here. The “false positive/negative” thing mainly applies to people using our matching tool to see if a given name they are interested in is mentioned on any watchlists. So it’s a match of some random data points we have basically no control over with the database we do control. That’s mainly what I’m working on with rigour right now: cleaning the uncontrolled inputs enough that we can then correlate them with our watchlists.
The other process is our internal data cleansing process leading into the database. This also relies on some of the tooling in rigour, but we also do a lot of bespoke stuff per source (check eg. here).
Internal deduplication between the lists (and sometimes inside one list) is something we do part automatic (maybe 5-10%) and part with human review. Given that it’s a moving target, we do prioritize the sections that receive human attention a bit: key NATO sanctions lists get more attention than eg. the GEM or UANI lists that have no regulatory significance.
I’m a little too much of a snob to want to open up merge decisions to the public (besides that being technically hard). The legal and data source context involved in making those decisions makes this a bad idea IMO.
Of course, we’d love to know if anyone notices any unmerged duplicates in the database. So I’ve just created a public spreadsheet that will send us a notification and let us track which suggestions we’ve already reviewed and accepted/suggested.