Hopefully this may fit this category, though if not where would be a good topic?
Our team at Senzing uses OpenSanctions data in several ways among customer use cases. Recently we saw a need for synthetic data, and I wanted to introduce the approach here and ask if this might help others with their use cases?
Consider a data model with: risk data (sanctions lists), link data (UBO relations), and event data. We have excellent sources for the first two categories, although event data is often highly confidential. There are some open examples, such as the OCCRP data for the “Azerbaijani Laundromat” case.
We thought it might be a good idea to simulate patterns of fraud and generate synthetic datasets to cover needs for event data. In other words, we can combine techniques from graph analytics and from queueing theory to build parameterized models for the OCCRP data about money transfers. Then we can generate more simulated money transfers.
The general architecture, so far, has been:
- Use a slice of the OpenSanctions plus Open Ownership datasets, along with entity resolution used to merge across them. This produces a kind of “structural graph” of UBO, for people and companies and the links among them.
- Analyze patterns of fraud (aka. tradecraft) from data sources such as OCCRP to build parameterized simulations.
- Sample subgraphs from the “structural graph”, then simulate money laundering transactions as a “temporal graph” atop it.
Also, given that fraud-related transactions are a small portion of the overall B2B international money transfers annually, we generate a much larger ratio of “legit” transactions, sampling other entities from the “structural graph” based on OpenSanctions to serve as “legit” companies in the simulation.
If we structure this open source project carefully, the components could be sub-classed to customize the simulation for other patterns. For example, I have friends at the CISO level for large banks, who would like to use their confidential patterns in a simulation, and would be interested to participate.
We had a discussion earlier today at GraphGeeks.org where I learned A LOT about graph analytics which could be leveraged. Admittedly, it’s not my main field. There’s other pre-existing work we can probably use. Also, I’ve been posting some insights along the way on Bluesky to see who might be interested.
We have a GitHub repo started at https://github.com/DerwenAI/kleptosyn/tree/main/kleptosyn
and the project has the rather clumsy name of kleptosyn
(so far)
All criticisms and suggestions are highly welcomed. We recognize that synthetic data has limitations and perils, and are trying to be careful. Perhaps by combining large parts of open datasets with small parts of simulated data, we can help illustrate a wider range of applications and use cases which might otherwise be blocked by confidential data sources, such as money transfers, SARs, log files in mission-critical systems, and so on.