WIP: Better workflow for Neo4J data conversion

Hey all!

We’ve seen a lot of interest recently for a more well-documented and flexible method (current method) for converting the OpenSanctions entity graph (nb: or the FollowTheMoney data model more broadly) into a Neo4J/Memgraph-compatible property graph.

Here’s what we’ve identified as issues so far:

  • Users want to select which entity types are going to be turned into nodes
  • Users want to select which types of values linked to entities (eg. names, addresses, identifiers) will be turned into nodes.
  • We may also want to make the conversion of risk topics (e.g. PEP, sanctioned, etc.) into labels configurable.
  • Similarly, the option to turn on/off particular edge types based on schemata (eg. Ownership, Directorship, etc.) or properties (eg. Passport:holder)

I’ve been thinking about turning this into a YAML configuration file that would define a mapping between the FtM/OS data and the property graph. Here’s a sketch:

source: "https://data.opensanctions.org/datasets/latest/default/entities.ftm.json"

config:
  join_values: ";"
  base_url: file:///where/neo4j/will/find/the/csvs

nodes:
  schemata:
    Security:
      ignore: true
    Address:
      ignore: true
    Person:
      label: Human
      properties:
        - name
        - birthDate
        - deathDate
  types:
    name: true
    address: true
    identifier:
      caption: raw
    country: false
    date: false
  topics:
    role.pep:
      label: PEP
    sanction: true
    sanction.linked: true

edges:
  schemata:
    Ownership:
      label: OWNS

  properties:
    "Sanction:target":
      label: TARGET

There’s some things that this doesn’t solve very well:

  • How to apply multi-valued properties to nodes/edges as string properties.
  • How to make the id generation for nodes configurable such that they can be made to collide with the IDs of another knowledge graph already present in the graph database (ie. how to make this snap into an existing, broader, in-house dataset).

The idea is to eventually make this executable as a Python tool:

pip install ftm-propgraph
ftm-propgraph spec.yml

I’m keen for any feedback: extra requirements, suggestions for a wholly different approach - etc… :slight_smile:

3 Likes

Thanks @pudo ! The listed requirements make sense to me. Just a clarification: the script would actually add the data to Neo4j (vs generating a bunch of CSV files and an import script)?

2 Likes

That’s something I’m deeply confused by: if we generate a single CYPHER file that is 6.000.000 lines long, is there a way to still wrap it in a transaction and make it import this century?

I deeply hate the weird CSV bazaar of the current solution, but the hope was that it’s fast - after reading somewhere that CYPHER imports are super slow.

1 Like

Here’s one of the fastest ways to load CSV into Neo4j, at least for a “business-friendly” audience of users.

This leverages the GraphDataScience library from Neo4j, which requires installing a plug-in in your database. See: Introduction - Neo4j Graph Data Science

You can see the full context of the application in the notebook ERKG/examples/datasets.ipynb at main · DerwenAI/ERKG · GitHub

First, this assumes you have a gds object configured using Bolt protocol and the necessary credentials for your database:

gds: GraphDataScience = GraphDataScience(
    bolt_uri,
    auth = ( username, password, ),
    database = database,
    aura_ds = False,
)

Then define the following unwind_query string and load_records() function:

unwind_query: str = """
UNWIND $rows AS row
CALL {
  WITH row
  MERGE (rec:Record {uid: row.uid})
  ON CREATE 
    SET rec += row
} IN TRANSACTIONS OF 10000 ROWS
    """

def load_records (
    gds: GraphDataScience,
    df: pd.DataFrame,
    ) -> None:
    """Load one Pandas DataFrame of node data into Neo4j"""
    gds.run_cypher(
        unwind_query,
        {"rows": convert_df(df).to_dict(orient = "records")},
    )

Assuming you have loaded the CSV data into a Pandas DataFrame called df below, i.e., by using pandas.read_csv(), then load CSV fast via this “mini-batch” method:

load_records(gds, df)

Note: the notebooks in the ERKG repo which I’ve linked above also show examples for how to handle multi-valued properties and ID generation, if this may be helpful?

Also, Senzing would be delighted to co-author a tutorial with OpenSanctions about how to load data, if that might help?

2 Likes