I Mapreduced a Neo store
Lately I've been busy talking at conferences to tell people about our way to create large Neo4j databases. Large means some tens of millions of nodes and hundreds of millions of relationships and billions of properties.
Our use case consisted of exploring our data to find interesting patterns. The data we want to explore is about financial transactions between people, so the Neo4j graph model is a good fit for us. Because we don't know upfront what we are looking for we need to create a Neo4j database with some parts of the data and explore that. When there is nothing interesting to find we go enhance our data to contain new information and possibly new connections and create a new Neo4j database with the extra information.
This means it's not about a one time load of the current data and keep that up to date by adding some more nodes and edges. It's really about building a new database from the ground up everytime we think of some new way to look at the data.
First try without Hadoop
Before we created our Hadoop based solution, we used the batchimport framework provided with Neo4j (the batch inserter API). This allows you to insert a large amount of nodes and edges without the transactional overhead (Neo4j is ACID compliant). The batch importer API is a very good fit for the medium sized graphs, or the one time imports of large datasets, but in our case recreating multiple databases a day, the running time was too long.
To speed the process we wanted to use our Hadoop cluster. If we could make the process of creating a Neo4j database work in a distributed way, we could make use of the total amount of cluster machines instead of the single machine batchimporter.
But how do you go about that? The batch import framework was built upon the idea of having a single place to store the data. Having a server running somewhere the cluster could connect to had multiple downsides:
- How to handle downtime of the Neo4j server
- You're back to being transactional
- You need to check if nodes are already existing
So the idea became to build the database really from the ground up. Would it be possible to build the underlying filestructure without having the need of Neo4j running somewhere? Would be cool right?
So we did! We investigated the internal filestructure (see how it works in part 2 of the technical blogs) and had to overcome one more hurdle to really make it work.
We needed monotonically increasing row ids because the position in a Neo4j file means something. The id of the node or relationship is directly connected to the offset in the file. So we couldn't get a way with just adding some numbers as id's to our data because there couldn't be a gap between the numbering. In Friso's blog about the monotonically increasing row ids you can read all about how we did that.
Follow us for more of this
Testing and debugging Apache Airflow
February 22, 2019
The Zen of Python and Apache Airflow
February 18, 2019
AWS Machine Learning Competency Status for GoDataDriven
February 14, 2019
GoDataDriven Open Source Contribution for January 2019, the Apache Edition
February 13, 2019
Our social responsibility as a company
February 08, 2019
Keras: multi-label classification with ImageDataGenerator
January 31, 2019