Graph technology: From SQL to neo4j

Martin Horvath
14 min readJan 5, 2021

I was heavily using relational SQL databases for ages and I would even call myself a SQL-Ninja. Oh boy… how flat my earth has been for years. There’s nothing wrong with relational databases if one knows their limitations.

Image by Pete Linforth from Pixabay

Relational database systems like Microsoft SQL server, Oracle, PostgreSQL, MySQL etc. are excellent choices for storing structured data. For the sake of simplicity, let’s consider them a good choice if the data to be stored is some kind of Microsoft Excel workbook with multiple sheets. Like an accounting journal, an address book or something similar.

And what RDBMs can also handle just fine are simple relationships between the data entries. This is comparable with a link between two Excel Sheets — to bring up this generalization again.
A use case would be to have a list of names in one place and a list of addresses in another place — linked through some shared identifier (e.g. PersonId). The first table below contains three of my friends and the second table contains addresses. They are linked to my friends using the ID and PersonID fields. For Peter Piper, no address is known.

-----------------------------------------------------------------
| ID | Firstname | Lastname | Date of birth | Notes |
-----------------------------------------------------------------
| 1 | John | Doe | 1984-01-30 | My best buddy |
| 2 | Peter | Piper | 1975-03-13 | Super skilled guy |
| 3 | Jennifer | Jones | 1983-09-15 | |
-----------------------------------------------------------------
-----------------------------------------------------------------
| PersonId | Street | ZIP | City | Country |
-----------------------------------------------------------------
| 1 | Samplestreet 3 | 1234 | Gotham | US |
| 3 | Kings-Avenue 9 | 99123 | Whoville | US |
-----------------------------------------------------------------

For a good reason, the previous paragraph mentioned “simple” relationships. Because relationships are an area where previously mentioned systems will reach certain limitations.

Relationships on data level

What if the focus of data queries lies on the relationship, rather than on the data entries? What if the search for data (e.g. a Person) begins with a known relationship and not with a part of a name?

“What’s the name of the person with the blue pants I met in the bar last Friday”
vs.
“What’s the name of the person who was at my birthday party and who’s name ends with *onnery?”

The first case requires you to find all the friends who were in the bar with you and probably know that person.
The second case requires you to browse the guest list and check for matching names.

What if a search has multiple levels of detail like “the person who knows a person who knows a person”? This is exactly where things are getting interesting and where an uprising technology called “Graph Databases” will become a game changer.

The flight example

This becomes even clearer with a sample. I started to play around with the “Flights and delays 2015 dataset” published by the U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics available here: https://www.kaggle.com/usdot/flight-delays

Airports with unique IDs (iata code) are connected by individual flights with unique IDs (airline and flight_number) using the information origin_airport and destination_airport.

Flight UA703 connects New York JFK and Los Angeles LAX on January 6, 11.36am UTC 

This data could be modeled as follows using a “traditional” relational database with two tables, flights and airports, and a foreign key relationship between them:

If you want to try this yourself, head over to the github repo for this article (https://github.com/martinhorvath/com.primebird.graphplayground) and copy the contents of flightnet_schema_sqlfiddle.sql into a new http://sqlfiddle.com/ (left hand side, then hit build-schema and then fire the queries on the right hand side)

Querying the data set for all airports or all flights is quite easy using the following sql query:

SELECT * FROM airports
Results for all-airports-query

It would also be quite easy to find all flights departing from New York. For improved readability, I have returned just a few columns from the flights table.

SELECT
f.airline,
f.flight_number,
f.tail_number,
f.origin_airport,
f.destination_airport,
f.year,
f.month,
f.day,
f.scheduled_departure
FROM flights f
LEFT JOIN airports a ON f.origin_airport = a.iata_code
WHERE a.city = 'New York'
Results for querying flights departing from New York

Things become more difficult when the question is “How can I reach San Francisco from New York?”. The following query is a first try.

SELECT
f.airline,
f.flight_number,
f.tail_number,
f.origin_airport,
f.destination_airport,
f.year,
f.month,
f.day,
f.scheduled_departure
FROM flights f
LEFT JOIN airports a ON f.origin_airport = a.iata_code
LEFT JOIN airports b ON f.destination_airport = b.iata_code
WHERE
a.city = 'New York'
and b.city = 'San Francisco'

Unfortunately this results in 0 flights because there is no direct connection between New York and San Francisco in the our flights table. So a two-hop connection needs to be found:

SELECT
f.airline,
f.flight_number,
f.origin_airport,
f.destination_airport,
f.year,
f.month,
f.day,
f.scheduled_departure,
f2.airline,
f2.flight_number,
f2.origin_airport,
f2.destination_airport,
f2.year,
f2.month,
f2.day,
f2.scheduled_departure
FROM flights f
LEFT JOIN flights f2 ON f.destination_airport = f2.origin_airport
LEFT JOIN airports a ON f.origin_airport = a.iata_code
LEFT JOIN airports b ON f2.destination_airport = b.iata_code
WHERE
a.city = 'New York'
and b.city = 'San Francisco'
Result for the 2-hop query from New York to San Francisco

This would return the desired flight-combinations DL2045 to Denver F9667 to San Francisco as well as B697 to Denver and again F9667 to San Francisco. But the query has not taken into consideration the departure and flight times to check if such a connection would make sense. And of course, a huge number of flights and airports as well as the fact that there’s a linear growth of the result set with the number of hops supported will cause unacceptable compute times and high workload.

So what if a the data model would be designed with relationships in mind, rather than flights or airports?

Hello GraphDB, hello Neo4J

The underlying concept of a graph is quite easy and graphs itself are not new. It consists of nodes which can be connected with each other by edges. Both can have properties and edges can also have a direction. For those coming from the SQL-world, the following comparison should be helpful:

****************************************
| SQL DB | Graph DB *
****************************************
| Database | Graph *
| Table | Does not exist *
| Row | Node *
| Column | Property on node *
| M:M Table | Edge *
| M:M Table Column | Property on edge *
****************************************

The illustration below is a screenshot taken from the Neo4J browser where I imported the airports and flights we worked on before (I’ll show you how to do that).

  • Airports are created as nodes labeled as Airport with their name shown and colored orange.
  • Flights are created as nodes labeled as Flight with their flight-number shown and colored in blue.
  • Flights and airports are connected with two relationships with dedicated labels for ARRIVING_AT and DEPARTING_FROM

There is much to learn when it comes to best practices, modelling efficiency and constraints (e.g. unique identifiers for nodes). Keep this in mind for later.

Airports and flights from the previous example turned into a graph

Graphification of the airports and flights

As promised, I’ll walk with you through the graphification of the sql tables we created earlier. There are a number of graph databases available these days and I’m making a choice for you: Neo4J is available as a community edition, really powerful and well documented. They also offer free sandboxes (at the time of writing) and this makes it much easier to play with. Click here to create a personal sandbox if you want to follow the steps and get your hands dirty: https://sandbox.neo4j.com/

Follow the steps until you’re taken to the Neo4j Browser and enjoy the short tour to get familiar with the user interface which should look like this:

Neo4j sandbox browser

Graph DBs are not really good at SQL… they don’t recognize it at all. What has turned out to become a standard like SQL is the query language Cypher. It has evolved a lot over the years and also comes with some importing capabilities. So the first thing we want to do is turning our sql-data into CSV files. For the non-tech-users: this is like a super simplified form of Microsoft Excel. Nearly every database or db-administration tool comes with such an export function. I did that already for you and provided the csv files for airports and flights on github,where you can find flightnet_airports.csvand flightnet_flights.csv https://github.com/martinhorvath/com.primebird.graphplayground

Let’s start loading the airports

The cypher query for the import is pasted below and ready to be used, but let’s first understand what it does.

LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/martinhorvath/com.primebird.graphplayground/main/flightnet_airports.csv' AS row

LOAD CSV WITH HEADERS FROM accepts either a local file (which doesn’t work in the sandbox as you don’t have physical access) or a fully qualified internt URL. I’ve taken the raw url of the file in my github repo. If you forked it, adjust the repo.

AS row makes every row (except the header) available for processing. It is an object and you can access each column by it’s header. It’s a good idea to avoid any special characters or spaces when working with CSV. It’s super important that this is written as one single row without line breaks.

MERGE (airport:Airport {iata_code: row.iata_code})

The line starting with MERGE is the first creation of a node. In contrast to CREATE it expects nodes to be unique. That works well for us as IATA codes are always unique.
Within the brackets, we define an alias airport for a node labeled as Airport. The property iata_code of the node is filled with the column-value iata_code of the csv-row currently in process (because the MERGE, and ON CREATE SET commands are executed for every row in the csv).

So far, we would get a node as shown below with just a label and an iata_code (and an internal ID which is auto-generated).

ON CREATE SET airport.name = row.airport, airport.city = row.city, airport.state = row.state, airport.country = row.country, airport.latitude = toFloat(row.latitude), airport.longitude = toFloat(row.longitude);

To set the rest of the properties, we use the ON CREATE SET command. I mentioned before that the MERGE expects unique nodes. If a node (e.g. airport SFO) is NOT found, then it is created. ON CREATE SET allows us to specify properties to be created during the creation. We can use the form <alias>.<desiredProperty> = <some value>. For getting the value from the csv row, we can use the row alias defined in the first command. It’s again super important that this contains no line breaks!
Below is a screenshot that visualized all properties created for the airport node:

Putting now all together results in the following cypher command that you can paste into the Neo4J browser and hit the triangle button.

LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/martinhorvath/com.primebird.graphplayground/main/flightnet_airports.csv' AS row
MERGE (airport:Airport {iata_code: row.iata_code})
ON CREATE SET airport.name = row.airport, airport.city = row.city, airport.state = row.state, airport.country = row.country, airport.latitude = toFloat(row.latitude), airport.longitude = toFloat(row.longitude);

You can now already check what you did and “query” all created airport odes with the cypher following cypher query, returning a result like seen in below image:

MATCH(a:Airport)
return a

We used the MATCH command to return all nodes labeled as Airport and make them available using the alias a. The return statement simply returns the full node.

Load all flights

The next set of nodes we want to create are the flights. The steps are similar to the ones taken above and shouldn’t require explanation. The only difference is the source file, the label used for the nodes (i.e. Flight instead of Airport) and the property-set for each node.

What is worth mentioning is the usage of built-in functions for setting the properties departureDate. Graph databases know several data types (e.g. text, number,…) and we want departure date to be of type date so that we could analyze and use it later on smartly.

The function date in below case accepts an iso-date (e.g. 2021–01–04) without additional format specification. Because year, month and day are available as separate fields per row, I’m using string-concatenation to form this per flight.

LOAD CSV WITH HEADERS FROM 'https://github.com/martinhorvath/com.primebird.graphplayground/raw/main/flightnet_flights.csv' AS row
MERGE (flight:Flight {id: row.id})
ON CREATE SET flight.flight_number = row.flight_number, flight.airline = row.airline, flight.tail_number = row.tail_number, flight.origin_airport = row.origin_airport, flight.destination_airport = row.destination_airport, flight.air_time = row.air_time,flight.departureDate = date(row.year + '-' + row.month + '-' + row.day), flight.departureTime = row.departure_time, flight.departureDate = date(row.year + '-' + row.month + '-' + row.day), flight.arrivalTime = row.arrival_time

Once done, you can check the results with an adapted cypher query:

MATCH(f:Flight)
return f

Relationships ARRIVING_AT and DEPARTING_FROM

Now that you are experienced in creating nodes and have airports and flights ready, the relationships can be modeled. As mentioned earlier, a relationship is an edge in graph-terms and connects two nodes. It is also labeled (e.g. departing_from) and can have properties.

We start with the querying of the nodes to be connected, namely Airport and Flight.

MATCH (a:Airport),(f:Flight)

MATCH is the command you know already. In contrast to the previously used statements, multiple nodes can be matched by using the comma as separator. The first letter/word in the bracket is the alias to be used for subsequent actions.

WHERE a.iata_code = f.origin_airport

Connecting all airports with all flights would be kind of silly. We only want to connect and airport and a flight if the flights origin_airport is matching the airports iata_code.

CREATE (f)<-[r:DEPARTING_FROM]-(a)

With that prepared, the CREATE command can be used to create the relationship. A relationship is modeled as (<Node>)<-[r:<RelationshipLabel>]-(<Node2>). The arrowhead can be used to specify a directed relationship. It is important to allow a continuous flow within the graph as the queries respect directions later. E.g. Airport -> Flight -> Airport requires us to put the arrowhead in a way that departing_from points to the flight and not to the airport.

RETURN a,r,f

Finally we return all three objects, airport, flight and the departing_from relationship to see what we produced.
Putting it all together leads to the following cypher query for creating all the departing_from relations:

MATCH (a:Airport),(f:Flight)
WHERE a.iata_code = f.origin_airport
CREATE (f)<-[r:DEPARTING_FROM]-(a)
RETURN a,r,f

To complete the graph, we just need to do the same for the arriving_at:

MATCH (a:Airport),(f:Flight)
WHERE a.iata_code = f.destination_airport
CREATE (f)-[r:ARRIVING_AT]->(a)
RETURN a,r,f

Wonderful, your graph network is ready!

Experience the power of graph queries

So far, we have represented the airport-flight-data-set in a “traditional” SQL database and in a fancy new graph database. The exciting part is yet to come as we will now experience the real power of graph with relationship-based queries!

Getting all airports:

MATCH(a)
WHERE a:Airport
RETURN a;

Getting all flights departing from San Francisco:

MATCH(a:Airport)-[r:DEPARTING_FROM]->(f:Flight)
WHERE a.city = "San Francisco"
RETURN a,f;
Flights departing from San Francisco

Getting all flights arriving in New York:

MATCH(f:Flight)-[r:ARRIVING_AT]->(a:Airport)
WHERE a.city = "New York"
RETURN a,f
Flights arriving at New York

How can I reach New York from San Francisco?

MATCH(a:Airport)-[r:DEPARTING_FROM]->(f:Flight)-[r2:ARRIVING_AT]->(a2:Airport)-[r3:DEPARTING_FROM]->(f2:Flight)-[r4:ARRIVING_AT]->(a3:Airport)
WHERE a.city = "New York" and a3.city = "San Francisco"
RETURN a,f,f2,a2,a3

This is really ugly and not very flexible. We can do better…

match p=(a)-[:DEPARTING_FROM|ARRIVING_AT*1..5]->(b)
where a.city='New York' and b.city = 'San Francisco'
return p

The pipe “|” can be used to match more than one relationship. The asterisk before the closing bracket can be used to limit the “number of hops” on the graph. The first number is the min-hops, the second number is the max-hops. E.g. 1..5 means at least one but at most 5 of either DEPARTING_FROM or ARRIVING_AT.

However, this will not guarantee the shortest connection between New York and San Francisco. There’s an even better way: The shortestPath function. It is used below and guarantees the shortest path (i.e. number of hops in this case).

match p=shortestPath((a)-[:DEPARTING_FROM|ARRIVING_AT*]->(b))
where a.city='New York' and b.city = 'San Francisco'
return p, length(p)

This is just the very basic functionality of a graph and there’s so much more to explore. The Dijkstra-implementation, advanced traversals and nearly unlimited capabilities for developers who are keen to develop custom logic to be used as plugins. E.g. to stop the graph-traversal if a connecting flight is leaving, before the arrival of the inbound flight, can be perfectly represented using such a custom logic.

The next interesting question is the application of such a technology in real business scenarios. While the “social media business” was implementing (inventing?) the technology early (e.g. you and person x have a shared colleague), the rest of the economic world was lagging behind.

Manufacturing industry

This industry is predestined to make use of graph technology! Everything in production is based on a “bill of materials” which defines the materials and sub-products that make up the final product. E.g. the BOM for a car consists of tires, wheels, chassis, engine,… where each “material” can have a separate BOM as well — like the engine.

With the idea and examples of graph technologies before, it’s quite easy to see the potential in this area: How many parts has a car? Which cars are built with the rear-shock model “super-shock-123”? What’s the accumulated cost of car model xy123?

When quality control comes into play, it is of interest to find out if built models are affected by a malfunction-report for the diesel-pump dp099. This immediately leads to the topic of replacement parts where an answer for “how can model K be built without the part Z which is not available anymore”? Even with the most obvious questions in mind and without being a domain expert for engineering, construction and production, the potential of graphs is obvious!

Utilities and maintenance

An industry most people are not too familiar with but rely on it everyday is about “utilities”. The network that provides us electricity, internet connection, water, gas,… is quite complex and difficult to manage. And because of the importance of the product, any malfunction or breakdown has significant impact!

Gas for example is provided using a complex network of pipes, needs to be allocated according to the need and numerous analysis tools evaluate the health of the network day by day, hour by hour,… This is a number one area for graphs as all pipes, valves and control centers are modeled using relationships. What if a part of a pipe needs to be repaired and therefore requires the valve closest to it to be closed? Where’s the valve? Which households are affected by this action? How can those households be provided gas for the time of the maintenance action? Are there any other sources they are connected to — and if so, which control-point can enable the flow of gas for the backup pipe?

Those two industries provide a good starting point to initiate the spark. And once things are getting concrete, it’s important to choose the right tool for the job. Luckily there aren’t too many products on the market and it seems like SAP and Neo4j share the business (at the time of writing in January 2021).

Neo4j has ever since been a big player on the graph market and has an excellent product with great customization possibilities — definitely a great choice.

Closing word

Congratulations for reading til the very end and thank you so much for your interest. I hope that the picture of graph databases became clearer and that you are able to evaluate if it is worth taking a deeper look for making use of them. I’s one of the most exciting technologies I’ve ever worked with and I still consider it as a “hidden gem”. Still too few businesses unleashing the full potential.

--

--

Martin Horvath

I'm a consultant working on international projects in the field of geospatial data and customer experience, with a passion for technology and mountaineering.