Graph technology: From SQL to neo4j

Image by Pete Linforth from Pixabay
-----------------------------------------------------------------
| ID | Firstname | Lastname | Date of birth | Notes |
-----------------------------------------------------------------
| 1 | John | Doe | 1984-01-30 | My best buddy |
| 2 | Peter | Piper | 1975-03-13 | Super skilled guy |
| 3 | Jennifer | Jones | 1983-09-15 | |
-----------------------------------------------------------------
-----------------------------------------------------------------
| PersonId | Street | ZIP | City | Country |
-----------------------------------------------------------------
| 1 | Samplestreet 3 | 1234 | Gotham | US |
| 3 | Kings-Avenue 9 | 99123 | Whoville | US |
-----------------------------------------------------------------

Relationships on data level

What if the focus of data queries lies on the relationship, rather than on the data entries? What if the search for data (e.g. a Person) begins with a known relationship and not with a part of a name?

The flight example

This becomes even clearer with a sample. I started to play around with the “Flights and delays 2015 dataset” published by the U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics available here: https://www.kaggle.com/usdot/flight-delays

Flight UA703 connects New York JFK and Los Angeles LAX on January 6, 11.36am UTC 
SELECT * FROM airports
Results for all-airports-query
SELECT
f.airline,
f.flight_number,
f.tail_number,
f.origin_airport,
f.destination_airport,
f.year,
f.month,
f.day,
f.scheduled_departure
FROM flights f
LEFT JOIN airports a ON f.origin_airport = a.iata_code
WHERE a.city = 'New York'
Results for querying flights departing from New York
SELECT
f.airline,
f.flight_number,
f.tail_number,
f.origin_airport,
f.destination_airport,
f.year,
f.month,
f.day,
f.scheduled_departure
FROM flights f
LEFT JOIN airports a ON f.origin_airport = a.iata_code
LEFT JOIN airports b ON f.destination_airport = b.iata_code
WHERE
a.city = 'New York'
and b.city = 'San Francisco'
SELECT
f.airline,
f.flight_number,
f.origin_airport,
f.destination_airport,
f.year,
f.month,
f.day,
f.scheduled_departure,
f2.airline,
f2.flight_number,
f2.origin_airport,
f2.destination_airport,
f2.year,
f2.month,
f2.day,
f2.scheduled_departure
FROM flights f
LEFT JOIN flights f2 ON f.destination_airport = f2.origin_airport
LEFT JOIN airports a ON f.origin_airport = a.iata_code
LEFT JOIN airports b ON f2.destination_airport = b.iata_code
WHERE
a.city = 'New York'
and b.city = 'San Francisco'
Result for the 2-hop query from New York to San Francisco

So what if a the data model would be designed with relationships in mind, rather than flights or airports?

Hello GraphDB, hello Neo4J

The underlying concept of a graph is quite easy and graphs itself are not new. It consists of nodes which can be connected with each other by edges. Both can have properties and edges can also have a direction. For those coming from the SQL-world, the following comparison should be helpful:

****************************************
| SQL DB | Graph DB *
****************************************
| Database | Graph *
| Table | Does not exist *
| Row | Node *
| Column | Property on node *
| M:M Table | Edge *
| M:M Table Column | Property on edge *
****************************************
  • Airports are created as nodes labeled as Airport with their name shown and colored orange.
  • Flights are created as nodes labeled as Flight with their flight-number shown and colored in blue.
  • Flights and airports are connected with two relationships with dedicated labels for ARRIVING_AT and DEPARTING_FROM
Airports and flights from the previous example turned into a graph

Graphification of the airports and flights

As promised, I’ll walk with you through the graphification of the sql tables we created earlier. There are a number of graph databases available these days and I’m making a choice for you: Neo4J is available as a community edition, really powerful and well documented. They also offer free sandboxes (at the time of writing) and this makes it much easier to play with. Click here to create a personal sandbox if you want to follow the steps and get your hands dirty: https://sandbox.neo4j.com/

Neo4j sandbox browser

Let’s start loading the airports

The cypher query for the import is pasted below and ready to be used, but let’s first understand what it does.

LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/martinhorvath/com.primebird.graphplayground/main/flightnet_airports.csv' AS row
MERGE (airport:Airport {iata_code: row.iata_code})
ON CREATE SET airport.name = row.airport, airport.city = row.city, airport.state = row.state, airport.country = row.country, airport.latitude = toFloat(row.latitude), airport.longitude = toFloat(row.longitude);
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/martinhorvath/com.primebird.graphplayground/main/flightnet_airports.csv' AS row
MERGE (airport:Airport {iata_code: row.iata_code})
ON CREATE SET airport.name = row.airport, airport.city = row.city, airport.state = row.state, airport.country = row.country, airport.latitude = toFloat(row.latitude), airport.longitude = toFloat(row.longitude);
MATCH(a:Airport)
return a

Load all flights

The next set of nodes we want to create are the flights. The steps are similar to the ones taken above and shouldn’t require explanation. The only difference is the source file, the label used for the nodes (i.e. Flight instead of Airport) and the property-set for each node.

LOAD CSV WITH HEADERS FROM 'https://github.com/martinhorvath/com.primebird.graphplayground/raw/main/flightnet_flights.csv' AS row
MERGE (flight:Flight {id: row.id})
ON CREATE SET flight.flight_number = row.flight_number, flight.airline = row.airline, flight.tail_number = row.tail_number, flight.origin_airport = row.origin_airport, flight.destination_airport = row.destination_airport, flight.air_time = row.air_time,flight.departureDate = date(row.year + '-' + row.month + '-' + row.day), flight.departureTime = row.departure_time, flight.departureDate = date(row.year + '-' + row.month + '-' + row.day), flight.arrivalTime = row.arrival_time
MATCH(f:Flight)
return f

Relationships ARRIVING_AT and DEPARTING_FROM

Now that you are experienced in creating nodes and have airports and flights ready, the relationships can be modeled. As mentioned earlier, a relationship is an edge in graph-terms and connects two nodes. It is also labeled (e.g. departing_from) and can have properties.

MATCH (a:Airport),(f:Flight)
WHERE a.iata_code = f.origin_airport
CREATE (f)<-[r:DEPARTING_FROM]-(a)
RETURN a,r,f
MATCH (a:Airport),(f:Flight)
WHERE a.iata_code = f.origin_airport
CREATE (f)<-[r:DEPARTING_FROM]-(a)
RETURN a,r,f
MATCH (a:Airport),(f:Flight)
WHERE a.iata_code = f.destination_airport
CREATE (f)-[r:ARRIVING_AT]->(a)
RETURN a,r,f

Experience the power of graph queries

So far, we have represented the airport-flight-data-set in a “traditional” SQL database and in a fancy new graph database. The exciting part is yet to come as we will now experience the real power of graph with relationship-based queries!

MATCH(a)
WHERE a:Airport
RETURN a;
MATCH(a:Airport)-[r:DEPARTING_FROM]->(f:Flight)
WHERE a.city = "San Francisco"
RETURN a,f;
Flights departing from San Francisco
MATCH(f:Flight)-[r:ARRIVING_AT]->(a:Airport)
WHERE a.city = "New York"
RETURN a,f
Flights arriving at New York
MATCH(a:Airport)-[r:DEPARTING_FROM]->(f:Flight)-[r2:ARRIVING_AT]->(a2:Airport)-[r3:DEPARTING_FROM]->(f2:Flight)-[r4:ARRIVING_AT]->(a3:Airport)
WHERE a.city = "New York" and a3.city = "San Francisco"
RETURN a,f,f2,a2,a3
match p=(a)-[:DEPARTING_FROM|ARRIVING_AT*1..5]->(b)
where a.city='New York' and b.city = 'San Francisco'
return p
match p=shortestPath((a)-[:DEPARTING_FROM|ARRIVING_AT*]->(b))
where a.city='New York' and b.city = 'San Francisco'
return p, length(p)

Manufacturing industry

This industry is predestined to make use of graph technology! Everything in production is based on a “bill of materials” which defines the materials and sub-products that make up the final product. E.g. the BOM for a car consists of tires, wheels, chassis, engine,… where each “material” can have a separate BOM as well — like the engine.

Utilities and maintenance

An industry most people are not too familiar with but rely on it everyday is about “utilities”. The network that provides us electricity, internet connection, water, gas,… is quite complex and difficult to manage. And because of the importance of the product, any malfunction or breakdown has significant impact!

Closing word

Congratulations for reading til the very end and thank you so much for your interest. I hope that the picture of graph databases became clearer and that you are able to evaluate if it is worth taking a deeper look for making use of them. I’s one of the most exciting technologies I’ve ever worked with and I still consider it as a “hidden gem”. Still too few businesses unleashing the full potential.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Martin Horvath

Martin Horvath

I'm a consultant working on international projects in the field of geospatial data and customer experience, with a passion for technology and mountaineering.