Graph databases 1 - Modeling

Fri, Aug 2, 2013

I am making my way through the book Graph Databases 1 which is an introductory book to the subject from Neo Technology 2, creators of Neo4j 3. At work, we are using Neo4j for a cool new thing we are building, and I see it as a great chance to learn some thing new and interesting. In the past, I have gone through a spiral 4 when it comes to learning stuff and the constant feedback I get is to pick something that one gets to work with everyday and dig deep into it, in order to avoid the frustration and the inevitable spiral.

This is obviously a completely new subject to me and I am enjoying it so far. I am going to write a series of blog posts about my effort to learn Graph Databases and my observations and learnings.

I am presently reading chapter 3, which deals with Data Modeling with Graphs. The key idea here is that the graph representation mirrors entities and relationships.

  • Entities are represented by nodes and their characteristics are represented as properties of the node. Eg. : Cristiano Ronaldo is a node, while his height is a property.
  • Relationships are modeled by relationships. Eg. : _OF_NATIONALTEAM is a relationship.
  • Relationships always connect two nodes
  • Relationships sometimes have properties. Eg. : debut is a property of the _OF_NATIONALTEAM relationship.

Example - European football players

If we were to model football players to have played for European clubs, a section of the graph will look like this:

"Graph: Zidane and teams"

Building the graph with Cypher

Cypher 5 is Neo4j’s query language. It is essentially structured ASCII art that tries to be as close to how relationships are mapped on a graph. The above graph can be created in Neo4j using the following Cypher snippet:

CREATE (zidane {name: "Zinadine Zidane", position: "Midfielder" }),
(cannes {name: "A.S. Cannes", founded: 1902}),
(bordeaux {name: "F.C.G. de Bordeaux", founded: 1881}),
(juventus {name: "Juventus F.C.", founded: 1897}),
(realmadrid {name: "Real Madrid C.F.", founded: 1902}),
(spain {name: "Spain"}),
(france {name: "France"}),
(italy {name: "Italy"}),
(zidane)-[:OF_NATIONAL_TEAM{debut: "19940817"}]->(france),
(zidane)-[:PLAYED_FOR_CLUB]->(cannes),
(zidane)-[:PLAYED_FOR_CLUB]->(bordeaux),
(zidane)-[:PLAYED_FOR_CLUB]->(juventus),
(zidane)-[:PLAYED_FOR_CLUB]->(realmadrid),
(cannes)-[:OF_FA]->(france),
(bordeaux)-[:OF_FA]->(france),
(juventus)-[:OF_FA]->(italy),
(realmadrid)-[:OF_FA]->(spain);

That is pretty straight forward and easy to grok. We define 8 nodes representing Zidane, the 4 clubs he played for and the 3 countries these clubs play in. Then we define the Relationships between them. Cypher uses the -> to represent a relationship between nodes.

Querying the graph

Querying the graph we created for interesting things is where the fun really is.

  • To figure out what clubs Zidane played for, we will write the following Cypher snippet:

    MATCH (zidane)-[:PLAYED_FOR_CLUB]->(club)
    RETURN club;

When executed:

neo4j-sh (?)$ MATCH (zidane)-[:PLAYED_FOR_CLUB]->(club)
>             RETURN club;
+--------------------------------------------------+
| club                                             |
+--------------------------------------------------+
| Node[53]{founded:1902,name:"A.S. Cannes"}        |
| Node[54]{founded:1881,name:"F.C.G. de Bordeaux"} |
| Node[55]{founded:1897,name:"Juventus F.C."}      |
| Node[56]{founded:1902,name:"Real Madrid C.F."}   |
+--------------------------------------------------+

What if the question is When were all the clubs Zidane played for founded? We will answer it with the following query:

MATCH (zidane)-[:PLAYED_FOR_CLUB]->(club)
RETURN club.founded;
  • Let’s try to answer When did Zidane make his international debut?. Let’s try the following query:

    MATCH (zidane)-[player:OF_NATIONAL_TEAM]->(team)
    RETURN player.debut;

This when executed will result in:

+--------------+
| player.debut |
+--------------+
| "19940817"   |
+--------------+

Here, _[player:OF_NATIONALTEAM] specifies a relationship of type _OF_NATIONALTEAM and labels it as player so that we can later extract out the debut property from player.

It can be seen from this example that, sometimes we are not really interested in extracting out certain information, although we want that information to be part of the query. Here, we don’t really care what national team Zidane played for, we are just intested in an attribute of the relationship. Cypher allows you to represent this intend to ignore.

MATCH (zidane)-[player:OF_NATIONAL_TEAM]->()
RETURN player.debut;

The above query will produce the same result as the previous query. The () specifies that there is some node there, but we are not interested in what it represents.

  • The ability to ignore can be leveraged to answer our next question. What all countries have Zidane played league football in?

    MATCH (zidane)-[:PLAYED_FOR_CLUB]->()-[:OF_FA]->(fa)
    RETURN DISTINCT fa.name;

The result:

+----------+
| fa.name  |
+----------+
| "France" |
| "Italy"  |
| "Spain"  |
+----------+

We use DISTINCT to get uniques because Zidane played for Cannes and Bordeaux, both of the French Ligue 1.

Summary

I am still making my way through the book. So far it has looked promising and I have been able to grasp things. I am looking forward to learning more complex querying and will write about it in the subsequent blog posts.