Intro to Neo4j

Introduction to Neo4j

In the world of NoSQL, scale is not the only performance characteristic that matters. Sometimes, the discoverability of data -- finding a needle in a haystack -- is more valuable.

Neo4j is a graph database. It's purpose is to store and manage data based on graph theory. Some problem domains lend themselves particularly well to graph space: social connections, collaborative filtering, and data discovery, along with many others.

In order to show the basics of graphs, we're going to use the ever-popular idea of movie ratings. Instead of data being scraped from IMDb, we are using a tool called MovieTweetings. This tool scrapes Twitter for well-defined tweets that indicate a movie rating. The author of the tool publishes a new dataset almost every day, so the data is significantly fresher than any other source we've found. Before we get started in that, however, let's get some background.

What is a Graph?

If you have a Computer Science background, you'll likely remember your classes on graphs. Graphs have nodes and edges. Edges indicate a connection between nodes. A graphical representation of a graph might look like this:

Graph Example

Note that the edges (lines) have numbers on them. This is often referred to as the "cost" of making a transition. Not every graph needs to have costs, but it's particularly valuable when you're trying to optimize a path through the graph. There's a very famous problem called the Travelling Salesman Problem. It's considered an NP-Hard problem, or the hardest of the hard. You can also consider driving directions on maps. Imagine intersections and addresses are the nodes, and the streets are edges. If you know the distance from one intersection node to the next, you could create fast driving directions. An example of this might look like so:

Graph Example 2.

Now, just looking at it, what is the best path from Second & Oak to Third & Market? You can't go on Market Alley, because it's a one-way street! Now, imagine this graph covers every single street in the country, with precise distance values. Then, add in traffic data for a more precise cost. Then, add in some intersections not allowing certain turns, construction issues, and a visualizer, and you're well on your way to competing with Google Maps!

Graphs in SQL

It's entirely possible to represent a graph in SQL. This has been done multiple times, and there are many versions of it. The classic is the adjacency list, where you have a "parent" column:

Adjacency List Example.

Note that to find the nodes that are 1 hop from A, it takes the query at the bottom of the graphic. You have to self-join on the table, and use the parent column to relate them correctly. Now, how would we find the path from A to I? We'd have to do a 4-way join:

SELECT node1.*, node2.*, node3.*, node4.*
  FROM nodes AS node1
  LEFT JOIN nodes AS node2 ON node1.title=node2.parent
  LEFT JOIN nodes AS node3 ON node2.title=node3.parent
  LEFT JOIN nodes AS node4 ON node3.title=node4.parent
  WHERE node1.title='A' AND node4.title='I';

Let's say you have a social media site, that has 2 million members, and a bunch of friendship links. In order to find out the friends-of-friends list, it would require a 3-way join. Databases really, really don't like it when you do too many self-joins. In addition, there is very little discovery. You can use the self joins to see if there is a 3-hop path from A to I, but is there a 4-hop path? A 2-hop path? What about a path from D to I? These variable length routes from one node to the next are particularly insidious for SQL. You are forced to write your own Graph Traversal algorithm, making sure not to get caught in loops for a fully-connected graph!

Graph Database

Enter the graph database. It's job is to store data and provide interfaces that make dealing with graphs easier. Neo4j is a particularly good and well-supported option (with a free version), so we'll be using it for this article. Neo4j includes a graph language called Cypher, and we'll show you how to use that as well.

Graph databases are generally optimized for graph operations. That is, finding a path between two nodes, the lowest cost transition between nodes, or finding related groups of data, like friend circles within a larger social network.

In addition, graphs are great at something called inferred data. Let's take mapping again as an example. Let's say you have a coffee shop, and it has an edge connection to Seattle, as a "located_at" relationship. If you then link Seattle to Washington, and Washington to the USA, then you have inferred facts about the coffee shop being located in Seattle, in Washington, and the USA. You can also do this with Taxonomies. You might remember these from school: kingdom, phylum, class, order, family, genus, and species. Here's a pretty picture I found online:

Animal Taxonomies

In this case, you can infer that the Housecat is of the phylum Chordata. Chordata is apparently animals that have a spinal column and a tail of some sort. Notice that Primates (which is where humans would be) are a class under Chordata. You know how some people have something called a Vestigial Tail? That's what keeps us connected to this family tree.

Data and Labels

Before we get started on the cool parts, let's add a few more concepts. A node is a pretty useless construct without meaning. So, in Neo4j, you can give a node properties -- key/value pairs. You can also give a relationship properties. These let you add data like timestamps, cost estimations, geographic coordinates, and more to your data.

Given that the data is key/value pairs, you can see the immediate need to group related nodes together. That is, for our example, grouping all of the movie nodes together, or all of the user nodes. So when I talk about labels, you can sort of think of them like SQL tables, where the properties are more like columns.

Graph Example

Now we get to play with the graph. If you have a fresh neo4j instance, you'll have a screen that looks like this:

Neo4j First Screen

You'll notice they have a movie graph example as well. Theirs connects actors and directors to movies, while ours is focused on ratings.

Cypher and Graph Basics

To create a node, we will be using their Cypher language. There is a language tour on the screenshot above, but I think if you have some SQL background you can follow along:

Neo4j Stupid Movie

First, I created a movie with the CREATE keyword, and gave it a collection of properties. I also named the node 'movie' -- but that only applies for this one query. The part after that, ':Movie', means that neo4j should apply the movie label.

Next, I ran a MATCH query. I again named it 'movie', which comes in handier here. I then compare the name to the node I want (almost exactly like SQL), and then return the movie.

This is very simple, as there are no relationships. Let's create another movie and a relationship:

Neo4j Matrix

Read these from the bottom-up. First, I created an actor (notice it created the label), then queried using an alternate MATCH syntax. We have both of the nodes. Then, I match them both again, but this time create a relationship between them. Finally, I use the cypher MATCH again, but query for any relationships with the type "acted_in", where the movie is "The Matrix," and we get our actor back. Easy, right?

Our Dataset

In the data set I linked to above, there are 3 files we're using:

  • users.dat - This is a list of internal user ids, related to twitter ids
  • movies.dat - This is a list of movie IDs, titles, year of release, and genres.
  • ratings.dat - This is a list that contains user id and movie id tuples, along with a score, correlating a rating to a person and a movie.

In order to start using these effectively, we need to bring them into Neo4j. I'll leave out the gory details, but I basically just wrote a ruby script to convert the 3 files to a file that has Cypher commands:

After that, you can import it to your neo4j app:

bin/neo4j-shell -file /path/to/cypherscript.txt

Now we can start with some fun stuff. Let's look up "The Matrix" again, this time returning the average rating:

Neo4j Matrix 2

If you check the IMDb page, the movie has an average rating of 8.7, so the ratings seem fairly representative. This is pretty easy with SQL, so let's try something a little more complex. Let's find out, if somebody rated The Matrix 8 or higher, what other movies they're likely to like. The screen shots started clipping the query, so I've included it:

MATCH (n:Movie {title: "The Matrix"})<-[r:Rated]-(u)-[or:Rated]->(m:Movie)
WHERE r.val >= 8 AND or.val >= 8
RETURN m.title, count(r) ORDER BY count(r) DESC LIMIT 10

The Matrix Recommendations

Seems like a pretty good list of movies! However, if you're trying to find movies that are like The Matrix, you probably want to restrict the genres as well. So, what genres are applied to The Matrix?

MATCH (n:Movie {title: "The Matrix"})-[:has_genre]->(g:Genre) RETURN g.name;

The Matrix Genres

Ok, let's find other top-rated movies that have Action, Adventure, and Sci-Fi genres:

MATCH ()-[r:Rated]->(n:Movie)-[:has_genre]->(g:Genre)
WHERE g.name="Action" or g.name="Adventure" or g.name="Sci-Fi"
WITH avg(r.val) as rating, n.title as title, count(distinct g.name) as genres
WHERE genres = 3
RETURN title, rating
ORDER BY rating DESC LIMIT 10;

Other Movies With Genres

The "WITH" clause allows us another chance to transform results, so we can do additional filtering. In this case, we're counting the genres and then only returning results that have a count of three. This guarantees only results that contain all three genres -- but those movies can contain other genres. You'll notice there are some weird results in there, and if you search them on IMDb, you'll notice the ratings don't match up either. For a lot of these results, there are only 1 or 2 ratings. So, let's add an additional clause for at least 1000 ratings:

MATCH ()-[r:Rated]->(n:Movie)-[:has_genre]->(g:Genre)
WHERE g.name="Action" or g.name="Adventure" or g.name="Sci-Fi"
WITH avg(r.val) as rating, n.title as title, count(r) as numratings, count(distinct g.name) as genres
WHERE genres = 3 and numratings >= 1000
RETURN title, numratings, rating
ORDER BY rating DESC LIMIT 10;

Better Results

These results are looking much more like we'd expect. Again, we show that you can also count the relationships. Because we put no restrictions on the other end of the Rated relationship, any user can match, which means all ratings for everyone.

So that's the basics of how you'd use a graph database. They're great for exploring hierarchical data, as well as finding relationships between items.

Let's take this one step further before we wrap up. We showed how to get data for one movie, but what if we want to start at a user? Let's pick one of the users who reviewed The Matrix and has some amount of reviews:

MATCH (u:User)-[:Rated]->(m:Movie {title: "The Matrix"})
WITH u
MATCH (u)-[r:Rated]->(m2:Movie)
RETURN u.id, count(r) as numratings

Potential Users

Here, we find all users who have rated The Matrix, then for each user, we execute another query to see how many ratings they have. We're going to take a look at user 9520. Based on what they've liked, what other movies should they see? Let's just say 6 or above is "Liking" a movie. So, how should we do it? We take this user, find all movies they've rated 6 or higher, find out who else has rated the movie 6 or higher, then find all movies each of those people have rated 6 or higher that aren't also rated by our original user, and figure out the weight based on the number of paths to that movie. How's that for a gnarly query? As a homework exercise, you should figure out what the SQL query for this would be. It's not pretty.

MATCH (u:User {id: "9520"})-[r:Rated]->(m:Movie)<-[r2:Rated]-(u2:User)
WHERE r.val >= 6 AND r2.val >= 6
WITH u2, collect(m) AS m
MATCH (u2)-[r3:Rated]->(m3:Movie)
WHERE r3.val >= 6 AND NOT m3 IN m
RETURN count(*) AS weight, m3.title AS movie
ORDER BY weight DESC, movie
LIMIT 25

Recommendations For User

And there we have a recommendation engine in 9 lines of code.