neo4j 20 minutes introduction

By András Fehér

THE REAL VALUE IS IN THE RELATIONSHIPS

• Google : Knowledge Graph• Facebook: Unicorn• Twitter: flockdb• ....

WHAT IS THE PROBLEM WITH RDBMS? (PART 1)

The base question of all recommendation systems: “User 99 has bought the products 1, 2, 3 and 765 so far. Get the list of other products bought by other users together with the products 1, 2, 3 or 765 in descending order by popularity”

WHAT IS THE PROBLEM WITH RDBMS? (PART 2)

“Who are Bob’s friends-of-friends-of-friends?”

“What is the shortest path between two specific friends?”...?

BASICS: WHAT IS A GRAPH?

• Origin: Euler 18th century• It contains nodes and relationships.• Nodes contain properties (key-value pairs).• Nodes can be labeled with one or more labels.• Relationships are named and directed, and

always have a start and end node.• Relationships can also contain properties.

GRAPH DATABASES ON THE MARKET

• Non-native storage: data in general purpose DB

• Native processing: index-free

SOCIAL NETWORK SPEED TEST

1 000 000 people each with approximately 50 friends:

USE CASES *

• Fraud Detection• Graph-Based Search• Identity and Access Management• Master Data Management• Network and IT Operations• Real-Time Recommendations• Social Network

* Detailed examples from Neo4j

http://neo4j.com/customers

DATA MODELING

• concept -> logical model -> physical model • big gap between concept and DB• structure and data volume determines query speed• hard to change schema

• concept directly to DB• no gap between concept and DB• query speed not influenced by structure or data

volume• easy to change connections

CYPHER – GRAPH DATABASE QUERY LANGUAGE

Name: Joe

Name: Bob

FRIEND

Person Person

(:Person{name:”Joe”})-[:FRIEND]->(:Person{name:”Bob”})

• Other query languages: SPARQL, Gremlin ...• Case sensitive• Most human friendly

CREATING SOME TEST DATA IN CYPHER// creating nodescreate(:Person{name:"Tom Hanks"});....// creating relation between two specific nodesmatch (a:Person),(b:Movie) where

a.name='Ron Howard' and b.title = 'The Da Vinci Code'

create (a)-[r:DIRECTED]->(b) return r;....// set relation propertymatch(Person{name:"Tom Hanks"})-[n:KNOWS]->(Person{name:„Ron Howard"}) set n.since=1987;....// delete relationmatch (a)-[r:KNOWS]->(b) where

a.name='Matt Damon' and b.name='Matt Damon'

delete r;

QUERYING DATA IN CYPHER

// whom does Tom Hanks know?match (:Person{name:"Tom Hanks"})-[r:KNOWS]->(b) return b;

// who knows Steven Spielberg?match (:Person{name:"Steven Spielberg"})<-[:KNOWS]-(b) return b;

// which films has Tom Hanks Acted in?match (:Person{name:"Tom Hanks"})-[:ACTED_IN]-(b) return b;

// delete by idmatch (n) where ID(n)=11 delete n;

// get Steven Spielberg aquantances 3 levels deepmatch (:Person{name:"Steven Spielberg"})

-[:KNOWS]-(b)-[:KNOWS]-(c)-[:KNOWS]-(d)

return b, c, d

A BIGGER EXAMPLE

MATCH (tom:Person {name:"Tom Hanks"})-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActors) RETURN tom, m, coActors

Tom Hanks’ co-actors:

FINDING THE SHORTEST PATH

MATCH p=shortestPath( (kevin:Person {name:"Kevin Bacon"})-[*]-(meg:Person {name:"Meg Ryan"}))RETURN p

The shortest path between Kevin Bacon and Meg Ryan:

RECOMMENDING CO-ACTORS TO TOM HANKS

MATCH // coActors: acted in the same movies as Tom// cocoActors: acted in the same movies as coActors but they Tom did not// act in the same movies as the coActors(tom:Person {name:"Tom Hanks"})-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActors), (coActors)-[:ACTED_IN]->(m2)<-[:ACTED_IN]-(cocoActors)WHERE NOT (tom)-[:ACTED_IN]->(m2)RETURN cocoActors.name AS Recommended,// strength: how many times the same cocoActor was found count(*) AS Strength ORDER BY Strength DESC

Find co-actors who haven't work with Tom Hanks (co-co-actors):

Tom m(movie)

ACTED_INcoActor

ACTED_INm2

(movie)

ACTED_INcocoActor

ACTED_IN

ACTED_IN

NEO4J CLUSTER ARCHITECTURE

• Automatic master election• Possible to write to slaves, but it is faster to the master• Full replication (/data redundancy); graph sharding is under development• Single server capacity: 34 billion nodes, 34 billion relationships, 65 thousands relationship types

and 68 billion properties• Cluster requires a quorum in order to serve write load• Reads done on slaves : reads scale linearly• Exceptionally high write loads: queing and vertical scaling• Large graph that does not fit in RAM: cache sharding by routing queries• Online backups full / incremental supported• Reporting instances are slaves that will never be elected to be master

DEVELOPMENTQuery tuning: • execution plan• profiling

Indexing on properties

Accessing: • web interface• REST API• shell• embedding in Java applications• Mazerunner extension (Using Apache Spark and Neo4j for Big Data Graph Analytics)

Utilities• neo4j-shell• neo4j-import• neo4j-backup• neo4j-arbiter

http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html

RESOURCES

Good official manual

From Relational to Graph: A Developer's Guide

http://neo4j.com/docs/stable/re03.html

neo4j 20 minutes introduction

Data & Analytics