neo4j 20 minutes introduction
TRANSCRIPT
By András Fehér
THE REAL VALUE IS IN THE RELATIONSHIPS
• Google : Knowledge Graph• Facebook: Unicorn• Twitter: flockdb• ....
WHAT IS THE PROBLEM WITH RDBMS? (PART 1)
The base question of all recommendation systems: “User 99 has bought the products 1, 2, 3 and 765 so far. Get the list of other products bought by other users together with the products 1, 2, 3 or 765 in descending order by popularity”
WHAT IS THE PROBLEM WITH RDBMS? (PART 2)
“Who are Bob’s friends-of-friends-of-friends?”
“What is the shortest path between two specific friends?”...?
BASICS: WHAT IS A GRAPH?
• Origin: Euler 18th century• It contains nodes and relationships.• Nodes contain properties (key-value pairs).• Nodes can be labeled with one or more labels.• Relationships are named and directed, and
always have a start and end node.• Relationships can also contain properties.
GRAPH DATABASES ON THE MARKET
• Non-native storage: data in general purpose DB
• Native processing: index-free
SOCIAL NETWORK SPEED TEST
1 000 000 people each with approximately 50 friends:
USE CASES *
• Fraud Detection• Graph-Based Search• Identity and Access Management• Master Data Management• Network and IT Operations• Real-Time Recommendations• Social Network
* Detailed examples from Neo4j
DATA MODELING
• concept -> logical model -> physical model • big gap between concept and DB• structure and data volume determines query speed• hard to change schema
• concept directly to DB• no gap between concept and DB• query speed not influenced by structure or data
volume• easy to change connections
CYPHER – GRAPH DATABASE QUERY LANGUAGE
Name: Joe
Name: Bob
FRIEND
Person Person
(:Person{name:”Joe”})-[:FRIEND]->(:Person{name:”Bob”})
• Other query languages: SPARQL, Gremlin ...• Case sensitive• Most human friendly
CREATING SOME TEST DATA IN CYPHER// creating nodescreate(:Person{name:"Tom Hanks"});....// creating relation between two specific nodesmatch (a:Person),(b:Movie) where
a.name='Ron Howard' and b.title = 'The Da Vinci Code'
create (a)-[r:DIRECTED]->(b) return r;....// set relation propertymatch(Person{name:"Tom Hanks"})-[n:KNOWS]->(Person{name:„Ron Howard"}) set n.since=1987;....// delete relationmatch (a)-[r:KNOWS]->(b) where
a.name='Matt Damon' and b.name='Matt Damon'
delete r;
QUERYING DATA IN CYPHER
// whom does Tom Hanks know?match (:Person{name:"Tom Hanks"})-[r:KNOWS]->(b) return b;
// who knows Steven Spielberg?match (:Person{name:"Steven Spielberg"})<-[:KNOWS]-(b) return b;
// which films has Tom Hanks Acted in?match (:Person{name:"Tom Hanks"})-[:ACTED_IN]-(b) return b;
// delete by idmatch (n) where ID(n)=11 delete n;
// get Steven Spielberg aquantances 3 levels deepmatch (:Person{name:"Steven Spielberg"})
-[:KNOWS]-(b)-[:KNOWS]-(c)-[:KNOWS]-(d)
return b, c, d
A BIGGER EXAMPLE
MATCH (tom:Person {name:"Tom Hanks"})-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActors) RETURN tom, m, coActors
Tom Hanks’ co-actors:
FINDING THE SHORTEST PATH
MATCH p=shortestPath( (kevin:Person {name:"Kevin Bacon"})-[*]-(meg:Person {name:"Meg Ryan"}))RETURN p
The shortest path between Kevin Bacon and Meg Ryan:
RECOMMENDING CO-ACTORS TO TOM HANKS
MATCH // coActors: acted in the same movies as Tom// cocoActors: acted in the same movies as coActors but they Tom did not// act in the same movies as the coActors(tom:Person {name:"Tom Hanks"})-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActors), (coActors)-[:ACTED_IN]->(m2)<-[:ACTED_IN]-(cocoActors)WHERE NOT (tom)-[:ACTED_IN]->(m2)RETURN cocoActors.name AS Recommended,// strength: how many times the same cocoActor was found count(*) AS Strength ORDER BY Strength DESC
Find co-actors who haven't work with Tom Hanks (co-co-actors):
Tom m(movie)
ACTED_INcoActor
ACTED_INm2
(movie)
ACTED_INcocoActor
ACTED_IN
ACTED_IN
NEO4J CLUSTER ARCHITECTURE
• Automatic master election• Possible to write to slaves, but it is faster to the master• Full replication (/data redundancy); graph sharding is under development• Single server capacity: 34 billion nodes, 34 billion relationships, 65 thousands relationship types
and 68 billion properties• Cluster requires a quorum in order to serve write load• Reads done on slaves : reads scale linearly• Exceptionally high write loads: queing and vertical scaling• Large graph that does not fit in RAM: cache sharding by routing queries• Online backups full / incremental supported• Reporting instances are slaves that will never be elected to be master
DEVELOPMENTQuery tuning: • execution plan• profiling
Indexing on properties
Accessing: • web interface• REST API• shell• embedding in Java applications• Mazerunner extension (Using Apache Spark and Neo4j for Big Data Graph Analytics)
Utilities• neo4j-shell• neo4j-import• neo4j-backup• neo4j-arbiter
RESOURCES
Good official manual
From Relational to Graph: A Developer's Guide