survey of graph database models
DESCRIPTION
Survey of Graph Database Models. Byoung Ju Yang 2011. 04. 01. IDS Lab., Seoul National University. Table of contents. Survey of Graph Database Models Renzo Angles, Alaudio Gutierrez ACM Computing Surveys, Vol. 40, No. 1, Article 1 (2008) - PowerPoint PPT PresentationTRANSCRIPT
Survey of Graph Database Models
Byoung Ju Yang
2011. 04. 01.
IDS Lab., Seoul National University
Copyright 2008 by CEBT
Table of contents
Survey of Graph Database Models
Renzo Angles, Alaudio Gutierrez
ACM Computing Surveys, Vol. 40, No. 1, Article 1 (2008)
Data structures, Query languages, and Integrity constraints
1. Introduction
2. Graph Data Modeling
3. Graph Database Models (~2002)
The latest Graph Database Models
Neo4j, FlockDB
Blueprint
Sharding
2
Copyright 2008 by CEBT
1. Introduction
3
Copyright 2008 by CEBT
2-1. What is a Graph Data Model?
Data Structure(Schema)
Represented by graph, or by data structure generaliz-ing the notion of graph(hypergraph)
- (un)labeled, (un)directed
Separation between schema and data in most cases.
Data Manipulation (Query languages)
Expressed by graph transformations, or by operations whose main primitives are on graph features like paths, neighborhoods, subgraphs, graph patterns, connectivity, and graph statistics.
Integrity constraints
Enforce graph data consistency
4
Copyright 2008 by CEBT
2-2. Why a Graph Data Model?
It allows for a more natural modeling of data
Being able to keep all the information about an entity in a single node and showing related information by arcs con-nected to it.
Queries can refer directly to this graph structure
Such as finding shortest paths, determining certain sub-graphs, and so forth.
For implementation, graph databases may provide spe-cial graph storage structures and efficient graph algo-rithms for realizing specific operations.
5
Copyright 2008 by CEBT
2-3. Comparison with other DB Mod-els
Physical DB Models
Hierarchical(1976), network(1976) models
Lack a good abstraction level
Relational DB Models
Introduced a separation btw physical and logical levels
Landmark development (mathematical foundation)
Geared toward simple record-type data (schema is known)
Not easy to integrate different schemas
Query language cannot explore the underlying graph of re-lationships among the data (path, neighborhoods, patterns)
6
Copyright 2008 by CEBT
2-3. Comparison with other DB Mod-els
Semantic DB Models
DB designer can represent objects and their relations in a natural and clear manner by using high-level abstraction concepts (E-R)
Relevant to graph DB (graph-like structures)
Object-oriented DB Models
For data-intensive domains (knowledge bases, eng. applica-tions)
Permit much richer structures but still require predefined schema
Related to graph DB (use graph structures in definitions)
Semi-structured DB Models
Irregular, implicit, and partial structures
7
Copyright 2008 by CEBT
2-4. Motivations and Applications
Motivations
Real-life App. where component interconnectivity is a key feature
Applications
Classical applications
Complex networks
- Social networks (people, groups)
- Information networks (citation, word thesaurus)
- Technological networks (spatial and geographical)
- Biological networks (genomics)
8
Copyright 2008 by CEBT
3-1. Brief historical overview
9
Copyright 2008 by CEBT
3-2. Data Structures
Hypernode
Simple flat graph is not good at presenting information to user
Hypernode provides inherent support (nested graphs)
Hypergraph
Generalization of a graph
2-uniform hypergraph is a graph
10
Person2 Sang 1name
Person3 Yong chinname
Person1 Young keyname
Person2 Sang 1
Person3 Yong chin
Person1 Young key
name
Copyright 2008 by CEBT
3-3. Integrity Constraints
Schema-instance consistency
The instance should contain only concrete entities and rela-tions from entity types and relations that were defined in the schema
Schema-instance separation
In most models there is a separation
An exception is the hypernode (dynamic DB)
Concentrated in the creation of consistent instances and the correct identification and reference of entities.
11
Copyright 2008 by CEBT
3-4. Query and Manipulation Lan-guages
There is substantial work focused on query languages, the problem of querying graphs, the visual presentation of results, and graphical query languages
Some graph-oriented object models regard database transformations as graph transformations based on graph-pattern matching
GOOD, GOAL, etc.
12
Copyright 2008 by CEBT
3. Summary
13
Copyright 2008 by CEBT
NoSQL DataBases
14
Schema-less
Shared nothing architecture
Each server uses only its own local storage (faster)
Elasticity
Able to add servers without downtime
Sharding
Asynchronous replication
BASE instead of ACID
Copyright 2008 by CEBT
NoSQL Database Models
15
Copyright 2008 by CEBT
Graph Database Models
16
Scalability
ACID vs. BASE
Complexity
Relational - no redundancy or information loss (normaliza-tion)
powerful SQL, optimization by RDBMS
- performance problem in deep queries (many joins)
no schema evolution, etc
Graph – property graph model
Copyright 2008 by CEBT
The latest Graph Database Models
17
AllegroGraph RDFStore
HyperGraphDB
InfoGrid
Neo4j
FlockDB
Sones
Virtuoso
Copyright 2008 by CEBT
The latest Graph Database Models
18
License
Distribution
The only one truly distributed solution is HyperGraphDB
Indexing
Neo4j, indexing is not default behavior (index by Lucene, Solr)
Storage system
General vs. Special
HyperGraphDB uses Berkeley DB
APIs
Most of them provide java and web APIs
Copyright 2008 by CEBT
Neo4j
19
Full ACID-transaction compliant graph DB written in java
High performance
Handles several billion nodes, relationships and properties
1~2 million traversal / second
- constant time (independent of total size)
Example code
Node creation
Find friend
Copyright 2008 by CEBT
Neo4j
20
Example code
Traversal
Indexing
Copyright 2008 by CEBT
Neo4j
21
Copyright 2008 by CEBT
FlockDB
22
Goals
High rate of add/update/remove operations
Complex set arithmetic queries
Paging through query result sets containing millions of en-tries
Ability to ‘archive’ and later restore archived edges
Horizontal scaling including replication
Non-goals
Multi-hop queries (or graph-walking queries)
Automatic shard migrations
Characteristics
Optimized for very large adjacency lists (no traversal)
Copyright 2008 by CEBT
FlockDB - Twitter
23
Previous models (could not have both)
Relational tables – handling write operations
Key-value storage – paging through giant result sets
Implementation goals
Write the simplest possible thing that could work
Use off-the-shelf MySQL as the storage engine
Allow horizontal partitioning
Allow write operations to arrive out of order or be pro-cessed more than one. (allow redundant work rather than lost work)
Twitter (April 2010)
More than 13 B edges, 20k writes/second, 100k reads/sec-ond
Copyright 2008 by CEBT
FlockDB - Twitter
24
Stores graphs as sets of edges
Primary key
(a compound key of the source ID, state, and position)
When an adge is deleted, the row is just marked ‘removed’
without deleting from MySQL
Keep only a compound primary key and a secondary index for each row, and answer all queries from a single index.
Copyright 2008 by CEBT
Sharding in Graph DB
25
Especially hard in graph DB due to traversal
Unless we store the entire graph on a single machine,
we are forced to query across machine boundaries (expen-sive)
Neo4j provides master/slave structure (still has limit)
FlockDB(twitter) does not consider (interested in 1-level re-lations)
Copyright 2008 by CEBT
How to shard?
26
A proposal: gravity
Localizing data leads to greater performance (like cache)
Shard graph data based on gravity
Copyright 2008 by CEBT
Blueprints
27
A collection of interfaces, etc for the property graph DB model
Analogous to the JDBC, but for graph DB
Provides a common set of interfaces to allow developers to plug-and-play their graph DB backend. (Pipes, Gremlin, Rexster)