survey of graph database models

Survey of Graph Database Models

Byoung Ju Yang

2011. 04. 01.

IDS Lab., Seoul National University

Copyright 2008 by CEBT

Table of contents

Survey of Graph Database Models

Renzo Angles, Alaudio Gutierrez

ACM Computing Surveys, Vol. 40, No. 1, Article 1 (2008)

Data structures, Query languages, and Integrity constraints

1. Introduction

2. Graph Data Modeling

3. Graph Database Models (~2002)

The latest Graph Database Models

Neo4j, FlockDB

Blueprint

Sharding

2


1. Introduction

3


2-1. What is a Graph Data Model?

Data Structure(Schema)

Represented by graph, or by data structure generaliz-ing the notion of graph(hypergraph)

- (un)labeled, (un)directed

Separation between schema and data in most cases.

Data Manipulation (Query languages)

Expressed by graph transformations, or by operations whose main primitives are on graph features like paths, neighborhoods, subgraphs, graph patterns, connectivity, and graph statistics.

Integrity constraints

Enforce graph data consistency

4


2-2. Why a Graph Data Model?

It allows for a more natural modeling of data

Being able to keep all the information about an entity in a single node and showing related information by arcs con-nected to it.

Queries can refer directly to this graph structure

Such as finding shortest paths, determining certain sub-graphs, and so forth.

For implementation, graph databases may provide spe-cial graph storage structures and efficient graph algo-rithms for realizing specific operations.

5


2-3. Comparison with other DB Mod-els

Physical DB Models

Hierarchical(1976), network(1976) models

Lack a good abstraction level

Relational DB Models

Introduced a separation btw physical and logical levels

Landmark development (mathematical foundation)

Geared toward simple record-type data (schema is known)

Not easy to integrate different schemas

Query language cannot explore the underlying graph of re-lationships among the data (path, neighborhoods, patterns)

6


2-3. Comparison with other DB Mod-els

Semantic DB Models

DB designer can represent objects and their relations in a natural and clear manner by using high-level abstraction concepts (E-R)

Relevant to graph DB (graph-like structures)

Object-oriented DB Models

For data-intensive domains (knowledge bases, eng. applica-tions)

Permit much richer structures but still require predefined schema

Related to graph DB (use graph structures in definitions)

Semi-structured DB Models

Irregular, implicit, and partial structures

7


2-4. Motivations and Applications

Motivations

Real-life App. where component interconnectivity is a key feature

Applications

Classical applications

Complex networks

- Social networks (people, groups)

- Information networks (citation, word thesaurus)

- Technological networks (spatial and geographical)

- Biological networks (genomics)

8


3-1. Brief historical overview

9


3-2. Data Structures

Hypernode

Simple flat graph is not good at presenting information to user

Hypernode provides inherent support (nested graphs)

Hypergraph

Generalization of a graph

2-uniform hypergraph is a graph

10

Person2 Sang 1name

Person3 Yong chinname

Person1 Young keyname

Person2 Sang 1

Person3 Yong chin

Person1 Young key

name


3-3. Integrity Constraints

Schema-instance consistency

The instance should contain only concrete entities and rela-tions from entity types and relations that were defined in the schema

Schema-instance separation

In most models there is a separation

An exception is the hypernode (dynamic DB)

Concentrated in the creation of consistent instances and the correct identification and reference of entities.

11


3-4. Query and Manipulation Lan-guages

There is substantial work focused on query languages, the problem of querying graphs, the visual presentation of results, and graphical query languages

Some graph-oriented object models regard database transformations as graph transformations based on graph-pattern matching

GOOD, GOAL, etc.

12


3. Summary

13


NoSQL DataBases

14

Schema-less

Shared nothing architecture

Each server uses only its own local storage (faster)

Elasticity

Able to add servers without downtime

Sharding

Asynchronous replication

BASE instead of ACID


NoSQL Database Models

15


Graph Database Models

16

Scalability

ACID vs. BASE

Complexity

Relational - no redundancy or information loss (normaliza-tion)

powerful SQL, optimization by RDBMS

- performance problem in deep queries (many joins)

no schema evolution, etc

Graph – property graph model



17

AllegroGraph RDFStore

HyperGraphDB

InfoGrid

Neo4j

FlockDB

Sones

Virtuoso



18

License

Distribution

The only one truly distributed solution is HyperGraphDB

Indexing

Neo4j, indexing is not default behavior (index by Lucene, Solr)

Storage system

General vs. Special

HyperGraphDB uses Berkeley DB

APIs

Most of them provide java and web APIs


Neo4j

19

Full ACID-transaction compliant graph DB written in java

High performance

Handles several billion nodes, relationships and properties

1~2 million traversal / second

- constant time (independent of total size)

Example code

Node creation

Find friend


Neo4j

20

Example code

Traversal

Indexing


Neo4j

21


FlockDB

22

Goals

High rate of add/update/remove operations

Complex set arithmetic queries

Paging through query result sets containing millions of en-tries

Ability to ‘archive’ and later restore archived edges

Horizontal scaling including replication

Non-goals

Multi-hop queries (or graph-walking queries)

Automatic shard migrations

Characteristics

Optimized for very large adjacency lists (no traversal)


FlockDB - Twitter

23

Previous models (could not have both)

Relational tables – handling write operations

Key-value storage – paging through giant result sets

Implementation goals

Write the simplest possible thing that could work

Use off-the-shelf MySQL as the storage engine

Allow horizontal partitioning

Allow write operations to arrive out of order or be pro-cessed more than one. (allow redundant work rather than lost work)

Twitter (April 2010)

More than 13 B edges, 20k writes/second, 100k reads/sec-ond


FlockDB - Twitter

24

Stores graphs as sets of edges

Primary key

(a compound key of the source ID, state, and position)

When an adge is deleted, the row is just marked ‘removed’

without deleting from MySQL

Keep only a compound primary key and a secondary index for each row, and answer all queries from a single index.


Sharding in Graph DB

25

Especially hard in graph DB due to traversal

Unless we store the entire graph on a single machine,

we are forced to query across machine boundaries (expen-sive)

Neo4j provides master/slave structure (still has limit)

FlockDB(twitter) does not consider (interested in 1-level re-lations)


How to shard?

26

A proposal: gravity

Localizing data leads to greater performance (like cache)

Shard graph data based on gravity


Blueprints

27

A collection of interfaces, etc for the property graph DB model

Analogous to the JDBC, but for graph DB

Provides a common set of interfaces to allow developers to plug-and-play their graph DB backend. (Pipes, Gremlin, Rexster)

survey of graph database models

Documents