master’s thesis defense - marcus pinnecke · master’s thesis defense marcus pinnecke....
TRANSCRIPT
Master’s Thesis Defense
Marcus Pinnecke
Efficient Single Step Traversals
in Main-Memory Graph-Shaped Data
Marcus Pinnecke
Jul 20 / 2016
Efficient Single Step Traversals
in Main-Memory Graph-Shaped Data
Marcus Pinnecke
Jul 20 / 2016
Graph-Shaped Data
Marcus Pinnecke
slide 1
Introduction
• Graph as generic data structure
• Data is modeled in terms of
‒ Nodes: the records
‒ Edges: the relationships
• Several graph data models
‒ Directed, labeled multi graphs
Image taken from: https://upload.wikimedia.org/wikipedia/commons/9/9b/Social_Network_Analysis_Visualization.png
Efficient Single Step Traversals
in Main-Memory Graph-Shaped Data
Marcus Pinnecke
Jul 20 / 2016
Marcus Pinnecke
slide 2
Introduction
Single Step Traversals
• Graph Traversal. Visiting of nodes according some algorithmic strategy
• Single step traversal is core primitive of graph traversal
• given a node get its out- (in-) going edges
• given an edge get its starting (ending) node
• include or exclude nodes or edges by some conditions
• Graph Traversal Pattern. Set of function declarations to express single step traversals
‒ Functions considering relationships between records
‒ Functions considering properties and values
‒ Function composition is query plan
• Example (simplified). Assume a citation graph (e.g., research papers)
‒ Query q(u) : Which authors cited the author u?
‒ Plan q(u) : (vin eIs Author Of ein vin eCited ein vout eIs Author Of eout)(u)
‒ Visualization q(u) :
Efficient Single Step Traversals
in Main-Memory Graph-Shaped Data
Marcus Pinnecke
Jul 20 / 2016
Graph Database Classification
Marcus Pinnecke
slide 3
Introduction
Graph Storage NativeNon-Native
Graph Data Structure „Flat“ Tables
Native graph storage is claimed to be more efficient
querying depends on total graph size query depends not on total graph size
(graph as local „micro-index“)
Efficient Single Step Traversals
in Main-Memory Graph-Shaped Data
Marcus Pinnecke
Jul 20 / 2016
Marcus Pinnecke
slide 4
Introduction
Object Cache
Disk-Storage
DBMS
Disked-based Graph DBMS
Main Memory (medium size and fast)
Persistent Storage (large and slow)
Bottleneck
CPU
DBMS
Main-Memory Graph DBMS
Main Memory (medium size and fast)
CPU Cache (tiny and super fast)
Bottleneck
CPU
Relationchain to enable fast seeks
(simplified Neo4j)
Disk-StorageEdge-Store
………
Nodes, Edges, Micro-Indexes,…
Nodes, Edges, Micro-Indexes,…
Motivation
Does efficient graph navigation require native graph storage in main-memory graph DBMS?
„Index-free adjacency“
Main-Memory DBMS:
• No Buffer Manager, less indirection, direct record access + „lightning fast“,…
• Non-Native will depend on graph size, but how dramatic is this?
Marcus Pinnecke
slide 5
Introduction
What we did
Does efficient graph navigation require native graph storage in main-memory graph DBMS?
Contribution
• Basic pattern match operations: common „interface“ for both native- and non-native approaches
• Main-memory optimized native and non-native graph storage approaches
• Comprehensive comparison
• Current research hints to an answer, but somehow ambiguous:
‒ Different graph models under study
‒ Different graph encodings
‒ Different database systems under study
‒ Different architectures and platforms
‒ Different evaluations
Basic Pattern Matching
Marcus Pinnecke
slide 6
Basic Pattern Matching (I)
• Define a graph by its set of edges
• Edge set is a logical table R with three columns
• Single Step Traversals can be expressed by selections σStart=α and Relationship=β and End=γ(R) = (α,β,γ)R
‒ Example: (*,*, )R = { ( , , ), ( , , ), …}
Start Relationship End
…… …
…
… …
…
* matches any value
Basic Pattern Matching
Marcus Pinnecke
slide 7
Basic Pattern Matching (II)
26 3. Basic Pattern Matching Queries
Table 3.1: Basic pattern matching: Interpretation and names of pattern vectors. The symbol
”*” indicates a wildcard, i.e., no condition to the column.
Name Pattern Interpretation (filter condition)
q1:Exact match (↵, �, �) exact match of edge (↵, �, �)
q2:Nodes out (↵, �, ⇤) ↵ is start node with out-going relationship �
q3:Connected by (↵, ⇤, �) ↵ is start node, � is end node
q4:Nodes in (⇤, �, �) the node � has an in-coming relationship �
q5:Edges in (⇤, ⇤, �) the node � has in-coming relationships
q6:Edges out (↵, ⇤, ⇤) the node ↵ has out-going relationships
q7:Connects (⇤, �, ⇤) right-directed relationship � connect nodes
q8:Full scan (⇤, ⇤, ⇤) all edges
equal to the j-th component in the pattern vector p (for j 2 {1, 2, 3}). Thus, a basic
pattern match is a trinary equi-selection
(�start=↵, �rel=�, �end=�) = (↵, �, �)
where some of the selection � are allowed to be always true, and which returns positions
of satisfying records rather than the records itself. Thus, a basic pattern match is
equivalent to a filter operation
�start=↵^rel=�^start=�(R)
on a logical table R with three columns where some expression can be left out (e.g.,
�start=↵(R) only).
We can express several restriction to the graph data depending on the pattern vector
definition. Each pattern vector returns edge indexes matching the conditions on their
values. In Table 3.1 we provide an overview of possible pattern vector definitions
including a verbal interpretation, and a naming.
To bridge basic pattern matches into single step traversal functions, we further provide
a projection, and composition function.
Projection. Given a graph G, and a set of indexes I in respect to the edge indexes in
the edge-store. A projection ⇡ is a function which takes an egde-store, a set of edge
indexes I, a number j 2 {1, 2, 3}, and returns a set of node identifiers or relationship
identifiers by projecting the edge-store to the j-column considering only elements as
given by I. If j = 1 the edges are projected to their starting node identifier, if j = 2
the edges are projected to their relationship identifier, and they are projected to their
ending node identifier otherwise.
Basic Pattern Matching
Executing Basic Pattern Matching
Marcus Pinnecke
slide 8
Executing Basic Pattern Matching
Basic Pattern Matching (α,β,γ)R
Native ApproachesNon-Native Approaches
Scanning Tree-Based Indexing Object Cache Adjacency Lists
… …
(Relational DBMS) (e.g., RDF-3x) (e.g., Neo4j) (e.g., Graph Compute Engines)
Execution
Marcus Pinnecke
slide 9
Scanning
0
1
2
3
4
5
6
7
Position Triples 0
1
2
3
4
5
6
7
CPU 1
CPU 2
CPU 3
CPU 4
0
4
7
5
match
no match Distribute MergeEvaluate (in parallel)
0
4
5
7
Result set
result set
• Records are sequentially accessed to evaluate condition for (α,β,γ)R predicate
‒ Larger design space (storage layout and access pattern)
• Query is answered by determining the set of triples that satisfy the predicate
74 5. Implementation and Evaluation
Table 5.3: Complexities of an edges-out query in a Graph G = (V, E) on node u, additional
storage consumption relative to working dataset size, and index build time. Best values are
marked in bold font.
Query Task Scan Native Red-Black Adjacency
Find u O(|E|) O(1) O(log2(|E|)) O(1)
Determine N (u) O(|N (u)|) O(1) O(1) O(1)
Return (N (u), types) O(1) O(1) O(1) O(1)
Storage overhead Scan Native Red-Black Adjacency
EEN dataset - + 108% + 67% + 83%
FB dataset - + 3980% + 60% + 1740%
GOG dataset - + 140% + 58% + 96%
Index build time Scan Native Red-Black Adjacency
EEN dataset - 3 msec 60 msec 3.8 msec
FB dataset - 0.07 msec 0.44 msec 0.00 msec
GOG dataset - 100 msec 1100 msec 10 msec
5.4 Summary
In this chapter we provided a brief overview on implementation details of our evaluation
prototype in Section 5.1, and evaluate the concepts covered in this thesis in Section 5.2,
and in Section 5.3.
In our micro-benchmarks, we determined a suitable scan configuration for our case-
study: a horizontal parallelization on a column-oriented storage layout with bulk-
processing evaluation for large datasets (i.e., the scanning in memory-bound). We
show that scanning is not a competitive option compared to the indexes introduced in
this thesis when query performance is primary. If storage overhead is primary, scanning
might be considered as an option. However, we argued e�cient indexing can highly ex-
ploit pre-computed resultsets for single step traversals, such that the task of answering
a query becomes the e↵ort of finding the record, and copying a memory address.
Query ExecutionIn Table 5.3 we summarize the complexities for an edges-out query on a node u in
a graph-shaped dataset. The evaluation of an edges-out query consists of three sub-
tasks: (1) find u, (2) determine its neighborhood N (u), and (3) return N (u) along
with the relationship types. Once a resultset is available, the e↵ort of sub-tasks (3)
is constant for all approaches, since it is returning a memory address. The di↵erences
rely on the sub-tasks (1), and (2). While the native approach (object cache), and the
adjacency list can access the pre-computed set for u directly, both scanning and search-
tree-based approaches must find u first. Since the red-black tree is a balanced search
(a node)
Execution
Marcus Pinnecke
slide 10
Tree-Based Indexing
• Records are organized in several tree indexes (i.e., seven trees)
‒ per single value
‒ per pair of values
‒ per triple of values
• Implemented as binary search trees (i.e., red-black search trees)
• Querying (α,β,γ)R is finding (α,β,γ)R in the tree, and returning its value (for α,β,γ are identifiers, or *)
………
match
no match
0 4 5 7
result set
74 5. Implementation and Evaluation
Table 5.3: Complexities of an edges-out query in a Graph G = (V, E) on node u, additional
storage consumption relative to working dataset size, and index build time. Best values are
marked in bold font.
Query Task Scan Native Red-Black Adjacency
Find u O(|E|) O(1) O(log2(|E|)) O(1)
Determine N (u) O(|N (u)|) O(1) O(1) O(1)
Return (N (u), types) O(1) O(1) O(1) O(1)
Storage overhead Scan Native Red-Black Adjacency
EEN dataset - + 108% + 67% + 83%
FB dataset - + 3980% + 60% + 1740%
GOG dataset - + 140% + 58% + 96%
Index build time Scan Native Red-Black Adjacency
EEN dataset - 3 msec 60 msec 3.8 msec
FB dataset - 0.07 msec 0.44 msec 0.00 msec
GOG dataset - 100 msec 1100 msec 10 msec
5.4 Summary
In this chapter we provided a brief overview on implementation details of our evaluation
prototype in Section 5.1, and evaluate the concepts covered in this thesis in Section 5.2,
and in Section 5.3.
In our micro-benchmarks, we determined a suitable scan configuration for our case-
study: a horizontal parallelization on a column-oriented storage layout with bulk-
processing evaluation for large datasets (i.e., the scanning in memory-bound). We
show that scanning is not a competitive option compared to the indexes introduced in
this thesis when query performance is primary. If storage overhead is primary, scanning
might be considered as an option. However, we argued e�cient indexing can highly ex-
ploit pre-computed resultsets for single step traversals, such that the task of answering
a query becomes the e↵ort of finding the record, and copying a memory address.
Query ExecutionIn Table 5.3 we summarize the complexities for an edges-out query on a node u in
a graph-shaped dataset. The evaluation of an edges-out query consists of three sub-
tasks: (1) find u, (2) determine its neighborhood N (u), and (3) return N (u) along
with the relationship types. Once a resultset is available, the e↵ort of sub-tasks (3)
is constant for all approaches, since it is returning a memory address. The di↵erences
rely on the sub-tasks (1), and (2). While the native approach (object cache), and the
adjacency list can access the pre-computed set for u directly, both scanning and search-
tree-based approaches must find u first. Since the red-black tree is a balanced search
74 5. Implementation and Evaluation
Table 5.3: Complexities of an edges-out query in a Graph G = (V, E) on node u, additional
storage consumption relative to working dataset size, and index build time. Best values are
marked in bold font.
Query Task Scan Native Red-Black Adjacency
Find u O(|E|) O(1) O(log2(|E|)) O(1)
Determine N (u) O(|N (u)|) O(1) O(1) O(1)
Return (N (u), types) O(1) O(1) O(1) O(1)
Storage overhead Scan Native Red-Black Adjacency
EEN dataset - + 108% + 67% + 83%
FB dataset - + 3980% + 60% + 1740%
GOG dataset - + 140% + 58% + 96%
Index build time Scan Native Red-Black Adjacency
EEN dataset - 3 msec 60 msec 3.8 msec
FB dataset - 0.07 msec 0.44 msec 0.00 msec
GOG dataset - 100 msec 1100 msec 10 msec
5.4 Summary
In this chapter we provided a brief overview on implementation details of our evaluation
prototype in Section 5.1, and evaluate the concepts covered in this thesis in Section 5.2,
and in Section 5.3.
In our micro-benchmarks, we determined a suitable scan configuration for our case-
study: a horizontal parallelization on a column-oriented storage layout with bulk-
processing evaluation for large datasets (i.e., the scanning in memory-bound). We
show that scanning is not a competitive option compared to the indexes introduced in
this thesis when query performance is primary. If storage overhead is primary, scanning
might be considered as an option. However, we argued e�cient indexing can highly ex-
ploit pre-computed resultsets for single step traversals, such that the task of answering
a query becomes the e↵ort of finding the record, and copying a memory address.
Query ExecutionIn Table 5.3 we summarize the complexities for an edges-out query on a node u in
a graph-shaped dataset. The evaluation of an edges-out query consists of three sub-
tasks: (1) find u, (2) determine its neighborhood N (u), and (3) return N (u) along
with the relationship types. Once a resultset is available, the e↵ort of sub-tasks (3)
is constant for all approaches, since it is returning a memory address. The di↵erences
rely on the sub-tasks (1), and (2). While the native approach (object cache), and the
adjacency list can access the pre-computed set for u directly, both scanning and search-
tree-based approaches must find u first. Since the red-black tree is a balanced search
(a node)
Execution
Marcus Pinnecke
slide 11
Object Cache
• Object cache is used to avoid costly disk access + as index
‒ Each node v is mapped to pair (ein(v), eout(v))
‒ ein contains all edges (grouped by relationship) where v is ending node
‒ eout contains all edges (grouped by relationship) where v is starting node
• Querying (α,β,*)R is answered by returning the edge-set of relationship β in eout(v) with v = α
0
1
i
n
Id Node Identifiers (all)
…
…
ein eout
…
β0
4
5
7
…
…
74 5. Implementation and Evaluation
Table 5.3: Complexities of an edges-out query in a Graph G = (V, E) on node u, additional
storage consumption relative to working dataset size, and index build time. Best values are
marked in bold font.
Query Task Scan Native Red-Black Adjacency
Find u O(|E|) O(1) O(log2(|E|)) O(1)
Determine N (u) O(|N (u)|) O(1) O(1) O(1)
Return (N (u), types) O(1) O(1) O(1) O(1)
Storage overhead Scan Native Red-Black Adjacency
EEN dataset - + 108% + 67% + 83%
FB dataset - + 3980% + 60% + 1740%
GOG dataset - + 140% + 58% + 96%
Index build time Scan Native Red-Black Adjacency
EEN dataset - 3 msec 60 msec 3.8 msec
FB dataset - 0.07 msec 0.44 msec 0.00 msec
GOG dataset - 100 msec 1100 msec 10 msec
5.4 Summary
In this chapter we provided a brief overview on implementation details of our evaluation
prototype in Section 5.1, and evaluate the concepts covered in this thesis in Section 5.2,
and in Section 5.3.
In our micro-benchmarks, we determined a suitable scan configuration for our case-
study: a horizontal parallelization on a column-oriented storage layout with bulk-
processing evaluation for large datasets (i.e., the scanning in memory-bound). We
show that scanning is not a competitive option compared to the indexes introduced in
this thesis when query performance is primary. If storage overhead is primary, scanning
might be considered as an option. However, we argued e�cient indexing can highly ex-
ploit pre-computed resultsets for single step traversals, such that the task of answering
a query becomes the e↵ort of finding the record, and copying a memory address.
Query ExecutionIn Table 5.3 we summarize the complexities for an edges-out query on a node u in
a graph-shaped dataset. The evaluation of an edges-out query consists of three sub-
tasks: (1) find u, (2) determine its neighborhood N (u), and (3) return N (u) along
with the relationship types. Once a resultset is available, the e↵ort of sub-tasks (3)
is constant for all approaches, since it is returning a memory address. The di↵erences
rely on the sub-tasks (1), and (2). While the native approach (object cache), and the
adjacency list can access the pre-computed set for u directly, both scanning and search-
tree-based approaches must find u first. Since the red-black tree is a balanced search
(a node)
74 5. Implementation and Evaluation
Table 5.3: Complexities of an edges-out query in a Graph G = (V, E) on node u, additional
storage consumption relative to working dataset size, and index build time. Best values are
marked in bold font.
Query Task Scan Native Red-Black Adjacency
Find u O(|E|) O(1) O(log2(|E|)) O(1)
Determine N (u) O(|N (u)|) O(1) O(1) O(1)
Return (N (u), types) O(1) O(1) O(1) O(1)
Storage overhead Scan Native Red-Black Adjacency
EEN dataset - + 108% + 67% + 83%
FB dataset - + 3980% + 60% + 1740%
GOG dataset - + 140% + 58% + 96%
Index build time Scan Native Red-Black Adjacency
EEN dataset - 3 msec 60 msec 3.8 msec
FB dataset - 0.07 msec 0.44 msec 0.00 msec
GOG dataset - 100 msec 1100 msec 10 msec
5.4 Summary
In this chapter we provided a brief overview on implementation details of our evaluation
prototype in Section 5.1, and evaluate the concepts covered in this thesis in Section 5.2,
and in Section 5.3.
In our micro-benchmarks, we determined a suitable scan configuration for our case-
study: a horizontal parallelization on a column-oriented storage layout with bulk-
processing evaluation for large datasets (i.e., the scanning in memory-bound). We
show that scanning is not a competitive option compared to the indexes introduced in
this thesis when query performance is primary. If storage overhead is primary, scanning
might be considered as an option. However, we argued e�cient indexing can highly ex-
ploit pre-computed resultsets for single step traversals, such that the task of answering
a query becomes the e↵ort of finding the record, and copying a memory address.
Query ExecutionIn Table 5.3 we summarize the complexities for an edges-out query on a node u in
a graph-shaped dataset. The evaluation of an edges-out query consists of three sub-
tasks: (1) find u, (2) determine its neighborhood N (u), and (3) return N (u) along
with the relationship types. Once a resultset is available, the e↵ort of sub-tasks (3)
is constant for all approaches, since it is returning a memory address. The di↵erences
rely on the sub-tasks (1), and (2). While the native approach (object cache), and the
adjacency list can access the pre-computed set for u directly, both scanning and search-
tree-based approaches must find u first. Since the red-black tree is a balanced search
Execution
Marcus Pinnecke
slide 12
Adjacency Lists
• Contains results for Edges-Out (α,*,*)R queries (i.e., restriction to relationship is not considered)
• The i-th entry contains a list edge identifiers which nodes are in the neighborhood of node i
• Typically implemented as array of arrays
• Two adjacency list, one for ein and one for eout
0
1
i
n
Id Node Identifiers (all)
…
…
0 4 5 7
74 5. Implementation and Evaluation
Table 5.3: Complexities of an edges-out query in a Graph G = (V, E) on node u, additional
storage consumption relative to working dataset size, and index build time. Best values are
marked in bold font.
Query Task Scan Native Red-Black Adjacency
Find u O(|E|) O(1) O(log2(|E|)) O(1)
Determine N (u) O(|N (u)|) O(1) O(1) O(1)
Return (N (u), types) O(1) O(1) O(1) O(1)
Storage overhead Scan Native Red-Black Adjacency
EEN dataset - + 108% + 67% + 83%
FB dataset - + 3980% + 60% + 1740%
GOG dataset - + 140% + 58% + 96%
Index build time Scan Native Red-Black Adjacency
EEN dataset - 3 msec 60 msec 3.8 msec
FB dataset - 0.07 msec 0.44 msec 0.00 msec
GOG dataset - 100 msec 1100 msec 10 msec
5.4 Summary
In this chapter we provided a brief overview on implementation details of our evaluation
prototype in Section 5.1, and evaluate the concepts covered in this thesis in Section 5.2,
and in Section 5.3.
In our micro-benchmarks, we determined a suitable scan configuration for our case-
study: a horizontal parallelization on a column-oriented storage layout with bulk-
processing evaluation for large datasets (i.e., the scanning in memory-bound). We
show that scanning is not a competitive option compared to the indexes introduced in
this thesis when query performance is primary. If storage overhead is primary, scanning
might be considered as an option. However, we argued e�cient indexing can highly ex-
ploit pre-computed resultsets for single step traversals, such that the task of answering
a query becomes the e↵ort of finding the record, and copying a memory address.
Query ExecutionIn Table 5.3 we summarize the complexities for an edges-out query on a node u in
a graph-shaped dataset. The evaluation of an edges-out query consists of three sub-
tasks: (1) find u, (2) determine its neighborhood N (u), and (3) return N (u) along
with the relationship types. Once a resultset is available, the e↵ort of sub-tasks (3)
is constant for all approaches, since it is returning a memory address. The di↵erences
rely on the sub-tasks (1), and (2). While the native approach (object cache), and the
adjacency list can access the pre-computed set for u directly, both scanning and search-
tree-based approaches must find u first. Since the red-black tree is a balanced search
(a node)
74 5. Implementation and Evaluation
Table 5.3: Complexities of an edges-out query in a Graph G = (V, E) on node u, additional
storage consumption relative to working dataset size, and index build time. Best values are
marked in bold font.
Query Task Scan Native Red-Black Adjacency
Find u O(|E|) O(1) O(log2(|E|)) O(1)
Determine N (u) O(|N (u)|) O(1) O(1) O(1)
Return (N (u), types) O(1) O(1) O(1) O(1)
Storage overhead Scan Native Red-Black Adjacency
EEN dataset - + 108% + 67% + 83%
FB dataset - + 3980% + 60% + 1740%
GOG dataset - + 140% + 58% + 96%
Index build time Scan Native Red-Black Adjacency
EEN dataset - 3 msec 60 msec 3.8 msec
FB dataset - 0.07 msec 0.44 msec 0.00 msec
GOG dataset - 100 msec 1100 msec 10 msec
5.4 Summary
In this chapter we provided a brief overview on implementation details of our evaluation
prototype in Section 5.1, and evaluate the concepts covered in this thesis in Section 5.2,
and in Section 5.3.
In our micro-benchmarks, we determined a suitable scan configuration for our case-
study: a horizontal parallelization on a column-oriented storage layout with bulk-
processing evaluation for large datasets (i.e., the scanning in memory-bound). We
show that scanning is not a competitive option compared to the indexes introduced in
this thesis when query performance is primary. If storage overhead is primary, scanning
might be considered as an option. However, we argued e�cient indexing can highly ex-
ploit pre-computed resultsets for single step traversals, such that the task of answering
a query becomes the e↵ort of finding the record, and copying a memory address.
Query ExecutionIn Table 5.3 we summarize the complexities for an edges-out query on a node u in
a graph-shaped dataset. The evaluation of an edges-out query consists of three sub-
tasks: (1) find u, (2) determine its neighborhood N (u), and (3) return N (u) along
with the relationship types. Once a resultset is available, the e↵ort of sub-tasks (3)
is constant for all approaches, since it is returning a memory address. The di↵erences
rely on the sub-tasks (1), and (2). While the native approach (object cache), and the
adjacency list can access the pre-computed set for u directly, both scanning and search-
tree-based approaches must find u first. Since the red-black tree is a balanced search
Execution
Marcus Pinnecke
slide 13
Applicability & Limits
4.3. Summary 47
Table 4.1: Support of basic pattern per matcher. The symbol X marks the direct support of
a basic pattern, blank fields mark required post-processing (e.g., further selection), or fallback
(i.e., execute scanning instead).
Basic Pattern Scan Native Red-Black Adjacency
q1:Exact match X Xq2:Nodes out X X Xq3:Connected by X Xq4:Nodes in X X Xq5:Edges in X X X Xq6:Edges out X X X Xq7:Connects X Xq8:Full scan X X
4.3 Summary
In the previous chapter, we showed that the concept of basic patterns is a core func-
tionality to evaluate single step traversals. In this section we provide an overview of
four state-of-the-art approaches to evaluate single step traversals.
For each, we contribute an extended or modified version to evaluate basic patterns.
Therefore, we answered our third research question asking possibilities to adapt existing
approaches for a main-memory setting.
• Scan-Matcher: A high-e�cient parallelized and memory-bound scan approach.
It is aware of the tables storage layout (i.e., the records are aligned column-,
or row-wise in memory), the access pattern (i.e., the evaluation can be either
in a bulk-fashion or in a branching-fashion). For querying we chose a vertical
parallism strategy. Hence, we evaluate a query by delegating the evaluation of
table partitions to threads, one per CPU core. The result is then merged after a
synchronization step, and returned.
• Native-Matcher: An storage-layout-independent index-structure that based on
native-graph processing (especially on the object cache) concepts. Each node
is mapped to a cache pair, each containing a collection of cache entries. These
cache pairs contain out-going and in-going edges, i.e., if the out-going edges of a
node is required, the out-going edge cache collection is considered. Each cache
collection is relationship-centric organized, such that related edges are grouped
by the relationship. Given a certain node and a certain relationship, querying
the edges is done by seeking to the nodes out-going edges cache entry. The cache
collection with the given relationship as grouping key contains the desired result.
Execution
Evaluation
Marcus Pinnecke
slide 14
Evaluation
Micro-Benchmarks
• Artificial datasets
• Single step traversals
Case-Study „Reachability Queries“
• Real world datasets
• Series of single step traversals
1 2
Evaluation
Marcus Pinnecke
slide 15
Micro-Benchmarking - Results
Avoid scanning: For regular access, scanning is likely no option
• Managing an index costs: scan as option for intermediate result data structures?
15.2. Micro-Benchmarking 57
100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
scanmax selectivity
0
100
200
300
400
25 50 75 100selectivity (in %)
exec
utio
n tim
e (m
s)indexes (native/non−native)scan
egdes out (1 GiB dataset)
Figure 5.1: A↵ect of selectivity value to execution time on ⇡ 90mio. egdes graph (executed
on server). 100ms is marked as threshold for interactive queries. The selectivity for which the
scan exceeds this threshold is also marked.
ObservationWith increasing selectivity all approaches require more time to answer the query as
expected. This decrease in execution performance is linear for all approaches but the
slopes are di↵erent. The strongest performance decrease is for scanning. The other
approaches lose slightly performance even when nearly all edges must be returned.
In fact, we observed the indexes were able to answer the query in less than 100ms
independent of the selectivity value. Thus, the indexes are independent of whenever
they are native or non-native are promising options for large-scale graph datasets with
respect to performance.
ExplanationWe explain the di↵erence between scanning and the indexes as follows: On the one
hand, scanning requires to touch each tuple independent of whenever it will satisfy
the predicate or not. This inherently requires a transfer of 1 GiB from memory to the
CPU. All positions of satisfying tuples must be written to the resultset, before the
memory address of this resultset can be returned. After returning the resultset address,
its dereferenced content is copied to the querys final resultset. Since the amount of
data to scan remain the same, the performance decreases with increasing selectivity.
This is caused by the e↵ort it takes for write-operations of relevant positions in the
resultset resp. the final resultset. For selectivity values near zero, the di↵erences between
scanning and indexing can be explained by the e↵ort of reading and testing 1 GiB of data
when using the scan. On the other hand, each index pre-computes the query answer
Evaluation
Marcus Pinnecke
slide 16
Micro-Benchmarking - Results
Engineering is important
• consider pitfalls in C++ such as implicit copy constructor call
• standard data structures (e.g., in STL) might become a bottleneck (e.g., std::back_inserter)
• be aware of implementation artifacts
15.2. Micro-Benchmarking 59
0
10
20
30
40
50
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9#edges (in mio)
exec
utio
n tim
e (m
s)
index (result by reference) index (result by value) scan
0
20
40
60
2.5 5.0 7.5#edges (in mio)
exec
utio
n tim
e (m
s)index (result by reference) index (result by value) scan
0
100
200
300
400
25 50 75#edges (in mio)
exec
utio
n tim
e (m
s)
index (result by reference) index (result by value)
0
100
200
300
400
25 50 75#edges (in mio)
exec
utio
n tim
e (m
s)
index (result by reference) index (result by value) scan
Figure 5.2: A↵ect of increasing number of edges to execution time (executed on server).
value is returning a constant reference to a resultset that was written based on the
pre-computed result in the index.
ObservationAs expected, all approaches are a↵ected by an increasing number of edges in the graph
such that it takes longer to answer the query. The strongest decrease in performance
can be observed for the scan approach. For an interactive query, which requires less
than 100ms for an answer, the scan is no option if there are more than 14mio edges.
In contrast, using an index (independent of whenever it is a native, or non-native one),
interactive queries gain 5 times more edges compared to a scan. When the result is
passed by reference, and no further writes to the resultset are required, the indexes
will be interactive until approximately 200mio edges (14 times more edges compared to
scanning).
ExplanationAs for the scalability micro-benchmark, all approaches are a↵ected by the e↵ort of
writing the final result in general. Hence, there is an inherently performance decrease
when the number of edges grow. In addition, there is an additional e↵ort to write
the resultset when using the index result passing by value. Consequentially, we can
observe a nearly constant di↵erence between the indexes using values, and indexes using
references. This constant di↵erence is the e↵ort it takes to write the resultset. Writing
the resultset before the final resultset is created is not necessary for indexes which
use result passing by reference. However, on tiny to small graphs (left figure) scanning
outperforms the index passing by value, and is competitive to any index for less than
0.1 mio edges. There is a threshold at approximately 6.3mio edges where scanning is no
reasonable option anymore (middle figure). For large-scale graphs, the only reasonable
option is an index, passing the result by reference. We explain the (nearly constant)
performance di↵erence of 50ms between index result passing by reference, and index
result passing by value with the following two aspects: First, the index with result
Evaluation
Marcus Pinnecke
slide 17
Case Study - Datasets
5.3. Case Study: Reachbility Queries 63
Table 5.1: Reachability Dataset Network Statistics
Property FB EEN GOG
Network Type Social Communication Web
Description Social circles E-Mail Network From Google
#Nodes 2,876 36,692 875,713
#Edges 4,199 367,662 5,105,039
#Relationship Types 45 1 1
data set size 4.9 MiB 4.1 MiB 75.4 MiB
diameter 8 11 21
e↵ective diameter 4.7 4.8 8.1
Origin Mcauley [ML14] Leskovec [LLDM09] Leskovec [LLDM09]
Real world datasets. One of a more advanced characteristic is the longest shortest
path in the graph, i.e., the maximal number of hops between two given nodes if they
are connected over some transitive path. If this number is unconstrained, a reachability
query must explore the entire dataset in worst case before answering no. If this number
is constrained on the other hand, the search can be stopped at a certain depth. In
addition, each nodes degree influences the number of neighbors that might be considered
during evaluation. Generating artifical data that consider such characteristics is out of
the scope of this work. Hence, we focus on real-world datasets. In Table 5.1 we show
datasets which are used for this case study:
1. The Facebook (FB) dataset is a medium graph, that covers (anonymous) social
network users, and their friendship. Friendship is determined by per-user friend list
indicating one user as a certain friendship to other users. In addition to friendship
information, certain properties are attached to users, e.g., the interest in a specific
topic or the belonging to a certain group2.
2. The Enron E-Mail Network (EEN) dataset is a small graph, and covers e-mail
communication. E-mail accounts are nodes. If there was at least one e-mail sent
between two accounts u, and v, there is an edge between u, and v3.
3. The Google Web Graph (GOG) dataset is a larger graph covering hyperlinks
between web pages. Nodes are web pages, and edges are links between web pages.
The dataset was released as part of the Google Programming Contest in 20024.
2https://snap.stanford.edu/data/egonets-Facebook.html3https://snap.stanford.edu/data/email-Enron.html4https://snap.stanford.edu/data/web-Google.html
2
Evaluation
Marcus Pinnecke
slide 18
Case Study - Results
Index performance is similar
• Optimized index structures pass a few pointers
• Mayor difference is finding a certain node for edges out queries
Dataset characteristics and algorithm design affect query processing heavily
74 5. Implementation and Evaluation
Table 5.3: Complexities of an edges-out query in a Graph G = (V, E) on node u, additional
storage consumption relative to working dataset size, and index build time. Best values are
marked in bold font.
Query Task Scan Native Red-Black Adjacency
Find u O(|E|) O(1) O(log2(|E|)) O(1)
Determine N (u) O(|N (u)|) O(1) O(1) O(1)
Return (N (u), types) O(1) O(1) O(1) O(1)
Storage overhead Scan Native Red-Black Adjacency
EEN dataset - + 108% + 67% + 83%
FB dataset - + 3980% + 60% + 1740%
GOG dataset - + 140% + 58% + 96%
Index build time Scan Native Red-Black Adjacency
EEN dataset - 3 msec 60 msec 3.8 msec
FB dataset - 0.07 msec 0.44 msec 0.00 msec
GOG dataset - 100 msec 1100 msec 10 msec
5.4 Summary
In this chapter we provided a brief overview on implementation details of our evaluation
prototype in Section 5.1, and evaluate the concepts covered in this thesis in Section 5.2,
and in Section 5.3.
In our micro-benchmarks, we determined a suitable scan configuration for our case-
study: a horizontal parallelization on a column-oriented storage layout with bulk-
processing evaluation for large datasets (i.e., the scanning in memory-bound). We
show that scanning is not a competitive option compared to the indexes introduced in
this thesis when query performance is primary. If storage overhead is primary, scanning
might be considered as an option. However, we argued e�cient indexing can highly ex-
ploit pre-computed resultsets for single step traversals, such that the task of answering
a query becomes the e↵ort of finding the record, and copying a memory address.
Query ExecutionIn Table 5.3 we summarize the complexities for an edges-out query on a node u in
a graph-shaped dataset. The evaluation of an edges-out query consists of three sub-
tasks: (1) find u, (2) determine its neighborhood N (u), and (3) return N (u) along
with the relationship types. Once a resultset is available, the e↵ort of sub-tasks (3)
is constant for all approaches, since it is returning a memory address. The di↵erences
rely on the sub-tasks (1), and (2). While the native approach (object cache), and the
adjacency list can access the pre-computed set for u directly, both scanning and search-
tree-based approaches must find u first. Since the red-black tree is a balanced search
270 5. Implementation and Evaluation
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms
0
200
400
600
1 2 3 4 5 6 7 8#hops
mea
n ex
ecut
ion
time
(ms)
adjacency. object−cache scan tree
Enron
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms
0
50
100
150
200
1 2 3 4 5 6 7 8#hops
mea
n ex
ecut
ion
time
(ms)
adjacency object cache scan tree
100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms
0
25
50
75
100
1 2 3 4 5 6 7 8#hops
mea
n ex
ecut
ion
time
(ms)
adjacency object cache scan tree
„bis zu welcher Anzahl von hops kann man den Scan als „Fallback“ nehmen (Schwellwert 100ms)?“
(Breitensuche)
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms
0
200
400
600
1 2 3 4 5 6 7 8#hops
mea
n ex
ecut
ion
time
(ms)
adjacency. object−cache scan tree
Enron
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms
0
50
100
150
200
1 2 3 4 5 6 7 8#hops
mea
n ex
ecut
ion
time
(ms)
adjacency object cache scan tree
100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms
0
25
50
75
100
1 2 3 4 5 6 7 8#hops
mea
n ex
ecut
ion
time
(ms)
adjacency object cache scan tree
„bis zu welcher Anzahl von hops kann man den Scan als „Fallback“ nehmen (Schwellwert 100ms)?“
(Breitensuche)
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms
0
200
400
600
1 2 3 4 5 6 7 8#hops
mea
n ex
ecut
ion
time
(ms)
adjacency. object−cache scan tree
Enron
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
scanmax hops
100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms
0
50
100
150
200
1 2 3 4 5 6 7 8#hops
mea
n ex
ecut
ion
time
(ms)
adjacency object cache scan tree
100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms
0
25
50
75
100
1 2 3 4 5 6 7 8#hops
mea
n ex
ecut
ion
time
(ms)
adjacency object cache scan tree
„bis zu welcher Anzahl von hops kann man den Scan als „Fallback“ nehmen (Schwellwert 100ms)?“
(Breitensuche)Figure 5.4: Reachability query execution times on real-world datasets depending on hop
count.
and the non-native approaches (the scan operation, and the tree-based index). The
lower the execution time, the better the result is. We show the mean execution times
for reachability queries looking for the requested node in most #hops. Hence, if the
requested node is not found, a path of #hops length is fully explored. If, on the other
hand, the requested node is found, the path has at most a length of #hops. We chose
this, since this reflects a typical case for reachability queries. However, we consider an
approach as interactive, if it achieves to answer a reachability query in less than 100ms.
ExpectationBased on our observations on the singe-step traversal performance for the individual
approaches, we expect a poor performance for the scan-based operation, and superior
performances for the indexes. Likewise we expect similar results for the indexing ap-
proaches, independent of whenever it is a native, or a non-native approach. Since our
in-depth analysis of the previous section (Section 5.3.3) revealed significant e↵ort to
evaluate reachability queries on the GOG dataset, we expect the worst performance for
the GOG dataset.
ObservationIndependent of the dataset, the scan operation performs worst. With the exception
of the FB dataset, the scan operation is not even acceptable for an interactive query.
Besides the scan operation, all indexes perform similar. In fact, all indexes are acceptable
for interactive queries, even for the GOG dataset. More in detail, we observe query times
below 1ms for EEN and FB, as well as below 25ms up to 4 hops, and below 50ms up
to 8 hops for the GOG dataset. Moreover, we observe a similar grow for all approaches
on the GOG datasets but with di↵erent factors: all approaches grow exponentially.
The scan grows at most such that it exceeds a reasonable query time very quickly. The
indexes, however, are likely to exceed the interactive limit for hops greater than 10.
Finally, we observe on each dataset all approaches start with a similar query time for 1-
Evaluation
Wrap-Up
Marcus Pinnecke
slide 19
Summary of Trade-Offs
Scan Tree-based Object Cache Adjacency List
Scalability
+Storage needed
Creation time
-
Depends on data set
1
1
1)
Depends on hardware and optimization2)
Implementation
erffort-
3
1,2
Depends on skills, design space and optimization 3)
Basic Pattern
Support
--
Wrap-Up
Marcus Pinnecke
slide 19
Summary of Trade-Offs
Scan Tree-based Object Cache Adjacency List
Scalability
+Storage needed
Creation time
-
Depends on data set
1
1
1)
Depends on hardware and optimization2)
Implementation
erffort-
3
1,2
Depends on skills, design space and optimization 3)
Basic Pattern
Support
--
Wrap-Up
Marcus Pinnecke
slide 19
Summary of Trade-Offs
Scan Tree-based Object Cache Adjacency List
Scalability
+Storage needed
Creation time
-
Depends on data set
1
1
1)
Depends on hardware and optimization2)
Implementation
erffort-
3
1,2
Depends on skills, design space and optimization 3)
Basic Pattern
Support
--
Wrap-Up
Marcus Pinnecke
slide 19
Summary of Trade-Offs
Scan Tree-based Object Cache Adjacency List
Scalability
+Storage needed
Creation time
-
Depends on data set
1
1
1)
Depends on hardware and optimization2)
Implementation
erffort-
3
1,2
Depends on skills, design space and optimization 3)
Basic Pattern
Support
--
Wrap-Up
Marcus Pinnecke
slide 19
Summary of Trade-Offs
Scan Tree-based Object Cache Adjacency List
Scalability
+Storage needed
Creation time
-
Depends on data set
1
1
1)
Depends on hardware and optimization2)
Implementation
erffort-
3
1,2
Depends on skills, design space and optimization 3)
Basic Pattern
Support
--
Wrap-Up
Marcus Pinnecke
slide 19
Summary of Trade-Offs
Scan Tree-based Object Cache Adjacency List
Scalability
+Storage needed
Creation time
-
Depends on data set
1
1
1)
Depends on hardware and optimization2)
Implementation
erffort-
3
1,2
Depends on skills, design space and optimization 3)
Basic Pattern
Support
--
Wrap-Up
Marcus Pinnecke
slide 20
Conclusion
• Organizing fast record access is primary
• storage layout is secondary
• „non-native“ drawback (i.e., dependence on graph size) is not as significant as in disk-based systems
• Benefits and drawbacks and trade-offs per approaches
• Dataset characteristics influence performance
• Algorithm design (i.e., improved BSF) is needed
Future Work
• Compression
• Improved algorithm design
• Graph navigation in RDBMS
Wrap-Up
Thank you for your attention.
Questions?