master’s thesis defense - marcus pinnecke · master’s thesis defense marcus pinnecke....

Master’s Thesis Defense

Marcus Pinnecke

Efficient Single Step Traversals

in Main-Memory Graph-Shaped Data

Marcus Pinnecke

Jul 20 / 2016

Graph-Shaped Data

Marcus Pinnecke

slide 1

Introduction

• Graph as generic data structure

• Data is modeled in terms of

‒ Nodes: the records

‒ Edges: the relationships

• Several graph data models

‒ Directed, labeled multi graphs

Image taken from: https://upload.wikimedia.org/wikipedia/commons/9/9b/Social_Network_Analysis_Visualization.png

https://upload.wikimedia.org/wikipedia/commons/9/9b/Social_Network_Analysis_Visualization.png



Marcus Pinnecke

Jul 20 / 2016

Marcus Pinnecke

slide 2

Introduction

Single Step Traversals

• Graph Traversal. Visiting of nodes according some algorithmic strategy

• Single step traversal is core primitive of graph traversal

• given a node get its out- (in-) going edges

• given an edge get its starting (ending) node

• include or exclude nodes or edges by some conditions

• Graph Traversal Pattern. Set of function declarations to express single step traversals

‒ Functions considering relationships between records

‒ Functions considering properties and values

‒ Function composition is query plan

• Example (simplified). Assume a citation graph (e.g., research papers)

‒ Query q(u) : Which authors cited the author u?

‒ Plan q(u) : (vin eIs Author Of ein vin eCited ein vout eIs Author Of eout)(u)

‒ Visualization q(u) :



Marcus Pinnecke

Jul 20 / 2016

Graph Database Classification

Marcus Pinnecke

slide 3

Introduction

Graph Storage NativeNon-Native

Graph Data Structure „Flat“ Tables

Native graph storage is claimed to be more efficient

querying depends on total graph size query depends not on total graph size

(graph as local „micro-index“)



Marcus Pinnecke

Jul 20 / 2016

Marcus Pinnecke

slide 4

Introduction

Object Cache

Disk-Storage

DBMS

Disked-based Graph DBMS

Main Memory (medium size and fast)

Persistent Storage (large and slow)

Bottleneck

CPU

DBMS

Main-Memory Graph DBMS

Main Memory (medium size and fast)

CPU Cache (tiny and super fast)

Bottleneck

CPU

Relationchain to enable fast seeks

(simplified Neo4j)

Disk-StorageEdge-Store

………

Nodes, Edges, Micro-Indexes,…

Nodes, Edges, Micro-Indexes,…

Motivation

Does efficient graph navigation require native graph storage in main-memory graph DBMS?

„Index-free adjacency“

Main-Memory DBMS:

• No Buffer Manager, less indirection, direct record access + „lightning fast“,…

• Non-Native will depend on graph size, but how dramatic is this?

Marcus Pinnecke

slide 5

Introduction

What we did

Does efficient graph navigation require native graph storage in main-memory graph DBMS?

Contribution

• Basic pattern match operations: common „interface“ for both native- and non-native approaches

• Main-memory optimized native and non-native graph storage approaches

• Comprehensive comparison

• Current research hints to an answer, but somehow ambiguous:

‒ Different graph models under study

‒ Different graph encodings

‒ Different database systems under study

‒ Different architectures and platforms

‒ Different evaluations

Basic Pattern Matching

Marcus Pinnecke

slide 6

Basic Pattern Matching (I)

• Define a graph by its set of edges

• Edge set is a logical table R with three columns

• Single Step Traversals can be expressed by selections σStart=α and Relationship=β and End=γ(R) = (α,β,γ)R

‒ Example: (*,*, )R = { ( , , ), ( , , ), …}

Start Relationship End

…… …

…

… …

…

* matches any value


Marcus Pinnecke

slide 7

Basic Pattern Matching (II)

26 3. Basic Pattern Matching Queries

Table 3.1: Basic pattern matching: Interpretation and names of pattern vectors. The symbol

”*” indicates a wildcard, i.e., no condition to the column.

Name Pattern Interpretation (filter condition)

q1:Exact match (↵, �, �) exact match of edge (↵, �, �)

q2:Nodes out (↵, �, ⇤) ↵ is start node with out-going relationship �

q3:Connected by (↵, ⇤, �) ↵ is start node, � is end node

q4:Nodes in (⇤, �, �) the node � has an in-coming relationship �

q5:Edges in (⇤, ⇤, �) the node � has in-coming relationships

q6:Edges out (↵, ⇤, ⇤) the node ↵ has out-going relationships

q7:Connects (⇤, �, ⇤) right-directed relationship � connect nodes

q8:Full scan (⇤, ⇤, ⇤) all edges

equal to the j-th component in the pattern vector p (for j 2 {1, 2, 3}). Thus, a basic

pattern match is a trinary equi-selection

(�start=↵, �rel=�, �end=�) = (↵, �, �)

where some of the selection � are allowed to be always true, and which returns positions

of satisfying records rather than the records itself. Thus, a basic pattern match is

equivalent to a filter operation

�start=↵^rel=�^start=�(R)

on a logical table R with three columns where some expression can be left out (e.g.,

�start=↵(R) only).

We can express several restriction to the graph data depending on the pattern vector

definition. Each pattern vector returns edge indexes matching the conditions on their

values. In Table 3.1 we provide an overview of possible pattern vector definitions

including a verbal interpretation, and a naming.

To bridge basic pattern matches into single step traversal functions, we further provide

a projection, and composition function.

Projection. Given a graph G, and a set of indexes I in respect to the edge indexes in

the edge-store. A projection ⇡ is a function which takes an egde-store, a set of edge

indexes I, a number j 2 {1, 2, 3}, and returns a set of node identifiers or relationship

identifiers by projecting the edge-store to the j-column considering only elements as

given by I. If j = 1 the edges are projected to their starting node identifier, if j = 2

the edges are projected to their relationship identifier, and they are projected to their

ending node identifier otherwise.


Executing Basic Pattern Matching

Marcus Pinnecke

slide 8

Executing Basic Pattern Matching

Basic Pattern Matching (α,β,γ)R

Native ApproachesNon-Native Approaches

Scanning Tree-Based Indexing Object Cache Adjacency Lists

… …

(Relational DBMS) (e.g., RDF-3x) (e.g., Neo4j) (e.g., Graph Compute Engines)

Execution

Marcus Pinnecke

slide 9

Scanning

0

1

2

3

4

5

6

7

Position Triples 0

1

2

3

4

5

6

7

CPU 1

CPU 2

CPU 3

CPU 4

0

4

7

5

match

no match Distribute MergeEvaluate (in parallel)

0

4

5

7

Result set

result set

• Records are sequentially accessed to evaluate condition for (α,β,γ)R predicate

‒ Larger design space (storage layout and access pattern)

• Query is answered by determining the set of triples that satisfy the predicate

74 5. Implementation and Evaluation

Table 5.3: Complexities of an edges-out query in a Graph G = (V, E) on node u, additional

storage consumption relative to working dataset size, and index build time. Best values are

marked in bold font.

Query Task Scan Native Red-Black Adjacency

Find u O(|E|) O(1) O(log2(|E|)) O(1)

Determine N (u) O(|N (u)|) O(1) O(1) O(1)

Return (N (u), types) O(1) O(1) O(1) O(1)

Storage overhead Scan Native Red-Black Adjacency

EEN dataset - + 108% + 67% + 83%

FB dataset - + 3980% + 60% + 1740%

GOG dataset - + 140% + 58% + 96%

Index build time Scan Native Red-Black Adjacency

EEN dataset - 3 msec 60 msec 3.8 msec

FB dataset - 0.07 msec 0.44 msec 0.00 msec

GOG dataset - 100 msec 1100 msec 10 msec

5.4 Summary

In this chapter we provided a brief overview on implementation details of our evaluation

prototype in Section 5.1, and evaluate the concepts covered in this thesis in Section 5.2,

and in Section 5.3.

In our micro-benchmarks, we determined a suitable scan configuration for our case-

study: a horizontal parallelization on a column-oriented storage layout with bulk-

processing evaluation for large datasets (i.e., the scanning in memory-bound). We

show that scanning is not a competitive option compared to the indexes introduced in

this thesis when query performance is primary. If storage overhead is primary, scanning

might be considered as an option. However, we argued e�cient indexing can highly ex-

ploit pre-computed resultsets for single step traversals, such that the task of answering

a query becomes the e↵ort of finding the record, and copying a memory address.

Query ExecutionIn Table 5.3 we summarize the complexities for an edges-out query on a node u in

a graph-shaped dataset. The evaluation of an edges-out query consists of three sub-

tasks: (1) find u, (2) determine its neighborhood N (u), and (3) return N (u) along

with the relationship types. Once a resultset is available, the e↵ort of sub-tasks (3)

is constant for all approaches, since it is returning a memory address. The di↵erences

rely on the sub-tasks (1), and (2). While the native approach (object cache), and the

adjacency list can access the pre-computed set for u directly, both scanning and search-

tree-based approaches must find u first. Since the red-black tree is a balanced search

(a node)

Execution

Marcus Pinnecke

slide 10

Tree-Based Indexing

• Records are organized in several tree indexes (i.e., seven trees)

‒ per single value

‒ per pair of values

‒ per triple of values

• Implemented as binary search trees (i.e., red-black search trees)

• Querying (α,β,γ)R is finding (α,β,γ)R in the tree, and returning its value (for α,β,γ are identifiers, or *)

………

match

no match

0 4 5 7

result set






Find u O(|E|) O(1) O(log2(|E|)) O(1)




EEN dataset - + 108% + 67% + 83%

FB dataset - + 3980% + 60% + 1740%

GOG dataset - + 140% + 58% + 96%





5.4 Summary



and in Section 5.3.






















Find u O(|E|) O(1) O(log2(|E|)) O(1)




EEN dataset - + 108% + 67% + 83%

FB dataset - + 3980% + 60% + 1740%

GOG dataset - + 140% + 58% + 96%





5.4 Summary



and in Section 5.3.

















(a node)

Execution

Marcus Pinnecke

slide 11

Object Cache

• Object cache is used to avoid costly disk access + as index

‒ Each node v is mapped to pair (ein(v), eout(v))

‒ ein contains all edges (grouped by relationship) where v is ending node

‒ eout contains all edges (grouped by relationship) where v is starting node

• Querying (α,β,*)R is answered by returning the edge-set of relationship β in eout(v) with v = α

0

1

i

n

Id Node Identifiers (all)

…

…

ein eout

…

β0

4

5

7

…

…






Find u O(|E|) O(1) O(log2(|E|)) O(1)




EEN dataset - + 108% + 67% + 83%

FB dataset - + 3980% + 60% + 1740%

GOG dataset - + 140% + 58% + 96%





5.4 Summary



and in Section 5.3.

















(a node)






Find u O(|E|) O(1) O(log2(|E|)) O(1)




EEN dataset - + 108% + 67% + 83%

FB dataset - + 3980% + 60% + 1740%

GOG dataset - + 140% + 58% + 96%





5.4 Summary



and in Section 5.3.

















Execution

Marcus Pinnecke

slide 12

Adjacency Lists

• Contains results for Edges-Out (α,*,*)R queries (i.e., restriction to relationship is not considered)

• The i-th entry contains a list edge identifiers which nodes are in the neighborhood of node i

• Typically implemented as array of arrays

• Two adjacency list, one for ein and one for eout

0

1

i

n

Id Node Identifiers (all)

…

…

0 4 5 7






Find u O(|E|) O(1) O(log2(|E|)) O(1)




EEN dataset - + 108% + 67% + 83%

FB dataset - + 3980% + 60% + 1740%

GOG dataset - + 140% + 58% + 96%





5.4 Summary



and in Section 5.3.

















(a node)






Find u O(|E|) O(1) O(log2(|E|)) O(1)




EEN dataset - + 108% + 67% + 83%

FB dataset - + 3980% + 60% + 1740%

GOG dataset - + 140% + 58% + 96%





5.4 Summary



and in Section 5.3.

















Execution

Marcus Pinnecke

slide 13

Applicability & Limits

4.3. Summary 47

Table 4.1: Support of basic pattern per matcher. The symbol X marks the direct support of

a basic pattern, blank fields mark required post-processing (e.g., further selection), or fallback

(i.e., execute scanning instead).

Basic Pattern Scan Native Red-Black Adjacency

q1:Exact match X Xq2:Nodes out X X Xq3:Connected by X Xq4:Nodes in X X Xq5:Edges in X X X Xq6:Edges out X X X Xq7:Connects X Xq8:Full scan X X

4.3 Summary

In the previous chapter, we showed that the concept of basic patterns is a core func-

tionality to evaluate single step traversals. In this section we provide an overview of

four state-of-the-art approaches to evaluate single step traversals.

For each, we contribute an extended or modified version to evaluate basic patterns.

Therefore, we answered our third research question asking possibilities to adapt existing

approaches for a main-memory setting.

• Scan-Matcher: A high-e�cient parallelized and memory-bound scan approach.

It is aware of the tables storage layout (i.e., the records are aligned column-,

or row-wise in memory), the access pattern (i.e., the evaluation can be either

in a bulk-fashion or in a branching-fashion). For querying we chose a vertical

parallism strategy. Hence, we evaluate a query by delegating the evaluation of

table partitions to threads, one per CPU core. The result is then merged after a

synchronization step, and returned.

• Native-Matcher: An storage-layout-independent index-structure that based on

native-graph processing (especially on the object cache) concepts. Each node

is mapped to a cache pair, each containing a collection of cache entries. These

cache pairs contain out-going and in-going edges, i.e., if the out-going edges of a

node is required, the out-going edge cache collection is considered. Each cache

collection is relationship-centric organized, such that related edges are grouped

by the relationship. Given a certain node and a certain relationship, querying

the edges is done by seeking to the nodes out-going edges cache entry. The cache

collection with the given relationship as grouping key contains the desired result.

Execution

Evaluation

Marcus Pinnecke

slide 14

Evaluation

Micro-Benchmarks

• Artificial datasets

• Single step traversals

Case-Study „Reachability Queries“

• Real world datasets

• Series of single step traversals

1 2

Evaluation

Marcus Pinnecke

slide 15

Micro-Benchmarking - Results

Avoid scanning: For regular access, scanning is likely no option

• Managing an index costs: scan as option for intermediate result data structures?

15.2. Micro-Benchmarking 57

100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

scanmax selectivity

0

100

200

300

400

25 50 75 100selectivity (in %)

exec

utio

n tim

e (m

s)indexes (native/non−native)scan

egdes out (1 GiB dataset)

Figure 5.1: A↵ect of selectivity value to execution time on ⇡ 90mio. egdes graph (executed

on server). 100ms is marked as threshold for interactive queries. The selectivity for which the

scan exceeds this threshold is also marked.

ObservationWith increasing selectivity all approaches require more time to answer the query as

expected. This decrease in execution performance is linear for all approaches but the

slopes are di↵erent. The strongest performance decrease is for scanning. The other

approaches lose slightly performance even when nearly all edges must be returned.

In fact, we observed the indexes were able to answer the query in less than 100ms

independent of the selectivity value. Thus, the indexes are independent of whenever

they are native or non-native are promising options for large-scale graph datasets with

respect to performance.

ExplanationWe explain the di↵erence between scanning and the indexes as follows: On the one

hand, scanning requires to touch each tuple independent of whenever it will satisfy

the predicate or not. This inherently requires a transfer of 1 GiB from memory to the

CPU. All positions of satisfying tuples must be written to the resultset, before the

memory address of this resultset can be returned. After returning the resultset address,

its dereferenced content is copied to the querys final resultset. Since the amount of

data to scan remain the same, the performance decreases with increasing selectivity.

This is caused by the e↵ort it takes for write-operations of relevant positions in the

resultset resp. the final resultset. For selectivity values near zero, the di↵erences between

scanning and indexing can be explained by the e↵ort of reading and testing 1 GiB of data

when using the scan. On the other hand, each index pre-computes the query answer

Evaluation

Marcus Pinnecke

slide 16

Micro-Benchmarking - Results

Engineering is important

• consider pitfalls in C++ such as implicit copy constructor call

• standard data structures (e.g., in STL) might become a bottleneck (e.g., std::back_inserter)

• be aware of implementation artifacts

15.2. Micro-Benchmarking 59

0

10

20

30

40

50

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9#edges (in mio)

exec

utio

n tim

e (m

s)

index (result by reference) index (result by value) scan

0

20

40

60

2.5 5.0 7.5#edges (in mio)

exec

utio

n tim

e (m

s)index (result by reference) index (result by value) scan

0

100

200

300

400

25 50 75#edges (in mio)

exec

utio

n tim

e (m

s)

index (result by reference) index (result by value)

0

100

200

300

400

25 50 75#edges (in mio)

exec

utio

n tim

e (m

s)

index (result by reference) index (result by value) scan

Figure 5.2: A↵ect of increasing number of edges to execution time (executed on server).

value is returning a constant reference to a resultset that was written based on the

pre-computed result in the index.

ObservationAs expected, all approaches are a↵ected by an increasing number of edges in the graph

such that it takes longer to answer the query. The strongest decrease in performance

can be observed for the scan approach. For an interactive query, which requires less

than 100ms for an answer, the scan is no option if there are more than 14mio edges.

In contrast, using an index (independent of whenever it is a native, or non-native one),

interactive queries gain 5 times more edges compared to a scan. When the result is

passed by reference, and no further writes to the resultset are required, the indexes

will be interactive until approximately 200mio edges (14 times more edges compared to

scanning).

ExplanationAs for the scalability micro-benchmark, all approaches are a↵ected by the e↵ort of

writing the final result in general. Hence, there is an inherently performance decrease

when the number of edges grow. In addition, there is an additional e↵ort to write

the resultset when using the index result passing by value. Consequentially, we can

observe a nearly constant di↵erence between the indexes using values, and indexes using

references. This constant di↵erence is the e↵ort it takes to write the resultset. Writing

the resultset before the final resultset is created is not necessary for indexes which

use result passing by reference. However, on tiny to small graphs (left figure) scanning

outperforms the index passing by value, and is competitive to any index for less than

0.1 mio edges. There is a threshold at approximately 6.3mio edges where scanning is no

reasonable option anymore (middle figure). For large-scale graphs, the only reasonable

option is an index, passing the result by reference. We explain the (nearly constant)

performance di↵erence of 50ms between index result passing by reference, and index

result passing by value with the following two aspects: First, the index with result

Evaluation

Marcus Pinnecke

slide 17

Case Study - Datasets

5.3. Case Study: Reachbility Queries 63

Table 5.1: Reachability Dataset Network Statistics

Property FB EEN GOG

Network Type Social Communication Web

Description Social circles E-Mail Network From Google

#Nodes 2,876 36,692 875,713

#Edges 4,199 367,662 5,105,039

#Relationship Types 45 1 1

data set size 4.9 MiB 4.1 MiB 75.4 MiB

diameter 8 11 21

e↵ective diameter 4.7 4.8 8.1

Origin Mcauley [ML14] Leskovec [LLDM09] Leskovec [LLDM09]

Real world datasets. One of a more advanced characteristic is the longest shortest

path in the graph, i.e., the maximal number of hops between two given nodes if they

are connected over some transitive path. If this number is unconstrained, a reachability

query must explore the entire dataset in worst case before answering no. If this number

is constrained on the other hand, the search can be stopped at a certain depth. In

addition, each nodes degree influences the number of neighbors that might be considered

during evaluation. Generating artifical data that consider such characteristics is out of

the scope of this work. Hence, we focus on real-world datasets. In Table 5.1 we show

datasets which are used for this case study:

1. The Facebook (FB) dataset is a medium graph, that covers (anonymous) social

network users, and their friendship. Friendship is determined by per-user friend list

indicating one user as a certain friendship to other users. In addition to friendship

information, certain properties are attached to users, e.g., the interest in a specific

topic or the belonging to a certain group2.

2. The Enron E-Mail Network (EEN) dataset is a small graph, and covers e-mail

communication. E-mail accounts are nodes. If there was at least one e-mail sent

between two accounts u, and v, there is an edge between u, and v3.

3. The Google Web Graph (GOG) dataset is a larger graph covering hyperlinks

between web pages. Nodes are web pages, and edges are links between web pages.

The dataset was released as part of the Google Programming Contest in 20024.

2https://snap.stanford.edu/data/egonets-Facebook.html3https://snap.stanford.edu/data/email-Enron.html4https://snap.stanford.edu/data/web-Google.html

2

Evaluation

Marcus Pinnecke

slide 18

Case Study - Results

Index performance is similar

• Optimized index structures pass a few pointers

• Mayor difference is finding a certain node for edges out queries

Dataset characteristics and algorithm design affect query processing heavily






Find u O(|E|) O(1) O(log2(|E|)) O(1)




EEN dataset - + 108% + 67% + 83%

FB dataset - + 3980% + 60% + 1740%

GOG dataset - + 140% + 58% + 96%





5.4 Summary



and in Section 5.3.


















scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms100 ms

0

200

400

600

1 2 3 4 5 6 7 8#hops

mea

n ex

ecut

ion

time

(ms)

adjacency. object−cache scan tree

Enron

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops


0

50

100

150

200

1 2 3 4 5 6 7 8#hops

mea

n ex

ecut

ion

time

(ms)

adjacency object cache scan tree

Google


0

25

50

75

100

1 2 3 4 5 6 7 8#hops

mea

n ex

ecut

ion

time

(ms)


Facebook

„bis zu welcher Anzahl von hops kann man den Scan als „Fallback“ nehmen (Schwellwert 100ms)?“

(Breitensuche)

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops


0

200

400

600

1 2 3 4 5 6 7 8#hops

mea

n ex

ecut

ion

time

(ms)


Enron

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops


0

50

100

150

200

1 2 3 4 5 6 7 8#hops

mea

n ex

ecut

ion

time

(ms)


Google


0

25

50

75

100

1 2 3 4 5 6 7 8#hops

mea

n ex

ecut

ion

time

(ms)


Facebook


(Breitensuche)

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops


0

200

400

600

1 2 3 4 5 6 7 8#hops

mea

n ex

ecut

ion

time

(ms)


Enron

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops

scanmax hops


0

50

100

150

200

1 2 3 4 5 6 7 8#hops

mea

n ex

ecut

ion

time

(ms)


Google


0

25

50

75

100

1 2 3 4 5 6 7 8#hops

mea

n ex

ecut

ion

time

(ms)


Facebook


(Breitensuche)Figure 5.4: Reachability query execution times on real-world datasets depending on hop

count.

and the non-native approaches (the scan operation, and the tree-based index). The

lower the execution time, the better the result is. We show the mean execution times

for reachability queries looking for the requested node in most #hops. Hence, if the

requested node is not found, a path of #hops length is fully explored. If, on the other

hand, the requested node is found, the path has at most a length of #hops. We chose

this, since this reflects a typical case for reachability queries. However, we consider an

approach as interactive, if it achieves to answer a reachability query in less than 100ms.

ExpectationBased on our observations on the singe-step traversal performance for the individual

approaches, we expect a poor performance for the scan-based operation, and superior

performances for the indexes. Likewise we expect similar results for the indexing ap-

proaches, independent of whenever it is a native, or a non-native approach. Since our

in-depth analysis of the previous section (Section 5.3.3) revealed significant e↵ort to

evaluate reachability queries on the GOG dataset, we expect the worst performance for

the GOG dataset.

ObservationIndependent of the dataset, the scan operation performs worst. With the exception

of the FB dataset, the scan operation is not even acceptable for an interactive query.

Besides the scan operation, all indexes perform similar. In fact, all indexes are acceptable

for interactive queries, even for the GOG dataset. More in detail, we observe query times

below 1ms for EEN and FB, as well as below 25ms up to 4 hops, and below 50ms up

to 8 hops for the GOG dataset. Moreover, we observe a similar grow for all approaches

on the GOG datasets but with di↵erent factors: all approaches grow exponentially.

The scan grows at most such that it exceeds a reasonable query time very quickly. The

indexes, however, are likely to exceed the interactive limit for hops greater than 10.

Finally, we observe on each dataset all approaches start with a similar query time for 1-

Evaluation

Wrap-Up

Marcus Pinnecke

slide 19

Summary of Trade-Offs

Scan Tree-based Object Cache Adjacency List

Scalability

+Storage needed

Creation time

-

Depends on data set

1

1

1)

Depends on hardware and optimization2)

Implementation

erffort-

3

1,2

Depends on skills, design space and optimization 3)

Basic Pattern

Support

--

Wrap-Up

Marcus Pinnecke

slide 20

Conclusion

• Organizing fast record access is primary

• storage layout is secondary

• „non-native“ drawback (i.e., dependence on graph size) is not as significant as in disk-based systems

• Benefits and drawbacks and trade-offs per approaches

• Dataset characteristics influence performance

• Algorithm design (i.e., improved BSF) is needed

Future Work

• Compression

• Improved algorithm design

• Graph navigation in RDBMS

Wrap-Up

Thank you for your attention.

Questions?

master’s thesis defense - marcus pinnecke · master’s thesis defense marcus pinnecke....

Documents