random walk triplerush: asynchronous graph querying and ... · tributed systems: mapreduce [5] has...

11
Random Walk TripleRush: Asynchronous Graph Querying and Sampling Philip Stutz Department of Informatics University of Zurich Zurich, Switzerland stutz@ifi.uzh.ch Bibek Paudel Department of Informatics University of Zurich Zurich, Switzerland paudel@ifi.uzh.ch Mihaela Verman Department of Informatics University of Zurich Zurich, Switzerland verman@ifi.uzh.ch Abraham Bernstein Department of Informatics University of Zurich Zurich, Switzerland bernstein@ifi.uzh.ch ABSTRACT Most Semantic Web applications rely on querying graphs, typically by using SPARQL with a triple store. Increasingly, applications also analyze properties of the graph structure to compute statistical inferences. The current Semantic Web infrastructure, however, does not efficiently support such op- erations. This forces developers to extract the relevant data for external statistical post-processing. In this paper we propose to rethink query execution in a triple store as a highly parallelized asynchronous graph ex- ploration on an active index data structure. This approach also allows to integrate SPARQL-querying with the sam- pling of graph properties. To evaluate this architecture we implemented Random Walk TripleRush, which is built on a distributed graph processing system. Our evaluations show that this architecture enables both competitive graph querying, as well as the ability to execute various types of random walks with restarts that sample interesting graph properties. Thanks to the asyn- chronous architecture, first results are sometimes returned in a fraction of the full execution time. We also evaluate the scalability and show that the architecture supports fast query-times on a dataset with more than a billion triples. 1. INTRODUCTION Use cases such as social network analysis, monitoring of financial transactions, or analysis of web pages and their links all require storage, retrieval, and analysis of large- scale graphs. To address this need, many have researched the development of efficient triple stores [1, 28, 21]. These systems borrow from the database literature to investigate Copyright is held by the International World Wide Web Conference Com- mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the author’s site if the Material is used in electronic media. WWW 2015, May 18–22, 2015, Florence, Italy. ACM 978-1-4503-3469-3/15/05. http://dx.doi.org/10.1145/2736277.2741687. efficient means for storing large graphs and retrieving sub- graphs, which are usually defined via a pattern matching language such as SPARQL. Even though these systems pro- cess graphs, most of them leverage decades of research re- sults in efficient processing of partial answer-sets by map- ping the graphs into set-/array-style internal data struc- tures. They are built like a centralized database, raising the question of scalability and parallelism within query ex- ecution. To increase the parallelism of such graph-stores, modern so- lutions propose the use of parallel operators [30], sideways information-passing [20], or even pipelined operations and replication [8]. Other approaches focus on building triple stores based on specialized programming models for dis- tributed systems: MapReduce [5] has been used to aggre- gate results from multiple single-node RDF stores in order to support distributed query processing [9] or to process whole SPARQL query execution pipelines (e.g., [14]). Whilst these systems efficiently support the storage and retrieval, they mostly fall short on the support of graph-analytics. Hence, developers have to painstakingly retrieve the relevant data for statistical post-processing in a suitable tool. In this paper we rethink query execution within graph stores in the light of the changes of computer architectures. We propose to exploit the large number of CPU-cores of mod- ern servers via the parallel exploration of partial bindings. Specifically, we explore each partial binding to a query in parallel akin to graph-exploration; this means (i) forking of the exploration whenever more than one binding is possible, (ii) returning the result when all variables of an exploration are bound, and (iii) aborting the exploration when it reaches a dead end (i.e., it cannot match a triple pattern). This re-conceptualization of triple-stores has the side-effect that it can efficiently support numerous graph-analytic algo- rithms such as Random Walks with Restarts (RWR)—the basis of many approaches to information extraction and rea- soning in noisy domains—or basic graph-algorithms such as shortest-path computations. This approach has the advan- tage to support the integration of statistical inference with SPARQL-based querying, which can provide better results in classification / learning [10], and simplifies the specifica- tion of restrictions on RWR via the re-use of SPARQL. 1034

Upload: others

Post on 10-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Random Walk TripleRush: Asynchronous Graph Querying and ... · tributed systems: MapReduce [5] has been used to aggre-gate results from multiple single-node RDF stores in order to

Random Walk TripleRush:Asynchronous Graph Querying and Sampling

Philip StutzDepartment of Informatics

University of ZurichZurich, [email protected]

Bibek PaudelDepartment of Informatics

University of ZurichZurich, Switzerland

[email protected]

Mihaela VermanDepartment of Informatics

University of ZurichZurich, Switzerland

[email protected]

Abraham BernsteinDepartment of Informatics

University of ZurichZurich, Switzerland

[email protected]

ABSTRACTMost Semantic Web applications rely on querying graphs,typically by using SPARQL with a triple store. Increasingly,applications also analyze properties of the graph structure tocompute statistical inferences. The current Semantic Webinfrastructure, however, does not efficiently support such op-erations. This forces developers to extract the relevant datafor external statistical post-processing.In this paper we propose to rethink query execution in atriple store as a highly parallelized asynchronous graph ex-ploration on an active index data structure. This approachalso allows to integrate SPARQL-querying with the sam-pling of graph properties.To evaluate this architecture we implemented Random WalkTripleRush, which is built on a distributed graph processingsystem. Our evaluations show that this architecture enablesboth competitive graph querying, as well as the ability toexecute various types of random walks with restarts thatsample interesting graph properties. Thanks to the asyn-chronous architecture, first results are sometimes returnedin a fraction of the full execution time. We also evaluatethe scalability and show that the architecture supports fastquery-times on a dataset with more than a billion triples.

1. INTRODUCTIONUse cases such as social network analysis, monitoring of

financial transactions, or analysis of web pages and theirlinks all require storage, retrieval, and analysis of large-scale graphs. To address this need, many have researchedthe development of efficient triple stores [1, 28, 21]. Thesesystems borrow from the database literature to investigate

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to theauthor’s site if the Material is used in electronic media.WWW 2015, May 18–22, 2015, Florence, Italy.ACM 978-1-4503-3469-3/15/05.http://dx.doi.org/10.1145/2736277.2741687.

efficient means for storing large graphs and retrieving sub-graphs, which are usually defined via a pattern matchinglanguage such as SPARQL. Even though these systems pro-cess graphs, most of them leverage decades of research re-sults in efficient processing of partial answer-sets by map-ping the graphs into set-/array-style internal data struc-tures. They are built like a centralized database, raisingthe question of scalability and parallelism within query ex-ecution.To increase the parallelism of such graph-stores, modern so-lutions propose the use of parallel operators [30], sidewaysinformation-passing [20], or even pipelined operations andreplication [8]. Other approaches focus on building triplestores based on specialized programming models for dis-tributed systems: MapReduce [5] has been used to aggre-gate results from multiple single-node RDF stores in order tosupport distributed query processing [9] or to process wholeSPARQL query execution pipelines (e.g., [14]). Whilst thesesystems efficiently support the storage and retrieval, theymostly fall short on the support of graph-analytics. Hence,developers have to painstakingly retrieve the relevant datafor statistical post-processing in a suitable tool.

In this paper we rethink query execution within graphstores in the light of the changes of computer architectures.We propose to exploit the large number of CPU-cores of mod-ern servers via the parallel exploration of partial bindings.Specifically, we explore each partial binding to a query inparallel akin to graph-exploration; this means (i) forking ofthe exploration whenever more than one binding is possible,(ii) returning the result when all variables of an explorationare bound, and (iii) aborting the exploration when it reachesa dead end (i.e., it cannot match a triple pattern).

This re-conceptualization of triple-stores has the side-effectthat it can efficiently support numerous graph-analytic algo-rithms such as Random Walks with Restarts (RWR)—thebasis of many approaches to information extraction and rea-soning in noisy domains—or basic graph-algorithms such asshortest-path computations. This approach has the advan-tage to support the integration of statistical inference withSPARQL-based querying, which can provide better resultsin classification / learning [10], and simplifies the specifica-tion of restrictions on RWR via the re-use of SPARQL.

1034

Page 2: Random Walk TripleRush: Asynchronous Graph Querying and ... · tributed systems: MapReduce [5] has been used to aggre-gate results from multiple single-node RDF stores in order to

We implemented Random Walk TripleRush (RW-TR)1 toexplore this architecture. RW-TR is built on the distributedgraph processing system Signal/Collect [24].2 Whilsttraditional stores pipe data through query processing op-erators, RW-TR asynchronously routes query descriptionsthrough an active data-structure. For this reason, RW-TRdoes not use any joins in the traditional sense, but searchesthe index graph in parallel.

As a consequence, the contributions of this paper are thefollowing: first and foremost, we propose a novel activeindex-structure that supports the parallel and distributedexploration of answers to SPARQL-queries. Second, we showhow this architecture can be extended with support for RWR– an important graph-analytic approach. Third, we presentan extensive evaluation of the architecture that includes (i)vertical-, horizontal-, and data-scalability experiments, (ii)an evaluation of the time until the first result is returned,which is sometimes computed much faster than the whole re-sults set, (iii) a benchmark against two other triple stores inthe single-node scenario, where RW-TR is on average morethan 10 times faster, and (iv) a comparison with two otherdistributed triple stores at the billon-triples scale, whereRW-TR is very competitive. Fourth and last, we show theeffectiveness of our RWR computations via a use case.

In the following, we succinctly discuss the relevant relatedwork, describe the novel distributed architecture, as well asthe functionality and interactions of its building blocks. Wethen compare the architecture with traditional graph-storeapproaches. Next, we evaluate the approach on multiplebenchmarks and show that it can offer competitive perfor-mance, as well as good scalability. We close with a discussionof the limitations.

2. RELATED WORKStudies related to RW-TR can be divided into four cat-

egories: (i) distributed in-memory triple-stores, (ii) exten-sions to SPARQL, (iii) graph computation frameworks, and(iv) studies into RDF index structures.

Distributed in-memory triple stores: Most closely relatedto RW-TR are Trinity.RDF [30], which relies on parallel op-erators to improve SPARQL performance, and TriAD [8],which relies on pipelined operations and replication. Bothsystems are competitive in terms of SPARQL performance(see also Section 5.4) but are limited to pure SPARQL pro-cessing. In addition, both use a different approach to paral-lelization. Trinity.RDF relies on a distributed Bulk-Synchro-nous approach [27] whilst TriAD uses extensive pre-process-ing and replication of indices. RW-TR uses an asychronousquerying approach, which allows efficient embedding of graphsampling.

SPARQL extensions: A number of projects have proposedextending SPARQL with additional functionality. Corese [3,4] and iSPARQL [12], for example, provide support for ap-proximate matching and SPARQL-ML [11] extends SPARQLwith statistical relational learning operators. Whilst theseapproaches show how SPARQL could be extended, they typ-ically do not do so efficiently. Only recently have bench-marks been proposed to integrate SPARQL processing with

1RW-TR is a significant redesign and extension ofTripleRush [26], which was limited to parallelising basic-graph-pattern queries on a single machine.2Signal/Collect is similar to Pregel [19], GraphLab/Pow-erGraph [6], and Trinity [22].

more traditional graph processing tasks.3 We are unawareof any other system that combines efficient SPARQL pro-cessing with efficient graph sampling.

Distributed graph computation frameworks: A number ofdistributed graph processing frameworks have been proposedin recent years [16, 24, 6]. Whilst these systems provide abasis for building distributed analytic solutions, they do notprovide a high-level (querying) language such as a SPARQLextension to answer analytic queries. Trinity.RDF [30], whichis layered on top of Trinity, does offer SPARQL querying butdoes not provide any support for sampling queries or ana-lytics, which would have to be implemented manually.

Sedge [29] introduces different graph partition manage-ment techniques to minimize inter-machine communicationduring query processing. The system’s effectiveness is demon-strated by answering SPARQL queries. In contrast, the fo-cus of our work is not to find better partitions of the graphor manage them effectively, but to propose a new way ofthinking about distributed triple stores with sampling capa-bilities.

RDF index structures: RW-TR reflects the insights gath-ered about RDF indexing in the past years [21, 28] in that itbuilds a multi-level structure of increasingly specific nodes.It differs significantly from these investigations in that itproposes a query execution as a highly parallelized, asyn-chronous routing of partially bound results through the in-dex. Some aspects of RW-TR have similarities with thepointer-chasing problem [18, 17, 13], where future referencesare prefetched to achieve locality. RW-TR’s general queryexecution is, however, fundamentally different, as it routes(passive) partial solutions through an actively processing in-dex structure rather than employing a (possibly parallelized)program that operates on an optimized but passive indexstructure. RW-TR does exploit locality when compressinglower-level lists with delta-encoding.

3. RW-TR ARCHITECTURERW-TR is built leveraging the large-scale, parallel and

distributed graph processing framework Signal/Collect4

[24, 25]. It allows to specify graph computations in termsof vertex-centric methods. In contrast to other frameworks,Signal/Collect allows for asynchronous execution, multi-ple vertex types, and the ability to change the graph struc-ture during the execution. Conceptually, Signal/Collect-vertices can be seen as actor-like active elements, where theframework handles messaging, parallelization and distribu-tion.

The core idea of RW-TR is to build a triple store withthree types of Signal/Collect vertices: Each index ver-tex corresponds to a triple pattern, each triple vertex cor-responds to an RDF triple, and query vertices coordinatequery execution. Partially matched copies of queries arerouted in parallel along different paths of this structure. Theindex graph is optimized for efficient routing of query de-scriptions to data and its vertices are addressable by an ID,which is a unique [ subject predicate object ] tuple.

We first describe how the graph is conceptually built andthen explain the details of how this structure enables efficientparallel graph exploration.

3http://ldbcouncil.org/benchmarks/snb4http://uzh.github.io/signal-collect/

1035

Page 3: Random Walk TripleRush: Asynchronous Graph Querying and ... · tributed systems: MapReduce [5] has been used to aggre-gate results from multiple single-node RDF stores in order to

4

3

2

1

DylanElvis inspired

*Elvis inspired DylanElvis *

Dylan* **Elvis * ** inspired

** *

Dylan* inspired

Figure 1: RW-TR index graph for the triple vertex[ Elvis inspired Dylan ].

3.1 Building the Index GraphAs mentioned before, RW-TR is a triple store with three

types of Signal/Collect vertices:

Triple vertices (level 4, Fig. 1) represent triples in thedatabase. Each contains subject, predicate, and objectinformation.

Index vertices (levels 1-3, Fig. 1) represent triple pat-terns and are responsible for routing partially matchedcopies of queries (referred to as query particles) to-wards triple vertices that match their respective pat-terns. They also contain subject, predicate, and objectinformation, but one or several of them are wildcards.

Query vertices (Fig. 2) are added to the graph for eachquery that is being executed. A query vertex emits thefirst query particle that traverses the index structure.All query particles—successfully matched or not—getrouted back to their respective query vertex and suc-cessful ones get reported as results. Once the queryexecution has finished, the query vertex removes itselffrom the graph.

The graph is built bottom-up, starting by creating a triplevertex for each RDF triple. These vertices are added toSignal/Collect, which turns them into parallel process-ing units. A triple vertex will add its immediate index ver-tices (if they do not exist yet) and an edge from each ofthose vertices to itself. The construction process continuesrecursively for the index vertices until the parent vertex hasalready been added or the index vertex has no parent.

The index structure illustrated in Fig. 1 ensures that thereis exactly one path from an index vertex to each triple vertexbelow it.Observations: The number of predicates is usually muchsmaller than the number of distinct subjects or objects.Hence, storing edges from the root to [ * P * ] vertices re-quires the least amount of memory. The index graph wejust described is different from traditional index structures,because it is designed for the efficient parallel routing ofmessages to triples corresponding to a given triple pattern.All vertices that form the index structure are active parallelprocessing elements that only interact via message passing.

3.2 Query ExecutionWe now look into how a query is executed, and then we

follow with the description of the query optimizer.Consider the subgraph shown in Fig. 2 and the query

processing for the query: (unmatched = [ ?X inspired ?Y ], [?Y inspired ?Z ]; bindings = {}). The query execution startsby adding the query vertex to the TripleRush graph. Afterthe query optimizer determines the execution order of thetriple patterns, the query gets processed as follows:

1 The query vertex emits a single query particle, whichis routed (by Signal/Collect) to the index vertexthat matches its first unmatched triple pattern. Todetermine when a query has finished processing, theinitial query particle is endowed with a large numberof tickets (Long.MaxValue). Should the tickets everrun out, new tickets could be acquired from the queryvertex.5

2 When a query particle arrives at an index vertex, acopy of it is sent along each edge. The original particleevenly splits up its tickets among its copies.

3 Once a query particle reaches a triple vertex, the vertexattempts to match the next unmatched query patternto its triple. If this succeeds, then a variable bindingis created and the remaining triple patterns are up-dated with the new binding. The query particle getssent to the index or triple vertex that matches its nextunmatched triple pattern.

4 If all triple patterns are matched, then the query par-ticle gets routed back to its query vertex.

5 If no vertex with a matching pattern is found, thena handler for undeliverable messages routes the failedquery particle back to its query vertex.

6 Query execution finishes when the sum of tickets ofall failed and successful query particles received by thequery vertex equals the initial ticket endowment of thefirst particle that was sent out. The query vertex re-ports that all results have been delivered and removesitself from the graph.

Observations: Queries are often routed along downward edgesin the index structure, and placing the index vertices in away that achieves good locality means fewer messages aresent across machines. We found that the following schemecan achieve good locality, while at the same time ensuring ahigh degree of parallelism: If the subject of an index vertexis defined, then it is placed on a node determined by its sub-ject. If the subject is a wildcard, then it is placed on a nodedetermined by the object. If only the predicate is defined,then it is placed on a node determined by the predicate. Theroot index vertex is hardcoded to the last node. This schemeguarantees that particles are locally routed from [ S * * ] to[ S P * ] as well as from [ * * O ] to [ * P O ].

In addition, to assign a vertex to workers on a machineidentified with the above assignment scheme, we computethe sum of its (encoded, see 3.5) IDs modulo the numberof workers on the assigned node. In our tests, this scheme

5This feature is currently not supported by our system andwas not necessary for any of our evaluations.

1036

Page 4: Random Walk TripleRush: Asynchronous Graph Querying and ... · tributed systems: MapReduce [5] has been used to aggre-gate results from multiple single-node RDF stores in order to

Query Vertex

DylanElvis inspired

** inspired

Dylan *inspired Elvis *inspired

Elvis inspired DylanDylan inspired ?Z

Elvis inspired DylanDylan inspired Jobs

Dylan inspired JobsJobs inspired ?Z

No vertex with id[ Jobs inspired * ]

Success Failure

?X inspired ?Y?Y inspired ?Z

JobsDylan inspired

1

2

3

4 5

6

Figure 2: Query execution on the relevant part of the index that was created for the triples [ Elvis inspired Dylan ] and [ Dylan inspiredJobs ].

performed better than mixing the values with a collision-minimizing hash function. Signal/Collect uses the samemappings for vertex addressing and for routing messagesto (potentially non-existent) index vertices.

3.3 Query OptimizationIt has been highlighted [23] that the order in which pat-

terns are explored affects the performance of query process-ing. When a triple pattern is matched at an index vertex, thebindings made in that vertex are forwarded to the index ver-tices responsible for subsequent triple patterns (unless thereis no match). Since it is expensive to send particles, we ex-plore the graph with a triple pattern ordering that considersboth the number of particles to be sent and the branchingfactor for each particle when matching the next pattern. Weestimate these costs by gathering several statistics on indexvertices and predicates.

Statistics on the number of children are incrementally ag-gregated in the index vertices during graph loading. Thesestatistics are cached and, if necessary, retrieved in parallelfrom the index vertices before optimization. We also com-pute predicate selectivity statistics after the loading is com-plete, by dispatching all required two-pattern queries [23]for all predicate combinations. In order to make this fast,we added special support and optimizations for queries thatonly compute the result count. When determining the num-ber of bindings for the second pattern, the counts can bedirectly accessed in the index vertices. For the evaluateddatasets, the predicate selectivity gathering is usually fasterthan the graph loading, but the number of dispatched queriesis O(| pred |2), which could become a problem if a datasetcontains many predicates. The optimizer also works, al-though not as well, when substituting missing selectivitieswith very large numbers (this was not necessary for any ofthe evaluated datasets).

In the remainder of this subsection we briefly introducethe cost model and discuss the optimization procedure.

3.3.1 Cost ModelWe model a query q as a sequence of triple patterns pi,

indexed by i ≥ 1. For a triple pattern p, we denote thesubject, predicate, and object by p.sub, p.pred, and p.obj,respectively.

The cost of executing a query can now be defined as thesum of the costs of matching individual triple patterns in a

given order:

Cost(q) =∑pi∈q

cost(pi) (1)

The cost cost(pi) of matching the ith triple pattern de-pends on two factors. First, we have to consider the numberof bindings or query particles created by the previous triplepattern, which we call frontier(pi−1), in accordance withgraph search algorithms. frontier(pi−1) can be seen as aworst-case estimate of the number of particles that mightreach this stage of the exploration. Second, we need to ac-count for the exploration cost explore(pi) of the index vertexcorresponding to the triple pattern pi. This can be seen asa worst-case estimate of the branching factor encounteredper frontier particle that matches triple pattern pi. Con-sequently, for i ≥ 1, we estimate the cost of matching apattern as: cost(pi) = frontier(pi−1)× explore(pi).

In order to define these two functions we need the statis-tics defined in Table 1. Given these statistics we can esti-mate frontier(pi−1) as:

frontier(pi−1) =

1, if i = 1

card(pi), if i = 2

min(explore(pi),min∀j<i selectivity(pj , pi)),

otherwise (for all available selectivities)

and explore(pi) is expressed as:

explore(pi) = min(card(pi), branch(pi)),

where branch(pi), the branching factor of the index elementassociated with pi, is estimated as follows:

branch(pi) =

card(pi), if i = 1

1, if pi.sub, pi.pred, pi.obj are bound

maxObj(pi), if pi.sub, pi.pred are bound

maxSub(pi), if pi.pred, pi.obj are bound

| pred | , if pi.sub, pi.obj are bound

edges(pi)×maxObj(pi), if pi.pred is bound

card(pi), otherwise

1037

Page 5: Random Walk TripleRush: Asynchronous Graph Querying and ... · tributed systems: MapReduce [5] has been used to aggre-gate results from multiple single-node RDF stores in order to

card(p): cardinality of triple pattern p, i.e.,number of triples that can bereached following the vertexresponsible for the triple pattern

selectivity(pi, pj): number of vertices that areconnected by a predicate-pair(pi.pred, pj .pred), sharing a commonsubject/object (see [23])

edges(p): number of outgoing edges fromthe [ * P * ] vertex correspondingto p.pred to all its [ S P * ] vertices(Figure 1)

maxObj(p): the maximum number of objects ofany [ S P * ] vertex correspondingto the predicate p.pred

maxSub(p): the maximum number of subjects ofany [ * P * ] vertex correspondingto the predicate p.pred

|pred|: number of distinct predicatesTable 1: Statistics used in query optimization

3.3.2 Query OptimizerThe query optimizer uses uniform-cost search to find the

plan with the best worst-case cost estimate. We employa min-heap ordered by Cost(q), which initially gets seededwith all possible 1-pattern plans. The optimizer then re-peatedly removes the cheapest plan from the heap and com-putes all possible plan-extensions, which it then inserts intothe min-heap.

To prevent the expansion of non-optimal (partial) plans,the optimizer maintains a map that uses the set of coveredtriple patterns as a key and the lowest cost ordering of thepatterns as the value. Before expanding partial plans, theplanner looks up their triple-pattern set in the map and onlyexpands the partial plans that have no prior entries. Otherplans are discarded as suboptimal.

If the optimizer finds a plan where frontier(pi−1) = 0,then the optimizer reports that the query has no results andit is not executed. When the optimizer finds a plan thatuses all patterns at the top of the heap, then it has foundthe cost-optimal plan to execute according to the model.

All operations on the planning heap and reference maptake O(log n) time, where n is the number of elements inthe heap/map. In the worst case, the planning heap cancontain almost all incomplete plans, which is exponential inthe number of patterns in a query: one can create a (partial)plan by picking or not picking each of the | q | patterns,resulting in a heap size and number of insert operations ofO(2|q|). This means that the search space exploration time

complexity is O(log(n) ∗ 2|q|).In order to prevent this exponential increase of the plan-

ning time for queries with many patterns, we use a greedyquery optimizer when the number of patterns in the queryis greater than a fixed number.6 The greedy optimizer isdescribed in [26].

3.4 Extension to Random WalksRandom Walks with Restarts (RWR) are a popular graph

sampling technique that can be used for various tasks fromcomputing the similarity between two nodes in a graph toretrieving novel relations [15]. Random walk based modelshave been applied to many problems such as ranking web-

6In our experiments, we fixed this number to 8.

pages and segmenting images. Conceptually, random walkscan be seen as starting from a given vertex and then fol-lowing a random edge to a neighboring vertex. There, thewalker moves again to a randomly chosen neighbor, goesback to the vertex from which the walk started, or stops itswalk based on a restart rule. Example restart rules are (i)walking for a finite number of steps from the starting node,(ii) walking for any number of steps and restarting whenthere is no outgoing edge, or (iii) at each vertex, restartwith a given probability.

In order to add support for efficient sampling queries basedon RWR, we have to modify three elements of the previouslydescribed architecture: First, we extend the query particleswith the extra structures required for sampling. Second, wemodify the routing of sampling query particles to adhere tothe rules of random walks. And third, to sample correctlywe need to store additional bookkeeping information insidethe second-level indices (SIndex, PIndex, OIndex). Next wedescribe each of these modifications in more detail.

To allow for sampling queries we extended the query parti-cles with a flag that indicates if the particle is currently exe-cuting a traditional (SPARQL) part of a query or a samplingelement. In addition, we extended the particles’ data struc-ture to optionally include information about the constraintsof the random walk, such as the directionality (subject ->object, object -> subject, or both), any constraints on thepath (e.g., if it should only follow certain properties or somespecified sequence of property types), and the stopping con-dition. This approach allows us to combine SPARQL andsampling queries within the same execution.

A naıve routing approach would route as many particlesthrough the index as there are tickets. Whenever a particlewould arrive at a triple vertex, it would test for its randomwalk constraints and decide whether to stop the explorationor continue. This would lead to a high overhead, as the samepath would be followed multiple times. To improve on thisapproach, we route as follows: each query begins with a cer-tain number of tickets provided to it. At each index level theparticle is split and sent along each index path that qualifiesaccording to the random walks’ constraints. Tickets are as-signed to each particle such that the sampling of the graphis not biased by the index structure. If there are not enoughtickets to assign to all particles, then we randomly choosesome paths to follow and abandon the others. As a result,RW-TR computes as many random walks in parallel as thereare tickets.Once the query reaches a triple vertex, the stopping condi-tion gets evaluated. In case it applies and the query con-straints are met, then the variable bindings to the subject,predicate, and/or object stored in the current vertex areadded to the particle and reported to the query vertex asa success. In case the stopping condition applies and thequery constraints fail, then it is reported as a failure. Else,the exploration continues.

To assign the tickets proportionally, additional bookkeep-ing information is stored in the second-level indices. Weneed to store the total number of outgoing edges that canbe traversed by following the child vertices, and the sum ofboth outgoing and incoming edges of all child vertices. Thisinformation is calculated during the data loading phase. Toassign the tickets proportionally to the particles sent to thechild vertices, we need to know the outgoing edges per childindex vertex. Precomputing these would increase the index

1038

Page 6: Random Walk TripleRush: Asynchronous Graph Querying and ... · tributed systems: MapReduce [5] has been used to aggre-gate results from multiple single-node RDF stores in order to

size considerably. We, therefore, ask the child index verticesfor their number of outgoing edges via a special signal andcan dispatch the particles as soon as the information arrives,as we know the total number of outgoing edges.

As an example, consider the sampling query

. SAMPLE ?X FROM [ Elvis inspired ?X]

. CONSTRAINTS [maxhops = 3, tickets=10]

and the subgraph of Fig. 2. This is a neighborhood sam-pling query as it returns a sample of vertices reachable fromElvis by traversing edges labeled inspired for a maximum ofthree hops. Given that it uses 10 tickets, it uses 10 ran-dom walks along the inspired edges from the Elvis-vertex.We compute these random walks by starting at the vertex[ Elvis inspired *], which has only one outgoing edge, leadingto [ Elvis inspired Dylan]. Given that Dylan is a correct an-swer to the random walk query, we would need to flip a cointo decide whether to return it as a correct answer or con-tinue. As we are doing multiple random walks in parallel,it is efficient to do both. Hence, we assign half the tick-ets to the current answer and continue exploring with theother half (when we have an odd number of tickets we flip acoin to determine which path gets one more ticket). Hence,〈Dylan〉, 5 is returned to the query vertex. Again, there’sonly one outgoing edge along which the remaining 5 ticketsof the query are sent. At the vertex [ Dylan inspired Jobs],the binding 〈Dylan, Jobs〉, 3 is returned to the query vertex,indicating that the path to reach Jobs went via the Dylanvertex. This is the second hop of the query, and accord-ing to the constraints set to the query, the query can makeone more hop. But since there are no vertices that can betraversed from here, we will also return the two remainingtickets to the query vertex. The final result of our neigh-borhood sampling query will be the following distributionof bindings: [ 〈Dylan〉, 5; 〈Dylan, Jobs〉, 5 ].

3.5 RW-TR OptimizationsJust like parallel TripleRush [26], RW-TR contains some

initial optimizations: a) we do dictionary encoding, b) weremove the triple vertices and fold them into the third indexlevel, where each index vertex stores a compact representa-tion of all the triples that match their pattern, c) we onlysend the tickets of the failed particles back to the query ver-tex, and d) we use bulk-messaging and message-combiners.

In addition to this, RW-TR contains improvements thataddress previous limitations with regard to insert perfor-mance and memory usage during loading, by adopting anew data structure for the index vertices. Next, we discussthe details and motivation of these changes.Index Vertex Representation: In Fig. 1, one notices thatthe ID of an index vertex varies only in one position—thesubject, the predicate, or the object—from the IDs of itschildren. To reduce the size of the edge representations,we do not store the entire ID of child vertices, but onlythe specification of this position consisting of one dictionaryencoded number per child. We refer to these numbers asID-refinements. The same reasoning applies to third levelindex vertices, where the triples they store only vary in oneposition from the ID of the binding index vertex.Routing and binding only require a traversal of all ID-re-finements. To support traversal and inserts in a memory-efficient way, we store the refinements in a special-tailoredSplay tree, where the key of each node is an interval and each

I2={ }

Traditional Querying: Processing Partial Answer Sets

⨝I1={ }(X=Elvis, Y=Dylan);(X=Dylan, Y=Jobs)

Result set for pattern 1

Result set for pattern 2

(Y=Elvis, Z=Dylan);(Y=Dylan, Z=Jobs)

θI1.Y=I2.Y { }(X=Elvis, Y=Dylan, Z=Jobs)

Result set for join

TripleRush: Parallel Exploration of Partial Answers

?X inspired ?Y?Y inspired ?Z

Elvis inspired DylanDylan inspired ?Z

Elvis inspired DylanDylan inspired Jobs

Dylan inspired JobsJobs inspired ?Z

Report Bindings

FailureNo Match for ?Z

Fork exploration for each possible binding of pattern 1

Explore possible bindings for partially

bound pattern 2

Figure 3: Comparison between query set processing andTripleRush parallel asynchronous partial answer exploration.Same query and data as in Fig. 2.

node stores the set of refinements contained in its interval.The data structure supports average case O(log(n)) inserts,low memory usage and fast traversal.Index Graph Structure: Because we fold the triple verticesinto the third index level, there is no longer an obvious placewhere one can verify if a fully bound pattern corresponds toa triple that exists inside the store.To deal with this, RW-TR sends the particle that has tocheck for the existence of a fully-bound pattern to the corre-sponding [ S * O ] index vertex. These vertices do not storethe ID-refinements into a Splay-tree, but into a sorted array.The existence is checked with binary search.We observe thatmost patterns have bound predicates, so these vertices arerarely used for anything but to check for the existence of atriple. We also observe that inserts into the array are O(n).In practice, this was not an issue, since there are usuallyvery few predicates for a given subject/object pair.

4. PRELIMINARY ANALYSISAs introduced in the last section and illustrated in Fig. 3,

RW-TR processes SPARQL queries by exploring each par-tial binding asynchronously in parallel. Whenever an explo-ration encounters more than one possible partial binding, itforks the exploration and pursues both potential solutionsin parallel (‘green’ and ‘orange’ explorations in the figure).When all variables of an exploration are bound, it returns aresult (‘green’ path). Alternatively, when the remaining un-bound variables of an exploration cannot be bound, it abortsthat path (‘orange’ path). Essentially, RW-TR performs aparallel-asynchronous graph search.

Traditional DBMS use operators on indices and interme-diate data structures (typically arrays or sets). Originally,these operators were executed synchronously, where each op-erator is executed until its full result set is available beforethe next operator is called (see also Figure 3). Parallelism isusually introduced by (i) executing independent operators inparallel (such as the scans that create the sets in the figure)and (ii) implementing parallelized operators resulting in aparallel but synchronous system (each operator has to findall its results before invoking the next one). Modern systemsintroduce additional parallelism via pipelining operators [8],which allow some operators to pass on partial results. Con-ceptually, these systems use parallelized approaches to pro-cess partial answer sets rather than exploring all possiblepartial solutions in parallel.

1039

Page 7: Random Walk TripleRush: Asynchronous Graph Querying and ... · tributed systems: MapReduce [5] has been used to aggre-gate results from multiple single-node RDF stores in order to

The central proposition of this paper is that RW-TR’sparallel-asynchronous exploration approach may be a viablealternative graph store architecture for today’s multi-coresystems.First, we believe that asynchronous-parallel query process-ing allows to exploit the many cores better than synchronous-parallel execution, as cores are less likely to wait for workduring synchronization. RW-TR is built to exploit this asyn-chronicity. As mentioned, some current systems exploit akind of asynchronicity via pipelining. Pipelining, however,comes at the cost of more complexity in both the operatorsand their coordination. Given that RW-TR does not requirecoordinating between its explorations, it does not incur suchan overhead.Second, we expect the exploitable performance improvementdue to parallelism to be curbed by (i) the branching factorof the query, which is a function of the selectivity of thetriple patterns and connectivity of the involved nodes (orjoin selectivity), as it limits the degree of parallelism and(ii) possible gains through locality, as forking explorationsand moving them to other cores (possibly on other machines)can be costly operations.

RW-TR’s index can be conceptualized as a vertical parti-tion of the data into S-Index, P-Index, and O-Index, for sub-jects, predicates and objects, respectively, as well as threeadditional indices for each combination of two columns –SP-Index, PO-Index and SO-Index. In addition, the latterthree indices are sharded by the subject key (for SP-Indexand SO-Index) or object key (for PO-Index). Each shardis assigned to a processing unit in a distributed computecluster.

5. EVALUATIONThe goal of the evaluation was to explore the proposi-

tions that that RW-TR’s parallel-asynchronous explorationapproach is both competitive and scalable via the efficientexploitation of parallelism where possible, as well as to il-lustrate its capability to gain useful results via RWR. Tothat end we employ two standard benchmarks—LUBM andBSBM—and evaluate RW-TR’s performance under differentconditions, as well as a use case for RWR.The experiments reported in subsection 5.3 and the dis-tributed evaluations in subsection 5.4 were run on a clusterof 8 machines, each machine having 128 GB RAM and twoE5-2680 v2 processors at 2.80GHz, with 10 cores per proces-sor. The machines are connected with 40Gbps Infiniband.We used version 1.8.0 05-b13 of the Java Runtime. All otherexperiments were run on single machines of the same cluster.

We used both the LUBM7 (Lehigh University Benchmark)and BSBM (Berlin SPARQL) [2] benchmarks. For LUBM,we used the queries used in the Trinity.RDF evaluation [30].For BSBM, we generated the datasets and explore use casequeries with the standard data generator and query testdriver, but stripped the queries of advanced SPARQL fea-tures unsupported by RW-TR such as OPTIONAL or com-plex filters, and discarded queries 9 and 12 for relying onsuch features.

We executed ten runs of each LUBM query and in thediagrams report both the average and geometric mean overthe fastest runs. For BSBM we executed the same ten gen-erated queries from each category, computed the category

7http://swat.cse.lehigh.edu/projects/lubm

average and reported the average and geometric mean overall categories. The measured total time for a run includeseverything from query optimization until the result set isfully traversed, but the decoding of the results is not forced.

5.1 Vertical Scalability and First ResultsThe goal of this evaluation was to measure how well RW-

TR scales with additional worker threads on a single ma-chine of the cluster. Additionally, the first result is reported,in order to test our hypothesis that the fully asynchronousexecution allows to deliver the first result much faster thanthe full result set. We ran this evaluation ten times on theLUBM 160 dataset with the Trinity.RDF queries and var-ied the number of worker threads between 1 and 20, be-cause the hardware has 20 physical cores. We pre-plannedthe queries and ran them without the optimizer, in orderto reduce overhead that is not directly associated with theexecution engine.

In Figure 4 we see that adding more workers has a nega-tive and at best neutral impact for queries L4, L5, and L6,which touch very little data and are answered in at mosta millisecond. For queries L1, L3, and L7, which are moreprocessing intense, the speedup for 20 workers relative to 1worker is between 10 and 12, which is good, considering thatquery dispatch and result reporting is still handled by onlyone worker, and that all queries are answered in under 50msat that point. Query 2 scales a bit up to 10 worker threads,but does not improve with more processing elements. Thisis likely due to its structure of only 2 triple-patterns, whichoffers RW-TR less potential for parallelization.

Figure 4(c) graphs time until the first result was reportedrelative to the total query execution time (Query 3 was omit-ted as it does not return results). For queries that profitedfrom parallelization, the first answer was delivered in arounda third of the time it took to compute the entire result. Therelative benefit increased when going from 1 to 10 workerthreads, but then remained approximately constant whengoing to 20 processing threads.

Overall, this evaluation shows that the architecture cantake advantage of multicore architectures and that if thereare enough workers available, then, for some queries, theasynchronous-parallel execution can deliver first results muchsooner than the full results.

5.2 Data Scalability and Memory UsageTo measure the data scalability of RW-TR in the single-

machine setup, we measured its performance for differentsizes of the benchmark datasets. For comparison, we alsosupply the numbers for the in-memory backend of Sesame,as it is open-source and runs in the JVM, and for Virtuoso7.1 as a comparison to on-disk approaches.

To make the comparison with the on-disk system Virtuosofairer, we evaluated warm-cache runs and we configured itto make use of the processors and memory of the machine.

The two diagrams in Figure 5 show how the performancechanges, when the LUBM and BSBM queries are executedon increasingly large datasets. On the BSBM dataset, theperformance of all systems is comparable for small datasetsizes, but RW-TR scales better to large dataset sizes, for thelargest BSBM dataset it is on average up to 10 times fasterthan Sesame and up to 25 times faster than Virtuoso. Thegeometric mean does not change dramatically, because mostqueries do not touch more data on a larger dataset.

1040

Page 8: Random Walk TripleRush: Asynchronous Graph Querying and ... · tributed systems: MapReduce [5] has been used to aggre-gate results from multiple single-node RDF stores in order to

0.2  

1  

5  

25  

125  

625  

1   2   4   8   16  

Execu&

on  &me  in  m

s  

#  Worker  threads  

L1  

L2  

L3  

L4  

L5  

L6  

L7  

(a)

0  1  2  3  4  5  6  7  8  9  10  11  12  13  

1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20  

Speedu

p  rela*v

e  to  1  worker  thread  

#  Worker  threads  

L1  

L2  

L3  

L4  

L5  

L6  

L7  

(b)

0%  10%  20%  30%  40%  50%  60%  70%  80%  90%  

100%  

1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   17   18   19   20  

First  result  *

me  pe

rcen

tage  of  total  

#  Worker  threads  

L1  

L2  

L4  

L5  

L6  

L7  

(c)

Figure 4: 4(a) shows the execution times of the different queries on a logarithmic scale on both axes, 4(b) shows the speedup relativeto 1 worker thread for all queries, and 4(c) shows the time it took until the first result as a percentage of the total for the entire result.

1  

5  

25  

125  

625  

100   1000   10000   100000  

Execu&

on  &me  in  m

s  

BSBM  Size  

TripleRush-­‐GeoMean  TripleRush-­‐AVG  Sesame-­‐GeoMean  Sesame-­‐AVG  Virtuoso-­‐GeoMean  Virtuoso-­‐AVG  

(a)

1  

5  

25  

125  

625  

3125  

15625  

20   40   80   160   320   640   1280  

Execu&

on  &me  in  m

s  

LUBM  Size  

TripleRush-­‐GeoMean  TripleRush-­‐AVG  Sesame-­‐GeoMean  Sesame-­‐AVG  Virtuoso-­‐GeoMean  Virtuoso-­‐AVG  

(b)

Figure 5: 5(a) and 5(b) compare the single-node scalability ofexecution times with increasing BSBM and LUBM sizes. Bothaxes are logarithmic.

On the more processing intense LUBM queries, RW-TRshows better performance on any dataset size, up to morethan 200 times faster for Sesame and 35 times faster forVirtuoso on average for the largest evaluated size.

We do not have any precise memory measurements, butwe measured the used JVM memory, which can serve asan upper bound for the memory used by the index. Wethen look at the lowest such upper bound that we measuredduring any of the runs. For BSBM 284’826, RW-TR had alowest upper bound of 39.7 GB, in contrast to 28.6 GB forSesame. For LUBM 1280 RW-TR used 61.6 GB comparedto the 34.7 GB used by Sesame. From this we concludethat the RW-TR index most likely uses more memory thanSesame’s, but that the index size is still reasonable.

5.3 Horizontal ScalabilityThe goal of this evaluation was to measure RW-TR’s scal-

ability in the distributed setting. In particular, we wantedto explore if RW-TR’s query evaluation approach would de-grade when faced with messaging over the network rather

0.1  

1  

10  

100  

2   4   8  

Execu&

on  &me  in  m

s  

Number  of  Nodes  

BSBM  Horizontal  Scalability  

GeoMean  

AVG  

(a)

1  

10  

100  

1000  

10000  

2   4   8  

Execu&

on  &me  in  m

s  

Number  of  Nodes  

LUBM  Horizontal  Scalability  

GeoMean  

AVG  

(b)

Figure 6: 6(a) and 6(b) compare the horizontal scalability ofRW-TR with 2, 4, and 8 nodes for both BSBM 284’826 and forLUBM 1280. The aggregates are over all queries for that dataset,and for each query we used the fastest of 10 runs. Error barsindicate the runtimes for the fastest and slowest queries in thebenchmark.

than in-memory, or if the benefit of additional processorswould dominate. For this, we measured the performance onlarge BSBM and LUBM data sets while varying the numberof nodes used.

Figure 6 shows the results of these evaluations. We ag-gregated over the fastest runs of ten executions for eachquery, in order to reduce confounding factors (e.g. garbagecollections). We found that for the BSBM dataset/queriesthe average execution time stays approximately the same,while the geometric mean slightly increases. For the LUBMdataset/queries the geometric mean stays approximately thesame, whilst the average execution time decreases. Our in-terpretation is that for queries that do not require a lotof processing the added overhead and network latency re-

1041

Page 9: Random Walk TripleRush: Asynchronous Graph Querying and ... · tributed systems: MapReduce [5] has been used to aggre-gate results from multiple single-node RDF stores in order to

duces the performance, whilst for queries that require alot of processing the benefit of the added processing el-ements can overcome this drawback. This explains whyadding nodes tends to slow down the execution of the fastestmillisecond-range queries, whilst improving the performancefor the most processing-intense queries.

5.4 Comparison with Trinity.RDF and TriADTables 2 and 3 compare the performance of RW-TR to

the numbers reported in the Trinity.RDF [30] and TriAD [8]papers. We followed the evaluation procedure described tous by the Trinity.RDF authors, which includes a partition-ing of rdf:type into a different type predicate for each classreferred to as the object. The RW-TR distributed evalua-tion on LUBM 10240 with 1.36 billion triples was run on all8 nodes of the cluster. The comparison of the numbers inthese tables has many caveats, as the cited numbers werecreated with different hardware and cluster sizes and theapproaches require different amounts of preprocessing. Webelieve this comparison at least shows that the RW-TR ar-chitecture is competitive in both the single-node and in thedistributed scenario.

5.5 A Random Walk Use Case: Path SamplingIn this section we illustrate RW-TR’s capability to run

sampling queries based on random walks with restarts (RWR).Our goal is to show the simplicity with which Semantic Webdevelopers using RW-TR can attain RWR results and howthey can be combined with SPARQL queries. A more gen-eral discussion of the usefulness of RWRs is beyond the scopeof our use case and can be found in [15].

Consider the need for establishing the relatedness of twoentities. As an example, we could have a selection of sportsteams—Liverpool, Manchester United, Chicago Bulls, andThe Brooklyn Dodgers—and we would like to know whatchampionship they compete in—the UEFA Cup, the WorldSeries, or the NBA Finals—from a dataset of relationshipsextracted from a large text corpus. The dataset can be ex-tremely noisy and may have misleading and/or conflictingrelationships, such as the Manchester United team playingbasketball. Indeed, for example, an October 2011 The Tele-graph article connects a basketball player with ManchesterUnited.8

RWR have been proposed to deal with these kinds of noisysettings. The rationale is the following: Intuitively, there aremore short paths between nodes that are conceptually closeto each other than between nodes that are further apart.Just picking the shortest path between two vertices may bemislead by a noisy connection. A sampling of random walkswill unearth which vertices are closer via many connectionsand is, hence, less susceptible to false relations in the graph.

For our use case we use the Never Ending Language Learn-ing (NELL) knowledge base (version 08m.845), which hasabout 2 million triples. NELL contains relations extractedfrom natural language text and it iteratively learns new re-lations based on what it learned in the previous iterations.This naturally leads to some ambiguous or false relations inthe knowledge base. For example, NELL contains these tworelations about Manchester United (the British soccer club),

8http://www.telegraph.co.uk/sport/football/teams/liverpool/8826912/Liverpool-v-Manchester-United-basketball-star-and-Anfield-stakeholder-Lebron-James-jets-in-for-clash.html

the first of which is clearly noise: 〈sportsteam:man unitedteam-plays-sport sport:basketball〉 and 〈sportsteam:man unitedteam-plays-in-league sportsleague:fa〉, where sportsleague:fa isthe British Football Association cup.

To illustrate the capability of RW-TR to solve the task offinding the connectedness of a team with a championship,we ran queries of the following form:9

. SAMPLE ?X FROM

. [ sportsteam:man united ?X sportsleague:uefa]

. CONSTRAINTS [maxhops = 5, tickets=100]

whilst varying the team, the championship, the maximumnumber of hops (we employed 5, 10, 20), and the number oftickets (we used 100, 1’000, and 10’000). By simulating mul-tiple independent random walks from teams, we can countall walks that reach the respective championships and esti-mate or calculate information such as the number of walksreaching the goal, the path lengths, or the conditional prob-abilities of reaching the championships from a given team.

Figure 7 graphs some of the results. In the first graphon the left we show the distributions of reaching the UEFACup from all four teams for 1’000 tickets whilst varying thepath length. As we can see, the distribution is extremelystable. Both Manchester United and Liverpool are clearlyassociated with the UEFA cup, whilst the two non soccerteams do not show any relations. This illustrates the smallworld phenomenon, where most entities that are related areclose to each other in the graph and the longer paths actuallydo not lead to many additional relationships. This latterobservation is supported by the actual number of arrivingtickets in each of the classes.

The three graphs on the right of Figure 7 show the dis-tributions of reaching each of the championships from thefour teams. Given the stability of the results, we chose apath length of 5 and varied the number of walks (i.e., tick-ets initially assigned). The paths to the UEFA Cup andthe World Series are very stable. Indeed, the number of ar-riving tickets (printed in the bars) are proportional to thenumber of initial tickets (or walks). The NBA graph on thefar left tells a more subtle story. The more tickets we as-sign to the exploration, the more the result reflects the noisyextractions mentioned above. When using 1’000 or 10’000tickets, we find a small number of connections between bothLiverpool and Manchester United and the NBA finals — re-flecting noisy connections. Hence, a higher number of ticketsis more likely to follow noisy connections. Note that the to-tal execution time for the thirty-six path sampling queriesneeded for the three graphs on the right of Figure 7 was lessthan a second.

6. LIMITATIONS AND CONCLUSIONIn the following we discuss limitations and threats to va-

lidity, followed by the conclusion.There are some limitations related to the RW-TR imple-

mentation being a prototype: (i) Encoded IDs cannot exceed231, (ii) only a subset of SPARQL is supported, (iii) dictio-nary encoding/decoding is not distributed, and (iv) splayinteger sets do not currently support deletions. These limi-

9Note that this query almost uses the SPARQL SELECTsyntax. We could support additional basic graph patternsin a WHERE clause employing the same semantics as in aSELECT statement.

1042

Page 10: Random Walk TripleRush: Asynchronous Graph Querying and ... · tributed systems: MapReduce [5] has been used to aggre-gate results from multiple single-node RDF stores in order to

Fastest of 10 runs L1 L2 L3 L4 L5 L6 L7 Geo. meanTripleRush 22.6 27.8 0.4 1 0.4 0.9 21.2 2.94Trinity.RDF 281 132 110 5 4 9 630 46TriAD 427 117 210 2 0.5 19 693 39TriAD-SG 97 140 31 1 0.2 1.8 711 14

Table 2: Single-node, LUBM 160 (∼21 million triples), time in ms. Comparison data from [30] and [8].

Fastest of 10 runs L1 L2 L3 L4 L5 L6 L7 Geo. meanTripleRush 3,111.2 1,457.9 0.7 3.5 9.5 29.1 1,165.8 62.1Trinity.RDF 12,648 6,018 8,735 5 4 9 31,214 450TriAD 7,631 1,663 4,290 2.1 0.5 69 14,895 249TriAD-SG 2,146 2,025 1,647 1.3 0.7 1.4 16,863 106

Table 3: Distributed, LUBM 10240 (∼1.36 billion triples), time in ms. Comparison data from [30] and [8].

8" 79" 808"

1" 8" 85"

100" 1000" 10000"Number'of'*ckets'

Paths'to'UEFA'

14# 146# 1427#

100# 1000# 10000#Number'of'*ckets'

Paths'to'World'Series'

2"35" 332"

100" 1000" 10000"Number'of'*ckets'

Paths'to'NBA'Finals'

79# 80# 80#

8# 8# 10#

0%#

10%#

20%#

30%#

40%#

50%#

60%#

70%#

80%#

90%#

100%#

5# 10# 20#

Normalized

+#+paths+per+te

am+

Maximum++length+of+paths+

Paths+to+UEFA+

bulls+

man_united+

brooklyn_dodgers+

liverpool+

Figure 7: Distributions resulting from randomly walking from all four teams to a given championship. The leftmost figure graphs walksto UEFA whilst varying the maximum path-length when employing 1’000 tickets. The other graphs vary the number of tickets whilstfixing the path-length to 5.

tations are not inherent to the approach and resolving themis primarily a matter of engineering.

There are, however, limitations that are inherent to theapproach: First, some operations, such as ordering the re-sults, by definition require a synchronization. Our currentapproach can only handle them as post-processing steps,which is straightforward but inefficient. Second, an efficientexecution of filters requires for an optimizer to be able toplace them at any point during the query execution plan.Our current approach is limited to treating them as a post-processing step. More efficient handling would need to en-able access to literals from inside the store.

Also, our evaluations have some limitations: Our bench-marking of RW-TR is limited to synthetic datasets, whichmeans that the results might not generalize to real-worlddatasets. Furthermore, as mentioned in Section 5.4, wecould not compare RW-TR’s performance with Trinity.RDFand TriAD running on the same hardware, as those twosoftware packages are not available. Nonetheless, we be-lieve that our evaluation shows that RW-TR is competitivein those system’s core strength—the evaluation of SPARQLqueries—whilst also supporting sampling queries.

Our query optimizer can be improved to better deal withqueries that have many patterns: typical SPARQL queriescontain star-shaped patterns that can be optimized indepen-dently of others [7].

Finally, our RWR use-case is only one example and lacksa full efficiency evaluation. Also, the current version of RW-TR only supports sampling using RWR and no other analyt-ics such as PageRank. The rationale for this limitation wasthat the main goal of this paper was to illustrate the versa-tility of our approach in supporting both SPARQL query-ing and RWR-style sampling. A full efficiency evaluation ofRW-TR’s RWR capability or an extension to other graphanalytics, which would be supported by the underlying Sig-nal/Collect framework, is beyond the scope of this paper.

7. CONCLUSIONSIn this paper we proposed to exploit the large number

of CPU-cores of modern servers via the parallel explorationof partial bindings implemented on a distributed graph pro-cessing system. In particular, we suggested to fork the explo-ration whenever more than one binding is possible, returningthe result when all variables of an exploration are bound,and expiring the exploration when it reaches a dead end.This re-conceptualization of triple-stores has the side-effectthat it can efficiently support random walks with restarts bytasking each parallel exploration to simultaneously exploreas many random walks as it has tickets.

As such, RW-TR presents a new approach to buildinggraph stores with integrated graph analytic operators suchas sampling queries. Our evaluation shows that this ar-chitecture can serve as the basis for a graph store that iscompetitive with other systems. We hope that RW-TR canserve as a basis for further exploration that will help Se-mantic Web developers to efficiently and seamlessly analyzetheir graphs.

Acknowledgments.We would like to thank the Hasler Foundation for the gen-

erous support of the Signal/Collect project under grantnumber 11072 and Alex Averbuch, Cosmin Basca, LorenzFischer, Shen Gao, Tobias Grubenmann, and Katerina Pa-paioannou for their feedback.

8. REFERENCES[1] D. Abadi, A. Marcus, S. Madden, and K. Hollenbach.

Scalable Semantic Web Data Management UsingVertical Partitioning. In Proceedings of the 33rdinternational conference on Very large data bases,pages 411–422, 2007.

[2] C. Bizer and A. Schultz. The berlin sparql benchmark.International Journal on Semantic Web andInformation Systems (IJSWIS), 5(2):1–24, 2009.

1043

Page 11: Random Walk TripleRush: Asynchronous Graph Querying and ... · tributed systems: MapReduce [5] has been used to aggre-gate results from multiple single-node RDF stores in order to

[3] O. Corby, R. Dieng-kuntz, and C. Faron-zucker.Querying the semantic web with the corese searchengine. pages 705–709. IOS Press, 2004.

[4] O. Corby, R. Dieng-Kuntz, C. Faron-Zucker, andF. Gandon. Ontology-based Approximate QueryProcessing for Searching the Semantic Web withCorese. Research Report RR-5621, 2006.

[5] J. Dean and S. Ghemawat. Mapreduce: simplifieddata processing on large clusters. Communications ofthe ACM, 51(1):107–113, 2008.

[6] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, andC. Guestrin. Powergraph: Distributed graph-parallelcomputation on natural graphs. In USENIXSymposium on Operating Systems Design andImplementation (OSDI), pages 17–30, 2012.

[7] A. Gubichev and T. Neumann. Exploiting the querystructure for efficient join ordering in sparql queries.

[8] S. Gurajada, S. Seufert, I. Miliaraki, andM. Theobald. Triad: a distributed shared-nothing rdfengine based on asynchronous message passing. InProceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’14,pages 289–300, 2014.

[9] J. Huang, D. J. Abadi, and K. Ren. Scalable sparqlquerying of large rdf graphs. Proceedings of the VLDBEndowment, 4(11):1123–1134, 2011.

[10] C. Kiefer, A. Bernstein, and A. Locher. Adding datamining support to sparql via statistical relationallearning methods. In Proceedings of the 5th EuropeanSemantic Web Conference on The Semantic Web:Research and Applications, ESWC’08, pages 478–492,Berlin, Heidelberg, 2008. Springer-Verlag.

[11] C. Kiefer, A. Bernstein, and A. Locher. Adding DataMining Support to SPARQL via Statistical RelationalLearning Methods. In Proceedings of the 5th EuropeanSemantic Web Conference (ESWC), Lecture Notes inComputer Science. Springer, 2008.

[12] C. Kiefer, A. Bernstein, and M. Stocker. TheFundamentals of iSPARQL: A Virtual TripleApproach for Similarity-Based Semantic Web Tasks.In Proceedings of the 6th International Semantic WebConference, 2007.

[13] N. Kohout, S. Choi, D. Kim, and D. Yeung.Multi-chain prefetching: Effective exploitation ofinter-chain memory parallelism for pointer-chasingcodes. In Parallel Architectures and CompilationTechniques, 2001. Proceedings. 2001 InternationalConference on, pages 268–279. IEEE, 2001.

[14] S. Kotoulas, J. Urbani, P. A. Boncz, and P. Mika.Robust runtime optimization and skew-resistantexecution of analytical sparql queries on pig. InInternational Semantic Web Conference (1), volumeLNCS 7649, pages 247–262, 2012.

[15] N. Lao and W. W. Cohen. Relational retrieval using acombination of path-constrained random walks. Mach.Learn., 81(1):53–67, Oct. 2010.

[16] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson,C. Guestrin, and J. M. Hellerstein. Graphlab: A newparallel framework for machine learning. InConference on Uncertainty in Artificial Intelligence(UAI), Catalina Island, California, July 2010.

[17] C.-K. Luk. Tolerating memory latency throughsoftware-controlled pre-execution in simultaneousmultithreading processors. In Computer Architecture,2001. Proceedings. 28th Annual InternationalSymposium on, pages 40–51. IEEE, 2001.

[18] C.-K. Luk and T. C. Mowry. Compiler-basedprefetching for recursive data structures. In ACMSIGOPS Operating Systems Review, volume 30, pages222–233. ACM, 1996.

[19] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C.Dehnert, I. Horn, N. Leiser, and G. Czajkowski.Pregel: a system for large-scale graph processing. InProceedings of the 2010 ACM SIGMOD InternationalConference on Management of data, pages 135–146,2010.

[20] T. Neumann and G. Weikum. Scalable join processingon very large rdf graphs. In Proceedings of the 2009ACM SIGMOD International Conference onManagement of data, pages 627–640, 2009.

[21] T. Neumann and G. Weikum. The RDF-3X engine forscalable management of RDF data. The VLDBJournal, 19(1):91–113, 2010.

[22] B. Shao, H. Wang, and Y. Li. The trinity graphengine. Technical report, Technical Report 161291,Microsoft Research, 2012.

[23] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, andD. Reynolds. Sparql basic graph pattern optimizationusing selectivity estimation. In Proceedings of the 17thinternational conference on World Wide Web, pages595–604. ACM, 2008.

[24] P. Stutz, A. Bernstein, and W. W. Cohen.Signal/Collect: Graph Algorithms for the (Semantic)Web. In International Semantic Web Conference,volume LNCS 6496, pages pp. 764–780. Springer,Heidelberg, 2010.

[25] P. Stutz, D. Strebel, and A. Bernstein. Signal/collect:Processing web-scale graphs in seconds. The SemanticWeb Journal – Interoperability, Usability,Applicability, Forthcoming.

[26] P. Stutz, M. Verman, L. Fischer, and A. Bernstein.Triplerush: A fast and scalable triple store. In 9thInternational Workshop on Scalable Semantic WebKnowledge Base Systems (SSWS), volume 50, 2013.

[27] L. G. Valiant. A bridging model for parallelcomputation. Communications of the ACM,33(8):103–111, 1990.

[28] C. Weiss, P. Karras, and A. Bernstein. Hexastore:sextuple indexing for semantic web data management.Proceedings of the VLDB Endowment, 1(1):1008–1019,2008.

[29] S. Yang, X. Yan, B. Zong, and A. Khan. Towardseffective partition management for large graphs. InProceedings of the 2012 ACM SIGMOD InternationalConference on Management of Data, pages 517–528.ACM, 2012.

[30] K. Zeng, J. Yang, H. Wang, B. Shao, and Z. Wang. Adistributed graph engine for web scale rdf data.Proceedings of the VLDB Endowment, 6(4):265–276,2013.

1044