parallel processing of graphs - eindhoven university of...

25
Parallel Processing of Graphs Bin Shao 1 , Yatao Li 1 , and Haixun Wang 2 1 Microsoft Research, Beijing, China, 2 Google Research, Mountain View, USA, [email protected], [email protected], [email protected] Abstract. Graphs play an indispensable role in a wide range of appli- cation domains. Graph processing at scale, however, is facing challenges at all levels, ranging from system architectures to programming models. In this chapter, we review the challenges of parallel processing of large graph data, representative graph processing systems, general design prin- ciples of large graph processing systems, and various graph computation paradigms. We will use a real-life large scale knowledge graph serving case study to demonstrate the introduced design principles. Graph pro- cessing covers a wide range of topics and graph data can be represented in very dierent forms, including adjacency lists, matrices, high dimensional vectors, and matroids. Dierent graph representations lead to dierent computation paradigms and system architectures. From the perspective of graph representation, this chapter will also briefly introduce a few other forms of graph representation and their applications besides the most commonly used adjacency list representation. Keywords: graph processing, distributed algorithms, graph databases 1 Overview Graphs are important to many applications and the applications powered by large graphs are proliferating. However, large scale graph processing is facing challenges at all levels, ranging from system architectures to programming mod- els. Graph applications have a variety of needs. We may roughly classify such needs into two categories: online query processing, which requires low latency, and oine graph analytics, which requires high throughput. As an example, de- ciding instantly whether there is a path between two given people in a social network belongs to the first category while calculating PageRank for the Web graph belongs to the second. Let us start with a real-life knowledge graph query example. Figure 1 gives a real-life relation search example on a big knowledge graph with more than 25 billion triple facts. In the context of knowledge graph, queries that are to find the paths linking a set of given graph nodes are usually used to find the relations between the given entities. In this example, we want to find the relations between entities Tom Cruise, Katie Holmes, Mimi Rogers, and Nicole Kidman. Many sophisticated real-world applications highly reply on the interplay be- tween oine graph analytics and online query processing. Given two nodes in a

Upload: others

Post on 05-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

Parallel Processing of Graphs

Bin Shao1, Yatao Li1, and Haixun Wang2

1 Microsoft Research, Beijing, China,2 Google Research, Mountain View, USA,

[email protected], [email protected], [email protected]

Abstract. Graphs play an indispensable role in a wide range of appli-cation domains. Graph processing at scale, however, is facing challengesat all levels, ranging from system architectures to programming models.In this chapter, we review the challenges of parallel processing of largegraph data, representative graph processing systems, general design prin-ciples of large graph processing systems, and various graph computationparadigms. We will use a real-life large scale knowledge graph servingcase study to demonstrate the introduced design principles. Graph pro-cessing covers a wide range of topics and graph data can be represented invery di↵erent forms, including adjacency lists, matrices, high dimensionalvectors, and matroids. Di↵erent graph representations lead to di↵erentcomputation paradigms and system architectures. From the perspectiveof graph representation, this chapter will also briefly introduce a fewother forms of graph representation and their applications besides themost commonly used adjacency list representation.

Keywords: graph processing, distributed algorithms, graph databases

1 Overview

Graphs are important to many applications and the applications powered bylarge graphs are proliferating. However, large scale graph processing is facingchallenges at all levels, ranging from system architectures to programming mod-els. Graph applications have a variety of needs. We may roughly classify suchneeds into two categories: online query processing, which requires low latency,and o✏ine graph analytics, which requires high throughput. As an example, de-ciding instantly whether there is a path between two given people in a socialnetwork belongs to the first category while calculating PageRank for the Webgraph belongs to the second.

Let us start with a real-life knowledge graph query example. Figure 1 givesa real-life relation search example on a big knowledge graph with more than 25billion triple facts. In the context of knowledge graph, queries that are to findthe paths linking a set of given graph nodes are usually used to find the relationsbetween the given entities. In this example, we want to find the relations betweenentities Tom Cruise, Katie Holmes, Mimi Rogers, and Nicole Kidman.

Many sophisticated real-world applications highly reply on the interplay be-tween o✏ine graph analytics and online query processing. Given two nodes in a

Page 2: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

II

Fig. 1. Relation Search over Knowledge Graph

graph, the “distance oracle” algorithm [31] that estimates the shortest distancebetween two given nodes is an online algorithm. However, to estimate the dis-tances, the algorithm relies on “landmark” nodes in the graph, and an optimalset of landmark nodes is discovered using o✏ine analytics.

Fig. 2. A general graph processing system stack

Generally speaking, the system stack of a graph processing system consistsof all or some of the layers shown in Fig. 2. At the bottom, the storage backendhosts the graph data in a certain graph representation, either centralized in asingle machine or distributed over multiple machines. Storage back-ends haveimportant system design implications. The storage layer largely determines thesystem optimization goal, as discussed in [20]. For example, systems includingSHARD [34], HadoopRDF [17], RAPID [33], and EAGRE [48] use a distributedfile system as their storage back-end. The systems that directly use file system asstorage back-end are usually optimized for throughput due to the relatively high

Page 3: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

III

data retrieval latency. In comparison, systems such as H2RDF [30], AMADA [2],and Trinity.RDF [45] are optimized for better response time via the fast randomdata access capability provided by their key-value store back-ends.

At the top, graph algorithms manipulate graph data via programming in-terfaces provided by a graph processing system. Between the programmminginterfaces and the storage backend, there usually is a computation engine whichexecutes the graph manipulation instructions dictated by the graph algorithmsthrough programming intefaces.

In this section, we first introduce the notations and discuss why large graphsare hard to process. Then, we present some general design principles after a briefsurvey of some representative graph processing systems.

1.1 Notations

Let us introduce the terminology and notations that will be used throughout thischapter. A graph may refer to a topology-only mathematical concept as definedin [4] or a data set. In the former sense, a graph is a pair of finite disjoint sets(V,E) such that the set of edges E is a subset of the set V ⇥ V of ordered orunordered pairs of vertices V . If each pair of vertices is ordered, we call G adirected graph; otherwise, we call it an undirected graph.

In what follows when we represent a real data set as a graph, especially whenthere is data associated with the vertices, we refer to the vertices as graph nodesor nodes. Correspondingly, we call adjacent vertices neighboring nodes. If thedata set only contains graph topology or if we only want to emphasize its graphtopology, we usually still call them vertices.

There are two common means of representing and storing a graph: adjacencylist and adjacency matrix. The way of representing a graph determines the waywe can process the graph. As most graph query processing and analytics al-gorithms highly rely on the operator that gets all adjacent vertices of a givenvertex, adjacency list is usually a preferred way of representing a graph especiallywhen the graph is large. To access the adjacent vertices, we need to scan thewhole adjacency matrix row when we use the adjacency matarix representation.For a graph with billion vertices, the costs of scanning large matrix rows willbe prohibitive. In this chapter, we assume graphs are represented and stored asadjacency lists unless otherwise stated.

1.2 Challenges of large graph processing

Large graphs are di�cult to handle mostly because they encode a huge numberof relationships. We summarize the challenges of large graph processing into fouraspects: 1) the complex nature of graph; 2) the diversity of graph data; 3) thediversity of graph computations; 4) the scale of graph data.

The complex nature of graph Graph data is inherently complex. The con-temporary computer architectures are good at processing linear and simple hier-archical data structures, such as Lists, Stacks, or Trees. Even when the data scale

Page 4: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

IV

goes large, divide and conquer computation paradigm still works well for thesedata structures, even if the data is partitioned over many distributed machines.However, when we are handling graphs, especially large graphs, the situationis changed. Andrew Lumsadine and Douglas Gregor [24] summarize the char-acteristics of parallel graph processing as: Data-driven computations, unstruc-tured problems, poor locality, and high data access to computation ratio. Theimplication is twofold: 1) From the perspective of data access, a graph node’sneighboring nodes’ content cannot be accessed without “jumping” in the stor-age no matter how we represent a graph. In other words, massive amount ofrandom accesses are required during graph processing. Many modern programoptimizations rely on data reuse. Unfortunately, the random access nature ofgraph data breaks the premise. This usually causes poor performance in sys-tem implementation since the CPU cache is not e↵ective for most of the time.2) From the perspective of program structure, parallelism is di�cult to extractbecause the unstructured nature of graphs. Partitioning large graphs itself isan NP-hard problem [10]; this makes it very hard to get an e�cient divide andconquer solutions for most of the large graph processing tasks.

The diversity of graph data There are many kinds of graphs, such as scalefree graphs, graphs with community structure, and small world graphs. A scale-free graph is a graph whose degrees follow a power-law distribution. In graphswith community structure, the graph nodes can be easily grouped into sets ofnodes such that each set of nodes is densely connected. In small-world graphs,most nodes can be reached from every other node by a small number of hops.Graph algorithms’ performance may vary a lot on di↵erent kinds of graphs.

The diversity of graph operations Furthermore, there are a large varietyof graph operations. Generally speaking, graph operations can be classified intotwo large categories: online query processing and o✏ine graph analytics. De-ciding instantly whether there is a path between two given persons in a socialnetwork belongs to the first category while calculating the PageRank scores fora web graph belongs to the second. There are also other useful graph operations,such as graph generation, graph visualization, and interactive exploration. It ischallenging to design a system that can support all these operations on top ofthe same infrastructure.

The scale of graph data Last but not least, the scale of graphs does matter.Graphs with billion nodes are common. The facebook social network has morethan 1.44 billion users and more than 480 billion friendship relations1. The worldwide web has more than 1 trillion unique links. The De brujin graph for geneseven has more than 1 trillion nodes and at least 8 trillion edges. The scale of thegraphs makes many classic graph algorithms ine↵ective.

1http://newsroom.fb.com/company-info/.

Page 5: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

V

1.3 Representative graph processing systems

Recently, research on large graph processing has seen an explosive growth [1].However, a lot of graph algorithms are ad-hoc in the sense that each of themassumes that the underlying graph data can be organized in a certain way thatmaximizes the performance of the algorithm. In other words, there is no stan-dard or de facto graph systems on which algorithms on graphs are developed andoptimized. The situation is even more urgent for extremely large graphs withbillions of nodes and edges: first, converting billion node graphs from one formatto another for di↵erent algorithms is extremely costly or totally infeasible; sec-ond, many graph algorithms (e.g., subgraph matching algorithms or reachabilityqueries that rely on super-linear graph indexes) are not applicable to billion nodegraphs.

Native Online Data In-memory Transationgraphs query sharding storage support

Neo4j2 Yes Yes No No YesTrinity [36] Yes Yes Yes Yes AtomicityHorton [35] Yes Yes Yes Yes NoHyperGraphDB [18] No Yes No No YesFlockDB3 No Yes Yes No YesTinkerGraph4 Yes Yes No Yes NoInfiniteGraph5 Yes Yes Yes No YesCayley6 Yes Yes SB SB YesTitan7 Yes Yes SB SB YesMapReduce [8] No No Yes No NoPEGASUS [19] No No Yes No NoPregel [26] No No Yes No NoGiraph8 No No Yes No NoGraphLab [23] No No Yes No NoGraphChi [22] No No No No NoGraphX [12] No No Yes No No

Table 1. Some representative graph processing systems ( SB means the current featureof the system depends on its storage backend).

In response to this situation, a lot of graph systems have been proposed.In this section, we try to provide perspectives on the goals and the means of

2http://neo4j.com/.

3https://github.com/twitter/flockdb.

4https://github.com/tinkerpop/blueprints/wiki/TinkerGraph.

5http://www.objectivity.com/products/infinitegraph/.

6https://github.com/google/cayley.

7http://thinkaurelius.github.io/titan/.

8http://giraph.apache.org/.

Page 6: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

VI

developing a graph system by briefly discussing and comparing the existing rep-resentative systems. Currently, there are two representative graph systems foronline query processing and o✏ine analytics respectively. Neo4j focuses on sup-porting online transaction processing (OLTP) on graph data. Neo4j is like aregular database system, only with a more expressive and powerful data model.However, Neo4j is not distributed: it does not handle graphs that are partitionedover multiple machines. For large graphs that cannot be stored in memory, diskrandom access becomes the performance bottleneck. Furthermore, a single ma-chine also does not have enough computation power compared with a distributed,parallel system. Thus, it is di�cult for systems without data sharding to handleweb-scale graphs.

From the perspective of online graph query processing, a few distributedin-memory systems are designed to conquer the challenges faced by disk-basedsingle-machine systems. Representative systems include Trinity [36] and Horton[35]. These systems leverage in-memory storage to speed up random data accessesand use a distributed computation/execution engine to process graph queries inparallel.

On the other end of the spectrum are MapReduce [8], Pregel [26], Giraph,GraphLab [23], GraphChi [22], and GraphX [12]. Unlike Neo4j, they do not sup-port online query processing, instead, they are optimized for high-throughputanalytics on large data partitioned over hundreds of machines. MapReduce com-putations on graphs depend heavily on interprocessor bandwidth, as graph struc-tures are sent over the network iteration after iteration. Pregel and its follow-upsystems mitigate this problem by passing computation results instead of graphstructures between processors. In Pregel, analytics on the graphs are expressedusing a vertex based computation mechanism under the Bulk Synchronous Par-allel (BSP) model. Although some well known graph algorithms, such as PageR-ank and shortest path discovery, can be implemented through vertex based com-puting with ease, there are many sophisticated graph computations cannot beexpressed in a succinct and elegant way.

Besides the systems mentioned earlier, dozens of graph systems have beenproposed in the last few years. Table 1 compares a few of them. For these systemswe compare them from five important design perspectives. First, does the graphexist in its native form, or does it follow other models, including RDBMS orkey-value stores? When a graph is in its native form, graph algorithms can beexpressed in standard, natural ways [6]. If not, we need a complete rethinking ofthe problem in order to develop analogous implementations in the new model,e.g., MapReduce. Second, does the system support low latency query processingon graphs? Third, does the system support data sharding and distributed parallelprocessing? Systems such as MapReduce and Pregel are capable of supportingcomputations such as PageRank on extremely large graphs. Due to their natureof batch processing, they usually cannot support real-time queries against largegraphs. Design choices from the perspective of the main storage (in-memory ornot) and transaction support are also compared in the table.

Page 7: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

VII

2 General design principles

We have reviewed a few representative graph processing systems in the lastsection. In this section, let us discuss a few general principles of designing ageneral-purpose real-time graph processing system.

2.1 Addressing the grand random data access challenge

Online query processing systems are usually optimized for response time, whileo✏ine analytics systems are usually optimized for throughput. Because dataaccesses on graphs have poor locality, and random accesses on disks lead to per-formance bottlenecks, keeping graphs memory-resident is important for e�cientgraph computations, especially for real-time online query processing. In orderto create a general-purpose graph processing system that supports both low-latency online query processing and high-throughput o✏ine analytics, the grandchallenge of random accesses must be well addressed at the data access layer.

Despite the great progress made in disk technology, it still cannot providethe level of e�cient random access required for graph computations. DRAM isstill the only promising storage medium that can provide a satisfactory levelof random accesses with acceptable costs. At the same time, memory-basedapproaches usually do not scale due to the capacity limit of single machines. Weargue that distributed in-memory solution would be one of the most promisingdirections for large graph processing.

By addressing the random data access problem for distributed large graphs,we can actually design systems to support both online graph query processingand o✏ine graph analytics instead of optimizing for certain types of graph com-putations. For online queries, in-memory graphs are particularly e↵ective, asmost of them require a certain degree of graph exploration, e.g., BFS, DFS, andsub-graph matching. On the other hand, o✏ine graph computations are usuallyperformed in an iterative, batch manner. For iterative computations, keepingdata in the main memory can improve performance by an order of magnitudedue to the reuse of intermediate results [44].

2.2 Avoiding prohibitive indexes

For o✏ine graph analytics, finding a way to well partition the computation taskis, to some extent, the silver bullet. As long as we can find an e�cient wayto partition the graph data, we basically have an e�cient solution to scalableo✏ine graph processing. Due to the random data access problem, general pur-pose graph computations usually do not have e�cient, disk-based solutions. Butunder certain constraints, o✏ine graph analytics problems sometimes have e�-cient disk-based “divide and conquer” solutions. A good example is GraphChi[22] which will be discussed with more details in Section 4.2. Disk-based graphcomputation solutions are essentially cache optimization mechanisms. If a com-putational problem can be well partitioned, then the sub-problems can be loaded

Page 8: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

VIII

in the memory and e�ciently handled one by one. However, as widely acknowl-edged [24], a lot of graph problems are inherently irregular and hard to partition,especially for online queries.

Compared with o✏ine analytics, online queries are much harder to handledue to the following reasons: 1) Online queries are sensitive to latency. It isharder to reduce latency than to increase throughput by adding more machines.On the one hand, adding machines can reduce each machine’s workload; at thesame time, having more machines incurs higher communication costs. 2) Thedata access patterns of a graph query are very di�cult to predict, thus it is veryhard to optimize the execution by leveraging I/O optimization techniques suchas prefetching.

As discussed earlier, data accesses usually dominate the graph computationcosts; eventually the performance of processing a graph query depends on howfast we can randomly access the graph data. The traditional way of speeding uprandom data accesses is to use indexes. Graph indexes are widely employed tospeed up online query processing, either by precomputing and materializing theresults of common query patterns or by storing more redundant information tospeed up data accesses. To capture the structural information of graphs, graphindexes usually require super-linear indexing time or super-linear storage space.For large graphs, e.g., graphs with billions of nodes, the super-linear complexityalmost always means infeasibility. We will show in the following section thatindex-free graph processing is a possible and e�cient approach to many real-time graph query processing tasks.

2.3 Supporting fine-grained one-sided communications

Most graph computations are data driven and the communication costs typicallycontribute a large portion in the overall system costs. Well overlapping the under-lying communication and ongoing computation is the key to high performance.From the perspective of program execution, a general computation engine orframework supporting fine-grained server-side computations is an indispensablepart of a high performance real-time graph processing system [24].

MPI is the de facto standard for message passing programming in high-performance computing. In MPI, pairwise two-sided send/receive communicationis the major paradigm provided by MPI9. The communication progress is dic-tated by the MPI primitive calls explicitly invoked in an application [25]. Nearlyall underlying network communication infrastructure provides asynchronous net-work events. MPI communication paradigm inevitably brings unnecessary la-tencies in since it only responds network events during send/receive primitiveinvocations.

In contrast, active messages [9] is a communication architecture that canwell overlap computation and communication. The active messages architectureis desirable for data-driven graph computation. It especially suitable for the

9 Even one-sided primitives are included starting from MPI-2 standard, their usage isstill limited.

Page 9: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

IX

online graph query processing which is sensitive to network latencies. In thiscommunication architecture, a user-space message handler pointed by a handlerindex encoded in the message upon the message arrival. Let us consider a simpleexample to illustrate the di↵erence of the communication paradigms providedby MPI and active messages. Suppose we want to send some messages fromone machine to another according to the output values of a random numbergenerator. Under active messages paradigm, we can check the random valueson the sender side, and invoke a sending operation only if the value matchesthe sending condition. Using the pairwise communication of MPI, we need toinvoke as many send/receive calls as the number of generated random valuesand perform the value checkings on the receiver side.

3 Online Query Processing

In this section, we review two online query processing techniques specially de-signed for distributed large graphs: asynchronous fanout search and index-freequery processing.

3.1 Asynchronous fanout search

A lot of applications require graph explorations, breadth-first search and depth-first search being the most typical. Here, we use people search on a social networkas an example to demonstrate an e�cient asynchronous e�cient graph explo-ration technique called fanout search. The prolem is the following: On a socialnetwork, for a given user, find anyone whose firstname is “David” among hisfriends, his friends’ friends, and his friends’ friends’ friends.

It is unlikely we can index the social network to solve the “David” problem.One option is to index the neighborhood for each user, so that given any user,we can use the index to check if there is any “David” within 3 hops of his/herneighborhood. However, the size and the update cost of such an index are pro-hibitive for a web-scale graph. The second option is to create an index to answer3-hop reachability queries for any two nodes. This is infeasible because “David”is a popular first name, and we cannot check every David in the social networkto see if he is within 3 hops to the current user.

We can tackle the “David” problem by leveraging e�cient memory-basedgraph explorations. The algorithm simply sends asynchronous “fan-out search”requests recursively to remote machines as shown by Procedures 1 and 2. Specif-ically, it partitions v’s neighbors into k parts (line 4 of Procedure 1), where k isthe total number of machines. Then, the “fan-out” search is performed by send-ing message (Ni, hop) (hop = 1) initially (line 6 of Procedure 1) and (N 0

i , hop+1)recursively (line 7 of Procedure 2) to all involved machines in parallel. On receiv-ing the search requests, machine i searches for “David” in its local data storage(line 2 of Procedure 2).

The simple fanout search works very well for randomly partitioned largedistributed graphs. As demonstrated in Trinity [36], for a Facebook-like graph,

Page 10: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

X

exploring the entire 3-hop neighborhood of any node in the graph takes less than100 milliseconds on average.

Procedure 1 3-hop Fan-out “David” searchInput: v (the current vertex)Output: all of the “David”s in v’s 3-hop neighborhood1: N v’s neighbors2: k (total number of machines)3: hop 14: Partition N into k parts: N = N1 [ · · · [Nk

5: foreach(Ni in N) in parallel6: async send message (Ni, hop) to machine i

Procedure 2 machine i on receiving message (Ni, hop)

1: Si Cells.load(Ni)2: check if any one in Si is named “David”3: if hop < 3 then4: N 0 neighbors of all the nodes in Si

5: Partition N 0 into k parts: N 0 = N 01 [ · · · [N 0

k

6: foreach(N 0i in N 0) in parallel

7: async send message (N 0i , hop+ 1) to machine i

3.2 Index-free query processing

Online query processing is usually harder to optimize because of its very limitedresponse time budgets. Here we use subgraph matching problem as an exampleto introduce an e�cient online query processing paradigm for distributed largegraphs.

Subgraph matching is one of the most basic graph operations. It is one of themost fundamental operators in many applications that handle graphs, includ-ing protein-protein interaction networks [14, 54], knowledge bases [21, 43], andprogram analysis [49, 47].

Graph G0 = (V 0, E0) is a subgraph of G = (V,E) if V 0 ⇢ V and E0 ⇢ E.Graph G00 = (V 00, E00) is isomorphic to G0 = (V 0, E0) if there is a bijectionf : V 0 ! V 00 such that xy 2 E0 i↵ f(x)f(y) 2 E00. For a given data graph Gand a query graph G0, subgraph matching is to retrieve all the subgraphs of Gthat are isomorphic to the query graph. Classic subgraph matching algorithmsare usually conducted in the following 3 steps:

1. Break the data graph into basic units, such as edges, paths, or frequentsubgraphs.

Page 11: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

XI

Algorithms Index Index Update Graph sizeSize Time Cost in experiments

Ullmann [39], VF2 [7] - - - 4,484RDF-3X [27] O(m) O(m) O(d) 33MBitMat [3] O(m) O(m) O(m) 361MSubdue [16] - Exponential O(m) 10KSpiderMine [53] - Exponential O(m) 40K

R-Join [5] O(nm1/2) O(n4) O(n) 1MDistance-Join [54] O(nm1/2) O(n4) O(n) 387KGraphQL [14] O(m+ ndr) O(m+ ndr) O(dr) 320KZhao [49] O(ndr) O(ndr) O(dL) 2MGADDI [46] O(ndL) O(ndL) O(dL) 10KSTwig [37] O(n) O(n) O(1) 1B

Table 2. Survey on subgraph matching algorithms [37].

Algorithms Index Size Index Time Query TimeUllmann [39], VF2 [7] - - >1000RDF-3X [27] 1T >20 days >48BitMat [3] 2.4T >20 days >269Subdue [16] - > 67 years -SpiderMine [53] - > 3 years -R-Join [5] >175T > 1015 years >200Distance-Join [54] >175T > 1015 years >4000GraphQL [14] >13T(r=2) > 600 years >2000Zhao [49] >12T(r=2) > 600 years >600GADDI [46] > 2⇥ 105T (L=4) > 4⇥ 105 years >400STwig [37] 6G 33s <20

Table 3. Index costs of subgraph matching algorithms for a Facebook-like graph [37].

2. Build indexes for every possible basic unit.3. Decompose a query into multiple basic unit queries and join their results.

However, indexing graph structures is much more costly than indexing a rela-tional table in terms of time and space complexity. For example, 2-hop reachabil-ity indexes usually require O(n4) construction time. Depending on the structureof the basic unit, the space costs vary. In many cases, they are super linear aswell. Moreover, multi-way joins are costly too, specially when the data is diskresident.

To demonstrate the infeasibility of index-based solutions for large graphs, letus review a survey on subgraph matching made by Sun et al. [37] as shown inTable 2 and 3.

Table 2 shows index complexities for a few representative subgraph match-ing approaches. Most of them are super-linear, the state-of-the-art approachesRDF-3x and Bitmat have linear indexing time complexity. To illustrate what

Page 12: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

XII

these complexities imply for a large graph, Table 3 shows their estimated indexconstruction costs and query times for a Facebook-like social graph. AlthoughRDF-3X and BitMat have linear indexing complexities, they takes more than 20days to index a Facebook-like large graph, let alone those super-linear indexes.The evident conclusion is that the costly structural graph indexes are infeasiblefor large graphs.

To avoid building sophisticated indexes, the STwig method proposed by Sunet al. [37] and the Trinity.RDF system [45] process subgraph matching querieswithout using structural graph indexes (except a “string” index that maps textlabels to vertex IDs). This ensures its scalability for billion-node graphs, whichare not indexable, both in terms of index space and index time. To make upfor the performance loss due to lack of indexing support, both STwig and Trin-ity.RDF make heavy use of in-memory graph explorations to replace expensivejoin operations. Given a query, they split it into a set of subquery graphs that canbe e�ciently processed via in-memory graph exploration. It performs join oper-ations only if they are not avoidable (when there is a cycle in the query graph).This dramatically reduces query processing cost, which is usually dominated byjoins.

4 O✏ine Analytics

Analytics jobs perform a global computation against a graph. Many of them areconducted in an iterative manner. When the graph is large, the analytics jobsare usually run as o✏ine tasks; that is why we usually call these analytics jobso✏ine analytics.

4.1 MapReduce computation paradigm

MapReduce [8] is a high latency yet high throughput data processing platformwhich is optimized for o✏ine analytics for large partitioned data sets over hun-dreds of machines. However, when processing graphs using MapReduce, it su↵ersfrom the following problems:

– Due to its o✏ine data processing nature, real-time online queries cannot besupported;

– The data model of MapReduce cannot model graphs natively, thus graphalgorithms cannot be expressed intuitively;

– Its computation architecture is not e�cient for graph processing because ofthe following two major reasons:• Intermediate results of each iteration need to be materialized;• Entire graph structures need to be sent over the network iteration afteriteration, thus incurring huge communication costs.

It is possible to run a graph processing task e�ciently on a MapReduceplatform if the graph data is well-partitioned. Please refer to [32] for some well

Page 13: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

XIII

designed graph algorithms implemented in MapReduce. For parallel processing,the parallelism we can achieve eventually depends on how well the data canbe partitioned. Unfortunately, most of the raw real-life graph data is not wellpartitioned and the partitioning task itself can be very costly [41].

4.2 Vertex-centric computation paradigm

The vertex-centric graph computation which is first advocated by Pregel [26]provides a vertex-centric computational abstraction over the BSP model [40]. Acomputational task is expressed in multiple iterative super-steps and each vertexacts as an independent agent. During each super-step, each agent performs somecomputations and communications, independent of each other. It then waitsfor all agents to finish their computations and communications before the nextsuper-step begins.

Compared to MapReduce, Pregel exploits finer-grained parallelism at vertexlevel. Moreover, Pregel does not move graph partitions over the network; onlymessages among graph vertices are passed at the end of each iteration. Thisgreatly reduces the network tra�c.

Many follow-up works, such as GraphLab [23], PowerGraph [11], Trinity [36],and GraphChi [22], provide a vertex-centric computational model for o✏inegraph analytics. Among these systems, GraphChi is specially worth mention-ing as it well addressed the “divide-and-conquer” problem for graph computa-tion under certain constraints. GraphChi can perform e�cient disk-based graphcomputations under the assumption that the computations have asynchronousvertex-centric solutions. An asynchronous solution is one where a vertex canperform a computational task based on the partially updated information fromits incoming links. This assumption, on the one hand, frees the need of passingmessages from the current vertex to all its outgoing links so that it can performthe graph computations block by block. On the other hand, it inherently cannote�ciently support traversal-based graph computations and synchronous graphcomputations because it cannot e�ciently access the outgoing links of a vertex.

Despite quite a few graph computation tasks, such as Single Source Short-est Paths, PageRank, and their variants, that can be expressed elegantly usingvertex-centric computation paradigm, there are many that can not be expressedusing the vertex-centric paradigm elegantly and intuitively, such as multi-levelgraph partitioning.

4.3 Communication Optimization

For large scale distributed graph computation, communication optimization isvery important. Although a graph is distributed on multiple machines, from thepoint view of a local machine, vertices of the graph are in two categories: verticeson the local machine, and vertices on any of the remote machines. Fig. 3(a) showsa local machine’s bipartite view of the entire graph.

Page 14: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

XIV

x

y

u

v

x

y

u

v

LocalRemote

a) b)

Fig. 3. Bipartite View on a Local Machine

Let us take the vertex-centric computation as an example. One naive ap-proach is to run jobs on local vertices without preparing any messages in ad-vance. When a local vertex is scheduled to run a job, we obtain remote messagesfor the vertex and run the job immediately after they arrive. Since the systemdoes not have space to hold all messages, we discard messages after they areused. For example, in Fig. 3(a), in order to run the job on vertex x, we needmessages from vertices u, v, and others. Later on, when y is scheduled to run, weneed messages from u and v again. This means a single message needs to be de-livered multiple times, which is unacceptable in an environment where networkcapacity is an extremely valuable resource.

Another naive approach is to wait until all messages arrive before we startrunning the job on any of the local vertices. This means the local machine needsto bu↵er all the messages from remote vertices, and perform random accessesto retrieve the messages when running algorithms on the local vertices. Somegraph processing systems, such as the ones built using Parallel Boost Graph Li-brary (PBGL) [13], uses ghost nodes (local replicas of remote nodes) for messagepassing. This mechanism works well for well-partitioned graphs [24]. However,graph partitioning itself is a very costly task, and it is very di�cult to createpartitions of even size while minimizing the number of edge cuts. On the otherhand, a large memory overhead would be incurred for large graphs that are notwell partitioned. To illustrate the memory overhead, Figure 4 shows the memoryusage for graphs with 1 million to 256 million vertices [36]. It took nearly 600GB of memory for the 256-million-node graph when the average degree was 16.

Since the total number of messages is too big to be memory resident, weneed to bu↵er the messages on the disk and perform random accesses on thedisk. This incurs a significant cost. To address this issue, we can cache the mes-sages in a smarter way. For example, on each local machine, we can di↵erentiateremote vertices into 2 categories. The first category contains hub vertices, that

Page 15: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

XV

2

02

12

22

32

42

52

62

72

80

100

200

300

400

500

600

Node Count (Million)

MemoryUsage(GB)

Avg Degree 4

Avg Degree 8

Avg Degree 16

Avg Degree 32

Fig. 4. Breadth-first Search using PBGL

is, vertices having a large degree and connecting to a great percentage of localvertices. The second category contains all of the remaining vertices. We bu↵ermessages from vertices in the first category for the entire duration of one compu-tation iteration. For a scale-free graph, e.g., one generated by degree distributionP (k) ⇠ ck�� with c = 1.16 and � = 2.16, 20% hub vertices are sending messagesto 80% of vertices. Even if we bu↵er messages from just 10% hub vertices, wehave addressed 72.8% of message needs.

4.4 Local sampling

In distributed graph processing systems, such as Trinity, a large graph is par-titioned and stored on a number of distributed machines. This leads to the fol-lowing question: Can we perform graph computations locally on each machineand then aggregate their answers to derive the answer for the entire graph? Fur-thermore, can we use probabilistic inferences to derive the answer for the entiregraph from the answer on a single machine? This paradigm has the potential toovercome the network communication bottleneck, as it minimizes or even abol-ishes network communications. The answers to these questions are positive. If agraph is partitioned over 10 machines, each machine has full information about10% of the vertices and 10% of the edges. Also, the edges link to a large amountof the remaining 90% of the vertices. Thus, the sample actually contains a greatdeal of information about the entire graph. Furthermore, when a graph is ran-domly partitioned, each machine holds a random sample of the graph, whichenables us to perform probabilistic inferences.

The distance oracle [31] demonstrated this graph computation paradigm ontop of Trinity. Distance oracle finds landmark vertices and uses them to esti-mate the shortest distances between any two nodes in a large graph [50]. Fig. 5shows the e↵ectiveness of using 3 methods to pick the landmark vertices. Here,the X axis shows the number of used landmark vertices, and the Y axis showsestimation accuracy. The best approach is to use vertices that have the highestglobal betweenness and the worst approach is to simply use vertices that have

Page 16: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

XVI

the largest degree. The distance oracle approach uses vertices that have the high-est betweenness computed locally, actually has very close accuracy to the bestapproach. However, finding landmarks that have the highest global betweennessis significantly more costly than local sampling.

100 200 300 400 500

0

20

40

60

80

100

Number of Landmarks

EstimationAccuracy(%)

Largest Degree

Local Betweenness

Global Betweenness

Fig. 5. E↵ectiveness of Local Sampling in Distance Oracle [31]

5 Graph Generation

There are many occasions when we need to generate large graphs for experi-ments, especially for the researchers who are working on large graph systemsor algorithms. Generating large graphs usually takes a long time, sometimes iteven dominates the time spent in conducting the experiments.

A good graph generator generally needs to meet the following criteria:

– It can generate a graph with certain properties, e.g., with certain averagedegree and with certain distribution of the node degrees.

– It can generate a large graph, say a graph with a billion nodes, in minutesor a few hours rather than days.

– It is as resource economical as possible.– It can generate graphs in native graph formats, instead of an unordered

collection of edges.

The way of representing a graph determines the way we can process the graph.Compared with the matrix representation, representing a graph as adjacencylists is usually more flexible since it allows us to add node or edge attributeseasily. Additionally, as the graph size gets large, the matrix representation itselfbecomes less desirable for the following two reasons:

– It does not support dynamic node insertion or deletion, which is required bysome graph generation algorithms;

Page 17: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

XVII

– The space overhead is high when we are generating a sparse graph.

The common practice for graph generation is to decompose the task intoseveral steps, and then pipeline them. The graph generation algorithms usuallyfocus on generating graph edges, but not the storage. Generally, the whole graphgeneration process consists of two parts:

– Generating a set of edges (vertex pairs) according to some predefined rules;– Merging together the edges associated with the same vertex into an ad-

jacency list. The process of building an adjacency list can be further de-composed. The edges are first grouped by their source vertices, and thendeduplicated. Finally, the graph generator should transform the intermedi-ate results into native graph formats, which can be easily consumed by othergraph systems. Figure 6 illustrates the general idea of the pipeline.

Fig. 6. Graph Generator Components

The simple, but most commonly adopted approach, is to implement the edgegeneration algorithm in a separate program, and stream the edges out to thedisk. Then, an external sorting program picks up the data from the disk, sorts,and groups the edge sets by their source vertices. After the deduplication is done,which is trivial after the data is sorted, another program then converts the rawadjacency lists to native graph formats.

There are several points of interest if we concern the e�ciency of such apipeline, as shown in figure 7. First, streaming edges to the disk will not onlylimit the throughput of an edge generator, but also discourage parallel edgegeneration. Second, using an external sorting program incurs a great cost, whichusually contributes the major part of the execution time.

Fig. 7. The bottlenecks of the simple graph generation pipeline

As the speed of an edge generator depends on the time complexity of theunderlying graph generation algorithm and varies case-by-case, here we focuson the optimization of the adjacency list building procedure. Adopting a RAM-backed storage will greatly increase the throughput of the edge generation pro-cess. Meanwhile, increasing the throughput of the edge generation process alone

Page 18: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

XVIII

does not improve the overall performance, because the sorting procedure willstill incur high costs with excessive comparisons and data copying.

Note that we essentially want to group the edges by their source verticesinstead of the sorting itself. Actually, we are not interested in the order of thesource vertices at all. Based on this insight, we can eliminate the sorting processentirely by hashing the target vertices with the same source vertex to the samebucket. With proper storage management techniques10, we could even insertedges into the adjacency lists and deduplicate them on the fly.

Fig. 8. A distributed graph generation pipeline

To further increase the graph size we can generate, we can adapt the graphgeneration pipeline to a distributed setting as illustrated by Figure 8, where eachmachine keeps a partition of the generated graph. The machines generate graphedges in parallel and synchronize with each other via the network as needed.

6 A Knowledge Graph Serving Case Study

Knowledge graphs play an important role in many real-world applications, suchas Google Now and Microsoft Bing. In this section, we discuss how to serveknowledge graphs at scale for real-time query processing via a real-life case study.

6.1 Challenges of knowledge graph serving

A knowledge graph is usually a massive entity network. The data set used inthe case study is a Microsoft-owned RDF knowledge graph; we refer to it asMKG (Microsoft Knowledge Graph) in what follows. MKG consists of 2.4 bil-lion entities, 8.05 billion entity properties, and 17.44 billion relations betweenentities. The most valuable part for such a large knowledge graph is its richrelationships. Currently, knowledge graphs are mainly served via various entityindexes, which index the entity properties and immediate neighbors to facilitate

10 A real-life example can be found here: http://www.graphengine.io/docs/manual/DemoApps/GraphGenerator.html

Page 19: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

XIX

query processing via joins. Entity indexes can help answer many entity queries,but they cannot answer queries requiring accesses of 2+ hop relations.

A real-time knowledge graph serving system at least needs to meet the fol-lowing three requirements:

1. The full scale knowledge graph must be served;2. The whole graph must be accessible in real time;3. Online graph queries must be supported.

Building a system satisfying these requirements faces three major challenges:

1. The large volume of data: for a real-life system, data size does matter. Graphquery processing algorithms with O(n2) or higher complexity are common;they are infeasible for large graph data as elaborated earlier. For MKG,O(n2) implies the order of magnitude 1018 which is infeasible for any moderncomputer architecture.

2. Complex data schemata: compared to typical social graphs which tend tohave a small number of data types, e.g., person and post, a real-world knowl-edge graph usually has thousands of types of entities and relationships. Forexample, the MKG data set has 1610 entity types and 5987 types of rela-tionships between entities.

3. Multi-typed entities: an real-world entity may have multiple roles to play. InMKG, the entity schemata are di↵erent from entity to entity. For example,there is an entity “Pal” who is a dog, but we cannot use a predefined “dog”entity type to represent it, because “Pal” happens to be an actor as well. Thiscannot be naturally modeled using the inheritance relationship of the object-oriented data modeling methodology either. We cannot let “actor” inheritthe properties of a “person”, since an actor might not be a person, in thisexample, it is a dog. The root cause is the following: The relationship betweenentity instances and their entity types is “AS-A” relationship, instead of the“IS-A” relationship. For the “Pal” example, as a dog, it has the property“breeds”; while as an actor, it has the property “perform”. The multi-typedentities make make modeling knowledge graphs a tricky task.

6.2 System design for serving large knowledge graph in real time

In this case study, we introduce a knowledge serving platform Trinity.KS [15]developed for serving big knowledge graphs, such as MKG and Freebase11.

Trinity.KS has three levels of service abstractions: Storage Layer, GraphLayer, and Query Processing Layer. The role of the storage layer is to managethe knowledge graph and some additional runtime data, providing fine-graineddata manipulation interfaces at the level of graph nodes and edges. On top of it,the graph layer realizes a strongly typed entity model and provides data manip-ulation interfaces at the granularity of entities. The query processing layer builton top of the graph layer provides the online query processing capability.

11 https://www.freebase.com/

Page 20: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

XX

Storage Layer This layer manages the knowledge graph data. To manage a largeknowledge graph, the graph is randomly partitioned over a cluster of machines.Each machine manages a graph partition; the knowledge graph facts for the sameentity are guaranteed to be placed on the same machine to reduce the messagepassing costs during the query processing.

Graph Layer This layer provides an entity graph view via a strongly-typedentity model. This layer provides a set of index-free graph exploration APIs tothe query processing layer. Advanced graph operators, such as finding shortestpaths, performing random walks, can be built on top of these interfaces.

Query Processing Layer With a rich set of graph access interfaces provided bythe graph layer and the storage layer, many online queries can be processedin the query processing layer, such as relation search, subgraph matching, andinteractive graph browser.

To avoid building the prohibitive indexes, Trinity.KS uses fast index-freegraph explorations to answer graph queries. Specifically, Trinity.KS [15] has twokey ingredients for real-time knowledge graph serving:

1. Leveraging the rich schemata of the knowledge graph to prune graph traver-sal paths. A knowledge graph usually has very complex data schemata com-pared to other kinds of graphs, such as social networks or web graphs. Thus,the complex data schemata pose great challenges for graph modeling; onthe other hand, the rich schema information can help us prune the graphtraversal space during query processing. Suppose we want to find the pathsbetween two given persons. From the graph schemata we may know that thePerson’s predicate profession will never route to another Person, thus edgeswith this predicate can be safely pruned during traversing.

2. A highly optimized asynchronous fanout search (AFOS) as introduced inSection 3.1. AFOS is similar to what a distributed breadth-first search (BFS)does, but there is a clear di↵erence. AFOS performs graph exploration hopby hop like BFS, however the search process is fully asynchronous, meaninga sub-search task never needs to wait for the termination of the other sub-search tasks invoked at the same hop. Compared to BFS, AFOS can greatlyreduce the synchronization overhead.

7 Advanced Topics

In the previous sections, we assume the graph is modeled and stored in its nativeadjacency list form. For certain tasks, transforming graphs to other representa-tion forms can help tackle the problems. This section covers three of them: matrixarithmetic, graph embedding, and matroids.

Page 21: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

XXI

7.1 Matrix arithmetic

A representative system is Pegasus [19]. Pegasus is an open source large graphmining system. The key idea of pegasus is to convert graph mining operationsinto iterative matrix-vector multiplications.

Pegasus uses a n by n matrix M and a vector v of size n to represent graphdata. Pegasus defines a Generalized Iterated Matrix-Vector Multiplication (GIM-V ).

M ⇥ v = v0,where v0 = ⌃nj=1mi,j ⇥ vj

Based on it, three primitive graph mining operations are defined. Graphmining problems are solved by customizing these three operations separately:

– combine2(mi,j , vj): multiply mi,j and vj ;– combineAlli(x1, ..., xn): sum n all the multiplication results from combine2

for node i;– assign(vi, vnew): decide how to update vi with vnew.

Many graph mining algorithms, such as PageRank, random walk, diameter,and connected component, can be expressed elegantly using those three cus-tomized primitive operations.

7.2 Graph embedding

Certain graph problems can be easily solved after we embed a graph into a highdimensional space [51, 52, 31]. This approach is particularly useful for estimatingthe distances between graph nodes.

Fig. 9. Calculating shortest distance using graph embedding

Let us use an example given by Zhao et al. [52] to illustrate the main idea ofhigh dimensional graph embedding. As shown in Figure 9, to compute the dis-tance between two given graph vertices, we can embed a graph into a geometric

Page 22: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

XXII

space so that the distances in the space preserve the shortest distances in thegraph. For example, we can immediately give an approximate shortest distancebetween vertex A and B by calculating the Euclidean distance between theircoordinates in the high dimensional geometric space.

7.3 Matroids

Any undirected graph may be represented by a binary matrix which in turn canproduce a graphic matroid [29]. Matroids usually use an “edge-centric” graphrepresentation. Instead of representing a graph as (V,E), we usually considera graph as a set E of edges and consider graph nodes as certain subsets ofE [38]. For example, a graph (V,E) = ({a, b, c}, {e1 = ab, e2 = bc, e3 = ca})can be represented as follows: E = {e1, e2, e3} and V = {a = {e1, e3}, b ={e1, e2}, c = {e2, e3}}. Generally, a matroid consists of a non-empty finite set E,called ground set, and a non-empty collection B of subsets of E, called bases orcircuits, satisfying the following two properties [42, 28]:

– No base properly contains another;– If B1 and B2 are distinct bases and e 2 B1 \B2, then (B1 [B2)� {e} 2 B.

Matroids provide a new perspective on looking at graphs with a powerful setof tools for solving many graph problems. Even though the interplay betweengraphs and matroids has been proven fruitful [28], how matroids can be leveragedto help design graph processing systems is still an open problem.

8 Summary

As the proliferation of large graph applications increases, e�cient parallel graphprocessing becomes more and more important. Parallel graph processing is anactive research area. This chapter tries to shed some light on parallel large graphprocessing from a pragmatic point of view. We discussed the challenges and gen-eral principles of designing general purpose large scale graph processing systems.After surveying a few representative systems, we reviewed a few important graphcomputation paradigms for both online query processing and o✏ine analytics.As a special graph related computation task, the principles for designing largegraph generators are also briefly covered. Di↵erent graph representations lead todi↵erent graph computation paradigms; each of them may be suitable for solvinga certain range of graph problems. At the end of this chapter, we explored a fewalternative graph representation forms and their applications as the advancedtopics for further study.

References

1. Aggarwal, C.C., Wang, H. (eds.): Managing and Mining Graph Data, Advances inDatabase Systems, vol. 40. Springer (2010)

Page 23: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

XXIII

2. Aranda-Andujar, A., Bugiotti, F., Camacho-Rodrıguez, J., Colazzo, D., Goasdoue,F., Kaoudi, Z., Manolescu, I.: Amada: Web data repositories in the amazon cloud.In: Proceedings of the 21st ACM International Conference on Information andKnowledge Management. pp. 2749–2751. CIKM ’12, ACM, New York, NY, USA(2012), http://doi.acm.org/10.1145/2396761.2398749

3. Atre, M., Chaoji, V., Zaki, M.J., Hendler, J.A.: Matrix ”bit” loaded: a scalablelightweight join query processor for rdf data. In: WWW. pp. 41–50 (2010)

4. Bollobas, B.: Modern Graph Theory. Graduate texts in mathematics, Springer(1998)

5. Cheng, J., Yu, J.X., Ding, B., Yu, P.S., Wang, H.: Fast graph pattern matching.In: ICDE. pp. 913–922 (2008)

6. Cohen, J.: Graph twiddling in a mapreduce world. Computing in Science & Engi-neering pp. 29–41 (2009)

7. Cordella, L.P., Foggia, P., Sansone, C., Vento, M.: A (sub)graph isomorphism algo-rithm for matching large graphs. IEEE Trans. Pattern Anal. Mach. Intell. 26(10),1367–1372 (2004)

8. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters.Commun. ACM 51, 107–113 (January 2008)

9. von Eicken, T., Culler, D.E., Goldstein, S.C., Schauser, K.E.: Active messages:a mechanism for integrated communication and computation. In: Proceedings ofthe 19th annual international symposium on Computer architecture. pp. 256–266.ISCA ’92, ACM, New York, NY, USA (1992)

10. Garey, M.R., Johnson, D.S., Stockmeyer, L.: Some simplified np-complete prob-lems. In: Proceedings of the Sixth Annual ACM Symposium on Theory of Com-puting. pp. 47–63. STOC ’74, ACM, New York, NY, USA (1974), http://doi.acm.org/10.1145/800119.803884

11. Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: Powergraph: Dis-tributed graph-parallel computation on natural graphs. In: OSDI. pp. 17–30 (2012)

12. Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.:Graphx: Graph processing in a distributed dataflow framework. In: Proceedings ofthe 11th USENIX Conference on Operating Systems Design and Implementation.pp. 599–613. OSDI’14, USENIX Association, Berkeley, CA, USA (2014)

13. Gregor, D., Lumsdaine, A.: The Parallel BGL: A generic library for distributedgraph computations. POOSC ’05

14. He, H., Singh, A.K.: Graphs-at-a-time: query language and access methods forgraph databases. In: SIGMOD (2008)

15. He, L., Shao, B., Li, Y., Chen, E.: Distributed real-time knowledge graph serving.In: Proceedings of BigComp 2015. BigComp ’15 (2015)

16. Holder, L.B., Cook, D.J., Djoko, S.: Substucture discovery in the subdue system.In: KDD Workshop. pp. 169–180 (1994)

17. Husain, M., McGlothlin, J., Masud, M.M., Khan, L., Thuraisingham, B.M.:Heuristics-based query processing for large rdf graphs using cloud computing.IEEE Trans. on Knowl. and Data Eng. 23(9), 1312–1327 (Sep 2011), http:

//dx.doi.org/10.1109/TKDE.2011.103

18. Iordanov, B.: Hypergraphdb: a generalized graph database. pp. 25–36. WAIM ’10,Springer-Verlag (2010)

19. Kang, U., Tsourakakis, C.E., Faloutsos, C.: Pegasus: A peta-scale graph miningsystem implementation and observations. pp. 229–238. ICDM ’09, IEEE ComputerSociety (2009)

20. Kaoudi, Z., Manolescu, I.: Rdf in the clouds: A survey. The VLDB Journal 24(1),67–91 (Feb 2015), http://dx.doi.org/10.1007/s00778-014-0364-z

Page 24: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

XXIV

21. Kasneci, G., Suchanek, F.M., Ifrim, G., Ramanath, M., Weikum, G.: Naga: Search-ing and ranking knowledge. In: ICDE. pp. 953–962 (2008)

22. Kyrola, A., Blelloch, G., Guestrin, C.: Graphchi: Large-scale graph computationon just a pc. In: OSDI. pp. 31–46 (2012)

23. Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.:Distributed graphlab: a framework for machine learning and data mining in thecloud. Proc. VLDB Endow. 5(8), 716–727 (Apr 2012)

24. Lumsdaine, A., Gregor, D., Hendrickson, B., Berry, J.W.: Challenges in parallelgraph processing. Parallel Processing Letters 17(1), 5–20 (2007)

25. Majumder, S., Rixner, S.: An event-driven architecture for mpi libraries. In: InProceedings of the 2004 Los Alamos Computer Science Institute Symposium (2004)

26. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Cza-jkowski, G.: Pregel: a system for large-scale graph processing. SIGMOD ’10, ACM

27. Neumann, T., Weikum, G.: The rdf-3x engine for scalable management of rdf data.VLDB J. 19(1), 91–113 (2010)

28. Oxley, J.: On the interplay between graphs and matroids29. Oxley, J.: Matroid Theory. Oxford University Press (1992)30. Papailiou, N., Konstantinou, I., Tsoumakos, D., Koziris, N.: H2rdf: Adaptive query

processing on rdf data in the cloud. In: Proceedings of the 21st International Con-ference on World Wide Web. pp. 397–400. WWW ’12 Companion, ACM, NewYork, NY, USA (2012), http://doi.acm.org/10.1145/2187980.2188058

31. Qi, Z., Xiao, Y., Shao, B., Wang, H.: Distance oracle on billion node graphs. In:VLDB. VLDB Endowment (2014)

32. Qin, L., Yu, J.X., Chang, L., Cheng, H., Zhang, C., Lin, X.: Scalable big graphprocessing in mapreduce. In: Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data. pp. 827–838. SIGMOD ’14, ACM, New York,NY, USA (2014), http://doi.acm.org/10.1145/2588555.2593661

33. Ravindra, P., Kim, H., Anyanwu, K.: An intermediate algebra for optimizing rdfgraph pattern matching on mapreduce. In: Proceedings of the 8th Extended Se-mantic Web Conference on The Semanic Web: Research and Applications - Vol-ume Part II. pp. 46–61. ESWC’11, Springer-Verlag, Berlin, Heidelberg (2011),http://dl.acm.org/citation.cfm?id=2017936.2017941

34. Rohlo↵, K., Schantz, R.E.: Clause-iteration with mapreduce to scalably query data-graphs in the shard graph-store. In: Proceedings of the Fourth International Work-shop on Data-intensive Distributed Computing. pp. 35–44. DIDC ’11, ACM, NewYork, NY, USA (2011), http://doi.acm.org/10.1145/1996014.1996021

35. Sarwat, M., Elnikety, S., He, Y., Mokbel, M.F.: Horton+: A distributed system forprocessing declarative reachability queries over partitioned graphs. Proc. VLDBEndow. 6(14), 1918–1929 (Sep 2013)

36. Shao, B., Wang, H., Li, Y.: Trinity: a distributed graph engine on a memory cloud.In: Proceedings of the 2013 ACM SIGMOD International Conference on Manage-ment of Data. pp. 505–516. SIGMOD ’13, ACM, New York, NY, USA (2013)

37. Sun, Z., Wang, H., Wang, H., Shao, B., Li, J.: E�cient subgraph matching onbillion node graphs. Proc. VLDB Endow. 5(9), 788–799 (May 2012)

38. Truemper, K.: Matroid Decomposition. Elsevier (1998)39. Ullmann, J.R.: An algorithm for subgraph isomorphism. J. ACM 23(1), 31–42

(1976)40. Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33, 103–

111 (August 1990)

Page 25: Parallel Processing of Graphs - Eindhoven University of ...edbt2015school.win.tue.nl/material/shao-et-al.pdf · billions of nodes and edges: first, converting billion node graphs

XXV

41. Wang, L., Xiao, Y., Shao, B., Wang, H.: How to partition a billion-node graph. In:IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014,IL, USA, March 31 - April 4, 2014. pp. 568–579 (2014), http://dx.doi.org/10.1109/ICDE.2014.6816682

42. Wilson, R.: Introduction to graph theory. Prentice Hall (1996)43. Wu, W., Li, H., Wang, H., Zhu, K.Q.: Probase: a probabilistic taxonomy for text

understanding. In: Proceedings of the 2012 ACM SIGMOD International Confer-ence on Management of Data. pp. 481–492. SIGMOD ’12, ACM, New York, NY,USA (2012)

44. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: clustercomputing with working sets. pp. 10–10. HotCloud’10, USENIX Association (2010)

45. Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine forweb scale rdf data. In: VLDB. VLDB Endowment (2013)

46. Zhang, S., Li, S., Yang, J.: Gaddi: distance index based subgraph matching inbiological networks. In: EDBT (2009)

47. Zhang, S., Yang, J., Jin, W.: Sapper: Subgraph indexing and approximate matchingin large graphs. PVLDB 3(1), 1185–1194 (2010)

48. Zhang, X., Chen, L., Tong, Y., Wang, M.: Eagre: Towards scalable i/o e�cientsparql query evaluation on the cloud. In: Proceedings of the 2013 IEEE Interna-tional Conference on Data Engineering (ICDE 2013). pp. 565–576. ICDE ’13, IEEEComputer Society, Washington, DC, USA (2013), http://dx.doi.org/10.1109/ICDE.2013.6544856

49. Zhao, P., Han, J.: On graph query optimization in large networks. PVLDB 3(1),340–351 (2010)

50. Zhao, X., Sala, A., Wilson, C., Zheng, H., Zhao, B.Y.: Orion: shortest path esti-mation for large social graphs. pp. 1–9. WOSN’10, USENIX Association (2010)

51. Zhao, X., Sala, A., Wilson, C., Zheng, H., Zhao, B.Y.: Orion: shortest path esti-mation for large social graphs. In: WOSN’10 (2010)

52. Zhao, X., Sala, A., Zheng, H., Zhao, B.Y.: Fast and scalable analysis of massivesocial graphs. CoRR (2011)

53. Zhu, F., Qu, Q., Lo, D., Yan, X., Han, J., Yu, P.S.: Mining top-k large structuralpatterns in a massive network. In: VLDB (2011)

54. Zou, L., Chen, L., Ozsu, M.T.: Distancejoin: Pattern match query in a large graphdatabase. PVLDB 2(1), 886–897 (2009)