understanding graph computation behavior to enable robust ... · giraph on all graph datasets,...

Understanding Graph Computation Behavior to Enable Robust Benchmarking

Fan Yang and Andrew A. Chien*

Department of Computer Science University of Chicago Chicago, IL, U.S.A.

{fanyang, achien}@cs.uchicago.edu

*also Mathematics and Computer Science, Argonne National Laboratory ABSTRACT Graph processing is widely recognized as important for a growing range of applications, including social network analysis, machine learning, data mining, and web search. Recently, many performance assessments and comparative studies of graph computation have been published, all of which employ highly varied ensembles of algorithms and graphs. To explore the robustness of these studies, we characterize how behavior varies across a variety of graph algorithms on a diverse collection of graphs (size and degree distribution). Our results show that graph computation behaviors form a very broad space, and inefficient exploration of this space will possibly lead to an ad-hoc study as well as a narrow understanding of graph processing performance.

Hence, we consider constructing an ensemble of experiments, e.g. a benchmark suite, to most effectively explore graph computation behavior space. We study different ensembles of graph computations, and define two metrics, spread and coverage, to quantify how efficiently and completely an ensemble explores the space. Our results show that: (1) an ensemble drawn from a single algorithms or a single graph may unfairly characterize a graph-processing system, (2) an ensemble exploring both algorithm diversity and graph diversity improves the quality significantly, but must be carefully chosen, (3) some algorithms are more useful than others for exploring the space, and (4) it is possible to reduce benchmarking complexity (i.e. number of algorithms, graphs, etc.) while conserving the benchmarking quality.

Categories and Subject Descriptors C.4 [Computer Systems Organization]: Performance of systems – Performance attributes.

General Terms Algorithms, Measurement, Performance, Experimentation.

Keywords Graph processing, Graph-algorithms, Benchmarking.

1. INTRODUCTION Rapid growth of World Wide Web and social networking has given rise to massive graph data sets that are for modeling in domains such as social networks, recommender systems, protein interaction networks, web graphs, and more. These real-world graphs have remarkable size [24, 27], which combined with complex algorithms creates a critical need for efficient and scalable graph-processing systems.

To meet this need, the research community has created numerous graph-processing systems including Pregel [19], Giraph [1], GraphLab [7], SNAP [24], TurboGraph [11], Mizan [16], GPS [21], and GraphChi [17] designed to meet the challenges of graph processing: (i) extreme scale (100’s of billions of edges), (ii) irregular computation structure (difficult to aggregate multiple vertex operations), (iii) poor locality, and (iv) wide variation in parallelism.

Confounding factors are wide variation in both the properties of graphs and graph algorithms. Graph features, including graph size, vertex degree, and vertex/edge weights, depend heavily on application domain. For example, the largest graph available from the Web Data Commons [27] has 3.5 billion vertices (web pages) and 128 billion edges (hyperlinks). Another web graph, “Wikipedia vote network” [24], contains only 7,115 vertices and 103,689 edges. Vertex degree also varies widely by domain. In social networks, popular Twitter or Facebook accounts have millions of followers, while others have only a few. But in a graph derived from a linear solver, vertices have a low, nearly uniform degree.

The dynamic properties of graph algorithms (variation in compute intensity, communication intensity, and vertex activity distribution) also exhibit great diversity. For example, computation at a vertex can vary widely based on algorithm. Communication intensity typically depends on vertex degree and algorithm. The active fraction of vertices is a strong function of algorithm – in PageRank, all vertices begin active and the fraction gradually decreases, whereas in Single-Source Shortest Path (SSSP), only the source vertex begins active, but the active fraction grows rapidly.

There have been numerous comparative studies of graph processing system, employing a wide variety of algorithms and graphs (see Table 1).

The results of these comparative studies have produced either incomparable or in some cases conflicting results. For example, Table 1 shows how three studies assess the relative merits of Giraph and GraphLab. Elser [6] finds that GraphLab outperforms

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPDC’15, June 15–19, 2015, Portland, Oregon, U.S.A. Copyright 2015 ACM 1-58113-000-0/00/0010 …$15.00.

Giraph on all graph datasets, while Han [10] finds that their relative performance varies, but they are comparable performing overall. Guo [9] uses a different set of graphs, producing similar, but incomparable results; also finding that relative performance varies, and provides no overall conclusion. Results such as these make it difficult for a practitioner of graph computing to gain a clear perspective on which systems are preferable and in which contexts (algorithm, graph, scale). More fundamentally, even experts cannot claim any deep understanding of graph computing system performance.

Our goal is to provide a more systematic understanding of the performance impact of graph algorithm and structure on graph processing systems, and thereby enable robust, systematic evaluation.

To understand and quantify diversity in graph computation behavior, we study the dynamic properties of a collection of graph algorithms chosen from diverse application domains. Using GraphLab as an experimental vehicle, the algorithms are run on a collection of synthetic graphs of varying size and structure. To characterize behavior, we measure key properties ranging from active vertices to compute intensity, characterizing behavior variation across graph algorithms and structure. Not only do algorithms behave differently, but also some graph algorithms exhibit radically diverse, input-graph dependent behavior. This differs significantly from many numerical, scientific parallel computations. In fact, our behavior metrics: active fraction, compute intensity, and communication intensity show high sensitivity to graph structure.

Benchmarking involves understanding performance over a behavior space. So, to understand graph-processing behavior systematically, we define a vector space using fundamental graph computation properties that also affect performance. To analyze properties of ensembles of runs, we define ensemble metrics, spread and coverage, that capture how well a set of graph-algorithm pairs explores the behavior space. With these metrics, we evaluate the benchmarking quality of different ensembles of runs. Our results show that, neither single algorithms with a variety of graph structures, nor multiple algorithms over single graphs explore the behavior space well, giving rise to capricious evaluation of graph-processing system.

We also search all the runs across multiple algorithms and graphs for best spread and coverage. Our study suggests that a small set of carefully chosen ensembles could sample the behavior space much more efficiently, which presents an opportunity to build a high-quality benchmark suite. We then design several ensembles,

designed systematically under constraint and evaluate their advantages. Specific contributions of this paper include:

(1) We conduct a series of experiments varying algorithm and graphs, which demonstrates 1000-fold variation across five dimensions of graph computation behavior.

(2) We define two ensemble metrics, spread and coverage, to measure how efficiently and thoroughly an ensemble of experiments explores the graph computation behavior space.

(3) We demonstrate that an ensemble drawn from a single algorithm or a single graph is a poor benchmark set, and an ensemble exploring both algorithm diversity and graph diversity gives 200% better spread and 30% better coverage.

(4) Our insights into the best ensembles with high spread and coverage show that some algorithms, including K-Means, Alternating Least Squares, and Triangle Counting, are more useful than other algorithms in behavior space exploration.

(5) We find that careful reduction in algorithm diversity minimizes loss of spread and coverage with further optimization possible by reducing runtime.

The remainder of the paper is organized as follows. In Section 2, we describe graph algorithms used throughout, drawing from varied application domains. In Section 3, we detail the methodology, experiments, and metrics. Our experiments with each algorithm and graph are documented in Section 4, showing the great diversity of behavior. In Section 5, we consider ensembles, and how to best design them for thorough graph system evaluation. Related work is discussed in Section 6, and Section 7 summarizes and suggests future directions.

2. Graph Algorithms and Structure We describe a variety of graph computation problems, and algorithms and graph structures that arise in them. We selected algorithms from distinct application domains for their widely varying behaviors.

2.1 Graph Algorithms (1) Graph Analytics (GA) focuses on data mining, especially relationships from large graphs. We choose six algorithms, including Connected Components (CC), K-Core decomposition (KC), Triangle Counting (TC), Single-Source Shortest Path (SSSP), PageRank (PR), and Approximate Diameter (AD).

l Connected Components [29] of an undirected graph are the subgraphs in which any two vertices are connected to each other. To find all connected components in a graph, the CC program compares the IDs of adjacent vertices and only update a vertex if its ID is larger than the minimum value. Vertices only receive data from neighbors that activate it.

l A K-Core [5] of a graph is the largest subgraph where vertices have at least k interconnections. To find all K-Cores of the input graph, the KC program recursively removes all vertices with degree d=0, 1, 2, ... Similarly, vertices only receive data from neighbors that activate it.

l Triangle Counting counts the number of triangles formed by three adjacent vertices in an undirected graph. For each edge in the graph, the TC program counts the number of intersections of the neighbor sets on both endpoints.

Table 1. Comparative Graph Processing System Evaluations

Authors Systems Algorithms Graphs

M. Han [10]

Giraph, GPS, Mizan, and GraphLab

PageRank, SSSP, WCC,

DMST

soc-LiveJournal, com-Orkut, Arabic-2005,

Twiiter-2010, UK-2007-05

B. Elser [6]

Map-Reduce, Stratosphere,

Hama, Giraph, GraphLab

K-core

ca.AstroPh, ca.CondMar, Amazon0601, web-

BerkStan, com.Youtube, wiki-Talk, com.Orkut

Y. Guo [9]

Hadoop,Neo4j,YARN,Giraph,Stratosphere,

Graphlab,

Statistic algorithm, BFS, CC, CD, GE

Amazon, WikiTalk, KGS, Citation, DotaLeague,

Synth, Friendster

Description Benchmarks Graphs M. Han [10]: Giraph,

GPS, Mizan, and GraphLab

PageRank, SSSP, WCC,

DMST

soc-LiveJournal, com-Orkut,Arabic-2005,

Twiiter-2010, UK-2007-05 B. Elser [6]: Map-

Reduce, Stratosphere, Hama, Giraph, GraphLab

K-core decomposition

ca.AstroPh, ca.CondMar, Amazon0601, web-BerkStan,

com.Youtube, wiki-Talk, com.Orkut

Y. Guo [9]: Hadoop, YARN, Stratosphere,

Giraph, Graphlab, Neo4j

Statistic algorithm,

BFS, CC, CD, GE

Amazon, WikiTalk, KGS, Citation, DotaLeague, Synth,

Friendster

l Single-Source Shortest Path finds shortest paths from one special vertex to all other vertices. The source vertex is active initially. In each iteration, an active vertex computes and updates distances for adjacent vertices.

l Page Rank [31] ranks web pages, considering links as votes where a page linking to another page is casting a vote. Pages with higher rank have more weight in links [15]. All vertices are active initially. A vertex becomes inactive when its rank remains stable within a given tolerance.

l Approximate Diameter estimates the diameter of a graph, which is the longest distance (i.e. the longest shortest path) between any two vertices.

(2) Clustering classifies objects based on their similarity. We choose K-Means algorithm (KM) that partitions n vertices into k clusters such that each vertex belongs to the cluster with the nearest mean [32]. All vertices remain active through the whole lifecycle. In scatter, each vertex sends messages to neighbors when the cluster assignment has changed, and those who receive no messages become inactive in the next iteration.

(3) Collaborative Filtering (CF) is a technique used by recommender systems to predict the 'rating' or 'preference' that user would give to an item, such as a movie or a book [28]. Each user is associated with a user-factor vector, and each item is associated with an item-factor vector. Prediction is done by taking an inner product [14]. Therefore, a general method for CF is learning all the user-factor vectors and item-factor vectors through matrix factorization. We select four algorithms:

l Alternating Least Squares method (ALS) is used for explicit feedback datasets, where unknown values are treated as missing, leading to a sparse objective function [14].

l Non-negative Matrix Factorization (NMF) is used to factorize non-negative matrices.

l Stochastic Gradient Descent (SGD) is a gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions [34].

l Singular Value Decomposition (SVD) decomposes a matrix into the product of unitary matrices and a diagonal matrix using Restarted Lanczos algorithm.

(4) Other algorithms include Jacobi method, Loopy Belief Propagation (LBP) algorithm and Dual Decomposition (DD).

l Jacobi method is an iterative method to solve a diagonally dominant system of linear equations.

l Loopy Belief Propagation is a discrete structured prediction application that can be applied to a wide range of predictions.

l Dual Decomposition solves a relaxation of difficult optimization problems by decomposing them into simpler sub-problems.

2.2 Graph Structure Graph features include size, vertex values, edge weights, and connectivity (degree distribution). All these features vary significantly across application domains. For example, GA applications use graphs as the abstraction of real-world graphs; clustering applications use graphs to represent lists of data points;

CF applications use graphs to represent recommendation matrices; and linear solvers use graphs to represent matrices.

Graph size is measured in vertices and edges. The number of edges is much larger, determining the scale of a graph problem. Vertex values and edge weights are defined by application domains. In Graph Analytics, vertices and edges are usually assigned with simple values, such as the rank of a page in PR, and edge weights in SSSP. Vertex/Edge values are given more complex meanings in Machine Learning. For example, in K-Means clustering, each vertex is a vector. In CF applications, each edge is associated with a rating, representing how much the user ‘recommends’ that item. In linear solvers, each edge models a matrix element with its source vertex ID the row number and target vertex ID the column number, and its edge weight the element value.

The degree distribution P(k) of a graph is defined to be the fraction of vertices in the graph with degree k. Thus if there are nk out of n vertices that have degree k, we have P(k) = nk /n [30]. Degree distribution characterizes the graph structure, and hence affects the behavior of topology-sensitive algorithms. Even across application domains, many real-world graphs exhibit similar degree distributions. Typically vertices with smaller degrees are more frequent than those with larger degrees. This is called a Power Law distribution, and graphs with such distributions are called scale-free networks [33]. More precisely, the fraction P(k) of vertices in the graph with degree k goes for large values of k as

P(k) ~ k-α, (1)

Where α is a constant typically ranging 2.0 < α < 3.0 in the real world, although it may also lie outside these bounds.

3. METHODS We introduce the system configuration used for experiments, the datasets for each domain, and define all the performance metrics used in our experiments.

3.1 Experiment Setup We employ GraphLab v2.2 [7] as our graph-processing platform. We execute programs using the synchronous mode where the Gather, Apply, and Scatter phases are performed without overlap.

Each graph algorithm is executed on a variety of graphs. We change the value of graph features one at a time to isolate impacts on behavior. To limit the number of experiments, we use four different sizes, and five different degree distributions (varying α in Equation 1) for each graph.

All the experiments were performed on the Midway system in Research Computing Center of the University of Chicago. We used up to 48 nodes, connected by a fully non-blocking FDR-10 Infiniband network. Each node has two eight-core 2.6GHz Intel Xeon E5-2670 "Sandy Bridge" processors with 32GB of memory.

3.2 Datasets We use synthetic graph generators to create graphs for each application domain. Parameters include number of edges (nedges) and α (alpha) in power law degree distribution. We set nedges to different orders of magnitude (105 to 109). We also set α to match real-world scale-free graphs (2.0 to 3.0), accepting slight variation in the number of vertices. The vertex data and edge weights are generated randomly in Gaussian distribution and not varied.

The inputs to Graph Analytics algorithms are unweighted graphs.

In the domain of Clustering, vertices are data points (in this paper they are 2D vectors) and edges are pairwise rewards between vertices. So the inputs to Clustering algorithms include unweighted graphs, and vertex data in the format of vector lists.

Inputs for Collaborative Filtering are weighted graphs, where source vertices of edges are users, target vertices are items to be recommended, and the weight of an edge represents the rating that a user gives to an item. Hence the total number of vertices is the sum of users and items. To simplify our experiments, we assume the number of items is equal to the number of users.

Inputs of Jacobi include a matrix (also a weighted graph with uniform degree for all vertices) and a vector. Inputs of LBP include a pixel matrix and vertex data, which are prior estimates for each pixel color. To simplify our experiments, we only generate square matrices as the input to Jacobi and LBP. Inputs of DD are Markov Random Field (MRF) graphs in the standard UAI file format. For DD we use real-world MRF graphs downloaded from [12].

The range of variables is shown in Table 2.

3.3 Computation Model Our algorithms are implemented in the Gather-Apply-Scatter (GAS) model [7, 25]. Graph computation is expressed in a vertex-centric fashion. Each vertex can be active or inactive. Only active vertices can perform computation. Vertices are activated by receiving signals from a neighboring vertex. The overall GAS process is listed below:

l Gather collects data through adjacent edges. Both the data to collect and the set of edges (in-edges, or out-edges) are user-defined. In this paper, the operation of collecting data through an edge is called an edge read;

l Apply applies user-defined computation to central vertices and update their values. In this paper, the operation of updating one vertex is called a vertex update;

l Scatter updates the value of adjacent edge, and send signals to activate neighbors. In this paper, a signal is called a message.

Hence, inter-vertex communication includes both edge reads and message transferring. Only vertices that receive messages can be

active in the next iteration. Vertex program converges (ends) when all vertices are inactive or the convergence condition is satisfied.

There are also other computation models used in current graph-processing systems (edge-centric model [20] and graph-centric model [25]), but the basic behavior of graph computation is conserved -- transferring information through edges, performing computation on an independent unit (vertex/edge), and activations.

In this paper, we call a complete GAS procedure in the synchronous model an iteration. In our experiments, all tested algorithms converge in finite number of iterations, except NMF and SGD. So we set a maximum number of iterations (20) for these two algorithms.

3.4 Performance Metrics We describe a graph computation behavior with five metrics:

(1) Active fraction is the ratio of active vertices to all vertices in a single iteration that directly shows the activity of vertices across the whole lifecycle.

Compute intensity measures how much work needs to be done for updating vertex values. We define two metrics to quantify compute intensity:

(2) UPDT: the average number of vertex updates per iteration

(3) WORK: the Average CPU time for computing and updating vertex values per iteration, which is measured by time spent in user-defined apply function.

Communication intensity measures the information transferred between vertices. In GraphLab, inter-vertex communication includes data transferring (edge read) and message transferring (signaling). In many cases like PR, these two types of communications are different. We define two metrics to quantify communication intensity:

(4) EREAD: Average number of edge reads per iteration

(5) MSG: Average number of messages per iteration

Moreover, because the values of UPDT, WORK, EREAD, and MSG can differ by orders of magnitude and highly depend on the number of edges, we divide each of these four metrics by the number of edges to capture the per-edge behavior. We also normalize these metrics to make it less than 1.0 for highlighting the relative difference, rather than absolute values.

4. Behavior Variation in Graph Computation In this section, we present experimental results of executing different graph algorithms on a variety of graph structures. Due to paper length constraints, we present a selection of results.

4.1 Graph Analytics Different GA algorithms exhibit different shapes of active fraction; for each GA algorithm, active fraction exhibits similar trends over various graph structures (Figure 1). In detail, the shape of trends is classified by degree distribution, especially for CC and SSSP –– these algorithms exhibit a higher peak, and converge faster with more uniform degree distribution (i.e. a smaller α). In contrast, KC and PR are less sensitive to graph topology. Specially, AD has active fraction = 1.0 for the whole lifecycle.

Table 2. Graph Feature Variables

Domains Algorithms Variables Range of Values

Graph Analytics

CC, TC, KC, SSSP, PR, AD

nedges 106, 107, 108, 109

α 2.0, 2.25, 2.5, 2.75, 3.0

Clustering KM nedges 106, 107, 108, 109

α 2.0, 2.25, 2.5, 2.75, 3.0

Collaborative Filtering

ALS, NMF, SGD, SVD

nedges 105, 106, 107, 108

α 2.0, 2.25, 2.5, 2.75, 3.0

Linear Solver Jacobi nrows 5000, 10000, 15000, 20000

Graphical Model LBP nrows 5000, 10000,

15000, 20000 Graphical

Model DD nedges 1056, 1190, 1406, 1560

Figure 1. GA Active Fraction for All Graphs.

For KC and PR, all metrics heavily depend on graph size and degree distribution (see Figures 2 and 4). All metrics of KC are positively correlated to α, whereas communication intensity of PR is negatively correlated to α.

In Figure 3, TC exhibits no significant variation in behavior across graph size; it has constant EREAD for all graphs; also, there is less computation, less updates, and less messages transferred per iteration when degree distribution becomes more uniform.

4.2 Clustering (K-Means) In Figure 5, KM activates all vertices all the time. It converges much more slowly than GA algorithms, requiring more than 700 iterations.

In Figure 6, KM behaves differently across graph sizes and degree distributions. All metric values are positively correlated to α, except EREAD that is constant.

4.3 Collaborative Filtering All of our Collaborative Filtering algorithms except ALS have all vertices active for entire lifecycle. As for ALS, active fraction exhibits different trends across graph sizes and degree distributions (Figure 7). ALS converges much more slowly over larger graphs, showing a nearly 60-fold difference in the number of iterations.

Figure 8 explains why ALS is so interesting as a benchmark. ALS behavior strongly depends on graph size and degree distribution. We observe high variation in the average value of all 4 metrics.

Figure 9 and 10 indicate that none of SGD and SVD exhibits significant changes in behavior across graph sizes, except for the outlier of nedges=106; compute intensity is positively correlated to α; for SVD, MSG is also positively correlated to α. We observe that NMF exhibits similar results to SVD.

4.4 Jacobi, LBP and DD In both Jacobi and DD, all vertices are active for all iterations, whereas LBP exhibits a sharp drop in the number of active vertices over time. For LBP in Figure 11, graph size has no effect on the shape of active fraction.

In Figure 12, the behavior of Jacobi highly depends on graph scale except EREAD; LBP and DD are less sensitive to graph size, while WORK is the only varied metric when graph size changes.

Figure 2. KC Metric Values.

Figure 3. TC Metric Values.

Figure 4. PR Metric Values.

Figure 5. KM Active Fraction for All Graphs

Figure 6. KM Metric Values.

Figure 7. ALS Active Fraction for All Graphs.

Figure 8. ALS Metric Values.

Figure 9. SGD Metric Values.

Figure 10. SVD Metric Values.

Figure 11. Active Fraction for LBP.

Figure 12. Metric Values for Jacobi, LBP, and DD.

4.5 Variation across Algorithms As shown above, all algorithms have a characteristic shape of active fraction that varies significantly across algorithms; some algorithms exhibit a varying active fraction over time, while other algorithms like AD and KM always activate all vertices. Moreover, the convergence rate differs a lot across domains, by up to three orders of magnitude (TC vs. DD).

In Figure 13 different algorithms exhibit quite different shapes of 4 performance metrics. The values of all 4 metrics are much smaller in ALS, SSSP, KC, PR and LBP than in other algorithms. AD requires the most work for updating vertices, KM requires the most data transferring, and SGD requires the most message transferring.

Figure 13. Metric Values for All Algorithms.

In all, graph computation behavior exhibits a wide diversity across algorithms and graphs, forming a very broad space. Any single experiment only samples a small part of that space, leading to ad-hoc performance studies that are unable to systematically understand the performance of graph-processing systems. Hence, we need to consider a more efficient way to sample the behavior space of graph computation.

5. Exploring a Behavior Space Given the wide variety of graph computation behavior, we need a systematic approach to characterize graph processing system performance thoroughly.

5.1 Graph Computation and Ensemble Metrics To systematically understand graph computations and graph-processing systems, we begin by defining a vector space for graph computation behavior. Vectors in this space are composed from four performance metrics that capture the fundamental behavior of graph computations –– vertex value updates, compute per update, and edge traversals (reads), and messages [1, 7, 19]. So we define Behavior(GCi) as follows:

Behavior (GCi) = <UPDT, WORK, EREAD, MSG> (2)

Where UPDT, WORK, EREAD, and MSG) are as defined in Section 3 and GCi is a graph computation represented as an <algorithm, graph size, degree distribution> tuple. While our behavior space is designed to capture fundamental graph computation behavior, doing so optimally is an open research challenge; we define only one1 vector performance space, and

1 Ours is the first formally defined behavior space for graph

computations of which we are aware.

evaluate its efficacy. Behavior(GCi) has two more dimensions of variation – the temporal extent of the computation (iterations), and the spatial extent of the graph (vertices). As in Section 4, we will use average metric values per iteration over these sample spaces to characterize typical values and variability.

Possible uses of our graph computation behavior characterization include basic algorithm analysis, algorithm comparison, performance analysis, graph computation optimization, performance prediction, system benchmarking, and even system design. Here we consider how to use the behavior performance space to design efficient, high-quality benchmarking of graph-processing systems.

Define an ensemble as a set of graph computations, {GC1, GC2, …}. Then we can model a benchmark suite as an ensemble, and really characterize any set of performance experiments on a graph processing system as an ensemble. Generally, we’d like an ensemble of graph computations that maximize the characterization of a graph processing systems performance –– over its entire behavior space (or the graph computation behavior space) with minimum effort. More formally we define:

Ensemblek = {GC1,GC2,...,GCN} (3)

Given this definition, then we would like to describe the quality with which an ensemble characterizes graph computation behavior. To this end, we define two ensemble metrics, spread and coverage, that quantitatively characterize the quality of an ensemble to thoroughly explore graph computing system behavior, but including a wide range of graph computation behavior.

Spread of an ensemble is defined as the mean pairwise distance between the Behavior vectors in an ensemble (see below).

Spread(Ensemblek ) =d(Behavior(GCi ),Behavior(GCj ))

j=1

N

∑i=1

N

∑

N(N −1)

Where d() is the Euclidean distance between two graph computation Behavior vectors. Intuitively, spread represents a form of “dispersion” of an ensemble. If the ensemble’s elements are tightly clustered, then it will be low. If ensemble members are uniformly dispersed, it will be high. From a thorough evaluation point of view, spread should be maximized.

Coverage of an ensemble is defined as average minimum distance from all points in the space to the nearest point in the ensemble. Formally:

Coverage(Ensemblek ) =NS

mink=1...N

{d(Samplei,Behavior(GCk ))}i=1

NS

∑

Where sample points are taken randomly and uniformly throughout the space to compute the coverage. Samplei is a sample point, and NS is the number of sample points (we use 1 million). Intuitively, the notion of good coverage is the idea that no matter where a behavior falls in the space, it will be close to a point in the ensemble. For an ensemble, a high coverage indicates that we have sampled the behavior space thoroughly. High coverage is desirable for an ensemble that characterizes graph processing system performance well.

5.2 Efficient Exploration: Ensembles of Single Algorithms First, we consider the question: How well can an ensemble using a single algorithm explore the behavior space? For each algorithm, we consider our collection of 20 runs, varying graph structure (size and connectivity). Jacobi, LBP and DD are not considered because their graph structures do not vary. So, for eleven algorithms, we have a total of 215 runs over 11 algorithms from across three application domains (GA, Clustering and CF). Unfortunately, 5 runs of AD with largest graph size=109 failed.

So, to understand how effective a single algorithm can be for achieving high spread, for each algorithm, we exhaustively test, and pick the best ensemble for each given size. Figure 14 presents our results, and shows that as the ensemble size increases, for all 11 algorithms, spread decreases steadily. This indicates that additional runs of a single algorithm tend to be clustered, and even selecting the best ones does not produce a good spread.

To understand the quality of the achieved spread, we also plot an empirical upper bound for spread for each given ensemble size. These are computed assuming ensemble members uniformly and maximally distributed in the behavior space. The spread achieved by single algorithm ensembles falls well below our empirical upper bound.

Next we consider coverage. For each algorithm, we select the best ensemble members to maximize coverage. Our results (see Figure 15) show that restricted to a single algorithm, coverage increases very slowly. That is, adding ensemble members from a single algorithm do not spread out and cover the graph computation space. While the coverage achieved varies some across algorithms, the overall trend is the same.

To understand the quality of the achieved coverage, we also plot an empirical upper bound for coverage for each given ensemble size. Again, these are computed assuming ensemble members uniformly and maximally distributed in the behavior space. The coverage achieved by single algorithm ensembles falls well below our empirical upper bound.

5.3 Efficient Exploration: Ensembles of Single Graph Second, we consider: How well can ensembles using a single graph structure explore the behavior space? To answer this question, we select fifteen graphs with varied size=106, 107, 108 and α=2.0, 2.25, 2.5, 2.75, 3.0. For each single-graph ensemble, we consider 11 runs over 11 algorithms. Overall, we have 165 runs.

Considering spread, we construct the ensemble that maximizes spread for each graph structure. These results are shown in Figure 16, while none of the graph structures enables spread anywhere close to the upper bound, the achieved spread is significantly higher than with single algorithms. Graph structure appears to be a more important factor in behavior variation than algorithm. Considering coverage, we construct the ensembles that maximize coverage for each graph structure. Our results (see Figure 17) again show a flattening trend – no single graph structure is sufficient to fully explore the behavior space. And further, none approaches our upper bound. Compared to ensembles drawn from single algorithms, ensembles around a single graph structure spread into the behavior space better, but a little less completely. Our consideration of both single algorithm and single graph structure ensembles all suggest that beyond a few, adding more runs with a similar origin gives little benefit.

Figure 14. Spread: Single Algorithm Ensembles.

Figure 15. Coverage: Single Algorithm Ensembles.

Figure 16. Spread: Single Graph Ensembles.

Figure 17. Coverage: Single Graph Ensembles.

5.4 Efficient Exploration: Unrestricted Ensembles Given our broad exploration of graph computation behavior, we consider efficient and effective ensemble design retrospectively. Examining our 215 runs, we ask the question –– What are the ensembles we could have constructed? That is, the ensembles that would explore the behavior space most efficiently? As before, the best ensembles avoid clustering (spread decreases slowly) and maximize exploration (coverage increases fast). Intuitively, the best ensemble corresponds to the most efficient benchmark suite of various sizes.

Considering spread, allowed unrestricted choice across multiple algorithms and graphs, it is possible to sample the space much more efficiently. Figure 18 presents the achievable spread with unrestricted ensembles and compares it to our prior results on ensembles drawn from single algorithms and single graphs. For unrestricted ensembles, spread starts high (~1.2), and declines slowly to 0.75 for 20 members. In contrast, with only single algorithm ensembles, spread starts much lower, and decreases to 0.25 for 20 runs. Compared to single graph ensembles, the story is different. The spread of single graph ensembles begins equally high, but falls off much more rapidly. Overall, there’s a clear benefit in drawing richly from both algorithm and graph structure diversity, with as much as a three-fold greater spread.

Considering coverage, Figure 19 presents the best achievable coverage for ensembles composed without restriction of algorithms and graphs. As with spread, ensembles that draw on both algorithm and graph diversity have clear advantage, delivering 30% better coverage than single algorithm ensembles. The unrestricted ensembles achieve coverage that is significantly higher at as few as 5 runs, and the advantage grows even at 20 runs, achieving 3.9 overall.

Figure 18. Spread: Unrestricted Ensembles.

Figure 19. Coverage: Unrestricted Ensembles.

5.5 Understanding Diversity Our results for unrestricted ensembles give significantly better results, so a natural question is Why? What aspects of diversity in algorithms and graphs enable them to achieve better spread and coverage? Because algorithms (and the associated implementation effort and code porting) are a major element of benchmarking complexity, we consider the representation of algorithms in the best ensembles. Table 3 lists the members of the unrestricted ensembles achieving best spread and coverage.

Table 3. Members of Ensembles Achieving Best Spread and Coverage (for ≥10, algorithms only).

Type Size Runs (Algorithm[, Graph Size, α])

Best spread

5 <ALS, 105, 3.0>, <SGD, 108, 2.0>, <TC, 109, 2.0>, <SSSP, 109, 3.0>, <ALS, 105, 2.75>

10 ALS, SGD, TC, SSSP, ALS, TC, SGD, ALS, KM, SVD

15 SSSP, ALS, KM, SGD, ALS, TC, SGD, ALS, TC, SSSP, ALS, SGD, TC, SVD, ALS

20 SSSP, ALS, TC, SGD, ALS, TC, SGD, ALS, KM, SSSP, ALS, SGD, KM, SVD, ALS, TC, SGD, ALS, SGD, TC

Best coverage

5 <TC, 106, 2.5>, <KM, 106, 2.25>, <AD, 107, 3.0>, <ALS, 108, 2.0>, <KC, 106, 2.5>

10 AD, SVD, KM, ALS, TC, KC, KM, ALS, KM, NMF

15 KM, NMF, ALS, AD, SVD, KC, KM, ALS, KM, KM, SVD, PR, ALS, TC, NMF

20 AD, SVD, KM, ALS, TC, KC, KM, ALS, KM, SGD, NMF, KM, ALS, NMF, PR, TC, NMF, SSSP, ALS, AD

Interestingly, some algorithms are consistently represented, Alternating Least Squares for best spread, and K-Means for best coverage. The best ensembles are complicated – involving large numbers of algorithms and graphs. For example, the best five-member ensemble for spread includes 4 algorithms and 5 different graphs. The best five-member ensemble for coverage includes five algorithms and 4 graphs. At the level of ten-member ensembles, the complexity already exceeds that of most comparative graph performance studies (6 algorithms for spread and 7 for coverage) [6, 9, 10, 11, 22].

To reliably assess “diversity contribution” of an algorithm, we would like to minimize shadowing effects. That is, considering only the best ensembles, a particular algorithm that is useful for spread or coverage might be shadowed by others that are slightly better, but when they’re removed, really contribute greatly to diversity. We consider the question: Which algorithms contribute most often to the best ensembles for spread and coverage? To minimize shadowing, we expand our consideration of the best ensemble of size n to the 100 best ensembles of size n for each -- spread and coverage. Within the 100 best ensembles, we use the frequency of appearance of each algorithm as an indication of contribution to diversity.

These results are shown in Figure 20 and 21, and demonstrate that not all algorithms contribute significantly to a good spread or coverage. For example, K-Means, Alternating Least Squares, and Triangle Counting among our suite contribute to efficient and thorough behavior space exploration.

Figure 20. Frequency of Appearance of Each Algorithm in

Top100 Sets for Spread

Figure 21. Frequency of Appearance of Each Algorithm in

Top100 Sets for Coverage

5.6 Efficient Exploration: Limited Ensemble Complexity Because the best ensembles require complex combinations of algorithms and graphs, it is worthwhile to consider simpler combinations to reduce benchmarking complexity. So the next question is: How does restricting our ensemble complexity impact achievable spread and coverage? Here we consider three dimensions of constraints, limited algorithms, graphs, and runtime (see Figure 22 and 23).

First, we limit ensembles to three algorithms, selecting those that contribute most to both spread and coverage. In this case, it is the same set for both metrics -- KM, ALS and TC. The algorithm-limited suites maintain a high spread, and a slight advantage over single algorithms.

Second, we limit ensembles to three graphs. The best ensembles use the graphs of size 107, 108 and 109 with α= 2.0. The results indicate that limiting the number of graphs decreases spread rapidly and produces poor coverage – even lower than single algorithms like KC and CC.

Third, we consider shortening graph computation runs for even more efficient benchmarking. Some of our algorithms, including AD, KM, NMF, SGD, and SVD, have constant, repetitive behavior (i.e. a constant active fraction = 1.0). Their runs could be shortened, so ensembles using these 5 algorithms, can much more efficiently probe the behavior space. Because of the variety of these repetitive algorithms, these runtime-constrained suites still achieve high spread and coverage.

By choosing these runs to sample the behavior space efficiently, we can benchmark systems with minimum effort.

6. Discussion and Related Work We discuss related benchmarking work and comparative graph-processing system experiments. For each, we summarize the related work, and explain how it relates to our efforts.

General benchmark suites such as SPEC [23], NPB [3], TPC [26], and HPCC [13] employ a wide collection of programs, often with single inputs of varied sizes. Criteria for selection typically include [23] programs (1) representing real problems, (2) with targeted properties, such as compute-bound for CPU benchmarks, and IO-bound for IO benchmarks, (3) easy portability, and (4) free availability. Though these benchmarks are carefully selected and stress bottlenecks in targeted systems, none are designed to systematically characterize performance. And even with their diversity, efficient sampling of the full behavior space is beyond reach.

Closer, there are several graph-oriented benchmarking efforts, including Graph 500 [8], LDBC [18], and LinkBench [2]. However, they suffer from the same limitations. Graph 500 uses only a single program, on a single graph typically. LDBC is only emerging, and LinkBench provides general infrastructure, but not a specific benchmark.

Comparative graph processing systems studies are numerous. We summarize a few of the most complete efforts here. M. Han et al. [10] compare Giraph, GPS, Mizan, and GraphLab while applying graph and algorithm agnostic optimizations. They employ four simple benchmarks – random walk, sequential traversal, parallel traversal, and graph mutation – and five real-world graphs of different sizes. S. Salihoglu et al. [22] made similar study on Pregel-like systems. However, neither Han nor Salihoglu give any clear rationale for benchmark and algorithm selection, and no claims of thoroughness of behavior space exploration are made. Taking Han’s study as an example, all the chosen benchmarks seem requiring much communication and few computation, which would possibly overestimate the performance of platforms that perform poorly in per-vertex computation.

Figure 22. Spread: Limited Algorithms, Graphs, Runtime

Figure 23. Coverage: Limited Algorithms, Graphs, Runtime

B. Elser et al. [6] compare Map-Reduce, Stratosphere, Hama, Giraph, and GraphLab using a single algorithm, K-core decomposition and 7 different graph inputs. They compare cluster executions with a single multicore machine with shared memory and use runtime as the only metric. W. Han et al. [11] compare TurboGraph and GraphChi using simple benchmarks (out-neighbor queries, PageRank and Connected Components) and three graphs of different scales. Our results show that, neither a single complex algorithm such as K-Core, nor several simple benchmarks such as PR and CC explore behavior thoroughly. For W. Han’s experiments, the use of a small number of graphs also narrows their study.

Y. Guo et al. [9] propose a comprehensive experimental method for benchmarking graph-processing platforms, and employ three dimensions of diversity: dataset diversity, algorithm diversity, and platform diversity. Their benchmarking suite has five classes of algorithms and seven diverse graphs, and they use it to analyze and compare six platforms. To explore diversity, they select real-world graphs with varied numbers of vertices and edges, and different structures (such as various average degree), and select graph algorithms including general statistics, graph traversal, connected components, community detection, and graph evolution. Metrics include basic performance (edges per second, vertices per second, etc.), resource utilization, scalability, and overhead. Guo’s work is perhaps closest to ours. They recognize the need to explore fully, and claim to do so. However, they present no clear framework of formal metric for assessing the thoroughness of exploration. In contrast, we have formulated a space and clear metrics for assessing thoroughness.

Our study suggests how to construct ensembles of graph computations that thoroughly explores the behavior space of graph computations. Beyond thoroughness, we describe how to further optimize an ensemble for efficiency and simplicity – limiting graphs, algorithms, etc. These ensembles can of course be used for benchmarking, and by exercising graph computing system behavior broadly, such a benchmark suite promises to give a comprehensive view of the performance of graph-processing systems, as well as an objective comparison between systems.

To our best knowledge, our study is the first and the only one that quantitatively characterize the behavior of graph computation and to propose an evaluation methodology for systematic benchmarking.

7. SUMMARY AND FUTURE WORK We have studied the behavior of a variety of graph algorithms (graph analytics, clustering, collaborative filtering, etc.) on a diverse collection of graphs (varying size and degree distribution). Our results characterize the variation of behavior across graph structures, application, domains, and graph algorithms. Given the broad range of graph computation behavior, we consider how well an ensemble of graph computations (algorithms and graphs) explores this space, in terms of spread and coverage. Since ad-hoc sets are unlikely to fairly characterize a graph-processing system, our study gives suggestions on systematically choosing a small set of algorithms and graphs structures to exercise broad computational behavior. We find that ensembles over single algorithms and single graphs are inefficient in sampling the behavior space, while a set of carefully selected runs exploring both algorithm diversity and graph structure diversity give significantly better results. We propose more efficient ensembles

by considering constraints such as limited algorithms, graphs, and even runtime.

With understanding of how ensembles of experiments sample the behavior space, more ambitious goals are within reach. Can we design optimal ensembles? Can we model precisely a graph computation’s behavior, and predict its performance?

8. Acknowledgements This work was supported by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Award DE-SC0008603 and in part gifts from HP and Agilent. This work was completed on resources provided by the University of Chicago Research Computing Center.

9. REFERENCES [1] Apache, Apache Incubator Giraph,

http://incubator.apache.org/giraph/. [2] T. Armstrong, V. Ponnekanti, D. Borthakur, LinkBench: a

Database Benchmark Based on the Facebook Social Graph, proceedings of SIGMOD’13, June 22–27, 2013.

[3] D. Bailey, E. Barszcz, J. Barton, et al. THE NAS PARALLEL BENCHMARKS, Techical Report, NASA Ames Research Center, Jan 1991.

[4] Shuai Che, “GasCL: A Vertex-Centric Graph Model for GPUs,” in HPEC 2014, 2014.

[5] S.N. Dorogovtsev, A.V. Goltsev, and J.F.F. Mendes, K-Core Organization of Complex Networks, Feb28, 2006.

[6] B. Elser and A. Montresor, An evaluation study of BigData frameworks for graph processing, in IEEE Big Data 2013, pp. 60-67, 2013.

[7] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin, PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs, proceedings of OSDI '12, 2012.

[8] Graph500, Graph500 Benchmark, http://www.graph500.org/. [9] Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella,

and T. L. Willke, How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis, in IPDPS 2014, 2014.

[10] M.K. Han, Daudjee, K. Ammar, M. Tamer Ozsu, X. Wang, and T. Jin, An Experimental Comparison of Pregel-like Graph Processing System, proceedings of the VLDB Endowment, Vol. 7, No. 12, 2014.

[11] Wook-Shin Han, S. Lee, K. Park, J. Lee, Min-Soo Kim, J. Kim, and H. Yu, TurboGraph: A Fast Parallel Graph Engine Handling Billion-scale Graphs in a Single PC, in KDD’13, 2013.

[12] Hebrew University, The Probabilistic Inference Challenge (PIC2011): MRF date sets, http://www.cs.huji.ac.il/project/PASCAL/showNet.php.

[13] HPCC, HPC C hallenge Benchmark, http://en.wikipedia.org/wiki/HPC_Challenge_Benchmark.

[14] Y. Hu, Y. Koren, and C. Volinsky, Collaborative Filtering for Implicit Feedback Datasets, in ICDM 08, pp. 263-272, 2008.

[15] M. Karch, What is PageRank and How do I Use It? http://google.about.com/od/searchengineoptimization/a/pagerankexplain.htm.

[16] Z. Khayyat, K. Awara, A. Alonazi, et al. Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing, proceedings of Eurosys’13, April 15-17, 2013.

[17] A. Kyrola, G. Blelloch, and C. Guestrin, “GraphChi: Large-scale Graph Computation on Just a PC,” in OSDI, 2012.

[18] LDBC Organization, LDBC, http://www.ldbc.eu/.

[19] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski., Pregel: A system for large-scale graph processing.,” in SIGMOD/PODS '10, pp. 135-146, 2010.

[20] Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel, X-Stream: Edge-centric Graph Processing using Streaming Partitions, in SOSP’13, Nov03-06, 2013.

[21] S. Salihoglu and J. Widom, GPS: A Graph Processing System, in roceedings of the SSDBM '13, 2013.

[22] S. Salihoglu and J. Widom, Optimizing Graph Algorithms on Pregel-like Systems, Technical Report, Stanford, 2013.

[23] SPEC, SPEC CPU2006 benchmark suite, https://www.spec.org/cpu2006/Docs/readme1st.html.

[24] Stanford University, The Stanford Network Analysis Project (SNAP), http://snap.stanford.edu/index.html.

[25] Y. Tian, A. Balmin, S. A. Corsten, S. Tatikonda, and John McPherson, From Think Like a Vertex to Think Like a Graph, proceedings of the VLDB Endowment, Vol. 7, No. 3, 2014.

[26] TPC, TPC Benchmarks, http://www.tpc.org/information/benchmarks.asp.

[27] Web Data Commons, Web Data Commons –– Hyperlink Graphs. http://webdatacommons.org/hyperlinkgraph/.

[28] Wikipedia, Collaborative filtering, http://en.wikipedia.org/wiki/Collaborative_filtering.

[29] Wikipedia, Connected Components, http://en.wikipedia.org/wiki/Connected_component_(graph_theory).

[30] Wikipedia, Degree distribution, http://en.wikipedia.org/wiki/Degree_distribution.

[31] Wikipedia, PageRank, http://en.wikipedia.org/wiki/PageRank. [32] Wikipedia, K-means clustering,

http://en.wikipedia.org/wiki/K-means_clustering. [33] Wikipedia, Scale-free network,

http://en.wikipedia.org/wiki/Scale-free_network. [34] Wikipedia, Stochastic Gradient Descent (SGD),

http://en.wikipedia.org/wiki/Stochastic_gradient_descent.

understanding graph computation behavior to enable robust ... · giraph on all graph datasets,...

Documents