shengqi yang, xifeng yan, bo zong and arijit khan computer...

22
Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer Science, UC Santa Barbara {sqyang, xyan, bzong, arijitkhan}@cs.ucsb.edu 5/25/2012 1 ICB UCSB

Upload: others

Post on 19-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan

Computer Science, UC Santa Barbara {sqyang, xyan, bzong, arijitkhan}@cs.ucsb.edu

5/25/2012 1

ICB UCSB

Page 2: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

5/25/2012j 2

Web graph Social network

Biological graph Linked data: RDF graphs

Google: > 1 trillion indexed pages

Facebook: >800 million active users

De Bruijn: billions of vertices

20 billion RDF tuples

UCSB

Page 3: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

Memory-resident solution Running on single server.

Difficult/Impossible to accommodate the content of an extremely large graph.

Low concurrency.

Simple distributed solution (e.g., MapReduce) Running on commodity cluster.

High concurrency and enough memory space.

Chained MapReduce

• Communication and serialization overhead.

• Programming complexity

5/25/2012 3 UCSB

Page 4: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

Distribution model: graph partitioning. Computation model: run on each partition/vertex

simultaneously. Communication model: message passing

5/25/2012 4 UCSB

Page 5: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

Pros Designed for iterative jobs

▪ Graph algorithms: shortest distance, PageRank, etc.

▪ DM/ML tasks: K-Means, belief propagation, Parallel Gibbs sampling, etc.

Scalable, robust, simple.

Cons Graph partitioning: one partition can not fit all !

Query on graph • Unbalanced workload

• Inter-machine communication

5/25/2012 5 UCSB

Page 6: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

Unbalanced workload

5/25/2012 6

Inter-machine communication

Graph query pattern

UCSB

Page 7: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

Graph query processing

Solving the problems facing graph partitioning (Pregel)

• Workload balancing (replication)

• Communication reduction

Graph partition management strategy

• Evolving query workload.

Sedge: a Self Evolving Distributed Graph Processing

Environment

5/25/2012 7 UCSB

Page 8: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

Partitioning techniques

Complementary partitioning

On demand partitioning

Two-level partition management

System architecture

Experiments

Conclusions

5/25/2012 8 UCSB

Page 9: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

Complementary partitioning : repartition the graph with region constraint.

These two sets of partitions will run independently.

5/25/2012 9 UCSB

Page 10: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

… …

Iteratively repartition the graph Pros

Effective communication reduction Workload balancing

Cons Space limitation Can not adapt to dynamic workload

5/25/2012 10 UCSB

Page 11: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

P1 P2

C1 C2 C3

v1 v2

v3

v4

v5

5/25/2012 11

Query profiling Blocking: use blocks to track cross-partition queries. Advantages: •Query generalization. •Profiling with fewer features.

Blocks Goal: coarsen a graph Method: disjoint 1-hop region

UCSB

Page 12: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

P1 P2

C1 C2 C3

v1

v2

v3

v4

v5

P3

Envelope: a set of blocks that covers a cross partition query.

Envelope Collection: put the maximized number of envelopes into a new partition wrt. space constraint.

5/25/2012 12 UCSB

Page 13: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

Intention: combine similar envelopes sharing many common blocks.

Algorithm:

1) Similarity search (nearest neighbor search).

Locality Sensitive Hashing (LSH): Min-Hash, in O(n)

2) Envelope combining

1. Cluster the envelopes in the same bucket produced by Min-Hash.

2. Combine the clusters with highest benefit.

5/25/2012 13 UCSB

Page 14: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

Two-level partition architecture Primary partitions. e.g., A, B, C and D. They are inter-

connected in two-way

Secondary partitions. e.g., B’ and E. They are connected with primary partitions in one-way

5/25/2012 14 UCSB

Page 15: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

5/25/2012 UCSB 15

Page 16: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

SP2Bench (Schmidt et. al., ICDE ‘09) Employ the DBLP library as its simulation

basis.

100M tuples (11.24GB): 6 partitions.

Query templates:

• 5 of 12: Q2, Q4, Q6, Q7, Q8.

• E.g. Q4: Given a journal (J), select all distinct pairs of article author names for authors (?A) that have published papers (?P) in the journal.

Cluster environment 31 nodes (connected by a gigabit Ethernet).

• 4 GB RAM and quad-core 2:60GHz Xeon Processors

• 1 master, 30 workers.

5/25/2012 16

J

?P ?P

?A ?A

Q4

UCSB

Page 17: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

5/25/2012 UCSB 17

1 part. set 5 comp. part. set

Query workload: 104 queries with random start

Log scale !

Page 18: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

5/25/2012 UCSB 18

5 Static replication

5 Comp. part.

4 Comp. part. + On demand part.

Page 19: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

Datasets UK web graph: 30M vertices, 956M edges.

Twitter: 15M users, 57M edges.

Bio graph: 50M vertices, 51M edges.

Synthetic graph: 0.5B vertices, 2 Billion edges (by R-MAT,

Chakrabarti et. al., SDM’04).

Query workload Mixture of general graph queries

• neighbor search (2~3 hops).

• random walk (5~10 steps).

• random walk with restart (5~10 steps, 10% restart).

5/25/2012 19 UCSB

Page 20: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

5/25/2012 UCSB 20

Bio

Web

Twitter

Syn.

Page 21: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

Conclusions Partitioning techniques

• Complementary partitioning

• On demand partitioning

Two-level partition management Project home:

http://grafia.cs.ucsb.edu/sedge Documents and sample datasets

Source code

5/25/2012 21 UCSB

Page 22: Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer ...grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf · Graph algorithms: shortest distance, PageRank, etc. DM/ML

5/25/2012 22 UCSB