graph-based parallel computingwcohen/10-605/pregel-etc.pdfgraph-based parallel computing william...
TRANSCRIPT
Graph-Based Parallel Computing
William Cohen
1
Announcements
• Next Tuesday 12/8: – Presentations for 10-805 projects. – 15 minutes per project. – Final written reports due Tues 12/15
2
Graph-Based Parallel Computing
William Cohen
3
Outline • Motivation/where it Hits • Sample systems (c. 2010) – Pregel: and some sample programs • Bulk synchronous processing
– Signal/Collect and GraphLab • Asynchronous processing
• GraphLab descendants – PowerGraph: partitioning – GraphChi: graphs w/o parallelism – GraphX: graphs over Spark
4
Problems we’ve seen so far
• Operations on sets of sparse feature vectors: – ClassiHication – Topic modeling – Similarity joins
• Graph operations: – PageRank, personalized PageRank – Semi-supervised learning on graphs
5
Architectures we’ve seen so far • Stream-and-sort: limited-memory, serial,
simple workHlows • + parallelism: Map-reduce (Hadoop) • + abstract operators like join, group: PIG,
Hive, GuineaPig, … • + caching in memory and efHicient iteration:
Spark, Flink, … • + parameter servers (Petuum, …)
• + …..? one candidate: architectures for graph processing
6
Architectures we’ve seen so far • Large immutable data structures on (distributed)
disk, processing by sweeping through then and creating new data structures: – stream-and-sort, Hadoop, PIG, Hive, …
• Large immutable data structures in distributed memory: – Spark – distributed tables
• Large mutable data structures in distributed memory: – parameter server: structure is a hashtable – today: large mutable graphs
7
Outline • Motivation/where it Hits • Sample systems (c. 2010) – Pregel: and some sample programs • Bulk synchronous processing
– Signal/Collect and GraphLab • Asynchronous processing
• GraphLab descendants – PowerGraph: partitioning – GraphChi: graphs w/o parallelism – GraphX: graphs over Spark • which kinds of brings things back to Map-Reduce
systems
8
GRAPH ABSTRACTIONS: PREGEL (SIGMOD 2010*)
*Used internally at least 1-2 years before
9
Many ML algorithms tend to have
• Sparse data dependencies • Local computations • Iterative updates
• Typical example: Gibbs sampling
10
Example: Gibbs Sampling [Guestrin UAI 2010]
11
X4 X5 X6
X9 X8
X3 X2 X1
X7
1) Sparse Data Dependencies
2) Local Computations
3) Iterative Updates
For LDA: Zd,m for Xd,m depends on others Z’s in doc d, and topic assignments to copies of word Xd,m
Pregel (Google, Sigmod 2010) • Primary data structure is a graph • Computations are sequence of supersteps, in each of
which – user-deHined function is invoked (in parallel) at
each vertex v, can get/set value – UDF can also issue requests to get/set edges – UDF can read messages sent to v in the last
superstep and schedule messages to send to in the next superstep
– Halt when every vertex votes to halt • Output is directed graph • Also: aggregators (like ALLREDUCE) • Bulk synchronous processing (BSP) model: all vertex
operations happen simultaneously
vertex value changes
communication
12
Pregel (Google, Sigmod 2010)
• One master: partitions the graph among workers
• Workers keep graph “shard” in memory • Messages to other partitions are buffered
• Communication across partitions is expensive, within partitions is cheap – quality of partition makes a difference!
13
simplest rule: stop when everyone votes to halt
everyone computes in parallel
14
Streaming PageRank: with some long rows • Repeat until converged: – Let vt+1 = cu + (1-c)Wvt
• Store A as a list of edges: each line is: “i d(i) j” • Store v’ and v in memory: v’ starts out as cu • For each line “i d j“
• v’[j] += (1-c)v[i]/d
We need to get the degree of i and store it locally
15
recap from 3/17
note we need to scan through the graph each time
16
edge weight
Another task: single source shortest path
17
a little bit of a cheat
18
Many Graph-Parallel Algorithms• Collaborative Filtering– Alternating Least Squares– Stochastic Gradient Descent– Tensor Factorization
• Structured Prediction– Loopy Belief Propagation– Max-Product Linear Programs– Gibbs Sampling
• Semi-supervised ML– Graph SSL – CoEM
• Community Detection– Triangle-Counting– K-core Decomposition– K-Truss
• Graph Analytics– PageRank– Personalized PageRank– Shortest Path– Graph Coloring
• Classification– Neural Networks
19
Low-RankMatrixFactoriza2on:
20
r13
r14
r24
r25
f(1)
f(2)
f(3)
f(4)
f(5)Use
r Fac
tors
(U)
Movie Factors (M
)U
sers Movies
NetflixU
sers
≈x
Movies
f(i)
f(j)
Iterate:
f [i] = arg minw2Rd
X
j2Nbrs(i)
�rij � wT f [j]
�2+ �||w||22
RecommendingProducts
Outline
• Mo2va2on/whereitfits• Samplesystems(c.2010)– Pregel:andsomesampleprograms
• Bulksynchronousprocessing– Signal/CollectandGraphLab
• Asynchronousprocessing• GraphLabdescendants– PowerGraph:par22oning– GraphChi:graphsw/oparallelism– GraphX:graphsoverSpark
21
GRAPH ABSTRACTIONS: SIGNAL/COLLECT (SEMANTIC WEB
CONFERENCE, 2010) Stutz, Strebel, Bernstein, Univ Zurich
22
23
Another task: single source shortest path
24
Signal/collect model vs Pregel • Integrated with RDF/SPARQL • Vertices can be non-uniform types • Vertex: – id, mutable state, outgoing edges, most recent
received signals (map: neighbor idàsignal), uncollected signals – user-deHined collect function
• Edge: id, source, dest – user-deHined signal function
• Allows asynchronous computations….via v.scoreSignal, v.scoreCollect
For “data-flow” operations
On multicore architecture: shared memory for workers 25
Signal/collect model
signals are made available in a list and a map
next state for a vertex is output of the collect() operation
relax “num_iterations” soon
26
Signal/collect examples Single-source shortest path
27
Signal/collect examples
PageRank
Life
28
PageRank + Preprocessing and Graph Building
29
Signal/collect examples Co-EM/wvRN/Harmonic fields
30
31
32
Signal/collect examples Matching path queries:
dept(X) -[member]à postdoc(Y) -[recieved]à grant(Z)
MLD wcohen
partha
NSF378
InMind7
LTI
dept(X) -[member]à postdoc(Y) -[recieved]à grant(Z)
33
Signal/collect examples: data flow Matching path queries:
dept(X) -[member]à postdoc(Y) -[recieved]à grant(Z)
MLD wcohen
partha
NSF378
InMind7
LTI
dept(X=MLD) -[member]à postdoc(Y) -[recieved]à grant(Z)
dept(X=LTI) -[member]à postdoc(Y) -[recieved]à grant(Z)
note: can be multiple input
signals
34
Signal/collect examples Matching path queries:
dept(X) -[member]à postdoc(Y) -[recieved]à grant(Z)
MLD wcohen
partha
NSF378
InMind7
LTI
dept(X=MLD) -[member]! postdoc(Y=partha) -[recieved]à grant(Z)
35
Signal/collect model vs Pregel • Integrated with RDF/SPARQL • Vertices can be non-uniform types • Vertex: – id, mutable state, outgoing edges, most recent
received signals (map: neighbor idàsignal), uncollected signals – user-deHined collect function
• Edge: id, source, dest – user-deHined signal function
• Allows asynchronous computations….via v.scoreSignal, v.scoreCollect
For “data-flow” operations
36
Asynchronous Parallel Computation
• Bulk-Synchronous: All vertices update in parallel – need to keep copy of “old”
and “new” vertex values • Asynchronous: – Reason 1: if two vertices are
not connected, can update them in any order • more Hlexibility, less storage
– Reason 2: not all updates are equally important • parts of the graph converge
quickly, parts slowly
37
using: • v.scoreSignal • v.scoreCollect
38
39
SSSP PageRank
40
Outline • Motivation/where it Hits • Sample systems (c. 2010) – Pregel: and some sample programs • Bulk synchronous processing
– Signal/Collect and GraphLab • Asynchronous processing
• GraphLab descendants – PowerGraph: partitioning – GraphChi: graphs w/o parallelism – GraphX: graphs over Spark
41
GRAPH ABSTRACTIONS: GRAPHLAB (UAI, 2010)
Guestrin, Gonzalez, Bikel, etc.
Many slides below pilfered from Carlos or Joey….
42
GraphLab • Data in graph, UDF vertex function • Differences: – some control over scheduling • vertex function can insert new tasks in a queue
– messages must follow graph edges: can access adjacent vertices only – “shared data table” for global data – library algorithms for matrix factorization,
coEM, SVM, Gibbs, … – GraphLab à Now Dato
43
Graphical Model Learning
44
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
Spee
dup
Number of CPUs
Optimal
Bett
er
Approx. Priority Schedule
Splash Schedule
15.5x speedup on 16 cpus On multicore architecture: shared memory for workers
Gibbs Sampling • Protein-protein interaction networks [Elidan et al.
2006] – Pair-wise MRF – 14K Vertices – 100K Edges
• 10x Speedup • Scheduling reduces�
locking overhead
45
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
Spee
dup
Number of CPUs
Optimal
Bett
er
Round robin schedule
Colored Schedule
CoEM (Rosie Jones, 2005) Named Entity Recognition Task
Vertices Edges
Small 0.2M 20M
Large 2M 200M
the dog
Australia
Catalina Island
<X> ran quickly
travelled to <X>
<X> is pleasant
Hadoop 95 Cores 7.5 hrs
Is “Dog” an animal? Is “Catalina” a place?
46
CoEM (Rosie Jones, 2005)
47 47
Optimal
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
Spee
dup
Number of CPUs
Bett
er
Small
Large GraphLab 16 Cores 30 min
15x Faster! 6x fewer CPUs!
Hadoop 95 Cores 7.5 hrs
GRAPH ABSTRACTIONS: GRAPHLAB CONTINUED….
48
Outline • Motivation/where it Hits • Sample systems (c. 2010) – Pregel: and some sample programs • Bulk synchronous processing
– Signal/Collect and GraphLab • Asynchronous processing
• GraphLab descendants – PowerGraph: partitioning – GraphChi: graphs w/o parallelism – GraphX: graphs over Spark
49
GraphLab’s descendents
• PowerGraph • GraphChi • GraphX
On multicore architecture: shared memory for workers
On cluster architecture (like Pregel): different memory spaces
What are the challenges moving away from shared-memory?
50
Natural Graphs ! Power Law
100 102 104 106 108100
102
104
106
108
1010
degree
count
Top1%ofver5cesisadjacentto
53%oftheedges!
AltavistaWebGraph:1.4BVer5ces,6.7BEdgesGraphLab group/Aapo 51
Touchesalargefrac2onofgraph(GraphLab1)
Producesmanymessages
(Pregel,Signal/Collect)
Edgeinforma2ontoolargeforsingle
machine
Asynchronousconsistencyrequiresheavylocking(GraphLab1)
Synchronousconsistencyispronetostragglers(Pregel)
Problem:HighDegreeVer5cesLimitParallelism
GraphLab group/Aapo 52
PowerGraph • Problem: GraphLab’s localities can be large – “all neighbors of a node” can be large for hubs,
high indegree nodes • Approach: – new graph partitioning algorithm • can replicate data
– gather-apply-scatter API: Hiner-grained parallelism • gather ~ combiner • apply ~ vertex UDF (for all replicates) • scatter ~ messages from vertex to edges
53
FactorizedVertexUpdates
Splitupdateinto3phases
++…+àΔ
Y YY
ParallelSum
Y Scope
Gather
Y
YApply(,Δ)à Y
LocallyapplytheaccumulatedΔtovertex
Apply
Y
Updateneighbors
Sca[er
Data-paralleloveredges Data-paralleloveredges
GraphLabgroup/Aapo
54
Signal/collectexamplesSingle-sourceshortestpath
55
Signal/collectexamples
PageRank
Life
56
PageRank+PreprocessingandGraphBuilding
57
Signal/collectexamplesCo-EM/wvRN/Harmonicfields
58
PageRankinPowerGraph
PageRankProgram(i)Gather(j à i):returnwji * R[j]sum(a,b):returna+b;Apply(i, Σ) : R[i] = β + (1 – β) * Σ Scatter( i àj ) : �
if (R[i] changes) then activate(j)
R[i] = � + (1� �)X
(j,i)2E
wjiR[j]
59
GraphLabgroup/Aapo
jedgeivertex
gather/sumlikeagroupby…reduceorcollect
sca[erislikeasignal
Machine2Machine1
Machine4Machine3
DistributedExecu2onofaPowerGraphVertex-Program
Σ1 Σ2
Σ3 Σ4
+++
YYYY
Y’
Σ
Y’Y’Y’Gather
Apply
Sca[er
60
GraphLabgroup/Aapo
MinimizingCommunica2oninPowerGraph
YYY
Avertex-cutminimizesmachineseachvertexspans
Percola3ontheorysuggeststhatpowerlawgraphshavegoodvertexcuts.[Albertetal.2000]
Communica2onislinearinthenumberofmachines
eachvertexspans
61
GraphLabgroup/Aapo
Par22oningPerformanceTwiWerGraph:41Mver2ces,1.4Bedges
Obliviousbalancespar22onqualityandpar22oning2me.
2468
1012141618
8 16 24 32 40 48 56 64
Avg#ofM
achine
sSpa
nned
NumberofMachines
0
200
400
600
800
1000
8 16 24 32 40 48 56 64Pa
r55o
nTime(Secon
ds)
NumberofMachines
Oblivious
CoordinatedCoordinated
Random
62
Cost Construc2onTime
Be[er
GraphLabgroup/Aapo
Par22oningma[ers…
00.10.20.30.40.50.60.70.80.91
PageRankCollabora2ve
Filtering ShortestPath
Redu
c5on
inRun
5me
Random
Oblivious
Greedy
GraphLabgroup/Aapo
63
Outline
• Mo2va2on/whereitfits• Samplesystems(c.2010)– Pregel:andsomesampleprograms
• Bulksynchronousprocessing– Signal/CollectandGraphLab
• Asynchronousprocessing• GraphLabdescendants– PowerGraph:par22oning– GraphChi:graphsw/oparallelism– GraphX:graphsoverSpark
64
GraphLab’sdescendents
• PowerGraph• GraphChi• GraphX
65
GraphLabcon’t
• PowerGraph• GraphChi– Goal:usegraphabstrac2onon-disk,notin-memory,onaconven2onalworksta2on
LinuxClusterServices(AmazonAWS)
MPI/TCP-IP PThreads Hadoop/HDFS
General-purposeAPI
GraphAnaly2cs
GraphicalModels
ComputerVision Clustering Topic
ModelingCollabora2ve
Filtering
66
GraphLabcon’t
• GraphChi– Keyinsight:
• somealgorithmsongrapharestreamable(i.e.,PageRank-Nibble)• ingeneralwecan’teasilystreamthegraphbecauseneighborswillbesca[ered• butmaybewecanlimitthedegreetowhichthey’resca[ered…enoughtomakestreamingpossible?
– “almost-streaming”:keepPcursorsinafileinsteadofone
67
• Ver2cesarenumberedfrom1ton– Pintervals,eachassociatedwithashardondisk.– sub-graph=intervalofver2ces
PSW:ShardsandIntervals
shard(1)
interval(1) interval(2) interval(P)
shard(2) shard(P)
1 nv1 v2
68
1. Load
2. Compute
3. Write
PSW:Layout
Shard1
Shardssmallenoughtofitinmemory;balancesizeofshards
Shard:in-edgesforintervalofver2ces;sortedbysource-id
in-edgesfo
rver2ces1..100
sorted
bysource_id
Ver2ces1..100
Ver2ces101..700
Ver2ces701..1000
Ver2ces1001..10000
Shard2 Shard3 Shard4Shard1
1. Load
2. Compute
3. Write
69
Ver2ces1..100
Ver2ces101..700
Ver2ces701..1000
Ver2ces1001..10000
Loadallin-edgesinmemory
Loadsubgraphforver5ces1..100
Whataboutout-edges?Arrangedinsequenceinothershards
Shard2 Shard3 Shard4
PSW:LoadingSub-graph
Shard1
in-edgesfo
rver2ces1..100
sorted
bysource_id
1. Load
2. Compute
3. Write
70
Shard1
Load all in-edges in memory
Loadsubgraphforver5ces101..700
Shard2 Shard3 Shard4
PSW:LoadingSub-graphVer2ces1..100
Ver2ces101..700
Ver2ces701..1000
Ver2ces1001..10000
Out-edge blocks in memory
in-edgesfo
rver2ces1..100
sorted
bysource_id
1. Load
2. Compute
3. Write
71
PSWLoad-Phase
Only P large reads for each interval.
P2 reads on one full pass.
72
1. Load
2. Compute
3. Write
PSW:Executeupdates
• Update-func2onisexecutedoninterval’sver2ces• Edgeshavepointerstotheloadeddatablocks– Changestakeeffectimmediatelyàasynchronous.
&Data
&Data &Data
&Data
&Data&Data&Data
&Data&Data
&Data
BlockX
BlockY
73
1. Load
2. Compute
3. Write
PSW:CommittoDisk
• Inwritephase,theblocksarewri[enbacktodisk– Nextload-phaseseestheprecedingwritesàasynchronous.
74
1. Load
2. Compute
3. Write
&Data
&Data &Data
&Data
&Data&Data&Data
&Data&Data
&Data
BlockX
BlockY
Intotal:P2readsandwrites/fullpassonthegraph.àPerformswellonbothSSDandharddrive.
Tomakethiswork:thesizeofavertexstatecan’tchangewhenit’supdated(atlast,asstoredondisk).
ExperimentSesng• MacMini(AppleInc.)– 8GBRAM– 256GBSSD,1TBharddrive– IntelCorei5,2.5GHz
• Experimentgraphs:
Graph Ver5ces Edges P(shards) Preprocessing
live-journal 4.8M 69M 3 0.5min
netlix 0.5M 99M 20 1min
twi[er-2010 42M 1.5B 20 2min
uk-2007-05 106M 3.7B 40 31min
uk-union 133M 5.4B 50 33min
yahoo-web 1.4B 6.6B 50 37min75
ComparisontoExis2ngSystems
Notes:comparisonresultsdonotinclude3metotransferthedatatocluster,preprocessing,orthe3metoloadthegraphfromdisk.GraphChicomputesasynchronously,whileallbutGraphLabsynchronously.
PageRank
Seethepaperformorecomparisons.
WebGraphBeliefPropaga5on(UKangetal.)
MatrixFactoriza5on(Alt.LeastSqr.) TriangleCoun5ng
GraphLabv1(8cores)
GraphChi(MacMini)
0 2 4 6 8 10 12Minutes
NeHlix(99Bedges)
Spark(50machines)
GraphChi(MacMini)
0 2 4 6 8 10 12 14Minutes
TwiNer-2010(1.5Bedges)
Pegasus/Hadoop(100
machines)
GraphChi(MacMini)
0 5 10 15 20 25 30Minutes
Yahoo-web(6.7Bedges)
Hadoop(1636
machines)
GraphChi(MacMini)
0 50 100 150 200 250 300 350 400 450Minutes
twiNer-2010(1.5Bedges)
OnaMacMini:ü GraphChicansolveasbigproblemsasexis2nglarge-scalesystems.
ü Comparableperformance.
76
Outline
• Mo2va2on/whereitfits• Samplesystems(c.2010)– Pregel:andsomesampleprograms
• Bulksynchronousprocessing– Signal/CollectandGraphLab
• Asynchronousprocessing• GraphLab“descendants”– PowerGraph:par22oning– GraphChi:graphsw/oparallelism– GraphX:graphsoverSpark(Gonzalez)
77
GraphLab’sdescendents
• PowerGraph• GraphChi• GraphX– implementa2onofGraphLabsAPIontopofSpark– Mo2va2ons:
• avoidtransfersbetweensubsystems• leveragelargercommunityforcommoninfrastructure
– What’sdifferent:• Graphsarenowimmutableandopera2onstransformonegraphintoanother(RDDèRDG,resiliantdistributedgraph)
78
Idea1:GraphasTables
Id
RxinJegonzalFranklinIstoica
SrcId DstId
rxin jegonzalfranklin rxinistoica franklinfranklin jegonzal
Property (E)
FriendAdvisor
CoworkerPI
Property (V)
(Stu., Berk.)(PstDoc, Berk.)
(Prof., Berk)(Prof., Berk)
R
J
F
I
Property GraphVertex Property Table
Edge Property Table
Underthehoodthingscanbesplitevenmorefinely:egavertexmaptable+vertexdatatable.Operatorsmaximizestructuresharingandminimizecommunica2on.
79
Operators• Table(RDD)operatorsareinheritedfromSpark:
80
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ])
// Table Views ----------------- def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def subgraph(pV: (Id, V) => Boolean,
pE: Edge[V,E] => Boolean): Graph[V,E] def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation ---------------------------------- def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E]
}
GraphOperators
81
Idea2:mrTriplets:low-levelrou2nesimilartosca[er-gather-apply.EvolvedtoaggregateNeighbors,aggregateMessages
TheGraphXStack(LinesofCode)
GraphX (3575)
Spark
Pregel (28) + GraphLab (50)
PageRank (5)
Connected Comp. (10)
Shortest Path (10)
ALS(40) LDA
(120)
K-core(51) Triangle
Count(45)
SVD(40)
82
PerformanceComparisons
2268
207354
1340
0 200 400 600 800 1000 1200 1400 1600
GraphLabGraphXGiraph
Naïve SparkMahout/Hadoop
Runtime (in seconds, PageRank for 10 iterations)
GraphX is roughly 3x slower than GraphLab
Live-Journal: 69 Million Edges
83
Summary• Largeimmutabledatastructureson(distributed)disk,processingbysweepingthroughthenandcrea2ngnewdatastructures:– stream-and-sort,Hadoop,PIG,Hive,…
• Largeimmutabledatastructuresindistributedmemory:– Spark–distributedtables
• Largemutabledatastructuresindistributedmemory:– parameterserver:structureisahashtable– Pregel,GraphLab,GraphChi,GraphX:structureisagraph84
Summary• APIsforthevarioussystemsvaryindetailbuthaveasimilarflavor– Typicalalgorithmsitera2velyupdatevertexstate– Changesinstatearecommunicatedwithmessageswhichneedtobeaggregatedfromneighbors
• Biggestwinsare– onproblemswheregraphisfixedineachitera2on,butvertexdatachanges
– ongraphssmallenoughtofitin(distributed)memory
85
Somethingstotakeaway• Platormsforitera2veopera2onsongraphs– GraphX:ifyouwanttointegratewithSpark– GraphChi:ifyoudon’thaveacluster– GraphLab/Dato:ifyoudon’tneedfreesoxwareandperformanceiscrucial
– Pregel:ifyouworkatGoogle– Giraph,Signal/collect,…
• Importantdifferences– Intendedarchitecture:shared-memoryandthreads,distributedclustermemory,graphondisk
– Howgraphsarepar22onedforclusters– Ifprocessingissynchronousorasynchronous
86