carnegie mellon parallel splash belief propagation joseph e. gonzalez yucheng low carlos guestrin...
TRANSCRIPT
Carnegie Mellon
Parallel Splash Belief Propagation
Joseph E. GonzalezYucheng Low
Carlos GuestrinDavid O’Hallaron
Computers which worked on this project:
BigBro1, BigBro2, BigBro3, BigBro4, BigBro5, BigBro6, BiggerBro, BigBroFSTashish01, Tashi02, Tashi03, Tashi04, Tashi05, Tashi06, …, Tashi30,
parallel, gs6167, koobcam (helped with writing)
2
Why talk about parallelism now?1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
Change in the Foundation of ML
Past
Sequential P
erform
ance
Future SequentialPerformance
Log
(Sp
eed
in G
Hz)
Future Parallel Performance
Release Date
3
Why is this a Problem?
Sophistication
Para
llelis
m
Nearest Neighbor[Google et al.]
Basic Regression[Cheng et al.]
Graphical Models[Mendiburu et al.]
Support Vector Machines[Graf et al.]
Want to behere
4
Why is it hard?
Algorithmic Efficiency
Parallel Efficiency
ImplementationEfficiency
Eliminate wastedcomputation
Expose independentcomputation
Map computation toreal hardware
The Key Insight
5
Statistical Structure• Graphical Model Structure• Graphical Model Parameters
Computational Structure• Chains of Computational Dependences• Decay of Influence
Parallel Structure• Parallel Dynamic Scheduling• State Partitioning for Distributed
Computation
6
The Result
Nearest Neighbor[Google et al.]
Basic Regression[Cheng et al.]
Support Vector Machines[Graf et al.]
GoalSplash Belief Propagation
Graphical Models[Gonzalez et al.]
Sophistication
Para
llelis
m
Graphical Models[Mendiburu et al.]
7
OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash
Dynamic SchedulingPartitioning
Experimental ResultsConclusions
8
Graphical Models and Parallelism
Graphical models provide a common language for general purpose parallel algorithms in machine learning
A parallel inference algorithm would improve:
Protein Structure Prediction
Inference is a key step in Learning Graphical Models
Computer VisionMovie Recommendation
Overview of Graphical Models
Graphical represent of local statistical dependencies
9
Observed Random Variables
Late
nt
Pixe
l Vari
able
s Loca
l Dependencie
s
Noisy Picture
InferenceWhat is the probability that this pixel is black?
“True” Pixel Values
Continuity Assumptions
Synthetic Noisy Image Problem
Overlapping Gaussian noiseAssess convergence and accuracy
Noisy Image Predicted Image
11
Protein Side-Chain Prediction
Model side-chain interactions as a graphical model
Protein Backbone
Sid
e-C
hain
Sid
e-C
hain
Side
-Cha
in
Side-Chain
Sid
e-C
hain
What is the most likely orientation?
Inference
12
Protein Side-Chain Prediction
276 Protein Networks:Approximately:
700 Variables1600 Factors70 Discrete orientations
Strong Factors
Protein B
ackbon
e
Side-Chain
Side-Chain
Side-Chain
Side-Chain
Side-Chain
5.8
10.6
15.4
20.2 25
29.8
34.6
39.4
44.2 49
0
0.1
Example Degree Distribution
Degree
13
Smokes(A) Cancer(A) Smokes(B) Cancer(B)
Friends(A,B) And Smokes(A) Smokes(B)
Markov Logic NetworksRepresent Logic as a graphical model
Cancer(A) Cancer(B)
Smokes(A) Smokes(B)
Friends(A,B)A: AliceB: Bob
True/False?
Pr(Cancer(B) = True | Smokes(A) = True & Friends(A,B) = True) = ?
Inference
14
Markov Logic Networks
Smokes(A) Cancer(A) Smokes(B) Cancer(B)
Friends(A,B) And Smokes(A) Smokes(B)
Cancer(A)Cancer(B)
Smokes(A) Smokes(B)
Friends(A,B)A: AliceB: Bob
True/False?UW-Systems Model8K Binary Variables406K Factors
Irregular degree distribution:
Some vertices with high degree
15
OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash
Dynamic SchedulingPartitioning
Experimental ResultsConclusions
16
The Inference Problem
NP-Hard in General Approximate Inference:
Belief Propagation
Smokes(A) Cancer(A) Smokes(B) Cancer(B)
Friends(A,B) And Smokes(A) Smokes(B)
Cancer(A) Cancer(B)
Smokes(A) Smokes(B)
Friends(A,B)A: AliceB: Bob
True/False?
Protein Backbone
Sid
e-C
hain
Sid
e-C
hain
Side
-Cha
in
Side-Chain
Sid
e-C
hainWhat is the probability
that Bob Smokes givenAlice Smokes?
What is the best configuration of the protein side-chains?
What is the probabilitythat each pixel is black?
17
Belief Propagation (BP)Iterative message passing algorithm
Naturally Parallel Algorithm
18
Parallel Synchronous BPGiven the old messages all new messages can be computed in parallel:
NewMessages
OldMessages
CPU 2
CPU 1
CPU 3
CPU n
Map-Reduce Ready!
19
Sequential Computational Structure
20
Hidden Sequential Structure
21
Hidden Sequential Structure
Running Time:
EvidenceEvidence
Time for a singleparallel iteration
Number of Iterations
Optimal Sequential Algorithm
Forward-Backward
Naturally Parallel
2n2/p
p ≤ 2n
22
RunningTime
2n
Gap
p = 1
Optimal Parallel
n
p = 2
Key Computational Structure
Naturally Parallel
2n2/p
p ≤ 2n
23
RunningTime
Optimal Parallel
n
p = 2
Gap Inherent Sequential Structure
Requires Efficient Scheduling
24
OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash
Dynamic SchedulingPartitioning
Experimental ResultsConclusions
25
Parallelism by Approximation
τε represents the minimal sequential structure
True Messages
τε -Approximation
1 2 3 4 5 6 7 8 9 10
1
26
Tau-Epsilon StructureOften τε decreases quickly:
Markov LogicNetworks
Protein Networks
Mess
ag
e A
pp
roxim
ati
on E
rror
in L
og S
cale
27
Running Time Lower Bound
Theorem: Using p processors it is not possible to obtain a τε approximation in time less than:
ParallelComponent
SequentialComponent
28
A single processor can only make k-τε +1 vertices left aware in k-iterations
Consider one direction using p/2 processors (p≥2):
1 n
τε τε τε τε τε τε τε
τε
…
n - τε
We must make n - τε vertices τε left-aware
Proof: Running Time Lower Bound
29
Optimal Parallel SchedulingProcessor 1 Processor 2 Processor 3
Theorem: Using p processors this algorithm achieves a τε approximation in time:
30
Proof: Optimal Parallel Scheduling
All vertices are left-aware of the left most vertex on their processor
After exchanging messages
After next iteration:
After k parallel iterations each vertex is (k-1)(n/p) left-aware
31
Proof: Optimal Parallel Scheduling
After k parallel iterations each vertex is (k-1)(n/p) left-awareSince all vertices must be made τε left aware:
Each iteration takes O(n/p) time:
32
Comparing with SynchronousBP
Processor 1 Processor 2 Processor 3
Synchronous Schedule Optimal Schedule
Gap
33
OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash
Dynamic SchedulingPartitioning
Experimental ResultsConclusions
34
The Splash OperationGeneralize the optimal chain algorithm:
to arbitrary cyclic graphs:
~
1) Grow a BFS Spanning tree with fixed size
2) Forward Pass computing all messages at each vertex
3) Backward Pass computing all messages at each vertex
35
Local State
CPU 2
Local State
CPU 3
Local State
CPU 1
Running Parallel Splashes
Partition the graphSchedule Splashes locallyTransmit the messages along the boundary of the partition
Splash SplashSplash
Key Challenges:1) How do we schedules Splashes?2) How do we partition the Graph?
Local State
Sch
edu
ling
Queu
e
Where do we Splash?Assign priorities and use a scheduling queue to select roots:
Splash
Splash
??
?
CPU 1
How do we assign priorities?
37
Message SchedulingResidual Belief Propagation [Elidan et al., UAI 06]:
Assign priorities based on change in inbound messages
1
Message
Message
Message
2
Message
Message
Message
Large ChangeSmall Change
Small Change
Large Change
Small Change:Expensive No-Op
Large Change:Informative Update
38
Problem with Message Scheduling
Small changes in messages do not imply small changes in belief:
Small change inall message
Large change inbelief
Message
Message
BeliefMessage
Message
39
Problem with Message Scheduling
Large changes in a single message do not imply large changes in belief:
Large change ina single message
Small changein belief
Message
BeliefMessageMessage
Message
40
Belief Residual Scheduling
Assign priorities based on the cumulative change in belief:
1 1
+1
+rv =
MessageChange
A vertex whose belief has changed substantially
since last being updatedwill likely produce
informative new messages.
41
Message vs. Belief Scheduling
Belief Scheduling improves accuracy and convergence
Belief Residuals
Message Residual
0%
20%
40%
60%
80%
100%
% Converged in 4Hrs
Bett
er
0 20 40 60 80 100
0.02
0.03
0.04
0.05
0.06
Error in BeliefsMessage Schedul-ing
Belief Scheduling
Time (Seconds)
L1
Err
or
in B
eliefs
Splash PruningBelief residuals can be used to dynamically reshape and resize Splashes:
LowBeliefs
Residual
43
Splash SizeUsing Splash Pruning our algorithm is able to dynamically select the optimal splash size
0 10 20 30 40 50 600
50
100
150
200
250
300
350Without Pruning
With Pruning
Splash Size (Messages)
Ru
nn
ing
Tim
e (
Secon
ds)
Bett
er
44
Example
Synthetic Noisy Image
Factor Graph
Vertex Updates
ManyUpdate
s
FewUpdate
s
Algorithm identifies and focuses on hidden sequential structure
45
Parallel Splash Algorithm
Partition factor graph over processorsSchedule Splashes locally using belief residualsTransmit messages on boundary
Local State
CPU 1
Splash
Local State
CPU 2
Local State
CPU 3
Splash
Fast Reliable Network
Splash
SchedulingQueue
SchedulingQueue
SchedulingQueue
Given a uniform partitioning of the chain graphical model, Parallel Splash will run in time:
retaining optimality.
Theorem:
46
CPU 1 CPU 2
Partitioning ObjectiveThe partitioning of the factor graph determines:
Storage, Computation, and Communication
Goal: Balance Computation and Minimize Communication
EnsureBalanceComm.
cost
47
The Partitioning ProblemObjective:
Depends on:
NP-Hard METIS fast partitioning heuristic
Work:
Comm:
Minimize Communication
Ensure Balance
Update counts are not known!
48
Unknown Update Counts Determined by belief schedulingDepends on: graph structure, factors, …Little correlation between past & future update counts
Noisy Image Update CountsSimple Solution:
Uninformed Cut
49
Uniformed Cuts
Greater imbalance & lower communication cost
Update Counts Uninformed Cut Optimal Cut
Denoise UW-Syst.0
1
2
3
4Work Imbalance
Denoise UW-Syst.0.5
0.7
0.9
1.1
Communication Cost
Unin-formedB
ett
er
Bett
er
Too Much Work
Too Little Work
50
Over-PartitioningOver-cut graph into k*p partitions and randomly assign CPUs
Increase balanceIncrease communication cost (More Boundary)
CPU 1
CPU 2
CPU 1 CPU 2 CPU 2
CPU 1 CPU 1 CPU 2
CPU 1 CPU 2 CPU 1
CPU 2 CPU 1 CPU 2
Without Over-Partitioning
k=6
51
Over-Partitioning ResultsProvides a simple method to trade between work balance and communication cost
-1 1 3 5 7 9 11 13 151
1.5
2
2.5
3
3.5
4
Communication Cost
Partition Factor k
-1 1 3 5 7 9 11 13 151.5
2
2.5
3
3.5
Work Imbalance
Partition Factor k
Bett
er
Bett
er
52
CPU UtilizationOver-partitioning improves CPU utilization:
0 50 100 150 200 2500
10203040506070
UW-Systems MLN
Time (Seconds)
Acti
ve C
PU
s
0 10 20 30 40 500
10203040506070
Denoise
No Over-Part
10x Over-Part
Time (Seconds)
53
Parallel Splash Algorithm
Over-Partition factor graph Randomly assign pieces to processors
Schedule Splashes locally using belief residualsTransmit messages on boundary
Local State
CPU 1
Splash
Local State
CPU 2
Local State
CPU 3
Splash
Fast Reliable Network
Splash
SchedulingQueue
SchedulingQueue
SchedulingQueue
54
OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash
Dynamic SchedulingPartitioning
Experimental ResultsConclusions
55
ExperimentsImplemented in C++ using MPICH2 as a message passing API
Ran on Intel OpenCirrus cluster: 120 processors
15 Nodes with 2 x Quad Core Intel Xeon ProcessorsGigabit Ethernet Switch
Tested on Markov Logic Networks obtained from Alchemy [Domingos et al. SSPR 08]
Present results on largest UW-Systems and smallest UW-Languages MLNs
56
Parallel Performance (Large Graph)
0 30 60 90 1200
20
40
60
80
100
120
No Over-Part
5x Over-Part
Number of CPUs
Sp
eed
up
UW-Systems8K Variables406K Factors
Single Processor Running Time:
1 Hour
Linear to Super-Linear up to 120 CPUs
Cache efficiency
Linear
Bett
er
57
Parallel Performance (Small Graph)
UW-Languages1K Variables27K Factors
Single Processor Running Time:
1.5 Minutes
Linear to Super-Linear up to 30 CPUs
Network costs quickly dominate short running-time
0 30 60 90 1200
10
20
30
40
50
60
No Over-Part
5x Over-Part
Number of CPUs
Sp
eed
up
Linear
Bett
er
58
OutlineOverviewGraphical Models: Statistical StructureInference: Computational Structureτε - Approximate Messages: Statistical StructureParallel Splash
Dynamic SchedulingPartitioning
Experimental ResultsConclusions
59
SummaryAlgorithmic Efficiency
Parallel Efficiency
ImplementationEfficiency
Independent Parallel Splashes
Splash Structure +
Belief Residual Scheduling
Distributed QueuesAsynchronous Communication
Over-Partitioning
Experimental results on large factor graphs:
Linear to super-linear speed-up using up to 120 processors
Sophistication
Para
llelis
m
Parallel Splash Belief Propagation
We are here
We were here
Conclusion
61
Questions
62
Protein Results
1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8
9
10
Number of Cores
Rel
ativ
e S
peed
up
Residual BP
Linear Speedup
ResidualSplash BP
MapReduce BP
63
3D Video Task
1 2 3 4 5 6 7 80
1
2
3
4
5
6
7
8
Number of Cores
Rel
ativ
e S
peed
up
Residual BP
ResidualSplash BP
Linear Speedup
MapReduce BP
64
Distributed Parallel Setting
Opportunities:Access to larger systems: 8 CPUs 1000 CPUsLinear Increase:
RAM, Cache Capacity, and Memory Bandwidth
Challenges:Distributed state, Communication and Load Balancing
Fast Reliable Network
Node
CPU BusMemory
Cache
Node
CPU BusMemory
Cache