Dynamic Feedback:An Effective Techniquefor Adaptive Computing
Pedro Diniz and Martin Rinard
Department of Computer ScienceUniversity of California, Santa Barbara
http://www.cs.ucsb.edu/~{pedro,martin}
Basic Issue:Efficient Implementation of Atomic
Operations in Object-Based Languages
Approach:Reduce Lock Overhead by
Coarsening Lock Granularity
Problem:Coarsening Lock Granularity
May ReduceAvailable Concurrency
Solution: Dynamic Feedback
• Multiple Lock Coarsening Policies
• Dynamic Feedback• Generate Multiple Versions of Code• Measure Dynamic Overhead of Each Policy• Dynamically Select Best Version
• Context• Parallelizing Compiler
• Irregular Object-Based Programs• Pointer-Based Data Structures
• Commutativity Analysis
Talk Outline
• Lock Coarsening
• Dynamic Feedback
• Experimental Results
• Related Work
• Conclusions
Model of Computation
• Parallel Programs• Serial Phases• Parallel Phases
•Atomic Operations on Shared Objects•Mutual Exclusion Locks•Acquire Constructs•Release Constructs
AtomicOperations
SerialPhase
SerialPhase
ParallelPhase
L.acquire()
L.release()
Mutual ExclusionRegion
Problem: Lock Overhead
L.acquire()
L.release()
L.acquire()
L.release()
Solution: Lock Coarsening
Original After Lock Coarsening
L.acquire()
L.release()
L.acquire()
L.release()
L.acquire()
L.release()
Reference: Diniz and Rinard“Synchronization Transformations for Parallel Computing”, POPL97
Lock Coarsening Trade-Off
• Advantage: • Reduces Number of Executed Acquires and Releases• Reduces Acquire and Release Overhead
• Disadvantage: May Introduce False Exclusion• Multiple Processors Attempt to Acquire Same Lock• Processor Holding the Lock is Executing Code that
was Originally in No Mutual Exclusion Region
False Exclusion
Original After Lock Coarsening
L.acquire()
L.release()
L.acquire()
L.release()
L.acquire()
L.release()
L.acquire()
L.release()
L.acquire()
•••
L.release()
FalseExclusion
Lock Coarsening Policy
Goal: Limit Potential Severity of False Exclusion
Mechanism: Multiple Lock Coarsening Policies
• Original: Never Coarsen Granularity• Bounded: Coarsen Granularity Only Within
Cycle-Free Subgraphs of ICFG
• Aggressive: Always Coarsen Granularity
Choosing Best Policy
• Best Lock Coarsening Policy May Depend On• Topology of Data Structures• Dynamic Schedule Of Computation
• Information Required to Choose Best Policy Unavailable at Compile Time
• Complications• Different Phases May Have Different Best Policy• In Same Phase, Best Policy May Change Over Time
Solution: Dynamic Feedback
• Generated Code Executes• Sampling Phases: Measure Performance of Different Policies• Production Phases : Use Best Policy From Sampling Phase
• Periodically Resample to Discover Best Policy Changes
AggressiveOriginal Bounded
Time
Ove
rhea
d
Sampling Phase Production Phase Sampling Phase
AggressiveCodeVersion Original
Guaranteed Performance Bounds
• Assumptions:• Overhead Changes Bounded by Exponential Decay
Functions
• Worst Case Scenario:• No Useful Work During Sampling Phase• Sampled Overheads Are Same For All Versions• Overhead of Selected Version Increases at Maximum Rate• Overhead of Other Versions Decreases at Maximum Rate
S PS S
Ove
rhea
d
Time
V0
Guaranteed Performance Bound
Definition 1. Policy p is at Most Worse Than Policy p over a Time Interval T if
Work = 0
T
(1 - oi(t))
dt
where
(1 - ) P + (1/) e(-P) Š (- 1) SN + (1/)
Result 1. To Guarantee this Bound
Work - Work Š T T
i
T
jT
i
ji
Definition 2. Dynamic Feedback is at Most Worse Than the Optimal if
Work - Work Š (P+SN) P+SN
opt
P
0 where Work = 1
P+SN
(1 - o1(t))
dt
P+SN
opt
Guaranteed Performance Bounds
(1 - ) P + (1/) e(-P)
(- 1) SN + (1/)
Production Interval P
Con
stra
int V
alue
sFeasibleRegion
Production Interval Too Long:May Execute Suboptimal
Policy for Long Time
Production Interval Too Short:
Unable to Amortize Sampling Overhead
Basic Constraint:Decay Rate () Must be Small Enough
Dynamic Feedback: Implementation
• Code Generation
• Measuring Policy Overhead
• Interval Selection
• Interval Expiration
• Policy Switch
Code Generation
• Statically Generate Different Code Versions for Each Policy• Alternative: Dynamic Code Generation
• Advantages of Static Code Generation:• Simplicity of Implementation• Fast Policy Switching
• Potential Drawback of Static Code Generation• Code Size (In Practice Not a Problem)
Measuring Policy Overhead
• Sources of Overhead• Locking Overhead• Waiting Overhead
• Compute Locking Overhead• Count Number of Executed Acquire/Release Constructs
• Estimate Waiting Overhead• Count Number of Spins on Locks Waiting to be
Released
Sampling TimeSampled Overhead =
Numberof Spins
Number ofAcquire/Release
xx Spin TimeAcquire/ReleaseExecution Time( )+( )
Interval Selection and Expiration
• Fixed Interval Values• Sampling Interval: 10 milliseconds• Production Interval: 10 seconds• Good Results for Wide Range of Interval
Values
• Polling Code for Expiration Detection• Location: Back Edges of Parallel Loop• Advantage: Low Overhead• Disadvantage: Potential Interaction with
Iteration Size
AtomicOperationsPolling
Points
Policy Switch
• Synchronous• Processors Poll Timer to Detect Interval Expiration• Barrier At End of Each Interval
• Advantages:• Consistent Transitions• Clean Overhead Measurements
• Disadvantages:• Need to Synchronize All Processors• Potential Idle Time At Barrier
Experimental Results
• Parallelizing Compiler Based on Commutativity Analysis [PLDI’96]
• Set of Complete Scientific Applications• Barnes-Hut N-Body Solver (1500 lines of C++)• Liquid Water Simulation Code (1850 lines of C++)• Seismic Modeling String Code (2050 lines of C++)
• Different Lock Coarsening Policies
• Dynamic Feedback
• Performance on Stanford DASH Multiprocessor
Code Sizes
0
20
40
60
Size
Tex
t Seg
men
t (K
byte
s)
Barnes-Hut
SerialOriginalDynamic
0
20
40
60
Size
Tex
t Seg
men
t (K
byte
s)
Water
Serial
OriginalDynamic
0
20
40
60
Size
Tex
t Seg
men
t (K
byte
s)
String
Serial
OriginalDynamic
Lock Overhead
0
20
40
60
Perc
enta
ge L
ock
Ove
rhea
d
Barnes-Hut(16K Particles)
Original
Bounded
Aggressive
Percentage of Time that the Single Processor Execution Spends Acquiring and Releasing
Mutual Exclusion Locks
0
20
40
60
Perc
enta
ge L
ock
Ove
rhea
d
Water(512 Molecules)
Original
BoundedAggressive
0
20
40
60
Perc
enta
ge L
ock
Ove
rhea
d
String(Big Well Model)
OriginalAggressive
Contention OverheadC
onte
ntio
n Pe
rcen
tage
Percentage of Time that Processors Spend Waiting to Acquire Locks Held by Other Processors
100
0
25
50
75
0 4 8 12 16Processors
0
25
50
75
100
0 4 8 12 16Processors
0
25
50
75
100
0 4 8 12 16Processors
OriginalBoundedAggressive
Barnes-Hut(16K Particles)
Water(512 Molecules)
String(Big Well Model)
Performance Results: Barnes-Hut
IdealAggressive
Dynamic FeedbackBounded
Original
Barnes-Hut on DASH(16K Particles)
0
4
8
12
16
0 4 8 12 16
Number of Processors
Spe
edup
Performance Results: Water
Ideal
Bounded
OriginalAggressive
Dynamic Feedback
Water on DASH(512 Molecules)
0
4
8
12
16
0 4 8 12 16
Number of Processors
Spe
edup
Performance Results: String
String on DASH(Big Well Model)
Ideal
Original
Aggressive
Dynamic Feedback
0
4
8
12
16
0 4 8 12 16
Number of Processors
Spe
edup
Summary
• Code Size Is Not An Issue
• Lock Coarsening Has Significant Performance Impact
• Best Lock Coarsening Policy Varies With Application
• Dynamic Feedback Delivers Code With Performance Comparable to The Best Static Lock Coarsening Policy
Related Work
• Adaptive Execution Techniques (Saavedra Park:PACT96)
• Dynamic Dispatch Optimizations (Hölzle Ungar:PLDI94)
• Dynamic Code Generation (Engler:PLDI96)
• Profiling (Brewer:PPoPP95)
• Synchronization Optimizations (Plevyak et al:POPL95)
Conclusions
• Dynamic Feedback• Generated Code Adapts to Different Execution
Environments
• Integration with Parallelizing Compiler• Irregular Object-Based Programs• Pointer-Based Linked Data Structures• Commutativity Analysis
• Evaluation with Three Complete Applications• Performance Comparable to Best Hand-Tuned
Optimization
BACKUP SLIDES
0
2
4
6
8
10
12
14
16
Spe
edup
0 2 4 6 8 10 12 14 16Number of Processors
Ideal
Aggressive
Bounded
Original
Barnes-Hut (16K Particles)
Performance Results : Barnes-Hut
Performance Results: Water
Ideal
Aggressive
Bounded
Original
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
Spe
edup
Number of Processors
Water (512 Molecules)
Performance Results: String
String (Big Well Model)
Spe
edup
Number of Processors
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
Ideal
Original
Aggressive
Policy Switch
TimerExpires
Policy 1
Policy 2TimerExpires
Motivation
Challenges:• Match Best Implementation to Environment• Heterogeneous and Mobile Systems
Goal: • Develop Mechanisms to Support Code that
Adapts to Environment Characteristics
Technique:• Dynamic Feedback
Overhead for Barnes-Hut
0
0.1
0.2
0.3
0.4
0.5
0 5 10 15 20 25
Sam
pled
Ove
rhea
d
Execution Time (Seconds)
Original
Aggressive
Bounded
Barnes-Hut on DASH (8 Processors)FORCES Loop
Data Set - 16K Particles
Overhead for Water
Water on DASH (8 Processors) INTERF Loop
Data Set - 512 Molecules
0
0.1
0.2
0.3
0.4
0.5
0 10 20 30 40 50 60
Sam
pled
Ove
rhea
d
Execution Time (Seconds)
Original
Bounded
Overhead for Water
Water on DASH (8 Processors)POTENG Loop
Data Set - 512 Molecules
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50 60
Sam
pled
Ove
rhea
d
Execution Time (Seconds)
Aggressive
Original
Overhead for String
String on DASH (8 Processors)PROJFWD Loop
Data Set -Big Well
0
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500
Sam
pled
Ove
rhea
d
Execution Time (Seconds)
Aggressive
Original
Dynamic Feedback
AggressiveOriginalBounded
Time
Ove
rhea
d
Sampling Phase Production Phase Sampling Phase
AggressiveCodeVersion