gennette gill montek singh bottleneck analysis and alleviation in pipelined systems: a fast...
DESCRIPTION
Our Contribution Identify bottlenecks in a pipelined system Recognize multiple components that limit throughput Bottlenecks represented in a Boolean expression Classify bottlenecks Latency, cycle time, and occupancy dependent Choose which transformation(s) apply Given a list of possible transforms List is open ended; allows for additions 3TRANSCRIPT
Gennette GillGennette GillMontek SinghMontek Singh
Bottleneck Analysis and Alleviation in Bottleneck Analysis and Alleviation in Pipelined Systems: Pipelined Systems:
A Fast Hierarchical Approach A Fast Hierarchical Approach
Univ. of North CarolinaUniv. of North CarolinaChapel Hill, NC, USAChapel Hill, NC, USA
Part of a Larger Design FlowPart of a Larger Design Flow
2
Big Picture:Big Picture: High-level specifications asynchronous
implementations Design space exploration (this work) is part of
overall flow
High-level Specification Implementation
This work:This work: Use various optimizations together in one tool Exploits circuit hierarchy to accelerate
analysis/optimzn
Our ContributionOur Contribution Identify bottlenecks in a pipelined systemIdentify bottlenecks in a pipelined system
Recognize multiple components that limit throughput
Bottlenecks represented in a Boolean expression Classify bottlenecks Classify bottlenecks
Latency, cycle time, and occupancy dependent Choose which transformation(s) applyChoose which transformation(s) apply
Given a list of possible transforms List is open ended; allows for additions
3
4
BackgroundBackgroundPipelines and Canopy GraphsPipelines and Canopy Graphs
Background: Asynchronous Background: Asynchronous PipelinesPipelines
5
Each stage characterized by three delays:Each stage characterized by three delays: Forward latency, Lf
time for data to propagate forwardReverse latency, Lr
time for a stage to receive and process ack time for a ‘hole’ to travel backward
Cycle time, T = Lf + Lr (typically) Throughput, tpt = 1 / cycle time
An abstracted view of the pipeline
Lf /Lr Lf /Lr
req
controllercontroller
LL LL LL
controllercontroller controllercontroller
logiclogic logiclogic
Cycle time in an asynchronous pipeline
ack
Background: Pipeline RingsBackground: Pipeline Rings
6
Throughput of ring depends on occupancy Throughput of ring depends on occupancy (#items)(#items) For small #items: underutilization limits
throughput For small #holes: congestion limits throughput Throughput also limited by the slowest stage Graph is a convex shape: “Canopy Graph”
1 2 N-2 N-10 Ring Occupancy
Rin
g Th
roug
hput
N
data data limitedlimited
holeholelimitedlimited
limited by limited by slowest stageslowest stage
Background: CompositionBackground: Composition
7
A B
AB
Combined
Pipe
line
Thro
ughp
ut
Pipeline Occupancy
AB
Combined
Pipe
line
Thro
ughp
ut
Pipeline Occupancy
A
B
Sequential Composition [Lines98]
Parallel Composition [Lines98]
ConditionalsConditionals
8
Conditional branches:Conditional branches: Implement if-then-else non-speculatively Split sends data along only one path Boolean decision determines path Merge also uses Boolean; maintains order of data
Performance depends on:Performance depends on: Canopy graphs of then and else branches Boolean probability of choosing each branch
……then
elsesplit mergefork
datain
dataoutboolean ……
ConditionalsConditionals
9
Simplifying assumption (relaxed later)Simplifying assumption (relaxed later) Boolean choices evenly distributed given a
probability p0 = 2/3 → 001001001…
Constraints on joint operation:Constraints on joint operation: Each branch’s occupancy (k) ∝ its probability:
Why? Because items must exit in order Each branch’s throughput ∝ its probability:
Throughput of composition:Throughput of composition: Scale each branch’s canopy graph by 1/pi Intersect the scaled canopy graphs
€
tpt0p0
= tpt1p1
€
k0p0
= k1p1
……then
elsesplit mergefork
datain
dataoutboolean ……
Conditionals: A Simple ExampleConditionals: A Simple Example
10
Example: pipelined implementation of CRC Example: pipelined implementation of CRC algorithmalgorithm
0 10 20 30 40 500 .0
0 .2
0 .4
0 .6
0 .8
0 .0 0 .2 0 .4 0 .6 0 .8 1 .00 .0
0 .2
0 .4
0 .6
0 .8
occupancy
thro
ughp
ut
probability
thro
ughp
ut
……1/1 5/1 5/1 1/1
1/1 3/1 2/1 1/1 1/1 1/1
split merge1/110 stages 3/1
9 stages
branch0
branch1
branch0branch1
€
min tpt0p0
, tpt1p1
⎛ ⎝ ⎜
⎞ ⎠ ⎟
Conditionals: A Simple ExampleConditionals: A Simple Example
11
Example: pipelined implementation of CRC Example: pipelined implementation of CRC algorithmalgorithm
0 10 20 30 40 500 .0
0 .2
0 .4
0 .6
0 .8
0 .0 0 .2 0 .4 0 .6 0 .8 1 .00 .0
0 .2
0 .4
0 .6
0 .8
occupancy
thro
ughp
ut
probability
thro
ughp
ut
branch0branch1
€
min tpt0p0
, tpt1p1
⎛ ⎝ ⎜
⎞ ⎠ ⎟
……1/1 5/1 5/1 1/1
1/1 3/1 2/1 1/1 1/1 1/1
split merge1/110 stages 3/1
9 stages
branch0
branch1
Conditionals: A Simple ExampleConditionals: A Simple Example
12
Example: pipelined implementation of CRC Example: pipelined implementation of CRC algorithmalgorithm
0 10 20 30 40 500 .0
0 .2
0 .4
0 .6
0 .8
0 .0 0 .2 0 .4 0 .6 0 .8 1 .00 .0
0 .2
0 .4
0 .6
0 .8
occupancy
thro
ughp
ut
probability
thro
ughp
ut
branch0branch1
€
min tpt0p0
, tpt1p1
⎛ ⎝ ⎜
⎞ ⎠ ⎟
……1/1 5/1 5/1 1/1
1/1 3/1 2/1 1/1 1/1 1/1
split merge1/110 stages 3/1
9 stages
branch0
branch1
Conditionals: A Simple ExampleConditionals: A Simple Example
13
Example: pipelined implementation of CRC Example: pipelined implementation of CRC algorithmalgorithm
0 10 20 30 40 500 .0
0 .2
0 .4
0 .6
0 .8
0 .0 0 .2 0 .4 0 .6 0 .8 1 .00 .0
0 .2
0 .4
0 .6
0 .8
occupancy
thro
ughp
ut
probability
thro
ughp
ut
branch0branch1
€
min tpt0p0
, tpt1p1
⎛ ⎝ ⎜
⎞ ⎠ ⎟
……1/1 5/1 5/1 1/1
1/1 3/1 2/1 1/1 1/1 1/1
split merge1/110 stages 3/1
9 stages
branch0
branch1
Conditionals: A Simple ExampleConditionals: A Simple Example
14
Example: pipelined implementation of CRC Example: pipelined implementation of CRC algorithmalgorithm
0 10 20 30 40 500 .0
0 .2
0 .4
0 .6
0 .8
0 .0 0 .2 0 .4 0 .6 0 .8 1 .00 .0
0 .2
0 .4
0 .6
0 .8
occupancy
thro
ughp
ut
probability
thro
ughp
ut
branch0branch1
€
min tpt0p0
, tpt1p1
⎛ ⎝ ⎜
⎞ ⎠ ⎟
……1/1 5/1 5/1 1/1
1/1 3/1 2/1 1/1 1/1 1/1
split merge1/110 stages 3/1
9 stages
branch0
branch1
Conditionals: A Simple ExampleConditionals: A Simple Example
15
Example: pipelined implementation of CRC Example: pipelined implementation of CRC algorithmalgorithm
0 10 20 30 40 500 .0
0 .2
0 .4
0 .6
0 .8
0 .0 0 .2 0 .4 0 .6 0 .8 1 .00 .0
0 .2
0 .4
0 .6
0 .8
occupancy
thro
ughp
ut
probability
thro
ughp
ut
branch0branch1
€
min tpt0p0
, tpt1p1
⎛ ⎝ ⎜
⎞ ⎠ ⎟
……1/1 5/1 5/1 1/1
1/1 3/1 2/1 1/1 1/1 1/1
split merge1/110 stages 3/1
9 stages
branch0
branch1
Conditionals: A Simple ExampleConditionals: A Simple Example
16
Example: pipelined implementation of CRC Example: pipelined implementation of CRC algorithmalgorithm
0 10 20 30 40 500 .0
0 .2
0 .4
0 .6
0 .8
0 .0 0 .2 0 .4 0 .6 0 .8 1 .00 .0
0 .2
0 .4
0 .6
0 .8
occupancy
thro
ughp
ut
probability
thro
ughp
ut
branch0branch1
€
min tpt0p0
, tpt1p1
⎛ ⎝ ⎜
⎞ ⎠ ⎟
……1/1 5/1 5/1 1/1
1/1 3/1 2/1 1/1 1/1 1/1
split merge1/110 stages 3/1
9 stages
branch0
branch1
Conditionals: A Simple ExampleConditionals: A Simple Example
17
Example: pipelined implementation of CRC Example: pipelined implementation of CRC algorithmalgorithm
0 10 20 30 40 500 .0
0 .2
0 .4
0 .6
0 .8
0 .0 0 .2 0 .4 0 .6 0 .8 1 .00 .0
0 .2
0 .4
0 .6
0 .8
occupancy
thro
ughp
ut
probability
thro
ughp
ut
branch0branch1
€
min tpt0p0
, tpt1p1
⎛ ⎝ ⎜
⎞ ⎠ ⎟
……1/1 5/1 5/1 1/1
1/1 3/1 2/1 1/1 1/1 1/1
split merge1/110 stages 3/1
9 stages
branch0
branch1
0 10 20 30 40 500 .0
0 .2
0 .4
0 .6
0 .8
0 .0 0 .2 0 .4 0 .6 0 .8 1 .00 .0
0 .2
0 .4
0 .6
0 .8
occupancy
thro
ughp
ut
probability
thro
ughp
ut
branch0branch1
€
min tpt0p0
, tpt1p1
⎛ ⎝ ⎜
⎞ ⎠ ⎟
Conditionals: A Simple ExampleConditionals: A Simple Example
18
Example: pipelined implementation of CRC Example: pipelined implementation of CRC algorithmalgorithm
……1/1 5/1 5/1 1/1
1/1 3/1 2/1 1/1 1/1 1/1
split merge1/110 stages 3/1
9 stages
branch0
branch1
Conditionals: A Simple ExampleConditionals: A Simple Example
19
Example: pipelined implementation of CRC Example: pipelined implementation of CRC algorithmalgorithm
0 10 20 30 40 500 .0
0 .2
0 .4
0 .6
0 .8
0 .0 0 .2 0 .4 0 .6 0 .8 1 .00 .0
0 .2
0 .4
0 .6
0 .8
occupancy
thro
ughp
ut
probability
thro
ughp
ut
branch0branch1
€
min tpt0p0
, tpt1p1
⎛ ⎝ ⎜
⎞ ⎠ ⎟
……1/1 5/1 5/1 1/1
1/1 3/1 2/1 1/1 1/1 1/1
split merge1/110 stages 3/1
9 stages
branch0
branch1
0 10 20 30 40 500 .0
0 .2
0 .4
0 .6
0 .8
Conditionals: A Simple ExampleConditionals: A Simple Example
20
Example: pipelined implementation of CRC Example: pipelined implementation of CRC algorithmalgorithm
0.0 0 .2 0 .4 0 .6 0 .8 1 .00 .0
0 .2
0 .4
0 .6
0 .8branch0branch1
occupancy
thro
ughp
ut
probability
thro
ughp
ut
€
min tpt0p0
, tpt1p1
⎛ ⎝ ⎜
⎞ ⎠ ⎟
……1/1 5/1 5/1 1/1
1/1 3/1 2/1 1/1 1/1 1/1
split merge1/110 stages 3/1
9 stages
branch0
branch1
Conditionals: Example with Slack Conditionals: Example with Slack MismatchMismatch
21
Slack mismatch implicitly handled by analysis Slack mismatch implicitly handled by analysis methodmethod
0 10 20 30 400 .0
0 .2
0 .4
0 .6
0 .8
0 10 20 30 400 .0
0 .2
0 .4
0 .6
0 .8
occupancy
thro
ughp
ut
thro
ughp
ut
occupancy
branch0branch1
branch0branch1
slack matched
…1/1 5/1 5/1 1/1
1/1 3/1 2/1
split merge…10 stages 3/1
9 stages
branch0
branch1
slack mismatch
0 5 10 15 2 00 .00
0 .05
0 .10
0 .15
0 .20
0 .25
0 .30
0 .35
Occupancy
Throughtput
Conditionals: Generalized Choice Conditionals: Generalized Choice ModelModel
22
Extend to more general choice model:Extend to more general choice model: Until now: assumed non-clustered decisions Now: consider clustering
Allow arbitrary runs of 0’s and 1’s for decisions Long runs reduce throughput: other branch is
underutilized Our Analysis Approach:
Introduces a “clustering factor” to quantify decision run lengths
e.g., for random uncorrelated data: ave. run length of 0’s is 1/p1
Analysis approach can handle arbitrary amounts of clustering
0 .0 0 .2 0 .4 0 .6 0 .8 1 .0
0 .2 0
0 .2 5
0 .3 0
0 .3 5
0 .4 0
Probability of choos ing branch1
Throughput
probability of choosing branch1
thro
ughp
ut
thro
ughp
ut
occupancy
non-clustered
random
acts as bottleneck
s2
i < N s3
s4
s5
i++
s6
s7s1 s8interface
Pipelined LoopsPipelined Loops
23
Analysis approach can handle single-token and Analysis approach can handle single-token and multi-token loopsmulti-token loops
Loop’s throughput depends on #iterations per itemLoop’s throughput depends on #iterations per item Assume given: #iterations/item or prob. of exiting the ring Note: Previous analysis looked at a different throughput
#iterations/second, not #completions/second
Pipelined LoopsPipelined Loops
24
Analysis approach for loops:Analysis approach for loops: Construct canopy graph for loop body Scale down based on expected number of
iterations
5/1 5/1 5/1 5/1 5/1
5/1 5/1 5/1 5/1 5/1
5/1 5/1 5/1 5/1 5/1
fork join
branch0
branch1
Boolean
forkLoop
interface
0 2 4 6 80 .00
0 .05
0 .10
0 .15
Occupancy
Throughtput
occupancy
thro
ughp
ut
Loop body
loop body
overall loop
25
Bottleneck IdentificationBottleneck Identification
Our Definition of BottleneckOur Definition of Bottleneck Set of hierarchical nodes that limit canopy graphSet of hierarchical nodes that limit canopy graph
Expressed as a Boolean combinationExpressed as a Boolean combination e.g. n0 OR n2 OR n3 OR n5 AND n6
27
n0
n1
n2
n3
n4
n5 n6
par
leaf
leaf
seq
leafleaf
par n2
n0
n1
n3 n4
n5 n6
Occupancy
Thro
ughp
ut
What caused this segment?• Usually more than one node• Often several conspire together
Find Limiting SegmentsFind Limiting Segments
28
par
n2
n0
n1
What sets this limit?
Occupancy
Thro
ughp
ut Begin with top segment of root nodeBegin with top segment of root node Find which child/children contribute to segmentFind which child/children contribute to segment
If more than one, is it AND or OR blame? Continue to lower levels of hierarchyContinue to lower levels of hierarchy
Find Limiting Segments: example 1Find Limiting Segments: example 1
29
par
n2
n0
n1
Occupancy
Thro
ughp
ut
n1 n2
Scenario: parallel operator limited by slow childScenario: parallel operator limited by slow child n1 and n2 contribute to bottleneck bottleneck(topn0) = topn0 OR bottleneck(topn1) AND bottleneck(topn2) next, find which children of n1 and n2 limit throughput
What sets this limit?
Find Limiting Segments: example 2Find Limiting Segments: example 2
31
par
n4
n2
n3
Occupancy
Thro
ughp
ut Scenario: parallel operator limited by slack mismatchScenario: parallel operator limited by slack mismatch
changing n3 or n4 could fix bottleneck bottleneck(topn2) = topn2 OR bottleneck(reversen3) OR bottleneck(forwardn4)
To fix, change n2, n3, or n4To fix, change n2, n3, or n4
n3 n4
What sets this limit?
Find Limiting Segments: example 3Find Limiting Segments: example 3
32
seq
n6
n4
n5
Occupancy
Thro
ughp
ut
n5 n6
Scenario: sequential operator limited by forward latencyScenario: sequential operator limited by forward latency changing n5 or n6 could fix the bottleneck bottleneck(forwardn4) = forwardn4 OR bottleneck(forwardn5) OR bottleneck(forwardn6)
If just one child is slow, only one contributesIf just one child is slow, only one contributes
n4
What sets this limit?
Bottleneck AlleviationBottleneck Alleviation
33
Bottleneck CategorizationBottleneck Categorization
34
topC
forwardC
reverse1,C
reverse0,C
Occupancy
Thro
ughp
ut
Type I: Latency Dependent
Type II: Cycle Time Dependent
Type III: Occupancy Dependent
Categories based on which c.g. segment limits tptCategories based on which c.g. segment limits tpt
TRansformations for Increasing the Canopy TRansformations for Increasing the Canopy GraphGraph A TRIC increases throughput for some A TRIC increases throughput for some
occupanciesoccupancies Idea: collect a bag of TRICs Idea: collect a bag of TRICs
Categorize circuit optimizations by bottleneck type Use different optimizations in one framework
Effects of few example TRICs:Effects of few example TRICs:
Suggestions for addl. TRICS needed.Suggestions for addl. TRICS needed.35
OccupancyThro
ughp
ut
OccupancyThro
ughp
ut
OccupancyThro
ughp
utParallelization Buffer InsertionStage Splitting
Fixes: Type I Fixes: Type IIICauses: Type I
Fixes: Type IICauses: Types I,III
Applying TRICsApplying TRICs Tool lists TRICs that alleviate current Tool lists TRICs that alleviate current
bottleneckbottleneck
Designer chooses one optionDesigner chooses one option Check for next bottleneck as neededCheck for next bottleneck as needed
36
TRIC Type I Type II Type IIICoalescing ✔ X X
Parallelization ✔ - X
Stage Splitting X ✔ ✔
Loop Pipelining X ✔ ✔
Duplication - ✔ ✔
Loop Unrolling - ✔ ✔
Buffer Insertion X - ✔
ResultsResults
Successful with 20% throughput goals on Successful with 20% throughput goals on examplesexamples
Suggest examples, please.Suggest examples, please.37
Example Throughput Type
orig goal final # iter I II III TRICS
CRC 286 342 345 4 1 0 3 coalesce; add
bufffers Cordic
cond 90.9 109 111 2 0 0 2 add buffers
Cordic 83.3 100 101 2 0 1 2 split stages
Diffeq 182 218 267 1 3 0 0 split stages;
duplicate
Mult 38.4 46.2 62.5 6 5 0 1 coalesce; add
buffers
Conclusion & Future WorkConclusion & Future WorkThis Work:This Work:
Employed multiple microarch. optimizations in one tool
User-guided application to a few examplesMore is needed:More is needed:
Clever ways to automate Additions to the bag of TRICs More examples and applictions
38