Probabilistic Predicate-Aware
Modulo Scheduling
Mikhail Smelyanskiy1, Scott Mahlke, Edward Davidson
Department of EECS
University of Michigan
1 Currently with the System Technology Lab at Intel Corporation
2
Introduction to Deterministic Predicate-aware Scheduling (DPAS)
[Smelyanskiy03]
Predication eliminates branch instructions• but increases resource requirements
Predicate-aware scheduling oversubscribes resources• reduces resource requirements• reduces schedule length
Abr cond
B
D
C
FT
Time FU
0 A
1 p1,p2=cmpp(cond)
2 B if p1
3 C if p2
4 D
Time FU
0 A
1 p1,p2=cmpp(cond)
2 B if p1 C if p2
3 D
3
Motivation for Probabilistic Predicate-aware Scheduling (PPAS)
DPAS can only combine A5 with A2, A3 and A4
What about combining• A2 with A3 ?• A3 with A4 ?• A2 with A6 ?
PPAS allows much more aggressive sharing than DPAS but can result in delay due to resource conflict
A2…
A3
…
A4
…
A6M2…
br
A5
…
A1M1…
2
4
Characteristics of Predicated Code
52% of time is spent in cyclic regionsCyclic PPAS might eliminate up to 38% of all dynamic operations from
cyclic regions
0%
20%
40%
60%
80%
100%
cjpeg
djpeg
epic
unepic
g721e
ncode
g721d
ecode
ghosts
crip
t
gsmdec
ode
gsmen
code
mes
amip
map
mpeg
2dec
mpeg
2enc
pegw
itdec
pegw
itenc
rast
a
raw
caudio
raw
daudio
Avera
ge
%%C_false %C_true Others (Cyclic with no predicated operations + Acyclic)
5
Outline
Motivation Resource Pressure Problem in Predicated Code Probabilistic Predicate Aware Architecture Probabilistic Predicate-aware Modulo Scheduling Performance Results Conclusions
6
Modulo Scheduling Example
+1 p1=cmpp
+2 if p1 +3 if p1
st br2
1
0
freq=0.3 freq=0.3
1
1
This control path is taken 30% of the time Assumed machine:
• 1 ALU, 1 MEMORY and 1 BRANCH units
T
7
Traditional Modulo Schedule (Rau 94)
Time Iteration i Iteration i + 1
0 +1
1
2 p1=cmpp
3 +2 if p1 br
4 +1
5 +3 if p1
6 st p1=cmpp
7 +2 if p1 br
8
9 +3 if p1
10 st
Modulo Schedule Modulo Scheduled
Loop Kernel
ALU MEM BR
I0 +1
I1 +3 if p1
I2 p1=cmpp st
I3 +2 if p1 br
II=4
II=5
8
Probabilistic Predicate-Aware Modulo Scheduling
Deterministic Predicate-Aware
Modulo Schedule
Time A M B
0 +1
1 +3 if p1
2 p1=cmpp st
3 +2 if p1 br
Probabilistic Predicate-AwareModulo Schedule
Time A M B
0 +1
1 +2 if p1 +3 if p1
2 p1 = cmpp st br
0.18 expected delay due to conflicts
+1 p1=cmpp
+2 if p1 +3 if p1
st br2
1
0freq=0.3 freq=0.31 / 2
1 / 2
II = 4 II = 3.18
Baseline
Modulo Schedule
Time A M B
0 +1
1 +3 if p1
2 p1=cmpp st
3 +2 if p1 br
II = 4
9
Must-use Resources May-use
Baseline Architecture Model
Predicate Register File is only accessed in EXECUTE stageResources from FETCH to EXECUTE are unconditionally
reserved
FE
TC
H
DIS
PA
TC
H
DE
CO
DE
RE
GIS
TE
R
RE
AD
WR
ITE
BA
CK
Predicate Register File
PR
ED
RE
AD
& EX
EC
UT
E
10
PR
ED
RE
AD
& DIS
PA
TC
H
DE
CO
DE
Must-use Resources May-use ResourcesF
ET
CH
RE
GIS
TE
R
RE
AD
WR
ITE
BA
CK
Predicate Register File (PRF)
EX
EC
UT
E
Extended Predicate-Aware Architecture
Resource Conflict Detection and Recovery Unit
stall stall conflictdetection
conflictrecovery
Conflict Detection and Recover Latency (CDRL) can be 0 or 1 cycles
11
Expected Delay Model
• ev is execution vector
• delay_cycles(ev) = CDRL + dispatch_cycles(ev) – 1
• P(ev) is probability of occurrence of ev
ev
cfl evPevED gconflictin eachfor
)es(delay_cycloperationspredicatedofgroup )()(
P(ev) is computed using disjointness and implication, and assuming independence otherwise
Example (assume 3 operations, one FU and CDRL=1)
EDcfl (op1 if p1, op2 if p2, op3 if p3) = (1 + 3 - 1) × P(p1=T, p2=T, p3=T) +
(1 + 2 - 1) × P(p1=T, p2=T, p3=F) +
(1 + 2 - 1) × P(p1=F, p2=T, p3=T) +
(1 + 2 - 1) × P(p1=T, p2=F, p3=T)
12
Modulo Scheduling using Expected Delay Model
(scheduling operation +3 if p1)
+1 p1=cmpp
+2 if p1 +3 if p1
st br2
1
0freq=0.3 freq=0.32
1
+3 if p1
0brstp1=cmpp2
0.18+2 if p11
0+10
0.182 0.3 0.3 =
0.602 1.0 0.3 =
0.602 1.0 0.3 =
2
1
0
2
1
0
2
1
0
0
1 Pconf(+2, +3) =
1 Pconf(+1, +3) =
1 Pconf(p1=pred, +3) = 00000
Expected Delay due to Conflicts (CDRL = 1)
3brp1=cmpp2
1
+3 if p16
+3 if p15
+2 if p14
+3 if p17
st8
+10BR mayMEM mayA mayTime
total expected delay due to conflicts0.18
SR
TM
RT
13
Modulo Scheduling using Expected Delay Model
(Finding Expected Initiation Interval, IIexp)
More than one way to achieve the same (eg. 3.2)expIITime ALU MEM BR
0 +1
1 +2 if p1 +3 if p1
2 p1=cmpp st br< 0.2 total expected conflict delay
Time ALU MEM BR
0 +1 +2 if p1 st
1 p1=cmpp +3 if p1 br
< 1.2 total expected conflict delay
• start with and increase till or sched. found
• of schedule found becomes new upper bound
• becomes new lower bound if no schedule found
expII2staticII
expIIII
Use binary search to find • upper bound =
• lower bound =
expII41/))(1)(1)(1)(1( 321 ALUcmpp
6.21/))(1)(3.0)(3.0)(1( 321 ALUcmpp
13
14
Performance Results
Compare the performance of baseline (BASE), deterministic (DPAS) and probabilistic (PPAS) predicate-aware modulo scheduling
Compiler Support
• Trimaran and ELCOR [Trimaran99] Mediabench [Lee97] benchmark suite was evaluated
Processor Models (BA – base, PA – predicate-aware)
Fetch Width Int ALU cmpp latency Memory CDRL
BASE 4 2 1 1 -
DPAS 4 2 3 1 -
PPAS 4 2 3 1 0 and 1
BASE 6 4 1 2 -
DPAS 6 4 3 2 -
PPAS 6 4 3 2 0 and 16-w
ide
4-w
ide
15
Cyclic PPAS Speedup over BASE (4-wide machine)
4-wide cyclic PPAS with CDRL=0 is 20% better than base and 10% better than cyclic DPAS
Increased CDRL has degraded performance
0.70.80.9
11.11.21.31.41.5
Sp
eed
up
DPAS PPAS with CDRL=0 PPAS with CDRL=1
16
Various Scheduling Measurements(4-wide machine, CDRL = 0)
Cyclic PPAS reduces II by 32% compared with BASE and by 12% compared with cyclic DPAS
Expected delay mode accurately predicts delay due to conflict Predicate-aware scheduling increases the epilogue size and
required more rotating registers than BASE
29.94.71.6%20.821.0PPAS
18.42.20.0%23.523.5DPAS
14.81.20.0%27.627.6BASE
# Rotating Registers
EpilogueSize
Absolute Error
IIruntimeIIcompile
17
Overall Speedup over BASE with Cyclic PPAS
0.90.95
11.051.1
1.151.2
1.251.3
1.351.4
1.45
cjpeg
djpeg
epic
unepic
g721en
code
g721deco
de
ghostsc
ript
gsmdec
ode
gsmen
code
mesa
mip
map
mpeg
2dec
mpeg
2enc
pegwitdec
pegwitenc
rast
a
raw
caudio
raw
daudio
Avera
ge
Avera
ge (D
PAS)
Sp
eed
up
4-wide PPAS with CDRL=0 6-wide PPAS with CDRL=0
Only 52% of regions are scheduled with cyclic PPAS Overall 4-wide cyclic PPAS is 10% better than base and 6-wide cyclic PPAS is 4%
better than base
18
Summary of PPAS
PPAS significantly reduces resource requirements in predicated cyclic code but cause conflicts
• compiler maximizes sharing in view of expected conflict
• PPAS architecture detects and recovers from conflicts
PPAS improves performance by
For further discussion, see
http://www.eecs.umich.edu/~msmelyan/publications.html
Mikhail Smelyanskiy. Hardware/Software Mechanisms for Increasing Resource Utilization on VLIW/EPIC Processors . Ph.D. Dissertation, University of Michigan, 2004
Overall (cmpplat=3, CDRL=0)
Cyclic vs. Base vs. DPAS
4-wide 20% 10% 6%
6-wide 8% 4% 3%
Questions?
Backup Foils
21
Resource Conflict Detection and Recovery Unit
A1 if 1 A2 if 1 A3 if 1 A4 if 0 A5 if 10 A1 A51 A22 A3
ALU1 ALU2one operation per assigned FU
Design alternatives to dispatch conflicting operations
Conflict Detection and Recovery Latency (CDRL)
A1 if 1 A2 if 1 A3 if 1 A4 if 0 A5 if 10 A1 A51 A2 A3
ALU1 ALU2one operation per any FU (not evaluated)
A1 if 1 A2 if 1 A3 if 1 A4 if 0 A5 if 10 A1 A51 A22 A3
ALU1 ALU2CDRL = 0
A1 if 1 A2 if 1 A3 if 1 A4 if 0 A5 if 10 conflict detected (dispatch bubble)1 A1 A52 A23 A3
ALU1 ALU2CDRL = 1
22
Cyclic PPAS Speedup for Training and Reference Input Sets (4-wide, CDRL=1)
0.70.80.9
11.11.21.31.41.5
cjpeg
djpeg
epic
unepic
g721e
ncode
g721d
ecode
ghosts
crip
t
gsmdec
ode
gsmen
code
mes
amip
map
mpeg
2dec
mpeg
2enc
pegw
itdec
pegw
itencra
sta
raw
caudio
raw
daudio
Avera
ge (4
-wid
e)
Avera
ge (6
-wid
e)
Sp
eed
up
Training Input Reference Input
0.38