Download - Probabilistic Predicate-Aware Modulo Scheduling

Probabilistic Predicate-Aware

Modulo Scheduling

Mikhail Smelyanskiy1, Scott Mahlke, Edward Davidson

Department of EECS

University of Michigan

1 Currently with the System Technology Lab at Intel Corporation

2

Introduction to Deterministic Predicate-aware Scheduling (DPAS)

[Smelyanskiy03]

Predication eliminates branch instructions• but increases resource requirements

Predicate-aware scheduling oversubscribes resources• reduces resource requirements• reduces schedule length

Abr cond

B

D

C

FT

Time FU

0 A

1 p1,p2=cmpp(cond)

2 B if p1

3 C if p2

4 D

Time FU

0 A

1 p1,p2=cmpp(cond)

2 B if p1 C if p2

3 D

3

Motivation for Probabilistic Predicate-aware Scheduling (PPAS)

DPAS can only combine A5 with A2, A3 and A4

What about combining• A2 with A3 ?• A3 with A4 ?• A2 with A6 ?

PPAS allows much more aggressive sharing than DPAS but can result in delay due to resource conflict

A2…

A3

…

A4

…

A6M2…

br

A5

…

A1M1…

2

4

Characteristics of Predicated Code

52% of time is spent in cyclic regionsCyclic PPAS might eliminate up to 38% of all dynamic operations from

cyclic regions

0%

20%

40%

60%

80%

100%

cjpeg

djpeg

epic

unepic

g721e

ncode

g721d

ecode

ghosts

crip

t

gsmdec

ode

gsmen

code

mes

amip

map

mpeg

2dec

mpeg

2enc

pegw

itdec

pegw

itenc

rast

a

raw

caudio

raw

daudio

Avera

ge

%%C_false %C_true Others (Cyclic with no predicated operations + Acyclic)

5

Outline

Motivation Resource Pressure Problem in Predicated Code Probabilistic Predicate Aware Architecture Probabilistic Predicate-aware Modulo Scheduling Performance Results Conclusions

6

Modulo Scheduling Example

+1 p1=cmpp

+2 if p1 +3 if p1

st br2

1

0

freq=0.3 freq=0.3

1

1

This control path is taken 30% of the time Assumed machine:

• 1 ALU, 1 MEMORY and 1 BRANCH units

T

7

Traditional Modulo Schedule (Rau 94)

Time Iteration i Iteration i + 1

0 +1

1

2 p1=cmpp

3 +2 if p1 br

4 +1

5 +3 if p1

6 st p1=cmpp

7 +2 if p1 br

8

9 +3 if p1

10 st

Modulo Schedule Modulo Scheduled

Loop Kernel

ALU MEM BR

I0 +1

I1 +3 if p1

I2 p1=cmpp st

I3 +2 if p1 br

II=4

II=5

8

Probabilistic Predicate-Aware Modulo Scheduling

Deterministic Predicate-Aware

Modulo Schedule

Time A M B

0 +1

1 +3 if p1

2 p1=cmpp st

3 +2 if p1 br

Probabilistic Predicate-AwareModulo Schedule

Time A M B

0 +1

1 +2 if p1 +3 if p1

2 p1 = cmpp st br

0.18 expected delay due to conflicts

+1 p1=cmpp

+2 if p1 +3 if p1

st br2

1

0freq=0.3 freq=0.31 / 2

1 / 2

II = 4 II = 3.18

Baseline

Modulo Schedule

Time A M B

0 +1

1 +3 if p1

2 p1=cmpp st

3 +2 if p1 br

II = 4

9

Must-use Resources May-use

Baseline Architecture Model

Predicate Register File is only accessed in EXECUTE stageResources from FETCH to EXECUTE are unconditionally

reserved

FE

TC

H

DIS

PA

TC

H

DE

CO

DE

RE

GIS

TE

R

RE

AD

WR

ITE

BA

CK

Predicate Register File

PR

ED

RE

AD

& EX

EC

UT

E

10

PR

ED

RE

AD

& DIS

PA

TC

H

DE

CO

DE

Must-use Resources May-use ResourcesF

ET

CH

RE

GIS

TE

R

RE

AD

WR

ITE

BA

CK

Predicate Register File (PRF)

EX

EC

UT

E

Extended Predicate-Aware Architecture

Resource Conflict Detection and Recovery Unit

stall stall conflictdetection

conflictrecovery

Conflict Detection and Recover Latency (CDRL) can be 0 or 1 cycles

11

Expected Delay Model

• ev is execution vector

• delay_cycles(ev) = CDRL + dispatch_cycles(ev) – 1

• P(ev) is probability of occurrence of ev

ev

cfl evPevED gconflictin eachfor

)es(delay_cycloperationspredicatedofgroup )()(

P(ev) is computed using disjointness and implication, and assuming independence otherwise

Example (assume 3 operations, one FU and CDRL=1)

EDcfl (op1 if p1, op2 if p2, op3 if p3) = (1 + 3 - 1) × P(p1=T, p2=T, p3=T) +

(1 + 2 - 1) × P(p1=T, p2=T, p3=F) +

(1 + 2 - 1) × P(p1=F, p2=T, p3=T) +

(1 + 2 - 1) × P(p1=T, p2=F, p3=T)

12

Modulo Scheduling using Expected Delay Model

(scheduling operation +3 if p1)

+1 p1=cmpp

+2 if p1 +3 if p1

st br2

1

0freq=0.3 freq=0.32

1

+3 if p1

0brstp1=cmpp2

0.18+2 if p11

0+10

0.182 0.3 0.3 =

0.602 1.0 0.3 =

0.602 1.0 0.3 =

2

1

0

2

1

0

2

1

0

0

1 Pconf(+2, +3) =

1 Pconf(+1, +3) =

1 Pconf(p1=pred, +3) = 00000

Expected Delay due to Conflicts (CDRL = 1)

3brp1=cmpp2

1

+3 if p16

+3 if p15

+2 if p14

+3 if p17

st8

+10BR mayMEM mayA mayTime

total expected delay due to conflicts0.18

SR

TM

RT

13

Modulo Scheduling using Expected Delay Model

(Finding Expected Initiation Interval, IIexp)

More than one way to achieve the same (eg. 3.2)expIITime ALU MEM BR

0 +1

1 +2 if p1 +3 if p1

2 p1=cmpp st br< 0.2 total expected conflict delay

Time ALU MEM BR

0 +1 +2 if p1 st

1 p1=cmpp +3 if p1 br

< 1.2 total expected conflict delay

• start with and increase till or sched. found

• of schedule found becomes new upper bound

• becomes new lower bound if no schedule found

expII2staticII

expIIII

Use binary search to find • upper bound =

• lower bound =

expII41/))(1)(1)(1)(1( 321 ALUcmpp

6.21/))(1)(3.0)(3.0)(1( 321 ALUcmpp

13

14

Performance Results

Compare the performance of baseline (BASE), deterministic (DPAS) and probabilistic (PPAS) predicate-aware modulo scheduling

Compiler Support

• Trimaran and ELCOR [Trimaran99] Mediabench [Lee97] benchmark suite was evaluated

Processor Models (BA – base, PA – predicate-aware)

Fetch Width Int ALU cmpp latency Memory CDRL

BASE 4 2 1 1 -

DPAS 4 2 3 1 -

PPAS 4 2 3 1 0 and 1

BASE 6 4 1 2 -

DPAS 6 4 3 2 -

PPAS 6 4 3 2 0 and 16-w

ide

4-w

ide

15

Cyclic PPAS Speedup over BASE (4-wide machine)

4-wide cyclic PPAS with CDRL=0 is 20% better than base and 10% better than cyclic DPAS

Increased CDRL has degraded performance

0.70.80.9

11.11.21.31.41.5

Sp

eed

up

DPAS PPAS with CDRL=0 PPAS with CDRL=1

16

Various Scheduling Measurements(4-wide machine, CDRL = 0)

Cyclic PPAS reduces II by 32% compared with BASE and by 12% compared with cyclic DPAS

Expected delay mode accurately predicts delay due to conflict Predicate-aware scheduling increases the epilogue size and

required more rotating registers than BASE

29.94.71.6%20.821.0PPAS

18.42.20.0%23.523.5DPAS

14.81.20.0%27.627.6BASE

# Rotating Registers

EpilogueSize

Absolute Error

IIruntimeIIcompile

17

Overall Speedup over BASE with Cyclic PPAS

0.90.95

11.051.1

1.151.2

1.251.3

1.351.4

1.45

cjpeg

djpeg

epic

unepic

g721en

code

g721deco

de

ghostsc

ript

gsmdec

ode

gsmen

code

mesa

mip

map

mpeg

2dec

mpeg

2enc

pegwitdec

pegwitenc

rast

a

raw

caudio

raw

daudio

Avera

ge

Avera

ge (D

PAS)

Sp

eed

up

4-wide PPAS with CDRL=0 6-wide PPAS with CDRL=0

Only 52% of regions are scheduled with cyclic PPAS Overall 4-wide cyclic PPAS is 10% better than base and 6-wide cyclic PPAS is 4%

better than base

18

Summary of PPAS

PPAS significantly reduces resource requirements in predicated cyclic code but cause conflicts

• compiler maximizes sharing in view of expected conflict

• PPAS architecture detects and recovers from conflicts

PPAS improves performance by

For further discussion, see

http://www.eecs.umich.edu/~msmelyan/publications.html

Mikhail Smelyanskiy. Hardware/Software Mechanisms for Increasing Resource Utilization on VLIW/EPIC Processors . Ph.D. Dissertation, University of Michigan, 2004

Overall (cmpplat=3, CDRL=0)

Cyclic vs. Base vs. DPAS

4-wide 20% 10% 6%

6-wide 8% 4% 3%

Questions?

Backup Foils

21

Resource Conflict Detection and Recovery Unit

A1 if 1 A2 if 1 A3 if 1 A4 if 0 A5 if 10 A1 A51 A22 A3

ALU1 ALU2one operation per assigned FU

Design alternatives to dispatch conflicting operations

Conflict Detection and Recovery Latency (CDRL)


ALU1 ALU2one operation per any FU (not evaluated)


ALU1 ALU2CDRL = 0

A1 if 1 A2 if 1 A3 if 1 A4 if 0 A5 if 10 conflict detected (dispatch bubble)1 A1 A52 A23 A3

ALU1 ALU2CDRL = 1

22

Cyclic PPAS Speedup for Training and Reference Input Sets (4-wide, CDRL=1)

0.70.80.9

11.11.21.31.41.5

cjpeg

djpeg

epic

unepic

g721e

ncode

g721d

ecode

ghosts

crip

t

gsmdec

ode

gsmen

code

mes

amip

map

mpeg

2dec

mpeg

2enc

pegw

itdec

pegw

itencra

sta

raw

caudio

raw

daudio

Avera

ge (4

-wid

e)

Avera

ge (6

-wid

e)

Sp

eed

up

Training Input Reference Input

0.38

Download - Probabilistic Predicate-Aware Modulo Scheduling

Top Related