juergen ributzka david stephenson timothy kong dee lee fred chow guang r. gao

Open64

A Software Pipelining Framework for Simple Processor Cores

•Juergen Ributzka•David Stephenson•Timothy Kong•Dee Lee•Fred Chow•Guang R. Gao

Copyright © 2009 - Juergen Ributzka. All rights reserved. 2Open64

Overview

• SiCortex Multiprocessor• Software Pipelining Framework• Results• Future Work

3/23/2009


SiCortex Multiprocessor

• RISC• 6 cores (MIPS 5KF)• 500 – 700 MHz• In-Order Execution• Limited-Dual Issue• Low Power (12 – 16 Watt)

3/23/2009


Software Pipelining Framework

3/23/2009

DDG

MII

MS

MVE

RA

CG

Front-End

Middle-End

Back-End

FORTRAN C/C++ Java

Loop NestOptimizer

GlobalOptimizer

CodeGenerator


Data Dependence Graph (DDG)

• Cyclic Data Dependence Graph• No register anti- and output-

dependencies• Only single BBs

3/23/2009

DDG

MII

MS

MVE

RA

CG

S1

S2

S1

S2


Minimum Initiation Interval

• Limited by:– Resource Requirements– Recurrence Requirements

3/23/2009

DDG

MII

MS

MVE

RA

CG


Modulo Scheduling (MS)

• Huff Modulo Scheduling– Lifetime sensitive– Uses backtracking

• Hyper Node Reduction Scheduling– Lifetime sensitive– No backtracking

3/23/2009

DDG

MII

MS

MVE

RA

CG


Modulo Variable Expansion (MVE)

• Separate TN set for every iteration• # of unrollings depends on longest

lifetime

3/23/2009

DDG

MII

MS

MVE

RA

CG

TN10 op1 TNXX…TN20 op2 TN10 TNYY



Iteration

Cycle


Register Allocation (RA)

• Only kernels are fully register allocated

• No regions– SWP kernels are not a black box for GRA

and LRA anymore• Currently no register spilling support

3/23/2009

DDG

MII

MS

MVE

RA

CG


Code Generation (CG)

• Prologues and epilogues are partially register allocated

• Need several epilogues due to different register sets

3/23/2009

DDG

MII

MS

MVE

RA

CG

P

P0

K0

K1 E0

E1

E2

E


Benchmark Results

3/23/2009

BT CG EP FT IS LU MG SP UA0.9

0.95

1

1.05

1.1

1.15

1.2

NAS Parallel Benchmark

ABI n32ABI 64

spee

dup


Benchmark Results

3/23/2009

410.

bwav

es

433.

milc

434.

zeus

mp

435.

grom

acs

436.

cact

usAD

M

437.

lesli

e3d

444.

nam

d

447.

deal

II

450.

sopl

ex

453.

povr

ay

454.

calcu

lix

459.

Gem

sFDT

D

465.

tont

o

470.

lbm

481.

wrf

482.

sphi

nx3

400.

perlb

ench

401.

bzip

2

403.

gcc

429.

mcf

445.

gobm

k

456.

hmm

er

458.

sjeng

462.

libqu

antu

m

464.

h26r

ef

471.

omne

tpp

473.

asta

r

0.95

0.96

0.97

0.98

0.99

1

1.01

1.02

1.03

1.04

1.05

SPEC2006

ABI n32ABI 64

spee

dup


Conclusion and Future Work

• Spilling support needed to enable more loops for SWP

• Screen out loops with small trip count during runtime to reduce SWP overhead

• Hit-under-miss support to overcome current hardware limitations

3/23/2009


Questions ?

3/23/2009

juergen ributzka david stephenson timothy kong dee lee fred chow guang r. gao

Documents