realizing high ipc through a scalable, multipath microarchitecture david kaeli northeastern...

Realizing High IPC Through a Scalable, Multipath Microarchitecture

David Kaeli

Northeastern University

Computer Architecture Research Laboratory

Boston, MA USA

The Team

Augustus Uht

Sean Langford*

University of Rhode Island

Kingston, RI USA(*now at CMU)

David Morano

Alireza Khalafi

Marcos de Alba

Northeastern University

Boston, MA USA

The Road to High IPC• Many studies have concluded that typical programs

(e.g., SPECint) contain a significant amount of Instruction Level Parallelism (ILP)– Lam and Wilson reported an IPC of ~40 for SP-CD-

MF (speculative execution, perfect control dependence information, multi-path execution)

– Gonzalez and Gonzalez reported an IPC of ~37 for an infinite instruction window, but no value prediction (IPC went down to just under 10 for a 128 entry instruction window)

• So why are we still living with low, single-digit, IPC’s???– Nobody has been aggressive enough!!!

Machine Philosophy• Issue a column of instructions on every cycle

(not always possible)• Spend the rest of the time executing, squashing,

snarfing and re-executing as necessary to preserve true control flow and data flow dependencies

• Retire instructions at a rate of a column at a time• Design a datapath that is scalable in terms of

latency as the size of the machine grows• ISA independent

Outline for this Talk

• Overview of the Levo microarchitecture

• Discussion of scalability within the Levo datapath

• Disjoint execution

• Simulation methodology and results

• Comments and summary

Levo Microarchitectural Features• In-order instruction load, in-order retirement, rampantly out-of-order

execution• Active stations – a more intelligent version of Tomasulo’s reservation

stations• Instruction/operand/memory/predicate time tags – used to enforce data and

control dependencies in a distributed fashion• Hardware runtime predication – used for all BBs with targets within the

execution window• Distributed register file – reduces contention for a shared register file• Aggressive speculation – execute instructions, independent of any data

flow or control flow dependencies• Disjoint execution to cover control hazards

Limit study with real hardware constraints

In-order Instruction Load• Instructions are fetched in static order from I-cache,

except:– Unconditional jump paths are followed– Loops are dynamically unrolled– Conditional branches with far targets (the target is greater than

2/3rds the size of the execution window), if the branch is strongly predicted taken, begin static fetching from the target

• A conventional 2-level gshare branch predictor is used• Dynamic run-time predicates are generated so that every

branch domain in the Execution Window is control independent– Nullify operations are broadcast to cause dependent instructions

to re-execute

Microarchitecture

InstructionFetch

BranchPredication

PredicationLogic

InstructionLoadBuffer

Temporally earliest instruction

Temporally latest instruction

Instruction Window

C

O

M

M

I

T

C

O

L

U

M

Nm-1

PEPE

Sharing Group- 4-8 AS’s- Single PE- Bus interfaces

AS(0,0)

AS(1,0)

AS(2,0)

AS(3,0)

C

O

L

U

M

N0

C

O

L

U

M

Nm-2

C

O

L

U

M

Nm-3

n x m Time-ordered Execution Window

Memory Window

I-Cache

Microarchitecture

Active Stations

• More intelligent version of Tomasulo reservation stations

• Each AS holds: – A single instruction– Instruction operands– A time tag denoting its logical position in the

execution window

• Each AS shares a processing element with a number of other AS’s (as defined by the size of a sharing group)

Microarchitecture

Active Stations• Communicate with other active stations in

order to:– Snoop for the latest operand values– Forward the results to other active stations– Request a value from other active stations

• Re-execute its instruction with new operand values

• Handles control flow changes through runtime predication

Microarchitecture

Time Tags• Enforce the nominal sequential order of the

instructions executed• Accompany all in-flight register values, memory

values and predicate values

• Have two parts– Column tag – is decremented by 1 whenever the left-most

column is loaded– Row tag – does not change

Row Column

Microarchitecture

AS(0,m-1)

PE

Sharing Group

Row 0

12

3

n-1

Row 0

12

3

n-1

Columnm-1

Column0

n rows by m columns

A sharing group of 4 mainline ASs sharing a single PE

AS(1,m-1)

AS(2,m-1)

AS(3,m-1)

Execution Window

Microarchitecture

LDLDLDpathtime tag value

AStime tag

=

address

=

Active Station Operand Snooping and Snarfing

>= <!=

execute or re-execute

time tag address value path time tag

time tag address value

result operand forwarding bus

Microarchitecture

(a) Program Code (b) With Renaming (c) With Time Tags

InstructionNumber

Instruction,Result Time Tag

(ResTT)

1.

5.

9.

1

5

9

R4a = 1

R4b = 2

R3 = R4b

Out-of-Order (OOO) Execution.- I9 only snarfs I5 result(at end, R3 holds ‘2’)

R4 = 1

R4 = 2

R3 = R4

SequentialExecution

(at end,R3 holds ‘2’)

R4 = 1

R4 = 2

R3 = R4

Out-of-Order (OOO) Execution.- I1 result and ResTT broadcast, – R3 = 1, LSTT = 1- I5 result and ResTT broadcast, – R3 = 2, LSTT = 5(at end, R3 holds ‘2’)(Same result if I5 broadcasts first; LSTT is set to and stays at ‘5’;I1 result not snarfed by I9.)

Last SnarfedTime Tag

In Active Station(LSTT).

.

–

–

1, then 5

Scalable Microarchitecture

• Time tags size grows linearly with the total number of ASs

• No reorder buffer (typically grows O(n2))• No centralized architected register file

– Register forwarding units hold the ISA-defined register state

– Forwarding transactions maintain state

• Segmented result buses – fixed length• Distributed L0 caching in the datapath

Observation About Register Lifetimes

• The MultiScalar Project demonstrated that register lifetimes are short (spanning 1-2 basic blocks, within 32 instructions)

• If we have instructions laid out in a time-ordered fashion, the probability we will have to forward in time very far is low

• As a result, we can segment our interconnection fabric, assuming that communicates will only span either the current, or at most the next, segment

Segmented Buses (Spanning Buses)

• Use segmented buses to propagate execution results to later stations

• Adjacent segments are interconnected with Forwarding Units (one forwarding unit, per bus, per column)

• Register Forwarding/Filter Units (RFUs) hold a version of the ISA register state

• Memory Forwarding/Filter Units (MFUs) and Predicate Forwarding Units (PFUs) are also provided

• Backwarding buses are also provided• The number of I/Os to a FU is independent of the machine size

and only depends on the column height• Segmented buses help to preserve scalability in our datapath

Microarchitecture

to next column

AS

AS

AS

AS

M D

AS

AS

AS

AS

M D

AS

AS

AS

AS

M D

FU

FU

FU

from previous column

FU

to next column

AS

AS

AS

AS

M D

AS

AS

AS

AS

M D

AS

AS

AS

AS

M D

FU

FU

FU


FU

to next column

AS

AS

AS

AS

M D

AS

AS

AS

AS

M D

AS

AS

AS

AS

M D

FU

FU

FU


FU

backwarding write bus

forwarding read buses

primary forwarding read

bus

backwarding read buses

primary backwarding

read bus

forwardwarding write bus

Register Forwarding/Filter Units• Capture the persistent register state• All buses are register transaction buses• Consolidate update transactions on input • Updates are forwarded to the output bus request logic

immediately when possible• Requests are “filtered” based on time-tag value• Updates are managed in the file store in FIFO order

logic

backward in time forward in time

logic

ISA register file per path

time-tag

Memory Forwarding/Filter Units• Serve as an L0 cache• All buses are memory buses (number of which set according to

interleave factor)• Consolidate update transactions on input • Updates are "forwarded" to the output bus request logic

immediately when possible• Requests are “filtered” based on time-tag value• Current policy is to queue outgoing requests or responses in

FIFOs until the buses are granted for use

backwarding write buses

forwarding read buses

backwarding read buses

forwardwarding write buses

logic

backward in time forward in time

logic

memory cache

time-tag

FIFO

FIFO

Disjoint Path Execution

• Levo can only obtain high IPC if:– we can provide a large window of instructions to

execute– a large percentage of the instructions on the

eventual committed control-flow path are included in the window

• To address the issues with hard-to-predict conditional control flow, we utilize disjoint path spawning in Levo

and DEE

Disjoint Path Execution• To enable path spawning we provide a disjoint path (D-

path) set of AS’s that share a processing element with a mainline set of AS’s

• D-paths are spawned in the case of hammock branches• The D-path is copied from the mainline path• The sign of the associated predicate is inverted for the D-

path• The D-path receives lower priority for the PE than the

mainline• When a hammock branch is mispredicted, we can treat

the D-path as the new mainline path, and continue execution accordingly

and DEE

Label Addr Instruction History

START: A100 LW R2,20(R4)

A104 SUB R2,R2,#1

A108 BEQZ R2,TAR1 Weakly T

A10C ADD R2,R2,#4

A110 SW 30(R4),R2

TAR1: A114 LW R2,30(R4)

A118 SUB R2,R2,#8

A11C BEQZ R2,TAR2 Weakly NT

A120 SW 20(R4),R2

TAR2: A124 ADD R2,R2,#10

A128 SUB R1,R1,#1

A12C BNEQZ R1,START Strongly T

A130 SW 40(R4),R2

.

.

A100 LW R2,20(R4)

A104 SUB R2,R2,#1

A108 BEQZ R2,TAR1

A10C ADD R2,R2,#4

A110 SW 30(R4),R2

A114 LW R2,28(R4)

A118 SUB R2,R2,#8

A11C BEQZ R2, TAR2

A120 SW 20(R4),R2

A124 ADD R2,R2,#10

A128 SUB R1,R1,#1

A12C BNEQZ R1,START

A130 SW 40(R4),R2

Mainline path

Disjoint path

Modeling and Results• Present work utilizes

– MIPS-1/MIPS-2 machine– SGI compiler– SPECint 95 (compress, go and ijpeg) and 2000 (bzip2, crafty, gcc,

gzip, mcf, parser and vertex) benchmarks

• 3 levels of modeling– Trace-driven model (FastLevo) – results in this presentation– Detailed cycle-accurate model (LevoSim) – still under development– Synthesizable VHDL hardware model (HDLevo) – validation

• Design space exploration– Impact of D-paths– Real vs. ideal memory– Range of bus latency issues

performance

Modeling parmsL1 1,D geometry 64KB, 2WSA, 32B

L2 unified I/D geometry 2MB, direct mapped, 32B

Main memory geometry infinite, 4W interleaved

L0, L1, L2, memory hit latencies 1, 1, 10, 100 cycles

(does not include bus latency)

Branch predictor 2-level 1024 entry BHT

4096 entry GPHT 2-bit

16 entry RAS

one per E-window row

Data value predictor 4096 stride predictor, one per E-window row

PE Element latencies same as MIPS R4000

performance

Modeling parms

L0 geometry 32-32b, fully associative, 32b line

Spanning bus delay 1 cycle

FU/BU delay (no contention) 1 cycle

Buses per RFU and per MFU 2 buses

Buses per PFU 1 bus

Columns per D-path, ML-D switch time

1 column, 1 cycle plus time to broadcast new D-path values as ML

performance

IPC obtained with Levo

0

2

4

6

8

10

12

14

Applications

Machine Geometry (32 SG/Col, 8 AS/SG, 8 Cols)

IPC

bzip2compresscraftygccgogzipijpegmcfparservortexharmonic

performance

Speedup obtained using D-paths versus single path execution

(harmonic means)

0

10

20

30

40

50

60

Machine geometry (SG/Col, AS/SG, Cols, D-paths)

% s

peed

up o

ver

sing

le p

ath

(8,4,8,8)(8,8,8,8)(16,8,8,8)(32,2,16,16)(32,2,16,16)

performance

IPC of Levo compared to modeling 100% L1 I/D hitsharmonic means

0

5

10

15

20

25

(8,4,8) (8,8,8) (16,8,8) (32,8,8)

Geometry - (SG/Col, AS/SG, Cols)

IPC

Ideal I/DIdeal IIdeal DReal I/D

performance

Summary of additional experiments

• Varying the L1-D/L2 hit time (versus 1 cycle)– Increased L1-D HT to 2/4/8 cycles = 10/22/43% IPC loss– Increased L2 HT to 2/4/8/16 cycles = .8/2.3/4.7/8.9% IPC loss

• Varying the number of buses per FU – Decreased to 1 bus/FU = 14% IPC loss– Increased to 4 buses/FU = 3% IPC gain

• Removal of stride predictor = .8% IPC loss

• Varying the number of columns per D-path– Increased to 2 cols/D-path = 8% IPC loss

• Use of D-paths = 45% IPC gain

• Varying the number of branch prediction tables– Decreased from 1 per row to a single of same total size = .4% IPC loss

performance

Comments and Future Directions• I-fetch is the main barrier to further gains in IPC• The use of a detailed VHDL model of critical components in

Levo has allowed us to design scalable resources• A number of novel microarchitectural features are present in a

single design• Future challenges in Levo include:

– Improved I-fetch – (EV8, trace cache, dynamic D-paths)– Finish design of an ARB-like memory– Consider compiler support to aid in-order issue and D-path execution – Consider multithreaded extensions to support coarse-grained

multithreading

To learn more about visit:

http://www.ece.neu.edu/info/architecture/research/Levo.html

Also see our paper at europar02.

http://www.ece.neu.edu/info/architecture/research/Levo.html

realizing high ipc through a scalable, multipath microarchitecture david kaeli northeastern...

Documents

execution windowdistributed

true control flow

control dependencies

static order

mf speculative execution

column of instructions

infinite instruction

order retirement