realizing high ipc through a scalable, multipath microarchitecture david kaeli northeastern...
Post on 15-Jan-2016
223 views
TRANSCRIPT
Realizing High IPC Through a Scalable, Multipath Microarchitecture
David Kaeli
Northeastern University
Computer Architecture Research Laboratory
Boston, MA USA
The Team
Augustus Uht
Sean Langford*
University of Rhode Island
Kingston, RI USA(*now at CMU)
David Morano
Alireza Khalafi
Marcos de Alba
Northeastern University
Boston, MA USA
The Road to High IPC• Many studies have concluded that typical programs
(e.g., SPECint) contain a significant amount of Instruction Level Parallelism (ILP)– Lam and Wilson reported an IPC of ~40 for SP-CD-
MF (speculative execution, perfect control dependence information, multi-path execution)
– Gonzalez and Gonzalez reported an IPC of ~37 for an infinite instruction window, but no value prediction (IPC went down to just under 10 for a 128 entry instruction window)
• So why are we still living with low, single-digit, IPC’s???– Nobody has been aggressive enough!!!
Machine Philosophy• Issue a column of instructions on every cycle
(not always possible)• Spend the rest of the time executing, squashing,
snarfing and re-executing as necessary to preserve true control flow and data flow dependencies
• Retire instructions at a rate of a column at a time• Design a datapath that is scalable in terms of
latency as the size of the machine grows• ISA independent
Outline for this Talk
• Overview of the Levo microarchitecture
• Discussion of scalability within the Levo datapath
• Disjoint execution
• Simulation methodology and results
• Comments and summary
Levo Microarchitectural Features• In-order instruction load, in-order retirement, rampantly out-of-order
execution• Active stations – a more intelligent version of Tomasulo’s reservation
stations• Instruction/operand/memory/predicate time tags – used to enforce data and
control dependencies in a distributed fashion• Hardware runtime predication – used for all BBs with targets within the
execution window• Distributed register file – reduces contention for a shared register file• Aggressive speculation – execute instructions, independent of any data
flow or control flow dependencies• Disjoint execution to cover control hazards
Limit study with real hardware constraints
In-order Instruction Load• Instructions are fetched in static order from I-cache,
except:– Unconditional jump paths are followed– Loops are dynamically unrolled– Conditional branches with far targets (the target is greater than
2/3rds the size of the execution window), if the branch is strongly predicted taken, begin static fetching from the target
• A conventional 2-level gshare branch predictor is used• Dynamic run-time predicates are generated so that every
branch domain in the Execution Window is control independent– Nullify operations are broadcast to cause dependent instructions
to re-execute
Microarchitecture
InstructionFetch
BranchPredication
PredicationLogic
InstructionLoadBuffer
Temporally earliest instruction
Temporally latest instruction
Instruction Window
C
O
M
M
I
T
C
O
L
U
M
Nm-1
PEPE
Sharing Group- 4-8 AS’s- Single PE- Bus interfaces
AS(0,0)
AS(1,0)
AS(2,0)
AS(3,0)
C
O
L
U
M
N0
C
O
L
U
M
Nm-2
C
O
L
U
M
Nm-3
n x m Time-ordered Execution Window
Memory Window
I-Cache
Microarchitecture
Active Stations
• More intelligent version of Tomasulo reservation stations
• Each AS holds: – A single instruction– Instruction operands– A time tag denoting its logical position in the
execution window
• Each AS shares a processing element with a number of other AS’s (as defined by the size of a sharing group)
Microarchitecture
Active Stations• Communicate with other active stations in
order to:– Snoop for the latest operand values– Forward the results to other active stations– Request a value from other active stations
• Re-execute its instruction with new operand values
• Handles control flow changes through runtime predication
Microarchitecture
Time Tags• Enforce the nominal sequential order of the
instructions executed• Accompany all in-flight register values, memory
values and predicate values
• Have two parts– Column tag – is decremented by 1 whenever the left-most
column is loaded– Row tag – does not change
Row Column
Microarchitecture
AS(0,m-1)
PE
Sharing Group
Row 0
12
3
n-1
Row 0
12
3
n-1
Columnm-1
Column0
n rows by m columns
A sharing group of 4 mainline ASs sharing a single PE
AS(1,m-1)
AS(2,m-1)
AS(3,m-1)
Execution Window
Microarchitecture
LDLDLDpathtime tag value
AStime tag
=
address
=
Active Station Operand Snooping and Snarfing
>= <!=
execute or re-execute
time tag address value path time tag
time tag address value
result operand forwarding bus
Microarchitecture
(a) Program Code (b) With Renaming (c) With Time Tags
InstructionNumber
Instruction,Result Time Tag
(ResTT)
1.
5.
9.
1
5
9
R4a = 1
R4b = 2
R3 = R4b
Out-of-Order (OOO) Execution.- I9 only snarfs I5 result(at end, R3 holds ‘2’)
R4 = 1
R4 = 2
R3 = R4
SequentialExecution
(at end,R3 holds ‘2’)
R4 = 1
R4 = 2
R3 = R4
Out-of-Order (OOO) Execution.- I1 result and ResTT broadcast, – R3 = 1, LSTT = 1- I5 result and ResTT broadcast, – R3 = 2, LSTT = 5(at end, R3 holds ‘2’)(Same result if I5 broadcasts first; LSTT is set to and stays at ‘5’;I1 result not snarfed by I9.)
Last SnarfedTime Tag
In Active Station(LSTT).
.
–
–
1, then 5
Scalable Microarchitecture
• Time tags size grows linearly with the total number of ASs
• No reorder buffer (typically grows O(n2))• No centralized architected register file
– Register forwarding units hold the ISA-defined register state
– Forwarding transactions maintain state
• Segmented result buses – fixed length• Distributed L0 caching in the datapath
Observation About Register Lifetimes
• The MultiScalar Project demonstrated that register lifetimes are short (spanning 1-2 basic blocks, within 32 instructions)
• If we have instructions laid out in a time-ordered fashion, the probability we will have to forward in time very far is low
• As a result, we can segment our interconnection fabric, assuming that communicates will only span either the current, or at most the next, segment
Segmented Buses (Spanning Buses)
• Use segmented buses to propagate execution results to later stations
• Adjacent segments are interconnected with Forwarding Units (one forwarding unit, per bus, per column)
• Register Forwarding/Filter Units (RFUs) hold a version of the ISA register state
• Memory Forwarding/Filter Units (MFUs) and Predicate Forwarding Units (PFUs) are also provided
• Backwarding buses are also provided• The number of I/Os to a FU is independent of the machine size
and only depends on the column height• Segmented buses help to preserve scalability in our datapath
Microarchitecture
to next column
AS
AS
AS
AS
M D
AS
AS
AS
AS
M D
AS
AS
AS
AS
M D
FU
FU
FU
from previous column
FU
to next column
AS
AS
AS
AS
M D
AS
AS
AS
AS
M D
AS
AS
AS
AS
M D
FU
FU
FU
from previous column
FU
to next column
AS
AS
AS
AS
M D
AS
AS
AS
AS
M D
AS
AS
AS
AS
M D
FU
FU
FU
from previous column
FU
backwarding write bus
forwarding read buses
primary forwarding read
bus
backwarding read buses
primary backwarding
read bus
forwardwarding write bus
Register Forwarding/Filter Units• Capture the persistent register state• All buses are register transaction buses• Consolidate update transactions on input • Updates are forwarded to the output bus request logic
immediately when possible• Requests are “filtered” based on time-tag value• Updates are managed in the file store in FIFO order
logic
backward in time forward in time
logic
ISA register file per path
time-tag
Memory Forwarding/Filter Units• Serve as an L0 cache• All buses are memory buses (number of which set according to
interleave factor)• Consolidate update transactions on input • Updates are "forwarded" to the output bus request logic
immediately when possible• Requests are “filtered” based on time-tag value• Current policy is to queue outgoing requests or responses in
FIFOs until the buses are granted for use
backwarding write buses
forwarding read buses
backwarding read buses
forwardwarding write buses
logic
backward in time forward in time
logic
memory cache
time-tag
FIFO
FIFO
Disjoint Path Execution
• Levo can only obtain high IPC if:– we can provide a large window of instructions to
execute– a large percentage of the instructions on the
eventual committed control-flow path are included in the window
• To address the issues with hard-to-predict conditional control flow, we utilize disjoint path spawning in Levo
and DEE
Disjoint Path Execution• To enable path spawning we provide a disjoint path (D-
path) set of AS’s that share a processing element with a mainline set of AS’s
• D-paths are spawned in the case of hammock branches• The D-path is copied from the mainline path• The sign of the associated predicate is inverted for the D-
path• The D-path receives lower priority for the PE than the
mainline• When a hammock branch is mispredicted, we can treat
the D-path as the new mainline path, and continue execution accordingly
and DEE
Label Addr Instruction History
START: A100 LW R2,20(R4)
A104 SUB R2,R2,#1
A108 BEQZ R2,TAR1 Weakly T
A10C ADD R2,R2,#4
A110 SW 30(R4),R2
TAR1: A114 LW R2,30(R4)
A118 SUB R2,R2,#8
A11C BEQZ R2,TAR2 Weakly NT
A120 SW 20(R4),R2
TAR2: A124 ADD R2,R2,#10
A128 SUB R1,R1,#1
A12C BNEQZ R1,START Strongly T
A130 SW 40(R4),R2
.
.
A100 LW R2,20(R4)
A104 SUB R2,R2,#1
A108 BEQZ R2,TAR1
A10C ADD R2,R2,#4
A110 SW 30(R4),R2
A114 LW R2,28(R4)
A118 SUB R2,R2,#8
A11C BEQZ R2, TAR2
A120 SW 20(R4),R2
A124 ADD R2,R2,#10
A128 SUB R1,R1,#1
A12C BNEQZ R1,START
A130 SW 40(R4),R2
Mainline path
Disjoint path
Modeling and Results• Present work utilizes
– MIPS-1/MIPS-2 machine– SGI compiler– SPECint 95 (compress, go and ijpeg) and 2000 (bzip2, crafty, gcc,
gzip, mcf, parser and vertex) benchmarks
• 3 levels of modeling– Trace-driven model (FastLevo) – results in this presentation– Detailed cycle-accurate model (LevoSim) – still under development– Synthesizable VHDL hardware model (HDLevo) – validation
• Design space exploration– Impact of D-paths– Real vs. ideal memory– Range of bus latency issues
performance
Modeling parmsL1 1,D geometry 64KB, 2WSA, 32B
L2 unified I/D geometry 2MB, direct mapped, 32B
Main memory geometry infinite, 4W interleaved
L0, L1, L2, memory hit latencies 1, 1, 10, 100 cycles
(does not include bus latency)
Branch predictor 2-level 1024 entry BHT
4096 entry GPHT 2-bit
16 entry RAS
one per E-window row
Data value predictor 4096 stride predictor, one per E-window row
PE Element latencies same as MIPS R4000
performance
Modeling parms
L0 geometry 32-32b, fully associative, 32b line
Spanning bus delay 1 cycle
FU/BU delay (no contention) 1 cycle
Buses per RFU and per MFU 2 buses
Buses per PFU 1 bus
Columns per D-path, ML-D switch time
1 column, 1 cycle plus time to broadcast new D-path values as ML
performance
IPC obtained with Levo
0
2
4
6
8
10
12
14
Applications
Machine Geometry (32 SG/Col, 8 AS/SG, 8 Cols)
IPC
bzip2compresscraftygccgogzipijpegmcfparservortexharmonic
performance
Speedup obtained using D-paths versus single path execution
(harmonic means)
0
10
20
30
40
50
60
Machine geometry (SG/Col, AS/SG, Cols, D-paths)
% s
peed
up o
ver
sing
le p
ath
(8,4,8,8)(8,8,8,8)(16,8,8,8)(32,2,16,16)(32,2,16,16)
performance
IPC of Levo compared to modeling 100% L1 I/D hitsharmonic means
0
5
10
15
20
25
(8,4,8) (8,8,8) (16,8,8) (32,8,8)
Geometry - (SG/Col, AS/SG, Cols)
IPC
Ideal I/DIdeal IIdeal DReal I/D
performance
Summary of additional experiments
• Varying the L1-D/L2 hit time (versus 1 cycle)– Increased L1-D HT to 2/4/8 cycles = 10/22/43% IPC loss– Increased L2 HT to 2/4/8/16 cycles = .8/2.3/4.7/8.9% IPC loss
• Varying the number of buses per FU – Decreased to 1 bus/FU = 14% IPC loss– Increased to 4 buses/FU = 3% IPC gain
• Removal of stride predictor = .8% IPC loss
• Varying the number of columns per D-path– Increased to 2 cols/D-path = 8% IPC loss
• Use of D-paths = 45% IPC gain
• Varying the number of branch prediction tables– Decreased from 1 per row to a single of same total size = .4% IPC loss
performance
Comments and Future Directions• I-fetch is the main barrier to further gains in IPC• The use of a detailed VHDL model of critical components in
Levo has allowed us to design scalable resources• A number of novel microarchitectural features are present in a
single design• Future challenges in Levo include:
– Improved I-fetch – (EV8, trace cache, dynamic D-paths)– Finish design of an ARB-like memory– Consider compiler support to aid in-order issue and D-path execution – Consider multithreaded extensions to support coarse-grained
multithreading
To learn more about visit:
http://www.ece.neu.edu/info/architecture/research/Levo.html
Also see our paper at europar02.