out-of-order commit processors adrián cristal (upc), daniel ortega (hp labs), josep llosa (upc) and...
Post on 05-Jan-2016
216 Views
Preview:
TRANSCRIPT
Out-of-Order Commit ProcessorsOut-of-Order Commit Processors
Adrián Cristal (UPC), Daniel Ortega (HP Labs),
Josep Llosa (UPC) and Mateo Valero (UPC)
HPCA-10, MadridFebruary 14-17th 2004
2
Motivation IMotivation I
0
0.5
1
1.5
2
2.5
3
3.5
4
128 256 512 1024 2048 4096
In-flight Instructions
IPC
L2 Perfect 100 500 1000
Spec FP 2000
0.30X 3.5X
Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002
3
1168 1382 1607 1868 1955 20340
200
400
600
800
1000
1200
1400
1600
1800
2000
Num
ber
of I
n-fli
ght
Inst
ruct
ions
Number of In-flight Instructions (SpecFP)
10% 25% 50% 75% 90%
Motivation II – Resources - ROBMotivation II – Resources - ROB
Often nearly full
Instructions in-flight (ROB=2048, Mem 500 cycles)
A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003
4
1 10 25 50 75 90 1000
100
200
300
400
500
600
FP
Qu
eu
e
Distribution of in-flight Instructions
Blocked-LongBlocked-ShortReady
1168 1382 1607 1868 1955
Number of Instructions
Long/Short Lat. Inst.Remove – ReinsertDependence Chain
Motivation III – Resources – FP Queue Motivation III – Resources – FP Queue
State of FP Queues (ROB=2048, Mem 500 cycles)
A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003
5
OutlineOutline
MotivationOut-of-Order Commit
Multicheckpointing ROBSlow Line Instruction QueuePerformance EvaluationConclusion
6
Out-of-Order CommitOut-of-Order Commit
Oldest
Newest
Instruction F
low
Oldest Checkpoint
New Checkpoint
I5
Br 3
I6
Br 2
St
I4
I3
Ld
Br 1
I2
I1
Ld
Checkpoint
New Checkpoint
7
Out-of-Order CommitOut-of-Order Commit
Oldest
Newest
Instruction F
low
Oldest Checkpoint
I5
Br 3
I6
Br 2
St
I4
I3
Ld
Br 1
I2
I1
Ld
Checkpoint
New CheckpointCheckpoint
Oldest Checkpoint
Store Buffer
Oldest Checkpoint
To Memory
Gan
g
Co
mm
it
8
Miss Branch Prediction
Recover from Checkpoint
Oldest Checkpoint
Out-of-Order CommitOut-of-Order Commit
Oldest
Newest
Instruction F
low
St
I4
I3
I5
Br 3
Br 2Checkpoint
St
Store Buffer
I7
I8
9
Out-of-Order Commit IIOut-of-Order Commit II
Checkpoint Table. Each entry has: PC of the next Instruction Instruction Counter: Count the number of
instructions still alive Map Table: Allows to recover the register file Pointer to the Store Buffer Mechanism to recover free Registers
• Future Free– One bit for each Physical Register
• Large Virtual ROB: Tech. Rep. UPC-DAC-2002-39• Ephemeral Registers: Tech. Rep UPC-DAC-
2003.51
10
Checkpoint CreationCheckpoint Creation
Save Pc Save Map Table Clean Future Free Bits Clean Instruction Counter Get a pointer to the first free entry of the store
buffer, and mark this entry in the store buffer.
11
Instruction DecodificationInstruction Decodification
Add 1 to the Instruction Counter of the newest checkpoint
R1R2 op R3 If R1 is mapped to PhyReg_N
• Set PhyReg_N bit of the future free vector bits• Map R1 to the new Physical Register
Associate the instruction to the last created checkpoint
12
Instruction WritebackInstruction Writeback
Decrement the Instruction Counter of the checkpoint associated to the instruction
If the instruction is a mispredicted branch: Recover From the associated checkpoint:
• Fetch instructions from saved PC• Release all entries in the store buffer from the
pointed entry• Free all registers in the future free vector of the
entry and for all the newer checkpoints entries
13
Checkpoint EliminationCheckpoint Elimination
If this counter is 0 and if it is the oldest checkpoint, then: The checkpoint is removed
• Clean the corresponding mark in the store buffer• The registers marked in the Future Free vector
are freed
14
OutlineOutline
MotivationOut-of-Order CommitSlow Line Instruction QueuePerformance EvaluationConclusions
15
Ps e
udo
Rob
Oldest
Newest
Instruction Flow
Ld
x
x
x
a
x
x
x
b
x
Data D
epen
denc
e
Load/StoreQueue
InstructionQueue
Slow LineInstruction
Queue
LD
a
b
Slow Line Instruction QueueSlow Line Instruction Queue
16
Ps e
udo
Rob
Oldest
Newest
Instruction Flow
Ld
x
x
x
a
x
x
x
b
x
Data D
epen
denc
e
Load/StoreQueue
InstructionQueue
Slow LineInstruction
Queue
LD
a
b
Slow Line Instruction QueueSlow Line Instruction Queue
17Ps e
udo
Rob
Oldest
Newest
Instruction Flow
Ld
x
x
x
a
x
x
x
b
x
Data D
epen
denc
e
Load/StoreQueue
InstructionQueue
Slow LineInstruction
Queue
LD
ab
Load End
Begin reinser
t
Slow Line Instruction QueueSlow Line Instruction Queue
18
Slow Lane Instruction Queue IISlow Lane Instruction Queue II
Very simple Buffer – Slow Lane Instruction Queue (SLIQ)
Each Load that miss in L2 has a pointer to an entry in the SLIQ
Pseudo ROB
19
Slow Line Instruction Queue IIISlow Line Instruction Queue III
When a Instruction is retired from the Pseudo ROB, its state is looked on:
• If the instruction is a load miss, the pointer is written• If the instruction depends on a long latency instruction,
it is moved to de SLIQ
When a load that miss in L2 finish its execution: The SLIQ is traversed from the instruction pointed by
the load if this point is older than the current traversal position.
The load’s dependent instructions are reinserted to the IQ
20
Performance EvaluationPerformance Evaluation
Processor Configuration (Baseline 4096): Fetch/Commit width 4 Branch Predictor 16K entries Gshare Instruction L1 32Kb, 4-way, 32 bytes line, 2
cycle Data L1 32Kb, 4-way, 32 bytes line, 2
cycle L2 size 512Kb, 4-way, 64 bytes line, 10 cycle Memory Latency 1000 cycles Physical Registers 4096 entries Load/Store Queue 4096 entries Reorder Buffer 4096 entries Integer General Units 4 (lat/rep 1/1) Integer Mult/Div Units 2 (lat/rep 3/1 and 20/20) FP Functional Units 4 (lat/rep 2/1) FP Mult/Div/Sqrt Units 2 (lat/rep 4/1, 12/12, 24/24)
21
Performance Evaluation - Some ConsiderationsPerformance Evaluation - Some Considerations
We mix both models. The processor takes the checkpoints when the
instructions are retired from the pseudo ROB. Many branches are resolved at this time, so the
probability to come back to the checkpoint is reduced.
If a miss predicted branch is detected in the pseudo ROB, a normal rollback mechanism is used.
22
IPC – Different ConfigurationsIPC – Different Configurations
0
0,5
1
1,5
2
2,5
3
3,5
512 1024 2048
Slow Lane Instruction Queue
IPC
COoO 32COoO 64COoO 128Baseline
Baseline 4096
Baseline 128
23
0
0,5
1
1,5
2
2,5
3
3,5
IPC
Baseline
4
8
16
32
64
128
Number of Checkpoints and PerformanceNumber of Checkpoints and Performance
Baseline: 2048 IQ. SLIQ 2048 entries and 128 IQ. 2048 Physical Registers
24
In-Flight InstructionsIn-Flight Instructions
0
500
1000
1500
2000
2500
3000
512 1024 2048
Slow Lane Instruction Queue
In-f
light
Inst
ruct
ions
COoO 32COoO 64COoO 128Baseline
Baseline 4096
Baseline 128
25
Delay in re-insertion from SLIQDelay in re-insertion from SLIQ
0
0,5
1
1,5
2
2,5
3
32 64 128
1
48
12
SLIQ: 1024 entries
26
Towards affordable Kilo-Instruction ProcessorTowards affordable Kilo-Instruction Processor
Adding Ephemeral Registers to the Out-of-Order Commit Processors
Change in the SLIQ to list of Buckets of Instructions
J. Martínez et al. “Ephemeral Registers”, Technical Report CSL-TR-2003-1035 , 2003.
27
0
0.5
1
1.5
2
2.5
3
3.5
512 1024 2048 512 1024 2048 512 1024 2048
100 500 1000
Virtual TagsMemory Latency
IPC
256
512
Limit 4096
Limit 4096
Limit 4096
Baseline 128
Baseline 128
Baseline 128
Putting It All TogetherPutting It All TogetherPhysicalRegisters
Virtual Registers
IQs of 128 entries
Memory Latency
28
ConclusionConclusion
To tolerate increasing memory latencies in Floating Point applications, a large number of in-flight instruction must be maintained. The resources must be up-sized. The resources are underutilized
We present two techniques to reduce the need for resources and we show its effectiveness Out of Order Commit Slow Lane Instruction Queue
29
Thank you very much
36
1 10 25 50 75 90 1000
50
100
150
200
250
ST
Qu
eu
e
Distribution of in-flight Instructions
ReadyAddress ReadyBlocked-LongBlocked-Short
20 108 435 1004 1361
Number of Instructions
INT
State of ST Queues (specInt, ROB=2048)State of ST Queues (specInt, ROB=2048)
Locality
38
1 10 25 50 75 90 1000
50
100
150
200
250
300
350
400
450
Int.
Qu
eu
e
Distribution of in-flight Instructions
Blocked-LongBlocked-ShortReady
20 108 435 1004 1361
Number of Instructions
INT
State of Int Queues (specInt, ROB=2048)State of Int Queues (specInt, ROB=2048)
Long/Short Lat. Inst.Remove – ReinsertDependence Chain
39
20 108 435 1004 1361 17560
100
200
300
400
500
600
700
800
900
1000
Int.
Re
gis
ters
Number of In-flight Instructions (SpecInt)
DeadBlocked-LongBlocked-ShortLive
10% 25% 50% 75% 90%
State of Registers (Int, ROB=2048)State of Registers (Int, ROB=2048)
Early Release
Virtual Registers
top related