lecture 2: pipelining and superscalar review. motivation: increase throughput with little increase...
DESCRIPTION
3 Combinatorial Logic N Gate Delays Combinatorial Logic N Gate Delays BW = ~(1/n) Combinatorial Logic N/2 Gate Delays Combinatorial Logic N/2 Gate Delays Combinatorial Logic N/2 Gate Delays Combinatorial Logic N/2 Gate Delays Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates BW = ~(2/n) BW = ~(3/n)TRANSCRIPT
![Page 1: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/1.jpg)
Advanced MicroarchitectureLecture 2: Pipelining and Superscalar Review
![Page 2: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/2.jpg)
2
Pipelined Design• Motivation: Increase throughput with little
increase in cost (hardware, power, complexity, etc.)
• Bandwidth or Throughput = Performance• BW = num. tasks/unit time• For a system that operates on one task at a
time:BW = 1 / latency
• Pipelining can increase BW if many repetitions of same operation/task
• Latency per task remains same or increasesLecture 2: Pipelining and Superscalar Review
![Page 3: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/3.jpg)
3
Pipelining Illustrated
Lecture 2: Pipelining and Superscalar Review
Combinatorial LogicN Gate Delays
BW = ~(1/n)
Combinatorial LogicN/2 Gate Delays
Combinatorial LogicN/2 Gate Delays
Comb. LogicN/3 Gates
Comb. LogicN/3 Gates
Comb. LogicN/3 Gates
BW = ~(2/n)
BW = ~(3/n)
![Page 4: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/4.jpg)
4
T/k
T/k
Performance Model• Starting from an
unpipelined version with propagation delay T and BW=1/T
Perfpipe = BWpipe =1 / (T/k + S)
where S = latch delaywhere k = num stages
Lecture 2: Pipelining and Superscalar Review
TS
S
k-stage pipelinedunpipelined
![Page 5: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/5.jpg)
5
G/k
G/k
Hardware Cost Model• Starting from an
unpipelined version with hardware cost G
Costpipe = G + kL
where L = latch cost incl. controlwhere k = num stages
Lecture 2: Pipelining and Superscalar Review
GL
L
k-stage pipelinedunpipelined
![Page 6: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/6.jpg)
6
Cost/Performance Tradeoff
Lecture 2: Pipelining and Superscalar Review
Cost/Performance: C/P = [Lk + G] / [1/(T/k + S)] = (Lk + G) (T/k + S)
= LT + GS + LSk + GT/k
Optimal Cost/Performance: find min. C/P w.r.t. choice of k
k
C/P
è øç ÷
ç ÷ç ÷æ ö
koptGTLS--------=
çç ç
ç
Lk + G1
Tk+ S
ddk
= 0 + 0 + LS -GTk2
![Page 7: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/7.jpg)
7
“Optimal” Pipeline Depth: kopt
Lecture 2: Pipelining and Superscalar Review
0
1
2
3
4
5
6
7
0 10 20 30 40 50Pipeline Depth k
x104Co
st/P
erfo
rman
ce R
atio
(C/P
)
G=175, L=41, T=400, S=22
G=175, L=21, T=400, S=11
![Page 8: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/8.jpg)
8
Cost?• “Hardware Cost”
– Transistor/Gate Count• Should include additional logic to control the pipeline
– Area (related to gate count)– Power!
• More gates more switching• More gates more leakage
• Many metrics to optimize• Very difficult to determine what really is
“optimal”
Lecture 2: Pipelining and Superscalar Review
![Page 9: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/9.jpg)
9
Pipelining Idealism• Uniform Suboperations
– The operation to be pipelined can be evenly partitioned into uniform-latency suboperations
• Repetition of Identical Operations– The same operations are to be performed
repeatedly on a large number of different inputs• Repetition of Independent Operations
– All the repetitions of the same operation are mutually independent, i.e., no data dependences and no resource conflicts
Lecture 2: Pipelining and Superscalar Review
Good Examples:Automobile assembly lineFloating-Point multiplierInstruction pipeline (?)
![Page 10: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/10.jpg)
10
Instruction Pipeline Design• Uniform Suboperations … NOT!
– Balance pipeline stages• Stage quantization to yield balanced stages• Minimize internal fragmentation (some waiting stages)
• Identical operations … NOT!– Unifying instruction types
• Coalescing instruction types into one “multi-function” pipe• Minimize external fragmentation (some idling stages)
• Independent operations … NOT!– Resolve data and resource hazards
• Inter-instruction dependency detection and resolution• Minimize performance loss
Lecture 2: Pipelining and Superscalar Review
![Page 11: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/11.jpg)
11
The Generic Instruction Cycle• The “computation” to be pipelined:
1. Instruction Fetch (IF)2. Instruction Decode (ID)3.Operand(s) Fetch (OF)4. Instruction Execution (EX)5.Operand Store (OS)
• a.k.a. writeback (WB)6.Update Program Counter (PC)
Lecture 2: Pipelining and Superscalar Review
![Page 12: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/12.jpg)
12
The Generic Instruction Pipeline
Lecture 2: Pipelining and Superscalar Review
Based on Obvious Subcomputations:Instruction Fetch
Instruction Decode
Operand Fetch
Instruction Execute
Operand Store
IF
ID
OF/RF
EX
OS/WB
![Page 13: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/13.jpg)
13
Balancing Pipeline Stages
Lecture 2: Pipelining and Superscalar Review
TIF= 6 units
TID= 2 units
TID= 9 units
TEX= 5 units
TOS= 9 units
• Without pipeliningTcyc TIF+TID+TOF+TEX+TOS
= 31
• PipelinedTcyc max{TIF, TID, TOF, TEX, TOS}
= 9
Speedup= 31 / 9
Can we do better in terms of either performance or efficiency?
IF
ID
OF/RF
EX
OS/WB
![Page 14: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/14.jpg)
14
Balancing Pipeline Stages• Two methods for stage quantization
– Merging multiple subcomputations into one– Subdividing a subcomputation into multiple
smaller ones
• Recent/Current trends– Deeper pipelines (more and more stages)
• To a certain point: then cost function takes over– Multiple different pipelines/subpipelines– Pipelining of memory accesses (tricky)
Lecture 2: Pipelining and Superscalar Review
![Page 15: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/15.jpg)
15
Granularity of Pipeline Stages
Lecture 2: Pipelining and Superscalar Review
Coarser-Grained Machine Cycle: 4 machine cyc /
instructionTIF&ID= 8 units
TOF= 9 units
TEX= 5 units
TOS= 9 units
Finer-Grained Machine Cycle: 11 machine cyc
/instruction
Tcyc= 3 units
TIF,TID,TOF,TEX,TOS = (6/2/9/5/9)
IFID
OF
OS
EX
IFIFID
OFOFOFEXEX
OSOSOS
![Page 16: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/16.jpg)
16
Hardware Requirements• Logic needed for
each pipeline stage
• Register file ports needed to support all (relevant) stages
• Memory accessing ports needed to support all (relevant) stages
Lecture 2: Pipelining and Superscalar Review
IFID
OF
OS
EX
IFIFID
OFOFOFEXEX
OSOSOS
![Page 17: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/17.jpg)
17
Pipeline Examples
Lecture 2: Pipelining and Superscalar Review
IF
RD
ALU
MEM
WB
IF
IDOF
EX
OS
PC GENCache ReadCache Read
DecodeRead REGAdd GEN
Cache ReadCache Read
EX 1EX 2
Check ResultWrite Result
OS
EX
OFID
IFMIPS R2000/R3000
AMDAHL 470V/7
![Page 18: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/18.jpg)
18
Instruction Dependencies• Data Dependence
– True Dependence (RAW)• Instruction must wait for all required input operands
– Anti-Dependence (WAR)• Later write must not clobber a still-pending earlier read
– Output Dependence (WAW)• Earlier write must not clobber an already-finished later write
• Control Dependence (a.k.a. Procedural Dependence)– Conditional branches cause uncertainty to instruction
sequencing– Instructions following a conditional branch depends on
the execution of the branch instruction– Instructions following a computed branch depends on the
execution of the branch instruction
Lecture 2: Pipelining and Superscalar Review
![Page 19: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/19.jpg)
Example: Quick Sort on MIPS
Lecture 2: Pipelining and Superscalar Review 19
bge $10, $9, $36mul $15, $10, 4addu $24, $6, $15lw $25, 0($24)mul $13, $8, 4addu $14, $6, $13lw $15, 0($14)bge $25, $15, $36
$35:addu $10, $10, 1. . .
$36:addu $11, $11, -1. . .
# for (;(j<high)&&(array[j]<array[low]);++j);# $10 = j; $9 = high; $6 = array; $8 = low
![Page 20: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/20.jpg)
20
Hardware Dependency Analysis• Processor must handle
– Register Data Dependencies• RAW, WAW, WAR
– Memory Data Dependencies• RAW, WAW, WAR
– Control Dependencies
Lecture 2: Pipelining and Superscalar Review
![Page 21: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/21.jpg)
21
Terminology• Pipeline Hazards:
– Potential violations of program dependencies– Must ensure program dependencies are not
violated• Hazard Resolution:
– Static method: performed at compile time in software
– Dynamic method: performed at runtime using hardware
Stall, Flush or Forward
• Pipeline Interlock:– Hardware mechanism for dynamic hazard
resolution– Must detect and enforce dependencies at
runtime
Lecture 2: Pipelining and Superscalar Review
![Page 22: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/22.jpg)
22
Pipeline: Steady State
Lecture 2: Pipelining and Superscalar Review
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEM WBIF ID RD ALU MEM
IF ID RD ALUIF ID RD
IF IDIF
t0 t1 t2 t3 t4 t5
Instj
Instj+1
Instj+2
Instj+3
Instj+4
![Page 23: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/23.jpg)
23
Pipeline: Data Hazard
Lecture 2: Pipelining and Superscalar Review
t0 t1 t2 t3 t4 t5
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEM WBIF ID RD ALU MEM
IF ID RD ALUIF ID RD
IF IDIF
Instj
Instj+1
Instj+2
Instj+3
Instj+4
![Page 24: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/24.jpg)
24
Pipeline: Stall on Data Hazard
Lecture 2: Pipelining and Superscalar Review
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID Stalled in RD ALU MEM WBIF Stalled in ID RD ALU MEM WB
Stalled in IF ID RD ALU MEMIF ID RD ALU
t0 t1 t2 t3 t4 t5
Instj
Instj+1
Instj+2
Instj+3
Instj+4
RDIDIF
IF ID RDIF ID
IF
![Page 25: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/25.jpg)
25
Different Viewt0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
IF Ij Ij+1 Ij+2 Ij+3 Ij+4 Stall Ij+4
ID Ij Ij+1 Ij+2 Ij+3 Stall Ij+3 Ij+4
RD Ij Ij+1 Ij+2 Stall Ij+2 Ij+3 Ij+4
ALU Ij Ij+1 nop nop nop Ij+2 Ij+3 Ij+4
MEM Ij Ij+1 nop nop nop Ij+2 Ij+3
WB Ij Ij+1 nop nop nop Ij+2
Lecture 2: Pipelining and Superscalar Review
![Page 26: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/26.jpg)
26
Pipeline: Forwarding Paths
Lecture 2: Pipelining and Superscalar Review
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEM WBIF ID RD ALU MEM
IF ID RD ALUIF ID RD
IF IDIF
t0 t1 t2 t3 t4 t5
Many possible pathsInstj
Instj+1
Instj+2
Instj+3
Instj+4
MEM ALURequires stalling even with fwding paths
![Page 27: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/27.jpg)
27
ALU Forwarding Paths
Lecture 2: Pipelining and Superscalar Review
Deeper pipeline mayrequire additionalforwarding paths
IF ID Register Filesrc1src2
==
ALU
MEM
==
dest
![Page 28: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/28.jpg)
28
Pipeline: Control Hazard
Lecture 2: Pipelining and Superscalar Review
t0 t1 t2 t3 t4 t5
InstiInsti+1
Insti+2
Insti+3
Insti+4
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEM WBIF ID RD ALU MEM
IF ID RD ALUIF ID RD
IF IDIF
![Page 29: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/29.jpg)
29
Pipeline: Stall on Control Hazard
Lecture 2: Pipelining and Superscalar Review
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEMIF ID RD ALU
IF ID RDIF ID
IF
t0 t1 t2 t3 t4 t5
InstiInsti+1
Insti+2
Insti+3
Insti+4
Stalled in IF
![Page 30: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/30.jpg)
Pipeline: Prediction for Control Hazards
Lecture 2: Pipelining and Superscalar Review 30
t0 t1 t2 t3 t4 t5
InstiInsti+1
Insti+2
Insti+3
Insti+4
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU nop nopIF ID RD nop nop
IF ID nop nopIF ID RD
IF IDIF
nopnop nopALU nopRD ALUID RD
nopnopnop
New Insti+2New Insti+3New Insti+4
Speculative State Cleared
Fetch Resteered
![Page 31: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/31.jpg)
31
Going Beyond Scalar• Simple pipeline limited to execution of CPI
≥ 1.0• “Superscalar” can achieve CPI ≤ 1.0 (i.e.,
IPC ≥ 1.0)– Superscalar means executing more than one
scalar instruction in parallel (e.g., add + xor + mul)
– Contrast to Vector which effectively executes multiple operations in parallel, but they all must be the same (e.g., four parallel additions)
Lecture 2: Pipelining and Superscalar Review
![Page 32: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/32.jpg)
32
Architectures for Instruction Parallelism• Scalar pipeline (baseline)
– Instruction/overlap parallelism = D– Operation Latency = 1– Peak IPC = 1
Lecture 2: Pipelining and Superscalar Review
D
Succ
essiv
eIn
stru
ctio
ns
Time in cycles
1 2 3 4 5 6 7 8 9 10 11 12
D different instructions overlapped
![Page 33: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/33.jpg)
33
Superscalar Machine• Superscalar (pipelined) Execution
– Instruction parallelism = D x N– Operation Latency = 1– Peak IPC = N per cycle
Lecture 2: Pipelining and Superscalar Review
N
Succ
essiv
eIn
stru
ctio
ns
Time in cycles
1 2 3 4 5 6 7 8 9 10 11 12
D x N different instructions overlapped
![Page 34: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/34.jpg)
34
Ex. Original Pentium
Lecture 2: Pipelining and Superscalar Review
Prefetch
Decode1
Decode2 Decode2
Execute Execute
WritebackWriteback
4× 32-byte buffers
Decode up to 2 insts
Read operands, Addr comp
Asymmetric pipesu-pipe v-pipe
shiftrotate
some FP
jmp, jcc,call,fxch
Bothmov, lea,
simple ALU,push/poptest/cmp
![Page 35: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/35.jpg)
35
Pentium Hazards, Stalls• “Pairing Rules” (when can/can’t two insts exec at the
same time?)– read/flow dependence
mov eax, 8mov [ebp], eax
– output dependencemov eax, 8
mov eax, [ebp]– partial register stalls
mov al, 1mov ah, 0
– function unit rules• some instructions can never be paired:
MUL, DIV, PUSHA, MOVS, some FP
Lecture 2: Pipelining and Superscalar Review
![Page 36: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/36.jpg)
36
Limitations of In-Order Pipelines• CPI of inorder pipelines degrades very
sharply if the machine parallelism is increased beyond a certain point– i.e., when N approaches the average distance
between dependent instructions– Forwarding is no longer effectiveMust stall more oftenPipeline may never be full due to frequency of
dependency stalls
Lecture 2: Pipelining and Superscalar Review
![Page 37: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/37.jpg)
37
N Instruction Limit
Lecture 2: Pipelining and Superscalar Review
Ex. Superscalar degree N = 4
Any dependencybetween theseinstructions willcause a stall
Dependent instmust be N =
4 instructionsaway
On average, the parent-child separation is onlyabout 5± instructions!
(Franklin and Sohi ’92)
Pentium: Superscalar degree N=2is reasonable… going much further
encounters rapidly diminishing returns
Average of 5 means there are many cases when the
separation is < 4… each of these limits parallelism
![Page 38: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/38.jpg)
38
In Search of Parallelism• “Trivial” Parallelism is limited
– What is trivial parallelism?• In-order: sequential instructions do not have
dependencies• in all previous examples, all instructions executed
either at the same time or after earlier instructions– previous slides show that superscalar execution
quickly hits a ceiling
• So what is “non-trivial” parallelism? …
Lecture 2: Pipelining and Superscalar Review
![Page 39: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/39.jpg)
39
What is Parallelism?• Work
T1: time to complete a computation on a sequential system
• Critical PathT: time to complete the same
computation on an infinitely-parallel system
• Average ParallelismPavg = T1/ T
• For a p-wide systemTp max{T1/p , T}Pavg >> p Tp T1/p
Lecture 2: Pipelining and Superscalar Review
+
+-
*
*2
a b
xy
x = a + b; y = b * 2z =(x-y) * (x+y)
![Page 40: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/40.jpg)
40
ILP: Instruction-Level Parallelism• ILP is a measure of the amount of inter-
dependencies between instructions• Average ILP = num instructions / longest
pathcode1: ILP = 1 (must execute serially)
T1 = 3, T = 3code2: ILP = 3 (can execute at the same time)
T1 = 3, T = 1
Lecture 2: Pipelining and Superscalar Review
code1: r1 r2 + 1r3 r1 / 17r4 r0 - r3
code2: r1 r2 + 1r3 r9 / 17r4 r0 - r10
![Page 41: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/41.jpg)
41
ILP != IPC• Instruction level parallelism usually
assumes infinite resources, perfect fetch, and unit-latency for all instructions
• ILP is more a property of the program dataflow
• IPC is the “real” observed metric of exactly how many instructions are executed per machine cycle, which includes all of the limitations of a real machine
• The ILP of a program is an upper-bound on the attainable IPC
Lecture 2: Pipelining and Superscalar Review
![Page 42: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/42.jpg)
42
Scope of ILP Analysis
Lecture 2: Pipelining and Superscalar Review
r1 r2 + 1r3 r1 / 17r4 r0 - r3
r11 r12 + 1r13 r19 / 17r14 r0 - r20
ILP=2ILP=1
ILP=3
![Page 43: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/43.jpg)
43
DFG AnalysisA: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R7 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R1]J: R1 = R1 – 1K: R3 ST 0[R1]
Lecture 2: Pipelining and Superscalar Review
![Page 44: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/44.jpg)
44
In-Order Issue, Out-of-Order Completion
Lecture 2: Pipelining and Superscalar Review
Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard
Issue = send an instructionto execution
INT Fadd1
Fadd2
Fmul1
Fmul2
Fmul3
Ld/St
In-orderInst.
Stream
ExecutionBegins
In-order
Out-of-orderCompletion
![Page 45: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/45.jpg)
45
Example
Lecture 2: Pipelining and Superscalar Review
A: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R7 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R1]J: R1 = R1 – 1K: R3 ST 0[R1]
A BCycle 1:
C2:
D3:
4:
5:
E F6: GH JK
7:
8:
IPC = 10/8 = 1.25
A B
C
D
E F
G
H
J
K
![Page 46: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/46.jpg)
46
Example (2)
Lecture 2: Pipelining and Superscalar Review
A: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R9 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R9]J: R1 = R9 – 1K: R3 ST 0[R1]
A BCycle 1:
C2:
D3:
4:
5:
E F G
IPC = 10/7 = 1.43
H J6:
K7:
A B
C
D
E
F G
H J
K
![Page 47: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/47.jpg)
47
Track with Simple Scoreboarding• Scoreboard: a bit-array, 1-bit for each GPR
– If the bit is not set: the register has valid data– If the bit is set: the register has stale data
i.e., some outstanding instruction is going to change it• Issue in Order: RD Fn (RS, RT)
– If SB[RS] or SB[RT] is set RAW, stall– If SB[RD] is set WAW, stall– Else, dispatch to FU (Fn) and set SB[RD]
• Complete out-of-order– Update GPR[RD], clear SB[RD]
Lecture 2: Pipelining and Superscalar Review
![Page 48: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/48.jpg)
48
Out-of-Order Issue
Lecture 2: Pipelining and Superscalar Review
INT Fadd1
Fadd2
Fmul1
Fmul2
Fmul3
Ld/St
In-orderInst.
Stream
DR DR DR DR
Out-of-orderCompletion
Out ofProgram
OrderExecution
Need an extraStage/buffers forDependencyResolution
![Page 49: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/49.jpg)
49
OOO Scoreboarding• Similar to In-Order scoreboarding
– Need new tables to track status of individual instructions and functional units
– Still enforce dependencies• Stall dispatch on WAW• Stall issue on RAW• Stall completion on WAR
• Limitations of Scoreboarding?• Hints
– No structural hazards– Can always write a RAW-free code sequence
Add R1 = R0 + 1; Add R2 = R0 + 1; Add R3 = R0 + 1; …– Think about x86 ISA with only 8 registers
Lecture 2: Pipelining and Superscalar Review
Finite number of registers inany ISA will force you to reuseregister names at some point
WAR, WAW stalls
![Page 50: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/50.jpg)
50
Lessons thus Far• More out-of-orderness More ILP exposed
But more hazards• Stalling is a generic technique to ensure
sequencing• RAW stall is a fundamental requirement (?)
• Compiler analysis and scheduling can help(not covered in this course)
Lecture 2: Pipelining and Superscalar Review
![Page 51: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/51.jpg)
51
Ex. Tomasulo’s Algorithm [IBM 360/91, 1967]
Lecture 2: Pipelining and Superscalar Review
Adder
Floating Point
Registers FLR
0
2
4
8
Store
Data
1
2
3
Buffers SDB
Control
Decoder
Floating
Operand
Stack
FLOSControl
Floating Point
Buffers FLB
1
2
3
4
5
6
Decoder
Floating PointRegisters (FLR)
Control
0248
Floating
Operand Stack
Floating Point
Buffers (FLB)
123456
StoreData
123
Buffers (SDB)
Control
Storage Bus Instruction Unit
Result
Multiply/Divide
•
Common Data Bus (CDB)
Point
BusyBits
Adder
FLB BusFLR Bus
CDB ••
•
•
Tags
Tags
Sink TagTag Source Ctrl.Sink TagTag Source Ctrl.
Sink TagTag Source Ctrl.Sink TagTag Source Ctrl.Sink TagTag Source Ctrl.
•
Result
(FLOS)
![Page 52: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/52.jpg)
52
FYI: Historical Note• Tomasulo’s algorithm (1967) was not the
first• Also at IBM, Lynn Conway proposed multi-
issue dynamic instruction scheduling (OOO) in Feb 1966– Ideas got buried due to internal politics,
changing project goals, etc.– But it’s still the first (as far as I know)
Lecture 2: Pipelining and Superscalar Review
![Page 53: Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth](https://reader036.vdocuments.site/reader036/viewer/2022062401/5a4d1b117f8b9ab05998f939/html5/thumbnails/53.jpg)
53
Modern Enhancements to Tomasulo’s Algorithm
Lecture 2: Pipelining and Superscalar Review
TomasuloPeak IPC = 12 FP FU’sSingle CDBOperand copyingRS TagTag-based forwardingImprecise
ModernPeak IPC = 6+6-10+ FU’sMany forwarding busesRenamed registersRenamed registersTag-based forwardingPrecise (requires ROB)
Machine WidthStructural Deps
Anti-DepsOutput-DepsTrue DepsExceptions