topic 3
DESCRIPTION
Topic 3. Exploitation of Instruction Level Parallelism. The secret to creativity is knowing how to hide your sources. Albert Einstein. Reading List. Slides: Topic3x Henn&Patt: Chapters 3 & 4 Other assigned readings from homework and classes. Instruction Level Parallelism. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/1.jpg)
9/25/2006 eleg652-F06 1
Topic 3
Exploitation of Instruction Level Parallelism
The secret to creativity is knowing how to hide your sources. Albert Einstein
![Page 2: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/2.jpg)
9/25/2006 eleg652-F06 2
Reading List
• Slides: Topic3x
• Henn&Patt: Chapters 3 & 4
• Other assigned readings from homework and classes
![Page 3: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/3.jpg)
9/25/2006 eleg652-F06 3
Instruction Level Parallelism
• Parallelism that is found between instructions (or intra instruction)
• Dynamic and Static Exploitation– Dynamic: Hardware related. – Static: Software related (compiler and system
software)
• VLIW and Superscalar
• Micro-Dataflow and Tomasulo’s Algorithm
![Page 4: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/4.jpg)
9/25/2006 eleg652-F06 4
RISC Concepts: Revisit
• Reduced Instruction Set Architecture– “Internal Computing Architecture in which processor instructions
are pared down so that most of them can be executed in one clock cycle, theoretically improving computing efficiency” Black Box Pocket Glossary of Computer Terms
• Characteristics:– Uniform instruction encoding– Homogenous Register Banks– Simplified Addressing Modes– Simplified data structures– Branch delay slot– Cache– Pipeline
![Page 5: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/5.jpg)
9/25/2006 eleg652-F06 5
RISC Concepts: Revisited
• What prevents one instruction per cycle (CPI = 1)?– Hazards– Dependencies– Long Latency ops
• Cache Trashing
![Page 6: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/6.jpg)
9/25/2006 eleg652-F06 6
Pipeline: A Review
• Hazards– Any situation that will prevent the smooth flow of the
instructions along the pipeline– Types
• Structural– Due to limited resources and contention among them
• Control– Instructions that change the PC (program counter)
• Data– Variables depends on values from previous instruction
– Stall• Hazards will “stall” the pipeline• Serious: It can hold up many instructions for many cycles
![Page 7: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/7.jpg)
9/25/2006 eleg652-F06 7
RISC Pipeline & Instruction Issue
• Instruction Issue– The process of letting an instruction move from ID to
EXEC– Issue V.S. Execution
• In DLX– ID Check all data hazards, stall if any exists
Typical RISC Pipeline:
Instruction Fetch Instruction Decode Execute Memory Op Register Update
![Page 8: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/8.jpg)
9/25/2006 eleg652-F06 8
Hazards
• Structural Hazards– Non Pipelining Function Units– One Port Register Bank and one port memory
bank
• Data Hazards– For some
• Forwarding
– For others• Pipeline Interlock
LD R1 A+ R4 R1 R7
Need Bubble / Stall
![Page 9: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/9.jpg)
9/25/2006 eleg652-F06 9
Instruction Clock cycle number
1 2 3 4 5 6 7 8 9
Load instruction IF ID EX MEM WB
Instruction i+1 IF ID EX MEM WB
Instruction i+2 IF ID EX MEM WB
Instruction i+3 IF ID EX MEM WB
Instruction i+4 IF ID EX MEM
Structural Hazard
A single memory bank for insts and data
![Page 10: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/10.jpg)
9/25/2006 eleg652-F06 10
Data Hazards
Instruction 1 2 3 4 5 6
ADD IF ID EX MEM WB
SUB IF ID EX MEM WB
Stage
Stage
Data is read here
Data is written here
The ADD instruction writes a register that is a source operand for the SUB instruction. But the ADD doesn’t finish writing the data into the register file until three clock cycles after SUB begins reading it!
The SUB instruction may read the incorrect value. Result may be non-deterministic. Solved by forwarding
![Page 11: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/11.jpg)
9/25/2006 eleg652-F06 11
Data Dependency: A Review
B + C A
A + D E
Flow DependencyRAW Conflicts
A + C B
E + D A
Anti DependencyWAR Conflicts
B + C A
E + D A
Output DependencyWAW Conflicts
RAR are not really a problem
![Page 12: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/12.jpg)
9/25/2006 eleg652-F06 12
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
ADD R1,R2,R3
SUB R4,R1,R5
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11
Forwarding Example
![Page 13: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/13.jpg)
9/25/2006 eleg652-F06 13
Bypassing Pitfalls
LW R1, 32 (R6)ADD R4, R1, R7SUB R5, R1, R8AND R6, R1, R7
IF ID EX MEM WBIF ID STALL EX MEM WB
IF STALL ID EX MEM WBSTALL IF ID EX MEM WB
Load Delay Slot cannot be eliminated by forwarding alone
Pipeline Interlock: Stall / Bubble for hazards that cannot be solved by forwarding
The Pipeline
The Code
![Page 14: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/14.jpg)
9/25/2006 eleg652-F06 14
Pipelining
• Issue Pass the Instruction Decode stage
• DLX: Only issue instruction if there is no hazard
• Detect interlock early in the pipeline has the advantage that it never needs to suspend an instruction and undo state changes.
![Page 15: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/15.jpg)
9/25/2006 eleg652-F06 15
Instruction Level Parallelism
• Static Scheduling– Simple Scheduling– Loop Unrolling– Loop Unrolling + Scheduling– Software Pipelining
• Dynamic Scheduling– Out of order execution– Data Flow computers
• Speculation
![Page 16: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/16.jpg)
9/25/2006 eleg652-F06 16
Constraint Graph
• Directed-edges: data-dependence
• Undirected-edges: Resources constraint
• An edge (u,v) (directed or undirected) of length e
represent an interlock between node u and v, and they
must be separated by e time.
S1
S6
S5S4
S3S2
12
62
1 1
operation latencies4
3
![Page 17: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/17.jpg)
9/25/2006 eleg652-F06 17
Code Scheduling For Single Pipeline
• Input: A constraint graph G = (V, E)
• Output: A sequence of operations in G (v1, v2, v3, v4, v5 ….vn) plus a number of no-op, such that:– If the no-op are deleted then the sequence is
a topological sort of G.– Any two nodes in the sequence (x, y) is
separated by a distance greater or equal d(x,y) in graph G
![Page 18: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/18.jpg)
9/25/2006 eleg652-F06 18
Advanced Pipelining
• Instruction Reordering and scheduling within loop body
• Loop Unrolling– Code size suffers
• Superscalar– Compact code– Multiple issued of different instruction types
• VLIW
![Page 19: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/19.jpg)
9/25/2006 eleg652-F06 19
VLIW
• Very Long Instruction Word
• Compiler has all responsibility to schedule instructions
• Make hardware simpler– Move complexity to software
• Concept developed by John Fisher at Yale’s University in early 1980
![Page 20: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/20.jpg)
9/25/2006 eleg652-F06 20
An ExampleX[i] + a Loop: LD F0, 0 (R1) ; load the vector element
ADDD F4, F0, F2 ; add the scalar in F2SD 0 (R1), F4 ; store the vector elementSUB R1, R1, #8 ; decrement the pointer by
; 8 bytes (per DW)BNEZ R1, Loop ; branch when it’s not zero
Instruction Producer Instruction Consumer Latency
FP ALU op FP ALU op 3
FP ALU op Store Double 2
Load Double FP ALU op 1
Load Double Store Double 0
Load can by-pass the storeAssume that latency for Integer ops is zero and latency for Integer load is 1
![Page 21: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/21.jpg)
9/25/2006 eleg652-F06 21
An ExampleX[i] + a
Loop: LD F0, 0 (R1) 1STALL 2ADDD F4, F0, F2 3STALL 4STALL 5SD 0 (R1), F4 6SUB R1, R1, #8 7BNEZ R1, Loop 8STALL 9
Load Latency
FP ALU Latency
Load Latency
This requires 9 Cycles per iteration
LD ADDD SD SUB BNEZ1 2 0 0 1
Constrain Graph
![Page 22: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/22.jpg)
9/25/2006 eleg652-F06 22
An ExampleX[i] + a
Loop: LD F0, 0 (R1) 1STALL 2ADDD F4, F0, F2 3SUB R1, R1, #8 4BNEZ R1, Loop 5SD 8 (R1), F4 6
This requires 6 Cycles per iteration
LD ADDD SD SUB BNEZ1 2 0 0 1
Constrain Graph
Scheduling
![Page 23: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/23.jpg)
9/25/2006 eleg652-F06 23
An ExampleX[i] + a
Loop : LD F0, 0 (R1) 1NOP 2ADDD F4, F0, F2 3NOP 4NOP 5SD 0 (R1), F4 6LD F6, -8 (R1) 7NOP 8ADDD F8, F6, F2 9NOP 10NOP 11SD -8 (R1), F8 12LD F10, -16 (R1) 13NOP 14ADDD F12, F10, F2 15NOP 16NOP 17SD -16 (R1), F12 18LD F14, -24 (R1) 19NOP 20ADDD F16, F14, F2 21NOP 22NOP 23SD -24 (R1), F16 24SUB R1, R1, #32 25BNEZ R1, LOOP 26NOP 27
This requires 6.8 Cycles per iteration
Unrolling
![Page 24: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/24.jpg)
9/25/2006 eleg652-F06 24
An Example
X[i] + a
Loop : LD F0, 0 (R1) 1LD F6, - 8 (R1) 2LD F10, -16 (R1) 3LD F14, -24 (R1) 4ADDD F4, F0, F2 5ADDD F8, F6, F2 6ADDD F12, F10, F2 7ADDD F16, F14, F2 8SD 0 (R1), F4 9SD -8 (R1), F8 10SD -16 (R1), F12 11SUB R1, R1, #32 12 BNEZ R1, LOOP 13SD 8 (R1), F16 14
This requires 3.5 Cycles per iteration
Unrolling + Scheduling
![Page 25: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/25.jpg)
9/25/2006 eleg652-F06 25
Topic 3a
Multi Issue Architectures
Beyond Simple RISC
![Page 26: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/26.jpg)
9/25/2006 eleg652-F06 26
ILP
• ILP of a program– Average Number of Instructions that a superscalar
processor might be able to execute at the same time• Data dependencies• Latencies and other processor difficulties
• ILP of a machine– The ability of a processor to take advantage of the ILP
• Number of instructions that can be fetched and executed at the same time by such processor
![Page 27: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/27.jpg)
9/25/2006 eleg652-F06 27
Multi Issue Architectures
• Super Scalar– Machines that issue multiple independent instructions
per clock cycle when they are properly scheduled by the compiler and runtime scheduler
• Very Long Instruction Word– A machine where the compiler has complete
responsibility for creating a package of instructions that can be simultaneously issued, and the hardware does not dynamically make any decisions about multiple issue
Patterson & Hennessy P317 and P318
![Page 28: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/28.jpg)
9/25/2006 eleg652-F06 28
Multiple Instruction Issue
• Multiple Issue + Static Scheduling VLIW• Dynamic Scheduling
– Tomasulo– Scoreboarding
• Multiple Issue + Dynamic Scheduling Superscalar
• Decoupled Architectures– Static Scheduling of R-R Instructions– Dynamic Scheduling of Memory Ops
• Buffers
![Page 29: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/29.jpg)
9/25/2006 eleg652-F06 29
Five Primary Approaches
Common Name
Issue Structure
Hazard Detection
Scheduling Distinguishing characteristics
Examples
Superscalar (static)
Dynamic Hardware Static In order execution Sun UltraSPARC II and III
Superscalar (dynamic)
Dynamic hardware Dynamic Some out of order execution
IBM Power2
Superscalar (speculative)
Dynamic Hardware Dynamic With speculation
Speculative out of order execution
Pentium 3 and 4
VLIW / LIW Static Software Static No hazards between issues packets
Trimedia, i860
EPIC Mostly Static Mostly Software
Mostly Static Explicit Dependences marked by compiler
Itanium
![Page 30: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/30.jpg)
9/25/2006 eleg652-F06 30
Integer instruction FP instruction Clock cycle
Loop: LD F0, 0 (R1) 1LD F6, -8 (R1) 2LD F10, -16 (R1) ADDD F4, F0, F2 3LD F14, -24 (R1) ADDD F8, F6, F2 4LD F18, -32 (R1) ADDD F12, F10, F2 5SD 0 (R1), F4 ADDD F16, F14, F2 6SD -8 (R1), F8 ADDD F20, F18, F2 7SD -16 (R1), F12 8SD -24 (R1), F16 9SUB R1, R1, #40 10BNEZ R1, LOOP 11SD 8 (R1), F20 12
Two-Issue ArchitectureUnrolled and Scheduled Code
The unrolled and scheduled code 2.4 cycles per iteration (5 iters in 12 cycles)
![Page 31: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/31.jpg)
9/25/2006 eleg652-F06 31
Memory Memory FP FP Integer operation reference 1 reference 2 operation 1 operation 2 /branch
LD F0, 0 (R1) LD F6, - 8 (R1)LD F10, -16 (R1) LD F14, -24 (R1) LD F18, -32 (R1) LD F22, -40 (R1) ADDD F4, F0, F2 ADDD F8, F6, F2 LD F26, -48 (R1) ADDD F12, F10, F2 ADDD F16, F14, F2
ADDD F20, F18, F2 ADDD F24, F22, F2SD 0 (R1), F4 SD - 8 (R1), F8 ADDD F28, F26, F2SD -16 (R1), F12 SD -24 (R1), F16SD -32 (R1), F20 SD -40 (R1), F24 SUB R1, R1, #48SD - 0 (R1), F28 BNEZ R1, LOOP
Unrolling 6 times
F0
+
a
F4
LD
SD
F6
+
a
F8
LD
SD
F10
+
a
F12
LD
SD
F14
+
a
F16
LD
SD
F18
+
a
F20
LD
SD
F22
+
a
F24
LD
SD
F26
+
a
F28
LD
SD
A VLIW Code Sequence
7 iterations in 9 cycles 1.28 cycle per iter
![Page 32: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/32.jpg)
9/25/2006 eleg652-F06 32
Trace Scheduling
• First Used for VLIW architecture• Trace
– A straight line sequence of instructions executed in some data or a sequence of ops which constitute a possible path based on “predicted” branches.
• Trace Scheduling– Identify a “most possible” sequence of instructions
and then “compact” the instructions in such path
• Tools– For Loops: Unrolling– For Branches: Static Branch prediction
![Page 33: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/33.jpg)
9/25/2006 eleg652-F06 33
An ExampleTraces
A;B;C;if(D){ E; F;}else{ G;}H;I;
Basic BlockAn instruction sequence which has only one entry point and one exit point (no target for branches or branches in the middle)
ABC
br D
EF
G
HI
Trace 1 Trace 2
![Page 34: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/34.jpg)
9/25/2006 eleg652-F06 34
Code Motion & Compensation Code
ABC
br D
EF G
HI
AB
br D
CE
CG
FHI
ABCE
br D
FH
Undo EGH
I
Original Code Code Move to the Succeeding Block
Code Move to the Preceding Block
![Page 35: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/35.jpg)
9/25/2006 eleg652-F06 35
Trace Scheduling
• Similar to Basic Block Scheduling– Their unit is traces not Basic Blocks
• Reduce execution time of likely traces– Using Profiling
![Page 36: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/36.jpg)
9/25/2006 eleg652-F06 36
Software Pipeline
• Reorganizing loops such that each iteration is composed of instruction sequences chosen from different iterations
• Use less code size– Compared to Unrolling
• Some Architecture has specific software support– Rotating register banks– Predicated Instructions
![Page 37: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/37.jpg)
9/25/2006 eleg652-F06 37
Software Pipelining
• Overlap instructions without unrolling the loop• Give the vector M in memory, and ignoring the start-up and finishing
code, we have:
Loop: SD 0 (R1), F4 ;stores into M[i]ADDD F4, F0, F2 ;adds to M[i +1]LD F0, -8 (R1) ;loads M[i + 2]BNEZ R1, LOOP
SUB R1, R1, #8 ;subtract indelay slot
This loop can be run at a rate of 5 cycles per result, ignoring the start-up and clean-up portions.
![Page 38: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/38.jpg)
9/25/2006 eleg652-F06 38
Software Pipelining
1 2 3 4 5 6 7
1 LD
2 LD
3 ADDD LD
4 ADDD LD
5 ADDD LD
6 SD ADDD LD
7 BNEZ SD ADDD LD
8 BNEZ SD ADDD
9 BNEZ SD ADDD
10 BNEZ SD
11 BNEZ SD
Tim
e
Iter
![Page 39: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/39.jpg)
9/25/2006 eleg652-F06 39
Software Pipeline
Overhead for Software Pipeline: Two times cost One for Prolog and one for epilog
Overhead for Unrolled Loop: M / N times cost M Loop Executions and N unrolling
Software Pipeline CodePrologue Epilog
Unrolled
Number of Overlapped instructions
Number of Overlapped instructions
Time
Time
![Page 40: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/40.jpg)
9/25/2006 eleg652-F06 40
Loop Unrolling V.S. Software Pipelining
• When not running at maximum rate– Unrolling: Pay m/n times overhead when m
iteration and n unrolling– Software Pipelining: Pay two times
• Once at prologue and once at epilog• Moreover
– Code compactness– Optimal runtime– Storage constrains
![Page 41: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/41.jpg)
9/25/2006 eleg652-F06 41
Comparison of Static Methods
w/o scheduling
scheduling unrolling Unrolling + Scheduling
2 issue 4 issue SP 1-issue
SP 5-Issue
Cycles per iterations
9 6 6.8 3.5 2.4 1.28 5 1
![Page 42: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/42.jpg)
9/25/2006 eleg652-F06 42
On a Final Note
Loop unrolling, trace scheduling, and software pipelining all aim at exposing fine grain parallelism.
“The effectiveness of these techniques and their suitability for various architectural approaches are among the most open research areas in pipelined processor design”
- Henn & Patt
![Page 43: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/43.jpg)
9/25/2006 eleg652-F06 43
Limitations of VLIW
• Limited parallelism (statically schedule) code– Basic Blocks may be too small– Global Code Motion is difficult
• Limited Hardware Resources• Code Size• Memory Port limitations• A Stall is serious• Cache is difficult to be used (effectively)
– i-cache misses have the potential to multiply the miss rate by a factor of n where n is the issue width
– Cache miss penalty is increased since the length of instruction word
![Page 44: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/44.jpg)
9/25/2006 eleg652-F06 44
An Open Question
“...Whether there are large classes of applications that are not suitable for vector machines, but still offer enough parallelism to justify the VLIW approach rather than a simpler one, such as a superscalar machine?”
Henn & Patt 1990
![Page 45: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/45.jpg)
9/25/2006 eleg652-F06 45
An VLIW ExampleT
MS
32C
62x/
C67
Blo
ck D
iagr
am
Source: TMS320C600 Technical Brief. February 1999
![Page 46: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/46.jpg)
9/25/2006 eleg652-F06 46
An VLIW Example
TMS32C62x/C67 Data Paths
Source: TMS320C600 Technical Brief. February 1999
Assembly Example
![Page 47: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/47.jpg)
9/25/2006 eleg652-F06 47
Introduction to SuperScalar
Topic 3b
![Page 48: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/48.jpg)
9/25/2006 eleg652-F06 48
Instruction Issue Policy
• It determinates the processor look ahead policy– Ability to examine instructions beyond the
current PC
• Look Ahead must ensure correctness at all costs
• Issue policy – Protocol used to issue instructions
• Note: Issue, execution and completion
![Page 49: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/49.jpg)
9/25/2006 eleg652-F06 49
Issues in Out of Order Execution & Completion
R3 := R3 op R5 (1)R4 := R3 + 1 (2)R3 := R5 + 1 (3)R7 := R3 op R4 (4)
1
3
2
4
Flow DependencyAnti DependencyOutput Dependency
(2), (3) cannot be completed out-of order, otherwise, the anti-dependence may be violated, or R3 in (2) may be incorrectly written by (3) – [when (2) was stalled for some reason]
![Page 50: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/50.jpg)
9/25/2006 eleg652-F06 50
Issues in Out of Order Execution & Completion
R3 := R3 op R5 (1)R4 := R3 + 1 (2)R3 := R5 + 1 (3)R7 := R3 op R4 (4)
1
3
2
4
Flow DependencyAnti DependencyOutput Dependency
(1), (3) cannot be completed out-of-order!Output-dependence has to be checked with all preceding instructions which are already in exec pipes, before an inst is issued, and ensure results to be written in correct order. Otherwise R3 in (4) may get a wrong value.
![Page 51: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/51.jpg)
9/25/2006 eleg652-F06 51
Issues in Out of Order Execution & Completion
R := (1) := R (2)R := (3)
Note that the anti-dependence between (2) and (3) is handled correctly by stalling (3)’s issue if (1) has not completed.
![Page 52: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/52.jpg)
9/25/2006 eleg652-F06 52
Achieve High Performance in Multiple Issued Instruction Machines
• Detection and resolution of storage conflicts– Extra “Shadow” registers– Special bit for reservation
• Organization and control of the buses between the various units in the PU– Special controllers to detect write backs and
read
![Page 53: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/53.jpg)
9/25/2006 eleg652-F06 53
How to Detect Data Dependencies
X1 = X2 + X3
Y1 = Y2 + Y3
How many dependencies between these two instruction?
Five Possible Dependencies
A Total of 5 * O(n2) for n instructions
![Page 54: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/54.jpg)
9/25/2006 eleg652-F06 54
A Super Scalar Architecture
Inst Fetch
Inst Decode
Issue Window
Wake Up Select
Register File
Exec
Write Back
New!!!!!Holds the instructions that are ready and the one that are waiting for dependencies
![Page 55: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/55.jpg)
9/25/2006 eleg652-F06 55
Data Dependencies & SuperScalar
• Hardware Mechanism (dynamic scheduling)
- Scoreboarding
- limited out-of-order issue/completion
- centralized control
- Renaming with reorder buffer is a another attractive approach
(based on Tomasulo Alg.)
- Micro dataflow
- Advantage: exact runtime information
- Load/cache miss
- resolve storage location related dependence
![Page 56: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/56.jpg)
9/25/2006 eleg652-F06 56
Scoreboarding• Named after CDC 6600
• Effective when there are enough resources and no data dependencies
• Out-of-order execution
• Issue: checking scoreboard and WAW will cause a stall
• Read operand- checking availability of operand and resolve RAW dynamically at this step
- WAR will not cause stall
• EX
• Write result- WAR will be checked and will cause stall
![Page 57: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/57.jpg)
9/25/2006 eleg652-F06 57. . . . .
Registers
Integer unit
FP add
FP divide
FP multFP mult
Scoreboard
Data buses
Control/status
Control/status
The basic structure of a DLX processor with a scoreboard
![Page 58: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/58.jpg)
9/25/2006 eleg652-F06 58
Scoreboarding
[CDC6600, Thorton70], [WeissSmith84]• A bit (called “scoreboard bit”) is associated with each
register bit = 1: the register is reserved by a write• An instruction has a source operand with bit = 1will be
issued, but put into an instruction window, with the register identifier to denote the “to-be-written” operand
• Copies of valid operands also be read with pending inst (solve anti-dependence)
• When the missing operand is finally written, the register id in the pending inst will be compared and value written, so it can be issued
• An inst has result R reserved - will stall so the output-dependence (WAW) will be correctly handled by stall!
![Page 59: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/59.jpg)
9/25/2006 eleg652-F06 59
Example
(1) DIVF F0, F2, F4
(2) ADDF F10, F0, F8
(3) SUBF F8, F8, F14
(3) is allowed to be issued and executed when (2)
is waiting some operand
(3) cannot write its result to F8 (stalls!) if (2) has
not read F8: a stall - since no renaming of the Rs.
![Page 60: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/60.jpg)
9/25/2006 eleg652-F06 60
Motivation of Dynamic Scheduling
Q1
– How to issue S2 without waiting S1 to complete?
(scoreboard)
– How to issue S3 without waiting S2 to complete?
– How to issue S3 without S1 to complete?
S1 X =
S2 = X
S3 X =
![Page 61: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/61.jpg)
9/25/2006 eleg652-F06 61
• It permits out-of-order “issue” of instructions which do not related to each other.
• It permits out-of-order “completion” of insts which do not related to each other.
• It prevents “execution” of an inst if flow-dependence is violated.
• It prevents “issue” of an inst if output-dependence is violated.
• It prevents “completion” of an inst if anti-dependence is violated.
Features of Scoreboarding
![Page 62: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/62.jpg)
9/25/2006 eleg652-F06 62
Advantage
• single bit – simple
• only one pending write per reg. so do
not need identify which is the latest
Scoreboarding
![Page 63: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/63.jpg)
9/25/2006 eleg652-F06 63
Micro Data Flow
• Fundamental Concepts– “Data Flow”
• Instructions can only be fired when operands are available
– Single assignment and register renaming
• Implementation– Tomasulo’s Algorithm– Reorder Buffer
![Page 64: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/64.jpg)
9/25/2006 eleg652-F06 64
Renaming/Single Assignment
R0 = R2 / R4; (1)R6 = R0 + R8 (2)R1[0] = R6 (3)R8 = R10 – R14 (4)R6 = R10 * R8 (5)
12
34
5
R0 = R2 / R4; (1)S = R0 + R8 (2)R1[0] = S (3)T = R10 – R14 (4)R6 = R10 * T (5)
12
34
5
![Page 65: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/65.jpg)
9/25/2006 eleg652-F06 65
Principles of Register Renaming
• Additional R’s reestablish a one to one correspondence between values and registers.
• Extra registers Scheduled by hardware and associated with values
• A new value New (hardware) register• Anti and Output Dependencies are avoided• Registers are reused according to program
needs
![Page 66: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/66.jpg)
9/25/2006 eleg652-F06 66
Baseline Superscalar Model
Inst Fetch
Inst Decode
Wake Up Select
Register File
ExecData Cache
Bypass
Renaming
Issue Window
Execution BypassData Cache Access
Register Write &Instruction Commit
![Page 67: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/67.jpg)
9/25/2006 eleg652-F06 67
Micro Data FlowConceptual Model
A R1R1 * B R2R2 / C R1R4 + R1 R4
A
Load
*
/
+
B
C
R1OR4
OR3
OR5OR1
OR6
R2
R4 R1
R4
R1
R2
R3
R4
![Page 68: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/68.jpg)
9/25/2006 eleg652-F06 68
Register Types
• Two Kinds of registers– Forwarding Registers
• Program / Instruction Visible• Compiler and programmer scheduled
– Physical Operand Registers• Not Visible• Scheduled and assigned by the hardware
![Page 69: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/69.jpg)
9/25/2006 eleg652-F06 69
Reorder Buffer & Instruction Commit
• Instruction Commit– When an instruction is allowed to update memory
and/or registers– Concept used a lot in speculation
• Instruction Commit V.S. Instruction Execution– When speculation is used, Inst. Commit may not
happen immediately after inst. execution. – Reorder Buffer– A hardware buffer holding completed instructions but
not committed– Execute out-of-order but commit in-order– Extend the register set with extra registers
![Page 70: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/70.jpg)
9/25/2006 eleg652-F06 70
ROB Stages
• Issue– Dispatch an instruction from the instruction queue– Reserved ROB entry and a reservation station
• Execute– Stall for operands– RAW resolved
• Write Result – Write back to any reservation stations waiting for it and to the
ROB• Commit
– Normal Commit: Update Registers– Store Commit: Update Memory– False Branch: Flush the ROB and re-begin execution
![Page 71: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/71.jpg)
9/25/2006 eleg652-F06 71
ROB High Level Overview
Fetcher
InstructionQueue
Decoder
ReorderBuffer
InstructionWindow
RegisterFile
FunctionalUnit
FunctionalUnit
FunctionalUnit
![Page 72: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/72.jpg)
9/25/2006 eleg652-F06 72
ROB Organization
• Content Addressable– X = A + B, X is renamed by a ROB register
(X’) and all references will be replaced by it– If an X’ is needed as an operand then ROB[X]
and• The value (if available) or a tag (if not) is returned if
X’ exists in the ROB, or …• A search to the “Visible” register bank is executed
if X’ is not found in the ROB
![Page 73: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/73.jpg)
9/25/2006 eleg652-F06 73
ROB Organization
• If there is more than one ROB[X], then the most recent “entry” is fetched from the ROB
• When a result is produced:– All reservation stations that have a tag for that
result are updated
• When an instruction commits:– Update register banks and memory – Flush the ROB in case of a false branch
![Page 74: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/74.jpg)
9/25/2006 eleg652-F06 74
Reg R-buffer
inst results
inst operands inst operands
R1 = R0 + 5 (1)
R2 = R1 + 6 (2)
R1 = R1 + 3 (3)
R4 = R1 + 9 (4)
op op1 op2 destR-buffer
R-name Value full
B5 R1 0
B8 R2 0
(1)+(2) issued
* + R0 = 1 5 B5
+ B5 6 B8
Assume R0 =1initially
enab
led
B5
B8
B11
R1
R1
0
0
0
+ 1 5 B5
+ B5 3 B11
+ B5 6 B8
(3) issued in-flight
R2
An
Exa
mpl
e
![Page 75: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/75.jpg)
9/25/2006 eleg652-F06 75
+ 6 6 B8
+ 6 3 B11
*
*
B5
B8
B11
R1
R2
R1
6 1
0
0
(1) completed
An
Exa
mpl
eR1 = R0 + 5 (1)
R2 = R1 + 6 (2)
R1 = R1 + 3 (3)
R4 = R1 + 9 (4)
![Page 76: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/76.jpg)
9/25/2006 eleg652-F06 76
B5 R1
R2
R1
6 1
0
0
(4) issued
R4
0
enabled op op1 op2 dest
Note: this is directly from B11 (not “6”), so, flow dependence is handled!
Also note these 2 instructors (e.g. (2) and (3)) can be completed out-of-order, but “6” is not affected so anti-dep : is resolved properly.
B8
B11
B13
* + 6 6 B8
+ B11 9 B13
* + 6 3 B11
An
Exa
mpl
eR1 = R0 + 5 (1)
R2 = R1 + 6 (2)
R1 = R1 + 3 (3)
R4 = R1 + 9 (4)
![Page 77: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/77.jpg)
9/25/2006 eleg652-F06 77
Questions
• Memory Renaming– Not as attractive as R-R Data Flow– Load and Store are less frequent– Memory locations are less reused (in the register alloc
sense)– Memory ops have only one memory operand
• Store Buffer– Give Load priority to access the data cache– In order stores– Ensure that all instructions are performed before a
store has completed
![Page 78: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/78.jpg)
9/25/2006 eleg652-F06 78
Memory Dataflow
• More difficult– Memory address is longer– Memory address may not be available at
decode stage
• Note:– In order-cache state: All stores must
performed in program order• All previous operations should have completed
– No cache reorder / check mechanism
![Page 79: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/79.jpg)
9/25/2006 eleg652-F06 79
Load / Store Policy
• Loads may by-pass Stores if there are NO true dependencies among them• A check should be performed to ensure correctness• it cannot by-pass a store with the same destination. If one is
detected, the load is satisfied directly from the store buffer
• Loads are performed in program order at data cache, with respect to other loads• Simplicity• Out of Order doesn’t help much anyway
• At data cache interface• No Anti: No Store can by pass a Load• No Output: No Store can by pass each other
![Page 80: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/80.jpg)
9/25/2006 eleg652-F06 80
Memory Dependencies & Event Ordering
• In case that a store target cannot be resolved – All subsequent loads are withheld until the
address can be resolved
• If two loads are in the instruction window– Do they need to wait for each other to be
resolved?
![Page 81: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/81.jpg)
9/25/2006 eleg652-F06 81
Out-Of-Order Architectures
Fetch
DecodeRename
i0: R2 * R3i1: load@[R1 + R4]…i2: load@[R5]
INT FP L/S ROB
Mem
Independent Loads can execute in parallel
If i1 and i2 are independent, then they can executed at the same time
Loads do NOT need to wait for each other, even when addressed to the same memory location
![Page 82: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/82.jpg)
9/25/2006 eleg652-F06 82
Summary
• Reorder Buffer– The most powerful scheme from the complex
dynamic scheduling techniques
• Simplest: Scoreboarding
• Hardware implementation is complex– Worth its returns?
![Page 83: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/83.jpg)
9/25/2006 eleg652-F06 83
Tomasulo’s Algorithm
• Tomasulo, R.M. “An Efficient Algorithm for Exploiting Multiple Arithmetic Units”, IBM J. of R&D 11:1 (Jan, 1967, p.p.232-233)
• IBM 360/91 (three year after CDC 6600 and just before caches)
• Features:• CDB: Common Data Bus• Reservation Units: Hardware features which allow the
fetch, use and reuse of data as soon as it becomes available. It allows register renaming and it is decentralized in nature (as opposed as Scoreboarding)
![Page 84: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/84.jpg)
9/25/2006 eleg652-F06 84
Tomasulo’s Algorithm
• Control and Buffers distributed with Functional Units.• HW renaming of registers• CDB broadcasting• Load / Store buffers Functional Units• Reservation Stations:
– Hazard detection and Instruction control– 4-bit tag field to specify which station or buffer will produce the
result
• Register Renaming– Tag Assigned on IS– Tag discarded after write back
![Page 85: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/85.jpg)
9/25/2006 eleg652-F06 85
Comparison
• Scoreboarding– Centralized Data structure
and control– Register bit
• Simple, low cost
– Structural hazards solved by FU
– Solve RAW by register bit– Solve WAR in write – Solve WAW stalls on issue
• Tomasulo’s Algoritjm– Distributed control– Tagged Registers +
register renaming– Structural Hazard stalls on
Reservation Station– Solve RAW by CDB– Solve WAR by copying
operand to Reservation Station
– Solve WAW by renaming– Limited: CDB
• Broadcast• 1 per cycle
![Page 86: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/86.jpg)
9/25/2006 eleg652-F06 86
Reservation Station Fields
• Op: Operation to perform in the unit• Vj, Vk: Value of Source Operands
– Store Buffers has V field Result to be stored• Qj, Qk: Reservation Stations producing the source
registers– Zero means ready
• Busy• A: Memory address calculation• Register file:
– Qi: The number of reservation stations that will write to this register
![Page 87: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/87.jpg)
9/25/2006 eleg652-F06 87
The Architecture
654321
Formmemory
Load buffers
From instruction unitFloating-pointoperations FP registers
FP adders FP multipliers
Store buffers
tomemory
Common data bus (CDB)
321
321
Operation bus
21
ReservationStations
Operandbus
- 3 Adders- 2 Multipliers- Load buffers (6)- Store buffers (3)- FP Queue- FP registers- CDB: Common Data Bus
![Page 88: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/88.jpg)
9/25/2006 eleg652-F06 88
Tomasulo’s Algorithm’s Steps
• Issue- Issue if empty reservation station is found, fetch operands if they are in
registers, otherwise assign a tag- If no empty reservation is found, stall and wait for one to get free- Renaming is performed here and WAW and WAR are resolved
• Execute– If operands are not ready, monitor the CDB for them– RAWs are resolved– When they are ready, execute the op in the FU
• Write Back– Send the results to CDB and update registers and the Store buffers– Store Buffers will write to memory during this step
• Exception Behavior– During Execute: No instructions are allowed to be issued until all
branches before it have been completed
![Page 89: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/89.jpg)
9/25/2006 eleg652-F06 89
Tomasulo’s Algorithm
• Note that:• Upon Entering a reservation station, source operands
are either filled with values or renamed• The new names are 1-to-1 correspondence to FU
names
• Question:• How the output dependencies are resolved?
• Two pending writes to a register• How to determinate that a read will get the most
recent value if they complete out of order
![Page 90: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/90.jpg)
9/25/2006 eleg652-F06 90
Features of T. Alg.
• The value of an operand (for any inst already issued in a reservation station) will be read from CDB. it will not be read from the reg. field.
• Instructions can be issued without even the operands produced (but know they are coming from CDB)
![Page 91: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/91.jpg)
9/25/2006 eleg652-F06 91
An Example
LD F6, 34 (R2) (1)LD F2, 45 (R3) (2)MULD F0, F2, F4 (3)SUBD F8, F2, F6 (4)DIVD F10, F0, F6 (5)ADDD F6, F8, F2 (6)
1
2
3
4
5
6
![Page 92: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/92.jpg)
9/25/2006 eleg652-F06 92
An Example
OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2OP Vj Vk Qj Qk Busy Addr1
OP Vj Vk Qj Qk Busy Addr2OP Vj Vk Qj Qk Busy Addr1
OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2L 0 0 Yes 34+[R2]1
0
1
2
3
4
5
6 L1
7
8
9
10
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
ADDD F6,F8,F2
DIVD F10,F0,F6
SUBD F8.F2.F6
MULD F0,F2,F4
LD F2,45(R3)
![Page 93: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/93.jpg)
9/25/2006 eleg652-F06 93
An Example
OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2OP Vj Vk Qj Qk Busy Addr1
OP Vj Vk Qj Qk Busy Addr2OP Vj Vk Qj Qk Busy Addr1
OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2L 0 0 Yes 34+[R2]1
0
1
2 L2
3
4
5
6 L1
7
8
9
10
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
ADDD F6,F8,F2
DIVD F10,F0,F6
SUBD F8.F2.F6
MULD F0,F2,F4
![Page 94: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/94.jpg)
9/25/2006 eleg652-F06 94
An Example
OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2OP Vj Vk Qj Qk Busy Addr1
OP Vj Vk Qj Qk Busy Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1
OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2L 0 0 Yes 34+[R2]1
0 M1
1
2 L2
3
4
5
6 L1
7
8
9
10
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
ADDD F6,F8,F2
DIVD F10,F0,F6
SUBD F8.F2.F6
![Page 95: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/95.jpg)
9/25/2006 eleg652-F06 95
An Example
OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2
- 45+[R3] 34+[R2] L2 L1 Yes Addr1
OP Vj Vk Qj Qk Busy Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1
OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2L 0 0 Yes 34+[R2]1
0 M1
1
2 L2
3
4
5
6 L1
7
8 A1
9
10
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
ADDD F6,F8,F2
DIVD F10,F0,F6
![Page 96: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/96.jpg)
9/25/2006 eleg652-F06 96
An Example
OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2
- 45+[R3] 34+[R2] L2 L1 Yes Addr1
/ 34+[R2] M1 L1 Yes Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1
OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2L 0 0 Yes 34+[R2]1
0 M1
1
2 L2
3
4
5
6 L1
7
8 A1
9
10 M2
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
ADDD F6,F8,F2
![Page 97: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/97.jpg)
9/25/2006 eleg652-F06 97
An Example
OP Vj Vk Qj Qk Busy Addr3+ 45+[R3] A1 L2 Yes Addr2- 45+[R3] 34+[R2] L2 L1 Yes Addr1
/ 34+[R2] M1 L1 Yes Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1
OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2L 0 0 Yes 34+[R2]1
0 M1
1
2 L2
3
4
5
6 A2
7
8 A1
9
10 M2
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
![Page 98: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/98.jpg)
9/25/2006 eleg652-F06 98
An Example
OP Vj Vk Qj Qk Busy Addr3+ 45+[R3] A1 L2 Yes Addr2- 45+[R3] 40 L2 0 Yes Addr1
/ 40 M1 0 Yes Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1
OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2
1
0 M1
1
2 L2
3
4
5
6 A2
7
8 A1
9
10 M2
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
Some time later L1 returns 40 and commits
![Page 99: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/99.jpg)
9/25/2006 eleg652-F06 99
An Example
OP Vj Vk Qj Qk Busy Addr3+ 32 A1 0 Yes Addr2- 32 40 0 0 Yes Addr1
/ 40 M1 0 Yes Addr2* 32 [F4] = 4 0 0 Yes Addr1
OP Vj Vk Qj Qk Busy Addr3
2
1
0 M1
1
32
3
4
5
6 A2
7
8 A1
9
10 M2
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
Some time later L2 returns 32 and commits
![Page 100: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/100.jpg)
9/25/2006 eleg652-F06 100
An Example
OP Vj Vk Qj Qk Busy Addr3+ -8 32 0 0 Yes Addr2
1
/ 128 40 0 0 Yes Addr2
1
OP Vj Vk Qj Qk Busy Addr3
2
1
128
1
32
3
4
5
6 A2
7
-8
9
10 M2
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
A1 and M1 Complete and commit
![Page 101: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/101.jpg)
9/25/2006 eleg652-F06 101
An Example
OP Vj Vk Qj Qk Busy Addr3
2
1
2
1
OP Vj Vk Qj Qk Busy Addr3
2
1
128
1
32
3
4
5
24
7
-8
9
3.2
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
A2 and M2 Complete and commit
![Page 102: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/102.jpg)
9/25/2006 eleg652-F06 102
ROB and Tomasulo’s Alg.
• Many elements of Tomasulo’s algorithm are already included• Major difference?
How WAW is handled?
In Tomasulo: this is by keeping a “tag” with each register x and the tag is updated, each time a
x “+”
is issued, i.e. X-tag “+3” means 3rd + unit is reserved.
when write back via CDB, the tag of FU is compare with tag of R:
if
tag of FU = X-tag, overwritten the R (e.g. X)
else
ignore the result
![Page 103: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/103.jpg)
9/25/2006 eleg652-F06 103
Tomasulo’s Algorithm
• Advantages• Distribution of Hazard
detection logic • R-renaming and
reservation stations take care of all data hazards.
• Disadvantages- Hardware cost: high-
speed associative M for tags + complex control logic
- One single CDB may be a bottleneck, while multiple CDB may be too costly (all associative - M must be duplicated)
![Page 104: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/104.jpg)
9/25/2006 eleg652-F06 104
Conclusions
- Good for pipelined architecture which is difficult to schedule code and it is short in “visible” registers
- Future:- Hybrid between Software and hardware
techniques- Static schedule of R-R’s- Dynamic Schedule of Load and stores
![Page 105: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/105.jpg)
9/25/2006 eleg652-F06 105
![Page 106: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/106.jpg)
9/25/2006 eleg652-F06 106
ExampleDynamic Scheduling in Pentium 4
• Fetch up 3 IA-32 Instruction per cycle
• Decode them into micro code and send them to the out of order execution engine.
• Commit up to 3 micro ops per cycle
• Pipeline takes 20 cycles
• Register Renaming files– Potentially 128 outstanding results
• Seven Integer execution units
![Page 107: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/107.jpg)
9/25/2006 eleg652-F06 107
An Example of an OoO Engine
Intel Xeon Out of Order Engine Pipeline Picture Courtesy of Intel from “Hyper-Threading Technology Architecture and Microarchitecture”
![Page 108: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/108.jpg)
9/25/2006 eleg652-F06 108
VLIW vs Superscalar
• Superscalar• Advantages
– Better Code density– Code compatible
• Difference– Dynamic scheduling
• Disadvantages– More IF and ID– More delay slots are
needed– Different FU
• VLIW• Advantages
– Fixed Instruction format– Explicit parallelism
exposed• Trace scheduling
• Difference:– Static Scheduling
• Disadvantages– Static scheduling– No dynamic decision– Code explosion– Caches are difficult to use
![Page 109: Topic 3](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814e55550346895dbbe93d/html5/thumbnails/109.jpg)
9/25/2006 eleg652-F06 109
Bibliography
• Texas Instruments, “TMS320C600 Technical Brief.” February 1999. www.ti.com
• Intel Pentium 4 Northwood. www.chip-architect.com. April 2003.
• “Hyper-Threading Technology Architecture and Microarchitecture.” Intel Technology Journal, Volume 6, Issue 1, February 2002, p4-15