chapter 8 pipelining. a strategy for employing parallelism to achieve better performance taking the...
TRANSCRIPT
![Page 1: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/1.jpg)
Chapter 8
Pipelining
![Page 2: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/2.jpg)
Pipelining
• A strategy for employing parallelism to achieve better performance
• Taking the “assembly line” approach to fetching and executing instructions
![Page 3: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/3.jpg)
The Cycle
The control unit:
Fetch
Execute
Fetch
Execute
Etc.
Etc.
![Page 4: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/4.jpg)
![Page 5: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/5.jpg)
The Cycle
How about separate components for fetching the instruction and executing it?
Then
fetch unit: fetch instruction
execute unit: execute instruction
So, how about fetch while execute?
![Page 6: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/6.jpg)
clock cycle clock cycle
![Page 7: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/7.jpg)
Overlapping fetch with execute
Two stage pipeline
![Page 8: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/8.jpg)
Both components busy during each clock cycle
F4
![Page 9: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/9.jpg)
The Cycle
The cycle can be divided into four parts
fetch instruction
decode instruction
execute instruction
write result back to memory
So, how about four components?
![Page 10: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/10.jpg)
![Page 11: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/11.jpg)
The four components operating in parallel
![Page 12: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/12.jpg)
buffer for instruction
buffer for operands
buffer for result
![Page 13: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/13.jpg)
Instruction
I3
Operands for I2
Operation info for I2
Write info for I2
Result of instruction
I1
![Page 14: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/14.jpg)
One clock cycle for each pipeline stage
Therefore cycle time must be long enough for the longest stage
A unit is idle if it requires less time than another
Best if all stages are about the same length
Cache memory helps
![Page 15: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/15.jpg)
Fetching (instructions or data) from main memory may take 10 times as long as an operation such as ADD
Cache memory (especially if on the same chip) allows fetching as quickly as other operations
![Page 16: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/16.jpg)
One clock cycle per component, four cycles total to complete an instruction
![Page 17: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/17.jpg)
Completes an instruction each clock cycle
Therefore, four times as fast as without pipeline
![Page 18: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/18.jpg)
Completes an instruction each clock cycle
Therefore, four times as fast as without pipeline
as long as nothing takes more than one cycle
But sometimes things take longer -- for example, most executes such as ADD take one clock, but suppose DIVIDE takes three
![Page 19: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/19.jpg)
![Page 20: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/20.jpg)
and other stages idle
Write has nothing to writeDecode can’t use its “out” bufferFetch can’t use its “out” buffer
![Page 21: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/21.jpg)
A data “hazard” has caused the pipeline to “stall”
no data for Write
![Page 22: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/22.jpg)
An instruction “hazard” (or “control hazard”) has caused the pipeline to “stall”
Instruction I2 not in the cache, required a main memory access
![Page 23: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/23.jpg)
![Page 24: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/24.jpg)
Structural Hazards
• Conflict over use of a hardware resource
• Memory – Can’t fetch an instruction while another instruction is
fetching an operand, for example
– Cache: same• Unless cache has multiple ports• Or separate caches for instructions, data
• Register file• One access at a time, again unless multiple ports
![Page 25: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/25.jpg)
Structural Hazards
• Conflict over use of a hardware resource--such as the register file
Example:
LOAD X(R1), R2 (LOAD R2, X(R1) in
MIPS)address of memory location i.e., the address in R1 + X
Load that word from memory (cache) into R2
X + [R1]
![Page 26: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/26.jpg)
I2 writing to register file
I3 must wait for register file
I2 takes extra cycle for cache access as part of execution
calculate the address
I5 fetch delayed
![Page 27: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/27.jpg)
Data Hazards
• Situations that cause the pipeline to stall because data to be operated on is delayed– execute takes extra cycle, for example
![Page 28: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/28.jpg)
Data Hazards
• Or, because of data dependencies
– Pipeline stalls because an instruction depends on data from another instruction
![Page 29: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/29.jpg)
Concurrency
A 3 + A
B 4 x A
A 5 x C
B 20 + C
Can’t be performed concurrently--result incorrect if new value of A is not used
Can be performed concurrently (or in either order) without affecting result
![Page 30: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/30.jpg)
Concurrency
A 3 + A
B 4 x A
A 5 x C
B 20 + C
Second operation depends on completion of first operation
The two operations are independent
![Page 31: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/31.jpg)
MUL R2, R3, R4
ADD R5, R4, R6 (dependent on result in R4 from previous instruction)
will write result in R4
can’t finish decoding until result is in R4
![Page 32: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/32.jpg)
Data Forwarding
• Pipeline stalls in previous example waiting for I1’s result to be stored in R4
• Delay can be reduced if result is forwarded directly to I2
![Page 33: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/33.jpg)
pipeline stall
data forwarding
![Page 34: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/34.jpg)
MUL R2, R3, R4
from R2
R2 x R3
toR4
to I2
from R3
![Page 35: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/35.jpg)
![Page 36: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/36.jpg)
ADD R5, R4, R6
MUL R2, R3, R4
R2, R3
R2 x R3
R2 x R3
R2 x R3
R5 R4 + R5
R4
R6
![Page 37: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/37.jpg)
If solved by software:
MUL R2, R3, R4
NOOP
NOOP
ADD R5, R4, R6
2 cycle stall introduced by hardware
(if no data forwarding)
![Page 38: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/38.jpg)
Side Effects
• ADD (R1)+, R2, R3– Not only changes destination register,
but also changes R1
• ADD R1, R3• ADDWC R2, R4
– Add with carry dependent on condition code flag set by previous ADD—an implicit dependency
![Page 39: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/39.jpg)
Side Effects
• Data dependency on something other than the result destination
• Multiple dependencies
• Pipelining clearly works better if side effects are avoided in the instruction set
– Simple instructions
![Page 40: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/40.jpg)
Instruction Hazards
• Pipeline depends on steady stream of instructions from the instruction fetch unit
pipeline stall from a cache miss
![Page 41: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/41.jpg)
Decode, execute, and write units are all idle for the “extra” clock cycles
![Page 42: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/42.jpg)
Branch Instructions
• Their purpose is to change the content of the PC and fetch another instruction
• Consequently, the fetch unit may be fetching an “unwanted” instruction
![Page 43: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/43.jpg)
SW R1, A
BUN K
LW R5, B
two stage pipeline
fetch instruction 3
discard instruction 3 and fetch instruction K instead
computes new PC value
![Page 44: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/44.jpg)
the lost cycle is the “branch penalty”
![Page 45: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/45.jpg)
four stage pipeline
instruction 3 fetched and decoded
instruction 4 fetched
instructions 3 and 4 discarded, instruction K fetched
![Page 46: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/46.jpg)
In a four stage pipeline, the penalty is two clock cycles
![Page 47: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/47.jpg)
Unconditional Branch Instructions
• Reducing the branch penalty requires computing the branch address earlier
• Hardware in the fetch and decode units
– Identify branch instructions
– Compute branch target address
(instead of doing it in the execute stage)
![Page 48: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/48.jpg)
fetched and decoded
discarded
penalty reduced to one cycle
![Page 49: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/49.jpg)
Instruction Queue and Prefetching
• Fetching instructions into a “queue”
• Dispatch unit (added to decode) to take instructions from queue
• Enlarging the “buffer” zone between fetch and decode
![Page 50: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/50.jpg)
buffer for instruction
buffer for operands
buffer for result
![Page 51: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/51.jpg)
buffer for multiple instructions
buffer for operands
buffer for result
![Page 52: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/52.jpg)
“oldest instruction in the queue--next to be dispatched
“newest instruction in the queue--most recently fetched
![Page 53: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/53.jpg)
![Page 54: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/54.jpg)
previous out, F1 in
F1 out, F2 in
F2 out, F3 in
F4 in F5 in
![Page 55: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/55.jpg)
instructions 3 and 4
instructions 3,4, and 5
keeps fetching despite stall
![Page 56: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/56.jpg)
calculates branch target concurrently
“branch folding”
discards F6 and fetches K
![Page 57: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/57.jpg)
completes an instruction each clock cycle
no branch penalty
![Page 58: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/58.jpg)
Instruction Queue
• Avoiding branch penalty requires that the queue contain other instructions for processing while branch target is calculated
• Queue can mitigate cache misses--if instruction not in cache, execution unit can continue as long as queue has instructions
![Page 59: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/59.jpg)
Instruction Queue
• So, it is important to keep queue full
• Helped by increasing rate at which instructions move from cache to queue
• Multiple word moves (in parallel)
cache instruction queue
one clock cycle
n words
![Page 60: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/60.jpg)
Conditional Branching
• Added hazard of dependency on previous instruction
• Can’t calculate target until a previous instruction completes execution
SUBW R1, A F ED W
F D
1
2
fetch K or fetch 3?
BGTZ K
![Page 61: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/61.jpg)
Conditional Branching
• Occur frequently, perhaps 20% of all instruction executions
• Would present serious problems for pipelining if not handled
• Several possibilities
![Page 62: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/62.jpg)
Delayed Branching
• Location(s) following a branch instruction have been fetched and must be discarded
• These positions called “branch delay slots”
![Page 63: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/63.jpg)
the penalty is two clock cycles in a four stage pipeline
two branch delay slots
if branch address calculated here
![Page 64: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/64.jpg)
fetched and decoded
discarded
penalty reduced to one cycle
one branch delay slot
![Page 65: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/65.jpg)
Delayed Branching
• Instructions in delay slots always fetched and partially executed
• So, place useful instructions in those positions and execute whether branch is taken or not
• If no such instructions, use NOOPs
![Page 66: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/66.jpg)
shift R1 N times
R2 contains N
fetched and discarded on every iteration
branch if not zero
![Page 67: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/67.jpg)
shift R1 N times R2 contains N
do the shifting while the counter is being decremented and tested
branch delayed for one instruction
![Page 68: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/68.jpg)
appears to “branch” here
actually branches here
Branch instruction waits for an instruction cycle before actually branching—hence “delayed branch”
“delay slot”
![Page 69: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/69.jpg)
If there is no useful instruction to put here, put NOOP
“delay slot”If branches are “delayed branches then the next instruction will always be fetched and executed
![Page 70: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/70.jpg)
next to last pass through the loop
last pass through the loop
![Page 71: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/71.jpg)
will branch, so fetch the decrement instruction
won’t branch, so fetch the add instruction
![Page 72: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/72.jpg)
Delayed Branching
• Compiler must recognize and rearrange instructions
• One branch delay slot can usually be filled--filling two is much more difficult
• If adding pipeline stages increases number of branch delay slots, benefit may be lost
![Page 73: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/73.jpg)
Branch Prediction
• Hardware attempts to predict which path will be taken
• For example, assumes branch will not take place
• Does “speculative execution” of the instructions on that path
• Must not change any registers or memory until branch decision is known
![Page 74: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/74.jpg)
fetch unit predicts branch will not be taken
results of compare known in cycle 4
if branch taken these instructions discarded
![Page 75: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/75.jpg)
Static Branch Prediction
• Hardware attempts to predict which path will be taken (shouldn’t always assume the same result)
• Based on target address of branch—is it higher or lower than current address
• Software (compiler) can make better prediction and for example set a bit in the branch instruction
![Page 76: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/76.jpg)
Dynamic Branch Prediction
• Processor keeps track of branch decisions
• Determines likelihood of future branches
• One bit for each branch instruction
– LT branch likely to be taken
– LNT branch likely not to be taken
![Page 77: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/77.jpg)
![Page 78: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/78.jpg)
Four states
ST: strongly likely
SNT: strongly not likely
![Page 79: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/79.jpg)
Dynamic Branch Prediction
• If prediction wrong (LNT) at end of loop, it is changed (to ST), therefore is correct until the last iteration
• Stays correct for subsequent executions of the loop (only changes if wrong twice in a row)
• Initial prediction can be guessed by the hardware, better if set by the compiler
![Page 80: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/80.jpg)
![Page 81: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/81.jpg)
Pipelining’s Effect on Instruction Sets
• Multiple addressing modes
– Facilitate use of data structures
• Indexing, offsets
– Provide flexibility
– One instruction instead of many
• Can cause problems for the pipeline
![Page 82: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/82.jpg)
Structural Hazards
• Conflict over use of a hardware resource--such as the register file
Example: (indirect addressing with offset)
LOAD X(R1), R2 (LOAD R2, X(R1) in
MIPS)address of memory location i.e., the address in R1 + X
Load that word from memory (cache) into R2
X + [R1]
![Page 83: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/83.jpg)
I2 writing to register file
I3 must wait for register file
I2 takes extra cycle for cache access as part of execution
calculate the address (register access)
I5 fetch delayed
![Page 84: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/84.jpg)
LOAD ( X(R1) ), R2
double indirect with offset
two clock pipeline stall
while address is calculated and data fetched
![Page 85: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/85.jpg)
ADD #X, R1, R2
LOAD (R2), R2
LOAD (R2), R2
same seven clock cycles
three simpler instructions to do the same thing
![Page 86: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/86.jpg)
Condition Codes
• Conditional branch instructions dependent on condition codes set by a previous instruction
• For example, COMPARE R3, R4 sets a bit in the PSW to be tested by BRANCH if ZERO
![Page 87: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/87.jpg)
Branch decision must wait for completion of Compare
Can’t take place in decode, must wait for execution stage
![Page 88: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/88.jpg)
result of compare is in PSW by the time Branch instruction is decoded
depends on Add instruction not affecting condition codes
![Page 89: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/89.jpg)
Condition Codes and Pipelining
• Compiler must be able to reorder instructions
• Condition codes set by few instructions
• Compiler should be able to control which instructions set condition codes
![Page 90: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/90.jpg)
Datapath: registers, ALU, interconnecting bus
single internal bus
general registers
![Page 91: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/91.jpg)
with a single internal bus, one thing at a time over the bus
single internal bus
![Page 92: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/92.jpg)
1) PC out, MAR in, Read, Select 4, Add, Z in
2) Z out, PC in, Y in, wait for memory
3) MDR out, IR in
4) Offset field of IR out, Add, Z in
5) Z out, PC in, end
unconditional branch
![Page 93: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/93.jpg)
three internal buses
three port register file
transfers three at a time
PC incrementer
for address incrementing (multiple word transfers)
![Page 94: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/94.jpg)
1) PC out, R=B, MAR in, Read, Inc PC
2) Wait for memory
3) MDR out B, R=B, IR in
4) R4 out A, R5 out B, select A, Add, R6 in, end
ADD R4, R5, R6
![Page 95: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/95.jpg)
![Page 96: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/96.jpg)
three internal bus organization modified for pipelining
two caches--one for instructions, one for data
separate MARs one for each cache
![Page 97: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/97.jpg)
PC connected directly to IMAR, can transfer concurrently with ALU operation
data address can come from register file or from ALU
![Page 98: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/98.jpg)
separate MDRs for read and write
buffer registers for ALU inputs and output
![Page 99: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/99.jpg)
buffering for control signals following Decode and Execute
instruction queue loaded directly from cache
![Page 100: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/100.jpg)
![Page 101: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/101.jpg)
Can perform simultaneously in any combination:
Reading an instruction from instruction cache
Incrementing the PC
Decoding an instruction
Reading from or writing into data cache
Reading contents of up to two registers from the register file
Writing into one register of the register file
Performing an ALU operation
![Page 102: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/102.jpg)
Superscalar Operation
• Pipelining fetches one instruction per cycle, completes one per cycle (if no hazards)
• Adding multiple processing units for each stage would allow more than one instruction to be fetched, and moved through the pipeline, during each cycle
![Page 103: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/103.jpg)
Superscalar Operation
• Starting more than one instruction in each clock cycle is called multiple issue
• Such processors are called superscalar
![Page 104: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/104.jpg)
a processor with two execution units
![Page 105: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/105.jpg)
instruction queue and multiple word moves enables fetching n instructions
![Page 106: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/106.jpg)
dispatch unit capable of decoding two instructions from queue
![Page 107: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/107.jpg)
so, if top two instructions are:
ADDF R1, R2, R3 ADD R4, R5, R6
![Page 108: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/108.jpg)
![Page 109: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/109.jpg)
Floating point execution unit is pipelined also
![Page 110: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/110.jpg)
instructions complete out of order
OK if no dependencies
problem if error (imprecise interrupt/exception)
![Page 111: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/111.jpg)
imprecise interrupt
error occurs here
later instructions have already completed!
![Page 112: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/112.jpg)
results written in program order
![Page 113: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/113.jpg)
error occurs here
later instructions discarded
results written in program order
precise interrupts
![Page 114: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/114.jpg)
temporary registers allow greater flexibility
![Page 115: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/115.jpg)
register renaming
ADD R4, R5, R6
would write to R6
writes to another register “renamed” as R6, used by subsequent instructions
using that result
![Page 116: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/116.jpg)
R0
R1
R2
.
.
.
Rn
0
1
2
.
.
.
n
the “architectural”
registers
the physical registers
changeable mapping
register renaming
![Page 117: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/117.jpg)
Superscalar
• Statically scheduled– Variable number of instructions
issued each clock cycle– Issued in-order (as ordered by
compiler)– Much effort required by compiler
• Dynamically scheduled– Issued out-of-order (determined by
hardware)
![Page 118: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/118.jpg)
Superscalar
• Two issue, dual issue--capable of issuing two instructions at a time (as in previous example)
• Four issue--four at a time
• Etc.
• Overhead grows with issue width
![Page 119: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/119.jpg)
Closing the Performance Gap
• At the “micro architecture” level
• Interpreting the CISC x86 instruction set with hardware that translates to “RISC-like” operators
• Pipelining and superscalar execution of those micro-ops
![Page 120: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/120.jpg)
microcode
compiler
x86 machine language
C++ program
hard wired logic
early x86 processors
![Page 121: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/121.jpg)
micro op translator
compiler
x86 machine language
C++ program
micro-ops
hard wired logic
current x86 processors
![Page 122: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/122.jpg)
micro op translator
compiler
x86 machine language
C++ program
micro-ops
hard wired logic
microcode
current x86 processors
some instructions
![Page 123: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/123.jpg)
micro op translator
compiler
x86 machine language
C++ program
micro-ops
hard wired logic
compiler
MIPS machine language
C++ program
hard wired logic
![Page 124: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/124.jpg)
micro op translator
compiler
x86 machine language
C++ program
micro-ops
hard wired logic
micro op translator
compiler
x86 machine language
Java program
micro-ops
hard wired logic
JVM
Java byte code
![Page 125: Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing](https://reader035.vdocuments.site/reader035/viewer/2022062720/56649f0d5503460f94c2083b/html5/thumbnails/125.jpg)
micro op translator
compiler
x86 machine language
C++ program
micro-ops
hard wired logic