instruction evel parallelismandits e (p 3) xploitation … · 2017-03-26 · multiple-issue...
TRANSCRIPT
INSTRUCTION LEVEL
PARALLELISM AND ITS
E (P 3)
CP
E731 -
Dr. Iy
ad
Jafa
r
EXPLOITATION (PART 3)Chapter 3
Appendix H
1
OUTLINE
� Dynamic Scheduling, Multiple Issue and
Speculation (3.8)
Advanced Techniques for Instruction Delivery and � Advanced Techniques for Instruction Delivery and
Speculation (3.9)
� Multithreading (3.12)
CP
E731 -
Dr. Iy
ad
Jafa
r
2
DYNAMIC, MULTIPLE ISSUE AND SPECULATION
� Microarchitecture that is used in modern processors
� Issuing multiple instructions dynamically is complex dueto dependency!
� The key is assigning a reservation station and updatingthe pipeline control tables
� Approaches� Issue one instruction in each half of the cycle � suitable for
two-issue� Build the logic necessary to handle two or more
instructions at once
CP
E731 -
Dr. Iy
ad
Jafa
r
instructions at once� Hybrid!
� This “issue step” is one of the most fundamentalbottlenecks
� Multiple completion/commit! 3
DYNAMIC, MULTIPLE ISSUE AND SPECULATION
� We will consider a simple implementation� Issue rate of 2 instructions per cycle
� Extend Tomasulo to support multiple-issue� Extend Tomasulo to support multiple-issue
superscalar pipeline with integer, load/store and FP
units that can initiate an operation every cycle
� Instructions are issued in order!
� The pipeline issues any combination of two
instructions each cycle using scheduling hardware
CP
E731 -
Dr. Iy
ad
Jafa
r
instructions each cycle using scheduling hardware
� Issue and completion logic is enhanced to allow
multiple instruction to issue and process each cycle
� All datapaths are widened to allow multiple issue4
DYNAMIC, MULTIPLE ISSUE AND SPECULATIONC
PE
731 -
Dr. Iy
ad
Jafa
r
5
DYNAMIC, MULTIPLE ISSUE AND SPECULATION
� Example: Consider the execution of the following loop which
increments each element of an integer array on a two-issue
processor, once without speculation and once with speculation.
CP
E731 -
Dr. Iy
ad
Jafa
r
Loop: LD R2, 0(R1)
DADDIU R2, R2, #1
SD R2, 0(R1)
DADDIU R1, R1, #8
BNE R1, R3, Loop
Assume there are separate integer functional units for effective
address calculation, for ALU operations and for branch
6
address calculation, for ALU operations and for branch
condition evaluation.
Create a table for the first two iterations of this loop for both
processors. Assume two instructions of any type can commit
per cycle.
DYNAMIC, MULTIPLE ISSUE AND SPECULATION
� No Speculation
Iter. Instructions Issue
Cycle
Execute
Cycle
Memory
Cycle
Write
Cycle
1 LD R2, 0(R1) 1 2 3 4
CP
E731 -
Dr. Iy
ad
Jafa
r
1 LD R2, 0(R1) 1 2 3 4
1 DADDIU R2,R2, #1 1 5 6
1 SD R2, 0(R1) 2 3 7
1 DADDIU R1, R1, #8 2 3 4
1 BNE R1, R3, Loop 3 5
2 LD R2, 0(R1) 6 7 8 9
7
2 DADDIU R2,R2, #1 6 10 11
2 SD R2, 0(R1) 7 8 12
2 DADDIU R1, R1, #8 7 11 12
2 BNE R1, R3, Loop 8 13
DYNAMIC, MULTIPLE ISSUE AND SPECULATION
� With Speculation
Iter. Instructions Issue
Cycle
Execute
Cycle
Memory
Cycle
Write
Cycle
Commit
Cycle
1 LD R2, 0(R1) 1 2 3 4 5
CP
E731 -
Dr. Iy
ad
Jafa
r
1 DADDIU R2,R2, #1 1 5 6 7
1 SD R2, 0(R1) 2 3 7
1 DADDIU R1, R1, #8 2 3 4 8
1 BNE R1, R3, Loop 3 5 8
2 LD R2, 0(R1) 4 5 6 7 9
2 DADDIU R2,R2, #1 4 8 9 10
8
2 DADDIU R2,R2, #1 4 8 9 10
2 SD R2, 0(R1) 5 6 10
2 DADDIU R1, R1, #8 5 6 7 11
2 BNE R1, R3, Loop 6 8 11
ADVANCED TECHNIQUES FOR INSTRUCTION
DELIVERY AND SPECULATION
� Multiple-issue processors require high bandwidth
instruction stream
Widen paths to instruction cache!� Widen paths to instruction cache!
� Branches are difficult!
� Increasing Instruction Fetch Bandwidth
� Branch-Target Buffer
� Return Address Predictors
� Integrated Instruction Fetch Units
CP
E731 -
Dr. Iy
ad
Jafa
r
� Speculation: Implementation Issues and Extensions
� Register Renaming versus Reorder Buffers
� How Much to Speculate?
� Value Prediction! 9
INCREASING INSTRUCTION FETCH BANDWIDTH
� Branch-Target Buffer� Reduced branch penalty if we know that the yet undecoded
instruction is a branch as well as knowing the branch address
� Zero branch penalty
� Branch-target buffet (BTB)!� Branch-target buffet (BTB)!
CP
E731 -
Dr. Iy
ad
Jafa
r
10
� Branch-Target Buffer
INCREASING INSTRUCTION FETCH BANDWIDTHC
PE
731 -
Dr. Iy
ad
Jafa
r
11
� Branch-Target Buffer
INCREASING INSTRUCTION FETCH BANDWIDTHC
PE
731 -
Dr. Iy
ad
Jafa
r
12
� Branch-Target Buffer� One possible variation
� Allow the BTB to store the target instruction(s) instead
INCREASING INSTRUCTION FETCH BANDWIDTH
of or in addition to the predicted target address
� We skip the IF of the next instruction!
� CPI for branch (unconditional and sometimes
conditional) is 0?
�Branch folding!
CP
E731 -
Dr. Iy
ad
Jafa
r
� Check example on p. 205
13
� Return Address Predictors� Indirect jumps
� Switch, Case, indirect procedure calls and procedure returns� Destination address varies at runtime� Hard to predict
INCREASING INSTRUCTION FETCH BANDWIDTH
� For SPEC95� 15% of branches are procedures returns� focus on procedure returns
� Use the BTB � low accuracy if called from multiplesites� <60% accuracy in SPEC CPPU95
� Use a small buffer that stores return addresses as
CP
E731 -
Dr. Iy
ad
Jafa
r
� Use a small buffer that stores return addresses asstack! � RAS� A small buffer that caches the most recent return addresses!� A call pushes the return address to stack� A return pops the return address to stack� LIFO !
� Intel Core processors and the AMD Phenom processors
14
� Return Address Predictors
INCREASING INSTRUCTION FETCH BANDWIDTHC
PE
731 -
Dr. Iy
ad
Jafa
r
15
� Integrated Instruction Fetch Units� In multiple-issue, IF is not a simple as in a single
pipeline
INCREASING INSTRUCTION FETCH BANDWIDTH
� Implement the instruction fetch unit as a
separate autonomous unit that feeds the
instructions to the rest of the pipeline
� The unit includes
� Integrated branch prediction
Instruction prefetch
CP
E731 -
Dr. Iy
ad
Jafa
r
� Instruction prefetch
� Instruction memory access and buffering
16
�Explicit Register Renaming vs. Reorder
Buffer� The values of architecturally visible registers are
distributed between actual registers, reservation
SPECULATION: IMPLEMENTATION ISSUES AND EXTENSIONS
distributed between actual registers, reservation
stations and ROB � complicates scheduling!
� Register renaming� Decouple renaming from scheduling!
� A single and large set of physical registers to hold both
architectural registers and temporary values
� A physical register is allocated for every instruction that
writes with the aid of a HW renaming map
This Allows data to be fetched from single register file
CP
E731 -
Dr. Iy
ad
Jafa
r
� This Allows data to be fetched from single register file
� No need to bypass values from reorder buffer
� Balancing pipeline
� Still need ROB to commit in-order!17
�How much to speculate?� Speculation helps reducing stalls!
� Cost? time, area, energy and recovery from
incorrect speculation!
SPECULATION: IMPLEMENTATION ISSUES AND EXTENSIONS
incorrect speculation!
� Performance??
� What if a speculative instruction results in
expensive exception (TLB or cache miss)?� Most speculative processors allow low cost exceptions to be
handled in speculative mode!
� Otherwise, wait until the instruction is no longer speculative
before serving the exception
Efficient with programs with high exception
CP
E731 -
Dr. Iy
ad
Jafa
r
� Efficient with programs with high exception
frequencies coupled with inefficient branch
predictions.
� Degrade performance of other programs.18
� Speculating through Multiple Branches
� So far, we have considered the case in which we
need to speculate a single branch instruction
before the need to speculate another one!
SPECULATION: IMPLEMENTATION ISSUES AND EXTENSIONS
� We may need to speculate on multiple branches� High branch frequency
� Significant clustering of branches
� Slow functional units
� However, speculation through multiple branches
CP
E731 -
Dr. Iy
ad
Jafa
r
� However, speculation through multiple branches
complicates speculation recovery� Until 2011, no processor implemented speculation
through multiple branches per cycle
19
� Speculating and Energy Efficiency� It might be argued that speculation decreases power
efficiency� Speculated instructions consumes energy
� Unrolling incorrect speculation
SPECULATION: IMPLEMENTATION ISSUES AND EXTENSIONS
� Unrolling incorrect speculation
� However, if speculation lowers execution time by more
than it increases average power, then the total energy
could be less� i.e. speculation is capable of improving performance
CP
E731 -
Dr. Iy
ad
Jafa
r
FP
20
� Value Prediction
� Attempt to predict the value that will be produced by aninstruction
� Limited success in general!
� How about
SPECULATION: IMPLEMENTATION ISSUES AND EXTENSIONS
How about
� A load that loads from a constant pool?
� A load that loads a value that changes infrequently?
� An instruction that produces a value chosen from a set of potentialvalues?
� No sufficient results to encourage actual incorporation inprocessors
CP
E731 -
Dr. Iy
ad
Jafa
r
� Address Alias Prediction
� Predicts whether two stores or a load and a store refer to the sameaddress, i.e. we don’t predict the address!
� If such reference don’t refer to the same address, they can beinterchanged. Otherwise, wait!
� Simple and stable and used several processors 21
� ILP is transparent and efficient; however, it
can be� Quite difficult to exploit for some applications
� Off-chip cache misses are less likely to be hidden
� Can we use other levels of parallelism?
MULTITHREADING
� Can we use other levels of parallelism?� Online transaction systems have multiple concurrent
queries
� Scientific applications have natural parallelism
� OS runs multiple active applications
� Thread-level parallelism� Allows multiple threads to share the functional units of a
single processor in an overlapping fashion
CP
E731 -
Dr. Iy
ad
Jafa
r
single processor in an overlapping fashion
� Most of the processor core is shared (Cache, TLB … )
� Requires duplicating state elements (separate registers,
page tables and PC for each thread)
� HW should switch between threads quickly
� OS should be optimized and aware! 22
� Fine-grained Multithreading� Switch on every cycle in an interleaved round-robin fashion� Hide both short and long stalls� Improves throughput, but slows down the execution of a
single thread (Latency)Sun Niagara processor and GPUs
MULTITHREADING
� Sun Niagara processor and GPUs
� Corse-grained Multithreading� Switch on costly stalls only� Less likely to slow down the execution of any thread� Limited ability to overcome throughput losses!� Research community only!
Simultaneous Multithreading (SMT)
CP
E731 -
Dr. Iy
ad
Jafa
r
� Simultaneous Multithreading (SMT)� Variation of fine-grained when implemented on multiple-
issue with dynamic scheduling processor� Issue multiple instructions from multiple threads every
CPU cycle� Intel Hyper Threading (HT) Technology
23
MULTITHREADING
Tim
e (p
roce
ssor
cyc
le) Superscalar Fine-Grained Coarse-Grained Multiprocessing
Simultaneous
Multithreading
CP
E731 -
Dr. Iy
ad
Jafa
rT
ime
(pro
cess
or c
ycle
)
24
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Idle slot
�Further investigation� Security !
� Power !
� Thread Scheduler!
MULTITHREADING
� Thread Scheduler!
� Super threading !
� Read pages 226-232� Effectiveness of Fine-Grained Multithreading on the
Sun T1
� Effectiveness of Simultaneous Multithreading on
Superscalar Processors
CP
E731 -
Dr. Iy
ad
Jafa
r
Superscalar Processors
�Read Section 3.13 “Putting it All
Together”25