instruction evel parallelismandits e (p 3) xploitation … · 2017-03-26 · multiple-issue...

INSTRUCTION LEVEL

PARALLELISM AND ITS

E (P 3)

CP

E731 -

Dr. Iy

ad

Jafa

r

EXPLOITATION (PART 3)Chapter 3

Appendix H

1

OUTLINE

� Dynamic Scheduling, Multiple Issue and

Speculation (3.8)

Advanced Techniques for Instruction Delivery and � Advanced Techniques for Instruction Delivery and

Speculation (3.9)

� Multithreading (3.12)

CP

E731 -

Dr. Iy

ad

Jafa

r

2

DYNAMIC, MULTIPLE ISSUE AND SPECULATION

� Microarchitecture that is used in modern processors

� Issuing multiple instructions dynamically is complex dueto dependency!

� The key is assigning a reservation station and updatingthe pipeline control tables

� Approaches� Issue one instruction in each half of the cycle � suitable for

two-issue� Build the logic necessary to handle two or more

instructions at once

CP

E731 -

Dr. Iy

ad

Jafa

r

instructions at once� Hybrid!

� This “issue step” is one of the most fundamentalbottlenecks

� Multiple completion/commit! 3


� We will consider a simple implementation� Issue rate of 2 instructions per cycle

� Extend Tomasulo to support multiple-issue� Extend Tomasulo to support multiple-issue

superscalar pipeline with integer, load/store and FP

units that can initiate an operation every cycle

� Instructions are issued in order!

� The pipeline issues any combination of two

instructions each cycle using scheduling hardware

CP

E731 -

Dr. Iy

ad

Jafa

r

instructions each cycle using scheduling hardware

� Issue and completion logic is enhanced to allow

multiple instruction to issue and process each cycle

� All datapaths are widened to allow multiple issue4

DYNAMIC, MULTIPLE ISSUE AND SPECULATIONC

PE

731 -

Dr. Iy

ad

Jafa

r

5


� Example: Consider the execution of the following loop which

increments each element of an integer array on a two-issue

processor, once without speculation and once with speculation.

CP

E731 -

Dr. Iy

ad

Jafa

r

Loop: LD R2, 0(R1)

DADDIU R2, R2, #1

SD R2, 0(R1)

DADDIU R1, R1, #8

BNE R1, R3, Loop

Assume there are separate integer functional units for effective

address calculation, for ALU operations and for branch

6

address calculation, for ALU operations and for branch

condition evaluation.

Create a table for the first two iterations of this loop for both

processors. Assume two instructions of any type can commit

per cycle.


� No Speculation

Iter. Instructions Issue

Cycle

Execute

Cycle

Memory

Cycle

Write

Cycle

1 LD R2, 0(R1) 1 2 3 4

CP

E731 -

Dr. Iy

ad

Jafa

r

1 LD R2, 0(R1) 1 2 3 4

1 DADDIU R2,R2, #1 1 5 6

1 SD R2, 0(R1) 2 3 7

1 DADDIU R1, R1, #8 2 3 4

1 BNE R1, R3, Loop 3 5

2 LD R2, 0(R1) 6 7 8 9

7

2 DADDIU R2,R2, #1 6 10 11

2 SD R2, 0(R1) 7 8 12

2 DADDIU R1, R1, #8 7 11 12

2 BNE R1, R3, Loop 8 13


� With Speculation

Iter. Instructions Issue

Cycle

Execute

Cycle

Memory

Cycle

Write

Cycle

Commit

Cycle

1 LD R2, 0(R1) 1 2 3 4 5

CP

E731 -

Dr. Iy

ad

Jafa

r

1 DADDIU R2,R2, #1 1 5 6 7

1 SD R2, 0(R1) 2 3 7

1 DADDIU R1, R1, #8 2 3 4 8

1 BNE R1, R3, Loop 3 5 8

2 LD R2, 0(R1) 4 5 6 7 9

2 DADDIU R2,R2, #1 4 8 9 10

8

2 DADDIU R2,R2, #1 4 8 9 10

2 SD R2, 0(R1) 5 6 10

2 DADDIU R1, R1, #8 5 6 7 11

2 BNE R1, R3, Loop 6 8 11

ADVANCED TECHNIQUES FOR INSTRUCTION

DELIVERY AND SPECULATION

� Multiple-issue processors require high bandwidth

instruction stream

Widen paths to instruction cache!� Widen paths to instruction cache!

� Branches are difficult!

� Increasing Instruction Fetch Bandwidth

� Branch-Target Buffer

� Return Address Predictors

� Integrated Instruction Fetch Units

CP

E731 -

Dr. Iy

ad

Jafa

r

� Speculation: Implementation Issues and Extensions

� Register Renaming versus Reorder Buffers

� How Much to Speculate?

� Value Prediction! 9

INCREASING INSTRUCTION FETCH BANDWIDTH

� Branch-Target Buffer� Reduced branch penalty if we know that the yet undecoded

instruction is a branch as well as knowing the branch address

� Zero branch penalty

� Branch-target buffet (BTB)!� Branch-target buffet (BTB)!

CP

E731 -

Dr. Iy

ad

Jafa

r

10


INCREASING INSTRUCTION FETCH BANDWIDTHC

PE

731 -

Dr. Iy

ad

Jafa

r

11



PE

731 -

Dr. Iy

ad

Jafa

r

12

� Branch-Target Buffer� One possible variation

� Allow the BTB to store the target instruction(s) instead


of or in addition to the predicted target address

� We skip the IF of the next instruction!

� CPI for branch (unconditional and sometimes

conditional) is 0?

�Branch folding!

CP

E731 -

Dr. Iy

ad

Jafa

r

� Check example on p. 205

13

� Return Address Predictors� Indirect jumps

� Switch, Case, indirect procedure calls and procedure returns� Destination address varies at runtime� Hard to predict


� For SPEC95� 15% of branches are procedures returns� focus on procedure returns

� Use the BTB � low accuracy if called from multiplesites� <60% accuracy in SPEC CPPU95

� Use a small buffer that stores return addresses as

CP

E731 -

Dr. Iy

ad

Jafa

r

� Use a small buffer that stores return addresses asstack! � RAS� A small buffer that caches the most recent return addresses!� A call pushes the return address to stack� A return pops the return address to stack� LIFO !

� Intel Core processors and the AMD Phenom processors

14

� Return Address Predictors


PE

731 -

Dr. Iy

ad

Jafa

r

15

� Integrated Instruction Fetch Units� In multiple-issue, IF is not a simple as in a single

pipeline


� Implement the instruction fetch unit as a

separate autonomous unit that feeds the

instructions to the rest of the pipeline

� The unit includes

� Integrated branch prediction

Instruction prefetch

CP

E731 -

Dr. Iy

ad

Jafa

r

� Instruction prefetch

� Instruction memory access and buffering

16

�Explicit Register Renaming vs. Reorder

Buffer� The values of architecturally visible registers are

distributed between actual registers, reservation

SPECULATION: IMPLEMENTATION ISSUES AND EXTENSIONS

distributed between actual registers, reservation

stations and ROB � complicates scheduling!

� Register renaming� Decouple renaming from scheduling!

� A single and large set of physical registers to hold both

architectural registers and temporary values

� A physical register is allocated for every instruction that

writes with the aid of a HW renaming map

This Allows data to be fetched from single register file

CP

E731 -

Dr. Iy

ad

Jafa

r

� This Allows data to be fetched from single register file

� No need to bypass values from reorder buffer

� Balancing pipeline

� Still need ROB to commit in-order!17

�How much to speculate?� Speculation helps reducing stalls!

� Cost? time, area, energy and recovery from

incorrect speculation!


incorrect speculation!

� Performance??

� What if a speculative instruction results in

expensive exception (TLB or cache miss)?� Most speculative processors allow low cost exceptions to be

handled in speculative mode!

� Otherwise, wait until the instruction is no longer speculative

before serving the exception

Efficient with programs with high exception

CP

E731 -

Dr. Iy

ad

Jafa

r

� Efficient with programs with high exception

frequencies coupled with inefficient branch

predictions.

� Degrade performance of other programs.18

� Speculating through Multiple Branches

� So far, we have considered the case in which we

need to speculate a single branch instruction

before the need to speculate another one!


� We may need to speculate on multiple branches� High branch frequency

� Significant clustering of branches

� Slow functional units

� However, speculation through multiple branches

CP

E731 -

Dr. Iy

ad

Jafa

r

� However, speculation through multiple branches

complicates speculation recovery� Until 2011, no processor implemented speculation

through multiple branches per cycle

19

� Speculating and Energy Efficiency� It might be argued that speculation decreases power

efficiency� Speculated instructions consumes energy

� Unrolling incorrect speculation


� Unrolling incorrect speculation

� However, if speculation lowers execution time by more

than it increases average power, then the total energy

could be less� i.e. speculation is capable of improving performance

CP

E731 -

Dr. Iy

ad

Jafa

r

FP

20

� Value Prediction

� Attempt to predict the value that will be produced by aninstruction

� Limited success in general!

� How about


How about

� A load that loads from a constant pool?

� A load that loads a value that changes infrequently?

� An instruction that produces a value chosen from a set of potentialvalues?

� No sufficient results to encourage actual incorporation inprocessors

CP

E731 -

Dr. Iy

ad

Jafa

r

� Address Alias Prediction

� Predicts whether two stores or a load and a store refer to the sameaddress, i.e. we don’t predict the address!

� If such reference don’t refer to the same address, they can beinterchanged. Otherwise, wait!

� Simple and stable and used several processors 21

� ILP is transparent and efficient; however, it

can be� Quite difficult to exploit for some applications

� Off-chip cache misses are less likely to be hidden

� Can we use other levels of parallelism?

MULTITHREADING

� Can we use other levels of parallelism?� Online transaction systems have multiple concurrent

queries

� Scientific applications have natural parallelism

� OS runs multiple active applications

� Thread-level parallelism� Allows multiple threads to share the functional units of a

single processor in an overlapping fashion

CP

E731 -

Dr. Iy

ad

Jafa

r

single processor in an overlapping fashion

� Most of the processor core is shared (Cache, TLB … )

� Requires duplicating state elements (separate registers,

page tables and PC for each thread)

� HW should switch between threads quickly

� OS should be optimized and aware! 22

� Fine-grained Multithreading� Switch on every cycle in an interleaved round-robin fashion� Hide both short and long stalls� Improves throughput, but slows down the execution of a

single thread (Latency)Sun Niagara processor and GPUs

MULTITHREADING

� Sun Niagara processor and GPUs

� Corse-grained Multithreading� Switch on costly stalls only� Less likely to slow down the execution of any thread� Limited ability to overcome throughput losses!� Research community only!

Simultaneous Multithreading (SMT)

CP

E731 -

Dr. Iy

ad

Jafa

r

� Simultaneous Multithreading (SMT)� Variation of fine-grained when implemented on multiple-

issue with dynamic scheduling processor� Issue multiple instructions from multiple threads every

CPU cycle� Intel Hyper Threading (HT) Technology

23

MULTITHREADING

Tim

e (p

roce

ssor

cyc

le) Superscalar Fine-Grained Coarse-Grained Multiprocessing

Simultaneous

Multithreading

CP

E731 -

Dr. Iy

ad

Jafa

rT

ime

(pro

cess

or c

ycle

)

24

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Idle slot

�Further investigation� Security !

� Power !

� Thread Scheduler!

MULTITHREADING

� Thread Scheduler!

� Super threading !

� Read pages 226-232� Effectiveness of Fine-Grained Multithreading on the

Sun T1

� Effectiveness of Simultaneous Multithreading on

Superscalar Processors

CP

E731 -

Dr. Iy

ad

Jafa

r

Superscalar Processors

�Read Section 3.13 “Putting it All

Together”25

instruction evel parallelismandits e (p 3) xploitation … · 2017-03-26 · multiple-issue...

Documents