lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 four steps of speculative tomasulo algorithm...

RHK.F95 1

Lecture 6

• Software pipelining• Speculative Execution• Introduction to the memory hierarchy

design

2

Administration

• Project Assignments– 張弘義 Evaluate MMX instrution set for MPEG

– 徐紹文 Literature Survey: Power Aware Design for Reconfigurable Computing System

– 蔣宗哲 Evaluating Write-policy for Spec2000

– 黃志源 Flash memory

– 何丞世 &陳銘堂 Low power high level synthesis example by implementing DLX with FPGA

• proposal presentation on 11/1• Midterm on 11/8

3

Review: Summary• Tamasulo – out-of-order exectuion• Branch Prediction

– Branch History Table: 2 bits for loop accuracy– Correlation: Recently executed branches correlated with

next branch– Branch Target Buffer: include branch address &

prediction

• SuperScalar and VLIW– CPI < 1– Dynamic issue vs. Static issue– More instructions issue at same time, larger the penalty

of hazards

4

Software Pipelining

• Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations

• Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (- Tomasulo in SW)

Iteration 0 Iteration

1 Iteration 2 Iteration

3 Iteration 4

Software- pipelined iteration

5

SW Pipelining Example

Before: Unrolled 3 times1 LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F44 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F87 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F1210 SUBI R1,R1,#2411 BNEZ R1,LOOP

After: Software PipelinedLD F0,0(R1)ADDD F4,F0,F2LD F0,-8(R1)

1 SD 0(R1),F4; Stores M[i]2 ADDD F4,F0,F2; Adds to M[i-1]3 LD F0,-16(R1); loads M[i-2]4 SUBI R1,R1,#85 BNEZ R1,LOOP

SD 0(R1),F4ADDD F4,F0,F2SD -8(R1),F4

IF ID EX Mem WBIF ID EX Mem WB

IF ID EX Mem WB

SDADDDLD

Read F4Write F4

Read F0

Write F0

6

SW Pipelining Example

Symbolic Loop Unrolling– Less code space– Overhead paid only once

vs. each iteration in loop unrolling

Software Pipelining

Loop Unrolling

100 iterations = 25 loops with 4 unrolled iterations each

7

Trace Scheduling

• Parallelism across IF branches vs. LOOP branches• Two steps:

– Trace Selection» Find likely sequence of basic blocks (trace) of (statically

predicted) long sequence of straight-line code– Trace Compaction

» Squeeze trace into few VLIW instructions» Need bookkeeping code in case prediction is wrong

8

A[i] = A[i] + B[i]

A[i]=0?

x

F

B[i]=

C[i]=

9

Fix up instructions in case we are wrong

10

HW support for More ILP

• Speculation: allow an instruction to issue that is dependent on branch predicted to be taken withoutany consequences (including exceptions) if branch is not actually taken (“HW undo”)

• Often try to combine with dynamic scheduling• Tomasulo: separate speculative bypassing of

results from real bypassing of results– When instruction no longer speculative, write results

(instruction commit)– execute out-of-order but commit in order

11

Hardware Speculation

Execution completeExecution

2 3 4 51

•Instruction 1 & 2 is allowed to change the machine state•Instruction 4 & 5 should not change the machine state

12


• Tomasulo without speculation– Issue– Execution– Write Result

» write results to the CDB & update the register file or memory

• Tomasulo with speculation– Issue– Execution– Write Result

» write results to the CDB & store results in a HW buffer (Reorder Buffer)

– Commit» Update register file or memory

13

HW support for More ILP

• Need HW buffer for results of uncommitted instructions: reorder buffer

::::::MULTD F0, F2, F4DIVD F10,F0,F6::::

Busy Op Vj Vk Qj Qk Dest

ReorderBuffer

FP Regs

FPOp

Queue

FP Adder FP Adder

Res Stations Res Stations

Figure 4.34, page 311

Mult1 Yes mult [F2] [F4] #1Mult2 Yes div [F6] #1

Reservation Station

Reorder BufferBusy Instruction State Destination Value

#1 yes multd f0,f2,f4 Write Result F0 x#2 yes divd f10,f0,f6 Execute F10

14

Four Steps of Speculative Tomasulo Algorithm

1.Issue—get instruction from FP Op QueueIf reservation station or reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination.

2.Execution—operate on operands (EX)When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute

3.Write result—finish execution (WB)Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.

4.Commit—update register with reorder resultWhen instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instrfrom reorder buffer.

15

Commit Stage

1 2 4 53

2 4 53

4 53

6

Instruction 1 commit

4 5


Wait for instruction 3 to complete


5Instruction 4 commit

6 7

6 7 8

6 7 8 9

16

Speculative Execution

• How to handle mispredicted branch?

• How to handle precise exception?– Handle the exception when a instruction reaches the

head of the reorder buffer

4 53 6 7 Inst 3 is a mispredicted branch

flush the reorder buffer

11 Fetch from the right path

17

How to Measure Available ILP?

Initial HW Model here; MIPS compilers1. Register renaming–infinite virtual registers and all WAW & WAR hazards are avoided2. Branch prediction–perfect; no mispredictions 3. Jump prediction–all jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal

1 cycle latency for all instructions

18

Upper Limit to ILP

Programs

0

20

40

60

80

100

120

140

160

gcc espresso li fpppp doducd tomcatv

54.862.6

17.9

75.2

118.7

150.1

19

Program

0

10

20

30

40

50

60


35

41

16

6158

60

9

1210

48

15

6 7 6

46

13

45

6 6 7

45

14

45

2 2 2

29

4

19

46

Perfect Selective predictor Standard 2-bit Static None

More Realistic HW: Branch ImpactChange from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle

ProfileBHT (512)Pick Cor. or BHTPerfect

20

Program

0

10

20

30

40

50

60


11

15

12

29

54

10

15

12

49

16

1013

12

35

15

44

9 10 11

20

11

28

5 5 6 5 57

4 45

45 5

59

45

Infinite 256 128 64 32 None

More Realistic HW: Register Impact

Change 2000 instrwindow, 64 instrissue, 8K 2 level Prediction

64 None256Infinite 32128

21

Program

0

5

10

15

20

25

30

35

40

45

50


10

15

12

49

16

45

7 79

49

16

4 5 4 46 5

35

3 3 4 4

45

Perfect Global/stack Perfect Inspection None

More Realistic HW: Alias Impact

Change 2000 instrwindow, 64 instr issue, 8K 2 level Prediction, 256 renaming registers

NoneGlobal/Stack perf;heap conflicts

Perfect Inspec.

22

Program

0

10

20

30

40

50

60

gcc expresso li fpppp doducd tomcatv

10

15

12

52

17

56

10

15

12

47

16

10

1311

35

15

34

910 11

22

12

8 8 9

14

9

14

6 6 68

79

4 4 4 5 46

3 2 3 3 3 3

45

22

Infinite 256 128 64 32 16 8 4

Realistic HW : Window Impact

Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window

64 16256Infinite 32128 8 4

23

• 8-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe)vs. 2-scalar Alpha @ 200 MHz (7 stage pipe)

Benchmark

0

100

200

300

400

500

600

700

800

900

espr

esso

li

eqnt

ott

com

pres

s sc gcc

spic

e

dodu

c

mdl

jdp2

wav

e5

tom

catv or

a

alvi

nn ear

mdl

jsp2

swm

256

su2c

or

hydr

o2d

nasa

fppp

p

24

Summary of Exploiting Instruction Level Parallelism

• Why exploiting ILP is important?• Techniques to exploit ILP

– Instruction scheduling» Static method : loop unrolling, software pipelining» Dynamic method: scoreboard and tomasulo

– Branch Prediction» How to handle mispredicted branch?

• Achieving CPI <1 – Superscalar processor– VLIW

25

Importance of Exploiting ILP• CPU utilization is low because of hazards

– Structural hazard– Data hazard– Control hazard

• CPU stalls if hazards can not be removed• Exploiting ILP to reduce the number of hazards

00.5

11.5

22.5

33.5

44.5

eqnt

ott

espr

esso gc

c li

dodu

c

nasa

7 ora

spic

e2g6

su2c

or

tom

catv

Base Load stalls Branch stalls FP result stalls FP structuralstalls

26

Static instruction scheduling: loop unrolling

1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24

1 Loop: LD F0,0(R1)2 stall3 ADDD F4,F0,F24 SUBI R1,R1,85 BNEZ R1,Loop

;delayed branch6 SD 8(R1),F4

27

Software Pipelining

• Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations

• Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (- Tomasulo in SW)

Iteration 0 Iteration

1 Iteration 2 Iteration

3 Iteration 4

Software- pipelined iteration

28

HW instruction scheduling: out-of-order execution

• Key idea: Allow instructions behind stall to proceed

DIVD F0,F2,F4ADDD F10,F0,F8SUBD F8,F8,F14

– In-order-execution» SUBD is not issued until the dependence is cleared

– Dynamic scheduling enables out-of-order execution => out-of-order completion

29

Tomasulo Organization

FP addersFP adders

Add1Add2Add3

FP multipliersFP multipliers

Mult1Mult2

From Mem FP Registers

Reservation Stations

Common Data Bus (CDB)

To Mem

FP OpQueue

Load Buffers

Store Buffers

Load1Load2Load3Load4Load5Load6

30

Tamasulo

• Prevents Register as bottleneck• Avoids WAR, WAW hazards of Scoreboard• Allows loop unrolling in HW• Not limited to basic blocks

– branch prediction

• Lasting Contributions– Dynamic scheduling– Register renaming– Load/store disambiguation

31

Branch Prediction

• Static approach– Predicted taken, predicted untaken, branch delay slot,

profiled-based prediction

• Dynamic approach– Branch History Table

» 1 bit vs. 2 bits– Correlation Branch Prediction– Branch target buffer

32

• Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 4.13, p. 264)

Dynamic Branch Prediction

T

T

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not TakenT

NT

T

NT

NT

33

Correlating Branches

Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior)

– Then behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction

• (2,2) predictor: 2-bit global, 2-bit local

Branch address (4 bits)

2-bits per branch local predictors

PredictionPrediction

2-bit global branch history

(01 = not taken then taken)

34

Need Address at Same Time as Prediction

• Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)

– Note: must check for branch match now, since can’t use wrong branch address (Figure 3.19, p. 262)

Branch PC Predicted PC

=?

PC of instructionFETCH

Extra prediction state

bitsYes: instruction is branch and use predicted PC as next PC

No: branch not predicted, proceed normally

(Next PC = PC+4)

35


• Tomasulo without speculation– Issue– Execution– Write Result

» write results to the CDB & update the register file or memory

– Out-of-order completion» Cause problems for mispredicted branch & maintaining

precise exception

• Tomasulo with speculation– Issue– Execution– Write Result

» write results to the CDB & store results in a HW buffer (Reorder Buffer)

– Commit» Update register file or memory (in-order commit)

36

Getting CPI < 1 Multiple Instructions/Cycle

• Two variations:– Superscalar: varying no. instructions/cycle (1 to 8),

scheduled by compiler or by HW (Tomasulo)» IBM PowerPC, Sun SuperSparc, DEC Alpha, HP 7100

– Very Long Instruction Words (VLIW): fixed number of instructions (16) scheduled by the compiler

» Joint HP/Intel agreement in 1998?

37

Loop Unrolling in SuperScalar

Integer instruction FP instruction Clock cycleLoop: LD F0,0(R1) 1

LD F6,-8(R1) 2LD F10,-16(R1) ADDD F4,F0,F2 3LD F14,-24(R1) ADDD F8,F6,F2 4LD F18,-32(R1) ADDD F12,F10,F2 5SD 0(R1),F4 ADDD F16,F14,F2 6SD -8(R1),F8 ADDD F20,F18,F2 7SD -16(R1),F12 8SD -24(R1),F16 9SUBI R1,R1,#40 10BNEZ R1,LOOP 11SD -32(R1),F20 12

Unrolled 5 times to avoid delays (+1 due to SS)12 clocks, or 2.4 clocks per iteration

38

Loop Unrolling in VLIW

Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branchLD F0,0(R1) LD F6,-8(R1) 1LD F10,-16(R1) LD F14,-24(R1) 2LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4

ADDD F20,F18,F2 ADDD F24,F22,F2 5SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6SD -16(R1),F12 SD -24(R1),F16 7SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8SD -0(R1),F28 BNEZ R1,LOOP 9

Unrolled 7 times to avoid delays7 results in 9 clocks, or 1.3 clocks per iterationNeed more registers in VLIW

39

Limits to Multi-Issue Machines

• Inherent limitations of ILP– 1 branch in 5: How to keep a 5-way VLIW busy?– Latencies of units: many operations must be scheduled– Need about Pipeline Depth x No. Functional Units of independent

operations to keep machines busy

• Difficulties in building HW– Duplicate FUs to get parallel execution– Increase ports to Register File – Increase ports to memory– Decoding SS and impact on clock rate, pipeline depth

40

Limits to Multi-Issue Machines

• Limitations specific to either SS or VLIW implementation

– Decode issue in SS– VLIW code size: unroll loops + wasted fields in VLIW– VLIW lock step => 1 hazard & all instructions stall– VLIW & binary compatibility is practical weakness

41

Chapter 5: Memory Hierarchy

42

Recap: Who Cares About the Memory Hierarchy?

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU19

82

Processor-MemoryPerformance Gap:(grows 50% / year)

Perf

orman

ce

Time

Processor-DRAM Memory Gap (latency)

43

Levels of the Memory Hierarchy

CPU Registers100s Bytes<1s ns

Cache10s-100s K Bytes1-10 ns$10/ MByte

Main MemoryM Bytes100ns- 300ns$1/ MByte

Disk10s G Bytes, 10 ms (10,000,000 ns)$0.0031/ MByte

CapacityAccess TimeCost

Tapeinfinitesec-min$0.0014/ MByte

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

44

The Principle of Locality

• The Principle of Locality:– Program access a relatively small portion of the address space at any

instant of time.• Two Different Types of Locality:

– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)

– Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)

• Last 15 years, HW (hardware) relied on locality for speed

45

Memory Hierarchy: Terminology• Hit: data appears in some block in the upper level (example:

Block X) – Hit Rate: the fraction of memory access found in the upper level– Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss• Miss: data needs to be retrieve from a block in the lower level

(Block Y)– Miss Rate = 1 - (Hit Rate)– Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor• Hit Time << Miss Penalty (500 instructions on 21264!)

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

46

Cache Measures

• Hit rate: fraction found in that level– So high that usually talk about Miss rate– Miss rate fallacy: as MIPS to CPU performance,

miss rate to average memory access time in memory • Average memory-access time

= Hit time + Miss rate x Miss penalty (ns or clocks)

• Miss penalty: time to replace a block from lower level, including time to replace in CPU

– access time: time to lower level = f(latency to lower level)

– transfer time: time to transfer block =f(BW between upper & lower levels)

47

Simplest Cache: Direct MappedMemory

4 Byte Direct Mapped Cache

Memory Address0123456789ABCDEF

Cache Index0123

• Location 0 can be occupied by data from:

– Memory location 0, 4, 8, ... etc.– In general: any memory location

whose 2 LSBs of the address are 0s– Address<1:0> => cache index

• Which one should we place in the cache?

• How can we tell which one is in the cache?

48

1 KB Direct Mapped Cache, 32B blocks• For a 2 ** N byte cache:

– The uppermost (32 - N) bits are always the Cache Tag– The lowest M bits are the Byte Select (Block Size = 2 ** M)

Cache Index

0123

:

Cache DataByte 0

0431

:

Cache Tag Example: 0x50Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte SelectEx: 0x00

9

49

Two-way Set Associative Cache• N-way set associative: N entries for each Cache Index

– N direct mapped caches operates in parallel (N typically 2 to 4)

• Example: Two-way set associative cache– Cache Index selects a “set” from the cache– The two tags in the set are compared in parallel– Data is selected based on the tag result

Cache DataCache Block 0

Cache TagValid

:: :


Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

50

Disadvantage of Set Associative Cache• N-way Set Associative Cache v. Direct Mapped Cache:

– N comparators vs. 1– Extra MUX delay for the data– Data comes AFTER Hit/Miss

• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:

– Possible to assume a hit and continue. Recover later if miss.


Cache Tag Valid

: ::


Cache TagValid

:: :

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

51

4 Questions for Memory Hierarchy

• Q1: Where can a block be placed in the upper level? (Block placement)

• Q2: How is a block found if it is in the upper level?(Block identification)

• Q3: Which block should be replaced on a miss? (Block replacement)

• Q4: What happens on a write? (Write strategy)

52

Q1: Where can a block be placed in the upper level?

• Block 12 placed in 8 block cache:– Fully associative, direct mapped, 2-way set associative– S.A. Mapping = Block Number Modulo Number Sets

Cache

01234567 0123456701234567

Memory

111111111122222222223301234567890123456789012345678901

Full Mapped Direct Mapped(12 mod 8) = 4

2-Way Assoc(12 mod 4) = 0

53

Q2: How is a block found if it is in the upper level?

• Tag on each block– No need to check index or block offset

• Increasing associativity shrinks index, expands tag

BlockOffset

Block Address

IndexTag

54

Q3: Which block should be replaced on a miss?

• Easy for Direct Mapped• Set Associative or Fully Associative:

– Random– LRU (Least Recently Used)

Assoc: 2-way 4-way 8-waySize LRU Ran LRU Ran LRU Ran16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

55

Q4: What happens on a write?

• Write through—The information is written to both the block in the cache and to the block in the lower-level memory.

• Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.

– is block clean or dirty?• Pros and Cons of each?

– WT: read misses cannot result in writes– WB: no repeated writes to same location

• WT always combined with write buffers so that don’t wait for lower level memory

56

Write Buffer for Write Through

• A Write Buffer is needed between the Cache and Memory

– Processor: writes data into the cache and the write buffer– Memory controller: write contents of the buffer to memory

• Write buffer is just a FIFO:– Typical number of entries: 4– Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

• Memory system designer’s nightmare:– Store frequency (w.r.t. time) -> 1 / DRAM write cycle– Write buffer saturation

ProcessorCache

Write Buffer

DRAM

57

A Modern Memory Hierarchy• By taking advantage of the principle of locality:

– Present the user with as much memory as is available in the cheapest technology.

– Provide access at the speed offered by the fastest technology.

Control

Datapath

SecondaryStorage(Disk)

Processor

Registers

MainMemory(DRAM)

SecondLevelCache

(SRAM)

On-C

hipC

ache1s 10,000,000s

(10s ms)Speed (ns): 10s 100s

100sGs

Size (bytes):Ks Ms

TertiaryStorage

(Disk/Tape)

10,000,000,000s (10s sec)

Ts

lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 four steps of speculative tomasulo algorithm...

Documents