lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 four steps of speculative tomasulo algorithm...

57
RHK.F95 1 Lecture 6 Software pipelining Speculative Execution Introduction to the memory hierarchy design

Upload: others

Post on 23-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

RHK.F95 1

Lecture 6

• Software pipelining• Speculative Execution• Introduction to the memory hierarchy

design

Page 2: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

2

Administration

• Project Assignments– 張弘義 Evaluate MMX instrution set for MPEG

– 徐紹文 Literature Survey: Power Aware Design for Reconfigurable Computing System

– 蔣宗哲 Evaluating Write-policy for Spec2000

– 黃志源 Flash memory

– 何丞世 &陳銘堂 Low power high level synthesis example by implementing DLX with FPGA

• proposal presentation on 11/1• Midterm on 11/8

Page 3: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

3

Review: Summary• Tamasulo – out-of-order exectuion• Branch Prediction

– Branch History Table: 2 bits for loop accuracy– Correlation: Recently executed branches correlated with

next branch– Branch Target Buffer: include branch address &

prediction

• SuperScalar and VLIW– CPI < 1– Dynamic issue vs. Static issue– More instructions issue at same time, larger the penalty

of hazards

Page 4: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

4

Software Pipelining

• Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations

• Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (- Tomasulo in SW)

Iteration 0 Iteration

1 Iteration 2 Iteration

3 Iteration 4

Software- pipelined iteration

Page 5: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

5

SW Pipelining Example

Before: Unrolled 3 times1 LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F44 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F87 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F1210 SUBI R1,R1,#2411 BNEZ R1,LOOP

After: Software PipelinedLD F0,0(R1)ADDD F4,F0,F2LD F0,-8(R1)

1 SD 0(R1),F4; Stores M[i]2 ADDD F4,F0,F2; Adds to M[i-1]3 LD F0,-16(R1); loads M[i-2]4 SUBI R1,R1,#85 BNEZ R1,LOOP

SD 0(R1),F4ADDD F4,F0,F2SD -8(R1),F4

IF ID EX Mem WBIF ID EX Mem WB

IF ID EX Mem WB

SDADDDLD

Read F4Write F4

Read F0

Write F0

Page 6: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

6

SW Pipelining Example

Symbolic Loop Unrolling– Less code space– Overhead paid only once

vs. each iteration in loop unrolling

Software Pipelining

Loop Unrolling

100 iterations = 25 loops with 4 unrolled iterations each

Page 7: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

7

Trace Scheduling

• Parallelism across IF branches vs. LOOP branches• Two steps:

– Trace Selection» Find likely sequence of basic blocks (trace) of (statically

predicted) long sequence of straight-line code– Trace Compaction

» Squeeze trace into few VLIW instructions» Need bookkeeping code in case prediction is wrong

Page 8: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

8

A[i] = A[i] + B[i]

A[i]=0?

x

F

B[i]=

C[i]=

Page 9: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

9

Fix up instructions in case we are wrong

Page 10: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

10

HW support for More ILP

• Speculation: allow an instruction to issue that is dependent on branch predicted to be taken withoutany consequences (including exceptions) if branch is not actually taken (“HW undo”)

• Often try to combine with dynamic scheduling• Tomasulo: separate speculative bypassing of

results from real bypassing of results– When instruction no longer speculative, write results

(instruction commit)– execute out-of-order but commit in order

Page 11: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

11

Hardware Speculation

Execution completeExecution

2 3 4 51

•Instruction 1 & 2 is allowed to change the machine state•Instruction 4 & 5 should not change the machine state

Page 12: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

12

Hardware Speculation

• Tomasulo without speculation– Issue– Execution– Write Result

» write results to the CDB & update the register file or memory

• Tomasulo with speculation– Issue– Execution– Write Result

» write results to the CDB & store results in a HW buffer (Reorder Buffer)

– Commit» Update register file or memory

Page 13: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

13

HW support for More ILP

• Need HW buffer for results of uncommitted instructions: reorder buffer

::::::MULTD F0, F2, F4DIVD F10,F0,F6::::

Busy Op Vj Vk Qj Qk Dest

ReorderBuffer

FP Regs

FPOp

Queue

FP Adder FP Adder

Res Stations Res Stations

Figure 4.34, page 311

Mult1 Yes mult [F2] [F4] #1Mult2 Yes div [F6] #1

Reservation Station

Reorder BufferBusy Instruction State Destination Value

#1 yes multd f0,f2,f4 Write Result F0 x#2 yes divd f10,f0,f6 Execute F10

Page 14: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

14

Four Steps of Speculative Tomasulo Algorithm

1.Issue—get instruction from FP Op QueueIf reservation station or reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination.

2.Execution—operate on operands (EX)When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute

3.Write result—finish execution (WB)Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.

4.Commit—update register with reorder resultWhen instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instrfrom reorder buffer.

Page 15: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

15

Commit Stage

1 2 4 53

2 4 53

4 53

6

Instruction 1 commit

4 5

Instruction 2 commit

Wait for instruction 3 to complete

Instruction 3 commit

5Instruction 4 commit

6 7

6 7 8

6 7 8 9

Page 16: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

16

Speculative Execution

• How to handle mispredicted branch?

• How to handle precise exception?– Handle the exception when a instruction reaches the

head of the reorder buffer

4 53 6 7 Inst 3 is a mispredicted branch

flush the reorder buffer

11 Fetch from the right path

Page 17: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

17

How to Measure Available ILP?

Initial HW Model here; MIPS compilers1. Register renaming–infinite virtual registers and all WAW & WAR hazards are avoided2. Branch prediction–perfect; no mispredictions 3. Jump prediction–all jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal

1 cycle latency for all instructions

Page 18: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

18

Upper Limit to ILP

Programs

0

20

40

60

80

100

120

140

160

gcc espresso li fpppp doducd tomcatv

54.862.6

17.9

75.2

118.7

150.1

Page 19: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

19

Program

0

10

20

30

40

50

60

gcc espresso li fpppp doducd tomcatv

35

41

16

6158

60

9

1210

48

15

6 7 6

46

13

45

6 6 7

45

14

45

2 2 2

29

4

19

46

Perfect Selective predictor Standard 2-bit Static None

More Realistic HW: Branch ImpactChange from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle

ProfileBHT (512)Pick Cor. or BHTPerfect

Page 20: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

20

Program

0

10

20

30

40

50

60

gcc espresso li fpppp doducd tomcatv

11

15

12

29

54

10

15

12

49

16

1013

12

35

15

44

9 10 11

20

11

28

5 5 6 5 57

4 45

45 5

59

45

Infinite 256 128 64 32 None

More Realistic HW: Register Impact

Change 2000 instrwindow, 64 instrissue, 8K 2 level Prediction

64 None256Infinite 32128

Page 21: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

21

Program

0

5

10

15

20

25

30

35

40

45

50

gcc espresso li fpppp doducd tomcatv

10

15

12

49

16

45

7 79

49

16

4 5 4 46 5

35

3 3 4 4

45

Perfect Global/stack Perfect Inspection None

More Realistic HW: Alias Impact

Change 2000 instrwindow, 64 instr issue, 8K 2 level Prediction, 256 renaming registers

NoneGlobal/Stack perf;heap conflicts

Perfect Inspec.

Page 22: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

22

Program

0

10

20

30

40

50

60

gcc expresso li fpppp doducd tomcatv

10

15

12

52

17

56

10

15

12

47

16

10

1311

35

15

34

910 11

22

12

8 8 9

14

9

14

6 6 68

79

4 4 4 5 46

3 2 3 3 3 3

45

22

Infinite 256 128 64 32 16 8 4

Realistic HW : Window Impact

Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window

64 16256Infinite 32128 8 4

Page 23: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

23

• 8-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe)vs. 2-scalar Alpha @ 200 MHz (7 stage pipe)

Benchmark

0

100

200

300

400

500

600

700

800

900

espr

esso

li

eqnt

ott

com

pres

s sc gcc

spic

e

dodu

c

mdl

jdp2

wav

e5

tom

catv or

a

alvi

nn ear

mdl

jsp2

swm

256

su2c

or

hydr

o2d

nasa

fppp

p

Page 24: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

24

Summary of Exploiting Instruction Level Parallelism

• Why exploiting ILP is important?• Techniques to exploit ILP

– Instruction scheduling» Static method : loop unrolling, software pipelining» Dynamic method: scoreboard and tomasulo

– Branch Prediction» How to handle mispredicted branch?

• Achieving CPI <1 – Superscalar processor– VLIW

Page 25: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

25

Importance of Exploiting ILP• CPU utilization is low because of hazards

– Structural hazard– Data hazard– Control hazard

• CPU stalls if hazards can not be removed• Exploiting ILP to reduce the number of hazards

00.5

11.5

22.5

33.5

44.5

eqnt

ott

espr

esso gc

c li

dodu

c

nasa

7 ora

spic

e2g6

su2c

or

tom

catv

Base Load stalls Branch stalls FP result stalls FP structuralstalls

Page 26: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

26

Static instruction scheduling: loop unrolling

1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24

1 Loop: LD F0,0(R1)2 stall3 ADDD F4,F0,F24 SUBI R1,R1,85 BNEZ R1,Loop

;delayed branch6 SD 8(R1),F4

Page 27: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

27

Software Pipelining

• Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations

• Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (- Tomasulo in SW)

Iteration 0 Iteration

1 Iteration 2 Iteration

3 Iteration 4

Software- pipelined iteration

Page 28: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

28

HW instruction scheduling: out-of-order execution

• Key idea: Allow instructions behind stall to proceed

DIVD F0,F2,F4ADDD F10,F0,F8SUBD F8,F8,F14

– In-order-execution» SUBD is not issued until the dependence is cleared

– Dynamic scheduling enables out-of-order execution => out-of-order completion

Page 29: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

29

Tomasulo Organization

FP addersFP adders

Add1Add2Add3

FP multipliersFP multipliers

Mult1Mult2

From Mem FP Registers

Reservation Stations

Common Data Bus (CDB)

To Mem

FP OpQueue

Load Buffers

Store Buffers

Load1Load2Load3Load4Load5Load6

Page 30: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

30

Tamasulo

• Prevents Register as bottleneck• Avoids WAR, WAW hazards of Scoreboard• Allows loop unrolling in HW• Not limited to basic blocks

– branch prediction

• Lasting Contributions– Dynamic scheduling– Register renaming– Load/store disambiguation

Page 31: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

31

Branch Prediction

• Static approach– Predicted taken, predicted untaken, branch delay slot,

profiled-based prediction

• Dynamic approach– Branch History Table

» 1 bit vs. 2 bits– Correlation Branch Prediction– Branch target buffer

Page 32: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

32

• Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 4.13, p. 264)

Dynamic Branch Prediction

T

T

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not TakenT

NT

T

NT

NT

Page 33: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

33

Correlating Branches

Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior)

– Then behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction

• (2,2) predictor: 2-bit global, 2-bit local

Branch address (4 bits)

2-bits per branch local predictors

PredictionPrediction

2-bit global branch history

(01 = not taken then taken)

Page 34: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

34

Need Address at Same Time as Prediction

• Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)

– Note: must check for branch match now, since can’t use wrong branch address (Figure 3.19, p. 262)

Branch PC Predicted PC

=?

PC of instructionFETCH

Extra prediction state

bitsYes: instruction is branch and use predicted PC as next PC

No: branch not predicted, proceed normally

(Next PC = PC+4)

Page 35: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

35

Hardware Speculation

• Tomasulo without speculation– Issue– Execution– Write Result

» write results to the CDB & update the register file or memory

– Out-of-order completion» Cause problems for mispredicted branch & maintaining

precise exception

• Tomasulo with speculation– Issue– Execution– Write Result

» write results to the CDB & store results in a HW buffer (Reorder Buffer)

– Commit» Update register file or memory (in-order commit)

Page 36: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

36

Getting CPI < 1 Multiple Instructions/Cycle

• Two variations:– Superscalar: varying no. instructions/cycle (1 to 8),

scheduled by compiler or by HW (Tomasulo)» IBM PowerPC, Sun SuperSparc, DEC Alpha, HP 7100

– Very Long Instruction Words (VLIW): fixed number of instructions (16) scheduled by the compiler

» Joint HP/Intel agreement in 1998?

Page 37: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

37

Loop Unrolling in SuperScalar

Integer instruction FP instruction Clock cycleLoop: LD F0,0(R1) 1

LD F6,-8(R1) 2LD F10,-16(R1) ADDD F4,F0,F2 3LD F14,-24(R1) ADDD F8,F6,F2 4LD F18,-32(R1) ADDD F12,F10,F2 5SD 0(R1),F4 ADDD F16,F14,F2 6SD -8(R1),F8 ADDD F20,F18,F2 7SD -16(R1),F12 8SD -24(R1),F16 9SUBI R1,R1,#40 10BNEZ R1,LOOP 11SD -32(R1),F20 12

Unrolled 5 times to avoid delays (+1 due to SS)12 clocks, or 2.4 clocks per iteration

Page 38: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

38

Loop Unrolling in VLIW

Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branchLD F0,0(R1) LD F6,-8(R1) 1LD F10,-16(R1) LD F14,-24(R1) 2LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4

ADDD F20,F18,F2 ADDD F24,F22,F2 5SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6SD -16(R1),F12 SD -24(R1),F16 7SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8SD -0(R1),F28 BNEZ R1,LOOP 9

Unrolled 7 times to avoid delays7 results in 9 clocks, or 1.3 clocks per iterationNeed more registers in VLIW

Page 39: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

39

Limits to Multi-Issue Machines

• Inherent limitations of ILP– 1 branch in 5: How to keep a 5-way VLIW busy?– Latencies of units: many operations must be scheduled– Need about Pipeline Depth x No. Functional Units of independent

operations to keep machines busy

• Difficulties in building HW– Duplicate FUs to get parallel execution– Increase ports to Register File – Increase ports to memory– Decoding SS and impact on clock rate, pipeline depth

Page 40: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

40

Limits to Multi-Issue Machines

• Limitations specific to either SS or VLIW implementation

– Decode issue in SS– VLIW code size: unroll loops + wasted fields in VLIW– VLIW lock step => 1 hazard & all instructions stall– VLIW & binary compatibility is practical weakness

Page 41: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

41

Chapter 5: Memory Hierarchy

Page 42: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

42

Recap: Who Cares About the Memory Hierarchy?

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU19

82

Processor-MemoryPerformance Gap:(grows 50% / year)

Perf

orman

ce

Time

Processor-DRAM Memory Gap (latency)

Page 43: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

43

Levels of the Memory Hierarchy

CPU Registers100s Bytes<1s ns

Cache10s-100s K Bytes1-10 ns$10/ MByte

Main MemoryM Bytes100ns- 300ns$1/ MByte

Disk10s G Bytes, 10 ms (10,000,000 ns)$0.0031/ MByte

CapacityAccess TimeCost

Tapeinfinitesec-min$0.0014/ MByte

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

Page 44: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

44

The Principle of Locality

• The Principle of Locality:– Program access a relatively small portion of the address space at any

instant of time.• Two Different Types of Locality:

– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)

– Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)

• Last 15 years, HW (hardware) relied on locality for speed

Page 45: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

45

Memory Hierarchy: Terminology• Hit: data appears in some block in the upper level (example:

Block X) – Hit Rate: the fraction of memory access found in the upper level– Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss• Miss: data needs to be retrieve from a block in the lower level

(Block Y)– Miss Rate = 1 - (Hit Rate)– Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor• Hit Time << Miss Penalty (500 instructions on 21264!)

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

Page 46: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

46

Cache Measures

• Hit rate: fraction found in that level– So high that usually talk about Miss rate– Miss rate fallacy: as MIPS to CPU performance,

miss rate to average memory access time in memory • Average memory-access time

= Hit time + Miss rate x Miss penalty (ns or clocks)

• Miss penalty: time to replace a block from lower level, including time to replace in CPU

– access time: time to lower level = f(latency to lower level)

– transfer time: time to transfer block =f(BW between upper & lower levels)

Page 47: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

47

Simplest Cache: Direct MappedMemory

4 Byte Direct Mapped Cache

Memory Address0123456789ABCDEF

Cache Index0123

• Location 0 can be occupied by data from:

– Memory location 0, 4, 8, ... etc.– In general: any memory location

whose 2 LSBs of the address are 0s– Address<1:0> => cache index

• Which one should we place in the cache?

• How can we tell which one is in the cache?

Page 48: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

48

1 KB Direct Mapped Cache, 32B blocks• For a 2 ** N byte cache:

– The uppermost (32 - N) bits are always the Cache Tag– The lowest M bits are the Byte Select (Block Size = 2 ** M)

Cache Index

0123

:

Cache DataByte 0

0431

:

Cache Tag Example: 0x50Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte SelectEx: 0x00

9

Page 49: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

49

Two-way Set Associative Cache• N-way set associative: N entries for each Cache Index

– N direct mapped caches operates in parallel (N typically 2 to 4)

• Example: Two-way set associative cache– Cache Index selects a “set” from the cache– The two tags in the set are compared in parallel– Data is selected based on the tag result

Cache DataCache Block 0

Cache TagValid

:: :

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

Page 50: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

50

Disadvantage of Set Associative Cache• N-way Set Associative Cache v. Direct Mapped Cache:

– N comparators vs. 1– Extra MUX delay for the data– Data comes AFTER Hit/Miss

• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:

– Possible to assume a hit and continue. Recover later if miss.

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache DataCache Block 0

Cache TagValid

:: :

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

Page 51: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

51

4 Questions for Memory Hierarchy

• Q1: Where can a block be placed in the upper level? (Block placement)

• Q2: How is a block found if it is in the upper level?(Block identification)

• Q3: Which block should be replaced on a miss? (Block replacement)

• Q4: What happens on a write? (Write strategy)

Page 52: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

52

Q1: Where can a block be placed in the upper level?

• Block 12 placed in 8 block cache:– Fully associative, direct mapped, 2-way set associative– S.A. Mapping = Block Number Modulo Number Sets

Cache

01234567 0123456701234567

Memory

111111111122222222223301234567890123456789012345678901

Full Mapped Direct Mapped(12 mod 8) = 4

2-Way Assoc(12 mod 4) = 0

Page 53: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

53

Q2: How is a block found if it is in the upper level?

• Tag on each block– No need to check index or block offset

• Increasing associativity shrinks index, expands tag

BlockOffset

Block Address

IndexTag

Page 54: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

54

Q3: Which block should be replaced on a miss?

• Easy for Direct Mapped• Set Associative or Fully Associative:

– Random– LRU (Least Recently Used)

Assoc: 2-way 4-way 8-waySize LRU Ran LRU Ran LRU Ran16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

Page 55: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

55

Q4: What happens on a write?

• Write through—The information is written to both the block in the cache and to the block in the lower-level memory.

• Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.

– is block clean or dirty?• Pros and Cons of each?

– WT: read misses cannot result in writes– WB: no repeated writes to same location

• WT always combined with write buffers so that don’t wait for lower level memory

Page 56: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

56

Write Buffer for Write Through

• A Write Buffer is needed between the Cache and Memory

– Processor: writes data into the cache and the write buffer– Memory controller: write contents of the buffer to memory

• Write buffer is just a FIFO:– Typical number of entries: 4– Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

• Memory system designer’s nightmare:– Store frequency (w.r.t. time) -> 1 / DRAM write cycle– Write buffer saturation

ProcessorCache

Write Buffer

DRAM

Page 57: Lecture 6 - 國立臺灣大學yangc/lecture6.pdf14 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station or reorder buffer

57

A Modern Memory Hierarchy• By taking advantage of the principle of locality:

– Present the user with as much memory as is available in the cheapest technology.

– Provide access at the speed offered by the fastest technology.

Control

Datapath

SecondaryStorage(Disk)

Processor

Registers

MainMemory(DRAM)

SecondLevelCache

(SRAM)

On-C

hipC

ache1s 10,000,000s

(10s ms)Speed (ns): 10s 100s

100sGs

Size (bytes):Ks Ms

TertiaryStorage

(Disk/Tape)

10,000,000,000s (10s sec)

Ts