ece 4100/6100 advanced computer architecture lecture 6 instruction fetch

26
ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

Upload: tareq

Post on 06-Feb-2016

35 views

Category:

Documents


2 download

DESCRIPTION

ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch. Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology. Execution Core. Instruction Supply Issues. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

ECE 4100/6100Advanced Computer Architecture

Lecture 6 Instruction Fetch

Prof. Hsien-Hsin Sean LeeSchool of Electrical and Computer EngineeringGeorgia Institute of Technology

Page 2: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

2

Instruction Supply Issues

• Fetch throughput defines max performance that can be achieved in later stages

• Superscalar processors need to supply more than 1 instruction per cycle

• Instruction Supply limited by– Misalignment of multiple instructions in a fetch group– Change of Flow (interrupting instruction supply)– Memory latency and bandwidth

InstructionFetch Unit

ExecutionCore

Instruction buffer

Page 3: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

3

Aligned Instruction Fetching (4 instructions)

Row

Dec

oder

Row

Dec

oder

Row

Dec

oder

Row

Dec

oder ..01

..00 A0A4

00

A1A5

01

A2A6

10

A3A7

11

inst 1inst 1 inst 2 inst 2 inst 3 inst 4 inst 3 inst 4 inst 1inst 1 inst 2 inst 2 inst 3 inst 4 inst 3 inst 4

PC=..xx000000

One 64B I-cache line

A8A12

A9A13

A10A14

A11A15

..10

..11

Assume one fetch group = 16B

Cycle nCycle nCan pull out one row at a time

Page 4: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

4

Misaligned FetchR

ow D

ecod

erR

ow D

ecod

erR

ow D

ecod

erR

ow D

ecod

er ..01

..00 A0A4

00

A1A5

01

A2A6

10

A3A7

11

PC=..xx001000

One 64B I-cache line

A8A12

A9A13

A10A14

A11A15

..10

..11

inst 1 inst 2inst 1 inst 2 inst 3 inst 4inst 3 inst 4 inst 1 inst 2inst 1 inst 2 inst 3 inst 4inst 3 inst 4

Rotating networkRotating network

Cycle nCycle n

IBM RS/6000

Page 5: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

5

Split Cache Line AccessR

ow D

ecod

erR

ow D

ecod

erR

ow D

ecod

erR

ow D

ecod

er ..01

..00 A0A4

00

A1A5

01

A2A6

10

A3A7

11

PC=..xx111000

cache line A

A8A12

A9A13

A10A14

A11A15

..10

..11

B0 B1 B2 B3cache line B

B4 B5 B6 B7

inst 1 inst 2inst 1 inst 2 inst 1 inst 2inst 1 inst 2

inst 3 inst 4inst 3 inst 4 inst 3 inst 4inst 3 inst 4

Cycle nCycle n

Cycle n+1Cycle n+1

Be broken down to 2 physical accesses

Page 6: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

6

Split Cache Line Access MissR

ow D

ecod

erR

ow D

ecod

erR

ow D

ecod

erR

ow D

ecod

er

A0A4

00

A1A5

01

A2A6

10

A3A7

11

cache line A

A8A12

A9A13

A10A14

A11A15

C0 C1 C2 C3cache line C

C4 C5 C6 C7

inst 1 inst 2inst 1 inst 2 inst 1 inst 2inst 1 inst 2

inst 3 inst 4inst 3 inst 4 inst 3 inst 4inst 3 inst 4

Cache line Cache line BB missesmisses

Cycle nCycle n

Cycle n+Cycle n+XX

..01

..00

..10

..11

PC=..xx111000

Page 7: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

7

High Bandwidth Instruction Fetching

BB1

BB2 BB3

BB4

BB5

BB6BB7

• Wider issue More instruction feed

• Major challenge: to fetch more than one non-contiguousnon-contiguous basic block per cycle

• Enabling technique?– Predication– Branch alignment based on

profiling– Other hardware solutions

(branch prediction is a given)

Page 8: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

8

Predication Example

• Convert control dependency into data dependency• Enlarge basic block size

– More room for scheduling– No fetch disruption

if (a[i+1]>a[i]) a[i+1] = 0 else

a[i] = 0

if (a[i+1]>a[i]) a[i+1] = 0 else

a[i] = 0

Source code

lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2L1: sw r0, [r1+4] L2:

lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2L1: sw r0, [r1+4] L2:

Typical assembly

lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1]

lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1]

Assembly w/ predication

Page 9: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

9

Collapse Buffer [ISCA 95]

• To fetch multiple (often non-contiguous) instructions

• Use interleaved BTB to enable multiple branch predictions

• Align instructions in the predicted sequential order

• Use banked I-cache for multiple line access

Page 10: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

10

Collapsing Buffer

Fetch PC Interleaved BTB

CacheBank 1

CacheBank 2

Interchange Switch

Collapsing Circuit

Page 11: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

11

Collapsing Buffer Mechanism

Interleaved BTB

A E

Bank Routing

E A

E F G H

A B C D

E F G H A B C D

Interchange Switch

A B C D E F G H

Collapsing Circuit

A B C E G

ValidInstructionBits

D F H

Page 12: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

12

High Bandwidth Instruction Fetching

BB1

BB2 BB3

BB4

BB5

BB6BB7

• To fetch more, we need to cross multiple basic blocks (and/or multiple cache lines)

• Multiple branches predictions

Page 13: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

13

Multiple Branch Predictor [YehMarrPatt ICS’93]

• Pattern History Table (PHT) design to support MBP• Based on global history only

Branch History Register(BHR)

Pattern History Table(PHT)

bk

……b1

Primary prediction

Secondary prediction

Tertiary prediction

p1

p2p1p1 p2p2

updateupdate

Page 14: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

14

Multiple Branch Predictin

• Fetch address could be retrieved from BTB

• Predicted path: BB1 BB2 BB5

• How to fetch BB2 and BB5? BTB?– Can’t. Branch PCs

of br1br1 and br2 br2 not available when MBP made

– Use a BAC design

BB1br1br1

BB2br2br2

BB3

BB4 BB5 BB6 BB7

T (2T (2ndnd)) FF

TTTTF (3F (3rdrd)) FF

Fetch address(br0 Primary prediction)

BTB entry

Page 15: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

15

Branch Address Cache

• Use a Branch Address Cache (BAC): Keep 6 possible fetch addresses for 2 more predictions

• br: 2 bits for branch type (cond, uncond, return)• V: single valid bit (to indicate if hits a branch in the sequence)• To make one more level prediction

– Need to cache another 8 more addresses (i.e. total=14 addresses)– 464 bits per entry = (23+3)*1 + (30+3) * (2+4) + 30*8

Tag

23 bits

TakenTarget Address

Not-Taken Target Address

T-T Address

T-N Address

N-T Address

N-N Address

30 bits 30 bits

V br V br V br

212 bits per fetch address entry

1 2

Fetch Addr (from BTB)Fetch Addr (from BTB)

Page 16: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

16

Caching Non-Consecutive Basic Blocks

BB2BB2

• High Fetch Bandwidth + Low Latency

BB1BB1

BB3BB3

BB4BB4

BB5BB5

Fetch in Conventional Instruction Cache

BB2BB2BB1BB1 BB3BB3 BB4BB4 BB5BB5

Fetch in Linear Memory Location

Page 17: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

17

Trace Cache • Cache dynamic non-contiguous instructions

(traces)• Cross multiple basic blocks• Need to predict multiple branches (MBP)

E F GH I J K

A BC

D

I$

A B

C

D

E F G

H I J

I$ Fetch(5 cycles)

A B C

D E F G

H I J

CollapsingBuffer Fetch

(3 cycles)

A B C D E F G H I J

Trace Cache

A B C D E F G H I J

T$ Fetch (1 cycle)

Page 18: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

18

Trace Cache [Rotenberg Bennett Smith MICRO‘96]

• Cache at most (in original paper)– M branches OR (M = 3 in all follow-up TC studies due to MBP)– N instructions (N = 16 in all follow-up TC studies)

• Fall-thru address if last branch is predicted not taken

Tag

Br flag

Fetch AddrFetch Addr

Br mask

Fall-thru Address

Taken Address

MBPMBP

BB2BB1 BB3

Line fill bufferLine fill buffer

For T.C. missFor T.C. miss

T.C. hits, N instructions

MM branches

Branch 1 Branch 2 Branch 3

10

1st Br taken2nd Br Not taken

11, 1

11: 3 branches.1: the trace ends w/ a branch

Page 19: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

19

Trace Hit Logic

A 10 11,1 X Y

Tag BF Mask Fall-thru TargetFetch: A

=

Match 1st

Block

Multi-BPred

T N

Cond.AND

Match Remaining

Block(s) Trace hit

N

0 1

Next FetchAddress

Page 20: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

20

Trace Cache Example

A

B C

D

Exit

5 insts

12 insts

4 insts

6 insts

BB Traversal Path: ABDABDACDABDACDABDAC

A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4

A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11

C12 D1 D2 D3 D4 A1 A2 A3 A4 A5

B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4

Trace Cache (5 lines)

Cond 1: 3 branchesCond 2: Fill a trace cache lineCond 3: Exit

16 instructions

Page 21: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

21

A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4

A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11

C12 D1 D2 D3 D4 A1 A2 A3 A4 A5

B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4

Trace Cache Example

A

B C

D

Exit

5 insts

12 insts

4 insts

6 insts

BB Traversal Path: ABDABDACDABDACDABDAC

A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4

A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11

C12 D1 D2 D3 D4 A1 A2 A3 A4 A5

B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4

Trace Cache (5 lines)

Cond 1: 3 branchesCond 2: Fill a trace cache lineCond 3: Exit

Page 22: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

22

Trace Cache Example

A

B C

D

Exit

5 insts

12 insts

4 insts

6 insts

BB Traversal Path: ABDABDACDABDACDABDAC

A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4

A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11

B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4

Trace Cache (5 lines)

C12 D1 D2 D3 D4 A1 A2 A3 A4 A5C12C12 D1 D2 D3 D4 A1 A2 A3 A4 A5

Trace Cache is Full

Page 23: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

23

Trace Cache Example

A

B C

D

Exit

5 insts

12 insts

4 insts

6 insts

BB Traversal Path: ABDABDACDABDACDABDAC

A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4

A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11

B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4

C12 D1 D2 D3 D4 A1 A2 A3 A4 A5C12

How many hits?

What is the utilization?

Page 24: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

24

Redundancy

• Duplication– Note that instructions only appear

once in I-Cache– Same instruction appears many ti

mes in TC • Fragmentation

– If 3 BBs < 16 instructions– If multiple-target branch (e.g. retu

rn, indirect jump or trap) is encountered, stop “trace construction”.

– Empty slots wasted resources • Example

– A single BB is broken up to (ABC), (BCD), (CDA), (DAB)

– Duplicating each instruction 3 times

(ABC) =16 inst(BCD) =13 inst(CDA) =15 inst(DAB) =13 inst

A B

CB D

Trace Cache

C

D A B

C D A

6

4

6

3

B

C

D

A

Page 25: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

25

Indexability

A

C

D

B

E• TC saved traces (EAC) and

(BCD) • Path: (EAC) to (D)

– Cannot index interior block (D)

• Can cause duplication • Need partial matching

– (BCD) is cached, if (BC) is needed

E

CB D

Trace Cache

A C

G

Page 26: ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch

26

Pentium 4 (NetBurst) Trace Cache

Front-endBTB

iTLB andPrefetcher

L2 Cache

Decoder

Trace $Trace $

BTB

Rename,execute,

etc.

No I$ !!

Decoded InstructionsTrace-based prediction(predict next-trace, not

next-PC)