lec6 computer architecture by hsien-hsin sean lee georgia tech -- instruction fetch

ECE 4100/6100Advanced Computer Architecture

Lecture 6 Instruction Fetch

Prof. Hsien-Hsin Sean LeeSchool of Electrical and Computer EngineeringGeorgia Institute of Technology

2

Instruction Supply Issues

• Fetch throughput defines max performance that can be achieved in later stages

• Superscalar processors need to supply more than 1 instruction per cycle

• Instruction Supply limited by– Misalignment of multiple instructions in a fetch group– Change of Flow (interrupting instruction supply)– Memory latency and bandwidth

InstructionFetch Unit

ExecutionCore

Instruction buffer

3

Aligned Instruction Fetching (4 instructions)

Row

Dec

oder

Row

Dec

oder ..01

..00 A0A4

00

A1A5

01

A2A6

10

A3A7

11

inst 1inst 1 inst 2 inst 2 inst 3 inst 4 inst 3 inst 4

PC=..xx000000

One 64B I-cache line

A8A12

A9A13

A10A14

A11A15

..10

..11

Assume one fetch group = 16B

Cycle nCycle nCan pull out one row at a time

4

Misaligned FetchR

ow D

ecod

erR

ow D

ecod

er ..01..00 A0

A4

00

A1A5

01

A2A6

10

A3A7

11

PC=..xx001000

One 64B I-cache line

A8A12

A9A13

A10A14

A11A15

..10

..11

inst 1 inst 2inst 1 inst 2 inst 3 inst 4inst 3 inst 4

Rotating networkRotating network

Cycle nCycle n

IBM RS/6000

5

Split Cache Line AccessR

ow D

ecod

erR

ow D

ecod

er ..01..00 A0

A4

00

A1A5

01

A2A6

10

A3A7

11

PC=..xx111000

cache line A

A8A12

A9A13

A10A14

A11A15

..10

..11

B0 B1 B2 B3cache line BB4 B5 B6 B7

inst 1 inst 2inst 1 inst 2


Cycle nCycle n

Cycle n+1Cycle n+1

Be broken down to 2 physical accesses

6

Split Cache Line Access MissR

ow D

ecod

erR

ow D

ecod

er

A0A4

00

A1A5

01

A2A6

10

A3A7

11

cache line A

A8A12

A9A13

A10A14

A11A15

C0 C1 C2 C3cache line CC4 C5 C6 C7



Cache line Cache line BB missesmisses

Cycle nCycle n

Cycle n+Cycle n+XX

..01

..00

..10

..11

PC=..xx111000

7

High Bandwidth Instruction Fetching

BB1

BB2 BB3

BB4

BB5

BB6BB7

• Wider issue More instruction feed

• Major challenge: to fetch more than one non-contiguousnon-contiguous basic block per cycle

• Enabling technique?– Predication– Branch alignment based on

profiling– Other hardware solutions (branch

prediction is a given)

8

Predication Example

• Convert control dependency into data dependency• Enlarge basic block size

– More room for scheduling– No fetch disruption

if (a[i+1]>a[i]) a[i+1] = 0 else

a[i] = 0

Source code

lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2L1: sw r0, [r1+4] L2:

Typical assembly

lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1]

Assembly w/ predication

9

Collapse Buffer [ISCA 95]

• To fetch multiple (often non-contiguous) instructions

• Use interleaved BTB to enable multiple branch predictions

• Align instructions in the predicted sequential order

• Use banked I-cache for multiple line access

10

Collapsing Buffer

Fetch PC Interleaved BTB

CacheBank 1

CacheBank 2

Interchange Switch

Collapsing Circuit

11

Collapsing Buffer MechanismInterleaved BTB

A E

Bank Routing

E A

E F G HA B C D

E F G H A B C D

Interchange Switch

A B C D E F G H

Collapsing Circuit

A B C E G

ValidInstructionBits

D F H

12

High Bandwidth Instruction Fetching

BB1

BB2 BB3

BB4

BB5

BB6BB7

• To fetch more, we need to cross multiple basic blocks (and/or multiple cache lines)

• Multiple branches predictions

13

Multiple Branch Predictor [YehMarrPatt ICS’93]

• Pattern History Table (PHT) design to support MBP• Based on global history only

Branch History Register(BHR)

Pattern History Table(PHT)

bk

……b1

Primary prediction

Secondary prediction

Tertiary prediction

p1

p2p1p1 p2p2

updateupdate

14

Multiple Branch Predictin• Fetch address could be

retrieved from BTB• Predicted path: BB1

BB2 BB5• How to fetch BB2 and

BB5? BTB?– Can’t. Branch PCs

of br1br1 and br2 br2 not available when MBP made

– Use a BAC design

BB1br1br1

BB2br2br2

BB3

BB4 BB5 BB6 BB7

T (2T (2ndnd)) FF

TT TTF (3F (3rdrd)) FF

Fetch address(br0 Primary prediction)

BTB entry

15

Branch Address Cache

• Use a Branch Address Cache (BAC): Keep 6 possible fetch addresses for 2 more predictions• br: 2 bits for branch type (cond, uncond, return)• V: single valid bit (to indicate if hits a branch in the sequence)• To make one more level prediction

– Need to cache another 8 more addresses (i.e. total=14 addresses)– 464 bits per entry = (23+3)*1 + (30+3) * (2+4) + 30*8

Tag

23 bits

TakenTarget Address

Not-Taken Target Address

T-T Address

T-N Address

N-T Address

N-N Address

30 bits 30 bits

V br V br V br

212 bits per fetch address entry

1 2

Fetch Addr (from BTB)

16

Caching Non-Consecutive Basic Blocks

BB2BB2

• High Fetch Bandwidth + Low Latency

BB1BB1

BB3BB3

BB4BB4

BB5BB5

Fetch in Conventional Instruction Cache

BB2BB2BB1BB1 BB3BB3 BB4BB4 BB5BB5Fetch in Linear Memory Location

17

Trace Cache • Cache dynamic non-contiguous instructions (traces)• Cross multiple basic blocks• Need to predict multiple branches (MBP)

E F GH I J K

A BC

D

I$

A BCDE F GH I J

I$ Fetch(5 cycles)

A B CD E F GH I J

CollapsingBuffer Fetch

(3 cycles)

A B C D E F G H I JTrace Cache

A B C D E F G H I J

T$ Fetch (1 cycle)

18

Trace Cache [Rotenberg Bennett Smith MICRO‘96]

• Cache at most (in original paper)– M branches OR (M = 3 in all follow-up TC studies due to MBP)– N instructions (N = 16 in all follow-up TC studies)

• Fall-thru address if last branch is predicted not taken

Tag

Br flag

Fetch Addr

Br mask

Fall-thru Address

Taken Address

MBP

BB2BB1 BB3

Line fill bufferLine fill buffer

For T.C. missFor T.C. miss

T.C. hits, N instructions

MM branches

Branch 1 Branch 2 Branch 3

10

1st Br taken2nd Br Not taken

11, 1

11: 3 branches.1: the trace ends w/ a branch

19

Trace Hit Logic

A 10 11,1 X YTag BF Mask Fall-thru TargetFetch: A

=

Match 1st

Block

Multi-BPredT N

Cond.AND

Match Remaining

Block(s) Trace hit

N

0 1

Next FetchAddress

20

Trace Cache Example

A

B C

D

Exit

5 insts

12 insts

4 insts

6 insts

BB Traversal Path: ABDABDACDABDACDABDAC

A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4

A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11

C12 D1 D2 D3 D4 A1 A2 A3 A4 A5

B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4

Trace Cache (5 lines)

Cond 1: 3 branchesCond 2: Fill a trace cache lineCond 3: Exit

16 instructions

21



C12 D1 D2 D3 D4 A1 A2 A3 A4 A5



Trace Cache Example

A

B C

D

Exit

5 insts

12 insts

4 insts

6 insts




C12 D1 D2 D3 D4 A1 A2 A3 A4 A5




Cond 1: 3 branchesCond 2: Fill a trace cache lineCond 3: Exit

22

Trace Cache Example

A

B C

D

Exit

5 insts

12 insts

4 insts

6 insts







C12 D1 D2 D3 D4 A1 A2 A3 A4 A5C12C12 D1 D2 D3 D4 A1 A2 A3 A4 A5

Trace Cache is Full

23

Trace Cache Example

A

B C

D

Exit

5 insts

12 insts

4 insts

6 insts






C12 D1 D2 D3 D4 A1 A2 A3 A4 A5C12

How many hits?

What is the utilization?

24

Redundancy• Duplication

– Note that instructions only appear once in I-Cache– Same instruction appears many times in TC

• Fragmentation – If 3 BBs < 16 instructions– If multiple-target branch (e.g. return, indirect jump or trap) is encountered, stop “trace construction”. – Empty slots wasted resources

• Example– A single BB is broken up to (ABC), (BCD), (CDA), (DAB)– Duplicating each instruction 3 times

(ABC) =16 inst(BCD) =13 inst(CDA) =15 inst(DAB) =13 inst

A B

CB D

Trace Cache

C

D A B

C D A

6

4

6

3

B

C

D

A

25

Indexability

A

C

D

B

E• TC saved traces (EAC) and (BCD)

• Path: (EAC) to (D)– Cannot index interior block

(D)• Can cause duplication • Need partial matching

– (BCD) is cached, if (BC) is needed

E

CB D

Trace Cache

A C

G

26

Pentium 4 (NetBurst) Trace Cache

Front-endBTB

iTLB andPrefetcher L2 Cache

Decoder

Trace $Trace $BTB

Rename,execute,

etc.

No I$ !!

Decoded InstructionsTrace-based prediction(predict next-trace, not

next-PC)

lec6 computer architecture by hsien-hsin sean lee georgia tech -- instruction fetch

Devices & Hardware