lec6 computer architecture by hsien-hsin sean lee georgia tech -- instruction fetch

26
ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

Upload: hsien-hsin-lee

Post on 13-Feb-2017

368 views

Category:

Devices & Hardware


2 download

TRANSCRIPT

Page 1: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

ECE 4100/6100Advanced Computer Architecture

Lecture 6 Instruction Fetch

Prof. Hsien-Hsin Sean LeeSchool of Electrical and Computer EngineeringGeorgia Institute of Technology

Page 2: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

2

Instruction Supply Issues

• Fetch throughput defines max performance that can be achieved in later stages

• Superscalar processors need to supply more than 1 instruction per cycle

• Instruction Supply limited by– Misalignment of multiple instructions in a fetch group– Change of Flow (interrupting instruction supply)– Memory latency and bandwidth

InstructionFetch Unit

ExecutionCore

Instruction buffer

Page 3: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

3

Aligned Instruction Fetching (4 instructions)

Row

Dec

oder

Row

Dec

oder ..01

..00 A0A4

00

A1A5

01

A2A6

10

A3A7

11

inst 1inst 1 inst 2 inst 2 inst 3 inst 4 inst 3 inst 4

PC=..xx000000

One 64B I-cache line

A8A12

A9A13

A10A14

A11A15

..10

..11

Assume one fetch group = 16B

Cycle nCycle nCan pull out one row at a time

Page 4: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

4

Misaligned FetchR

ow D

ecod

erR

ow D

ecod

er ..01..00 A0

A4

00

A1A5

01

A2A6

10

A3A7

11

PC=..xx001000

One 64B I-cache line

A8A12

A9A13

A10A14

A11A15

..10

..11

inst 1 inst 2inst 1 inst 2 inst 3 inst 4inst 3 inst 4

Rotating networkRotating network

Cycle nCycle n

IBM RS/6000

Page 5: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

5

Split Cache Line AccessR

ow D

ecod

erR

ow D

ecod

er ..01..00 A0

A4

00

A1A5

01

A2A6

10

A3A7

11

PC=..xx111000

cache line A

A8A12

A9A13

A10A14

A11A15

..10

..11

B0 B1 B2 B3cache line BB4 B5 B6 B7

inst 1 inst 2inst 1 inst 2

inst 3 inst 4inst 3 inst 4

Cycle nCycle n

Cycle n+1Cycle n+1

Be broken down to 2 physical accesses

Page 6: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

6

Split Cache Line Access MissR

ow D

ecod

erR

ow D

ecod

er

A0A4

00

A1A5

01

A2A6

10

A3A7

11

cache line A

A8A12

A9A13

A10A14

A11A15

C0 C1 C2 C3cache line CC4 C5 C6 C7

inst 1 inst 2inst 1 inst 2

inst 3 inst 4inst 3 inst 4

Cache line Cache line BB missesmisses

Cycle nCycle n

Cycle n+Cycle n+XX

..01

..00

..10

..11

PC=..xx111000

Page 7: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

7

High Bandwidth Instruction Fetching

BB1

BB2 BB3

BB4

BB5

BB6BB7

• Wider issue More instruction feed

• Major challenge: to fetch more than one non-contiguousnon-contiguous basic block per cycle

• Enabling technique?– Predication– Branch alignment based on

profiling– Other hardware solutions (branch

prediction is a given)

Page 8: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

8

Predication Example

• Convert control dependency into data dependency• Enlarge basic block size

– More room for scheduling– No fetch disruption

if (a[i+1]>a[i]) a[i+1] = 0 else

a[i] = 0

Source code

lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2L1: sw r0, [r1+4] L2:

Typical assembly

lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1]

Assembly w/ predication

Page 9: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

9

Collapse Buffer [ISCA 95]

• To fetch multiple (often non-contiguous) instructions

• Use interleaved BTB to enable multiple branch predictions

• Align instructions in the predicted sequential order

• Use banked I-cache for multiple line access

Page 10: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

10

Collapsing Buffer

Fetch PC Interleaved BTB

CacheBank 1

CacheBank 2

Interchange Switch

Collapsing Circuit

Page 11: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

11

Collapsing Buffer MechanismInterleaved BTB

A E

Bank Routing

E A

E F G HA B C D

E F G H A B C D

Interchange Switch

A B C D E F G H

Collapsing Circuit

A B C E G

ValidInstructionBits

D F H

Page 12: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

12

High Bandwidth Instruction Fetching

BB1

BB2 BB3

BB4

BB5

BB6BB7

• To fetch more, we need to cross multiple basic blocks (and/or multiple cache lines)

• Multiple branches predictions

Page 13: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

13

Multiple Branch Predictor [YehMarrPatt ICS’93]

• Pattern History Table (PHT) design to support MBP• Based on global history only

Branch History Register(BHR)

Pattern History Table(PHT)

bk

……b1

Primary prediction

Secondary prediction

Tertiary prediction

p1

p2p1p1 p2p2

updateupdate

Page 14: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

14

Multiple Branch Predictin• Fetch address could be

retrieved from BTB• Predicted path: BB1

BB2 BB5• How to fetch BB2 and

BB5? BTB?– Can’t. Branch PCs

of br1br1 and br2 br2 not available when MBP made

– Use a BAC design

BB1br1br1

BB2br2br2

BB3

BB4 BB5 BB6 BB7

T (2T (2ndnd)) FF

TT TTF (3F (3rdrd)) FF

Fetch address(br0 Primary prediction)

BTB entry

Page 15: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

15

Branch Address Cache

• Use a Branch Address Cache (BAC): Keep 6 possible fetch addresses for 2 more predictions• br: 2 bits for branch type (cond, uncond, return)• V: single valid bit (to indicate if hits a branch in the sequence)• To make one more level prediction

– Need to cache another 8 more addresses (i.e. total=14 addresses)– 464 bits per entry = (23+3)*1 + (30+3) * (2+4) + 30*8

Tag

23 bits

TakenTarget Address

Not-Taken Target Address

T-T Address

T-N Address

N-T Address

N-N Address

30 bits 30 bits

V br V br V br

212 bits per fetch address entry

1 2

Fetch Addr (from BTB)

Page 16: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

16

Caching Non-Consecutive Basic Blocks

BB2BB2

• High Fetch Bandwidth + Low Latency

BB1BB1

BB3BB3

BB4BB4

BB5BB5

Fetch in Conventional Instruction Cache

BB2BB2BB1BB1 BB3BB3 BB4BB4 BB5BB5Fetch in Linear Memory Location

Page 17: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

17

Trace Cache • Cache dynamic non-contiguous instructions (traces)• Cross multiple basic blocks• Need to predict multiple branches (MBP)

E F GH I J K

A BC

D

I$

A BCDE F GH I J

I$ Fetch(5 cycles)

A B CD E F GH I J

CollapsingBuffer Fetch

(3 cycles)

A B C D E F G H I JTrace Cache

A B C D E F G H I J

T$ Fetch (1 cycle)

Page 18: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

18

Trace Cache [Rotenberg Bennett Smith MICRO‘96]

• Cache at most (in original paper)– M branches OR (M = 3 in all follow-up TC studies due to MBP)– N instructions (N = 16 in all follow-up TC studies)

• Fall-thru address if last branch is predicted not taken

Tag

Br flag

Fetch Addr

Br mask

Fall-thru Address

Taken Address

MBP

BB2BB1 BB3

Line fill bufferLine fill buffer

For T.C. missFor T.C. miss

T.C. hits, N instructions

MM branches

Branch 1 Branch 2 Branch 3

10

1st Br taken2nd Br Not taken

11, 1

11: 3 branches.1: the trace ends w/ a branch

Page 19: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

19

Trace Hit Logic

A 10 11,1 X YTag BF Mask Fall-thru TargetFetch: A

=

Match 1st

Block

Multi-BPredT N

Cond.AND

Match Remaining

Block(s) Trace hit

N

0 1

Next FetchAddress

Page 20: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

20

Trace Cache Example

A

B C

D

Exit

5 insts

12 insts

4 insts

6 insts

BB Traversal Path: ABDABDACDABDACDABDAC

A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4

A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11

C12 D1 D2 D3 D4 A1 A2 A3 A4 A5

B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4

Trace Cache (5 lines)

Cond 1: 3 branchesCond 2: Fill a trace cache lineCond 3: Exit

16 instructions

Page 21: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

21

A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4

A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11

C12 D1 D2 D3 D4 A1 A2 A3 A4 A5

B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4

Trace Cache Example

A

B C

D

Exit

5 insts

12 insts

4 insts

6 insts

BB Traversal Path: ABDABDACDABDACDABDAC

A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4

A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11

C12 D1 D2 D3 D4 A1 A2 A3 A4 A5

B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4

Trace Cache (5 lines)

Cond 1: 3 branchesCond 2: Fill a trace cache lineCond 3: Exit

Page 22: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

22

Trace Cache Example

A

B C

D

Exit

5 insts

12 insts

4 insts

6 insts

BB Traversal Path: ABDABDACDABDACDABDAC

A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4

A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11

B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4

Trace Cache (5 lines)

C12 D1 D2 D3 D4 A1 A2 A3 A4 A5C12C12 D1 D2 D3 D4 A1 A2 A3 A4 A5

Trace Cache is Full

Page 23: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

23

Trace Cache Example

A

B C

D

Exit

5 insts

12 insts

4 insts

6 insts

BB Traversal Path: ABDABDACDABDACDABDAC

A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4

A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11

B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4

C12 D1 D2 D3 D4 A1 A2 A3 A4 A5C12

How many hits?

What is the utilization?

Page 24: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

24

Redundancy• Duplication

– Note that instructions only appear once in I-Cache– Same instruction appears many times in TC

• Fragmentation – If 3 BBs < 16 instructions– If multiple-target branch (e.g. return, indirect jump or trap) is encountered, stop “trace construction”. – Empty slots wasted resources

• Example– A single BB is broken up to (ABC), (BCD), (CDA), (DAB)– Duplicating each instruction 3 times

(ABC) =16 inst(BCD) =13 inst(CDA) =15 inst(DAB) =13 inst

A B

CB D

Trace Cache

C

D A B

C D A

6

4

6

3

B

C

D

A

Page 25: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

25

Indexability

A

C

D

B

E• TC saved traces (EAC) and (BCD)

• Path: (EAC) to (D)– Cannot index interior block

(D)• Can cause duplication • Need partial matching

– (BCD) is cached, if (BC) is needed

E

CB D

Trace Cache

A C

G

Page 26: Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction Fetch

26

Pentium 4 (NetBurst) Trace Cache

Front-endBTB

iTLB andPrefetcher L2 Cache

Decoder

Trace $Trace $BTB

Rename,execute,

etc.

No I$ !!

Decoded InstructionsTrace-based prediction(predict next-trace, not

next-PC)