nc state university center for efficient, scalable, and reliable computing department of electrical...

29
NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University EXACT: Explicit Dynamic-Branch Prediction with Active Updates Muawya Al-Otoom, Elliott Forbes, Eric Rotenberg

Upload: alijah-blackshaw

Post on 02-Apr-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Center for Efficient, Scalable, and Reliable ComputingDepartment of Electrical & Computer Engineering

North Carolina State University

EXACT:Explicit Dynamic-Branch Prediction

with Active Updates

Muawya Al-Otoom, Elliott Forbes, Eric Rotenberg

Page 2: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 2© Eric Rotenberg

Branch Prediction and Performance

1K-entry instruction window

128-entry issue queue (similar to Nehalem)

4-wide fetch/issue/retire

16 KB gshare

Randomly remove 0% to 100% of mispredictions

0

0.5

1

1.5

2

2.5

3

3.5

4

% of remaining branch mispredicts removed

IPC

bzip2gziptwolf

89%

91%

95%

Page 3: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 3© Eric Rotenberg

while ( . . . ){ . . . for (i = 0; i < 16; i++) { if (a[i] > 10) { . . . } else { . . . } } . . .}

LOAD r5 = a[i]X: BRANCH r5 <= 10

Example

Page 4: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 4© Eric Rotenberg

Example (cont.)

LOAD

BRANCH

4 11 15 2 7 1 3 52 9 3 8 5 55 833

a[14]a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9] a[10] a[11] a[12] a[13]

1 0 0 1 1 1 1 0 1 1 1 1 0 10

18

a[15]

0

0 0 1

X

+

PC

GHR

0

PHT

Page 5: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 5© Eric Rotenberg

Good Scenario #1

LOAD

BRANCH

4 11 15 2 7 1 3 52 9 3 8 5 55 833

a[14]a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9] a[10] a[11] a[12] a[13]

1 0 0 1 1 1 1 0 1 1 1 1 0 10

18

a[15]

0

0 0 1

X

+

PC

GHR

0

PHT a[4] has unique GHR context. GHR implicitly identifies a[4]. Dedicated prediction for a[4].

dedicatedpredictionfor a[4]

Page 6: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 6© Eric Rotenberg

Good Scenario #2

LOAD

BRANCH

4 11 15 2 7 1 3 52 9 3 8 5 55 833

a[14]a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9] a[10] a[11] a[12] a[13]

1 0 0 1 1 1 1 0 1 1 1 1 0 10

18

a[15]

0

a[6], a[10] have same GHR context. GHR does not distinguish a[6], a[10]. Shared prediction for a[6], a[10]. Fortunately, they have same outcome.

sharedpredictionfor a[6], a[10]

1 0 1

X

+

PC

GHR

1

PHT

Page 7: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 7© Eric Rotenberg

Bad Scenario #1

LOAD

BRANCH

4 11 15 2 7 1 3 52 9 3 8 5 55 833

a[14]a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9] a[10] a[11] a[12] a[13]

1 0 0 1 1 1 1 0 1 1 1 1 0 10

18

a[15]

0

a[6], a[15] have same GHR context. GHR does not distinguish a[6], a[15]. Shared prediction for a[6], a[15]. Problem: they have different outcomes.

sharedpredictionfor a[6], a[15]

1 0 1

X

+

PC

GHR

?

PHT

Page 8: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 8© Eric Rotenberg

Bad Scenario #1 (cont.)

LOAD

BRANCH

4 11 15 2 7 1 3 52 9 3 8 5 55 833

a[14]a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9] a[10] a[11] a[12] a[13]

1 0 0 1 1 1 1 0 1 1 1 1 0 10

18

a[15]

0

Indirect solution Use longer history. GHR now distinguishes a[6], a[15].

0 1 0

X

+

PC

GHR

PHT

1 0

X

+

PC

GHR

1

1

11 0

dedicated prediction for a[6]

dedicated prediction for a[15]

Page 9: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 9© Eric Rotenberg

Bad Scenario #1 (cont.)

LOAD

BRANCH

4 11 15 2 7 1 3 52 9 3 8 5 55 833

a[14]a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9] a[10] a[11] a[12] a[13]

1 0 0 1 1 1 1 0 1 1 1 1 0 10

18

a[15]

0

Indirect solution Use longer history. GHR now distinguishes a[6], a[15]. Ambiguity not eradicated: a[10], a[15].

PHT

1 0

X

+

PC

GHR 11

? sharedpredictionfor a[10], a[15]

Direct solution Use address of element. Dedicated predictions for different elements.

Page 10: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 10© Eric Rotenberg

Bad Scenario #2

LOAD

BRANCH

4 11 15 2 7 1 3 52 9 3 8 5 55 833

a[14]a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] a[9] a[10] a[11] a[12] a[13]

1 0 0 1 1 1 1 0 1 1 1 1 0 10

18

a[15]

0

0 0 1

X

+

PC

GHR

0

PHT STORE a[4] = 6

stalepredictionfor a[4]

6

1

Conventional passive updates Predictor’s entry is stale after store Retrained after suffering misprediction

Solution: Active updates

Page 11: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 11© Eric Rotenberg

Big Picture

branchpredictor

memory

Proposedbranchpredictor

memory

Conventional

Branch predictor should mirror a program’s objects

Page 12: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 12© Eric Rotenberg

Big Picture Branch predictor should mirror a program’s objects Branch predictor should mirror changes as they happen

branchpredictor

memory

Conventional

Store

branchpredictor

memory

Proposed

Store

Page 13: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 13© Eric Rotenberg

Characterize Mispredictions

Bad Scenario #1 Measure how often global branch history does

not distinguish different dynamic branches that have different outcomes

Bad Scenario #2 Measure how often stores cause stale

predictions Evaluate for two history-based predictors

Very large gselect Very large L-TAGE

Page 14: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 14© Eric Rotenberg

Characterize Mispredictions

Bad Scenario #1

Page 15: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 15© Eric Rotenberg

Characterize Mispredictions

Bad Scenario #2

Note:Predominance of Bad Scenario #1 may obscure occurrences of Bad Scenario #2.

Page 16: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 16© Eric Rotenberg

Problems and Solutions Two problems:

{PC, global branch history} does not always distinguish different dynamic branches that have different outcomes

Stores to memory change branch outcomes

Two solutions: Explicitly identify dynamic branches to provide dedicated

predictions for them (EX)o branch ID = hash(PC, load addresses)

Stores “actively update” the branch predictor (ACT)

Page 17: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 17© Eric Rotenberg

Bad Scenario #2

Bad Scenario #1

Load-dependent branches use explicit predictor Index with branch ID (EX) Index with branch ID and perform active updates (EXACT)

Other branches use the default predictor (e.g., L-TAGE)

Page 18: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 18© Eric Rotenberg

3 Implementation Challenges

1. Indexing the explicit predictor Branch ID unknown at fetch time Loads are unlikely to have computed their addresses

by the time the branch is fetched

2. Large explicit predictor Many different IDs contribute to mispredictions Predictor size normally limited by cycle time

3. Large active update unit Too large to implement with dedicated storage

Page 19: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 19© Eric Rotenberg

Implementation Challenge #1:Indexing the Explicit Predictor

Insight: sequences of IDs repeat due to traversing data structures

Use ID of a prior retired branch at a fixed distance N

Page 20: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 20© Eric Rotenberg

0%

2%

4%

6%

8%

10%

12%

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30distance

mis

pre

dic

tio

n r

ate

%

bzip2 crafty gzip twolf

0%

2%

4%

6%

8%

10%

12%

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30distance

mis

pre

dic

tio

n r

ate

%

bzip2 crafty gzip twolf

IDEAL:Assumes ID of Nth branch away is always available.

REAL:Use explicit predictor only if Nth branch away is retired.Otherwise use default predictor.

(N)

(N)

use IDof branchitself

Page 21: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 21© Eric Rotenberg

Need large explicit predictor with fast cycle time Indexing with prior retired branch IDs makes it

easily pipelinable In general, pipelining is straightforward if the

index does not depend on immediately preceding predictions

Implementation Challenge #2:Large Explicit Predictor

Page 22: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 22© Eric Rotenberg

++

++

++

GBQ

ID1+

BHR

ID2+

BHR

ID3+

BHR

CYCLE 1 CYCLE 2 CYCLE 3 CYCLE 4 CYCLE 5 CYCLE 6 CYCLE 7

Implementation Challenge #2:Large Explicit Predictor

Page 23: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 23© Eric Rotenberg

Most benchmarks are tolerant of 400+ cycles of active update latency Large distance between stores and re-encounters of the branches

they update

Implementation Challenge #3:Large Storage for Active Updates

0%

2%

4%

6%

8%

10%

12%

0 100 200 300 400 500 600 700 800 900 1000active update latency (cycles)

mis

pre

dic

tio

n r

ate

%

bzip2 crafty gzip twolf

Page 24: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 24© Eric Rotenberg

Exploit “Predictor Virtualization” [Burcea et al.] Eliminate significant amount of dedicated storage Use small L1 table backed by full table in physical

memory The full table in physical memory is transparently

cached in the general-purpose memory hierarchy (e.g., L2$)

Implementation Challenge #3:Large Storage for Active Updates

Page 25: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 25© Eric Rotenberg

Putting it all together

Processor PipelineFetch Retire

Explicit Predictor

ID Gen

Default Predictor

past ID for predicting current branch

passive updates

stores

active updates

Active Update

Unit

1

2

3

Page 26: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 26© Eric Rotenberg

gzip

4%

5%

6%

7%

8%

9%

1 10 100 1000 10000cost (K-Bytes)

mis

pre

dic

tio

n r

ate

%

gsharegshare+EXACTgshare+EXACT (VIRT. SACT)L-TAGEL-TAGE+EXACTL-TAGE+EXACT (VIRT. SACT)

Page 27: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 27© Eric Rotenberg

Active Update Unit

First proposal to use store instructions to update the branch predictor

Store instructions might: Change a branch outcome in the explicit predictor Change a trip-count in the explicit loop predictor

Two mechanisms required: Convert store address into a predictor index Convert store value into a branch-outcome or trip-count

Page 28: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 28© Eric Rotenberg

Active Update Example0x0400STORE 0x20 X

.

.

.

0x0800LOAD R5 X

0x0808BLE R5, 0x30, target

Explicit Predictor

Single-Address Conversion Table (SACT)

X

0xABCD

0xABCD0x0808

Ranges Reuse Table (RRT)

T min T max N min N max0x0808 0x10

compare

T

PC index

0x30 0x31 0x50

0x20

N

Page 29: NC STATE UNIVERSITY Center for Efficient, Scalable, and Reliable Computing Department of Electrical & Computer Engineering North Carolina State University

NC STATE UNIVERSITY

Int’l Conference on Computing Frontiers, May 16-19, 2010 29© Eric Rotenberg

Future Work

• Rich ISA support– Empower compiler or programmer to directly index

into the explicit predictor, directly manage it, and directly manage the instruction fetch unit

– Close gap between real and ideal indexing– More efficient hardware