puma 2 : bridging the cpu/memory gap through prediction & speculation babak falsafi team...

40
PUMA PUMA 2 2 : Bridging the : Bridging the CPU/Memory Gap through CPU/Memory Gap through Prediction & Speculation Prediction & Speculation Babak Falsafi Team Members: Chi Chen, Chris Gniady, Jangwoo Kim, Tom Wenisch, Se-Hyun Yang Alumni: Cem Fide, An-Chow Lai Impetus Group Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University http://www.ece.cmu.edu/~impetus

Upload: israel-washburne

Post on 14-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

PUMAPUMA22: Bridging the CPU/Memory Gap : Bridging the CPU/Memory Gap through Prediction & Speculationthrough Prediction & Speculation

Babak FalsafiTeam Members: Chi Chen, Chris Gniady, Jangwoo Kim,

Tom Wenisch, Se-Hyun Yang

Alumni: Cem Fide, An-Chow Lai

Impetus GroupImpetus GroupComputer Architecture Lab (CALCM)Carnegie Mellon Universityhttp://www.ece.cmu.edu/~impetus

Copyright © 2002 by Babak Falsafi 2

Network

Our Group’s Research Focus

Handheld to Server Memory Design Processor Design

Design issues Performance Power Reliability Programmability

Copyright © 2002 by Babak Falsafi 3

ImpetusImpetus Projects

Today’s talk: PUMA2: Bridging CPU/Memory Gap

Others:

1. PowerTap: Power-Aware Computer Systems

2. JITR: Soft-error Tolerant Microarchitecture

3. GigaTrans: Beyond Superscalar & ILP

Goals: Impact products, research, education in architecture E.g., Reactive NUMA => Sun WildFire DSM

Copyright © 2002 by Babak Falsafi 4

Outline

Impetus Overview PUMAPUMA22

Hitting the Memory Wall Last-touch Memory Access Speculative Memory Ordering

Copyright © 2002 by Babak Falsafi 5

Hitting the Memory Wall

Growing distance: Processors getting faster Memory only getting larger

Caching less effective: Simplistic (demand) fetch/replace Deeper higher worst case access latencies Multiple hierarchies in multiprocessors!

50% processor utilization in servers [Ailamaki, VLDB’99] Commercial databases running on a Xeon server

Copyright © 2002 by Babak Falsafi 6

Conventional Data Demand Fetch

Fetch data upon CPU request zero lookahead upon miss crude guess for replacement

Works only when working set fits in L1/L2 changes infrequently

Out-of-order core at best tolerate L1-L2 latency

CPU

L1

L2

L3

Memory

10 c

lk2 c

lk

50 c

lk

500

clk

Copyright © 2002 by Babak Falsafi 7

PUMAPUMA22 Proactively Uniform Memory Access Architecture

Goal:

Bridge the CPU/Memory Performance Gap

How? Prediction and speculation Hide/tolerate memory latency Hardware techniques transparent to software

Copyright © 2002 by Babak Falsafi 8

This Talk

1. Last-touch memory access model predict the last processor reference evict and fetch upon last reference

+ significantly enhance fetch lookahead

2. Speculative memory ordering overlap accesses tolerate latency but, overlapping memory affects memory order show, hardware can both relax & enforce order

Copyright © 2002 by Babak Falsafi 9

Outline

Impetus Overview PUMAPUMA22

Hitting the Memory Wall Last-Touch Memory Access Speculative Memory Ordering

Copyright © 2002 by Babak Falsafi 10

Hiding the Memory Latency Using Prediction/Speculation

Mechanisms required:

1. Predict “what” memory address to fetch Goal: Minimize traffic, avoid thrashing, etc.

2. Predict “when” to fetch Goal: Maximize hiding latency

3. Storage “where” fetched data is placed Goal: Avoid lookups in auxiliary structures

Copyright © 2002 by Babak Falsafi 11

Current Proposals for Data Prefetching

Custom Prefetchers Stride, stream, dependence-based, etc.

General-Purpose Prefetchers Precomputation/slipstream prefetcher Address correlating prefetcher

Key shortcomings: Insufficient lookahead (e.g., 10~100 cycles) Low accuracy for general access patterns Can not place directly in L1 (use buffers)

Copyright © 2002 by Babak Falsafi 12

Markov Prefetchers (Joseph & Grunwald, ISCA’97) Predict “what”: correlate L1 miss addresses Predict “when”: consecutive L1 misses High prediction coverage Clustered L1 misses Insufficient lookahead One-to-many predictions Low accuracy Prefetch into a buffer High prefetch hit time

Related Work: Address Correlating Prefetchers

Copyright © 2002 by Babak Falsafi 13

Insufficient Lookahead in Correlating Prefetchers

Consecutive L1 misses often clustered Exacerbated in out-of-order cores

load/store A1 (miss)

load/store A1 (hit)

load/store C3 (miss)

...

load/store A3 (miss)

Fetch on miss

lookahead

Copyright © 2002 by Babak Falsafi 14

We Propose Fetch on Last Touch Predict & fetch on last touch+ Evict dead block+ Enhance fetch lookahead+ Fetch directly into L1

load/store A1 (miss)

load/store A1 (hit)

load/store C3 (miss)

...

load/store A3 (miss)

Fetch on last touch

lookahead

load/store A1 (miss)

load/store A1 (hit)

load/store C3 (miss)

...

load/store A3 (miss)

Fetch on miss

Copyright © 2002 by Babak Falsafi 15

Enhancing Fetch LookaheadC

umul

ativ

e D

istr

ibut

ion

2 4 8 16 32 64 128

256

2048

102451

2

> 20

48

Between last touch & next miss(Our proposal)

Between two misses(Markov)

Processor cycles

0

20

40

60

80

100L2

late

ncy

Mem

ory

late

ncy

Copyright © 2002 by Babak Falsafi 16

Dead-Block Prediction [ISCA’01]

Correlate a trace of memory accesses to a block Uniquely identify different dead-times

PC3: load/store A1

PC1: load/store A1

PC3: load/store A1

PC5: load/store A3

Acc

ess

str

ea

m to

a b

lock

fra

me

(miss)

(hit)

(hit)

(miss)

PC0: load/store A0 (hit)

Trace = (PC1,PC3, PC3)

Last touch

First touch

Copyright © 2002 by Babak Falsafi 17

Miss-Address Prediction

Correlate last 2 misses within a cache block frame

Correlation = (A0,A1) (A3)

PC3: load/store A1

PC1: load/store A1

PC3: load/store A1

PC5: load/store A3

(miss)

(hit)

(hit)

(miss)

PC0: load/store A0 (hit)

Trace = (PC1,PC3, PC3)

(A0,A1,PC1,PC3,PC3) (A3)

Copyright © 2002 by Babak Falsafi 18

Prefetch A3

11

Dead-Block Correlating Prefetcher

Correlating Prediction Table

A3A0,A1,PC1,PC3,PC3A0,PC1,PC3

History Table

PC3

encode

Current Access

Latest

A1

Two-level prediction table History table Correlating Prediction Table Encoding truncated addition Two bit saturating counter

Copyright © 2002 by Babak Falsafi 19

Methodology

Simulated using SimpleScalar 3.0 2 GHz, 8-issue, 128-entry window 32K, DM, 1-cycle L1D 1M, 4-way, 12-cycle L2 70-cycle memory 2M, 8-way, 24-cycle prediction table 128-entry prefetch buffer (for Markov only)

Memory-intensive integer, float-point, linked-data apps 14 Benchmarks 5 Olden, 4 SpecINT, 5 SpecFP

Copyright © 2002 by Babak Falsafi 20

0%

20%

40%

60%

80%

100%

120%

Dead-Block Coverage and Accuracy

MispredictedTrainingPredicted

DBP predicts 90% and miss by 4% only

bhem

3d

healt

hm

st

treea

dd

com

pres

spe

rlgc

cm

cf

amm

p art

equa

kem

grid

swim

Fra

ctio

n of

all

mis

ses

Copyright © 2002 by Babak Falsafi 21

Miss-Address Prediction

0%

20%40%

60%80%

100%120%

140%MispredictedTrainingPredicted > 190%

DBCP predicts 82%, misses 3% Markov (Joseph & Grunwald) predicts 81%, but misses 229%

bhem

3d

healt

hm

st

treea

dd

com

pres

spe

rlgc

cm

cf

amm

p art

equa

kem

grid

swim

MD MD MD MD MD MD MD MD MD MD MD MD MD MD

Fra

ctio

n of

all

mis

ses

M=Markov D=DBCP

Copyright © 2002 by Babak Falsafi 22

Memory Stall Time Reduction

0%10%20%30%40%50%60%70%80%90%

100%Markov DBCP

DBCP reduces memory stall time by 62% on average Markov reduces memory stall time by 30% only

bhem

3d

healt

hm

st

treea

dd

com

pres

spe

rlgc

cm

cf

amm

p art

equa

kem

grid

swim

Fra

ctio

n R

educ

ed

Copyright © 2002 by Babak Falsafi 23

DBCP vs. Larger On-Chip L2

-40%

-20%

0%

20%

40%

60%

80%

100%

bhem

3d

healt

hm

st

treea

dd

com

pres

spe

rlgc

cm

cf

amm

p art

equa

ke

mgr

idsw

im

18-cycle 3M-L2 24-cycle 2M DBCP12-cycle 3M-L2

Fra

ctio

n R

educ

ed

Copyright © 2002 by Babak Falsafi 24

Conclusions

Dead-Block Predictors (DBP) Predict when to evict block Enable timely prefetching Can prefetch into L1 cache High coverage of 90%, mispredicting only 4%

Dead-Block Correlating Prefetchers (DBCP) Accurate and timely prefetch Reduce memory stall time by 62%

Copyright © 2002 by Babak Falsafi 25

Other Mechanisms in PUMA2

Self-invalidation predictors [ISCA’00]

Predict when to self-invalidate in multiprocessors Converts 3-hop latencies to 2-hop

Memory sharing predictors [ISCA’99]

Predict subsequent sharers of block Powerful mechanism to move data

Both exhibit high coverage and accuracy

Copyright © 2002 by Babak Falsafi 26

Outline

Impetus Overview PUMAPUMA22

Hitting the Memory Wall Last-Touch Memory Access Speculative Memory Ordering

Copyright © 2002 by Babak Falsafi 27

Sequential Consistency (SC) [LAMPORT]

Memory should appear in program order & atomic e.g., critical section

lock modify data unlock

+ intuitive programming– extremely slow!

What Programmers Want (SC)

P P P....

Shared Memory

Copyright © 2002 by Babak Falsafi 28

What Machines Provide (RC)

Overlap remote accesses software enforces order (when needed)

e.g., first lock, then data special “ordering” instructions

Release Consistency (RC) [Gharachorloo, et al.]

allows any (re-)ordering in e.g., IA-64, SPARC+ high performance– complicates programming

P P P....

Shared Memory

Overlap Accesses

Copyright © 2002 by Babak Falsafi 29

Can We Have SC Programming With RC Performance?

Observation: SC must only appear in program order need order only when others race to access

SC hardware can emulate RC iff overlap accesses speculatively keep a log of computation in program order roll back in case of a race + no help from software SC programming+ infrequent rollback better than RC performance

Copyright © 2002 by Babak Falsafi 30

Related Work: Hiding the Store Latency

A number of SC optimizations

1. Multiple pending prefetches Commit to L1 in order [Gharachorloo et al.] E.g., MIPS R10000’s pending misses

2. Relaxing order within ROB Speculative loads [Gharachorloo et al.] E.g., MIPS R10000’s speculative loads

3. Extensions to ROB Speculative retirement [Ranganathan et al.]

Limited speculation in small associative buffers!

Copyright © 2002 by Babak Falsafi 31

MemoryQueuePipeline

ReorderBuffer

Done WR X

RD Z

WR A

RD Y

RD A

ALUWR A Idle

WR X Miss

RD A Idle

Execution in SC Memory System

WR X, RD Y, RD Z access remote memory X, Y, Z, A are unrelated need not be ordered WR X blocks pipeline hundreds of cycles Can not overlap RD Y & RD Z with WR X

RD Y Idle

Copyright © 2002 by Babak Falsafi 32

WR X

WR A

RD A

ALUDone

Out of order

Execution in RC Memory System

+ Accesses to A complete while WR X is pending+ Overlaps remote accesses to X, Y, Z– Software must guarantee that X, Y, Z, A are unrelated

PipelineReorderBuffer ...

...

RD Y

RD ZRD Z Miss

WR X Miss

RD Y Miss

...

MemoryQueue

Copyright © 2002 by Babak Falsafi 33

Speculatively & Fully Relaxing Order

With Vijaykumar [ISCA’99]

H/W support for relaxing all order

Storage to tolerate long latencies Old processor state Old memory state

Fast lookup to detect possible order violation upon cache invalidations and replacements

Infrequent rollbacks Typical of well-behaved applications Rollbacks are due to false sharing or data races

Copyright © 2002 by Babak Falsafi 34

Done

WR X

WR A

RD A

ALU

SC++: A Design for Speculative SC

SHiQ: Back up computation in a queue BLT: Quick lookup to detect races

SpeculativeHistoryQueue

PipelineReorderBuffer ...

...

RD Y

RD ZRD Z Miss

WR X Miss

RD Y Miss

...

MemoryQueue

Block containing A

Block containing Y & Z

BlockLookupTable

Detect races from directory accesses

Copyright © 2002 by Babak Falsafi 35

Applications Beyond Memory Order

SC++ can be used as generic speculation:

1. rollbacks are rare

2. verifying speculation >> ROB can sustain

Examples: Value speculation [Sorin et al.] Speculating beyond locks [Rajwar et al.]

Copyright © 2002 by Babak Falsafi 36

Performance of SC, RC and SC++

1.01.11.21.31.41.51.61.71.8

Sp

ee

du

p o

ve

r S

C

d

RC SC++

Data from RSIM DSM simulator 16, 1 GHz MIPS R10000 processors Up to 70% gap between SC & RC

SC++ can fully emulate RC

Copyright © 2002 by Babak Falsafi 37

Sensitivity to Queue Size

Queue size varies across apps (& systems) History is highly bursty Can spill history to L2

50%

60%

70%

80%

90%

100%

16 32 64 128 256 512 1024 2048 4096 8192

Number of Entries

Fra

cti

on

of

Be

st a

P

erf

orm

an

ce

appbtbarnesem3dfftradixtomcatvunstruct.waterwater-spAverage

Copyright © 2002 by Babak Falsafi 38

Conclusions

First to show SC + Speculation = RC! identified design requirements current systems do not satisfy requirements proposed a design, SC++

Hardware can provide

simple programming with high performance!

Copyright © 2002 by Babak Falsafi 39

Other Ongoing Projects

Ultra-Deep-Submicron Designs1. Power Management: [MICRO’01,HPCA’02,HPCA’01,ISLPED’00]

First architectural proposal to reduce leakage Resizable Caches Way/Bank Predicting Caches Power-Aware Snoopy Coherence

2. Transient-Error Tolerant Superscalar: [MICRO’01] Error-tolerant instruction scheduling

Beyond ILP & Superscalar: [ICS’01,PPoPP’01] Low-overhead mechanisms for thread-level speculation Selective dependence tracking

For More InformationFor More Information

Please visit our web site

Impetus GroupImpetus GroupComputer Architecture LabCarnegie Mellon Universityhttp://www.ece.cmu.edu/~impetus