puma 2 : bridging the cpu/memory gap through prediction & speculation babak falsafi team...

PUMAPUMA22: Bridging the CPU/Memory Gap : Bridging the CPU/Memory Gap through Prediction & Speculationthrough Prediction & Speculation

Babak FalsafiTeam Members: Chi Chen, Chris Gniady, Jangwoo Kim,

Tom Wenisch, Se-Hyun Yang

Alumni: Cem Fide, An-Chow Lai

Impetus GroupImpetus GroupComputer Architecture Lab (CALCM)Carnegie Mellon Universityhttp://www.ece.cmu.edu/~impetus

Copyright © 2002 by Babak Falsafi 2

Network

Our Group’s Research Focus

Handheld to Server Memory Design Processor Design

Design issues Performance Power Reliability Programmability


ImpetusImpetus Projects

Today’s talk: PUMA2: Bridging CPU/Memory Gap

Others:

1. PowerTap: Power-Aware Computer Systems

2. JITR: Soft-error Tolerant Microarchitecture

3. GigaTrans: Beyond Superscalar & ILP

Goals: Impact products, research, education in architecture E.g., Reactive NUMA => Sun WildFire DSM


Outline

Impetus Overview PUMAPUMA22

Hitting the Memory Wall Last-touch Memory Access Speculative Memory Ordering


Hitting the Memory Wall

Growing distance: Processors getting faster Memory only getting larger

Caching less effective: Simplistic (demand) fetch/replace Deeper higher worst case access latencies Multiple hierarchies in multiprocessors!

50% processor utilization in servers [Ailamaki, VLDB’99] Commercial databases running on a Xeon server


Conventional Data Demand Fetch

Fetch data upon CPU request zero lookahead upon miss crude guess for replacement

Works only when working set fits in L1/L2 changes infrequently

Out-of-order core at best tolerate L1-L2 latency

CPU

L1

L2

L3

Memory

10 c

lk2 c

lk

50 c

lk

500

clk


PUMAPUMA22 Proactively Uniform Memory Access Architecture

Goal:

Bridge the CPU/Memory Performance Gap

How? Prediction and speculation Hide/tolerate memory latency Hardware techniques transparent to software


This Talk

1. Last-touch memory access model predict the last processor reference evict and fetch upon last reference

+ significantly enhance fetch lookahead

2. Speculative memory ordering overlap accesses tolerate latency but, overlapping memory affects memory order show, hardware can both relax & enforce order


Outline


Hitting the Memory Wall Last-Touch Memory Access Speculative Memory Ordering


Hiding the Memory Latency Using Prediction/Speculation

Mechanisms required:

1. Predict “what” memory address to fetch Goal: Minimize traffic, avoid thrashing, etc.

2. Predict “when” to fetch Goal: Maximize hiding latency

3. Storage “where” fetched data is placed Goal: Avoid lookups in auxiliary structures


Current Proposals for Data Prefetching

Custom Prefetchers Stride, stream, dependence-based, etc.

General-Purpose Prefetchers Precomputation/slipstream prefetcher Address correlating prefetcher

Key shortcomings: Insufficient lookahead (e.g., 10~100 cycles) Low accuracy for general access patterns Can not place directly in L1 (use buffers)


Markov Prefetchers (Joseph & Grunwald, ISCA’97) Predict “what”: correlate L1 miss addresses Predict “when”: consecutive L1 misses High prediction coverage Clustered L1 misses Insufficient lookahead One-to-many predictions Low accuracy Prefetch into a buffer High prefetch hit time

Related Work: Address Correlating Prefetchers


Insufficient Lookahead in Correlating Prefetchers

Consecutive L1 misses often clustered Exacerbated in out-of-order cores

load/store A1 (miss)

load/store A1 (hit)

load/store C3 (miss)

...


Fetch on miss

lookahead


We Propose Fetch on Last Touch Predict & fetch on last touch+ Evict dead block+ Enhance fetch lookahead+ Fetch directly into L1


load/store A1 (hit)


...


Fetch on last touch

lookahead


load/store A1 (hit)


...


Fetch on miss


Enhancing Fetch LookaheadC

umul

ativ

e D

istr

ibut

ion

2 4 8 16 32 64 128

256

2048

102451

2

> 20

48

Between last touch & next miss(Our proposal)

Between two misses(Markov)

Processor cycles

0

20

40

60

80

100L2

late

ncy

Mem

ory

late

ncy


Dead-Block Prediction [ISCA’01]

Correlate a trace of memory accesses to a block Uniquely identify different dead-times

PC3: load/store A1

PC1: load/store A1

PC3: load/store A1

PC5: load/store A3

Acc

ess

str

ea

m to

a b

lock

fra

me

(miss)

(hit)

(hit)

(miss)

PC0: load/store A0 (hit)

Trace = (PC1,PC3, PC3)

Last touch

First touch


Miss-Address Prediction

Correlate last 2 misses within a cache block frame

Correlation = (A0,A1) (A3)

PC3: load/store A1

PC1: load/store A1

PC3: load/store A1

PC5: load/store A3

(miss)

(hit)

(hit)

(miss)

PC0: load/store A0 (hit)

…

…

Trace = (PC1,PC3, PC3)

(A0,A1,PC1,PC3,PC3) (A3)


Prefetch A3

11

Dead-Block Correlating Prefetcher

Correlating Prediction Table

A3A0,A1,PC1,PC3,PC3A0,PC1,PC3

History Table

PC3

encode

Current Access

Latest

A1

Two-level prediction table History table Correlating Prediction Table Encoding truncated addition Two bit saturating counter


Methodology

Simulated using SimpleScalar 3.0 2 GHz, 8-issue, 128-entry window 32K, DM, 1-cycle L1D 1M, 4-way, 12-cycle L2 70-cycle memory 2M, 8-way, 24-cycle prediction table 128-entry prefetch buffer (for Markov only)

Memory-intensive integer, float-point, linked-data apps 14 Benchmarks 5 Olden, 4 SpecINT, 5 SpecFP


0%

20%

40%

60%

80%

100%

120%

Dead-Block Coverage and Accuracy

MispredictedTrainingPredicted

DBP predicts 90% and miss by 4% only

bhem

3d

healt

hm

st

treea

dd

com

pres

spe

rlgc

cm

cf

amm

p art

equa

kem

grid

swim

Fra

ctio

n of

all

mis

ses


Miss-Address Prediction

0%

20%40%

60%80%

100%120%

140%MispredictedTrainingPredicted > 190%

DBCP predicts 82%, misses 3% Markov (Joseph & Grunwald) predicts 81%, but misses 229%

bhem

3d

healt

hm

st

treea

dd

com

pres

spe

rlgc

cm

cf

amm

p art

equa

kem

grid

swim

MD MD MD MD MD MD MD MD MD MD MD MD MD MD

Fra

ctio

n of

all

mis

ses

M=Markov D=DBCP


Memory Stall Time Reduction

0%10%20%30%40%50%60%70%80%90%

100%Markov DBCP

DBCP reduces memory stall time by 62% on average Markov reduces memory stall time by 30% only

bhem

3d

healt

hm

st

treea

dd

com

pres

spe

rlgc

cm

cf

amm

p art

equa

kem

grid

swim

Fra

ctio

n R

educ

ed


DBCP vs. Larger On-Chip L2

-40%

-20%

0%

20%

40%

60%

80%

100%

bhem

3d

healt

hm

st

treea

dd

com

pres

spe

rlgc

cm

cf

amm

p art

equa

ke

mgr

idsw

im

18-cycle 3M-L2 24-cycle 2M DBCP12-cycle 3M-L2

Fra

ctio

n R

educ

ed


Conclusions

Dead-Block Predictors (DBP) Predict when to evict block Enable timely prefetching Can prefetch into L1 cache High coverage of 90%, mispredicting only 4%

Dead-Block Correlating Prefetchers (DBCP) Accurate and timely prefetch Reduce memory stall time by 62%


Other Mechanisms in PUMA2

Self-invalidation predictors [ISCA’00]

Predict when to self-invalidate in multiprocessors Converts 3-hop latencies to 2-hop

Memory sharing predictors [ISCA’99]

Predict subsequent sharers of block Powerful mechanism to move data

Both exhibit high coverage and accuracy


Outline


Hitting the Memory Wall Last-Touch Memory Access Speculative Memory Ordering


Sequential Consistency (SC) [LAMPORT]

Memory should appear in program order & atomic e.g., critical section

lock modify data unlock

+ intuitive programming– extremely slow!

What Programmers Want (SC)

P P P....

Shared Memory


What Machines Provide (RC)

Overlap remote accesses software enforces order (when needed)

e.g., first lock, then data special “ordering” instructions

Release Consistency (RC) [Gharachorloo, et al.]

allows any (re-)ordering in e.g., IA-64, SPARC+ high performance– complicates programming

P P P....

Shared Memory

Overlap Accesses


Can We Have SC Programming With RC Performance?

Observation: SC must only appear in program order need order only when others race to access

SC hardware can emulate RC iff overlap accesses speculatively keep a log of computation in program order roll back in case of a race + no help from software SC programming+ infrequent rollback better than RC performance


Related Work: Hiding the Store Latency

A number of SC optimizations

1. Multiple pending prefetches Commit to L1 in order [Gharachorloo et al.] E.g., MIPS R10000’s pending misses

2. Relaxing order within ROB Speculative loads [Gharachorloo et al.] E.g., MIPS R10000’s speculative loads

3. Extensions to ROB Speculative retirement [Ranganathan et al.]

Limited speculation in small associative buffers!


MemoryQueuePipeline

ReorderBuffer

Done WR X

RD Z

WR A

RD Y

RD A

ALUWR A Idle

WR X Miss

RD A Idle

Execution in SC Memory System

WR X, RD Y, RD Z access remote memory X, Y, Z, A are unrelated need not be ordered WR X blocks pipeline hundreds of cycles Can not overlap RD Y & RD Z with WR X

RD Y Idle


WR X

WR A

RD A

ALUDone

Out of order

Execution in RC Memory System

+ Accesses to A complete while WR X is pending+ Overlaps remote accesses to X, Y, Z– Software must guarantee that X, Y, Z, A are unrelated

PipelineReorderBuffer ...

...

RD Y

RD ZRD Z Miss

WR X Miss

RD Y Miss

...

MemoryQueue


Speculatively & Fully Relaxing Order

With Vijaykumar [ISCA’99]

H/W support for relaxing all order

Storage to tolerate long latencies Old processor state Old memory state

Fast lookup to detect possible order violation upon cache invalidations and replacements

Infrequent rollbacks Typical of well-behaved applications Rollbacks are due to false sharing or data races


Done

WR X

WR A

RD A

ALU

SC++: A Design for Speculative SC

SHiQ: Back up computation in a queue BLT: Quick lookup to detect races

SpeculativeHistoryQueue

PipelineReorderBuffer ...

...

RD Y

RD ZRD Z Miss

WR X Miss

RD Y Miss

...

MemoryQueue

Block containing A

Block containing Y & Z

BlockLookupTable

Detect races from directory accesses


Applications Beyond Memory Order

SC++ can be used as generic speculation:

1. rollbacks are rare

2. verifying speculation >> ROB can sustain

Examples: Value speculation [Sorin et al.] Speculating beyond locks [Rajwar et al.]


Performance of SC, RC and SC++

1.01.11.21.31.41.51.61.71.8

Sp

ee

du

p o

ve

r S

C

d

RC SC++

Data from RSIM DSM simulator 16, 1 GHz MIPS R10000 processors Up to 70% gap between SC & RC

SC++ can fully emulate RC


Sensitivity to Queue Size

Queue size varies across apps (& systems) History is highly bursty Can spill history to L2

50%

60%

70%

80%

90%

100%

16 32 64 128 256 512 1024 2048 4096 8192

Number of Entries

Fra

cti

on

of

Be

st a

P

erf

orm

an

ce

appbtbarnesem3dfftradixtomcatvunstruct.waterwater-spAverage


Conclusions

First to show SC + Speculation = RC! identified design requirements current systems do not satisfy requirements proposed a design, SC++

Hardware can provide

simple programming with high performance!


Other Ongoing Projects

Ultra-Deep-Submicron Designs1. Power Management: [MICRO’01,HPCA’02,HPCA’01,ISLPED’00]

First architectural proposal to reduce leakage Resizable Caches Way/Bank Predicting Caches Power-Aware Snoopy Coherence

2. Transient-Error Tolerant Superscalar: [MICRO’01] Error-tolerant instruction scheduling

Beyond ILP & Superscalar: [ICS’01,PPoPP’01] Low-overhead mechanisms for thread-level speculation Selective dependence tracking

For More InformationFor More Information

Please visit our web site

Impetus GroupImpetus GroupComputer Architecture LabCarnegie Mellon Universityhttp://www.ece.cmu.edu/~impetus

puma 2 : bridging the cpu/memory gap through prediction & speculation babak falsafi team...

Documents