many-core computing era and new challengescs.iit.edu/~iraicu/teaching/eecs495-dic/lecture14.pdf ·...

40
Many-Core Computing Era and New Challenges Nikos Hardavellas, EECS

Upload: others

Post on 30-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

Many-Core Computing Eraand New Challenges

Nikos Hardavellas, EECS

Page 2: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© Hardavellas2

Moore’s Law Is Alive And Well

90nm

90nm transistor

(Intel, 2005)

Swine Flu A/H1N1

(CDC)

65nm

2007

45nm

2010

32nm

2013

22nm

2016

16nm

2019

Device scaling continues for at least another 10 years

Page 3: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© Hardavellas3

Good Days Ended Nov. 2002

[Yelick09]

“Traditional” Moore’s Law: 2x transistors with every generation

“New” Moore’s Law: 2x cores with every generation

Page 4: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

What Happened in Nov. 2002?Chips Became Too Hot

© Hardavellas4

Watt

s/c

m2

1

10

100

1000

1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m

i386

i486

Pentium®

Pentium® Pro

Pentium® II

Pentium® IIIHot plate

RocketNozzle

Nuclear Reactor

Pentium® 4

[Pollack]

Page 5: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

The New Cooking Sensation!

© Hardavellas5

[Huang]

Page 6: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

data

data

data

© Hardavellas6

But, Chips Are Physically Constrained,and Constraints Don’t Scale

How to balance constraints and achieve peak performance?

L2

core

core

core core

core

core

core core core core core core

core core core core

core

core

core

core

Power?

Thermal?

Area?

Bandwidth?

Page 7: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

A Look at Technology Projections & Implications

• Use ITRS projections + first-order models/analysis

Given a technology node, find highest-performance design

Respect physical constraints

Account for SW trends

Amdahl’s Law

Exponentially-increasing datasets

© Hardavellas7

Page 8: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

How Many Cores/Cache Can We Power?

© Hardavellas8

1

2

4

8

16

32

64

128

256

1 2 4 8 16 32 64 128 256 512

Nu

mb

er

of

Co

res

Cache Size (MB)

Area

Power (Max Freq.)

1 GHz, 0.27V

2.7 GHz, 0.36V

4.4 GHz, 0.45V

5.7 GHz, 0.54V

6.9 GHz, 0.63V

8 GHz, 0.72V

9 GHz, 0.81V

The power wall: a power/performance trade-off

Voltage/

Freq.

Scaling

(VFS)

Page 9: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

Pin Bandwidth: Fewer Cores, More Cache

© Hardavellas9

1

2

4

8

16

32

64

128

256

1 2 4 8 16 32 64 128 256 512

Nu

mb

er

of

Co

res

Cache Size (MB)

Area

Power (Max Freq.)

1 GHz, 0.27V

2.7 GHz, 0.36V

4.4 GHz, 0.45V

5.7 GHz, 0.54V

6.9 GHz, 0.63V

8 GHz, 0.72V

9 GHz, 0.81V

Bandwidth (1 GHz)

Bandwidth (2.7GHz)

Where would the best performance be?

Page 10: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

Peak-Performing Designs

© Hardavellas10

0

50

100

150

200

250

300

350

400

1 2 4 8 16 32 64 128 256 512

Pe

rfo

rma

nc

e (

10

00

x M

IPS

)

Cache Size (MB)

Area (Max Freq.)

Power (Max Freq.)

Area+Power (VFS)

Bandwidth (VFS)

Peak Performance

Peak-performing design

First, bandwidth constrained, then power constrained

Page 11: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

Transistor Scaling: Many Cores, Huge Caches

© Hardavellas11

Need cache architectures for >>MB

1

2

4

8

16

32

64

128

256

512

2004 2007 2010 2013 2016 2019

Ca

ch

e S

ize

(M

B)

Year of Technology Introduction

OLTP

DSS

Apache

1

2

4

8

16

32

64

128

256

512

1024

2004 2007 2010 2013 2016 2019

Nu

mb

er

of

Co

res

Year of Technology Introduction

Area (Max Freq.)

OLTP - Peak Perf.

DSS - Peak Perf.

Apache - Peak Perf.

OLTP (Max Freq.)

DSS (Max Freq.)

Apache (Max Freq.)

Page 12: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

Why Are Caches Growing So Large?

• Increasing number of cores: cache grows commensurately

Fewer but faster cores have the same effect

• Increasing datasets: faster than Moore’s Law!

• Power/thermal efficiency: caches are “cool”, cores are “hot”

So, its easier to fit more cache in a power budget

• Limited bandwidth: large cache == more data on chip

Off-chip pins are used less frequently

© Hardavellas12

So, caches are getting huge. Great! What’s the catch?

Page 13: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© Hardavellas13

1

10

100

1,000

10,000

100,000

1990 2000 2010

Year

L2 C

ach

e S

ize (

KB

)

0

5

10

15

20

25

1990 2000 2010

Year

L2 H

it L

ate

ncy (

cycle

s)

slow access

large caches

Larger Caches Are Slower Caches

Does this affect the end performance?

Page 14: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© Hardavellas14

Impact of Slower Caches on Performance

0

0.5

1

1.5

2

2.5

3

0 10 20 30

L2 Cache Size (MB)

No

rm. T

hro

ug

hp

ut

DSS-const DSS-real

OLTP-const OLTP-real

We lose half the potential throughput!

Page 15: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© Hardavellas15

Cache Design Trends

Balance cache slice access with network latency

As caches become bigger, they get slower:

Divide the cache into smaller “slices”:

Page 16: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© Hardavellas16

corecorecore

Modern Caches: Distributed

Split cache into “slices”, distribute across die

L2 L2 L2 L2

L2 L2 L2 L2

core

core core core core

Page 17: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

Modern Multi-Core Chips

© Hardavellas17

Page 18: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

Multi-Core: Distributed System On Chip

• Similar structure

Nodes in a cluster or a multiprocessor -> cores on a chip

Physically distributed memory -> distributed cache

Interconnect -> ditto, but on chip

• Similar challenges, albeit with different constants:

Power, thermal, reliability, performance

• Also, some new challenges:

Off-chip bandwidth: limited pins

© Hardavellas18

Page 19: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

Data Placement Determines Performance

© 2009 Hardavellas19

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Goal: place data on chip close to where they are used

cache

slice

Page 20: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© 2009 Hardavellas20

Terminology: Data Types

core

L2

core

L2

core core

L2

core

Read

or

Write

ReadRead Read

Write

Private Shared

Read-Only

Shared

Read-Write

Page 21: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© Hardavellas21

Distributed shared L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Maximum capacity, but slow access (40+ cycles)

Can we do better?

address

mod <#slices>

Unique location

for any block

(private or shared)

Page 22: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© Hardavellas22

L2

Distributed private L2: private data access

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Fast access to core-private data

What about accesses to shared data?

Private data:

allocate at

local L2 slice

On every access

allocate data

at local L2 slice

Page 23: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© Hardavellas23

L2

Distributed private L2: shared-RO access

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Wastes capacity due to replication

What about accesses to shared read-write data?

Shared read-only

data: replicate

across L2 slices

On every access

allocate data

at local L2 slice

Page 24: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© Hardavellas24

Distributed private L2: shared-RW access

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 dir

core core core core

L2 L2 L2 L2

Slow for shared read-write

Wastes capacity (dir overhead) and bandwidth

XShared read-write

data: maintain

coherence via

indirection (dir)

On every access

allocate data

at local L2 slice

Page 25: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© 2009 Hardavellas25

Conventional Multi-Core Caches

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

dir L2 L2 L2

We want: high capacity (shared) + fast access (priv.)

PrivateShared

Address-interleave blocks

(sliceID = addr mod #slices)

+ High effective capacity

− Slow access

Each block cached locally

+ Fast access (local)

− Low capacity (replicas)

− Coherence: via indirection(distributed directory)

Page 26: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© 2009 Hardavellas26

Where to Place the Data?• Close to where they are used!

• Accessed by single core: migrate locally

• Accessed by many cores: replicate (?)

If read-only, replication is OK

If read-write, coherence a problem

Low reuse: evenly distribute across sharers

sharers#

read-write

mig

rate

replicate

share

read-only

Page 27: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© 2009 Hardavellas27

Cache Access Classification Example

-20%

0%

20%

40%

60%

80%

100%

120%

0 2 4 6 8 10 12 14 16 18 20

Number of Sharers

% R

ea

d-W

rite

Blo

ck

s

Instructions Data-Private Data-Shared

0

% L2

accesses

• Each bubble: cache blocks shared by x cores

• Size of bubble proportional to % L2 accesses

• y axis: % blocks in bubble that are read-write

% R

W B

lock

s i

n B

ub

ble

Page 28: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© 2009 Hardavellas28

Scientific/MP Apps

-20%

0%

20%

40%

60%

80%

100%

120%

-4 -2 0 2 4 6 8 10 12 14 16 18 20

Number of Sharers%

Read

-Wri

te B

locks

Instructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-Shared

0%

% 0-20%

0%

20%

40%

60%

80%

100%

120%

0 2 4 6 8 10 12 14 16 18 20

Number of Sharers

% R

ead

-Wri

te B

locks

Instructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-Shared

0

Cache Access Clustering

Accesses naturally form 3 clusters

Server Appsmigrate

locally

share (addr-interleave)

replicate

R/W

migrate

replicate

share

R/O

sharers#% R

W B

locks i

n B

ub

ble

% R

W B

locks i

n B

ub

ble

Page 29: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

Instruction Replication

© 2009 Hardavellas29

L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Distribute in cluster of neighbors, replicate across

• Instruction working set too large for one cache slice

Page 30: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© Hardavellas30

Reactive NUCA in a nutshell

• Classify accesses

private data: like private scheme (migrate)

shared data: like shared scheme (interleave)

instructions: controlled replication (middle ground)

To place cache blocks, we first need to classify them

Page 31: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© Hardavellas31

Classification Granularity

• Per-block classification

High area/power overhead (cut L2 size by half)

High latency (indirection through directory)

• Per-page classification (utilize OS page table)

Persistent structure

Core accesses the page table for every access anyway (TLB)

Utilize already existing SW/HW structures and events

Page classification is accurate (<0.5% error)

Classify entire data pages, page table/TLB for bookkeeping

Page 32: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

• Instructions classification: all accesses from L1-I (per-block)

• Data classification: private/shared per-page at TLB miss

Page classification is accurate (<0.5% error)

Classification Mechanisms

© 2009 Hardavellas32

TLB Misscore

L2

Ld ACore i

OS

A: Private to “i”

TLB MissLd A

OS

A: Private to “i”

core

L2

Core j

A: Shared

On 1st access On access by another core

Bookkeeping through OS page table and TLB

Page 33: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

Page Table and TLB Extensions

© 2009 Hardavellas33

vpage ppageL2 idP/S/I

2 bits log(n)

vpage ppageP/STLB entry:

1 bit

Page granularity allows simple + practical HW

Page table entry:

• Persistent structure, ideal for “directory”

• Core accesses the page table for every access anyway (TLB)

Pass information from the “directory” to the core

• Utilize already existing SW/HW structures and events

Page 34: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© Hardavellas34

Data Class Bookkeeping and Lookup

offsetPhysical Addr.:

vpage ppageL2 id

vpage ppageL2 idS

cache indextag

Page table entry:

Page table entry:

vpage ppagePTLB entry:

L2 id

vpage ppageSTLB entry:

P

• private data: place in local L2 slice

• shared data: place in aggregate L2 (addr interleave)

Page 35: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

each slice caches the same blocks

on behalf of any cluster

© Hardavellas35

3 10

0 1 32 0

3 1

3 10

0 1 32 0

3 1

Instructions Lookup: Rotational Interleaving

2 2 3 10

1 32

2 0 2 3 10

0 1 32 0 1 32

+1

+log2(k)

)1(&1 nRIDAddrnDestinatio

RID

Fast access (nearest-neighbor, simple lookup)

Balance access latency with capacity constraints

Equal capacity pressure at overlapped slices

PC: 0xfa480

RID

Addr size-4 clusters:

local slice + 3 neighbors

Page 36: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© 2009 Hardavellas36

Coherence: No Need for HW Mechanisms at L2

Fast access, eliminates HW overhead, SIMPLE

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Private data: local sliceShared data: addr-interleave

• Reactive NUCA placement guarantee

Each R/W datum in unique & known location

Page 37: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

-20%

-10%

0%

10%

20%

30%

40%

50%

60%

A S R I A S R I A S R I A S R I A S R I A S R I A S R I A S R I

OLTP

DB2

Apache DSS

Qry6

DSS

Qry8

DSS

Qry13

em3d OLTP

Oracle

MIX

Private-averse workloads Shared-averse

workloads

Sp

eed

up

over

Pri

vate

© 2009 Hardavellas37

Evaluation

Delivers robust performance across workloads

Shared: same for Web, DSS; 17% for OLTP, MIX

Private: 17% for OLTP, Web, DSS; same for MIX

ASR (A)

Shared (S)

R-NUCA (R)

Ideal (I)

Page 38: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© 2009 Hardavellas38

Reactive NUCA: Fast >>MB Caches

• Data may exhibit arbitrarily complex behaviors

...but few that matter!

• Learn the behaviors at run time for placement

Make the common case fast

• Fast!

Near-optimal placement (within 5% of ideal)

Robust (matches best alternative, or 17% better; up to 32%)

Nearest-neighbor communication → scalable

• Transparent to the user, simple design, negligible overhead

• BONUS: simplify hardware

Eliminate HW coherence at L2

Page 39: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

Concluding Remarks

• The old multiprocessor/cluster is now within a single chip

• Still a distributed system

• Similar challenges, with new constants

Lecture focused on performance via data placement

Similar challenge: HPC systems strive to privatize data

R-NUCA strives to use “private caches”

New constants: latency, bandwidth,

Single OS image allows for many simplifications

• Research opportunity: map ideas from old domain to new

© Hardavellas39

Page 40: Many-Core Computing Era and New Challengescs.iit.edu/~iraicu/teaching/EECS495-DIC/lecture14.pdf · 0 2 4 6 8 10 12 14 16 18 20 Number of Sharers Blocks Instructions Data-Private Data-Shared

© Hardavellas40

“Multicore: This is the one which will have the biggest impact on us. We have never had a problem to solve like this. A breakthrough is needed in how applications are done on multicore devices.”

– Bill Gates

“It’s time we rethink some of the basics of computing. It’s scary and lots of fun at the same time.”

– Burton Smith

Multicore Research Brings Opportunities and Challenges