many-core computing era and new challengescs.iit.edu/~iraicu/teaching/eecs495-dic/lecture14.pdf ·...

Many-Core Computing Eraand New Challenges

Nikos Hardavellas, EECS

© Hardavellas2

Moore’s Law Is Alive And Well

90nm

90nm transistor

(Intel, 2005)

Swine Flu A/H1N1

(CDC)

65nm

2007

45nm

2010

32nm

2013

22nm

2016

16nm

2019

Device scaling continues for at least another 10 years

© Hardavellas3

Good Days Ended Nov. 2002

[Yelick09]

“Traditional” Moore’s Law: 2x transistors with every generation

“New” Moore’s Law: 2x cores with every generation

What Happened in Nov. 2002?Chips Became Too Hot

© Hardavellas4

Watt

s/c

m2

1

10

100

1000

1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m

i386

i486

Pentium®

Pentium® Pro

Pentium® II

Pentium® IIIHot plate

RocketNozzle

Nuclear Reactor

Pentium® 4

[Pollack]

The New Cooking Sensation!

© Hardavellas5

[Huang]

data

data

data

© Hardavellas6

But, Chips Are Physically Constrained,and Constraints Don’t Scale

How to balance constraints and achieve peak performance?

L2

core

core

core core

core

core

core core core core core core

core core core core

core

core

core

core

Power?

Thermal?

Area?

Bandwidth?

A Look at Technology Projections & Implications

• Use ITRS projections + first-order models/analysis

Given a technology node, find highest-performance design

Respect physical constraints

Account for SW trends

Amdahl’s Law

Exponentially-increasing datasets

© Hardavellas7

How Many Cores/Cache Can We Power?

© Hardavellas8

1

2

4

8

16

32

64

128

256

1 2 4 8 16 32 64 128 256 512

Nu

mb

er

of

Co

res

Cache Size (MB)

Area

Power (Max Freq.)

1 GHz, 0.27V

2.7 GHz, 0.36V

4.4 GHz, 0.45V

5.7 GHz, 0.54V

6.9 GHz, 0.63V

8 GHz, 0.72V

9 GHz, 0.81V

The power wall: a power/performance trade-off

Voltage/

Freq.

Scaling

(VFS)

Pin Bandwidth: Fewer Cores, More Cache

© Hardavellas9

1

2

4

8

16

32

64

128

256

1 2 4 8 16 32 64 128 256 512

Nu

mb

er

of

Co

res

Cache Size (MB)

Area

Power (Max Freq.)

1 GHz, 0.27V

2.7 GHz, 0.36V

4.4 GHz, 0.45V

5.7 GHz, 0.54V

6.9 GHz, 0.63V

8 GHz, 0.72V

9 GHz, 0.81V

Bandwidth (1 GHz)

Bandwidth (2.7GHz)

Where would the best performance be?

Peak-Performing Designs

© Hardavellas10

0

50

100

150

200

250

300

350

400

1 2 4 8 16 32 64 128 256 512

Pe

rfo

rma

nc

e (

10

00

x M

IPS

)

Cache Size (MB)

Area (Max Freq.)

Power (Max Freq.)

Area+Power (VFS)

Bandwidth (VFS)

Peak Performance

Peak-performing design

First, bandwidth constrained, then power constrained

Transistor Scaling: Many Cores, Huge Caches

© Hardavellas11

Need cache architectures for >>MB

1

2

4

8

16

32

64

128

256

512

2004 2007 2010 2013 2016 2019

Ca

ch

e S

ize

(M

B)

Year of Technology Introduction

OLTP

DSS

Apache

1

2

4

8

16

32

64

128

256

512

1024

2004 2007 2010 2013 2016 2019

Nu

mb

er

of

Co

res

Year of Technology Introduction

Area (Max Freq.)

OLTP - Peak Perf.

DSS - Peak Perf.

Apache - Peak Perf.

OLTP (Max Freq.)

DSS (Max Freq.)

Apache (Max Freq.)

Why Are Caches Growing So Large?

• Increasing number of cores: cache grows commensurately

Fewer but faster cores have the same effect

• Increasing datasets: faster than Moore’s Law!

• Power/thermal efficiency: caches are “cool”, cores are “hot”

So, its easier to fit more cache in a power budget

• Limited bandwidth: large cache == more data on chip

Off-chip pins are used less frequently

© Hardavellas12

So, caches are getting huge. Great! What’s the catch?

© Hardavellas13

1

10

100

1,000

10,000

100,000

1990 2000 2010

Year

L2 C

ach

e S

ize (

KB

)

0

5

10

15

20

25

1990 2000 2010

Year

L2 H

it L

ate

ncy (

cycle

s)

slow access

large caches

Larger Caches Are Slower Caches

Does this affect the end performance?

© Hardavellas14

Impact of Slower Caches on Performance

0

0.5

1

1.5

2

2.5

3

0 10 20 30

L2 Cache Size (MB)

No

rm. T

hro

ug

hp

ut

DSS-const DSS-real

OLTP-const OLTP-real

We lose half the potential throughput!

© Hardavellas15

Cache Design Trends

Balance cache slice access with network latency

As caches become bigger, they get slower:

Divide the cache into smaller “slices”:

© Hardavellas16

corecorecore

Modern Caches: Distributed

Split cache into “slices”, distribute across die

L2 L2 L2 L2

L2 L2 L2 L2

core

core core core core

Modern Multi-Core Chips

© Hardavellas17

Multi-Core: Distributed System On Chip

• Similar structure

Nodes in a cluster or a multiprocessor -> cores on a chip

Physically distributed memory -> distributed cache

Interconnect -> ditto, but on chip

• Similar challenges, albeit with different constants:

Power, thermal, reliability, performance

• Also, some new challenges:

Off-chip bandwidth: limited pins

© Hardavellas18

Data Placement Determines Performance

© 2009 Hardavellas19

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Goal: place data on chip close to where they are used

cache

slice


Terminology: Data Types

core

L2

core

L2

core core

L2

core

Read

or

Write

ReadRead Read

Write

Private Shared

Read-Only

Shared

Read-Write

© Hardavellas21

Distributed shared L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Maximum capacity, but slow access (40+ cycles)

Can we do better?

address

mod <#slices>

Unique location

for any block

(private or shared)

© Hardavellas22

L2

Distributed private L2: private data access

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Fast access to core-private data

What about accesses to shared data?

Private data:

allocate at

local L2 slice

On every access

allocate data

at local L2 slice

© Hardavellas23

L2

Distributed private L2: shared-RO access

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Wastes capacity due to replication

What about accesses to shared read-write data?

Shared read-only

data: replicate

across L2 slices

On every access

allocate data

at local L2 slice

© Hardavellas24

Distributed private L2: shared-RW access

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 dir

core core core core

L2 L2 L2 L2

Slow for shared read-write

Wastes capacity (dir overhead) and bandwidth

XShared read-write

data: maintain

coherence via

indirection (dir)

On every access

allocate data

at local L2 slice


Conventional Multi-Core Caches

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

dir L2 L2 L2

We want: high capacity (shared) + fast access (priv.)

PrivateShared

Address-interleave blocks

(sliceID = addr mod #slices)

+ High effective capacity

− Slow access

Each block cached locally

+ Fast access (local)

− Low capacity (replicas)

− Coherence: via indirection(distributed directory)


Where to Place the Data?• Close to where they are used!

• Accessed by single core: migrate locally

• Accessed by many cores: replicate (?)

If read-only, replication is OK

If read-write, coherence a problem

Low reuse: evenly distribute across sharers

sharers#

read-write

mig

rate

replicate

share

read-only


Cache Access Classification Example

-20%

0%

20%

40%

60%

80%

100%

120%

0 2 4 6 8 10 12 14 16 18 20

Number of Sharers

% R

ea

d-W

rite

Blo

ck

s

Instructions Data-Private Data-Shared

0

% L2

accesses

• Each bubble: cache blocks shared by x cores

• Size of bubble proportional to % L2 accesses

• y axis: % blocks in bubble that are read-write

% R

W B

lock

s i

n B

ub

ble


Scientific/MP Apps

-20%

0%

20%

40%

60%

80%

100%

120%

-4 -2 0 2 4 6 8 10 12 14 16 18 20

Number of Sharers%

Read

-Wri

te B

locks

Instructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-Shared

0%

% 0-20%

0%

20%

40%

60%

80%

100%

120%

0 2 4 6 8 10 12 14 16 18 20

Number of Sharers

% R

ead

-Wri

te B

locks

Instructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-Shared

0

Cache Access Clustering

Accesses naturally form 3 clusters

Server Appsmigrate

locally

share (addr-interleave)

replicate

R/W

migrate

replicate

share

R/O

sharers#% R

W B

locks i

n B

ub

ble

% R

W B

locks i

n B

ub

ble

Instruction Replication


L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Distribute in cluster of neighbors, replicate across

• Instruction working set too large for one cache slice

© Hardavellas30

Reactive NUCA in a nutshell

• Classify accesses

private data: like private scheme (migrate)

shared data: like shared scheme (interleave)

instructions: controlled replication (middle ground)

To place cache blocks, we first need to classify them

© Hardavellas31

Classification Granularity

• Per-block classification

High area/power overhead (cut L2 size by half)

High latency (indirection through directory)

• Per-page classification (utilize OS page table)

Persistent structure

Core accesses the page table for every access anyway (TLB)

Utilize already existing SW/HW structures and events

Page classification is accurate (<0.5% error)

Classify entire data pages, page table/TLB for bookkeeping

• Instructions classification: all accesses from L1-I (per-block)

• Data classification: private/shared per-page at TLB miss

Page classification is accurate (<0.5% error)

Classification Mechanisms


TLB Misscore

L2

Ld ACore i

OS

A: Private to “i”

TLB MissLd A

OS

A: Private to “i”

core

L2

Core j

A: Shared

On 1st access On access by another core

Bookkeeping through OS page table and TLB

Page Table and TLB Extensions


vpage ppageL2 idP/S/I

2 bits log(n)

vpage ppageP/STLB entry:

1 bit

Page granularity allows simple + practical HW

Page table entry:

• Persistent structure, ideal for “directory”

• Core accesses the page table for every access anyway (TLB)

Pass information from the “directory” to the core

• Utilize already existing SW/HW structures and events

© Hardavellas34

Data Class Bookkeeping and Lookup

offsetPhysical Addr.:

vpage ppageL2 id

vpage ppageL2 idS

cache indextag

Page table entry:

Page table entry:

vpage ppagePTLB entry:

L2 id

vpage ppageSTLB entry:

P

• private data: place in local L2 slice

• shared data: place in aggregate L2 (addr interleave)

each slice caches the same blocks

on behalf of any cluster

© Hardavellas35

3 10

0 1 32 0

3 1

3 10

0 1 32 0

3 1

Instructions Lookup: Rotational Interleaving

2 2 3 10

1 32

2 0 2 3 10

0 1 32 0 1 32

+1

+log2(k)

)1(&1 nRIDAddrnDestinatio

RID

Fast access (nearest-neighbor, simple lookup)

Balance access latency with capacity constraints

Equal capacity pressure at overlapped slices

PC: 0xfa480

RID

Addr size-4 clusters:

local slice + 3 neighbors


Coherence: No Need for HW Mechanisms at L2

Fast access, eliminates HW overhead, SIMPLE

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

core core core core

L2 L2 L2 L2

Private data: local sliceShared data: addr-interleave

• Reactive NUCA placement guarantee

Each R/W datum in unique & known location

-20%

-10%

0%

10%

20%

30%

40%

50%

60%

A S R I A S R I A S R I A S R I A S R I A S R I A S R I A S R I

OLTP

DB2

Apache DSS

Qry6

DSS

Qry8

DSS

Qry13

em3d OLTP

Oracle

MIX

Private-averse workloads Shared-averse

workloads

Sp

eed

up

over

Pri

vate


Evaluation

Delivers robust performance across workloads

Shared: same for Web, DSS; 17% for OLTP, MIX

Private: 17% for OLTP, Web, DSS; same for MIX

ASR (A)

Shared (S)

R-NUCA (R)

Ideal (I)


Reactive NUCA: Fast >>MB Caches

• Data may exhibit arbitrarily complex behaviors

...but few that matter!

• Learn the behaviors at run time for placement

Make the common case fast

• Fast!

Near-optimal placement (within 5% of ideal)

Robust (matches best alternative, or 17% better; up to 32%)

Nearest-neighbor communication → scalable

• Transparent to the user, simple design, negligible overhead

• BONUS: simplify hardware

Eliminate HW coherence at L2

Concluding Remarks

• The old multiprocessor/cluster is now within a single chip

• Still a distributed system

• Similar challenges, with new constants

Lecture focused on performance via data placement

Similar challenge: HPC systems strive to privatize data

R-NUCA strives to use “private caches”

New constants: latency, bandwidth,

Single OS image allows for many simplifications

• Research opportunity: map ideas from old domain to new

© Hardavellas39

© Hardavellas40

“Multicore: This is the one which will have the biggest impact on us. We have never had a problem to solve like this. A breakthrough is needed in how applications are done on multicore devices.”

– Bill Gates

“It’s time we rethink some of the basics of computing. It’s scary and lots of fun at the same time.”

– Burton Smith

Multicore Research Brings Opportunities and Challenges

many-core computing era and new challengescs.iit.edu/~iraicu/teaching/eecs495-dic/lecture14.pdf ·...

Documents