the why, where and how of multicore

Anant Agarwal

MIT and Tilera Corp.

The Why, Where and How of Multicore

What is Multicore?Whatever’s “Inside”?

What is Multicore?

Single chipMultiple distinct processing enginesMultiple, independent threads of control (or program counters – MIMD)

Whatever’s “Inside”?

Seriously, multicore satisfies three properties

pswitch

mp

switch

mp

switch

mp

switch

m

pswitch

mp

switch

mp

switch

mp

switch

m

pswitch

mp

switch

mp

switch

mp

switch

m

pswitch

mp

switch

mp

switch

mp

switch

m

BUS

p p

c c

L2 Cache

BUS

RISC DSPc m

SRAM

Outline

The whyThe where The how

The “Moore’s Gap”

Houston, we have a problem…

1998 time

Performance(GOPS)

2002

0.01

1992 2006

0.1

1

10

100

1000

2010

Transis

tors

1. Diminishing returns from single CPU mechanisms (pipelining, caching, etc.)

2. Wire delays3. Power envelopes

PipeliningSuperscalar

SMT, FGMT, CGMT

OOO

The Moore’s Gap

The Moore’s Gap – Example

Pentium 41.4 GHzYear 20000.18 micron42M transistors

393 (Specint 2000)

Pentium 31 GHzYear 20000.18 micron28M transistors

343 (Specint 2000)

Transistor count increased by 50%Performance increased by only 15%

Closing Moore’s Gap Today

Today’s applications have ample parallelism –and they are not PowerPoint and Word!Technology: On-chip integration of multiple cores is now possible

Two things have changed:

Parallelism is Everywhere

SuperComputing

GraphicsGaming

VideoImagingSettopsTVse.g., H.264

CommunicationsCellphonese.g., FIRs

SecurityFirewallse.g., AES

Wirelesse.g., Viterbi

Networkinge.g., IP forwarding

GeneralPurpose

DatabasesWebserversMultiple tasks

Integration is Efficient

*90nm, 32 bits, 1mm

pc

s

pc

s

pc

pc

BUS

Bandwidth 2GBpsLatency 60nsEnergy > 500pJ

Bandwidth > 40GBps*Latency < 3nsEnergy < 5pJ

Discrete chips Multicore

Parallelism and interconnect efficiency enables harnessing the “power of n”n cores yield n-fold increase in performanceThis fact yields the multicore opportunity

Why Multicore?

PerformancePower efficiency

Let’s look at the opportunity from two viewpoints

The Performance Opportunity

Processor

Cache

CPI: 1 + 0.01 x 100 = 290nm

Processor1x

Cache1x

CPI: (1 + 0.01 x 100)/2 = 165nm

Processor1x

Cache1x

Go multicore

Processor 1x

Cache 3x

CPI: 1 + 0.006 x 100 = 1.665nm

Push single core

Smaller CPI (cycles per instruction) is better

Single processor mechanisms yielding diminishing returnsPrefer to build two smaller structures than one very big one

Multicore Example: MIT’s Raw Processor

16 coresYear 20020.18 micron425 MHzIBM SA27E std. cell6.8 GOPS

Google MIT Raw

Raw’s Multicore Performance

Architecture Space

Perf

orman

ce

Raw 425 MHz, 0.18μm

[ISCA04]

Speedup vs. P3

P3 600 MHz, 0.18μm

The Power Cost of FrequencySynthesized Multiplier power versus frequency (90nm)

1.00

3.00

5.00

7.00

9.00

11.00

13.00

250.0 350.0 450.0 550.0 650.0 750.0 850.0 950.0 1050.0 1150.0

FREQUENCY (MHz)

Pow

er (n

orm

aliz

ed to

Mul

32@

250M

Hz)

Increase Area

Increase Voltage`

Frequency V Power V3 (V2F)For a 1% increase in freq, we suffer a 3% increase in power

∞∞

“New” Superscalar

Multicore

FreqV Perf Power

Superscalar 1 1 1 1

Cores

1

Multicore’s Opportunity for Power Efficiency

PE (Bops/watt)

1

1.5X 1.5X 1.5X 3.3X1X 0.45X

0.75X 0.75X 1.5X 0.8X2X 1.88X

(Bigger # is better)

50% more performance with 20% less power

Preferable to use multiple slower devices, than one superfast device

Outline

The whyThe whereThe how

The Future of MulticoreNumber of cores will double every 18 months

‘05 ‘08 ‘11 ‘14

64 256 1024 4096

‘02

16Academia

Industry 16 64 256 10244

But, wait a minute…

Need to create the “1K multicore”research program ASAP!

Outline

The whyThe where The how

Multicore ChallengesThe 3 P’s

Performance challengePower efficiency challengeProgramming challenge

Performance Challenge

InterconnectIt is the new mechanism Not well understoodCurrent systems rely on buses or ringsNot scalable - will become performance bottleneckBandwidth and latency are the two issues

BUS

L2 Cache

p

c

p

c

p

c

p

c

Interconnect Options

BUS

pc

pc

pc

Bus Multicore

s s s s

pc

pc

pc

pc

Ring Multicorepc

s

pc

s

pc

s

pc

s

pc

s

pc

s

pc

s

pc

s

pc

s

Mesh Multicore

Imagine This City…

Bus Ring Mesh

Interconnect Bandwidth

2-4 4-8 > 8Cores:

Communication LatencyCommunication latency not interconnect problem! It is a “last mile” issueLatency comes from coherence protocols or software overhead

rMPI

100

1000

10000

100000

1000000

10000000

100000000

1000000000

1 10 100 1000 10000 100000 1000000 1E+07

Message Size (words)

End-

to-E

nd L

aten

cy (c

ycle

s)

rMPI

Highly optimized MPI implementation on the Raw Multicore processor

Challenge: Reduce overhead to a few cycles Avoid memory accesses, provide direct access to interconnect, and eliminate protocols

rMPI vs Native MessagesrMPI percentage overhead (cycles) (compared to native GDN): Jacobi

-50.00%

0.00%

50.00%

100.00%

150.00%

200.00%

250.00%

300.00%

350.00%

400.00%

450.00%

500.00%

N=16 N=32 N=64 N=128 N=256 N=512 N=1024 N=2048

problem size

over

head

2 tiles4 tiles8 tiles16 tiles

Power Efficiency Challenge

Existing CPUs at 100 watts100 CPU cores at 10 KWatts!Need to rethink CPU architecture

The Potential Exists

Power PerfProcessor Power Efficiency

1/2W 1/8X**RISC* 25X

100W 1Itanium 2 1

Assuming 130nm

* 90’s RISC at 425MHz** e.g., Timberwolf (SpecInt)

Area Equates to Power

Less than 4% to ALUs and FPUs

Madison Itanium20.13µm

L3 Cache

Photo courtesy Intel Corp.

Less is MoreResource size must not be increased unless the resulting percentage increase in performance is at least the same as the percentage increase in the area (power)Remember power of n: n 2n cores doubles performance2n cores have 2X the area (power)e.g., increase a resource only if for every 1% increase in area there is at least a 1% increase in performance

Kill If Less than Linear

“KILL Rule” for Multicore

Communication Cheaper than Memory Access

3pJNetwork transfer (1mm)

500pJOff-chip memory read

50pJ32KB cache read

2pJALU add

EnergyAction

90nm, 32b

Migrate from memory oriented computation models to communication centric models

Multicore Programming Challenge

Traditional cluster computing programming methods squander multicore opportunity

– Message passing or shared memory, e.g., MPI, OpenMP– Both were designed assuming high-overhead

communication• Need big chunks of work to minimize comms, and huge caches

– Multicore is different – Low-overhead communication that is cheaper than memory access

• Results in smaller per-core memories

Must allow specifying parallelism at any granularity, and favor communication over memory

Stream Programming Approach

ASIC-like conceptRead value from network, compute, send out valueAvoids memory access instructions, synchronization and address arithmetic

A channel with send and receive ports

Core B

Core A(E.g., FIR)

Core C(E.g., FIR)

e.g., pixel data stream

e.g., Streamit, StreamC

ConclusionMulticore can close the “Moore’s Gap”Four biggest myths of multicore

– Existing CPUs make good cores– Bigger caches are better– Interconnect latency comes from wire delay– Cluster computing programming models are just fine

For multicore to succeed we need new research– Create new architectural approaches; e.g., “Kill Rule”

for cores– Replace memory access with communication– Create new interconnects– Develop innovative programming APIs and standards

Vision for the FutureThe ‘core’ is the LUT (lookup table) of the21st century

If we solve the 3 challenges, multicore could replace all hardware in the future

pm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

spm

s

the why, where and how of multicore

Documents