the why, where and how of multicore
TRANSCRIPT
Anant Agarwal
MIT and Tilera Corp.
The Why, Where and How of Multicore
What is Multicore?Whatever’s “Inside”?
What is Multicore?
Single chipMultiple distinct processing enginesMultiple, independent threads of control (or program counters – MIMD)
Whatever’s “Inside”?
Seriously, multicore satisfies three properties
pswitch
mp
switch
mp
switch
mp
switch
m
pswitch
mp
switch
mp
switch
mp
switch
m
pswitch
mp
switch
mp
switch
mp
switch
m
pswitch
mp
switch
mp
switch
mp
switch
m
BUS
p p
c c
L2 Cache
BUS
RISC DSPc m
SRAM
Outline
The whyThe where The how
Outline
The whyThe where The how
The “Moore’s Gap”
Houston, we have a problem…
1998 time
Performance(GOPS)
2002
0.01
1992 2006
0.1
1
10
100
1000
2010
Transis
tors
1. Diminishing returns from single CPU mechanisms (pipelining, caching, etc.)
2. Wire delays3. Power envelopes
PipeliningSuperscalar
SMT, FGMT, CGMT
OOO
The Moore’s Gap
The Moore’s Gap – Example
Pentium 41.4 GHzYear 20000.18 micron42M transistors
393 (Specint 2000)
Pentium 31 GHzYear 20000.18 micron28M transistors
343 (Specint 2000)
Transistor count increased by 50%Performance increased by only 15%
Closing Moore’s Gap Today
Today’s applications have ample parallelism –and they are not PowerPoint and Word!Technology: On-chip integration of multiple cores is now possible
Two things have changed:
Parallelism is Everywhere
SuperComputing
GraphicsGaming
VideoImagingSettopsTVse.g., H.264
CommunicationsCellphonese.g., FIRs
SecurityFirewallse.g., AES
Wirelesse.g., Viterbi
Networkinge.g., IP forwarding
GeneralPurpose
DatabasesWebserversMultiple tasks
Integration is Efficient
*90nm, 32 bits, 1mm
pc
s
pc
s
pc
pc
BUS
Bandwidth 2GBpsLatency 60nsEnergy > 500pJ
Bandwidth > 40GBps*Latency < 3nsEnergy < 5pJ
Discrete chips Multicore
Parallelism and interconnect efficiency enables harnessing the “power of n”n cores yield n-fold increase in performanceThis fact yields the multicore opportunity
Why Multicore?
PerformancePower efficiency
Let’s look at the opportunity from two viewpoints
The Performance Opportunity
Processor
Cache
CPI: 1 + 0.01 x 100 = 290nm
Processor1x
Cache1x
CPI: (1 + 0.01 x 100)/2 = 165nm
Processor1x
Cache1x
Go multicore
Processor 1x
Cache 3x
CPI: 1 + 0.006 x 100 = 1.665nm
Push single core
Smaller CPI (cycles per instruction) is better
Single processor mechanisms yielding diminishing returnsPrefer to build two smaller structures than one very big one
Multicore Example: MIT’s Raw Processor
16 coresYear 20020.18 micron425 MHzIBM SA27E std. cell6.8 GOPS
Google MIT Raw
Raw’s Multicore Performance
Architecture Space
Perf
orman
ce
Raw 425 MHz, 0.18μm
[ISCA04]
Speedup vs. P3
P3 600 MHz, 0.18μm
The Power Cost of FrequencySynthesized Multiplier power versus frequency (90nm)
1.00
3.00
5.00
7.00
9.00
11.00
13.00
250.0 350.0 450.0 550.0 650.0 750.0 850.0 950.0 1050.0 1150.0
FREQUENCY (MHz)
Pow
er (n
orm
aliz
ed to
Mul
32@
250M
Hz)
Increase Area
Increase Voltage`
Frequency V Power V3 (V2F)For a 1% increase in freq, we suffer a 3% increase in power
∞∞
“New” Superscalar
Multicore
FreqV Perf Power
Superscalar 1 1 1 1
Cores
1
Multicore’s Opportunity for Power Efficiency
PE (Bops/watt)
1
1.5X 1.5X 1.5X 3.3X1X 0.45X
0.75X 0.75X 1.5X 0.8X2X 1.88X
(Bigger # is better)
50% more performance with 20% less power
Preferable to use multiple slower devices, than one superfast device
Outline
The whyThe whereThe how
The Future of MulticoreNumber of cores will double every 18 months
‘05 ‘08 ‘11 ‘14
64 256 1024 4096
‘02
16Academia
Industry 16 64 256 10244
But, wait a minute…
Need to create the “1K multicore”research program ASAP!
Outline
The whyThe where The how
Multicore ChallengesThe 3 P’s
Performance challengePower efficiency challengeProgramming challenge
Performance Challenge
InterconnectIt is the new mechanism Not well understoodCurrent systems rely on buses or ringsNot scalable - will become performance bottleneckBandwidth and latency are the two issues
BUS
L2 Cache
p
c
p
c
p
c
p
c
Interconnect Options
BUS
pc
pc
pc
Bus Multicore
s s s s
pc
pc
pc
pc
Ring Multicorepc
s
pc
s
pc
s
pc
s
pc
s
pc
s
pc
s
pc
s
pc
s
Mesh Multicore
Imagine This City…
Bus Ring Mesh
Interconnect Bandwidth
2-4 4-8 > 8Cores:
Communication LatencyCommunication latency not interconnect problem! It is a “last mile” issueLatency comes from coherence protocols or software overhead
rMPI
100
1000
10000
100000
1000000
10000000
100000000
1000000000
1 10 100 1000 10000 100000 1000000 1E+07
Message Size (words)
End-
to-E
nd L
aten
cy (c
ycle
s)
rMPI
Highly optimized MPI implementation on the Raw Multicore processor
Challenge: Reduce overhead to a few cycles Avoid memory accesses, provide direct access to interconnect, and eliminate protocols
rMPI vs Native MessagesrMPI percentage overhead (cycles) (compared to native GDN): Jacobi
-50.00%
0.00%
50.00%
100.00%
150.00%
200.00%
250.00%
300.00%
350.00%
400.00%
450.00%
500.00%
N=16 N=32 N=64 N=128 N=256 N=512 N=1024 N=2048
problem size
over
head
2 tiles4 tiles8 tiles16 tiles
Power Efficiency Challenge
Existing CPUs at 100 watts100 CPU cores at 10 KWatts!Need to rethink CPU architecture
The Potential Exists
Power PerfProcessor Power Efficiency
1/2W 1/8X**RISC* 25X
100W 1Itanium 2 1
Assuming 130nm
* 90’s RISC at 425MHz** e.g., Timberwolf (SpecInt)
Area Equates to Power
Less than 4% to ALUs and FPUs
Madison Itanium20.13µm
L3 Cache
Photo courtesy Intel Corp.
Less is MoreResource size must not be increased unless the resulting percentage increase in performance is at least the same as the percentage increase in the area (power)Remember power of n: n 2n cores doubles performance2n cores have 2X the area (power)e.g., increase a resource only if for every 1% increase in area there is at least a 1% increase in performance
Kill If Less than Linear
“KILL Rule” for Multicore
Communication Cheaper than Memory Access
3pJNetwork transfer (1mm)
500pJOff-chip memory read
50pJ32KB cache read
2pJALU add
EnergyAction
90nm, 32b
Migrate from memory oriented computation models to communication centric models
Multicore Programming Challenge
Traditional cluster computing programming methods squander multicore opportunity
– Message passing or shared memory, e.g., MPI, OpenMP– Both were designed assuming high-overhead
communication• Need big chunks of work to minimize comms, and huge caches
– Multicore is different – Low-overhead communication that is cheaper than memory access
• Results in smaller per-core memories
Must allow specifying parallelism at any granularity, and favor communication over memory
Stream Programming Approach
ASIC-like conceptRead value from network, compute, send out valueAvoids memory access instructions, synchronization and address arithmetic
A channel with send and receive ports
Core B
Core A(E.g., FIR)
Core C(E.g., FIR)
e.g., pixel data stream
e.g., Streamit, StreamC
ConclusionMulticore can close the “Moore’s Gap”Four biggest myths of multicore
– Existing CPUs make good cores– Bigger caches are better– Interconnect latency comes from wire delay– Cluster computing programming models are just fine
For multicore to succeed we need new research– Create new architectural approaches; e.g., “Kill Rule”
for cores– Replace memory access with communication– Create new interconnects– Develop innovative programming APIs and standards
Vision for the FutureThe ‘core’ is the LUT (lookup table) of the21st century
If we solve the 3 challenges, multicore could replace all hardware in the future
pm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
spm
s