many-core computing era and new challengescs.iit.edu/~iraicu/teaching/eecs495-dic/lecture14.pdf ·...
TRANSCRIPT
Many-Core Computing Eraand New Challenges
Nikos Hardavellas, EECS
© Hardavellas2
Moore’s Law Is Alive And Well
90nm
90nm transistor
(Intel, 2005)
Swine Flu A/H1N1
(CDC)
65nm
2007
45nm
2010
32nm
2013
22nm
2016
16nm
2019
Device scaling continues for at least another 10 years
© Hardavellas3
Good Days Ended Nov. 2002
[Yelick09]
“Traditional” Moore’s Law: 2x transistors with every generation
“New” Moore’s Law: 2x cores with every generation
What Happened in Nov. 2002?Chips Became Too Hot
© Hardavellas4
Watt
s/c
m2
1
10
100
1000
1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m
i386
i486
Pentium®
Pentium® Pro
Pentium® II
Pentium® IIIHot plate
RocketNozzle
Nuclear Reactor
Pentium® 4
[Pollack]
The New Cooking Sensation!
© Hardavellas5
[Huang]
data
data
data
© Hardavellas6
But, Chips Are Physically Constrained,and Constraints Don’t Scale
How to balance constraints and achieve peak performance?
L2
core
core
core core
core
core
core core core core core core
core core core core
core
core
core
core
Power?
Thermal?
Area?
Bandwidth?
A Look at Technology Projections & Implications
• Use ITRS projections + first-order models/analysis
Given a technology node, find highest-performance design
Respect physical constraints
Account for SW trends
Amdahl’s Law
Exponentially-increasing datasets
© Hardavellas7
How Many Cores/Cache Can We Power?
© Hardavellas8
1
2
4
8
16
32
64
128
256
1 2 4 8 16 32 64 128 256 512
Nu
mb
er
of
Co
res
Cache Size (MB)
Area
Power (Max Freq.)
1 GHz, 0.27V
2.7 GHz, 0.36V
4.4 GHz, 0.45V
5.7 GHz, 0.54V
6.9 GHz, 0.63V
8 GHz, 0.72V
9 GHz, 0.81V
The power wall: a power/performance trade-off
Voltage/
Freq.
Scaling
(VFS)
Pin Bandwidth: Fewer Cores, More Cache
© Hardavellas9
1
2
4
8
16
32
64
128
256
1 2 4 8 16 32 64 128 256 512
Nu
mb
er
of
Co
res
Cache Size (MB)
Area
Power (Max Freq.)
1 GHz, 0.27V
2.7 GHz, 0.36V
4.4 GHz, 0.45V
5.7 GHz, 0.54V
6.9 GHz, 0.63V
8 GHz, 0.72V
9 GHz, 0.81V
Bandwidth (1 GHz)
Bandwidth (2.7GHz)
Where would the best performance be?
Peak-Performing Designs
© Hardavellas10
0
50
100
150
200
250
300
350
400
1 2 4 8 16 32 64 128 256 512
Pe
rfo
rma
nc
e (
10
00
x M
IPS
)
Cache Size (MB)
Area (Max Freq.)
Power (Max Freq.)
Area+Power (VFS)
Bandwidth (VFS)
Peak Performance
Peak-performing design
First, bandwidth constrained, then power constrained
Transistor Scaling: Many Cores, Huge Caches
© Hardavellas11
Need cache architectures for >>MB
1
2
4
8
16
32
64
128
256
512
2004 2007 2010 2013 2016 2019
Ca
ch
e S
ize
(M
B)
Year of Technology Introduction
OLTP
DSS
Apache
1
2
4
8
16
32
64
128
256
512
1024
2004 2007 2010 2013 2016 2019
Nu
mb
er
of
Co
res
Year of Technology Introduction
Area (Max Freq.)
OLTP - Peak Perf.
DSS - Peak Perf.
Apache - Peak Perf.
OLTP (Max Freq.)
DSS (Max Freq.)
Apache (Max Freq.)
Why Are Caches Growing So Large?
• Increasing number of cores: cache grows commensurately
Fewer but faster cores have the same effect
• Increasing datasets: faster than Moore’s Law!
• Power/thermal efficiency: caches are “cool”, cores are “hot”
So, its easier to fit more cache in a power budget
• Limited bandwidth: large cache == more data on chip
Off-chip pins are used less frequently
© Hardavellas12
So, caches are getting huge. Great! What’s the catch?
© Hardavellas13
1
10
100
1,000
10,000
100,000
1990 2000 2010
Year
L2 C
ach
e S
ize (
KB
)
0
5
10
15
20
25
1990 2000 2010
Year
L2 H
it L
ate
ncy (
cycle
s)
slow access
large caches
Larger Caches Are Slower Caches
Does this affect the end performance?
© Hardavellas14
Impact of Slower Caches on Performance
0
0.5
1
1.5
2
2.5
3
0 10 20 30
L2 Cache Size (MB)
No
rm. T
hro
ug
hp
ut
DSS-const DSS-real
OLTP-const OLTP-real
We lose half the potential throughput!
© Hardavellas15
Cache Design Trends
Balance cache slice access with network latency
As caches become bigger, they get slower:
Divide the cache into smaller “slices”:
© Hardavellas16
corecorecore
Modern Caches: Distributed
Split cache into “slices”, distribute across die
L2 L2 L2 L2
L2 L2 L2 L2
core
core core core core
Modern Multi-Core Chips
© Hardavellas17
Multi-Core: Distributed System On Chip
• Similar structure
Nodes in a cluster or a multiprocessor -> cores on a chip
Physically distributed memory -> distributed cache
Interconnect -> ditto, but on chip
• Similar challenges, albeit with different constants:
Power, thermal, reliability, performance
• Also, some new challenges:
Off-chip bandwidth: limited pins
© Hardavellas18
Data Placement Determines Performance
© 2009 Hardavellas19
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
Goal: place data on chip close to where they are used
cache
slice
© 2009 Hardavellas20
Terminology: Data Types
core
L2
core
L2
core core
L2
core
Read
or
Write
ReadRead Read
Write
Private Shared
Read-Only
Shared
Read-Write
© Hardavellas21
Distributed shared L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
Maximum capacity, but slow access (40+ cycles)
Can we do better?
address
mod <#slices>
Unique location
for any block
(private or shared)
© Hardavellas22
L2
Distributed private L2: private data access
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
Fast access to core-private data
What about accesses to shared data?
Private data:
allocate at
local L2 slice
On every access
allocate data
at local L2 slice
© Hardavellas23
L2
Distributed private L2: shared-RO access
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
Wastes capacity due to replication
What about accesses to shared read-write data?
Shared read-only
data: replicate
across L2 slices
On every access
allocate data
at local L2 slice
© Hardavellas24
Distributed private L2: shared-RW access
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 dir
core core core core
L2 L2 L2 L2
Slow for shared read-write
Wastes capacity (dir overhead) and bandwidth
XShared read-write
data: maintain
coherence via
indirection (dir)
On every access
allocate data
at local L2 slice
© 2009 Hardavellas25
Conventional Multi-Core Caches
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
dir L2 L2 L2
We want: high capacity (shared) + fast access (priv.)
PrivateShared
Address-interleave blocks
(sliceID = addr mod #slices)
+ High effective capacity
− Slow access
Each block cached locally
+ Fast access (local)
− Low capacity (replicas)
− Coherence: via indirection(distributed directory)
© 2009 Hardavellas26
Where to Place the Data?• Close to where they are used!
• Accessed by single core: migrate locally
• Accessed by many cores: replicate (?)
If read-only, replication is OK
If read-write, coherence a problem
Low reuse: evenly distribute across sharers
sharers#
read-write
mig
rate
replicate
share
read-only
© 2009 Hardavellas27
Cache Access Classification Example
-20%
0%
20%
40%
60%
80%
100%
120%
0 2 4 6 8 10 12 14 16 18 20
Number of Sharers
% R
ea
d-W
rite
Blo
ck
s
Instructions Data-Private Data-Shared
0
% L2
accesses
• Each bubble: cache blocks shared by x cores
• Size of bubble proportional to % L2 accesses
• y axis: % blocks in bubble that are read-write
% R
W B
lock
s i
n B
ub
ble
© 2009 Hardavellas28
Scientific/MP Apps
-20%
0%
20%
40%
60%
80%
100%
120%
-4 -2 0 2 4 6 8 10 12 14 16 18 20
Number of Sharers%
Read
-Wri
te B
locks
Instructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-Shared
0%
% 0-20%
0%
20%
40%
60%
80%
100%
120%
0 2 4 6 8 10 12 14 16 18 20
Number of Sharers
% R
ead
-Wri
te B
locks
Instructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-SharedInstructions Data-Private Data-Shared
0
Cache Access Clustering
Accesses naturally form 3 clusters
Server Appsmigrate
locally
share (addr-interleave)
replicate
R/W
migrate
replicate
share
R/O
sharers#% R
W B
locks i
n B
ub
ble
% R
W B
locks i
n B
ub
ble
Instruction Replication
© 2009 Hardavellas29
L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
Distribute in cluster of neighbors, replicate across
• Instruction working set too large for one cache slice
© Hardavellas30
Reactive NUCA in a nutshell
• Classify accesses
private data: like private scheme (migrate)
shared data: like shared scheme (interleave)
instructions: controlled replication (middle ground)
To place cache blocks, we first need to classify them
© Hardavellas31
Classification Granularity
• Per-block classification
High area/power overhead (cut L2 size by half)
High latency (indirection through directory)
• Per-page classification (utilize OS page table)
Persistent structure
Core accesses the page table for every access anyway (TLB)
Utilize already existing SW/HW structures and events
Page classification is accurate (<0.5% error)
Classify entire data pages, page table/TLB for bookkeeping
• Instructions classification: all accesses from L1-I (per-block)
• Data classification: private/shared per-page at TLB miss
Page classification is accurate (<0.5% error)
Classification Mechanisms
© 2009 Hardavellas32
TLB Misscore
L2
Ld ACore i
OS
A: Private to “i”
TLB MissLd A
OS
A: Private to “i”
core
L2
Core j
A: Shared
On 1st access On access by another core
Bookkeeping through OS page table and TLB
Page Table and TLB Extensions
© 2009 Hardavellas33
vpage ppageL2 idP/S/I
2 bits log(n)
vpage ppageP/STLB entry:
1 bit
Page granularity allows simple + practical HW
Page table entry:
• Persistent structure, ideal for “directory”
• Core accesses the page table for every access anyway (TLB)
Pass information from the “directory” to the core
• Utilize already existing SW/HW structures and events
© Hardavellas34
Data Class Bookkeeping and Lookup
offsetPhysical Addr.:
vpage ppageL2 id
vpage ppageL2 idS
cache indextag
Page table entry:
Page table entry:
vpage ppagePTLB entry:
L2 id
vpage ppageSTLB entry:
P
• private data: place in local L2 slice
• shared data: place in aggregate L2 (addr interleave)
each slice caches the same blocks
on behalf of any cluster
© Hardavellas35
3 10
0 1 32 0
3 1
3 10
0 1 32 0
3 1
Instructions Lookup: Rotational Interleaving
2 2 3 10
1 32
2 0 2 3 10
0 1 32 0 1 32
+1
+log2(k)
)1(&1 nRIDAddrnDestinatio
RID
Fast access (nearest-neighbor, simple lookup)
Balance access latency with capacity constraints
Equal capacity pressure at overlapped slices
PC: 0xfa480
RID
Addr size-4 clusters:
local slice + 3 neighbors
© 2009 Hardavellas36
Coherence: No Need for HW Mechanisms at L2
Fast access, eliminates HW overhead, SIMPLE
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
core core core core
L2 L2 L2 L2
Private data: local sliceShared data: addr-interleave
• Reactive NUCA placement guarantee
Each R/W datum in unique & known location
-20%
-10%
0%
10%
20%
30%
40%
50%
60%
A S R I A S R I A S R I A S R I A S R I A S R I A S R I A S R I
OLTP
DB2
Apache DSS
Qry6
DSS
Qry8
DSS
Qry13
em3d OLTP
Oracle
MIX
Private-averse workloads Shared-averse
workloads
Sp
eed
up
over
Pri
vate
© 2009 Hardavellas37
Evaluation
Delivers robust performance across workloads
Shared: same for Web, DSS; 17% for OLTP, MIX
Private: 17% for OLTP, Web, DSS; same for MIX
ASR (A)
Shared (S)
R-NUCA (R)
Ideal (I)
© 2009 Hardavellas38
Reactive NUCA: Fast >>MB Caches
• Data may exhibit arbitrarily complex behaviors
...but few that matter!
• Learn the behaviors at run time for placement
Make the common case fast
• Fast!
Near-optimal placement (within 5% of ideal)
Robust (matches best alternative, or 17% better; up to 32%)
Nearest-neighbor communication → scalable
• Transparent to the user, simple design, negligible overhead
• BONUS: simplify hardware
Eliminate HW coherence at L2
Concluding Remarks
• The old multiprocessor/cluster is now within a single chip
• Still a distributed system
• Similar challenges, with new constants
Lecture focused on performance via data placement
Similar challenge: HPC systems strive to privatize data
R-NUCA strives to use “private caches”
New constants: latency, bandwidth,
Single OS image allows for many simplifications
• Research opportunity: map ideas from old domain to new
© Hardavellas39
© Hardavellas40
“Multicore: This is the one which will have the biggest impact on us. We have never had a problem to solve like this. A breakthrough is needed in how applications are done on multicore devices.”
– Bill Gates
“It’s time we rethink some of the basics of computing. It’s scary and lots of fun at the same time.”
– Burton Smith
Multicore Research Brings Opportunities and Challenges