the future of computer architecture
DESCRIPTION
The Future of Computer Architecture. Shekhar Borkar Intel Corp. June 9, 2012. Outline. Compute performance goals Challenges & solutions for: Compute , Memory , Interconnect Importance of resiliency Paradigm shift Summary. Performance Roadmap. Client. Hand-held. - PowerPoint PPT PresentationTRANSCRIPT
1This research was, in part, funded by the U.S. Government. The views and
conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the
U.S. Government.
The Future of Computer Architecture
Shekhar Borkar Intel Corp.
June 9, 2012
2
OutlineCompute performance goalsChallenges & solutions for: • Compute,
• Memory,
• Interconnect
Importance of resiliencyParadigm shiftSummary
3
Performance Roadmap
1.E-04
1.E-02
1.E+00
1.E+02
1.E+04
1.E+06
1.E+08
1960 1970 1980 1990 2000 2010 2020
GFL
OP
Mega
Giga
Tera
Peta
Exa
12 Years 11 Years 10 Years
Client
Hand-held
4
From Giga to Exa, via Tera & Peta
1
10
100
1000
1986 1996 2006 2016
Rela
tive
Tr P
erfo
rman
ce
G
Tera
Peta
30X 250X
0.001
0.01
0.1
1
1986 1996 2006 2016
Rel
ativ
e En
ergy
/Op G
TeraPeta
5V
Vcc scaling
1.E+00
1.E+02
1.E+04
1.E+06
1.E+08
1986 1996 2006 2016
G
Tera
Peta
36X
Exa
4,000XConcurrency
2.5M X
Transistor Performance
1.E+00
1.E+02
1.E+04
1.E+06
1.E+08
1986 1996 2006 2016
G
Tera
Peta
80X
Exa
4,000X
1M X
Power
5
Energy per Operation
45nm 32nm 22nm 14nm 10nm 7nm0
50
100
150
200
250
300
pJ/bit CompJ/bit DRAMDP RF OppJ/DP FP
Ener
gy (p
J)
FP Op
DRAM
Communication
Operands
75 pJ/bit
25 pJ/bit
10 pJ/bit
100 pJ/bit
6
Where is the Energy Consumed?
100pJ per FLOP100W
150W
100W
100W
2.5KW
Decode and controlAddress translations…Power supply lossesBloated with inefficient architectural features
~3KW
Compute
Memory
Com
Disk
10TB disk @ 1TB/disk @10W
0.1B/FLOP @ 1.5nJ per Byte
100pJ com per FLOP
5W2W
~5W~3W5W
Goal
~20W
7
Voltage Scaling
0
0.2
0.4
0.6
0.8
1
0.3 0.5 0.7 0.9
Vdd (Normal)
Nor
mal
ized
0
2
4
6
8
10
Freq
Total PowerLeakage
Energy Efficiency
When designed to voltage scale
8
Near Threshold-Voltage (NTV)
1
101
103
104
102
10-2
10-1
1
101
102
0.2 0.4 0.6 0.8 1.0 1.2 1.4Supply Voltage (V)
Max
imum
Fre
quen
cy (M
Hz)
Tota
l Pow
er (m
W)
320mV
65nm CMOS, 50°C
320mVSu
bthr
esho
ld R
egio
n
9.6X
65nm CMOS, 50°C
10-2
10-1
1
101
0
50
100
150
200
250
300
350
400
450
0.2 0.4 0.6 0.8 1.0 1.2 1.4Supply Voltage (V)
Ener
gy E
ffici
ency
(GO
PS/W
att)
Act
ive
Leak
age
Pow
er (m
W)
H. Kaul et al, 16.6: ISSCC08
4 O
rder
s
< 3
Ord
ers
NTV Across Technology Generations
9
0
0.5
1.0
1.5
2.0
2.5
3.0
0.2 0.4 0.6 0.8 1.0 1.2 1.4Supply Voltage (V)
Ener
gy E
ffici
ency
(TO
PS/W
)
10-3
10-2
10-1
1
10
Act
ive
Leak
age
Pow
er (m
W)
Reconfigurable Fabric, 32nm CMOS, 50 � C
340mV
0.8mW5.7x
Sub-
thre
shol
d R
egio
n
0.2 0.4 0.6 0.8 1.0 1.2En
ergy
Effi
cien
cy (G
OPS
/W)
Supply Voltage (V)
Leak
age
Pow
er (m
W)
103
102
10
10-2
1
10-3
103
102
10
1
10-1
Sub-
thre
shol
d Re
gion
22nm CMOS, 50°C
9x
9x
Register FilePermute Crossbar
0123456789
0.15 0.40 0.65 0.90 1.15 1.40Nor
mal
ized
Ene
rgy
Effic
ienc
y
Supply Voltage (V)
300mV
8X
45nm CMOS 50 � C
32b Multiply
16b SIMD Multiply
72b Add 1.1V
0.980.870.740.590.370.15VhiVlo
H. Kaul, et. al., ISSCC 2009
A. Agarwal, et. al., ISSCC 2010
S. K. Hsu, et. al., ISSCC 2012
NTV operation improves energy efficiency across 45nm-22nm CMOS
Impact of Variation on NTV
10
0%
10%
20%
30%
40%
50%
60%
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Freq
(Rel
ativ
e)
Vdd (Relative)
Frequency
Spread
+/- 5% Variation in Vdd or Vt
0
1
2
3
4
5
6
1.0 0.9 0.8 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Circ
uit v
ulne
rabi
lity t
o 5%
noi
se
Vdd scaling towards threshold Threshold
5% variation in Vt or Vdd results in 20 to 50% variation in circuit performance
dd
tdd
VVVfrequency
)(
Mitigating Impact of Variation
11
1.Variation control with body biasingBody effect is substantially reduced in advanced technologiesEnergy cost of body biasing could become substantialFully-depleted transistors have no body left
f
f
f f
f
f
f
f
f
f
f f
Example: Many-core System
f
f
f f
f/2
f/2
f/2
f/2
f/4
f/4
f/4 f/4
Running all cores at full frequency exceeds energy budget
Run cores at the native frequencyLaw of large numbers—averaging
2. Variation tolerance at the system level
Subthreshold Leakage at NTV
12
45nm 32nm 22nm 14nm 10nm 7nm 5nm0%
10%
20%
30%
40%
50%
60%SD
Lea
kage
Pow
er
100% Vdd
75% Vdd
50% Vdd
40% Vdd Increasing Variations
NTV operation reduces total power, improves energy efficiencySubthreshold leakage power is substantial portion of the total
13
Experimental NTV Processor
IA-32 CoreLogic
Scan
RO ML1$-I L1$-D
Level Shifters + clk spine
1.1 mm
1.8
mm
Custom Interposer951 Pin FCBGA Package
Legacy Socket-7 Motherboard
Technology 32nm High-K Metal GateInterconnect 1 Poly, 9 Metal (Cu)Transistors 6 Million (Core)Core Area 2mm2
S. Jain, et al, “A 280mV-to-1.2V Wide-Operating-Range IA-32 Processor in 32nm CMOS”, ISSCC 2012
14
Power and Performance915MHz
500MHz
100MHz
3MHz
737mW
174mW
17mW2mW
0
100
200
300
400
500
600
700
800
1
10
100
1000
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.20.55 0.55 0.55 0.55 0.6 0.7 0.8 0.9 1 1.1 1.2
Total Power (m
W)Fr
eque
ncy
(MH
z)32nm CMOS, 25oC
Logic Vcc / Memory Vcc (V)
4%
33%
62%
1%
53%
27%
15%
5% 81%
11%3%
5%
Logic dynamicMemory leakage
Logic leakage
Subthreshold NTV Super-threshold
15
Observations
Sub-Vt NTV Full Vdd
0%
20%
40%
60%
80%
100%
Mem LkgMem DynLogic LkgLogic Dyn
Leakage power dominatesFine grain leakage power management is required
16
Memory & Storage Technologies
SRAM DRAM NAND, PCM Disk0.01
0.1
1
10
100
1000
10000
100000
Energy/bit(Pico Joule)
Capacity(G Bit)
Cost/bit(Pico $)
(Endurance issue)
17
Revise DRAM Architecture
Page Page Page
RASCAS
Activates many pagesLots of reads and writes (refresh)Small amount of read data is usedRequires small number of pins
Traditional DRAM New DRAM architecture
Page Page PageAddr
Addr
Activates few pagesRead and write (refresh) what is neededAll read data is usedRequires large number of IO’s (3D)
Energy cost today:~150 pJ/bit
Signaling
DRAM Array
M Control
18
Package
3D Integration of DRAM
Logic Buffer
Thin Logic and DRAM die
Through silicon vias
Power delivery through logic die
Energy efficient, high speed IO to logic buffer
Detailed interface signals to DRAMs created on the logic die
The most promising solution for energy efficient BW
19
Communication Energy
0.1 1 10 100 1000012345678
Interconnect Distance (cm)
Ener
gy/b
it (p
J)
On Die
Chip to chip
Board to Board
Between cabinets
On DieChip to chip
Board to Board
Between cabinets
20
On-Die Communication Power
IMEM + DMEM21%
10-port RF4%
Router + Links28%
Clock dist.11%
Dual FPMACs
36%
Global Clocking
1%Routers
& 2D-mesh10%
MC & DDR3-
80019%
Cores70%
21.7
2mm
12.64mmI/O Area
I/O AreaPLL
single tile1.5mm
2.0mm
TAP
21.7
2mm
12.64mmI/O Area
I/O AreaPLL
single tile1.5mm
2.0mm
TAP
8 X 10 Mesh 32 bit links320 GB/sec bisection BW @ 5 GHz
80 Core TFLOP Chip (2006)
VRC
21.4
mm
26.5mm
System Interface + I/O
DD
R3
MC
DD
R3
MC
DD
R3
MC
DD
R3
MC
PLL
TILE
TILE
JTAG
CC
CC
2 Core clusters in 6 X 4 Mesh (why not 6 x 8?)128 bit links256 GB/sec bisection BW @ 2 GHz
48 Core Single Chip Cloud (2009)
21
On-die Communication Energy
0.01
0.1
1
10
100
1000
0.5u 0.18u 65nm 22nm 7nm
Switc
h En
ergy
/bit
(pJ)
Measured
Extrapolated
Traditional Homogeneous Network
0.08 pJ/bit/switch0.04 pJ/mm (wire)Assume Byte/Flop
27 pJ/Flop on-die communication energy
1. Network power too high (27MW for EFLOP)2. Worse if link width scales up each generation3. Cache coherency mechanism is complex
22
Packet Switched Interconnect
STOP STOP5mm
3-5 Clk 3-5 Clk<1 Clk
0.5mW 0.5mW 0.5mW5 Clk 3-5 Clk
STOP STOP10mm
3-5 Clk2-3 Clk
0.5mW 1mW 0.5mW6-7 Clk
1. Router acts like STOP signs—adds latency2. Each hop consumes power (unnecessary)
23
Mesh—RetrospectiveBus: Good at board level, does not extend well• Transmission line issues: loss and signal integrity, limited frequency
• Width is limited by pins and board area
• Broadcast, simple to implement
Point to point busses: fast signaling over longer distance• Board level, between boards, and racks
• High frequency, narrow links
• 1D Ring, 2D Mesh and Torus to reduce latency
• Higher complexity and latency in each node
Hence, emergence of packet switched network
But, pt-to-pt packet switched network on a chip?
24
Interconnect Delay & Energy
100
1000
10000
100000
0 5 10 15 20 25
Length (mm)
Del
ay (p
s)
00.10.20.30.40.50.60.70.80.9
pJ/B
it
0.3u pitch, 0.5V
Wire Delay
Router Energy
Wire Energy
Repeated wire delayRouter Delay
25
Bus—The Other Extreme…Issues:Slow, < 300MHzShared, limited scalability?
Solutions:Repeaters to increase freqWide busses for bandwidthMultiple busses for scalability
Benefits:Power?Simpler cache coherency
Move away from frequency, embrace parallelism
26
Repeated Bus (Circuit Switched)
O R R
R R
R R
R
R
Arbitration:Each cycle for the next cycleDecision visible to all nodes
Repeaters:Align repeater directionNo driving contention
Core (mm)
Bus Seg Delay (ps)
Max Bus Freq (GHz)
65nm 5 195 2.245nm 3.5 99 232nm 2.5 51 1.822nm 1.8 26 1.516nm 1.3 13 1.2
Assume:10mm die,1.5u bus pitch50ps repeater delay
Anders et al, A 2.9Tb/s 8W 64-Core Circuit-switched Network-on-Chip in 45nm CMOS, ESSCIRC 2008Anders et al, A 4.1Tb/s Bisection-Bandwidth 560Gb/s/W Streaming Circuit-Switched 8×8 Mesh Network-on-Chip in 45nm CMOS, ISSCC 2008
27
A Circuit Switched Network
PClk
CClk
Packet-switched Request
Circuit-switchedAcknowledge
Circuit-switched Data Transmission
8x8 Circuit-switched NoC
CClk
Src Dest0 1 n
Routers
Routers2mm links
Anders et al, A 4.1Tb/s Bisection-Bandwidth 560Gb/s/W Streaming Circuit-Switched 8×8 Mesh Network-on-Chip in 45nm CMOS, ISSCC 2008
● Circuit-switched NoC eliminates intra-route data storage● Packet-switching used only for channel requests
Þ High bandwidth and energy efficiency (1.6 to 0.6 pJ/bit)
28
Hierarchical & Heterogeneous
Bus
C C
C C
Bus to connect over short distances
BusC C
C CBus
C C
C C
BusC C
C CBus
C C
C C2nd Level Bus
Hierarchy of BussesOr hierarchical circuit and packet switched networks
R R
R R
29
Tapered Interconnect Fabric
0
500
1000
1500
2000
2500
Cores Unit Chip System
BW
(a.u
.)
0
1
2
3
4
5
Byt
es/F
lop
(a.u
.)
Tapered, but over-provisioned bandwidthPay (energy) as you go (afford)
30
But wait, what about Optical?
Source: Hirotaka Tamure (Fujitsu), ISSCC 08 Workshop on HS Transceivers
65nm
Possible today
Hopeful
Optical:Pre-driverDriverVCSEL
TIALADeMUXCDRClk Buffer
31
Impact of Exploding Parallelism
100
150
200
250
300
350
400
450
65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm
Mill
ion
Cor
es/E
FLO
P
1x Vdd
0.7x Vdd
0.5x Vdd
Almost flat because Vdd close to Vt4X increase in the number of cores (Parallelism)
Increased communication and related energyIncreased HW, and unreliability
1. Strike a balance between Com & Computation2. Resiliency (Gradual, Intermittent, Permanent faults)
32
Road to Unreliability?From Peta to Exa Reliability Issues
1,000X parallelism More hardware for something to go wrong>1,000X intermittent faults due to soft errors
Aggressive Vcc scaling to reduce power/energy
Gradual faults due to increased variationsMore susceptible to Vcc droops (noise)More susceptible to dynamic temp variationsExacerbates intermittent faults—soft errors
Deeply scaled technologies
Aging related faultsLack of burn-in?Variability increases dramatically
Resiliency will be the corner-stone
Soft Errors and Reliability
33
0
0.2
0.4
0.6
0.8
1
0.5 1 1.5 2Voltage (V)
n-SE
R/c
ell (
sea-
leve
l)
65nm90nm130nm180nm250nm
Soft error/bit reduces each generationNominal impact of NTV on soft error rate
1
10
180 130 90 65 45 32Technology (nm)
Rel
ativ
e to
130
nm
Memory
Latch
Assuming 2X bit/latch count increase per
generation
Soft error at the system level will continue to increase
Positive impact of NTV on reliabilityLow V lower E fields, low power lower temperatureDevice aging effects will be less of a concernLower electromigration related defects
N. Seifert et al, "Radiation-Induced Soft Error Rates of Advanced CMOS Bulk Devices", 44th Reliability Physics Symposium, 2006
34
ResiliencyFaults ExamplePermanent faults Stuck-at 0 & 1Gradual faults Variability
TemperatureIntermittent faults Soft errors
Voltage droopsAging faults Degradation
Faults cause errors (data & control)Datapath errors Detected by parity/ECCSilent data corruption Need HW hooksControl errors Control lost (Blue screen)
Minimal overhead for resiliency
Circuit & Design
Microarchitecture
Microcode, Platform
Programming system
Applications Error detectionFault isolationFault confinementReconfigurationRecovery & Adapt
System Software
35
Compute vs Data Movement
0.001 0.01 0.1 1 10 100 10001E+00
1E+01
1E+02
1E+03
1E+04
1E+05
1E+06
1E+07
Distance (meters)
Ener
gy (p
J)
Read a bit from internal SRAM
Computation Read a bit from DRAM
1 bit Xfer using Bluetooth1 bit Xfer over Ethernet
1 bit Xfer over WiFi1 bit Xfer over 3G
1nJ
1pJ
1mJ
Instruction Execution
Data movement energy will dominate
1 bit over linkR/W Operands (RF)
System Level Optimization
36
00.20.40.60.8
11.2
45 32 22 14 10 7
Rela
tive
Technology (nm)
Compute Energy
Interconnect Energy
6X
1.6X
Compute energy reduces faster than global interconnect energyFor constant throughput, NTV demands more parallelismIncreases data movement at the system levelSystem level optimization is required to determine NTV operating
point
ComputeGlobal Interconnect
Supply Voltage
Ener
gy
37
Architecture needs a Paradigm Shift
Must revisit and evaluate each (even legacy) architecture feature
Single thread performance FrequencyProgramming productivity Legacy, compatibility
Architecture features for productivity
Constraints (1) Cost(2) Reasonable Power/Energy
Throughput performance Parallelism, application specific HWPower/Energy Architecture features for energy
Simplicity
Constraints (1) Programming productivity(2) Cost
Architect’s past and present priorities—
Architect’s future priorities should be—
38
A non-architect’s biased view…Power, energy
VariationsInterconnects Soft-errors
ResiliencyBandwidth
39
Appealing the ArchitectsExploit transistor integration capacity
Dark Silicon is a self-fulfilling prophecy
Compute is cheap, reduce data movement
Use futuristic work-loads for evaluation
Fearlessly break the legacy shackles
40
SummaryPower & energy challenge continues
Opportunistically employ NTV operation
3D integration for DRAM
Hierarchical, heterogeneous, tapered interconnect
Resiliency spanning the entire stack
Does computer architecture have a future?• Yes, if you acknowledge issues & challenges, and embrace
the paradigm shift
• No, if you keep “head buried in the sand”!