117/20/06
SubSub--45nm Circuit Technologies for 45nm Circuit Technologies for HighHigh--performance Energyperformance Energy--efficient efficient
MicroprocessorsMicroprocessors
Sanu Mathew and Ram KrishnamurthyCircuit Research Lab, Intel Corporation
Sanu Mathew and Ram KrishnamurthySanu Mathew and Ram KrishnamurthyCircuit Research Lab, Intel CorporationCircuit Research Lab, Intel Corporation
[email protected]@intel.com
ContributorsContributors: Mark Anders, Steven Hsu, Himanshu Kaul, Amit Agarwal, Yatin H: Mark Anders, Steven Hsu, Himanshu Kaul, Amit Agarwal, Yatin Hoskote, Nitin Borkar, Sapumal Wijeratne, Nanda Siddaiah, Bart Zeoskote, Nitin Borkar, Sapumal Wijeratne, Nanda Siddaiah, Bart Zeydel, ydel, Vojin Oklobdzija, David Harris, Wajdi Feghali, Kirk YapVojin Oklobdzija, David Harris, Wajdi Feghali, Kirk Yap
227/20/06
SubSub--45nm Power45nm Power--Performance Challenge Performance Challenge
0.01
0.1
1
10
100
1000
10000
100000
1000000
1970 1980 1990 2000 2010
MIP
S Pentium® Pro Architecture
Pentium® 4 Architecture
Pentium® Architecture
486386
2868086
0.1
1
10
100
1000
1970 1980 1990 2000 2010 2020
Power(Watts)
1000's ofWatts?
8080
8086 386
Pentium® proc
Pentium® 4 proc
Strong demand for >TIPS performance in 2010+Strong demand for >TIPS performance in 2010+ Power will be the limiter to reach thatPower will be the limiter to reach that
MIPS/W slowdown0.01
0.1
1
10
100
1000
10000
100000
1000000
1970 1980 1990 2000 2010
MIP
S
0.01
0.1
1
10
100
1000
10000
100000
1000000
MIP
S/W
att
>2% power increase for every 1% performance ⇒ poor MIPS/Watt >2% power increase for every 1% performance >2% power increase for every 1% performance ⇒⇒ poor MIPS/Watt poor MIPS/Watt
337/20/06
Platform 2015 VisionPlatform 2015 VisionPlatform 2015 Vision““Over time, important functions once relegated to software and Over time, important functions once relegated to software and specialized chips are typically absorbed into the microprocessorspecialized chips are typically absorbed into the microprocessor itself. itself. By moving functions on chip, such capabilities benefit from moreBy moving functions on chip, such capabilities benefit from moreefficient execution, superior economies of scale, and drasticallefficient execution, superior economies of scale, and drastically reduced y reduced power consumption. power consumption. SpecialSpecial--purpose hardware is an important purpose hardware is an important ingredient of Intelingredient of Intel’’s future processor and platform architecturess future processor and platform architectures””..
Justin Rattner, Justin Rattner, ““Platforms 2015Platforms 2015””, IDF Keynote, March 3, IDF Keynote, March 3rdrd, 2005, 2005
Performance Through ParallelismPerformance Through ParallelismPerformance Through Parallelism
20002000 2008+2008+
Perf
orm
ance
Perf
orm
ance
10X10X
SINGLE CORESINGLE CORE
MULTIMULTI--CORECORE
20042004
3X3X
FORECASTFORECAST
You AreYou AreHereHere
447/20/06
GOPS/Watt Distinction: General-purpose vs. Dedicated
GOPS/Watt Distinction: GOPS/Watt Distinction: GeneralGeneral--purpose vs. Dedicated purpose vs. Dedicated
Dedicated hardware: 100x higher energy-efficiency than GPDSP apps: Amenable to parallelism and pipelining Dedicated hardware: 100x higher energyDedicated hardware: 100x higher energy--efficiency than GPefficiency than GPDSP apps: Amenable to parallelism and pipelining DSP apps: Amenable to parallelism and pipelining
Efficient power-performance optimizationEfficient powerEfficient power--performance optimizationperformance optimization
0.01
0.1
1
10
100
1000
10000
MO
PS/m
W
PPC
PPC1
-SOI
Spar
cSp
arc2
PPC2
-SOI
Spar
c1 P4 x86
PPC7
70Al
pha
PPC9
70Al
pha
PPC
Itani
umSA
-DSP
Hita
chi-D
SPFu
j-DSP
Fuj-D
SPCe
ll-SP
EKA
IST-
DSP
NEC-
DSP
Fuj-M
ulti
MPEG
2En
cryp
tMU
DMP
EG2
802.1
1aH.
264
Microprocessors
DSPs
DedicatedHW
10x
100x
557/20/06
Special Purpose HW in Multi-core ProcessorsSpecial Purpose HW in MultiSpecial Purpose HW in Multi--core Processorscore Processors
Low-power General-purpose coreSP HW Accelerators
1000GOPS+ SP HW accelerator cores integrated into a multi-core processor
Challenges:Fixed function vs. limited programmable Multi/dual-supply voltage level convertersUltra low voltage operation, variation tolerance
1000GOPS+ SP HW accelerator cores integrated 1000GOPS+ SP HW accelerator cores integrated into a multiinto a multi--core processorcore processor
Challenges:Challenges:Fixed function vs. limited programmable Fixed function vs. limited programmable Multi/dualMulti/dual--supply voltage level converterssupply voltage level convertersUltra low voltage operation, variation tolerance Ultra low voltage operation, variation tolerance
667/20/06
The Leap to Parallelism: Driving Energy-Efficient Performance
The Leap to Parallelism: The Leap to Parallelism: Driving EnergyDriving Energy--Efficient PerformanceEfficient Performance
EN
ERG
Y-E
FFIC
IEN
T P
ERFO
RM
AN
CE
TIME
Multi-Processor
Hyper-Threading
Dual-Core
Quad-Core
2005 – First Intel dual-core ships 2H’06 – Next Generation Core Arch.
O O O
The Next Leap
Special-purpose Hardware Accelerators: Next Leap in Performance beyond multi-core
Special-purpose Hardware Accelerators: Next Leap in Performance beyond multi-core
777/20/06
SpecialSpecial--purpose HW for Viterbi Decodingpurpose HW for Viterbi Decoding90nm 6490nm 64--state Viterbi Acceleratorstate Viterbi Accelerator
Performance and power critical workload in wireless Performance and power critical workload in wireless basebase--band, DVD codec, HDD signaling etc. band, DVD codec, HDD signaling etc.
ACS: 230ACS: 230µµm x 210 m x 210 µµmm
6464--state radixstate radix--2 design: 40mW at 500Mbps in 90nm CMOS2 design: 40mW at 500Mbps in 90nm CMOSNew leakageNew leakage--tolerant tracetolerant trace--back register file and bitback register file and bit--serial ACS circuitsserial ACS circuits10X faster than current WLAN designs10X faster than current WLAN designs
Intel Design
TraceTrace--back: 260back: 260µµm x 510 m x 510 µµmm
M. Anders et al, 2004 VLSI Circuits Symp.M. Anders et al, 2004 VLSI Circuits Symp.
~10X higher GOPS/W than best reported
500Mbps Viterbi Accelerator
887/20/06
Special Purpose HW for DSP FilteringSpecial Purpose HW for DSP Filtering90nm 110GOPS/Watt Filter Accelerator90nm 110GOPS/Watt Filter Accelerator
S. Hsu et al, ISSCCS. Hsu et al, ISSCC’’0505
1GHz single-cycle throughput at 9mWHighest reported performance per watt
>3X benefit over previously published design
1GHz single1GHz single--cycle throughput at 9mWcycle throughput at 9mWHighest reported performance per watt Highest reported performance per watt
>3X benefit over previously published design>3X benefit over previously published design
Performance and power critical workload in video,graphics, SIMD, base-band (FIR, FFT, DCT)
Performance and power critical workload in video,Performance and power critical workload in video,graphics, SIMD, basegraphics, SIMD, base--band (FIR, FFT, DCT)band (FIR, FFT, DCT)
90nm, 50°C
0.00.20.40.60.81.01.21.41.6
0.4 0.7 1.0 1.3 1.6 1.9 2.2Supply Voltage (V)
Max
imum
Fre
quen
cy
(GH
z)
05101520253035
Tota
l Pow
er (m
W)
16-bit integer DSP multiplier with reconfigurable PLA control engine in
90nm CMOS
1616--bit integer DSP multiplier with bit integer DSP multiplier with reconfigurable PLA control engine in reconfigurable PLA control engine in
90nm CMOS90nm CMOS
997/20/06
1.E+07
1.E+08
1.E+09
1.E+10
1.E+11
1.E+12
1.E-04 1.E-03 1.E-02 1.E-01 1.E+00 1.E+01
Power (W)
Perfo
rman
ce (H
z/OPS
)
Filter Accelerator: ComparisonsFilter Accelerator: ComparisonsFilter Accelerator: Comparisons
9pJ/OP or 110GOPS/Watt at nominal 1.3V1.6pJ/OP or 630GOPS/Watt at 0.57V
Highest Filter MACC power-performance in industry
9pJ/OP or 110GOPS/Watt at nominal 1.3V9pJ/OP or 110GOPS/Watt at nominal 1.3V1.6pJ/OP or 630GOPS/Watt at 0.57V1.6pJ/OP or 630GOPS/Watt at 0.57V
Highest Filter MACC powerHighest Filter MACC power--performance performance in industryin industry
Xemics[1]
Berkeley[2]
Lucent[3]
Hitachi[4] TI[6]NEC[5]
ADI[8]
Toshiba[11]
Fujitsu[9]
Fujitsu[10]
Lucent[12]
0.1GOPS/W1GOPS/W10GOPS/W100GOPS/W
Intel design
Measurement test setup
110GOPS/W at 1.3V
630GOPS/W at 0.57V
~10-60X higher GOPS/W than best reported
10107/20/06
Crypto AcceleratorCrypto Accelerator
Encryption apps are computationally intensive due to:Encryption apps are computationally intensive due to:–– Wide operand bitWide operand bit--widths (widths (egeg: 2048: 2048--bit modular exponentiations)bit modular exponentiations)
–– Iterative, parallelizable algorithms (Iterative, parallelizable algorithms (egeg: 48: 48--round bit permutations) round bit permutations)
EnergyEnergy--inefficient on GP execution coresinefficient on GP execution cores
Hardware accelerator offers 5x higher performance/wattHardware accelerator offers 5x higher performance/watt
Encryption workloads offloaded to a CryptoEncryption workloads offloaded to a Crypto--acceleratoraccelerator
Core2Core2Core0Core0
Core1Core1 Core3Core3
Core4Core4
Core5Core5
GraphicsGraphics
CryptoCrypto
SSL/RSASSL/RSA
DES/3-DESDES/
3-DES
AESAES SHA-1SHA-1
88--core processor with core processor with Graphics & Crypto AcceleratorsGraphics & Crypto Accelerators
11117/20/06
AA××B B modmod CC is a key operation in RSA cryptographyis a key operation in RSA cryptographyScalable design Scalable design ⇒⇒reconfigurable for 256/1024bit op.reconfigurable for 256/1024bit op.Reconfigurable for RSA, Reconfigurable for RSA, DiffieDiffie--HelmanHelman, DSA, ECC, DSA, ECC15K 25615K 256--bit exponentiations/s and 7.3MMults/sbit exponentiations/s and 7.3MMults/s44% speedup over prior44% speedup over prior--artart
Crypto Accelerator: SSLCrypto Accelerator: SSL256-bit kernel MM datapath 16-bit PE
354μm
146μ
m
256-bit kernel layout
D. Harris, S. Mathew et al, 2005 ArithD. Harris, S. Mathew et al, 2005 Arith--17 Symp.17 Symp.
12127/20/06
Special Purpose HW for TCPSpecial Purpose HW for TCP--processingprocessingTC
B ExecCore
PLL
OOO
ROMC
AM
1
TCB Exec
Core
PLL
ROB
ROMC
LB
Inputseq
Sendbuffer
2.23 mm X 3.54 mm, 260K transistors(Y. Hoskote, et. al. ISSCC ’03)
Opportunities for acceleration:Network processing enginesMPEG Encode/Decode enginesSpeech engines
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1995 2000 2005 2010 2015
MIP
S GP MIPS@75W
TOE MIPS@~2W
Special purpose HW—Best MIPS/WattSpecial purpose HWSpecial purpose HW——Best MIPS/WattBest MIPS/Watt
TCP Offload EngineTCP Offload Engine
13137/20/06
GP Processor Core ChallengesGP Processor Core Challenges
EnergyEnergy--efficient ALUefficient ALU–– SparseSparse--tree adder coretree adder core
LowLow--power dualpower dual--VVcccc operationoperation–– SplitSplit--output level converteroutput level converter
HighHigh--performance interconnectsperformance interconnects–– FullFull--swing transitionswing transition--encoded domino bussesencoded domino busses
Leakage/Variation tolerant memoriesLeakage/Variation tolerant memories–– FullFull--swing register file with compensation keeperswing register file with compensation keeper
14147/20/06
GP Processor Core ChallengesGP Processor Core Challenges
EnergyEnergy--efficient ALUefficient ALU–– SparseSparse--tree adder coretree adder core
LowLow--power dualpower dual--VVcccc operationoperation–– SplitSplit--output level converteroutput level converter
HighHigh--performance interconnectsperformance interconnects–– FullFull--swing transitionswing transition--encoded domino bussesencoded domino busses
Leakage/Variation tolerant memoriesLeakage/Variation tolerant memories–– FullFull--swing register file with compensation keeperswing register file with compensation keeper
15157/20/06
Execution Core Power DensityExecution Core Power Density
Execution core
120oC
Cache70°C
Integer & FP ALUs
Temp(oC)
Itanium2® thermal map
Requires high-performance, low-power execution core circuits
Requires highRequires high--performance, lowperformance, low--power power execution core circuitsexecution core circuits
110 115 120 125 130
Pentium4® thermal map
Execution core
Integer & FPU ALUs/AGUs: performance & power limitersHigh activity ⇒ thermal hotspots and peak-current limiters
16167/20/06
SparseSparse--Tree AddersTree Adders
Generate only every 4th carry in parallel, i.e., C3, C7 etc.Non-critical 4-bit sum generator side path to produce sum73% fewer carry-merge gates ⇒ fast and energy-efficient
Generate only every 4th carry in parallel, i.e., C3, C7 etc.Non-critical 4-bit sum generator side path to produce sum73% fewer carry-merge gates ⇒ fast and energy-efficient
17177/20/06
Non-critical Conditional Sum GeneratorNonNon--critical Conditional Sum Generatorcritical Conditional Sum Generator
Non-critical path: 4-bit ripple carry chainReduced area, energy consumption, and active leakage
Generates conditional sums for each cluster of 4-bitsSparse-tree carry selects appropriate sum
PsumiPiPi+1 ,Gi+1
Sumi+1Sumi+2Sumi+3Sumi+3
XOR XORXOR XOR
Pi+2,Gi+2
Sumi
Sumi ,1
Sumi ,0
Carry
Gi
CMCMCM CMCMCM
Optimized 1stOptimized 1st--level level carrycarry--mergemerge
2:1 2:1 2:12:12:12:1
CMCMCM CMCMCMXORXORXOR XORXORXOR
18187/20/06
Process Process 90nm Dual90nm Dual--Vt CMOS, 7 Vt CMOS, 7 MetalMetal
6464--bit ALU layout areabit ALU layout area 0.073mm0.073mm22
Total transistor countTotal transistor count 61006100
6464--bit ALU maximum frequencybit ALU maximum frequency 7GHz at 2.1V, 25C7GHz at 2.1V, 25C
3232--bit ALU average switching power bit ALU average switching power ((αα=0.3)=0.3)
71mW at 7GHz, 1.3V, 2571mW at 7GHz, 1.3V, 25ooCC
3232--bit ALU active leakage powerbit ALU active leakage power 4.4mW at 1.3V, 254.4mW at 1.3V, 25ooCC
6464--bit ALU average switching power bit ALU average switching power ((αα=0.3)=0.3)
89mW at 4GHz, 1.3V, 2589mW at 4GHz, 1.3V, 25ooCC
6464--bit ALU active leakage powerbit ALU active leakage power 9.6mW at 1.3V, 259.6mW at 1.3V, 25ooCC
Die areaDie area 0.474mm0.474mm22
Low
er-o
rder
32
-bit
ALU
Upp
er-o
rder
32
-bit
ALU
I/O CircuitsClock Generator and Drivers
90nm 7GHz 64-bit Integer ALU (ISSCC’04)90nm 7GHz 64-bit Integer ALU (ISSCC’04)
• 7GHz single-cycle 64-bit integer ALU (measured in 90nm CMOS) • Simultaneous 9GHz single-cycle 32-bit integer ALU mode
Fastest reported singleFastest reported single--cycle 64cycle 64--bit integer ALU performancebit integer ALU performance
6464--bit ALU die microphotograph and measured performance summary bit ALU die microphotograph and measured performance summary
S. Mathew et al,S. Mathew et al,ISSCC 2004 & JSSC 01/05ISSCC 2004 & JSSC 01/05
19197/20/06
GP Processor Core ChallengesGP Processor Core Challenges
EnergyEnergy--efficient ALUefficient ALU–– SparseSparse--tree adder coretree adder core
LowLow--power dualpower dual--VVcccc operationoperation–– SplitSplit--output level converteroutput level converter
HighHigh--performance interconnectsperformance interconnects–– FullFull--swing transitionswing transition--encoded domino bussesencoded domino busses
Leakage/Variation tolerant memoriesLeakage/Variation tolerant memories–– FullFull--swing register file with compensation keeperswing register file with compensation keeper
20207/20/06
SplitSplit--output Level Converteroutput Level Converter
• Contention in CVSL LCB degrades delay • Split-output LCB decouples CVSL stage from output driver
stage• Fast level conversion due to low contention• Reduced fanin load on clock grid
Conventional CVSL LCB Split-output LCB
low-Vcc
inout
high-Vcc
low-Vcc
high-Vcc
in
out
21217/20/06
47%
LCB EnergyLCB Energy--Delay Comparisons Delay Comparisons
Effective low-energy alternative to CVSL LCBEffective lowEffective low--energy alternative to CVSL LCBenergy alternative to CVSL LCB
16%
Conventional CVSL
Split-output
Vcch=1.2V, Vccl=0.8V130nm, 30°C simulation
LCB Scheme Fanin cap (fF)
Total area (mm2)
CVSL-stage contention energy (pJ)
Conventional CVSL
This work
8.27.1 (-14%)
15.513.8 (-11%)
0.0850.039 (-54%)
0.24
0.28
0.32
0.36
LCB
Ene
rgy
(pJ)
0 50 100 150LCB Delay (ps)
200 250
R. Krishnamurthy et al, 2002 Symp. VLSI CircuitsR. Krishnamurthy et al, 2002 Symp. VLSI Circuits
22227/20/06
GP Processor Core ChallengesGP Processor Core Challenges
EnergyEnergy--efficient ALUefficient ALU–– SparseSparse--tree adder coretree adder core
LowLow--power dualpower dual--VVcccc operationoperation–– SplitSplit--output level converteroutput level converter
HighHigh--performance interconnectsperformance interconnects–– FullFull--swing transitionswing transition--encoded domino bussesencoded domino busses
Leakage/Variation tolerant memoriesLeakage/Variation tolerant memories–– FullFull--swing register file with compensation keeperswing register file with compensation keeper
23237/20/06
D1 FF
Φ1
D1
encode
decode
TransitionTransition--Encoded BusEncoded Bus
Encoder circuitEncoder circuit–– XOR of previous and current inputXOR of previous and current input
Decoder circuitDecoder circuit–– XOR of previous output and bus stateXOR of previous output and bus state
Domino delay performanceDomino delay performance–– Collinear cap reduction Collinear cap reduction
Static bus energyStatic bus energy–– TransitionTransition--dependent activitydependent activity
M. Anders et al, 2002 VLSI Circuits Symp.
Φ2 Φ1
24247/20/06
TransitionTransition--Encoded Bus: ResultsEncoded Bus: Results
Transition only when current input != previous inputTransition only when current input != previous inputDynamic bus performance but energy profile of static busDynamic bus performance but energy profile of static busEnergy scales linearly with input switching activityEnergy scales linearly with input switching activity79% of full79% of full--chip buses: 10%chip buses: 10%--35% delay improvement35% delay improvement
M. Anders et al, 2002 VLSI Circuits Symp.M. Anders et al, 2002 VLSI Circuits Symp.
1 6 11 16Length (mm)
- 10
0
10
20
30
40
yaleD
tne
mevor
pmI
%
0.180.18μμm Pentiumm Pentium®® 4 Simulations4 Simulations
25257/20/06
GP Processor Core ChallengesGP Processor Core Challenges
EnergyEnergy--efficient ALUefficient ALU–– SparseSparse--tree adder coretree adder core
LowLow--power dualpower dual--VVcccc clockingclocking–– SplitSplit--output level converteroutput level converter
HighHigh--performance interconnectsperformance interconnects–– FullFull--swing transitionswing transition--encoded domino bussesencoded domino busses
Leakage/Variation tolerant memoriesLeakage/Variation tolerant memories–– FullFull--swing register file with compensation keeperswing register file with compensation keeper
26267/20/06
PVT induced leakage variationPVT induced leakage variation–– Traditional noise engineering: diminishing ROITraditional noise engineering: diminishing ROI
Additional keeper enables customizationAdditional keeper enables customization–– High leakage states: 8% keeperHigh leakage states: 8% keeper–– Low leakage states: 4% keeperLow leakage states: 4% keeper
0.7
1
1.3
1.6
0.25 1 1.75
Nor
mal
ized
Del
ay
Keeper upsizing
5%
20%
5%
10%
10%
Low Vt
High Vt
DC noise robustness
8X
Lower LBLUpper LBL
out
compensation enable
clk enable
* * * *
4%4%
A. Alvandpour et al, 2001 Symp. VLSI CircuitsA. Alvandpour et al, 2001 Symp. VLSI Circuits
PVT Variation tolerant memories: PVT Variation tolerant memories: Keeper UpsizingKeeper Upsizing
27277/20/06
0.50.95
1.4
2065
1100
0.5
1
Voltage (V)
Slow
Temp. (°C)
Leak
age
0.50.95
1.4
2065
1100
0.5
1
Voltage (V)
Typical
Temp. (°C)
Leak
age
0.50.95
1.4
2065
1100
0.5
1
Voltage (V)
Fast
Temp. (°C)
Leak
age TypicalFast
Slow
104X leakage variation across PVT 101044X leakage variation across PVT X leakage variation across PVT
PVT Compensation MotivationPVT Compensation Motivation
28287/20/06
PVT Compensation BenefitPVT Compensation Benefit
High leakage dies: 27% robustness High leakage dies: 27% robustness Low leakage dies: 10% delay Low leakage dies: 10% delay IndustryIndustry’’s highest performance/Watt s highest performance/Watt ⇒⇒ 96GOPs/Watt96GOPs/Watt
16x64b Register FileArray
S.Hsu, et. al.,ISSCC ‘06
20
65
110
0.5 0.95 1.4Voltage (V)
Tem
p (°
C)
Comp. OFF
FastComp. ON
Fast, TypicalComp. ON
FrequencyFrequency PowerPower LeakageLeakage
198mW198mW
273mW273mW
1.3mW1.3mW
Nominal (1.2V)Nominal (1.2V) 8.8GHz8.8GHz 25mW25mW
Peak Performance (1.4V)Peak Performance (1.4V) 10.1GHz10.1GHz
300MHz300MHz
57mW57mW
LowLow--Voltage Mode (0.5V)Voltage Mode (0.5V) 405405µµWW
P1264 Si measurements, 50oC
65nm CMOS Die area: 0.017mm2
68K transistors
29297/20/06
SummarySummaryPerformance through parallelism: Performance through parallelism: MultiMulti--corecoreSpecialSpecial--purpose hardware accelerators provide purpose hardware accelerators provide higher MIPS/Watt vs. generalhigher MIPS/Watt vs. general--purpose corespurpose coresEnergyEnergy--efficient, leakage/variation tolerant circuits efficient, leakage/variation tolerant circuits required for scalable GP performancerequired for scalable GP performance–– SparseSparse--tree addertree adder–– SplitSplit--output level converteroutput level converter–– TransitionTransition--encoded interconnectsencoded interconnects–– Leakage/variation tolerant register filesLeakage/variation tolerant register files