Download - Sub-45nm Circuit Technologies for High-performance Energy ...caxapa.ru/thumbs/281955/Sanu_Mathew.pdfSub-45nm Circuit Technologies for High-performance Energy-efficient ... MIPS Pentium®

117/20/06

SubSub--45nm Circuit Technologies for 45nm Circuit Technologies for HighHigh--performance Energyperformance Energy--efficient efficient

MicroprocessorsMicroprocessors

Sanu Mathew and Ram KrishnamurthyCircuit Research Lab, Intel Corporation

[email protected]

Sanu Mathew and Ram KrishnamurthySanu Mathew and Ram KrishnamurthyCircuit Research Lab, Intel CorporationCircuit Research Lab, Intel Corporation

[email protected]@intel.com

ContributorsContributors: Mark Anders, Steven Hsu, Himanshu Kaul, Amit Agarwal, Yatin H: Mark Anders, Steven Hsu, Himanshu Kaul, Amit Agarwal, Yatin Hoskote, Nitin Borkar, Sapumal Wijeratne, Nanda Siddaiah, Bart Zeoskote, Nitin Borkar, Sapumal Wijeratne, Nanda Siddaiah, Bart Zeydel, ydel, Vojin Oklobdzija, David Harris, Wajdi Feghali, Kirk YapVojin Oklobdzija, David Harris, Wajdi Feghali, Kirk Yap

227/20/06

SubSub--45nm Power45nm Power--Performance Challenge Performance Challenge

0.01

0.1

1

10

100

1000

10000

100000

1000000

1970 1980 1990 2000 2010

MIP

S Pentium® Pro Architecture

Pentium® 4 Architecture

Pentium® Architecture

486386

2868086

0.1

1

10

100

1000

1970 1980 1990 2000 2010 2020

Power(Watts)

1000's ofWatts?

8080

8086 386

Pentium® proc

Pentium® 4 proc

Strong demand for >TIPS performance in 2010+Strong demand for >TIPS performance in 2010+ Power will be the limiter to reach thatPower will be the limiter to reach that

MIPS/W slowdown0.01

0.1

1

10

100

1000

10000

100000

1000000

1970 1980 1990 2000 2010

MIP

S

0.01

0.1

1

10

100

1000

10000

100000

1000000

MIP

S/W

att

>2% power increase for every 1% performance ⇒ poor MIPS/Watt >2% power increase for every 1% performance >2% power increase for every 1% performance ⇒⇒ poor MIPS/Watt poor MIPS/Watt

337/20/06

Platform 2015 VisionPlatform 2015 VisionPlatform 2015 Vision““Over time, important functions once relegated to software and Over time, important functions once relegated to software and specialized chips are typically absorbed into the microprocessorspecialized chips are typically absorbed into the microprocessor itself. itself. By moving functions on chip, such capabilities benefit from moreBy moving functions on chip, such capabilities benefit from moreefficient execution, superior economies of scale, and drasticallefficient execution, superior economies of scale, and drastically reduced y reduced power consumption. power consumption. SpecialSpecial--purpose hardware is an important purpose hardware is an important ingredient of Intelingredient of Intel’’s future processor and platform architecturess future processor and platform architectures””..

Justin Rattner, Justin Rattner, ““Platforms 2015Platforms 2015””, IDF Keynote, March 3, IDF Keynote, March 3rdrd, 2005, 2005

Performance Through ParallelismPerformance Through ParallelismPerformance Through Parallelism

20002000 2008+2008+

Perf

orm

ance

Perf

orm

ance

10X10X

SINGLE CORESINGLE CORE

MULTIMULTI--CORECORE

20042004

3X3X

FORECASTFORECAST

You AreYou AreHereHere

447/20/06

GOPS/Watt Distinction: General-purpose vs. Dedicated

GOPS/Watt Distinction: GOPS/Watt Distinction: GeneralGeneral--purpose vs. Dedicated purpose vs. Dedicated

Dedicated hardware: 100x higher energy-efficiency than GPDSP apps: Amenable to parallelism and pipelining Dedicated hardware: 100x higher energyDedicated hardware: 100x higher energy--efficiency than GPefficiency than GPDSP apps: Amenable to parallelism and pipelining DSP apps: Amenable to parallelism and pipelining

Efficient power-performance optimizationEfficient powerEfficient power--performance optimizationperformance optimization

0.01

0.1

1

10

100

1000

10000

MO

PS/m

W

PPC

PPC1

-SOI

Spar

cSp

arc2

PPC2

-SOI

Spar

c1 P4 x86

PPC7

70Al

pha

PPC9

70Al

pha

PPC

Itani

umSA

-DSP

Hita

chi-D

SPFu

j-DSP

Fuj-D

SPCe

ll-SP

EKA

IST-

DSP

NEC-

DSP

Fuj-M

ulti

MPEG

2En

cryp

tMU

DMP

EG2

802.1

1aH.

264

Microprocessors

DSPs

DedicatedHW

10x

100x

557/20/06

Special Purpose HW in Multi-core ProcessorsSpecial Purpose HW in MultiSpecial Purpose HW in Multi--core Processorscore Processors

Low-power General-purpose coreSP HW Accelerators

1000GOPS+ SP HW accelerator cores integrated into a multi-core processor

Challenges:Fixed function vs. limited programmable Multi/dual-supply voltage level convertersUltra low voltage operation, variation tolerance

1000GOPS+ SP HW accelerator cores integrated 1000GOPS+ SP HW accelerator cores integrated into a multiinto a multi--core processorcore processor

Challenges:Challenges:Fixed function vs. limited programmable Fixed function vs. limited programmable Multi/dualMulti/dual--supply voltage level converterssupply voltage level convertersUltra low voltage operation, variation tolerance Ultra low voltage operation, variation tolerance

667/20/06

The Leap to Parallelism: Driving Energy-Efficient Performance

The Leap to Parallelism: The Leap to Parallelism: Driving EnergyDriving Energy--Efficient PerformanceEfficient Performance

EN

ERG

Y-E

FFIC

IEN

T P

ERFO

RM

AN

CE

TIME

Multi-Processor

Hyper-Threading

Dual-Core

Quad-Core

2005 – First Intel dual-core ships 2H’06 – Next Generation Core Arch.

O O O

The Next Leap

Special-purpose Hardware Accelerators: Next Leap in Performance beyond multi-core

Special-purpose Hardware Accelerators: Next Leap in Performance beyond multi-core

777/20/06

SpecialSpecial--purpose HW for Viterbi Decodingpurpose HW for Viterbi Decoding90nm 6490nm 64--state Viterbi Acceleratorstate Viterbi Accelerator

Performance and power critical workload in wireless Performance and power critical workload in wireless basebase--band, DVD codec, HDD signaling etc. band, DVD codec, HDD signaling etc.

ACS: 230ACS: 230µµm x 210 m x 210 µµmm

6464--state radixstate radix--2 design: 40mW at 500Mbps in 90nm CMOS2 design: 40mW at 500Mbps in 90nm CMOSNew leakageNew leakage--tolerant tracetolerant trace--back register file and bitback register file and bit--serial ACS circuitsserial ACS circuits10X faster than current WLAN designs10X faster than current WLAN designs

Intel Design

TraceTrace--back: 260back: 260µµm x 510 m x 510 µµmm

M. Anders et al, 2004 VLSI Circuits Symp.M. Anders et al, 2004 VLSI Circuits Symp.

~10X higher GOPS/W than best reported

500Mbps Viterbi Accelerator

887/20/06

Special Purpose HW for DSP FilteringSpecial Purpose HW for DSP Filtering90nm 110GOPS/Watt Filter Accelerator90nm 110GOPS/Watt Filter Accelerator

S. Hsu et al, ISSCCS. Hsu et al, ISSCC’’0505

1GHz single-cycle throughput at 9mWHighest reported performance per watt

>3X benefit over previously published design

1GHz single1GHz single--cycle throughput at 9mWcycle throughput at 9mWHighest reported performance per watt Highest reported performance per watt

>3X benefit over previously published design>3X benefit over previously published design

Performance and power critical workload in video,graphics, SIMD, base-band (FIR, FFT, DCT)

Performance and power critical workload in video,Performance and power critical workload in video,graphics, SIMD, basegraphics, SIMD, base--band (FIR, FFT, DCT)band (FIR, FFT, DCT)

90nm, 50°C

0.00.20.40.60.81.01.21.41.6

0.4 0.7 1.0 1.3 1.6 1.9 2.2Supply Voltage (V)

Max

imum

Fre

quen

cy

(GH

z)

05101520253035

Tota

l Pow

er (m

W)

16-bit integer DSP multiplier with reconfigurable PLA control engine in

90nm CMOS

1616--bit integer DSP multiplier with bit integer DSP multiplier with reconfigurable PLA control engine in reconfigurable PLA control engine in

90nm CMOS90nm CMOS

997/20/06

1.E+07

1.E+08

1.E+09

1.E+10

1.E+11

1.E+12

1.E-04 1.E-03 1.E-02 1.E-01 1.E+00 1.E+01

Power (W)

Perfo

rman

ce (H

z/OPS

)

Filter Accelerator: ComparisonsFilter Accelerator: ComparisonsFilter Accelerator: Comparisons

9pJ/OP or 110GOPS/Watt at nominal 1.3V1.6pJ/OP or 630GOPS/Watt at 0.57V

Highest Filter MACC power-performance in industry

9pJ/OP or 110GOPS/Watt at nominal 1.3V9pJ/OP or 110GOPS/Watt at nominal 1.3V1.6pJ/OP or 630GOPS/Watt at 0.57V1.6pJ/OP or 630GOPS/Watt at 0.57V

Highest Filter MACC powerHighest Filter MACC power--performance performance in industryin industry

Xemics[1]

Berkeley[2]

Lucent[3]

Hitachi[4] TI[6]NEC[5]

ADI[8]

Toshiba[11]

Fujitsu[9]

Fujitsu[10]

Lucent[12]

0.1GOPS/W1GOPS/W10GOPS/W100GOPS/W

Intel design

Measurement test setup

110GOPS/W at 1.3V

630GOPS/W at 0.57V

~10-60X higher GOPS/W than best reported

10107/20/06

Crypto AcceleratorCrypto Accelerator

Encryption apps are computationally intensive due to:Encryption apps are computationally intensive due to:–– Wide operand bitWide operand bit--widths (widths (egeg: 2048: 2048--bit modular exponentiations)bit modular exponentiations)

–– Iterative, parallelizable algorithms (Iterative, parallelizable algorithms (egeg: 48: 48--round bit permutations) round bit permutations)

EnergyEnergy--inefficient on GP execution coresinefficient on GP execution cores

Hardware accelerator offers 5x higher performance/wattHardware accelerator offers 5x higher performance/watt

Encryption workloads offloaded to a CryptoEncryption workloads offloaded to a Crypto--acceleratoraccelerator

Core2Core2Core0Core0

Core1Core1 Core3Core3

Core4Core4

Core5Core5

GraphicsGraphics

CryptoCrypto

SSL/RSASSL/RSA

DES/3-DESDES/

3-DES

AESAES SHA-1SHA-1

88--core processor with core processor with Graphics & Crypto AcceleratorsGraphics & Crypto Accelerators

11117/20/06

AA××B B modmod CC is a key operation in RSA cryptographyis a key operation in RSA cryptographyScalable design Scalable design ⇒⇒reconfigurable for 256/1024bit op.reconfigurable for 256/1024bit op.Reconfigurable for RSA, Reconfigurable for RSA, DiffieDiffie--HelmanHelman, DSA, ECC, DSA, ECC15K 25615K 256--bit exponentiations/s and 7.3MMults/sbit exponentiations/s and 7.3MMults/s44% speedup over prior44% speedup over prior--artart

Crypto Accelerator: SSLCrypto Accelerator: SSL256-bit kernel MM datapath 16-bit PE

354μm

146μ

m

256-bit kernel layout

D. Harris, S. Mathew et al, 2005 ArithD. Harris, S. Mathew et al, 2005 Arith--17 Symp.17 Symp.

12127/20/06

Special Purpose HW for TCPSpecial Purpose HW for TCP--processingprocessingTC

B ExecCore

PLL

OOO

ROMC

AM

1

TCB Exec

Core

PLL

ROB

ROMC

LB

Inputseq

Sendbuffer

2.23 mm X 3.54 mm, 260K transistors(Y. Hoskote, et. al. ISSCC ’03)

Opportunities for acceleration:Network processing enginesMPEG Encode/Decode enginesSpeech engines

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1995 2000 2005 2010 2015

MIP

S GP MIPS@75W

TOE MIPS@~2W

Special purpose HW—Best MIPS/WattSpecial purpose HWSpecial purpose HW——Best MIPS/WattBest MIPS/Watt

TCP Offload EngineTCP Offload Engine

13137/20/06

GP Processor Core ChallengesGP Processor Core Challenges

EnergyEnergy--efficient ALUefficient ALU–– SparseSparse--tree adder coretree adder core

LowLow--power dualpower dual--VVcccc operationoperation–– SplitSplit--output level converteroutput level converter

HighHigh--performance interconnectsperformance interconnects–– FullFull--swing transitionswing transition--encoded domino bussesencoded domino busses

Leakage/Variation tolerant memoriesLeakage/Variation tolerant memories–– FullFull--swing register file with compensation keeperswing register file with compensation keeper

14147/20/06






15157/20/06

Execution Core Power DensityExecution Core Power Density

Execution core

120oC

Cache70°C

Integer & FP ALUs

Temp(oC)

Itanium2® thermal map

Requires high-performance, low-power execution core circuits

Requires highRequires high--performance, lowperformance, low--power power execution core circuitsexecution core circuits

110 115 120 125 130

Pentium4® thermal map

Execution core

Integer & FPU ALUs/AGUs: performance & power limitersHigh activity ⇒ thermal hotspots and peak-current limiters

16167/20/06

SparseSparse--Tree AddersTree Adders

Generate only every 4th carry in parallel, i.e., C3, C7 etc.Non-critical 4-bit sum generator side path to produce sum73% fewer carry-merge gates ⇒ fast and energy-efficient

Generate only every 4th carry in parallel, i.e., C3, C7 etc.Non-critical 4-bit sum generator side path to produce sum73% fewer carry-merge gates ⇒ fast and energy-efficient

17177/20/06

Non-critical Conditional Sum GeneratorNonNon--critical Conditional Sum Generatorcritical Conditional Sum Generator

Non-critical path: 4-bit ripple carry chainReduced area, energy consumption, and active leakage

Generates conditional sums for each cluster of 4-bitsSparse-tree carry selects appropriate sum

PsumiPiPi+1 ,Gi+1

Sumi+1Sumi+2Sumi+3Sumi+3

XOR XORXOR XOR

Pi+2,Gi+2

Sumi

Sumi ,1

Sumi ,0

Carry

Gi

CMCMCM CMCMCM

Optimized 1stOptimized 1st--level level carrycarry--mergemerge

2:1 2:1 2:12:12:12:1

CMCMCM CMCMCMXORXORXOR XORXORXOR

18187/20/06

Process Process 90nm Dual90nm Dual--Vt CMOS, 7 Vt CMOS, 7 MetalMetal

6464--bit ALU layout areabit ALU layout area 0.073mm0.073mm22

Total transistor countTotal transistor count 61006100

6464--bit ALU maximum frequencybit ALU maximum frequency 7GHz at 2.1V, 25C7GHz at 2.1V, 25C

3232--bit ALU average switching power bit ALU average switching power ((αα=0.3)=0.3)

71mW at 7GHz, 1.3V, 2571mW at 7GHz, 1.3V, 25ooCC

3232--bit ALU active leakage powerbit ALU active leakage power 4.4mW at 1.3V, 254.4mW at 1.3V, 25ooCC

6464--bit ALU average switching power bit ALU average switching power ((αα=0.3)=0.3)

89mW at 4GHz, 1.3V, 2589mW at 4GHz, 1.3V, 25ooCC

6464--bit ALU active leakage powerbit ALU active leakage power 9.6mW at 1.3V, 259.6mW at 1.3V, 25ooCC

Die areaDie area 0.474mm0.474mm22

Low

er-o

rder

32

-bit

ALU

Upp

er-o

rder

32

-bit

ALU

I/O CircuitsClock Generator and Drivers

90nm 7GHz 64-bit Integer ALU (ISSCC’04)90nm 7GHz 64-bit Integer ALU (ISSCC’04)

• 7GHz single-cycle 64-bit integer ALU (measured in 90nm CMOS) • Simultaneous 9GHz single-cycle 32-bit integer ALU mode

Fastest reported singleFastest reported single--cycle 64cycle 64--bit integer ALU performancebit integer ALU performance

6464--bit ALU die microphotograph and measured performance summary bit ALU die microphotograph and measured performance summary

S. Mathew et al,S. Mathew et al,ISSCC 2004 & JSSC 01/05ISSCC 2004 & JSSC 01/05

19197/20/06






20207/20/06

SplitSplit--output Level Converteroutput Level Converter

• Contention in CVSL LCB degrades delay • Split-output LCB decouples CVSL stage from output driver

stage• Fast level conversion due to low contention• Reduced fanin load on clock grid

Conventional CVSL LCB Split-output LCB

low-Vcc

inout

high-Vcc

low-Vcc

high-Vcc

in

out

21217/20/06

47%

LCB EnergyLCB Energy--Delay Comparisons Delay Comparisons

Effective low-energy alternative to CVSL LCBEffective lowEffective low--energy alternative to CVSL LCBenergy alternative to CVSL LCB

16%

Conventional CVSL

Split-output

Vcch=1.2V, Vccl=0.8V130nm, 30°C simulation

LCB Scheme Fanin cap (fF)

Total area (mm2)

CVSL-stage contention energy (pJ)

Conventional CVSL

This work

8.27.1 (-14%)

15.513.8 (-11%)

0.0850.039 (-54%)

0.24

0.28

0.32

0.36

LCB

Ene

rgy

(pJ)

0 50 100 150LCB Delay (ps)

200 250

R. Krishnamurthy et al, 2002 Symp. VLSI CircuitsR. Krishnamurthy et al, 2002 Symp. VLSI Circuits

22227/20/06






23237/20/06

D1 FF

Φ1

D1

encode

decode

TransitionTransition--Encoded BusEncoded Bus

Encoder circuitEncoder circuit–– XOR of previous and current inputXOR of previous and current input

Decoder circuitDecoder circuit–– XOR of previous output and bus stateXOR of previous output and bus state

Domino delay performanceDomino delay performance–– Collinear cap reduction Collinear cap reduction

Static bus energyStatic bus energy–– TransitionTransition--dependent activitydependent activity

M. Anders et al, 2002 VLSI Circuits Symp.

Φ2 Φ1

24247/20/06

TransitionTransition--Encoded Bus: ResultsEncoded Bus: Results

Transition only when current input != previous inputTransition only when current input != previous inputDynamic bus performance but energy profile of static busDynamic bus performance but energy profile of static busEnergy scales linearly with input switching activityEnergy scales linearly with input switching activity79% of full79% of full--chip buses: 10%chip buses: 10%--35% delay improvement35% delay improvement

M. Anders et al, 2002 VLSI Circuits Symp.M. Anders et al, 2002 VLSI Circuits Symp.

1 6 11 16Length (mm)

- 10

0

10

20

30

40

yaleD

tne

mevor

pmI

%

0.180.18μμm Pentiumm Pentium®® 4 Simulations4 Simulations

25257/20/06



LowLow--power dualpower dual--VVcccc clockingclocking–– SplitSplit--output level converteroutput level converter



26267/20/06

PVT induced leakage variationPVT induced leakage variation–– Traditional noise engineering: diminishing ROITraditional noise engineering: diminishing ROI

Additional keeper enables customizationAdditional keeper enables customization–– High leakage states: 8% keeperHigh leakage states: 8% keeper–– Low leakage states: 4% keeperLow leakage states: 4% keeper

0.7

1

1.3

1.6

0.25 1 1.75

Nor

mal

ized

Del

ay

Keeper upsizing

5%

20%

5%

10%

10%

Low Vt

High Vt

DC noise robustness

8X

Lower LBLUpper LBL

out

compensation enable

clk enable

* * * *

4%4%

A. Alvandpour et al, 2001 Symp. VLSI CircuitsA. Alvandpour et al, 2001 Symp. VLSI Circuits

PVT Variation tolerant memories: PVT Variation tolerant memories: Keeper UpsizingKeeper Upsizing

27277/20/06

0.50.95

1.4

2065

1100

0.5

1

Voltage (V)

Slow

Temp. (°C)

Leak

age

0.50.95

1.4

2065

1100

0.5

1

Voltage (V)

Typical

Temp. (°C)

Leak

age

0.50.95

1.4

2065

1100

0.5

1

Voltage (V)

Fast

Temp. (°C)

Leak

age TypicalFast

Slow

104X leakage variation across PVT 101044X leakage variation across PVT X leakage variation across PVT

PVT Compensation MotivationPVT Compensation Motivation

28287/20/06

PVT Compensation BenefitPVT Compensation Benefit

High leakage dies: 27% robustness High leakage dies: 27% robustness Low leakage dies: 10% delay Low leakage dies: 10% delay IndustryIndustry’’s highest performance/Watt s highest performance/Watt ⇒⇒ 96GOPs/Watt96GOPs/Watt

16x64b Register FileArray

S.Hsu, et. al.,ISSCC ‘06

20

65

110

0.5 0.95 1.4Voltage (V)

Tem

p (°

C)

Comp. OFF

FastComp. ON

Fast, TypicalComp. ON

FrequencyFrequency PowerPower LeakageLeakage

198mW198mW

273mW273mW

1.3mW1.3mW

Nominal (1.2V)Nominal (1.2V) 8.8GHz8.8GHz 25mW25mW

Peak Performance (1.4V)Peak Performance (1.4V) 10.1GHz10.1GHz

300MHz300MHz

57mW57mW

LowLow--Voltage Mode (0.5V)Voltage Mode (0.5V) 405405µµWW

P1264 Si measurements, 50oC

65nm CMOS Die area: 0.017mm2

68K transistors

29297/20/06

SummarySummaryPerformance through parallelism: Performance through parallelism: MultiMulti--corecoreSpecialSpecial--purpose hardware accelerators provide purpose hardware accelerators provide higher MIPS/Watt vs. generalhigher MIPS/Watt vs. general--purpose corespurpose coresEnergyEnergy--efficient, leakage/variation tolerant circuits efficient, leakage/variation tolerant circuits required for scalable GP performancerequired for scalable GP performance–– SparseSparse--tree addertree adder–– SplitSplit--output level converteroutput level converter–– TransitionTransition--encoded interconnectsencoded interconnects–– Leakage/variation tolerant register filesLeakage/variation tolerant register files

Download - Sub-45nm Circuit Technologies for High-performance Energy ...caxapa.ru/thumbs/281955/Sanu_Mathew.pdfSub-45nm Circuit Technologies for High-performance Energy-efficient ... MIPS Pentium®

Top Related