energy‐efficient vlsi...
TRANSCRIPT
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
Energy‐EfficientVLSIInterconnects
PatrickChiangOregonStateVLSIResearchGroup
SantaFe,Oct.13,2010
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
FighDngwords
• Whatmeansmoretoyou–performanceorenergy?
• DarkSilicon–moresiliconthanyouhaveenergy– Whyiswiderparallelandslowaproblem?
• MulDcorelowerCLKdeepparallelism– LowVDDattheextreme
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
ITRSRoadmap2007
• “at0.13umapproximately51%ofmicroprocessorpowerwasconsumedbyinterconnect,withaprojecDonthatwithoutchangesindesignphilosophy,inthenextfiveyearsupto80%ofmicroprocessorpowerwillbeconsumedbyinterconnect”
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
InterconnectEnergyWall• “TheEnergyandPowerChallengeisthemostpervasiveofthe
four,andhasitsrootsintheinabilityofthegrouptoprojectanycombinaDonofcurrentlymaturetechnologiesthatwilldeliversufficientlypowerfulsystemsinanyclassatthedesiredpowerlevels.“
• AkeyobservaDonofthestudyisthatitmaybeeasiertosolvethepowerproblemassociatedwithbasecomputaAonthanitwillbetoreducetheproblemoftransporAngdatafromonesitetoanother‐onthesamechip,betweencloselycoupledchipsinacommonpackage,orbetweendifferentracksonoppositesidesofalargemachineroom…
Source:DARPAExascaleStudy,Sep.2008
• Thisisa“solved”problemifweareWILLINGtodealwiththeimplicaDons
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
Overview• MoDvaDon:Whyhigh‐speedlinks?
• Tutorialonchannelloss– Channellosslimits:
• DataRate• EnergyconsumpDon
– OpDcalinterconnect– 3DIntegraDon
• Energy‐Efficiency– Assumechannelisbenign– 1mW/GbpsOff‐ChipSerialLinks– Fundamentallimitstolow‐poweron‐chiplinks
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
Whyseriallinks• WiredcommunicaDonsfromonechiptoanother
– Memory‐>CPU(DDR3)– CPU‐>CPU(IntelQuickpath;AMDHyperTransport)– Routerbackplane(line‐cardtobackplane)
• Typicallyamajorboileneckfor:– Systemperformance– PowerconsumpDon– Chipcomplexity
• Whyoff‐chipbandwidthscalesslowly?– CPUtransistordensityincreasing– Memorycapacityincreasing– Interconnectdoesnotimprove– #ofpadsdoesnotincrease
• UseOpDcs??
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
I/OBandwidthisLimiDngFactor
OnChipBW
OffChipBW
• GOAL:– highBW(Gbps/pad)– Energy‐Efficient(mW/GbpsorpJ/bit)– Low‐area(100’sto1000’sonadie)
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
TheEnergyWall‐‐ComputaDonisfree;CommunicaDonisexpensive
• Futureprocessor:256Cores
• Totalon‐chipbandwidth– IfPWIRE=0.25mW/Gbps/mm, 150W
Need>10xreducDon
• Totaloff‐chipbandwidth– Predictedoff‐diebandwidth:5.12Tbps– 512Seriallinks@10Gbps(10mW/Gbps)=50W
Need>10xreducDon(1mW/Gbps)
Source:IEEEMicro,Dec.2007
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
Source:ExascaleRoadmapMeeDng,Dec.2009
EnergyBarrier
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
Overview• MoDvaDon:Whyhigh‐speedlinks?
• Tutorialonchannelloss– Channellosslimits:
• DataRate• EnergyconsumpDon
– OpDcalinterconnect– 3DIntegraDon
• Energy‐Efficiency– Assumechannelisbenign– 1mW/GbpsOff‐ChipSerialLinks– Fundamentallimitstolow‐poweron‐chiplinks
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
Whatmakesitmorechallenging
• Now,thebandwidthlimitisinwires• TakeHomeNote:tellvendortothrowkitchensinkatPCB
Highspeedlinkchip
>2GHzsignals
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
Backplanechannel• Lossisvariable
– Samebackplane– Differentlengths– Differentstubs
• Topvs.Bot
• AienuaDonislarge– >30dB@3GHz– Butisthatbad?
• Requiredsignalamplitudesetbynoise
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
ChannelResponsesSource:Palermo,TexasA&M
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
10GbpsEqualizaDon
• NOTE:TellvendortothrowkitchensinkatPCB
Source:Palermo,TexasA&M
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
OpDcssolvesthechannel• InterconnectisnowopDcal
– On‐chipwiresarenolongerRC– Off‐chipbackplanechannelsarenolongerT‐lines
• MulDmodeloss=2.5dB(@850nm)
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
OpDcsdoesn’tsolvetheelectronics
IfTransmitpower=0,5mW/Gbps
Palermo,JSSC08
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
3Dsolvesthechannel• Memory‐>CPUinterfaceiscloser
– Distancesareshorter
• ElectronicsattheinterfacessDllrequired
ChannelDistance
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
Overview• MoDvaDon:Whyhigh‐speedlinks?
• Tutorialonchannelloss– Channellosslimits:
• DataRate• EnergyconsumpDon
– OpDcalinterconnect– 3DIntegraDon
• Energy‐Efficiency– Assumechannelisbenign– <1mW/GbpsOff‐ChipSerialLinks– Fundamentallimitstolow‐swingon‐chiplinks
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
Problem1:Off‐chipI/Opowerisnotscaling
Source:ExascaleRoadmapMeeDng,Dec.2009
OFF‐CHIP:1‐10pJ/bit(1‐10mW/Gbps)
CONVENTIONAL
OURGOAL:<0.1pJ/bit
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
Parallelism:ReducingClockPowerThroughGlobalClockDistribuDonOpDmizaDon(withIntel)
• Goal:Reducepower<1mW/Gbs
• Assumechannellossismoderate
Clk/CDR is 54% of Total
Power (10Gbs)
Clk/CDR is 71% of Total
Power (15Gbs)
Intel(VLSI,2007)
Rambus,2.4mW/Gbps
• Clockdominatespowerinseriallinks• Isthereawaytoshareclockpower?
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
ProposedReceiverArchitecture
21
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
ILRO:ExtensionofAdler’sEquaDon
22
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
Further:Near‐Threshold,0.13mW/Gbps,8GbpsSerialLinkReceiver
• TwoinnovaDons:– Exploitsconstraineddeskew(Jaussi,Casper@Intel)
– “KitchenSink”
– OperatesinNear‐Threshold(Vdd<0.6V)
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
MeasuredDeskewRangeandBER
• Deskewrange:0.37UI(max)
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
ComparisonTable
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
Problem2:On‐ChipWiresConsumeEnergy
• On‐chipwirepowerdoesnotscale– Dominatedbyinterconnectcapacitance(CVDD
2)
ON‐CHIP(StatusQuo):100‐300fJ/bit/mm
NOTE:Sub/Near‐Thresholddoesn’thelpthisproblem!
OURGOAL:<5fJ/bit/mm
[DOE,ExascaleWorkshop]
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
R R R R R R R R R R R R R R R R R R R G G R R R R R G R R G R R R R G R R G R R R R R G G R R R R R R R R R R R R R R R R R R R
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
XBAR
LINKS
Energy‐EfficientOn‐ChipLinks
4‐CoreNetwork‐on‐a‐Chip(withLi‐ShiuanPeh,MIT)
ICCD,2010
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
Low‐SwingBitcellBasedCrossbar
• Twosupplies– Low‐voltageswing=200mV
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
LowSwingLinkOverLogic• 1mmlong,64bitbus
– Lowswingdiff.signaling(0.2V‐0.4V)
• Routedovernoisydigitallogic
• Inter‐pairdifferenDalshielding– ReducesdifferenDalmodenoise– Lesscapacitancethanfullplaneshielding
29
0
20
40
60
80
100
120
140
0 0.25 0.5 0.75 1 1.25
Diff
. Mod
e C
ross
talk
(mV
)
Aggressor Distance from Center of Diff. Pair (um)
Shielded Unshielded
Post‐layoutSimulaDonof1mmParallelCoupledDigitalAggressor
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
ChipSimulaDons
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
PerformanceSummary
31
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
FundamentalLimitstoOn‐ChipLinks
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
BitErrorRate
65nm‐CMOSPrototype
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
MeasuredEnergy/b/mm
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
On‐ChipLinkComparisonConvenAonalFullSwing
Schinkel(JSSCC‘09)
Stojanovic(ISSCC‘09)
1stChip 2ndChip
WireLength 1mm 2mm 10mm 1mm 1mm‐5mm
Supply 1.2V 1.2V ‐ 1.2V 0.2‐1.0V
TransceiverArea
21um2 TX:20um 2880um2 23um2 20‐30um2
SignalSwing 1.2V 120mV 200mV 250mV 34mV
Energy/Bit/mm
305fJ 105fJ 356fJ 28‐60fJ 8fJ/b/mm
• Approachingthefundamentallimitstoenergy‐efficient,on‐chiplinks
• Lowestenergy/b/mmofon‐chiplink
– Energyscalability– Lowarea– Robust
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
Co‐Design• TakeHome:tellvendortothrowkitchensinkatPCB
• Energycanbesavedby:– OFF:Turningthingsoffwhennotbeingused– ON:Goingparallel(lowVdd)
• Sozwareneedstopredict:– OFF:ThingsarenotbeinguDlized,turnthingsOFF
• Coarse,finegrainclock/powergaDng– ON:ThingsareuDlized,gowideparallel
• LowerVdddynamically,coarse/finegrainDVFS
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
MulDcore:near‐thresholdparallelprocessor–SyncDumin45nm‐CMOS
- Ten parallel lanes, running in near-threshold operation (Vdd=0.4V-1.0)!
- Incorporates Razor-like detection/recovery in every lane (variation tolerance) - Throughput of SIMD; energy-efficiency of near-threshold operation - GOAL: 1 GOPs / 1mW power; (Eight parallel 16b Multiply/Adds)
E.Krimer,R.Pawlowski,M.Erez,P.Chiang,"SyncDum:aNear‐ThresholdStreamProcessorforEnergy‐ConstrainedParallelApplicaDons",IEEEComputerArchitectureLeiers,2010.
PatrickChiang,OregonStateUniversityh6p://eecs.oregonstate.edu/research/vlsi
HeterogeneousInterconnect
Micro,09.