isscc 2005 / session 10 / microprocessors and signal
TRANSCRIPT
182 • 2005 IEEE International Solid-State Circuits Conference 0-7803-8904-2/05/$20.00 ©2005 IEEE.
ISSCC 2005 / SESSION 10 / MICROPROCESSORS AND SIGNAL PROCESSING / 10.1
10.1 The Implementation of a 2-core Multi-Threaded Itanium®-Family Processor
Samuel Naffziger, Blaine Stackhouse, Tom Grutkowski
Intel, Fort Collins, CO
The next generation in the Itanium® processor family, codenamed Montecito is introduced. The processor has two dual-threaded cores integrated on die with 26.5MB of cache in a 90nmprocess with 7 layers of copper interconnect. The die is 21.5mmby 27.7mm and includes 1.72B transistors. With both cores at fullfrequency, it consumes 100W. The micro-architecture and circuitmethodologies are leveraged from the prior Itanium2 proces-sors[2]. Improvements include the integration of 2 cores on-die,each with a dedicated 12MB 3rd-level cache, a 1MB 2nd-level cacheand dual-threading. Susceptibility to soft errors is also reducedand power efficiency improved through low-power techniques andactive power management.
Multi-threading addresses the memory latency issue in largeSMP systems. The Montecito style of multi-threading is referredto as temporal multi-threading (TMT) where the thread switch istriggered by long-latency events such as a 3rd-level cache miss.Duplicating all architectural state and some micro-architecturalstate creates the two logical processors that then share all theexecution resources and the large caches. To the operating sys-tem, each Montecito die looks like 4 processors. The thread-switch logic and state is implemented without a cycle-timeimpact. At any given time, only one thread is active in thepipeline of each core and consumes all the physical executionresources while the caches continue to service both threads. Theperformance benefits of TMT on database workloads are estimat-ed to range from 15 to 35% depending on memory latency. Theinteger and floating-point register files implement a duplicate setof registers in an area-efficient manner with a 15% areaimpact[2]. Other architectural state that must be duplicatedrequires dual-threaded latches (Fig. 10.1.5). The changes forTMT result in less than a 2% area growth in the core providing avery power- and area-efficient performance increase for SMP sys-tems.
Soft errors caused by high-energy particles negatively affect sys-tem reliability resulting in customer downtime which is notacceptable for servers. Montecito provides several key enhance-ments over the previous Itanium processors. First, the parityprotection of the 2nd-level data-cache tag structures is enhancedwith ECC protection. This feature allows the transparent fixingof 86% of bits that would otherwise cause the processor to crashon a detected error. Second, the large L3 caches provide supportfor automatic disabling of individual cache lines that suffer fromrepeated errors. This is achieved by the combination of hardwareand firmware support. Third, parity protection is added for thelarge register file structures (2 sets of 128 dual-threaded regis-ters), with a low-overhead approach [2].
Power reduction is targeted as a primary design priority. Atissue, is the fact that the last 3 generations of Itanium processorsall had the same power consumption (130W) for a single-coreprocessor despite process scaling. For Montecito to provide com-petitive database performance, 2 processor cores are integratedon-die and the cache is grown. Scaling indicates that this wouldconsume nearly 300W in 90nm at the 2.0+GHz target frequency– clearly unacceptable. To reduce power while also providing anew level of power manageability, several capabilities are need-ed:Power in should be managed in the processor, not just tempera-ture (~Power out). Power in is a major driver of system and datacenter operating costs. Power out must be managed also as it
affects heat-sink design and form-factor issues, but it is only halfthe problem.
The processor should be able to manage its power to a dynami-cally adjustable limit. This provides flexibility in system designand data-center management. The processor should also adaptits operating point for each power setting to extract the maximumperformance/watt possible to maximize flexibility. This meansmaximizing frequency as a function of voltage.
In order to reduce power and provide the above capabilities, theMontecito processor applies the full range of traditional CMOSpower-reduction techniques to the core design inherited from theprevious generation. These include reducing leakage throughlong-L and high-Vt device insertion, gating-off clocks, and reduc-ing FET width in over-designed circuits. The result is a powerreduction of 28% for integer applications.
Montecito also targets the three main sources of power variabili-ty for removal: the difference between average application powerand the maximum (1.4X at the chip level for Itanium2); the vari-ability in leakage due to silicon processing which is at least 2X ofa nominal 25% power component; and voltage variability due toregulator and package variations and transients that are ~10% ofnominal.
Measuring power in required the invention of several new tech-nologies: An ammeter is integrated into the chip and package tocontinuously monitor the power consumed by the die[3]. Thisintegration not only ensures good accuracy (~3%), but also dovetails with existing manufacturing test flows and eliminatesthe dependence on external components. Next, an autonomousmicrocontroller is implemented on-die to enable the processor toflexibly manage the voltage of the chip in response to changingpower consumption and thermal conditions[3]. Finally, the coredesign has adaptive frequency to continuously varying voltagewhich in turn requires several capabilities: First, an asynchro-nous interface between cores and system bus, where one cycle ofmeta-stability resolution time is achieved using a high-gainresolver latch (Fig. 10.1.6). Second, a frequency synthesizer thatis repeatable (manufacturing / test needs) by being as digital aspossible, has a very low frequency-change latency (1 cycle) foradapting to voltage transients, enables fine-grained frequencychanges (1/64th of Tmin), and has good range and low jitter[5]. Thefinal capability is a core that robustly operates across a broadrange of frequencies and voltages. This required hold-time analy-sis at two voltage points and checks for low-voltage circuit opera-tion as well as an active clock deskew system[4] that continuous-ly maintains clock alignment within 10ps across each core asvoltage and temperature vary.
Acknowledgements:Many thanks to a talented and dedicated team from Intel in making thisambitious design a reality.
References:[1] S. Naffziger et al, “The Implementation of the Itanium2Microprocessor,” IEEE J. of Solid-State Circuits, vol. 37, no. 11, pp 1448-1460, Nov., 2002.[2] E. Fetzer et al., “The Multi-Threaded, Parity-Protected, 128-WordRegister Files on a Dual-Core Itanium®-Family Processor,” ISSCC Dig.Tech. Papers, Paper 20.5, pp. 382-383, Feb., 2005.[3] C. Poirier et al., “Power and Temperature Control on a 90nm Itanium®-Family Processor,” ISSCC Dig. Tech. Papers, Paper 16.7, pp. 304-305,Feb., 2005.[4] E. Fetzer et al., “Clock Distribution on a Dual-core, Multi-threadedItanium® Family Processor,” ISSCC Dig. Tech. Papers, Paper 16.1, pp.292-293, Feb., 2005.[5] T. Fischer et al., “A 90nm Variable-Frequency Clock System for aPower-Managed Itanium®-Family Processor,” ISSCC Dig. Tech. Papers,Paper 16.2, pp. 294-295, Feb., 2005.
183DIGEST OF TECHNICAL PAPERS •
Continued on Page 592
ISSCC 2005 / February 8, 2005 / Salon 8 / 8:30 AM
Figure 10.1.1: Chip power consumption. Figure 10.1.2: Chip statistics at nominal operating point.
Figure 10.1.3: Power reduction.
Figure 10.1.5: Dual threaded latch. Figure 10.1.6: Clock domain crossing resolver latch.
Figure 10.1.4: Power management.
Core 0,frequency tracks voltage, 40W
250m
W
Foxton Control1GHz fixed frequency
Bus ArbiterFixed frequency, 2.5W
Bus
I/O
, fix
ed fr
eque
ncy,
8W
tota
l
12M
B L
3 ca
che,
as
ynch
rono
usop
erat
ion,
2.5
W
12MB
L3 cache, asynchronousoperation, 2.5W
Core 1,frequency tracks voltage, 40W
8W totalBus1.2VIO term6W totalFixed1.15VFixed
5W totalSelf-timed.9-1VCache80W totalTracks voltage.8 – 1.2VCoresPowerFrequencyVoltageDomain
.06%1.72GTotal
0.3%6.7MBus logic & IO01550ML3 Cache0106.5MCore caches1.7%57MCore logiclow VtFET count
Clock distribution
5%Final clock
buffer20%
Logic sw itching
50%
Leakage (avg.)25%
Core power81%
L3 cache5%
Bus/IO logic6%
IO termination6%
Variation/GB2%
Core Power BreakdownChip Power Breakdown
Legacy Chip Frequency
Adaptive Chip Frequency
Today: Minimum Vcc(t) determines maximum (and avg.) frequency.Montecito: Average Vcc(t) determines average frequency.
5-10%
F(V(t)) Produced by voltage detectors in conjunction with digital frequency synthesizers to provide frequency change response to a voltage transient in 1.5 cycles on average
Vcc(t)
F(V(t))
Favg(t)
Fmin(t)
0.40.5
0.60.7
0.80.9
11.6
1.82
2.22.4
0
20
40
60
80
100
120
140
160
180
200
Power(Watts)
Activity FactorGHz = F(V)
Power Consumption ProfileApplications Vary in Activity Factor from .4 to 1.0, average integer applications are <=.60
With power measurement and fine grained frequency change as a function of voltage, max power can be reduced from 140W to 100W with no performance impact to most applications, or equivalently, frequency can be increased ~12%
Chip power consumption vs. application activity factor and frequencyassuming the max frequencypossible for a given voltage. Profile varies with temperature and process.
0
50
100
150
200
250
300
350
Simple Port of Madison Montecito Final
Cache SwitchCache IgateCache IoffCore SwitchCore IgateCore IoffIO bias DCAP lkg
FrequencyFrequencyGain @ Gain @ 100W100W
FeatureFeature
3%3%Manage junction temperature to Manage junction temperature to the minimum possiblethe minimum possible
5%5%AdaptAdapt VccVcc to optimal value for to optimal value for each parteach part
11%11%Optimize circuits for low Optimize circuits for low switching and low leakageswitching and low leakage
7%7%Adapt frequency to Adapt frequency to VccVcc to to
operate atoperate at frequency(Vfrequency(Vavgavg))vs.vs. frequency(Vfrequency(Vminmin))
12%12%Manage to application power vs. Manage to application power vs. max powermax power
5%5%Cache power optimization (low Cache power optimization (low V,V, asynchasynch. etc.). etc.)
With P With P ~ F~ F33::P = PP = Porigorig/(1+.12+.05+.03+.07+.11+.05)/(1+.12+.05+.03+.07+.11+.05)33 ==
PPorigorig/2.92/2.92
Threadswitchswitch_pck: fires on thread switch
Inactivethreadstorage
switch_pck
Pulse clock (pck)BA
in
Scan
I
I
I
I
O
I
I
I
I
delay delay
q
VDD
GND
nsb
sin
GND
VDD
in1
fb
shift
VDD
GND
ns sout
VDDVDD
GND
GND
inb
pckGND
VDD
ina
GND
VDD
in
npck
anp
ckb
output
Resolve tau 5psResolve time = 1/f1- Thold1 – Tsu2
frequency=f1 frequency=f2
Din
SmallDevices
Standard Pulse ClockOutput Clock DomainInput Clock DomainStandard Pulse Clock
f1 f2
DevicesLarge
GNDGND
pck_in pck_out
input
output
10
592 • 2005 IEEE International Solid-State Circuits Conference 0-7803-8904-2/05/$20.00 ©2005 IEEE.
ISSCC 2005 PAPER CONTINUATIONS
Figure 10.1.7: Die micrograph.
21.5
mm
21.5
mm
27.72 mm27.72 mm
Pipeline ControlPipeline Control
12MB L2 12MB L2 CacheCache
Bus
B
us I/
OI/OB
us
Bus
I/OI/O
Bus
B
us I/
OI/O
FoxtonFoxtonPowerPowermanagementmanagement
1MB1MBL1IL1ICacheCache
Floating PointFloating Point
IntegerInteger DatapathDatapath
L2L2TagTag
Branch UnitBranch Unit
16KB16KBL0DL0DCacheCache
256KB256KBL1D CacheL1D Cache
Bus LogicBus Logic
16KB16KBL0IL0ICacheCache
CLK
CLK
ALATALAT
HPWHPW
FuseFuse
BusBusArbiterArbiter
InstrInstr..FetchFetch
11I 11ICache Cache
Floating Point Floating Point
Integer IntegerDatapath Datapath
L2L2Tag Tag
Branch Unit Branch Unit
L0D L0DCache Cache
L1D Cache L1D Cache
Bus Logic Bus Logic
L0I L0ICache CacheALAT ALAT
HPW HPW
Instr Instr..Fetch Fetch
12MB L2 12MB L2 CacheCache
Technology: 90nm bulk, 7 layers Cu•1.72B transistors•596mm2
•2.0+GHz operation at self-selected voltage•100W electrical and thermal power limit•Two 11 issue, 2 way TMT EPIC cores•3 level on-chip cache per core – 16K L1I, 16K L1D, 1MB L2I, 256K L2D, 12MB unified L3
Figure 10.1.7