© 2006, kevin skadron power-aware and temperature-aware architecture kevin skadron lava/hotspot lab...
TRANSCRIPT
© 2
006,
Kev
in S
kadr
on
Power-Aware andTemperature-Aware
Architecture
Kevin Skadron
LAVA/HotSpot LabDept. of Computer Science
University of VirginiaCharlottesville, VA
2
© 2
006,
Kev
in S
kadr
on
“Cooking-Aware” Computing?
3
© 2
006,
Kev
in S
kadr
on
Thermal Packaging is Expensive• Nvidia GeForce
5900
Source: Tech-Report.com http://www.ixbt.com/video2/images/g71/7900gtx-front.jpg
• Nvidia GeForce 7900
Source: Gordon Bell, “A Seymour Cray perspective”http://www.research.microsoft.com/users/gbell/craytalk/
4
© 2
006,
Kev
in S
kadr
on
“Moore’s Law” for PowerM
ax
Po
we
r (W
att
s)
i386 i386
i486 i486
Pentium® Pentium®
Pentium® w/MMX tech.
Pentium® w/MMX tech.
1
10
100
Pentium® Pro Pentium® Pro
Pentium® II Pentium® II
Pentium® 4Pentium® 4Pentium® 4Pentium® 4
??
Pentium® III Pentium® III
Source: Intel
• Reasons: higher frequencies, more “stuff”
5
© 2
006,
Kev
in S
kadr
on
Leakage – A Growing Problem
• The fraction of leakage power is increasing exponentially• Also exponentially dependent on temperature• This is bad for designs with idle logic, e.g. multi-core
processors, specialized functional units, lots of storage, etc.
Source: N. S. Kim et al.., “Leakage Current: Moore’s Law Meets Static Power,” IEEE Computer, Dec. 2003.
6
© 2
006,
Kev
in S
kadr
on
Inter-Related Design Objectives
• Performance gains increasingly require gains in • Cooling efficiency• Power efficiency
Frequency Throughput
Performance DynamicPower
LeakagePower
Vdd Vth Area
Temp
Cost
exp
Reliability
exp
7
© 2
006,
Kev
in S
kadr
on
ITRS Projections
• These are targets, doubtful that they are feasible
• Growth in power density means cooling costs continue to grow
ITRS 2005
Year 2003 2006 2010 2013 2016Tech node (nm) 100 70 45 32 22Vdd (high perf) (V) 1.2 1.1 1.0 0.9 0.8Vdd (low power) (V) 1.0 0.9 0.7 0.6 0.5Frequency (high perf) (GHz) 3.0 6.8 15.1 23.0 39.7
High-perf w/ heatsink 149 180 198 198 198Cost-performance 80 98 119 137 151Hand-held 2.1 3.0 3.0 3.0 3.0
Max power (W)
2001 – was 0.4
2001 – was 288
8
© 2
006,
Kev
in S
kadr
on
Hitting the Power Wall• Intel canceled Pentium 4 microarchitecture in part
due to power limits• Couldn’t keep raising clock frequency
• Non-ideal power scaling• Vdd scaling limited due to leakage (Vth)
• General-purpose CPU community shifting to replicating cores
• Slow growth in frequency• Reduces growth in power density
– but not total heat flux• Programming model an open question
• In-order or out-of-order cores?• Our early results suggest OO is often superior
• How many threads per core?• Sun, for example, puts 4 threads per core on its 8-core
T2000 to hide memory latency• This comes at the expense of single-thread latency
9
© 2
006,
Kev
in S
kadr
on
Multi-Core Isn’t Enough• High degrees of integration still max out
heat removal
• Core type and core count must be selected to maximize power efficiency
• Simply replicating cores and then trying to scale Vdd and frequency will not work
10
© 2
006,
Kev
in S
kadr
on
Talk Outline• Different philosophies of Power-Aware
design• Energy efficient vs. low power vs. temperature-
aware
• Power Management Techniques• Dynamic
• Static
• Thermal Issues• Factors to consider
• DTM techniques
• Architectural modeling
• Summary of Important Challenges
11
© 2
006,
Kev
in S
kadr
on
Metrics
• Power• Average power, instantaneous power, peak power
• Energy• Energy (MIPS/W)
• Energy-Delay product (MIPS2/W)
• Energy-Delay2 product (MIPS3/W) – voltage independent!
• Temperature• On-chip temperature: correlated with localized
power density
• Enclosure/rack/data-center cooling
Low-Power DesignPower-Aware/
Energy-EfficientDesign
Temperature-Aware Design
Design for power delivery
12
© 2
006,
Kev
in S
kadr
on
13
© 2
006,
Kev
in S
kadr
on
Circuit Techniques • Transistor sizing
• Dynamic vs. static logic
• Signal and clock gating
• Circuit restructuring
• Low power caches, register files, queues
• These typically reduce the capacitance being switched
14
© 2
006,
Kev
in S
kadr
on
Clock Gating, Signal Gating
• Implementation• Simple gate that replaces
one buffer in the clock tree• Signal gating is similar, helps
avoid glitches• Delay is generally not a concern
except at fine granularities
• Choice of circuit design andclock gating style can have a dramatic effect on temperaturedistribution
““Disabling a functional block when it is not required for an extended Disabling a functional block when it is not required for an extended periodperiod””
signal
ctrl
functionalunit
functionalunit
15
© 2
006,
Kev
in S
kadr
on
Circuit Restructuring• Parallelize (can reduce frequency)• Pipeline (tolerate smaller, longer-latency circuitry)• Reorder inputs so that most active input is closest
to output (reduces switched capacitance)• Restructure gates (equivalent functions are not
equivalent in switched capacitance)
Logic BlockLogic BlockFreq = 1Vdd = 1Throughput = 1Power = 1Area = 1 Pwr Den = 1
Vdd
Logic BlockLogic Block
Freq = 0.5Vdd = 0.5Throughput = 1Power = 0.25Area = 2Pwr Den = 0.125
Vdd/2
Logic BlockLogic Block
Example: Parallelizing (maintain throughput)
Source: Shekhar Borkar, keynote presentation, MICRO-37, 2004
16
© 2
006,
Kev
in S
kadr
on
Cache Design
• Caccess = R C Ccell
• Reducing power• Switched capacitance
• Voltage swing
• Activity factor
• Frequency
sens amp
Column dec
row
dec bi
tline
bitli
neR rowsC cols
0
10
20
30
40
50
60
70
80
Decode
r
Wlin
es
TBLSA
DBLSA
I/O b
uses
Other
pJ
ou
les
Read
Write
TBLSA: Tagbitlines & sense amp.DBLSA: Data bitlines and sense amp.
Cache parameters: 16 KB cache 0.25 μm
wordline
Villa et al, MICRO 2000
17
© 2
006,
Kev
in S
kadr
on
Cache Design• Banked organization
• Targets switched capacitance
• Caccess = R C Ccell / B
• Dividing bit line • Same effect for wordlines
• Reducing voltage swings• Sense amplifiers used to detect Vdiff across bitlines
• Read operation can complete as soon as Vdiff is detected
• Limiting voltage swing saves a fraction of power
• Pulse word lines• Enabling the word line for the time needed to discharge
bitcell voltage
• Designer needs to estimate access time and implement a pulse generator
18
© 2
006,
Kev
in S
kadr
on
Architectural-Level Techniques• Sleep modes• Pipeline depth• Energy-efficient front end
• Branch prediction accuracy is a major determinant of pipeline activity -> spending more power in the branch predictor can be worthwhile if it improves accuracy
• Integration (e.g. multiple cores)• Multi-threading• Dynamic voltage/frequency scaling• Multi clock domain architectures (similar to GALS)• Power islands• Encoding/compression
• Can reduce both switched capacitance and cross talk• Application specific hardware
• Co-processors, functional units, etc.• Compiler techniques
Prevalent
Growing or Imminent
19
© 2
006,
Kev
in S
kadr
on
Optimal Pipeline Depth
Hartstein and Puzak, ACM TACO, Dec. 2004
• Increased power and diminishing returns vs. increased throughput
• 5-10 stages, 15-30 FO4Srinivasan et al, MICRO-35, Hartstein and Puzak, ACM TACO, Dec. 2004
Single issue
4-wide issue
Pipeline Stages
20
© 2
006,
Kev
in S
kadr
on
Multi-threading• Do more useful work per unit time
• Amortize overhead and leakage
• Switch-on-event MT• Switch on cache misses, etc. (Ex: Sun T2000
“throughput computing”)
• Can even rotate among threads every instruction (Tera/Cray)
• Simultaneous Multithreading/HyperThreading• For superscalar – eliminate wasted slots
• Intel Pentium 4, IBM POWER5, Alpha 21464
21
© 2
006,
Kev
in S
kadr
on
Architectural-Level Techniques• Sleep modes• Pipeline depth• Energy-efficient front end
• Branch prediction accuracy is a major determinant of pipeline activity -> spending more power in the branch predictor can be worthwhile if it improves accuracy
• Integration (e.g. multiple cores)• Multi-threading• Dynamic voltage/frequency scaling
• Limits• Multi clock domain architectures (similar to GALS)• Power islands• Encoding/compression
• Can reduce both switched capacitance and cross talk• Application specific hardware
• Co-processors, functional units, etc.• Compiler techniques
Prevalent
Growing or Imminent
22
© 2
006,
Kev
in S
kadr
on
Multi Clock Domain Architecture• Multiple voltage/clock domains inside the
processor• Globally-asynchronous locally synchronous
(GALS) clock style• Independent voltage/frequency scaling in
each domain• Synchronizers to ensure inter-domain
communication• Good for domains that are loosely coupled
anyway • Integer/FP units in CPUs• Multiple cores
23
© 2
006,
Kev
in S
kadr
on
Multi Clock Domain Architecture• Advantages
• Local clock design is not aware of global skew
• Each domain limited by its local critical path, allowing higher frequencies
• Different voltage regulators allow for a finer-grain energy control
• Frequency/voltage of each domain can be tailored to its dynamic requirements
• Clock power is reduced
• Drawbacks• Complexity and penalty of synchronizers
• Feasibility of multiple voltage regulators
24
© 2
006,
Kev
in S
kadr
on
Simple Example of MCD in GPUs• T is performance
• ED^2 and E are energy efficiency metrics
• All normalized to default case with no MCD
• The higher the leakage, the more DVS pays off
25
© 2
006,
Kev
in S
kadr
on
26
© 2
006,
Kev
in S
kadr
on
Static Power Dissipation• Static power: dissipation due to leakage
current
• Exponentially dependent on T, Vdd, Vth
• Most important sources of static power: subthreshold leakage and gate leakage• We will focus on subthreshold
• Gate leakage has essentially been ignored
– New gate insulation materials may solve problem
27
© 2
006,
Kev
in S
kadr
on
Thermal Runaway• The leakage-temperature feedback can lead
to a positive feedback loop• Temperature increases leakage increases
temperature increases leakage increases • …
Source: www.usswisconsin.org
28
© 2
006,
Kev
in S
kadr
on
A Smorgasbord• Transistor sizing
• Multi Vth
• Dynamic threshold voltage – reverse body bias – Transmeta Efficeon• Transmeta uses runtime compilation and load monitoring to
select thresholds
• Stack effect
• Sleep transistors
• DVS• Coarse or fine grained
• Low leakage caches, register files, queues
• Hurry up and wait• Low leakage: maintain min possible V, f
• High leakage: use high V/f to finish work quickly, then go to sleep
29
© 2
006,
Kev
in S
kadr
on
Leakage ControlBody Bias
VddVbp
Vbn-Ve
+Ve
2-10X2-10XReductionReduction
Sleep Transistor
Logic BlockLogic Block
2-1000X2-1000XReductionReduction
Stack Effect
Equal Loading
5-10X5-10XReductionReduction
Source: Shekhar Borkar, keynote presentation, MICRO-37, 2004
30
© 2
006,
Kev
in S
kadr
on
Sleep Transistors
• Recent work suggests that a properly sized, low-Vth footer transistor can preserve enough leakage to keep the cell active (Li et al, PACT’02; Agarwal et al, DAC’02)• Great care must be taken when
switching back to full voltage: noise can flip bits
• Extra latency may be necessarywhen re-activating
• Similar to principles in sub-threshold computing• Ex – sensor motes for wireless
sensor networks
• Concerns about susceptibility to SEU
Logic BlockLogic Block
31
© 2
006,
Kev
in S
kadr
on
Low-Leakage Caches• Gated-Vdd/Vss (Powell et al, ISLPED’00; Kaxiras et al,
ISCA-28)• Uses sleep transistor on Vdd/ground for each cache line
• Typically considered non-state-preserving, but recent work (Agarwal et al, DAC’02) suggests that gated-Vss it may preserve state
• Many algorithms for determining when to gate
– May want to make decay interval temperature-dependent
• Simplest (Kaxiras et al, ISCA-28): Two-bit access counter and decay interval
• Workload-adaptive decay intervals - hard
• Drowsy cache (Flautner et al, ISCA-29)• Uses dual supply voltages: normal Vdd and a low Vdd close
to the threshold voltage
• State preserving, but requires an extra cycle to wake up – two extra cycles if tags are decayed
32
© 2
006,
Kev
in S
kadr
on
A Smorgasbord• Transistor sizing
• Multi Vth
• Dynamic threshold voltage – reverse body bias – Transmeta Efficeon• Transmeta uses runtime compilation and load monitoring to
select thresholds
• Stack effect
• Sleep transistors
• DVS• Coarse or fine grained
• Low leakage caches, register files, queues
• Hurry up and wait• Low leakage: maintain min possible V, f
• High leakage: use high V/f to finish work quickly, then go to sleep
33
© 2
006,
Kev
in S
kadr
on
34
© 2
006,
Kev
in S
kadr
on
Thermal Issues - Outline• Arguments for dynamic thermal
management• Factors to consider, such as reliability
• Brief discussion of DTM techniques
• Architectural modeling
• Sensing
35
© 2
006,
Kev
in S
kadr
on
Worst-Case Leads to Over-Design• Average case temperature lower than worst-case
• Aggressive clock gating
• Application variations
• Underutilized resources, e.g. FP units during integer code, vertex units during fill-bound region
• Currently 20-40% difference
Source: Gunther et al, ITJ 2001
Reduced targetpower density
Reduced coolingcost
TDP
36
© 2
006,
Kev
in S
kadr
on
Temporal, Spatial VariationsTemperature variationof SPEC applu over time
Localized hot spots dictate cooling solution
37
© 2
006,
Kev
in S
kadr
on
Application Variations
• Wide variation across applications
• Architectural and technology trends are making it worse, e.g. simultaneous multithreading (SMT)
370
380
390
400
410
420
gzip mcf swim mgrid applu eon mesa
Ke
lvin
370
380
390
400
410
420
gzip mcf swim mgrid applu eon mesa
Ke
lvin
ST
SMT
38
© 2
006,
Kev
in S
kadr
on
Temperature-Aware Design• Worst-case design is wasteful
• Power management is not sufficient for chip-level thermal management
• Must target blocks with high power density
• When they are hot
• Spreading heat helps
– Even if energy not affected
– Even if average temperature goes up
• This also helps reduce leakage
39
© 2
006,
Kev
in S
kadr
on
Dynamic Thermal Management
Time
Tem
pera
ture
DTM Disabled DTM/Response Engaged
Designed for Cooling Capacity w/out DTM
DTM TriggerLevel
Designed for Cooling Capacity w/ DTM
SystemCost Savings
Source: David Brooks 2002
40
© 2
006,
Kev
in S
kadr
on
DTM• Worst case design for the external cooling
solution is wasteful• Yet safe temperatures must be maintained when
worst case happens
• Thermal monitors allow• Tradeoff between cost and performance• Cheaper package
– More triggers,less performance
• Expensive package– No triggers
full performance
41
© 2
006,
Kev
in S
kadr
on
Role of Architecture?Dynamic thermal management (DTM)
• Automatic hardware response when temp. exceeds cooling• Cut power density at runtime, on demand• Trade reduced costs for occasional performance loss
• Architecture natural granularity for thermal management• Activity, temperature correlated within arch. units• DTM response can target hottest unit: permits fine-
tuned response compared to OS or package• Modern architectures offer rich opportunities for
remapping computation– e.g., CMPs/SoCs, graphics processors, tiled architectures
– e.g., register file
• DTM will intermittently affect performance
42
© 2
006,
Kev
in S
kadr
on
Existing DTM Implementations• Intel Pentium 4: Global clock gating with
shut-down fail-safe
• Intel Pentium M: Dynamic voltage scaling
• Transmeta Crusoe: Dynamic voltage scaling
• IBM Power 5: Probably fetch gating
• ACPI: OS configurable combination of passive & active cooling
• These solutions sacrifice time (slower or stalled execution) to reduce power density
• Better: a solution in “space”• Tradeoff between exacerbating leakage (more idle logic) or
reducing leakage (lower temperatures)
43
© 2
006,
Kev
in S
kadr
on
Alternative: Migrating Computation
This is only a simplistic illustrative example
44
© 2
006,
Kev
in S
kadr
on
Space vs. Time• Moving the hotspot, rather than throttling it,
reduces performance overhead by almost 60%
1.270
1.359
1.231
1.112
1.00
1.10
1.20
1.30
1.40
DVS FG Hyb MC
Slo
wd
ow
n F
ac
tor
Time Space
The greater the replication and spread,
the greater the opportunities
45
© 2
006,
Kev
in S
kadr
on
Granularity of DTM• Subunit (single queue entry, register, etc.)
• Lots of replication, low migration cost, but not spread out
• Structure (queue, register file, ALU, etc.)• Yuck: copy stalls required, hard to avoid throttling
• Core• Lots of replication, good spread, but high migration cost,
and local hotspots remain
– But, if threads are short, scheduling can achieve thermal load balancing without migration
The greater the replication and spread, the greater the opportunitiesThe shorter the threads, the more flexiblity
46
© 2
006,
Kev
in S
kadr
on
Thermal ConsequencesTemperature affects:
• Circuit performance
• Circuit power (leakage)
• IC reliability
• IC and system packaging cost
• Environment
47
© 2
006,
Kev
in S
kadr
on
Performance and LeakageTemperature affects :
• Transistor threshold and mobility
• Subthreshold leakage, gate leakage
• Ion, Ioff, Igate, delay
• ITRS: 85°C for high-performance, 110°C for embedded!
IonNMOS
Ioff
48
© 2
006,
Kev
in S
kadr
on
Reliability
The Arrhenius Equation: MTF=A*exp(Ea
/K*T) MTF: mean time to failure at T
A: empirical constant
Ea: activation energy
K: Boltzmann’s constant
T: absolute temperature
Failure mechanisms:
• Electromigration
• Dielectric breakdown
• Mechanical stress
• Negative bias temperature instability (NBTI)
49
© 2
006,
Kev
in S
kadr
on
Reliability as f(T)• Reliability criteria (e.g., DTM thresholds) are typically
based on worst-case assumptions
• But actual behavior is often not worst case
• So aging occurs more slowly
• This means the DTM design is over-engineered!
• We can exploit this, e.g. for DTM or frequency
Bank
Spend
50
© 2
006,
Kev
in S
kadr
on
Reliability-Aware DTM
0.00
0.04
0.08
0.12
0.16
Base_C
onfigure
High_Conve
ctio
n_Res...
Thick_
Spread
_Materia
l
Ave
rag
e sl
ow
do
wn
DTM_controller
DTM_reliability
51
© 2
006,
Kev
in S
kadr
on
Thermal Issues - Outline• Arguments for dynamic thermal
management• Factors to consider, such as reliability
• Brief discussion of DTM techniques
• Architectural modeling
• Sensing
52
© 2
006,
Kev
in S
kadr
on
Heat Mechanisms• Conduction is the main mechanism in a
single chip• Conduction is proportional to the temperature
difference and surface area
• Convection is the main mechanism in racks, data centers, etc.
53
© 2
006,
Kev
in S
kadr
on
Simplistic steady-state model
All thermal transfer: R = k/A
Power density matters!Ohm’s law for thermals
(steady-state)
V = I · R -> T = P · R
T_hot = P · Rth + T_amb
Ways to reduce T_hot:
- reduce P (power-aware)
- reduce Rth (packaging, spread heat)
- reduce T_amb (Alaska?)
- maybe also take advantage of transients (Cth)
T_hot
T_amb
54
© 2
006,
Kev
in S
kadr
on
Simplistic dynamic thermal model
Electrical-thermal duality
V temp (T)
I power (P)
R thermal resistance (Rth)
C thermal capacitance (Cth)
RC time constant
KCL
differential eq. I = C · dV/dt + V/R
difference eq. V = I/C · t + V/RC · t
thermal domain T = P/C · t + T/RC · t
(T = T_hot – T_amb)
One can compute stepwise changes in temperature for any granularity at which one can get P, T, R, C
T_hot
T_amb
55
© 2
006,
Kev
in S
kadr
on
Thermal resistance• Θ = rt / A = t / kA
56
© 2
006,
Kev
in S
kadr
on
Thermal capacitance• Cth = V· Cp·
(Aluminum) = 2,710 kg/m3
Cp(Aluminum) = 875 J/(kg-°C)
V = t· A = 0.000025 m3
Cbulk = V· Cp· = 59.28 J/°C
57
© 2
006,
Kev
in S
kadr
on
Thermal issues summary• Temperature affects
performance, power, and reliability
• Architecture-level: conduction only• Very crude approximation of convection as equivalent
resistance• Convection: too complicated
– Need CFD!• Radiation: can be ignored
• Use compact models for package• Power density is key• Temporal, spatial variation are key• Hot spots drive thermal design
58
© 2
006,
Kev
in S
kadr
on
Thermal modeling• Want a fine-grained, dynamic model of
temperature• At a granularity architects can reason about• That accounts for adjacency and package• That does not require detailed designs• That is fast enough for practical use
• HotSpot - a compact model based on thermal R, C• Parameterized to automatically derive a model
based on various– Architectures– Power models– Floorplans– Thermal Packages
59
© 2
006,
Kev
in S
kadr
on
Dynamic compact thermal model
Electrical-thermal duality
V temp (T)
I power (P)
R thermal resistance (Rth)
C thermal capacitance (Cth)
RC time constant (Rth Cth)
Kirchoff Current Law
differential eq. I = C · dV/dt + V/R
thermal domain P = Cth · dT/dt + T/Rth
where T = T_hot – T_amb
At higher granularities of P, Rth, Cth
P, T are vectors and Rth, Cth are circuit matrices
T_hot
T_amb
60
© 2
006,
Kev
in S
kadr
on
Example System
Heat sink
Heat spreader
PCB
Die
IC Package
Pin
Interface material
61
© 2
006,
Kev
in S
kadr
on
Surface-to-surface contacts• Not negligible, heat crowding
• Thermal greases/epoxy (can “pump-out”)
• Phase Change Films (undergo a transition from solid to semi-solid with the application of heat)
Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001
62
© 2
006,
Kev
in S
kadr
on
Our Model (lateral and vertical)
Interface material(not shown)
63
© 2
006,
Kev
in S
kadr
on
HotSpot
• Time evolution of temperature is driven by unit activities and power dissipations averaged over 10K cycles• Power dissipations can come from any power
simulator, act as “current sources” in RC circuit ('P' vector in the equations)
• Simulation overhead in Wattch/SimpleScalar: < 1%
• Requires models of• Floorplan: important for adjacency
• Package: important for spreading and time constants
• R and C matrices are derived from the above
64
© 2
006,
Kev
in S
kadr
on
Validation• Validated and calibrated using FEM simulations,
FPGA measurements, and MICRED test chips• 9x9 array of power dissipators and sensors• Compared to HotSpot configured with same grid,
package
• Within 7% for both steady-state and transient step-response
• Interface material (chip/spreader) matters
65
© 2
006,
Kev
in S
kadr
on
Sensors
Caveat emptor:
We are not well-versed on sensor design; the following is a digest of information we have been able to collect from industry sources and the research literature.
66
© 2
006,
Kev
in S
kadr
on
Desirable Sensor Characteristics
• Small area
• Low Power
• High Accuracy + Linearity
• Easy access and low access time
• Fast response time (slew rate)
• Easy calibration
• Low sensitivity to process and supply noise
67
© 2
006,
Kev
in S
kadr
on
Types of Sensors(In approx. order of increasing ease to build)
• Thermocouples – voltage output• Junction between wires of different materials; voltage at
terminals is α Tref – Tjunction
• Often used for external measurements• Thermal diodes – voltage output
• Biased p-n junction; voltage drop for a known current is temperature-dependent
• Biased resistors (thermistors) – voltage output• Voltage drop for a known current is temperature dependent
– You can also think of this as varying R• Example: 1 KΩ metal “snake”
• BiCMOS, CMOS – voltage or current output• Rely on reference voltage or current generated from a reference
band-gap circuit; current-based designs often depend on temp-dependence of threshold
• 4T RAM cell – decay time is temp-dependent• [Kaxiras et al, ISLPED’04]
68
© 2
006,
Kev
in S
kadr
on
Sensors: Problem Issues
• Poor control of CMOS transistor parameters
• Noisy environment• Cross talk
• Ground noise
• Power supply noise
• These can be reduced by making the sensor larger• This increases power dissipation
• But we may want many sensors
69
© 2
006,
Kev
in S
kadr
on
“Reasonable” Values
• Based on conversations with engineers at Sun, Intel, and HP (Alpha)
• Linearity: not a problem for range of temperatures of interest
• Slew rate: < 1 μs• This is the time it takes for the physical sensing
process (e.g., current) to reach equilibrium
• Sensor bandwidth: << 1 MHz, probably 100-200 kHz• This is the sampling rate; 100 kHz = 10 μs
• Limited by slew rate but also A/D
– Consider digitization using a counter
70
© 2
006,
Kev
in S
kadr
on
• Mid 1980s: < 0.1° was possible
• Precision• ± 3° is very reasonable
• ± 2° is reasonable
• ± 1° is feasible but expensive
• < ± 1° is really hard
• The limited precision of the G3 sensor seems to have been a design choice involving the digitization
“Reasonable” Values: Precision
P: 10s of mW
71
© 2
006,
Kev
in S
kadr
on
Calibration• Accuracy vs. Precision
• Analogous to mean vs. stdev
• Calibration deals with accuracy• The main issue is to reduce inter-die variations
in offset
• Typically requires per-part testing and configuration
• Basic idea: measure offset, store it, then subtract this from dynamic measurements
72
© 2
006,
Kev
in S
kadr
on
Dynamic Offset Cancellation• Rich area of research
• Build circuit to continuously, dynamically detect offset and cancel it
• Typically uses an op-amp
• Has the advantage that it adapts to changing offsets
• Has the disadvantage of more complex circuitry
73
© 2
006,
Kev
in S
kadr
on
Role of Precision
• Suppose:• Junction temperature is J
• Max variation in sensor is S, offset is O
• Thermal emergency is T
• T = J – S – O
• Spatial gradients• If sensors cannot be located exactly at
hotspots, measured temperature may be G° lower than true hotspot
• T = J – S – O – G
74
© 2
006,
Kev
in S
kadr
on
Rate of Change of Temperature
• Our FEM simulations suggest maximum 0.1° in about 25-100 μs
• This is for power density < 1 W/mm2 die thickness between 0.2 and 0.7mm, and contemporary packaging
• This means slew rate is not an issue
• But sampling rate is!
75
© 2
006,
Kev
in S
kadr
on
Sensors Summary
• Sensor precision cannot be ignored• Reducing operating threshold by 1-2 degrees
will affect performance
• Precision of 1° is conceivable but expensive• Maybe reasonable for a single sensor or a few
• Precision of 2-3° is reasonable even for a moderate number of sensors
• Power and area are probably negligible from the architecture standpoint
• Sampling period <= 10-20 μs
76
© 2
006,
Kev
in S
kadr
on
77
© 2
006,
Kev
in S
kadr
on
Massive Multi-Core Design Space• # cores
• Pipeline depth
• Pipeline width
• In-order vs. out-of-order
• Cache per core
• Core-to-core interconnect fabric
• All dependent on temperature constraints!
78
© 2
006,
Kev
in S
kadr
on
Wither Core Type?
vs.
Source: Chrostopher Reeve Homepage, http://www.chrisreevehomepage.com/
Cores may also be heterogeneous, with a few powerful cores and very many small cores
Hot spot?
79
© 2
006,
Kev
in S
kadr
on
Impact of Thermal Constraints
0
2
4
6
8
10
12
2 4 6 8 10 12 14 16 18 20
Core Number
BIP
S
2MB/18FO4/2 2MB/18FO4/4
Thermal limits change the optimal pipeline width as core count increases
80
© 2
006,
Kev
in S
kadr
on
0
5
10
15
20
25
30
35
40
45
2 4 6 8 10 12 14 16 18 20Core Number
BIP
S
4MB/12FO4/4 4MB/18FO4/4
4MB/24FO4/4 4MB/30FO4/4
Thermal limits favor shallower pipelines
0
2
4
6
8
10
12
2 4 6 8 10 12 14 16 18 20Core Number
BIP
S
2MB/12FO4/4 2MB/18FO4/4
2MB/24FO4/4 2MB/30FO4/4
Without thermal constraints With thermal constraints
Pipeline depth, which is often fixed early in the design, can impact the multi-core performance dramatically
Impact of Thermal Constraints
81
© 2
006,
Kev
in S
kadr
on
Workload Sensitivity
0
2
4
6
8
10
12
2 4 6 8 10 12 14 16 18 20
Core Number
BIP
S
2MB/12FO4/4 2MB/18FO4/42MB/24FO4/4 2MB/30FO4/42MB/18FO4/2 2MB/18FO4/48MB/18FO4/2 8MB/18FO4/4
0
1
2
3
4
5
2 4 6 8 10 12 14 16 18 20
Core NumberB
IPS
8MB/12FO4/4 8MB/18FO4/48MB/24FO4/4 8MB/30FO4/48MB/18FO4/2 8MB/18FO4/416MB/18FO4/2 16MB/18FO4/4
400mm2
Cheap Thermal package
CPU bound
400mm2
Cheap Thermal package
Memory bound
CPU- and memory-bound applications desire different resources
26-53% performance loss if switch the best configurations!
82
© 2
006,
Kev
in S
kadr
on
Summary• Reviewed current techniques for managing
dynamic power, leakage power, temperature• A major obstacle with architectural techniques
is the difficulty of predicting performance impact
• Spread heat in space, not time
• Continuing integration makes power and thermal constraints even more important
• Optimal multi-core design is dependent on thermal considerations
• Security challenges
84
© 2
006,
Kev
in S
kadr
on
More Info
http://www.cs.virginia.edu/~skadron
LAVA Lab
85
© 2
006,
Kev
in S
kadr
on
Backup Slides• These slides are an assortment that
wouldn’t fit in the talk but I kept to answer questions or provide more info
86
© 2
006,
Kev
in S
kadr
on
Hot Chips are No Longer Cool!W
att
s/c
m2
1
10
100
1000
i386i386i486i486
Pentium® Pentium®
Pentium® ProPentium® Pro
Pentium® IIPentium® IIPentium® IIIPentium® IIIHot plateHot plate
Nuclear ReactorNuclear ReactorNuclear ReactorNuclear Reactor
* “New Microarchitecture Challenges in the Coming Generations of CMOS Process * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.
Pentium® 4Pentium® 4
RocketRocketNozzleNozzleRocketRocketNozzleNozzle
Today’slaptops:
SIA
87
© 2
006,
Kev
in S
kadr
on
ITRS quotes – thermal challenges• For small dies with high pad count, high power
density, or high frequency, “operating temperature, etc for these devices exceed the capabilities of current assembly and packaging technology.”
• “Thermal envelopes imposed by affordable packaging discourage very deep pipelining.”
• Intel recently canceled its NetBurst microarchitecture
– Press reports suggest thermal envelopes were a factor
88
© 2
006,
Kev
in S
kadr
on
Dynamic Power Consumption• Power dissipated due to switching activity
• A capacitance is charged and discharged
Vdd
Charge/discharge at the frequency Charge/discharge at the frequency ffP=a CLV2 f
Ec=1/2CLV2
Ed=1/2CLV2
89
© 2
006,
Kev
in S
kadr
on
Transistor Sizing• Transistor sizing plays an important role to reduce
power
• Delay ~ (k / ln K)
• Power ~ K / (K-1)
• Optimum K for both power and delay must be pursued
C0 C1 CN-1 CN
K = Ci/Ci-1
90
© 2
006,
Kev
in S
kadr
on
Signal Gating
• Implementation• Simple gate
• Tristate buffer
• ...
• Control signal needed• Generation requires additional logic
• Especially helps to prevent power dissipation due to glitches
““techniques to mask unwanted switching activities from propagating techniques to mask unwanted switching activities from propagating forward, causing unnecessary power dissipationforward, causing unnecessary power dissipation””
signal
ctrl
Output
91
© 2
006,
Kev
in S
kadr
onDifferent Implementation and Corresponding
Clock Gating Choices
Latch-Mux Design
SRAM Design
92
© 2
006,
Kev
in S
kadr
on
DVS “Critical Power Slope”• It may be more efficient not to use DVS, and
to run at the highest possible frequency, then go into a sleep mode!• Depends on power dissipation in sleep mode vs.
power dissipation at lowest voltage
• This has been formalized as the critical power slope (Miyoshi et al, ICS’02):• mcritical = (Pfmin
– Pidle) / fmin
• If the actual slope m = (Pf - Pfmin) / (f – fmin) < mcritical
then it is more energy efficient to run at the highest frequency, then go to sleep
• Switching overheads must be taken into account
93
© 2
006,
Kev
in S
kadr
on
Application-Specific Hardware• Specialized logic is usually much lower power
• Co-processors• Ex: TCP/IP offload, codecs, etc.
• Functional units• Ex: Intel SSE, specialized arithmetic (e.g., graphics), etc.
• Ex: Custom instructions in configurable cores (e.g., Tensilica)
• Specific example: Zoran ER4525 – cell phone• ARM microcontroller, no DSP!
• Video capture & pre/post processing
• Video codec
• 2D/3D rendering
• Video display
• Security
94
© 2
006,
Kev
in S
kadr
on
Gate Leakage• Not clear if new oxide materials will arrive in time• Any technique that reduces Vdd helps• Otherwise it seems difficult to develop architecture
techniques that directly attack gate leakage• In fact, very little work has been done in this area
• One example: domino gates (Hamzaoglu & Stan, ISLPED’02)• Replace traditional NMOS pull-down network with a PMOS
pull-up network• Gate leakage is greater in NMOS than PMOS• But PMOS domino gate is slower
• Note: Gate oxide so thin - especially prone to manufacturing variations
95
© 2
006,
Kev
in S
kadr
on
Static Power - Modeling• Modeling Leakage
• Butts and Sohi (MICRO-33)
– Pstatic = Vcc · N · kdesign · Îleak
– Îleak determined by circuit simulation, kdesign empirically
– Key contribution: separate technology from design
• HotLeakage (UVA TR CS-2003-05, DATE’04)
– Extension of Butts & Sohi approach: scalable with Vdd, Vth, Temp, and technology node; adds gate leakage
– Îleak determined by BSIM3 subthreshold equation and BSIM4 gate-leakage equations, giving an analytical expression that accounts for dependence on factors that may change at runtime, namely Vdd, Vth, and Temp
– kdesign replaced by separate factors for N- and P-type transistors
– kdesign also exponentially dependent on Vdd and Tox, linearly dependent on Temp
– Currently integrated with SimpleScalar/Wattch for caches
96
© 2
006,
Kev
in S
kadr
on
Static Power – Modeling• Modeling Leakage (cont.)
• Su et al, IBM (ISLPED’03)
–Similar approach to HotLeakage – but they observe that modeling the change in leakage allows linearization of the equations
• Many, many other papers on various aspects of modeling different aspects of leakage
–Most focus on subthreshold
–Few suggest how to model leakage in microarchitecture simulations
97
© 2
006,
Kev
in S
kadr
on
Performance Comparison• TT-DFS is best but can’t prevent excess temperature
• Suitable for use with aggressive clock rates at low temp.
• Hybrid technique reduces DTM cost by 25% vs. DVS (DVS overhead important)
• A substantial portion of MC’s benefit comes from the altered floorplan, which separates hot units
1.045
1.270
1.359
1.231
1.112
1.00
1.10
1.20
1.30
1.40
TTDFS DVS FG Hyb MC
Slo
wd
ow
n F
acto
r
98
© 2
006,
Kev
in S
kadr
on
EM Model
( )
0
1,
( )
failure at E
kT tth the dt const
kT t
( )1( )
( )
aE
kT tR t ekT t
Life Consumption
Rate:
Apply in a “lumped” fashion at the granularity of microarchitecture units, just like RAMP [Srinivasan et al.]
99
© 2
006,
Kev
in S
kadr
on
Carnot efficiency
• Note that in all cases, heat transfer is proportional to ΔT
• This is also one of the reasons energy “harvesting” in computers is probably not cost-effective• ΔT w.r.t. ambient is << 100°
• For example, with a 25W processor, thermoelectric effect yields only ~50mW• Solbrekken et al, ITHERM’04
• This is also why Peltier coolers are not energy efficient• 10% eff., vs. 30% for a refrigerator
100
© 2
006,
Kev
in S
kadr
on
Thermal Modeling• Want a fine-grained, dynamic model of
temperature• At a granularity architects can reason about• That accounts for adjacency and package• That does not require detailed designs• That is fast enough for practical use
HotSpot - a compact model based on thermal R, C• Parameterized to automatically derive a model
based on various…– Architectures– Power models– Floorplans– Thermal Packages
101
© 2
006,
Kev
in S
kadr
on
Temperature equations• Fundamental RC differential equation
• P = C dT/dt + T / R
• Steady state• dT/dt = 0
• P = T / R
• When R and C are network matrices• Steady state – T = R x P
• Modified transient equation
– dT/dt + (RC)-1 x T = C-1 x P
• HotSpot software mainly solves thesetwo equations
102
© 2
006,
Kev
in S
kadr
on
Our Model (Lateral and Vertical)
Interface material(not shown)
Derived from material and geometric properties
103
© 2
006,
Kev
in S
kadr
on
Transient solution
• Solves differential equations of the form dT + AT = B where A and B are constants
• In HotSpot, A is constant (RC) but B depends on the power dissipation
• Solution – assume constant average power dissipation within an interval (10 K cycles) and call RK4 at the end of each interval
• In RK4, current temperature (at t) is advanced in very small steps (t+h, t+2h ...) till the next interval (10K cycles)
• RK – `4` because error term is 4th order i.e., O(h^4)
104
© 2
006,
Kev
in S
kadr
on
Transient solution contd...
• 4th order error has to be within the required precision
• The step size (h) has to be small enough even for the maximum slope of the temperature evolution curve
• Transient solution for the differential equation is of the form Ae-Bt with A and B are dependent on the RC network
• Thus, the maximum value of the slope (AxB) and the step size are computed accordingly
105
© 2
006,
Kev
in S
kadr
on
HotSpot• Time evolution of temperature is driven by
unit activities and power dissipations averaged over 10K cycles• Power dissipations can come from any power
simulator, act as “current sources” in RC circuit
• Simulation overhead in Wattch/SimpleScalar: < 1%
• Requires models of• Floorplan: important for adjacency
• Package: important for spreading and time constants
106
© 2
006,
Kev
in S
kadr
on
Notes• Note that HotSpot currently measures
temperaturesin the silicon
• But that’s also what the most sensors measure
• Temperature continues to rise through each layer of the die
• Temperature in upper-level metal is considerably higher
• Interconnect model released soon!
107
© 2
006,
Kev
in S
kadr
on
HotSpot Summary
• HotSpot is a simple, accurate and fast architecture level thermal model for microprocessors
• Over 850 downloads since June’03
• Ongoing active development – architecture level floorplanning will be available soon
• Download site• http://lava.cs.virginia.edu/HotSpot
• Mailing list• www.cs.virginia.edu/mailman/listinfo/hotspot
108
© 2
006,
Kev
in S
kadr
on
Hybrid DTM• DVS is attractive because of its cubic advantage
• P V2f
• This factor dominates when DTM must be aggressive
• But changing DVS setting can be costly
– Resynchronize PLL
– Sensitive to sensor noise spurious changes
• Fetch gating is attractive because it can use instruction level parallelism to reduce impact of DTM
• Only effective when DTM is mild
• So use both!
109
© 2
006,
Kev
in S
kadr
on
Migrating Computation• When one unit overheats, migrate its
functionality to a distant, spare unit (MC)• Spare register file (Skadron et al. 2003)
• Separate core (CMP) (Heo et al. 2003)
• Microarchitectural clusters
• etc.
• Raises many interesting issues• Cost-benefit tradeoff for that area
• Use both resources (scheduling)
• Extra power for long-distance communication
• Floorplanning
110
© 2
006,
Kev
in S
kadr
on
Hybrid DTM, cont.• Combine fetch gating with DVS
• When DVS is better, use it
• Otherwise use fetch gating
• Determined by magnitude of temperature overshoot
• Crossover at FG duty cycle of 3
• FG has low overhead: helps reduce cost of sensor noise
1.0
1.1
1.2
1.3
20 5 2Duty Cycle
Slo
wd
ow
n
1.0
1.1
1.2
1.3
1.4
05101520Duty Cycle
Slo
wd
ow
n
FG
DVSHyb
111
© 2
006,
Kev
in S
kadr
on
Hybrid DTM, cont.
• DVS doesn’t need more than two settings for thermal control
• Lower voltage cools chip faster
• FG by itself does need multiple duty cycles and hence requires PI control
• But in a hybrid configuration, FG does not require PI control
• FG is only used at mild DTM settings
• Can pick one fixed duty cycle
• This is beneficial because feedback control is vulnerable to noise
112
© 2
006,
Kev
in S
kadr
on
Sensors• Almost half of DTM overhead is due to
• Guard banding due to offset errors and lack of co-located sensors
• Spurious sensor readings due to noise
• Need localized, fine-grained sensing• Need new sensor designs that are cheap
and can be used liberally – co-locate with hotspots
• But these may be imprecise
• Many sensor designs look promising• Need new data fusion techniques to reduce
imprecision, possibly combine heterogeneous sensors
113
© 2
006,
Kev
in S
kadr
on
Impact of Physical Constraints• Thermal constraints shift optimum toward fewer
and simpler cores
• CPU-bound programs still want aggressive superscalar cores despite throttling—but not deeply pipelined
• Mem-bound programs want narrow cores, lots of L2
• You can still have lots of cores• They will be severely throttled (e.g., up to 45% voltage
reduction and 75% frequency reduction)
• But you still win by adding cores until throttling outweighs the benefit of an additional core
• Preliminary results suggest that OO cores are always preferable: they are more efficient in terms of BIPS/area