© 2006, kevin skadron power-aware and temperature-aware architecture kevin skadron lava/hotspot lab...

© 2

006,

Kev

in S

kadr

on

Power-Aware andTemperature-Aware

Architecture

Kevin Skadron

LAVA/HotSpot LabDept. of Computer Science

University of VirginiaCharlottesville, VA

[email protected]

2

© 2

006,

Kev

in S

kadr

on

“Cooking-Aware” Computing?

3

© 2

006,

Kev

in S

kadr

on

Thermal Packaging is Expensive• Nvidia GeForce

5900

Source: Tech-Report.com http://www.ixbt.com/video2/images/g71/7900gtx-front.jpg

• Nvidia GeForce 7900

Source: Gordon Bell, “A Seymour Cray perspective”http://www.research.microsoft.com/users/gbell/craytalk/

4

© 2

006,

Kev

in S

kadr

on

“Moore’s Law” for PowerM

ax

Po

we

r (W

att

s)

i386 i386

i486 i486

Pentium® Pentium®

Pentium® w/MMX tech.

Pentium® w/MMX tech.

1

10

100

Pentium® Pro Pentium® Pro

Pentium® II Pentium® II

Pentium® 4Pentium® 4Pentium® 4Pentium® 4

??

Pentium® III Pentium® III

Source: Intel

• Reasons: higher frequencies, more “stuff”

5

© 2

006,

Kev

in S

kadr

on

Leakage – A Growing Problem

• The fraction of leakage power is increasing exponentially• Also exponentially dependent on temperature• This is bad for designs with idle logic, e.g. multi-core

processors, specialized functional units, lots of storage, etc.

Source: N. S. Kim et al.., “Leakage Current: Moore’s Law Meets Static Power,” IEEE Computer, Dec. 2003.

6

© 2

006,

Kev

in S

kadr

on

Inter-Related Design Objectives

• Performance gains increasingly require gains in • Cooling efficiency• Power efficiency

Frequency Throughput

Performance DynamicPower

LeakagePower

Vdd Vth Area

Temp

Cost

exp

Reliability

exp

7

© 2

006,

Kev

in S

kadr

on

ITRS Projections

• These are targets, doubtful that they are feasible

• Growth in power density means cooling costs continue to grow

ITRS 2005

Year 2003 2006 2010 2013 2016Tech node (nm) 100 70 45 32 22Vdd (high perf) (V) 1.2 1.1 1.0 0.9 0.8Vdd (low power) (V) 1.0 0.9 0.7 0.6 0.5Frequency (high perf) (GHz) 3.0 6.8 15.1 23.0 39.7

High-perf w/ heatsink 149 180 198 198 198Cost-performance 80 98 119 137 151Hand-held 2.1 3.0 3.0 3.0 3.0

Max power (W)

2001 – was 0.4

2001 – was 288

8

© 2

006,

Kev

in S

kadr

on

Hitting the Power Wall• Intel canceled Pentium 4 microarchitecture in part

due to power limits• Couldn’t keep raising clock frequency

• Non-ideal power scaling• Vdd scaling limited due to leakage (Vth)

• General-purpose CPU community shifting to replicating cores

• Slow growth in frequency• Reduces growth in power density

– but not total heat flux• Programming model an open question

• In-order or out-of-order cores?• Our early results suggest OO is often superior

• How many threads per core?• Sun, for example, puts 4 threads per core on its 8-core

T2000 to hide memory latency• This comes at the expense of single-thread latency

9

© 2

006,

Kev

in S

kadr

on

Multi-Core Isn’t Enough• High degrees of integration still max out

heat removal

• Core type and core count must be selected to maximize power efficiency

• Simply replicating cores and then trying to scale Vdd and frequency will not work

10

© 2

006,

Kev

in S

kadr

on

Talk Outline• Different philosophies of Power-Aware

design• Energy efficient vs. low power vs. temperature-

aware

• Power Management Techniques• Dynamic

• Static

• Thermal Issues• Factors to consider

• DTM techniques

• Architectural modeling

• Summary of Important Challenges

11

© 2

006,

Kev

in S

kadr

on

Metrics

• Power• Average power, instantaneous power, peak power

• Energy• Energy (MIPS/W)

• Energy-Delay product (MIPS2/W)

• Energy-Delay2 product (MIPS3/W) – voltage independent!

• Temperature• On-chip temperature: correlated with localized

power density

• Enclosure/rack/data-center cooling

Low-Power DesignPower-Aware/

Energy-EfficientDesign

Temperature-Aware Design

Design for power delivery

12

© 2

006,

Kev

in S

kadr

on

13

© 2

006,

Kev

in S

kadr

on

Circuit Techniques • Transistor sizing

• Dynamic vs. static logic

• Signal and clock gating

• Circuit restructuring

• Low power caches, register files, queues

• These typically reduce the capacitance being switched

14

© 2

006,

Kev

in S

kadr

on

Clock Gating, Signal Gating

• Implementation• Simple gate that replaces

one buffer in the clock tree• Signal gating is similar, helps

avoid glitches• Delay is generally not a concern

except at fine granularities

• Choice of circuit design andclock gating style can have a dramatic effect on temperaturedistribution

““Disabling a functional block when it is not required for an extended Disabling a functional block when it is not required for an extended periodperiod””

signal

ctrl

functionalunit

functionalunit

15

© 2

006,

Kev

in S

kadr

on

Circuit Restructuring• Parallelize (can reduce frequency)• Pipeline (tolerate smaller, longer-latency circuitry)• Reorder inputs so that most active input is closest

to output (reduces switched capacitance)• Restructure gates (equivalent functions are not

equivalent in switched capacitance)

Logic BlockLogic BlockFreq = 1Vdd = 1Throughput = 1Power = 1Area = 1 Pwr Den = 1

Vdd

Logic BlockLogic Block

Freq = 0.5Vdd = 0.5Throughput = 1Power = 0.25Area = 2Pwr Den = 0.125

Vdd/2


Example: Parallelizing (maintain throughput)

Source: Shekhar Borkar, keynote presentation, MICRO-37, 2004

16

© 2

006,

Kev

in S

kadr

on

Cache Design

• Caccess = R C Ccell

• Reducing power• Switched capacitance

• Voltage swing

• Activity factor

• Frequency

sens amp

Column dec

row

dec bi

tline

bitli

neR rowsC cols

0

10

20

30

40

50

60

70

80

Decode

r

Wlin

es

TBLSA

DBLSA

I/O b

uses

Other

pJ

ou

les

Read

Write

TBLSA: Tagbitlines & sense amp.DBLSA: Data bitlines and sense amp.

Cache parameters: 16 KB cache 0.25 μm

wordline

Villa et al, MICRO 2000

17

© 2

006,

Kev

in S

kadr

on

Cache Design• Banked organization

• Targets switched capacitance

• Caccess = R C Ccell / B

• Dividing bit line • Same effect for wordlines

• Reducing voltage swings• Sense amplifiers used to detect Vdiff across bitlines

• Read operation can complete as soon as Vdiff is detected

• Limiting voltage swing saves a fraction of power

• Pulse word lines• Enabling the word line for the time needed to discharge

bitcell voltage

• Designer needs to estimate access time and implement a pulse generator

18

© 2

006,

Kev

in S

kadr

on

Architectural-Level Techniques• Sleep modes• Pipeline depth• Energy-efficient front end

• Branch prediction accuracy is a major determinant of pipeline activity -> spending more power in the branch predictor can be worthwhile if it improves accuracy

• Integration (e.g. multiple cores)• Multi-threading• Dynamic voltage/frequency scaling• Multi clock domain architectures (similar to GALS)• Power islands• Encoding/compression

• Can reduce both switched capacitance and cross talk• Application specific hardware

• Co-processors, functional units, etc.• Compiler techniques

Prevalent

Growing or Imminent

19

© 2

006,

Kev

in S

kadr

on

Optimal Pipeline Depth

Hartstein and Puzak, ACM TACO, Dec. 2004

• Increased power and diminishing returns vs. increased throughput

• 5-10 stages, 15-30 FO4Srinivasan et al, MICRO-35, Hartstein and Puzak, ACM TACO, Dec. 2004

Single issue

4-wide issue

Pipeline Stages

20

© 2

006,

Kev

in S

kadr

on

Multi-threading• Do more useful work per unit time

• Amortize overhead and leakage

• Switch-on-event MT• Switch on cache misses, etc. (Ex: Sun T2000

“throughput computing”)

• Can even rotate among threads every instruction (Tera/Cray)

• Simultaneous Multithreading/HyperThreading• For superscalar – eliminate wasted slots

• Intel Pentium 4, IBM POWER5, Alpha 21464

21

© 2

006,

Kev

in S

kadr

on

Architectural-Level Techniques• Sleep modes• Pipeline depth• Energy-efficient front end

• Branch prediction accuracy is a major determinant of pipeline activity -> spending more power in the branch predictor can be worthwhile if it improves accuracy

• Integration (e.g. multiple cores)• Multi-threading• Dynamic voltage/frequency scaling

• Limits• Multi clock domain architectures (similar to GALS)• Power islands• Encoding/compression

• Can reduce both switched capacitance and cross talk• Application specific hardware

• Co-processors, functional units, etc.• Compiler techniques

Prevalent

Growing or Imminent

22

© 2

006,

Kev

in S

kadr

on

Multi Clock Domain Architecture• Multiple voltage/clock domains inside the

processor• Globally-asynchronous locally synchronous

(GALS) clock style• Independent voltage/frequency scaling in

each domain• Synchronizers to ensure inter-domain

communication• Good for domains that are loosely coupled

anyway • Integer/FP units in CPUs• Multiple cores

23

© 2

006,

Kev

in S

kadr

on

Multi Clock Domain Architecture• Advantages

• Local clock design is not aware of global skew

• Each domain limited by its local critical path, allowing higher frequencies

• Different voltage regulators allow for a finer-grain energy control

• Frequency/voltage of each domain can be tailored to its dynamic requirements

• Clock power is reduced

• Drawbacks• Complexity and penalty of synchronizers

• Feasibility of multiple voltage regulators

24

© 2

006,

Kev

in S

kadr

on

Simple Example of MCD in GPUs• T is performance

• ED^2 and E are energy efficiency metrics

• All normalized to default case with no MCD

• The higher the leakage, the more DVS pays off

25

© 2

006,

Kev

in S

kadr

on

26

© 2

006,

Kev

in S

kadr

on

Static Power Dissipation• Static power: dissipation due to leakage

current

• Exponentially dependent on T, Vdd, Vth

• Most important sources of static power: subthreshold leakage and gate leakage• We will focus on subthreshold

• Gate leakage has essentially been ignored

– New gate insulation materials may solve problem

27

© 2

006,

Kev

in S

kadr

on

Thermal Runaway• The leakage-temperature feedback can lead

to a positive feedback loop• Temperature increases leakage increases

temperature increases leakage increases • …

Source: www.usswisconsin.org

28

© 2

006,

Kev

in S

kadr

on

A Smorgasbord• Transistor sizing

• Multi Vth

• Dynamic threshold voltage – reverse body bias – Transmeta Efficeon• Transmeta uses runtime compilation and load monitoring to

select thresholds

• Stack effect

• Sleep transistors

• DVS• Coarse or fine grained

• Low leakage caches, register files, queues

• Hurry up and wait• Low leakage: maintain min possible V, f

• High leakage: use high V/f to finish work quickly, then go to sleep

29

© 2

006,

Kev

in S

kadr

on

Leakage ControlBody Bias

VddVbp

Vbn-Ve

+Ve

2-10X2-10XReductionReduction

Sleep Transistor



Stack Effect

Equal Loading


Source: Shekhar Borkar, keynote presentation, MICRO-37, 2004

30

© 2

006,

Kev

in S

kadr

on

Sleep Transistors

• Recent work suggests that a properly sized, low-Vth footer transistor can preserve enough leakage to keep the cell active (Li et al, PACT’02; Agarwal et al, DAC’02)• Great care must be taken when

switching back to full voltage: noise can flip bits

• Extra latency may be necessarywhen re-activating

• Similar to principles in sub-threshold computing• Ex – sensor motes for wireless

sensor networks

• Concerns about susceptibility to SEU


31

© 2

006,

Kev

in S

kadr

on

Low-Leakage Caches• Gated-Vdd/Vss (Powell et al, ISLPED’00; Kaxiras et al,

ISCA-28)• Uses sleep transistor on Vdd/ground for each cache line

• Typically considered non-state-preserving, but recent work (Agarwal et al, DAC’02) suggests that gated-Vss it may preserve state

• Many algorithms for determining when to gate

– May want to make decay interval temperature-dependent

• Simplest (Kaxiras et al, ISCA-28): Two-bit access counter and decay interval

• Workload-adaptive decay intervals - hard

• Drowsy cache (Flautner et al, ISCA-29)• Uses dual supply voltages: normal Vdd and a low Vdd close

to the threshold voltage

• State preserving, but requires an extra cycle to wake up – two extra cycles if tags are decayed

32

© 2

006,

Kev

in S

kadr

on

A Smorgasbord• Transistor sizing

• Multi Vth

• Dynamic threshold voltage – reverse body bias – Transmeta Efficeon• Transmeta uses runtime compilation and load monitoring to

select thresholds

• Stack effect

• Sleep transistors

• DVS• Coarse or fine grained

• Low leakage caches, register files, queues

• Hurry up and wait• Low leakage: maintain min possible V, f

• High leakage: use high V/f to finish work quickly, then go to sleep

33

© 2

006,

Kev

in S

kadr

on

34

© 2

006,

Kev

in S

kadr

on

Thermal Issues - Outline• Arguments for dynamic thermal

management• Factors to consider, such as reliability

• Brief discussion of DTM techniques


• Sensing

35

© 2

006,

Kev

in S

kadr

on

Worst-Case Leads to Over-Design• Average case temperature lower than worst-case

• Aggressive clock gating

• Application variations

• Underutilized resources, e.g. FP units during integer code, vertex units during fill-bound region

• Currently 20-40% difference

Source: Gunther et al, ITJ 2001

Reduced targetpower density

Reduced coolingcost

TDP

36

© 2

006,

Kev

in S

kadr

on

Temporal, Spatial VariationsTemperature variationof SPEC applu over time

Localized hot spots dictate cooling solution

37

© 2

006,

Kev

in S

kadr

on

Application Variations

• Wide variation across applications

• Architectural and technology trends are making it worse, e.g. simultaneous multithreading (SMT)

370

380

390

400

410

420

gzip mcf swim mgrid applu eon mesa

Ke

lvin

370

380

390

400

410

420

gzip mcf swim mgrid applu eon mesa

Ke

lvin

ST

SMT

38

© 2

006,

Kev

in S

kadr

on

Temperature-Aware Design• Worst-case design is wasteful

• Power management is not sufficient for chip-level thermal management

• Must target blocks with high power density

• When they are hot

• Spreading heat helps

– Even if energy not affected

– Even if average temperature goes up

• This also helps reduce leakage

39

© 2

006,

Kev

in S

kadr

on

Dynamic Thermal Management

Time

Tem

pera

ture

DTM Disabled DTM/Response Engaged

Designed for Cooling Capacity w/out DTM

DTM TriggerLevel

Designed for Cooling Capacity w/ DTM

SystemCost Savings

Source: David Brooks 2002

40

© 2

006,

Kev

in S

kadr

on

DTM• Worst case design for the external cooling

solution is wasteful• Yet safe temperatures must be maintained when

worst case happens

• Thermal monitors allow• Tradeoff between cost and performance• Cheaper package

– More triggers,less performance

• Expensive package– No triggers

full performance

41

© 2

006,

Kev

in S

kadr

on

Role of Architecture?Dynamic thermal management (DTM)

• Automatic hardware response when temp. exceeds cooling• Cut power density at runtime, on demand• Trade reduced costs for occasional performance loss

• Architecture natural granularity for thermal management• Activity, temperature correlated within arch. units• DTM response can target hottest unit: permits fine-

tuned response compared to OS or package• Modern architectures offer rich opportunities for

remapping computation– e.g., CMPs/SoCs, graphics processors, tiled architectures

– e.g., register file

• DTM will intermittently affect performance

42

© 2

006,

Kev

in S

kadr

on

Existing DTM Implementations• Intel Pentium 4: Global clock gating with

shut-down fail-safe

• Intel Pentium M: Dynamic voltage scaling

• Transmeta Crusoe: Dynamic voltage scaling

• IBM Power 5: Probably fetch gating

• ACPI: OS configurable combination of passive & active cooling

• These solutions sacrifice time (slower or stalled execution) to reduce power density

• Better: a solution in “space”• Tradeoff between exacerbating leakage (more idle logic) or

reducing leakage (lower temperatures)

43

© 2

006,

Kev

in S

kadr

on

Alternative: Migrating Computation

This is only a simplistic illustrative example

44

© 2

006,

Kev

in S

kadr

on

Space vs. Time• Moving the hotspot, rather than throttling it,

reduces performance overhead by almost 60%

1.270

1.359

1.231

1.112

1.00

1.10

1.20

1.30

1.40

DVS FG Hyb MC

Slo

wd

ow

n F

ac

tor

Time Space

The greater the replication and spread,

the greater the opportunities

45

© 2

006,

Kev

in S

kadr

on

Granularity of DTM• Subunit (single queue entry, register, etc.)

• Lots of replication, low migration cost, but not spread out

• Structure (queue, register file, ALU, etc.)• Yuck: copy stalls required, hard to avoid throttling

• Core• Lots of replication, good spread, but high migration cost,

and local hotspots remain

– But, if threads are short, scheduling can achieve thermal load balancing without migration

The greater the replication and spread, the greater the opportunitiesThe shorter the threads, the more flexiblity

46

© 2

006,

Kev

in S

kadr

on

Thermal ConsequencesTemperature affects:

• Circuit performance

• Circuit power (leakage)

• IC reliability

• IC and system packaging cost

• Environment

47

© 2

006,

Kev

in S

kadr

on

Performance and LeakageTemperature affects :

• Transistor threshold and mobility

• Subthreshold leakage, gate leakage

• Ion, Ioff, Igate, delay

• ITRS: 85°C for high-performance, 110°C for embedded!

IonNMOS

Ioff

48

© 2

006,

Kev

in S

kadr

on

Reliability

The Arrhenius Equation: MTF=A*exp(Ea

/K*T) MTF: mean time to failure at T

A: empirical constant

Ea: activation energy

K: Boltzmann’s constant

T: absolute temperature

Failure mechanisms:

• Electromigration

• Dielectric breakdown

• Mechanical stress

• Negative bias temperature instability (NBTI)

49

© 2

006,

Kev

in S

kadr

on

Reliability as f(T)• Reliability criteria (e.g., DTM thresholds) are typically

based on worst-case assumptions

• But actual behavior is often not worst case

• So aging occurs more slowly

• This means the DTM design is over-engineered!

• We can exploit this, e.g. for DTM or frequency

Bank

Spend

50

© 2

006,

Kev

in S

kadr

on

Reliability-Aware DTM

0.00

0.04

0.08

0.12

0.16

Base_C

onfigure

High_Conve

ctio

n_Res...

Thick_

Spread

_Materia

l

Ave

rag

e sl

ow

do

wn

DTM_controller

DTM_reliability

51

© 2

006,

Kev

in S

kadr

on

Thermal Issues - Outline• Arguments for dynamic thermal

management• Factors to consider, such as reliability

• Brief discussion of DTM techniques


• Sensing

52

© 2

006,

Kev

in S

kadr

on

Heat Mechanisms• Conduction is the main mechanism in a

single chip• Conduction is proportional to the temperature

difference and surface area

• Convection is the main mechanism in racks, data centers, etc.

53

© 2

006,

Kev

in S

kadr

on

Simplistic steady-state model

All thermal transfer: R = k/A

Power density matters!Ohm’s law for thermals

(steady-state)

V = I · R -> T = P · R

T_hot = P · Rth + T_amb

Ways to reduce T_hot:

- reduce P (power-aware)

- reduce Rth (packaging, spread heat)

- reduce T_amb (Alaska?)

- maybe also take advantage of transients (Cth)

T_hot

T_amb

54

© 2

006,

Kev

in S

kadr

on

Simplistic dynamic thermal model

Electrical-thermal duality

V temp (T)

I power (P)

R thermal resistance (Rth)

C thermal capacitance (Cth)

RC time constant

KCL

differential eq. I = C · dV/dt + V/R

difference eq. V = I/C · t + V/RC · t

thermal domain T = P/C · t + T/RC · t

(T = T_hot – T_amb)

One can compute stepwise changes in temperature for any granularity at which one can get P, T, R, C

T_hot

T_amb

55

© 2

006,

Kev

in S

kadr

on

Thermal resistance• Θ = rt / A = t / kA

56

© 2

006,

Kev

in S

kadr

on

Thermal capacitance• Cth = V· Cp·

(Aluminum) = 2,710 kg/m3

Cp(Aluminum) = 875 J/(kg-°C)

V = t· A = 0.000025 m3

Cbulk = V· Cp· = 59.28 J/°C

57

© 2

006,

Kev

in S

kadr

on

Thermal issues summary• Temperature affects

performance, power, and reliability

• Architecture-level: conduction only• Very crude approximation of convection as equivalent

resistance• Convection: too complicated

– Need CFD!• Radiation: can be ignored

• Use compact models for package• Power density is key• Temporal, spatial variation are key• Hot spots drive thermal design

58

© 2

006,

Kev

in S

kadr

on

Thermal modeling• Want a fine-grained, dynamic model of

temperature• At a granularity architects can reason about• That accounts for adjacency and package• That does not require detailed designs• That is fast enough for practical use

• HotSpot - a compact model based on thermal R, C• Parameterized to automatically derive a model

based on various– Architectures– Power models– Floorplans– Thermal Packages

59

© 2

006,

Kev

in S

kadr

on

Dynamic compact thermal model

Electrical-thermal duality

V temp (T)

I power (P)

R thermal resistance (Rth)

C thermal capacitance (Cth)

RC time constant (Rth Cth)

Kirchoff Current Law

differential eq. I = C · dV/dt + V/R

thermal domain P = Cth · dT/dt + T/Rth

where T = T_hot – T_amb

At higher granularities of P, Rth, Cth

P, T are vectors and Rth, Cth are circuit matrices

T_hot

T_amb

60

© 2

006,

Kev

in S

kadr

on

Example System

Heat sink

Heat spreader

PCB

Die

IC Package

Pin

Interface material

61

© 2

006,

Kev

in S

kadr

on

Surface-to-surface contacts• Not negligible, heat crowding

• Thermal greases/epoxy (can “pump-out”)

• Phase Change Films (undergo a transition from solid to semi-solid with the application of heat)

Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001

62

© 2

006,

Kev

in S

kadr

on

Our Model (lateral and vertical)

Interface material(not shown)

63

© 2

006,

Kev

in S

kadr

on

HotSpot

• Time evolution of temperature is driven by unit activities and power dissipations averaged over 10K cycles• Power dissipations can come from any power

simulator, act as “current sources” in RC circuit ('P' vector in the equations)

• Simulation overhead in Wattch/SimpleScalar: < 1%

• Requires models of• Floorplan: important for adjacency

• Package: important for spreading and time constants

• R and C matrices are derived from the above

64

© 2

006,

Kev

in S

kadr

on

Validation• Validated and calibrated using FEM simulations,

FPGA measurements, and MICRED test chips• 9x9 array of power dissipators and sensors• Compared to HotSpot configured with same grid,

package

• Within 7% for both steady-state and transient step-response

• Interface material (chip/spreader) matters

65

© 2

006,

Kev

in S

kadr

on

Sensors

Caveat emptor:

We are not well-versed on sensor design; the following is a digest of information we have been able to collect from industry sources and the research literature.

66

© 2

006,

Kev

in S

kadr

on

Desirable Sensor Characteristics

• Small area

• Low Power

• High Accuracy + Linearity

• Easy access and low access time

• Fast response time (slew rate)

• Easy calibration

• Low sensitivity to process and supply noise

67

© 2

006,

Kev

in S

kadr

on

Types of Sensors(In approx. order of increasing ease to build)

• Thermocouples – voltage output• Junction between wires of different materials; voltage at

terminals is α Tref – Tjunction

• Often used for external measurements• Thermal diodes – voltage output

• Biased p-n junction; voltage drop for a known current is temperature-dependent

• Biased resistors (thermistors) – voltage output• Voltage drop for a known current is temperature dependent

– You can also think of this as varying R• Example: 1 KΩ metal “snake”

• BiCMOS, CMOS – voltage or current output• Rely on reference voltage or current generated from a reference

band-gap circuit; current-based designs often depend on temp-dependence of threshold

• 4T RAM cell – decay time is temp-dependent• [Kaxiras et al, ISLPED’04]

68

© 2

006,

Kev

in S

kadr

on

Sensors: Problem Issues

• Poor control of CMOS transistor parameters

• Noisy environment• Cross talk

• Ground noise

• Power supply noise

• These can be reduced by making the sensor larger• This increases power dissipation

• But we may want many sensors

69

© 2

006,

Kev

in S

kadr

on

“Reasonable” Values

• Based on conversations with engineers at Sun, Intel, and HP (Alpha)

• Linearity: not a problem for range of temperatures of interest

• Slew rate: < 1 μs• This is the time it takes for the physical sensing

process (e.g., current) to reach equilibrium

• Sensor bandwidth: << 1 MHz, probably 100-200 kHz• This is the sampling rate; 100 kHz = 10 μs

• Limited by slew rate but also A/D

– Consider digitization using a counter

70

© 2

006,

Kev

in S

kadr

on

• Mid 1980s: < 0.1° was possible

• Precision• ± 3° is very reasonable

• ± 2° is reasonable

• ± 1° is feasible but expensive

• < ± 1° is really hard

• The limited precision of the G3 sensor seems to have been a design choice involving the digitization

“Reasonable” Values: Precision

P: 10s of mW

71

© 2

006,

Kev

in S

kadr

on

Calibration• Accuracy vs. Precision

• Analogous to mean vs. stdev

• Calibration deals with accuracy• The main issue is to reduce inter-die variations

in offset

• Typically requires per-part testing and configuration

• Basic idea: measure offset, store it, then subtract this from dynamic measurements

72

© 2

006,

Kev

in S

kadr

on

Dynamic Offset Cancellation• Rich area of research

• Build circuit to continuously, dynamically detect offset and cancel it

• Typically uses an op-amp

• Has the advantage that it adapts to changing offsets

• Has the disadvantage of more complex circuitry

73

© 2

006,

Kev

in S

kadr

on

Role of Precision

• Suppose:• Junction temperature is J

• Max variation in sensor is S, offset is O

• Thermal emergency is T

• T = J – S – O

• Spatial gradients• If sensors cannot be located exactly at

hotspots, measured temperature may be G° lower than true hotspot

• T = J – S – O – G

74

© 2

006,

Kev

in S

kadr

on

Rate of Change of Temperature

• Our FEM simulations suggest maximum 0.1° in about 25-100 μs

• This is for power density < 1 W/mm2 die thickness between 0.2 and 0.7mm, and contemporary packaging

• This means slew rate is not an issue

• But sampling rate is!

75

© 2

006,

Kev

in S

kadr

on

Sensors Summary

• Sensor precision cannot be ignored• Reducing operating threshold by 1-2 degrees

will affect performance

• Precision of 1° is conceivable but expensive• Maybe reasonable for a single sensor or a few

• Precision of 2-3° is reasonable even for a moderate number of sensors

• Power and area are probably negligible from the architecture standpoint

• Sampling period <= 10-20 μs

76

© 2

006,

Kev

in S

kadr

on

77

© 2

006,

Kev

in S

kadr

on

Massive Multi-Core Design Space• # cores

• Pipeline depth

• Pipeline width

• In-order vs. out-of-order

• Cache per core

• Core-to-core interconnect fabric

• All dependent on temperature constraints!

78

© 2

006,

Kev

in S

kadr

on

Wither Core Type?

vs.

Source: Chrostopher Reeve Homepage, http://www.chrisreevehomepage.com/

Cores may also be heterogeneous, with a few powerful cores and very many small cores

Hot spot?

79

© 2

006,

Kev

in S

kadr

on

Impact of Thermal Constraints

0

2

4

6

8

10

12

2 4 6 8 10 12 14 16 18 20

Core Number

BIP

S

2MB/18FO4/2 2MB/18FO4/4

Thermal limits change the optimal pipeline width as core count increases

80

© 2

006,

Kev

in S

kadr

on

0

5

10

15

20

25

30

35

40

45

2 4 6 8 10 12 14 16 18 20Core Number

BIP

S

4MB/12FO4/4 4MB/18FO4/4

4MB/24FO4/4 4MB/30FO4/4

Thermal limits favor shallower pipelines

0

2

4

6

8

10

12

2 4 6 8 10 12 14 16 18 20Core Number

BIP

S

2MB/12FO4/4 2MB/18FO4/4

2MB/24FO4/4 2MB/30FO4/4

Without thermal constraints With thermal constraints

Pipeline depth, which is often fixed early in the design, can impact the multi-core performance dramatically

Impact of Thermal Constraints

81

© 2

006,

Kev

in S

kadr

on

Workload Sensitivity

0

2

4

6

8

10

12

2 4 6 8 10 12 14 16 18 20

Core Number

BIP

S

2MB/12FO4/4 2MB/18FO4/42MB/24FO4/4 2MB/30FO4/42MB/18FO4/2 2MB/18FO4/48MB/18FO4/2 8MB/18FO4/4

0

1

2

3

4

5

2 4 6 8 10 12 14 16 18 20

Core NumberB

IPS

8MB/12FO4/4 8MB/18FO4/48MB/24FO4/4 8MB/30FO4/48MB/18FO4/2 8MB/18FO4/416MB/18FO4/2 16MB/18FO4/4

400mm2

Cheap Thermal package

CPU bound

400mm2

Cheap Thermal package

Memory bound

CPU- and memory-bound applications desire different resources

26-53% performance loss if switch the best configurations!

82

© 2

006,

Kev

in S

kadr

on

Summary• Reviewed current techniques for managing

dynamic power, leakage power, temperature• A major obstacle with architectural techniques

is the difficulty of predicting performance impact

• Spread heat in space, not time

• Continuing integration makes power and thermal constraints even more important

• Optimal multi-core design is dependent on thermal considerations

• Security challenges

85

© 2

006,

Kev

in S

kadr

on

Backup Slides• These slides are an assortment that

wouldn’t fit in the talk but I kept to answer questions or provide more info

86

© 2

006,

Kev

in S

kadr

on

Hot Chips are No Longer Cool!W

att

s/c

m2

1

10

100

1000

i386i386i486i486

Pentium® Pentium®

Pentium® ProPentium® Pro

Pentium® IIPentium® IIPentium® IIIPentium® IIIHot plateHot plate

Nuclear ReactorNuclear ReactorNuclear ReactorNuclear Reactor

* “New Microarchitecture Challenges in the Coming Generations of CMOS Process * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.

Pentium® 4Pentium® 4

RocketRocketNozzleNozzleRocketRocketNozzleNozzle

Today’slaptops:

SIA

87

© 2

006,

Kev

in S

kadr

on

ITRS quotes – thermal challenges• For small dies with high pad count, high power

density, or high frequency, “operating temperature, etc for these devices exceed the capabilities of current assembly and packaging technology.”

• “Thermal envelopes imposed by affordable packaging discourage very deep pipelining.”

• Intel recently canceled its NetBurst microarchitecture

– Press reports suggest thermal envelopes were a factor

88

© 2

006,

Kev

in S

kadr

on

Dynamic Power Consumption• Power dissipated due to switching activity

• A capacitance is charged and discharged

Vdd

Charge/discharge at the frequency Charge/discharge at the frequency ffP=a CLV2 f

Ec=1/2CLV2

Ed=1/2CLV2

89

© 2

006,

Kev

in S

kadr

on

Transistor Sizing• Transistor sizing plays an important role to reduce

power

• Delay ~ (k / ln K)

• Power ~ K / (K-1)

• Optimum K for both power and delay must be pursued

C0 C1 CN-1 CN

K = Ci/Ci-1

90

© 2

006,

Kev

in S

kadr

on

Signal Gating

• Implementation• Simple gate

• Tristate buffer

• ...

• Control signal needed• Generation requires additional logic

• Especially helps to prevent power dissipation due to glitches

““techniques to mask unwanted switching activities from propagating techniques to mask unwanted switching activities from propagating forward, causing unnecessary power dissipationforward, causing unnecessary power dissipation””

signal

ctrl

Output

91

© 2

006,

Kev

in S

kadr

onDifferent Implementation and Corresponding

Clock Gating Choices

Latch-Mux Design

SRAM Design

92

© 2

006,

Kev

in S

kadr

on

DVS “Critical Power Slope”• It may be more efficient not to use DVS, and

to run at the highest possible frequency, then go into a sleep mode!• Depends on power dissipation in sleep mode vs.

power dissipation at lowest voltage

• This has been formalized as the critical power slope (Miyoshi et al, ICS’02):• mcritical = (Pfmin

– Pidle) / fmin

• If the actual slope m = (Pf - Pfmin) / (f – fmin) < mcritical

then it is more energy efficient to run at the highest frequency, then go to sleep

• Switching overheads must be taken into account

93

© 2

006,

Kev

in S

kadr

on

Application-Specific Hardware• Specialized logic is usually much lower power

• Co-processors• Ex: TCP/IP offload, codecs, etc.

• Functional units• Ex: Intel SSE, specialized arithmetic (e.g., graphics), etc.

• Ex: Custom instructions in configurable cores (e.g., Tensilica)

• Specific example: Zoran ER4525 – cell phone• ARM microcontroller, no DSP!

• Video capture & pre/post processing

• Video codec

• 2D/3D rendering

• Video display

• Security

94

© 2

006,

Kev

in S

kadr

on

Gate Leakage• Not clear if new oxide materials will arrive in time• Any technique that reduces Vdd helps• Otherwise it seems difficult to develop architecture

techniques that directly attack gate leakage• In fact, very little work has been done in this area

• One example: domino gates (Hamzaoglu & Stan, ISLPED’02)• Replace traditional NMOS pull-down network with a PMOS

pull-up network• Gate leakage is greater in NMOS than PMOS• But PMOS domino gate is slower

• Note: Gate oxide so thin - especially prone to manufacturing variations

95

© 2

006,

Kev

in S

kadr

on

Static Power - Modeling• Modeling Leakage

• Butts and Sohi (MICRO-33)

– Pstatic = Vcc · N · kdesign · Îleak

– Îleak determined by circuit simulation, kdesign empirically

– Key contribution: separate technology from design

• HotLeakage (UVA TR CS-2003-05, DATE’04)

– Extension of Butts & Sohi approach: scalable with Vdd, Vth, Temp, and technology node; adds gate leakage

– Îleak determined by BSIM3 subthreshold equation and BSIM4 gate-leakage equations, giving an analytical expression that accounts for dependence on factors that may change at runtime, namely Vdd, Vth, and Temp

– kdesign replaced by separate factors for N- and P-type transistors

– kdesign also exponentially dependent on Vdd and Tox, linearly dependent on Temp

– Currently integrated with SimpleScalar/Wattch for caches

96

© 2

006,

Kev

in S

kadr

on

Static Power – Modeling• Modeling Leakage (cont.)

• Su et al, IBM (ISLPED’03)

–Similar approach to HotLeakage – but they observe that modeling the change in leakage allows linearization of the equations

• Many, many other papers on various aspects of modeling different aspects of leakage

–Most focus on subthreshold

–Few suggest how to model leakage in microarchitecture simulations

97

© 2

006,

Kev

in S

kadr

on

Performance Comparison• TT-DFS is best but can’t prevent excess temperature

• Suitable for use with aggressive clock rates at low temp.

• Hybrid technique reduces DTM cost by 25% vs. DVS (DVS overhead important)

• A substantial portion of MC’s benefit comes from the altered floorplan, which separates hot units

1.045

1.270

1.359

1.231

1.112

1.00

1.10

1.20

1.30

1.40

TTDFS DVS FG Hyb MC

Slo

wd

ow

n F

acto

r

98

© 2

006,

Kev

in S

kadr

on

EM Model

( )

0

1,

( )

failure at E

kT tth the dt const

kT t

( )1( )

( )

aE

kT tR t ekT t

Life Consumption

Rate:

Apply in a “lumped” fashion at the granularity of microarchitecture units, just like RAMP [Srinivasan et al.]

99

© 2

006,

Kev

in S

kadr

on

Carnot efficiency

• Note that in all cases, heat transfer is proportional to ΔT

• This is also one of the reasons energy “harvesting” in computers is probably not cost-effective• ΔT w.r.t. ambient is << 100°

• For example, with a 25W processor, thermoelectric effect yields only ~50mW• Solbrekken et al, ITHERM’04

• This is also why Peltier coolers are not energy efficient• 10% eff., vs. 30% for a refrigerator

100

© 2

006,

Kev

in S

kadr

on

Thermal Modeling• Want a fine-grained, dynamic model of

temperature• At a granularity architects can reason about• That accounts for adjacency and package• That does not require detailed designs• That is fast enough for practical use

HotSpot - a compact model based on thermal R, C• Parameterized to automatically derive a model

based on various…– Architectures– Power models– Floorplans– Thermal Packages

101

© 2

006,

Kev

in S

kadr

on

Temperature equations• Fundamental RC differential equation

• P = C dT/dt + T / R

• Steady state• dT/dt = 0

• P = T / R

• When R and C are network matrices• Steady state – T = R x P

• Modified transient equation

– dT/dt + (RC)-1 x T = C-1 x P

• HotSpot software mainly solves thesetwo equations

102

© 2

006,

Kev

in S

kadr

on

Our Model (Lateral and Vertical)

Interface material(not shown)

Derived from material and geometric properties

103

© 2

006,

Kev

in S

kadr

on

Transient solution

• Solves differential equations of the form dT + AT = B where A and B are constants

• In HotSpot, A is constant (RC) but B depends on the power dissipation

• Solution – assume constant average power dissipation within an interval (10 K cycles) and call RK4 at the end of each interval

• In RK4, current temperature (at t) is advanced in very small steps (t+h, t+2h ...) till the next interval (10K cycles)

• RK – `4` because error term is 4th order i.e., O(h^4)

104

© 2

006,

Kev

in S

kadr

on

Transient solution contd...

• 4th order error has to be within the required precision

• The step size (h) has to be small enough even for the maximum slope of the temperature evolution curve

• Transient solution for the differential equation is of the form Ae-Bt with A and B are dependent on the RC network

• Thus, the maximum value of the slope (AxB) and the step size are computed accordingly

105

© 2

006,

Kev

in S

kadr

on

HotSpot• Time evolution of temperature is driven by

unit activities and power dissipations averaged over 10K cycles• Power dissipations can come from any power

simulator, act as “current sources” in RC circuit

• Simulation overhead in Wattch/SimpleScalar: < 1%

• Requires models of• Floorplan: important for adjacency

• Package: important for spreading and time constants

106

© 2

006,

Kev

in S

kadr

on

Notes• Note that HotSpot currently measures

temperaturesin the silicon

• But that’s also what the most sensors measure

• Temperature continues to rise through each layer of the die

• Temperature in upper-level metal is considerably higher

• Interconnect model released soon!

107

© 2

006,

Kev

in S

kadr

on

HotSpot Summary

• HotSpot is a simple, accurate and fast architecture level thermal model for microprocessors

• Over 850 downloads since June’03

• Ongoing active development – architecture level floorplanning will be available soon

• Download site• http://lava.cs.virginia.edu/HotSpot

• Mailing list• www.cs.virginia.edu/mailman/listinfo/hotspot

108

© 2

006,

Kev

in S

kadr

on

Hybrid DTM• DVS is attractive because of its cubic advantage

• P V2f

• This factor dominates when DTM must be aggressive

• But changing DVS setting can be costly

– Resynchronize PLL

– Sensitive to sensor noise spurious changes

• Fetch gating is attractive because it can use instruction level parallelism to reduce impact of DTM

• Only effective when DTM is mild

• So use both!

109

© 2

006,

Kev

in S

kadr

on

Migrating Computation• When one unit overheats, migrate its

functionality to a distant, spare unit (MC)• Spare register file (Skadron et al. 2003)

• Separate core (CMP) (Heo et al. 2003)

• Microarchitectural clusters

• etc.

• Raises many interesting issues• Cost-benefit tradeoff for that area

• Use both resources (scheduling)

• Extra power for long-distance communication

• Floorplanning

110

© 2

006,

Kev

in S

kadr

on

Hybrid DTM, cont.• Combine fetch gating with DVS

• When DVS is better, use it

• Otherwise use fetch gating

• Determined by magnitude of temperature overshoot

• Crossover at FG duty cycle of 3

• FG has low overhead: helps reduce cost of sensor noise

1.0

1.1

1.2

1.3

20 5 2Duty Cycle

Slo

wd

ow

n

1.0

1.1

1.2

1.3

1.4

05101520Duty Cycle

Slo

wd

ow

n

FG

DVSHyb

111

© 2

006,

Kev

in S

kadr

on

Hybrid DTM, cont.

• DVS doesn’t need more than two settings for thermal control

• Lower voltage cools chip faster

• FG by itself does need multiple duty cycles and hence requires PI control

• But in a hybrid configuration, FG does not require PI control

• FG is only used at mild DTM settings

• Can pick one fixed duty cycle

• This is beneficial because feedback control is vulnerable to noise

112

© 2

006,

Kev

in S

kadr

on

Sensors• Almost half of DTM overhead is due to

• Guard banding due to offset errors and lack of co-located sensors

• Spurious sensor readings due to noise

• Need localized, fine-grained sensing• Need new sensor designs that are cheap

and can be used liberally – co-locate with hotspots

• But these may be imprecise

• Many sensor designs look promising• Need new data fusion techniques to reduce

imprecision, possibly combine heterogeneous sensors

113

© 2

006,

Kev

in S

kadr

on

Impact of Physical Constraints• Thermal constraints shift optimum toward fewer

and simpler cores

• CPU-bound programs still want aggressive superscalar cores despite throttling—but not deeply pipelined

• Mem-bound programs want narrow cores, lots of L2

• You can still have lots of cores• They will be severely throttled (e.g., up to 45% voltage

reduction and 75% frequency reduction)

• But you still win by adding cores until throttling outweighs the benefit of an additional core

• Preliminary results suggest that OO cores are always preferable: they are more efficient in terms of BIPS/area

© 2006, kevin skadron power-aware and temperature-aware architecture kevin skadron lava/hotspot lab...

Documents