system-level power optimization - infflavio/ensino/cmp502/date00_tut_slides.pdf · • describe...
TRANSCRIPT
Benini & De Micheli DATE 2000
SYSTEM-LEVELSYSTEM-LEVELPOWER OPTIMIZATION:POWER OPTIMIZATION:
TECHNIQUES AND TOOLSTECHNIQUES AND TOOLS
Giovanni De MicheliGiovanni De MicheliLuca BeniniLuca Benini
CSL - Stanford UniversityCSL - Stanford University
DEIS - Università di BolognaDEIS - Università di Bologna
Benini & De Micheli DATE 2000
Motivation
• Energy-efficient computing is required by:– Mobile electronic systems
• to extend battery life-time
– Large-scale electronic systems• to reduce operating costs and meet
environmental standards
• The quest for energy efficiency affects allaspects of system design
Benini & De Micheli DATE 2000
Objectives of the tutorial
• Describe design techniques and tools forsystem-level design
• Address system-level modeling, designand power management issues
• Purposely neglect:– Chip-level design issues
• physical and logic design
– Distributed systems (e.g. wireless networks)
Benini & De Micheli DATE 2000
Electronic systems
• A system is a combination of:– Hardware:
• Computation units• Storage units• Communication units• Peripherals
– Software :• Application and system software
• Energy is required by all hardware units• Software organization affects how
hardware consumes energy
Benini & De Micheli DATE 2000
Electronic system design
• Conceptualization and modeling:– From idea to model
• Design:– HW: computation, storage and communication– SW: application and system software
• Run-time management:– Run-time system management and control of
all units including peripherals
Benini & De Micheli DATE 2000
Examples• Modeling:
– Choice of algorithm– Application-specific hardware vs. programmable
hardware (software) implementation– Word-width and precision
• Design:– Structural trade-off
• Resource sharing and logic restructuring– Exploit multiple/variable supplies
• Management:– Operating system– Dynamic power management
Benini & De Micheli DATE 2000
Outline
• Introduction• System conceptualization and modeling
– Modeling and design– Modeling for energy-efficient design
• System design• System management• Conclusions
Benini & De Micheli DATE 2000
System models
• Modeling is an abstraction:– Represent important features and hide
unnecessary details
• Functional models:– Capture functionality and requirements– Executable models:
• Support hw and/or sw compilation and simulation
• Implementation models:– Describe target realization
Benini & De Micheli DATE 2000
Taxonomy
Classes of Systems
General-purpose Systems Special-purpose Systems
Modeling Styles
Implementation Models Functional Models
Executable Non-executable
Benini & De Micheli DATE 2000
Functional models
• Abstract functional representation– Mathematical notation
• Abstract algorithmic notation– Pseudocode
• Executable functional models– Hw and/or sw languages:
• Verilog, VHDL, SystemC, C, C++, Java
• Non-executable functional models– Complex and/or incomplete representations
• e.g., task graph
Benini & De Micheli DATE 2000
Functional models: examples
+==
=− 1j2iyK
j2ixKz
)1i(102
i101i
while(1) { read(x); read(y); if (c%10 == 0) { z = K1 * x; write(z); z = K2 * y; write(z); } c++;}
read yread x
dsample 10dsample 10
scale K1 scale K2
multiplex
Benini & De Micheli DATE 2000
Conceptualization
• Refining abstract into executable models• Many different realizations:
– Different algorithms implement a function– e.g., sorting
• Many different stylistic embodiments– Different way of expressing computation, storage
and communication– e.g., different ways of writing hw/sw models
• Form and content of executable modelsaffects energy consumption and bias design
Benini & De Micheli DATE 2000
System platform• Micro-architecture:
– Programmable component/core(e.g., processor, controller)
• Macro-architecture: system organization– Programmable and application-specific
components, memory, communication units
• System design:– Mapping a functional model onto a platform– Hardware and software design
• How to achieve energy-efficient design?
Benini & De Micheli DATE 2000
System-level design• Design macro-architecture:
– Use off-the-shelf components or cores• Do not change micro-architectures
– Design application-specific components/cores• Exploit component architecture
• Low-energy computation can be achievedwith either approach:– Designing custom components requires more
effort but may lower-power consumption
Benini & De Micheli DATE 2000
Outline• Introduction• System conceptualization and modeling
– Modeling and design– Energy efficient design from
• Executable functional models• Non-executable functional models• Implementation models
• System design• System management• Conclusions
Benini & De Micheli DATE 2000
Algorithm selection
• Inputs– A target macro-architecture– Abstract functional/executable spec.– Constraints– Library of algorithms
• Objective Select the most energy-efficient algorithm
that satisfies constraints
Benini & De Micheli DATE 2000
Example: set data types[Wuytack96]
• Select abstract data type (data struct & algorithm)• Minimize memory power• Application: ATM segment protocol processor• Library with 32 complex abstract data types
Array Linked list Pointer array Binary tree
×1000 ∆P between best and worst
Benini & De Micheli DATE 2000
Issues in algorithm selection
• Applicable only to general-purposeprimitives with many alternativeimplementation
• Pre-characterization on targetarchitecture
• Limited search space exploration
Benini & De Micheli DATE 2000
Algorithm optimization
• Algorithm ⇒ control data flow graph• Computational energy
– Explicit in CDFG– Cost-per-operation metrics
• Communication, storage energy– Implicit in CDFG– Locality metrics
Benini & De Micheli DATE 2000
Example: dataflow optimization[Chau92,Srivastava96,Mehra96]
+ +
+
∗
∗∗+∆
∆+ +
+
∗
∗∗+∆
∆
• Computational energy⇒ graph nodes– E.g. Node minimization
• Communication/storageenergy ⇒ graph edges– E.g. Min-cut clustering
Benini & De Micheli DATE 2000
Algorithm optimization issues• Analyze CDFG locality
– Need to rigorously define locality metrics– Locality analysis can be expensive
• Tradeoff between spatial locality andtemporal locality– Parallel vs. sequential
• Relationship with macro-architecturaltemplate (cost metric characterization)
• Accuracy
Benini & De Micheli DATE 2000
Algorithm transformation
• More aggressive than optimization– Global vs. local– Change the semantic of the computation
• Hard to automate• Very effective
Benini & De Micheli DATE 2000
Approximate processing
Introducing well-controlled errors can beadvantageous for power– Reduced data width (coarse discretization)– Layered algorithms (successive
approximations)– Lossy communication
Stage1340×400
Stage2680×800
Stage 31020×1200
Min P? Med P?N N
YY
Benini & De Micheli DATE 2000
Control flow-basedtransformations
• Mutual exclusion
if (hi > 0) NewAddr(Pack)Route(Pack)
>0NewAddr Route
Pack
hi
• Computational kernel
dequant(BLK)IDCT(BLK)
CPU CPU DCT
Benini & De Micheli DATE 2000
Algorithm transformationIssues
• Approximate processing– Error control
• Quantify average and worst-case errors
– Scope of applicability• In some domains, exact results are required
• Control flow-based transformation– Control flow analysis is hard– Input dependency of control flow
Benini & De Micheli DATE 2000
Outline• Introduction• System conceptualization and modeling
– Modeling and design– Energy efficient design from
• Executable functional models• Non-executable functional models• Implementation models
• System design• System management• Conclusions
Benini & De Micheli DATE 2000
Task graph• Nodes are tasks• Edges are dependencies• Periodic execution is implicitly assumed
T1T1
T2T2T3T3
BB
EE
Benini & De Micheli DATE 2000
Task graph techniques
• Problem formulation– Input
• Task graph• Set of available processing elements (PEs)• Power, performance, cost metrics• Performance & cost constraints
– Output• Power-optimized implementation (constrained)
• [Dave99], [Kirovski97]
Benini & De Micheli DATE 2000
Processing elements• Several classes of PEs
– General-purpose processors (e.g. RISC core)– Digital signal processors (e.g. VLIW core)– Programmable logic (e.g. LUT-based FPGA)– Specialized processors (e.g. custom DCT core)
• Trade off flexibility vs. efficiency– Specialized is faster and power-efficient– General-purpose is flexible and
inexpensive
Benini & De Micheli DATE 2000
Cost metrics• Multi-objective problems
– Constrained optimization– Explore design trade-offs
• Example: performance and power
#C lock C ycles Energy (µJ)
PE1 PE2 PE3 PE1 PE2 PE3Energy per Cycle 1.5 .2 .05
T1 100 250 900 150 50 45
T2 110 260 900 165 52 45
T3 120 300 1000 180 60 50
TCLK = 10ns @ Vdd = 3.3V
Benini & De Micheli DATE 2000
Constrained optimization
• Design space– Who does what and when (binding &
scheduling)– Supply voltage of the various PEs:
• TCLK = K Vdd/(Vdd - Vt)2
• Design target– Minimize power– Performance constraint (e.g. Titeration=21µsec)
Benini & De Micheli DATE 2000
Basic search algorithmTask graph preprocessing
(clustering-splitting)Task graph preprocessing
(clustering-splitting)
PE allocationPE allocation
Scheduling & BindingScheduling & Binding
Stop?Stop?Stop?Cost EstimationCost Estimation
NO
YES
Benini & De Micheli DATE 2000
Refined Power Metrics I
• Communication power• Memory power
MMPEPEMMPEPE MMPEPE
BUST1T1
T2T2T3T3
BB
EE
Benini & De Micheli DATE 2000
Refined Power Metrics II
• Multi-tasking overhead• Power management overhead
PE1
PE1
PE1
PE1
Benini & De Micheli DATE 2000
Limitations of Task-LevelAbstraction
• Task-level description does not expresscomplete functional information– Tasks must be defined a priori– Functional transformations are impossible
• Power metric characterization– Exaustive (every PE for every task)– Inaccurate (misses inter-task effects)
Benini & De Micheli DATE 2000
Outline• Introduction• System conceptualization and modeling
– Modeling and design– Energy efficient design from
• Executable functional models• Non-executable functional models• Implementation models
• System design• System management• Conclusions
Benini & De Micheli DATE 2000
The spreadsheet model
• General-purpose systems– Backward compatibility– Component-based
• Spreadsheet-based analysis– Basic budgeting– Simple “what if” analyses– No learning curve
Benini & De Micheli DATE 2000
Example: datasheet analysis
PDA #Comp Vdd Iidle Ion %on %idle I(mA)
Proc 1 3.3 0.5 50 0.7 0.3 36.15
DRAM 1 3.3 0.1 12 0.7 0.3 8.43
FLASH 5 3.3 0.0 9 0.7 0.3 31.5
IR 1 3.3 0.0 64 0.05 0.95 3.2
RTC 1 3.3 0.0 0.1 1 0 0.1
DC-DC 1 - 0.1 5.5 0.99 0.01 5.44
TOT 83.82TOT 83.82
Benini & De Micheli DATE 2000
Outline• Introduction• System conceptualization and modeling• System design
– Computation– Memory– Communication– Software
• System Management• Conclusions
Benini & De Micheli DATE 2000
System design• Input
– The output of the conceptualization phase• A macro-architectural template• A hardware-software partition• Component by component constraints
• Output– Complete hardware design
ProcProc InteconnectInteconnect
PEPE PEPE PEPE PEPE
MemMem MemMem
Benini & De Micheli DATE 2000
Design process
• Specify computation, storage, templatecomponents, and software– Synergic process
• Fundamental tradeoff: general-purposevs. application-specific– Flexibility has a cost in terms of power
Benini & De Micheli DATE 2000
Processing element design
• Application specific processing unit– Minimum flexibility, minimum power
• Application-specific processor– Tailored processor template
• Core processor– Maximum flexibility, maximum power
Benini & De Micheli DATE 2000
Application-specificcomputational units
• Designed at the circuit/logic/RT level– Outside the scope of this tutorial
• Synthesized from high-level executablespecification (behavioral synthesis)– Supply voltage reduction– Switching frequency reduction– Load capacitance reduction– Minimization of switching activity
Benini & De Micheli DATE 2000
Power-driven voltage scaling
From faster to power efficient by scalingdown voltage supply– Traditional speed-enhancing
transformations can be exploited for low-power design
• Pipelining• Parallelization• Loop unrolling• Re-timing
Benini & De Micheli DATE 2000
Issues
• Performance-enhancing transformationsdo not always pay off– Region of diminishing returns (e.g.
speculation)
• Voltage supply is driven low bytechnological reasons– Reduced headroom
Benini & De Micheli DATE 2000
Advanced voltage scaling
• Multiple voltages– Slow down non-critical path with lower
voltage supply– Two or more power grids– High-efficiency voltage converters
* + -
+
+
Benini & De Micheli DATE 2000
Clock frequency reduction
• fclk↓ does not decrease energy– …but it may increase battery life: C=K/Iα
• Multi-frequency clocks– GALS [Hemani99]– Low-frequency distribution [Chung95]
Clk domain 1
Clk domain 2
Clk domain 3
Benini & De Micheli DATE 2000
Reducing load capacitance
• Reduce wiring capacitance– Reduce local loads– Reduce global interconnect
Global interconnect can be reducedby improving spatial locality: trade off
communication for computation
Benini & De Micheli DATE 2000
Reduce switching activity• Improve correlation between
consecutive input to functional macros• Reduced glitching• All basic HLS steps have been modified
– A synergic approach lead best results
+ +
- +
>
+ +
-
+
>
+
+
-
+
>
A AB C
H
0
AB ABA C
A CH H
0 0
Benini & De Micheli DATE 2000
Issues
• High-level estimation– Accuracy is still limited– Dependency on input patterns
• Design flow– HLS is not yet mainstream technology– Compete with RTL techniques
Benini & De Micheli DATE 2000
Application-specificprocessors
• Parameterized processors tailored to aspecific application– Optimally exploit parallelism– Eliminate unneeded features
• Applied to different architectures– Single-issue cores ⇒ instruction subsetting– Superscalar cores ⇒ # and type of FUs– VLIW cores ⇒ FUs and compiler
Benini & De Micheli DATE 2000
Example: Application-specific VLIW optimization
ApplicationProcessor library
ApplicationProcessor library
Select ProcessorRetarget compiler
Select ProcessorRetarget compiler
Compile andSimulate (ISS)
Compile andSimulate (ISS)
Estimate and findbest so far
Estimate and findbest so far
Other?Other?
Eliminate dominatedsolutions
Eliminate dominatedsolutions
Done
Y
N
Benini & De Micheli DATE 2000
Issues
• Exploration techniques– Limited search space– Accuracy of cost metrics
• Back-end– Synthesis of ASIPS– Competitive with highly-optimized cores?
• Preliminary research results
Benini & De Micheli DATE 2000
Low power core processors
• Details are outside the tutorial’s scope– [Gonzalez96,Burd96]
• Key ideas– Low voltage– Reduce wasted switching– Specialized modes of operations/instructions– Variable voltage supply
Benini & De Micheli DATE 2000
Core design space forMultimedia [Nishitani99]
Sony Mpeg2 Enc
NEC Mpeg2 Enc
TMS320C6xTMS320C6201
MPactTriMedia
SH4
Scarlet
DEC21164
PentiumII
MMXPentiumVR4400
VR4300StrongARM110
Pallas
Lucent16210SPX
MN1933211
VR4100
1,0e+1
1,0e+2
1,0e+3
1,0e+4
1,0e+5
1,0e-2 1,0e-1 1,0e+0 1,0e+1 1,0e+2
MPEG2 ENC
MPEG1 ENC
MPEG2 DEC
ASIC
MediaProc
General Purpose
MOPS
W
Benini & De Micheli DATE 2000
Exploiting Variable Supply• Supply voltage can be dynamically
changed during system operation– Quadratic power savings– Circuit slowdown
• Just-in-time computation– Stretch execution time up to the max
tolerable
Available time
PowerFixed voltage + Shutdown
Variable voltage
Benini & De Micheli DATE 2000
Variable-supplyArchitectures
• High-efficiency adjustable DC-DC converter• Adjustable synchronization
– Variable-frequency clock generator [Chandrakasan96]
– Self-timed circuits [Nielsen94]
• Example: Power-pro architecture [Ishiara98], Crusoeembedded processor [Transmeta00]
DecPower
Manager
CPUProgROM
Data RAM
DC-DC& VCO
Vdd CLK
Benini & De Micheli DATE 2000
Issues
• Optimization still not proven on real-lifearchitectures
• Overhead in supporting variable voltage– Adjustable DC-DC– Adjustable clock– Interfaces
• Reliability concerns
Benini & De Micheli DATE 2000
Outline• Introduction• System conceptualization and modeling• System design
– Computation– Memory– Communication– Software
• System Management• Conclusions
Benini & De Micheli DATE 2000
Memory Optimization
• Custom data processors– Computation is less critical than data
storage (for data-dominated applications)
• General-purpose processors– A significant fraction of system power is
consumed by memories
Memory-related consumptionStorage
Transfer
Benini & De Micheli DATE 2000
Memory Power Optimization
• Key idea: exploit locality– Hierarchical memory– Partitioned memory
B1B1
B2B2
B3B3
B1B1
B2B2
B1B1
B2B2
PEPE
L1
L2L3L3
Benini & De Micheli DATE 2000
Optimization approaches
• Fixed memory access patterns– Optimize memory architecture
• Fixed memory architecture– Optimize memory access patterns
• Concurrently optimize memoryarchitecture and accesses
Benini & De Micheli DATE 2000
Optimize MemoryArchitecture
• Data replication to localize accesses– Implicit: multi-level caches [Su95], [Bahar98]– Explicit: buffers [Bajwa97], [Wuytack98]
• Partitioning to minimize cost per access– Multi-bank caches [Ko98]– Partitioned memories [Tellez97], [Wuytack98]
L1L1 L2L2 L3L3
B12 B23
Pmem = PL1⋅hitL1+PB12(1-hitL1)+PL2 ⋅hitL1+(PB23+PL3)⋅(1-hitL1-hitL2)
Benini & De Micheli DATE 2000
Optimize Memory Accesses
• Sequentialize memory accesses– Reduce address bus transitions [Catthoor94], [Su95]
– Exploit multiple small memories [Panda96]
• Localize program execution– Fit frequently executed code into a small instruction
buffer (or cache) [Panwar95], [Bellas98]
• Reduce storage requirements [Gebotys96],[Catthoor]
Benini & De Micheli DATE 2000
Optimize Memory Architectureand Access Patterns
• Two phase-process– Specification (program) transformations
• Reduce memory requirements• Improve regularity of accesses
– Build optimized memory architecture
• Highest potential– How to automate program transformations
Benini & De Micheli DATE 2000
Memory Optimization Tools
• Atomium (IMEC)– Supports very high-level transformations
and optimizations (e.g. data-structureselection)
• Avalanche (Princeton)– Focuses on cache optimization, provides
cycle-accurate estimates
• UCLA system-level tool– Integrates voltage supply selection
Benini & De Micheli DATE 2000
Outline• Introduction• System conceptualization and modeling• System design
– Computation– Memory– Communication– Software
• System Management• Conclusions
Benini & De Micheli DATE 2000
Design of communicationunits
• Trends:– Faster computation blocks, larger chips
– Communication speed is critical
– Energy cost of communication is significant
• Multifaceted design approach:– On chip, networks, wireless, ...
– Protocol stack
Benini & De Micheli DATE 2000
Protocol stacka simplified view
• Data Link– Error control through
coding and channelmanagement
– Shared busses
• Physical layer– Signaling– Modulation
NetworkNetwork
OS & MiddlewareOS & Middleware
ApplicationsApplications
Data LinkData Link
PhysicalPhysical
Benini & De Micheli DATE 2000
Data encoding
• Theoretical results:– Bounds on transition activity reduction:
• The higher the entropy rate of the source is,the lower is the gain achievable by coding
• Practical applications:– Processor-memory (and other) busses
• Data busses, address busses
• Transition activity reduction does notguarantee energy savings
Benini & De Micheli DATE 2000
Bus encoding
BUSControl line(s)
ENCDEC
ENCDEC
PROCESSOR MEMORY
Benini & De Micheli DATE 2000
Bus encoding
• Data buses:– Random white noise model
• Address busses:– Some spatio-temporal correlations
• Embedded software:– Addresses and data can be analyzed
a priori to determine encoding
Benini & De Micheli DATE 2000
Bus-Invert coding for data busses[Stan, Burleson]
• Add redundant line INV to bus• When INV=0
– Data is equal to remaining bus lines
• When INV = 1– Data is complement of remaining bus lines
• Performance:– Peak: at most n/2 bus lines switch– Average: Code is optimal. No other code
with 1-bit redundancy can do better
Benini & De Micheli DATE 2000
Bus-Invert coding for data busses
• Average switching reduction is bus-widthdependent:– Ex: 3.27 for an 8-bit bus
• Average switching per line decreases asbusses get wider– Use partitioned codes– No longer optimal (among redundant codes)
• Implementation issues:– Difference (XOR) of two data samples and
majority vote
Benini & De Micheli DATE 2000
Bus-Inver code comparisons
lines mode A- trans A-trans/ line A-power
2 - 1 0.5 100%2 BI 0.75 0.375 75%8 - 4 0.5 100%8 1 BI 3.27 0.409 81.8%8 4 BI 3 0.375 75%16 - 8 0.5 100%16 1 BI 6.83 0.427 85.4%
Benini & De Micheli DATE 2000
Extensions and generalizations
• Transition signaling:– Assert logic TRUE / FALSE by signal
transitions
• Use several redundant lines– Limited-weight codes [Stan, Burelson]– Coding is space and time– Modulation techniques
Benini & De Micheli DATE 2000
Encoding instruction addresses
• Most instruction addresses are consecutive– Use Gray code [Su, Tsui, Despain]
• Word-oriented machines:– Increments by 4 (32bit) or by 8 (64bit).– Modify Gray code to switch 1 bit per increment
[Metha, Owens, Irwin]– Gray code adder for jumps
• Harder to partition• Convert to Gray code after update
Benini & De Micheli DATE 2000
Working zone encoding (WZE)[Mussol, Lang, Cortadella]
• Conjecture:– Software programs favor working zones of
their address space
• WZE:– Transmit WZ identifier and offset in WZ– 1-hot encoding for offsets
• Applicability:– No caches: data/instruction/shared address
busses– With caches: data/instruction-only busses
Benini & De Micheli DATE 2000
T0 Code• Add redundant line INC to bus• When INC = 0
– Address is equal to remaining bus lines
• When INC = 1– Transmitter freezes other bus lines– Receiver increments previously transmitted
address by a parameter called stride
• Asymptotically zero transitions forsequences– Better than Gray code
Benini & De Micheli DATE 2000
Mixed bus encoding techniques• T0_BI:
– Use two redundant lines: INC and INV– Good for shared address/data busses
• Dual encoding:– Good for time-multiplexed address busses– Use redundant line SEL :
• SEL=1 denotes addresses• SEL is already present in the bus interface
– Dual T0:• Use T0 code when SEL is asserted.
– Dual T0_BI:• Use T0 when SEL is asserted; otherwise use BI
Benini & De Micheli DATE 2000
Address bus encodingusing statistical analysis
• Statistical analysis of bus traces– Spatio-temporal correlation of word K-tuples– Often limited to first / second order statistics (K=1,2)
• Encode words according to correlation• Use transition signaling• Spatio-temporal correlation computation:
– On-line adaptive– Off-line for embedded software
Benini & De Micheli DATE 2000
Address bus encoding forembedded software
• Off-line statistical analysis of bus traces– Compute bit 2nd order correlation from known stream:
• Correlate bit it with bit jt+1
• Use correlation measure to group bits into fields– Apply graph clustering algorithm– Cluster correspond to mutual high spatio-temporal correlation
• Re-encode bus lines in each cluster– Group bus lines into clusters (with locally high correlation)– Encode signals within each cluster to reduce bus switching
Benini & De Micheli DATE 2000
Example[Ramprasad98,Benini99]
E⊕ ⊕
D
X(n-1)
z(n)
X(n-1)
y(n)y(n) x(n)x(n)
Correlator Decorrelator
Benini & De Micheli DATE 2000
Bus encoding: summary• Bus encoding is very useful to reduce switching
of high-capacitance busses
• Some techniques require synthesis ofdedicated encoder/decoder circuitry
• Power consumption of such circuits must beweighted against power savings on busses
• Techniques differ for address and data busses
Benini & De Micheli DATE 2000
Outline• Introduction• System conceptualization and modeling• System design
– Computation– Memory– Communication– Software
• System Management• Conclusions
Benini & De Micheli DATE 2000
Impact of software
• For a given a hardware platform, the energyto realize a function depends on software– Operating system– Different algorithms to embody a function
(e.g., sorting)– Different coding styles– Application software compilation
Benini & De Micheli DATE 2000
Source-code transformations• Minimize power-consuming activity
– Computation
– Communication
– Storage
for (c = 1..N) receive(A) B = c*A
receive(A)for (c = 1..N) B = c*A
A*B+A*C A*(B+C)
for (c = 1..N) B[c] = A[c]*D[c]for (c = 1..N) F[c] = B[c]-1
for (c = 1..N) F[c] = A[c]*D[c]-1
Benini & De Micheli DATE 2000
Coding styles• Use processor-specific instruction style:
– Variable types– Function calls style– Conditionalized instructions (for ARM)
• Follow general guidelines for software coding– Use table look-up instead of conditionals– Make local copies of global variables so that they
can be assigned to registers– Avoid multiple memory look-ups with pointer chains
Benini & De Micheli DATE 2000
Example:ARM Variable Types
• Default “int” variable type is 18.2% moreenergy efficient than “char” or “short”
• Sign or zero extending is needed for shortervariable types
int wordinc (int a){ return a + 1;}
char charinc (char a){ return a + 1;}
Benini & De Micheli DATE 2000
Example:ARM conditional execution
• All ARM instructions are conditional• Conditional execution reduces the number of
branches
int g(int a, int b, int c, int d){ if (a > 0 && b > 0 && c < 0 && d < 0)
return a + b + c + d; return -1;}
Benini & De Micheli DATE 2000
Example:Switch versus Table Lookup
• Table lookup is 52.3% more energy efficientthan a dense switch statement since fewerjumps are required
if ((unsigned) condition >= 4 ) return 0;return "EQ\0NE\0CS\0CC\0" + 3*condition;
switch(condition){ case 0: return "EQ"; case 1: return "NE"; case 2: return "CS"; case 3: return "CC"; default: return 0;}
Benini & De Micheli DATE 2000
Example: MPEG
• Compiler optimizations account for only 0.6%energy savings
• Source code optimizations give as much as32.3% energy savings
Source General Special Size Time EnergyOpt. Opt. Opt. %change %change %changenone Balance none 0 0 0none Code Size none 0 -0.6 -0.6none Time none 0 -0.2 -0.2none Balance all 0 -0.6 -0.6all Balance none -5.8 -35 -32.3all Balance all -5.8 -35 -32.3
Benini & De Micheli DATE 2000
Software analysis
• Simulation-based approach– Perform system-level simulation of hardware
platform running application software– Best for software program comparisons
• Experimental measurement– Run program with instruction streams and
measure average energy cost of instructions– Best for instruction-level analysis
Benini & De Micheli DATE 2000
Instruction-level analysis[Tiwari, Malik, Wolfe]
• Analyze loop execution containing specificinstruction(s)– Loop should be long enough to neglect overhead
and short enough to avoid cache misses– About 200 instructions
• Measure instruction base cost• Measure inter-instruction effects
(can be approximated by a constant)
Benini & De Micheli DATE 2000
Base cost for the 486DX2
# Instruction Current cCycles1 NOP 275.7 1
2 MOV DX, BX 302.4 13 MOV DX, [BX] 428.3 1
4 MOV DX, [BX] [DI] 409.0 2
5 MOV [BX], DX 521.7 16 MOV [BX] [DI], DX 451.7 2
Benini & De Micheli DATE 2000
Instruction-level analysis
• Energy cost per instruction:– current * cycles * voltage / frequency
• Energy cost per instruction stream– Basis for comparison
• Example: StrongArm@170MHz– Average power/instruction: 200mW– Load/store: 260mW
Benini & De Micheli DATE 2000
Software compilation
Parse & LexParse & Lex
Machine independent optimizations
Machine independent optimizations
Instruction selectionOperation schedulingRegister assignment
Instruction selectionOperation schedulingRegister assignment
Benini & De Micheli DATE 2000
Compilation for low-powerInstruction selection
• Use instruction cost as guideline:– Example: use register-based operation
• Instruction selection by tree cover:– Use energy cost as guideline [Tiwari]– Experimental results do not show major variations
from code obtained by selecting instructions tominimize code length
Benini & De Micheli DATE 2000
Compilation for low-poweroperation scheduling
• Reorder instructions:– Reduce inter-instruction effects [Tiwari]– Switching in control part [Tiwari, Malik, Wolfe]
• Cold scheduling:– Reorder instructions to reduce inter-instruction
effects on instruction bus [Su, Tsui, Despain]:• Consider instruction op-codes• Inter-instruction cost is op-code Hamming distance• Use list scheduler where priority criterion is tied to
Hamming distance
Benini & De Micheli DATE 2000
Scheduling to reduce off-chip traffic[Tomiyama, Ishihara, Inoue, Yasuura]
• Schedule instructions (within basic blocks) tominimize Hamming distance
• Scheduling algorithm:– Operates within each basic block– Searches for linear orders consistent with data-flow– Prunes search space by avoiding redundant
solutions (hash sub-trees) and heuristically limitsthe number of sub-trees
Benini & De Micheli DATE 2000
Example
• ( ) 11111111• (A) 10010011• (B) 11101000• (C) 10111011• (D) 01110100• ( ) 11111111
• ( ) 11111111• (A) 10010011• (C)10111011• (B) 11101000• (D) 01110100• ( ) 111111111
4
6464
4
244
4
Benini & De Micheli DATE 2000
Compilation for low-powerregister assignment
• Minimize spills to memory• Register labeling [Metha, Owens, Irwin]
– Reduce switching in instruction register/bus andregister file decoder by encoding
– Reduce Hamming distance between addresses ofconsecutive register accesses
– Approach is complementary to cold scheduling
Benini & De Micheli DATE 2000
Other compiler optimizations
• Loop unrolling to reduce overhead– Contra: increased code space
• Software pipelining– Decreases the number of stalls by fetching
instructions in different iterations
• Eliminate tail recursion– Reduce overhead and use of stack
Benini & De Micheli DATE 2000
Software compilation:summary
• Several interesting ideas but:– Shortest code may turn out to be best for energy
• Memory organization and encoding of data,instructions and addresses play major role:– High-capacitance busses draw a lot of power
• Instruction/data compression/decompressionlead to significant saving due to traffic reduction
Benini & De Micheli DATE 2000
Outline• Introduction• System conceptualization and modeling• System design• System Management
– Dynamic power management techniques– Implementation of DPM
• Conclusions
Benini & De Micheli DATE 2000
Dynamic power management• Systems are:
– Designed to deliver peak performance, but …– Not needing peak performance most of the time
• Components are idle at times• Dynamic power management (DPM):
– Puts idle components in low-power non-operationalstates when idle
• Power manager:– Observes and controls the system– Power consumption of power manager is negligible
Benini & De Micheli DATE 2000
Structure ofpower-manageable systems
• System consists of several components:– E.g., Laptop: Processor, memory, disk, display …– E.g., SOC: CPU, DSP, FPU, RF unit
• Components may:– Self-manage state transitions– Be controlled externally
• Power manager:– Abstraction of power control unit– May be realized in hardware or software
Benini & De Micheli DATE 2000
Power manageable components
• Components with several internal states– Corresponding to power and service levels
• Abstracted as a power state machine– State diagram with:
• Power and service annotation on states• Power and delay annotation on edges
Benini & De Micheli DATE 2000
Example: SA-1100
• RUN: operational• IDLE: a sw routine
may stop the CPUwhen not in use,while monitoringinterrupts
• SLEEP: Shutdownof on-chip activity
RUN
SLEEPIDLE
400mW
160uW50mW
90us
90us10us
10us160ms
Benini & De Micheli DATE 2000
Power State Transitions(Fujitsu MHF 2043 AT)
Working: 2.2 W(spinning + IO)
Sleeping: 0.13 W(stop spinning)
Idle: 0.95 W(spinning)
read / write
IO complete
shut down0.36 J, 0.67 sec
spin up4.4J, 1.6 sec
Benini & De Micheli DATE 2000
The applicability of DPM
• State transition power (Ptr) and delay (Ttr)• If Τtr = 0, Ptr = 0 the policy is trivial
– Stop a component when it is not needed
• If Τtr != 0 or Ptr != 0 (always…)– Shutdown only when idleness is
long enough to amortize the cost
– What if Τ and P are not deterministic?
Ptr Ttr
OFF
Ptr Ttr
ON
Benini & De Micheli DATE 2000
offon
ontrtrtrBE PP
PPTTT
−−
+=offon
ontrtrtrBE PP
PPTTT
−−
+=
System break-even-time:TBE
Transition delay (Ttr) Transition power (Ptr)
Sleep power (Poff)
•• Minimum idle time for amortizing theMinimum idle time for amortizing thecost of component shutdowncost of component shutdown
Benini & De Micheli DATE 2000
When to use powermanagement
• When– Average idle periods are long enough– Transition delay is short enough– Transition power is low enough– Sleep power is low enough
• When designing system for a known workload– Criteria for component specification and design
avgidleBE TT < avgidleBE TT <
Benini & De Micheli DATE 2000
Controlling PM systems
• DPM is a control problem: a policy is the control law– Collect observations– Issue commands
• Optimal control– Synthesize the best controller (PM)
Benini & De Micheli DATE 2000
The workload
The system
Tidlen
Tactiven-1
Tidlen-1
t
On Off
Pon Poff
PsdTsd
Benini & De Micheli DATE 2000
Predictive techniques• Observe time-varying workload
– Predict idle period Tpred ~ Tidle
– Go to sleep state if Tpred is long enough toamortize state transition cost
• Main issue: prediction accuracy
T
Load
Wasted idle time Incorrect prediction
Tpred
Tidle
Benini & De Micheli DATE 2000
• Simple policy– If Tidle > TTO go to SLEEP– Stay in sleep until workload != 0
• Rationale– When Tidle > TTO it is likely that: Tidle > TTO + TBE
• Choice of TTO is critical– Large is safe, but it could be useless– Too small is highly undesirable
• Limitations– Performance penalty for wake-up is paid after every
shutdown– Power is wasted during TTO
Fixed time-out
Benini & De Micheli DATE 2000
• Choose: TTO = TBE
• Property:– Worst- case energy consumption is twice higher as
compared to an oracle policy which knows the future• Rationale:
– Worst case workload has repeated idles periodsof length TBE separated by point-wise activity
• Disadvantage:– For some components (e.g., hard disks) a small time-out
corresponds to many transitions, which affect componentreliability
Component-specific time-out
Benini & De Micheli DATE 2000
Predictive shutdown[Srivastava and Broderson]
• More aggressive than time-out– Eliminates power waste caused by TTO
– Shutdown as soon as idle if predicted idleperiod Tpred>TBE
• Prediction is based on past history– Non-idle periods Tactive are observed as well– Tidle
n, Tactiven: n-th idle and active periods
Benini & De Micheli DATE 2000
Previous active period
Tidlen
Tactiven-1
TBE
Ta-threshold
If Tactiven-1 < Ta-threshold then go to SLEEP
Benini & De Micheli DATE 2000
When to use predictivetechniques
• When workload has memory• Implementing predictive schemes
– Predictor families must be chosen basedon workload types
– Predictor parameters must be tuned to theinstance-specific workload statistics
– When workload is non-stationary orunknown, on-line adaptation is required.
Benini & De Micheli DATE 2000
Stochastic control approach
• Recognize inherent uncertainty
– Exact prediction of future events is impossible
– Abstraction of system model implies uncertainty
• Model components,system and workload as
stochastic processes
• Expected values of cost metrics are
optimized
Benini & De Micheli DATE 2000
Controlled Markov processes• System and environment modeled as Markov chains
– System is called service provider (SP)– Environment is called service requester (SR)
• SP is a controlled Markov chain– State transition probabilities depends on commands
• The power manager (PM) observes the state of thesystem and issues commands to control evolution
SR SPPM
Benini & De Micheli DATE 2000
Example
PowerManager
Policy
System
Queue
4 3 2 1
Sleep
Active
Idle
Sleep
Service Provider
Active
Idle
Service Requestor
Benini & De Micheli DATE 2000
Controlled Markov processesadvantages
• Optimum policy is captured by actionswhich are function of system state only– Actions are typically non-deterministic
i.e., probabilities of issuing a command– Control policy is a table– Policy implementation is easy
• Computation can be cast as linear programand solved exactly
Benini & De Micheli DATE 2000
Markov models for DPMoutstanding issues
• Discrete-time versus continuous time– Avoid periodic evaluation of system state– Event-driven paradigm
• Stochastic distributions:– Geometric and exponential distribution of
events may not fit component– Use semi-Markov models
• Non-stationary work-loads– Use adaptive schemes
Benini & De Micheli DATE 2000
Service Requestor - Idle State
0.0001
0.001
0.01
0.1
1
1 10 100 1000
Time (s)
Tai
l dis
trib
utio
n
ExponentialPareto
Experimental
bSR ta1E −⋅−=
Benini & De Micheli DATE 2000
When to use stochasticcontrol
• Constrained power optimization– Power/performance trade-off
• Global view of the system– Relations between workload and resources– Extensions to multiple power states
• For stationary Markov models– Optimum policies can be efficiently computed
• For non-stationary Markov models– Good heuristic adaptive policies can be derived
Benini & De Micheli DATE 2000
Outline• Introduction• System conceptualization and modeling• System design• System Management
– Dynamic power management techniques– Implementation of DPM
• Conclusions
Benini & De Micheli DATE 2000
Operating system-basedpower management
• In systems with an operating system (OS)– The OS knows of tasks running and waiting– The OS should perform the DPM decisions
• Advanced Configuration and Power Interface(ACPI) [Intel, Microsoft, Toshiba]– Open standard to facilitate design of OS-based
power management
Benini & De Micheli DATE 2000
OS
Kernel Power Management
DeviceDriver
ACPI driverAML interpreter
ACPI architecture
BIOS
Applications
Platform Hardware
Motherboarddevices
Chipset CPU
ACPI Tables
ACPITable interface
BIOS interface
Registerinterface
ACPI BIOS ACPI registers
Benini & De Micheli DATE 2000
State of the art
• ACPI supported by:
– Recent PCs (1999+)
– Windows 2000
• Most PC still use APM
• Experiments with:
– Sony VAIO PCG-F150
– Power savings: 1.7 X
– Comparable performance
Benini & De Micheli DATE 2000
Implementations of DPM
• Shutdown idle components• Gate clock of idle units• Clock setting and voltage setting
– Support multiple-voltage multiple-frequencycomponents
– Components with multiple working power states
Benini & De Micheli DATE 2000
DPM and operating systems
• Task scheduling:– Schedule tasks with variable latencies (due to
clock setting) while meeting timing constraints.– Schedule tasks to increase efficiency of DPM,
by grouping idle periods of resources
• Exploit interaction among applicationprograms, OS scheduling and hardware, toimprove power manageability
Benini & De Micheli DATE 2000
DPM: summary• DPM exploits variations in workload:
– Analysis of workload statistics and componentparameters is required to determine its applicability
• DPM policy choices:– Predictive methods and stochastic control
• Implementation choices:– Shutdown, clock gating and setting– Strong interaction between software layers and
hardware
• New requirements for operating systems
Benini & De Micheli DATE 2000
Conclusions• System-level energy-efficient design is an area
of research with many opportunities– Many interesting ideas have been proposed– Design tools will not solve the entire problem
• Global view of the system is required:– Technology: pushing the limit of low voltage, noise,
reliability, frequency, performance, …– Hardware architecture: systems with heterogeneous
components and multi-threaded execution– Software: operating systems and compilers