ece 260c – vlsi advanced topics term paper presentation may 27, 2014 keyuan huang ngoc luong low...
TRANSCRIPT
ECE 260C – VLSI Advanced TopicsTerm paper presentation
May 27, 2014
Keyuan Huang
Ngoc Luong
Low Power Processor Architectures and Software Optimization Techniques
Motivation ~10 billion mobile devices in 2018
Moore’s law is slowing down
Power dissipation per gate remains unchanged
How to reduce power?
Circuit level optimizations (DVFS, power gating, clock gating)
Microarchitecture optimization techniques
Compiler optimization techniques
Global Mobile Devices and Connections Growth
Trend: More innovations on architectural and software techniques to optimize power consumption
Low Power Architectures Overview Asynchronous Processors
Eliminate clock and use handshake protocol
Save clock power but higher area
Ex: SNAP, ARM996HS, SUN Sproull.
Application Specific Instruction Set Processors
Applications: cryptography, signal processing, vector processing, physical simulation, computer graphic
Combine basic instructions with custom instruction based on application
Ex: Tensilica’s Extensa, Altera’s NIOS, Xilinx Microblaze, Sony’s Cell, IRAM, Intel’s EXOCHI
Reconfigurable Instruction Set Processors
Combine fixed core with reconfigurable logic (FPGA)
Low NRE cost vs ASIP
Ex: Chimaera, GARP, PRISC, Warp, Tensilica’s Stenos, OptimoDE, PICO
No Instruction Set Computer
Build custom datapath based on application code
Compiler has low-level control of hardware resource
Ex: WISHBONE system.Qadri, Muhammad Yasir, Hemal S. Gujarathi, and Klaus D. McDonald-Maier. "Low Power Processor Architectures and Contemporary Techniques for Power Optimization--A Review." Journal of computers 4.10 (2009).
Combine GP processor with ASIP to focus on reducing energy and energy delay for a range of applications
Broader range of applications compared to accelerator
Reconfigurable via patching algorithm
Automatically synthesizable by toolchain from C source code
Energy consumption is reduced up to 16x for functions and 2.1x for whole application
Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
C-core organization
Data path (FU, mux, register)
Control unit (state machine)
Cache interface (ld, st)
Scan chain (CPU interface)
Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
C-core execution
Compiler insert stubs into code compatible with c-core
Choose between c-core and CPU and use c-core if available
If no c-core available, use GP processor, else use c-core to execute
C-core raises exception when finish executing and return the value to CPU
Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
Patching support
Basic block mapping
Control flow mapping
Register mapping
Patch generation
Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
Patching Example
Configurable constants
Generalized single-cycle datapath operators
Control flow changes
Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
Results 18 fully placed-and routed c-cores vs MIPS
3.3x – 16x energy efficiency improvement
Reduce system energy consumption by upto 47%
Reduce energy-delay by up to 55% at the full application level
Even higher energy saving without patching support
Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
Memory system uses power (1/10 to ¼) in portable computers
System bus switching activity controlled by software
ALU and FPU data paths needs good scheduling to avoid pipeline stalls
Control logic and clock reduce by using shortest possible program to do the computation
Software Optimization Technique
K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics
General categories of software optimization
Minimizing memory accesses
Minimize accesses needed by algorithm
Minimize total memory size needed by algorithm
Use multiple-word parallel loads, not single word loads
Optimal selection and sequencing of machine instruction
Instruction packing
Minimizing circuit state effect
Operand swapping
K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics
Banking, dynamic voltage/frequency,scaling, dynamic resizing
+ Transparent to the user + Handle arbitrary instr/data accesses ー Limited program information
Banking, dynamic voltage/frequency,scaling, dynamic resizing
+ Transparent to the user + Handle arbitrary instr/data accesses ー Limited program information
Software controlled scratch-pad, data/code reorganization
+ Whole program information + Proactive ー Conservative
Software controlled scratch-pad, data/code reorganization
+ Whole program information + Proactive ー Conservative
Global program knowledge Proactive optimizations Efficient execution
Basic Idea: Compiler Managed, Hardware Assisted
Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran ,Michael Chu,Scott Mahlke Hardware Software
Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)
tag set offset
=? =?=?=?
tag data lru tag data lru tag data lru tag data lru
4:1 mux
Replace
• Lookup Activate all ways on every access• Replacement Choose among all the ways
Traditional Cache Architecture
– Fixed replacement policy– Set index no program locality– Set-associativity has high overhead– Activate multiple data/tag-array per access
– Fixed replacement policy– Set index no program locality– Set-associativity has high overhead– Activate multiple data/tag-array per access
Disadvantages
Partitioned Cache Architecture
tag set offset
=? =?=?=?
tag data lru tag data lru tag data lru tag data lru
Ld/St Reg [Addr] [k-bitvector] [R/U]
4:1 mux
Replace
• Lookup Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions• Replacement Restricted to partitions specified in bit-vector
P0 P3P2P1+ Improve performance by controlling replacement+ Reduce cache access power by restricting number of accesses
+ Improve performance by controlling replacement+ Reduce cache access power by restricting number of accesses
Advantages
for (i = 0; i < N1; i++) { … for (j = 0; j < N2; j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < N3; k++) y[i + k] += *w2++ + x[i + k]}
ld1/st1
ld2/st2
ld3
ld4
ld5
ld6
Partitioned Caches: Example
(a) Annotated code segment
(c) Trace consisting of array references, cache blocks, and load/storesfrom the example
(b) Fused load/store instructions
tag data tag datatag data
ld1, st1, ld2, st2 ld5, ld6 ld3, ld4
ld1 [100], Rld5 [010], Rld3 [001], R
y w1/w2 x
part-0 part-1 part-3
(d) Actual cache partition assignment for each instuction
Compiler Controlled Data Partitioning
Goal: Place loads/stores into cache partitions Analyze application’s memory characteristics
Cache requirements Number of partitions per ld/st
Predict conflicts
Place loads/stores to different partitions Satisfies its caching needs
Avoid conflicts, overlap if possible
Cache Analysis: Estimating Number of Partitions
X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y
j-loop k-loop
MM MMB1 B2B2 B1
• M has reuse distance = 1
• Minimal partitions to avoid conflict/capacity misses • Probabilistic hit-rate estimate
• Use the reuse distance to compute number of partitions
Cache Analysis:Estimating Number Of Partitions
1
1
1
1
8
16
24
32
1 2 3 4
D = 0
.76
.98
1
1
8
16
24
32
1 2 3 4
D = 2
.87
1
1
1
8
16
24
32
1 2 3 4
D = 1
Avoid conflict/capacity misses for an instruction Estimates hit-rate based on
• Reuse-distance (D), total number of cache blocks (B), associativity (A)
Compute energy matrices in reality Pick most energy efficient configuration per instruction
(Brehob et. al., ’99)
Cache Analysis: Computing Interferences
Avoid conflicts among temporally co-located references Model conflicts using interference graph
M4D = 1
M3D = 1
M2D = 1
M1D = 1
X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y
M4 M2 M1 M1 M4 M2 M1 M1 M4 M3 M1 M1 M4 M3 M1 M1
Partition Assignment Placement phase can overlap references Compute combined working-set Use graph-theoretic notion of a clique
For each clique, new D Σ D of each node
Combined D for all overlaps Max (All cliques)
M4D = 1
M3D = 1
M2D = 1
M1D = 1
Clique 2
Clique 1
Clique 1 : M1, M2, M4 New reuse distance (D) = 3Clique 2 : M1, M3, M4 New reuse distance (D) = 3 Combined reuse distance Max(3, 3) = 3
tag data tag datatag data
ld1, st1, ld2, st2 ld5, ld6 ld3, ld4
part-0 part-2part-1
ld1 [100], Rld5 [010], Rld3 [001], R
y w1/w2 x
Actual cache partition assignment for each instruction
Experimental Setup
Trimaran compiler and simulator infrastructure
ARM9 processor model
Cache configurations:
1-Kb to 32-Kb
32-byte block size
2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache
Mediabench suite
CACTI for cache energy modeling
Reduction in Tag & Data-Array Checks
0
1
2
3
4
5
6
7
8
1-K 2-K 4-K 8-K 16-K 32-K Average
Aver
age
way
acce
sses
8-part 4-part 2-part
Cache size
Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)
• 25%,30%,36% access reduction on a 2-,4-,8-partition cache
Improvement in Fetch Energy
0
10
20
30
40
50
60
rawc
audi
o
rawd
audi
o
g721
enco
de
g721
deco
de
mpe
g2de
c
mpe
g2en
c
pegw
itenc
pegw
itdec
pgpe
ncod
e
pgpd
ecod
e
gsm
enco
de
gsm
deco
de epic
unep
ic
cjpeg
djpe
g
Aver
age
Perc
enta
ge e
nerg
y im
prov
emen
t 2-part vs 2-way 4-part vs 4-way 8-part vs 8-way
16-Kb cache
Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)
• 8%,16%,25% energy reduction on a 2-,4-,8-partition cache
Summary
Maintain the advantages of a hardware-cache Expose placement and lookup decisions to the
compiler Avoid conflicts, eliminate redundancies
Achieve a higher performance and a lower power consumption
Future Works
Hybrid scratch-pad and caches
Develop advance toolchain for newer technology node such as 28nm
Incorporate the ability of partitioning data cache into the compiler of the toolchain for the ASIP
1. Qadri, Muhammad Yasir, Hemal S. Gujarathi, and Klaus D. McDonald-Maier. "Low Power Processor Architectures and Contemporary Techniques for Power Optimization--A Review." Journal of computers 4.10 (2009).
2. Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
3. Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)
4. K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics
Reference