ece 260c – vlsi advanced topics term paper presentation may 27, 2014 keyuan huang ngoc luong low...

ECE 260C – VLSI Advanced TopicsTerm paper presentation

May 27, 2014

Keyuan Huang

Ngoc Luong

Low Power Processor Architectures and Software Optimization Techniques

Motivation ~10 billion mobile devices in 2018

Moore’s law is slowing down

Power dissipation per gate remains unchanged

How to reduce power?

Circuit level optimizations (DVFS, power gating, clock gating)

Microarchitecture optimization techniques

Compiler optimization techniques

Global Mobile Devices and Connections Growth

Trend: More innovations on architectural and software techniques to optimize power consumption

Low Power Architectures Overview Asynchronous Processors

Eliminate clock and use handshake protocol

Save clock power but higher area

Ex: SNAP, ARM996HS, SUN Sproull.

Application Specific Instruction Set Processors

Applications: cryptography, signal processing, vector processing, physical simulation, computer graphic

Combine basic instructions with custom instruction based on application

Ex: Tensilica’s Extensa, Altera’s NIOS, Xilinx Microblaze, Sony’s Cell, IRAM, Intel’s EXOCHI

Reconfigurable Instruction Set Processors

Combine fixed core with reconfigurable logic (FPGA)

Low NRE cost vs ASIP

Ex: Chimaera, GARP, PRISC, Warp, Tensilica’s Stenos, OptimoDE, PICO

No Instruction Set Computer

Build custom datapath based on application code

Compiler has low-level control of hardware resource

Ex: WISHBONE system.Qadri, Muhammad Yasir, Hemal S. Gujarathi, and Klaus D. McDonald-Maier. "Low Power Processor Architectures and Contemporary Techniques for Power Optimization--A Review." Journal of computers 4.10 (2009).

Combine GP processor with ASIP to focus on reducing energy and energy delay for a range of applications

Broader range of applications compared to accelerator

Reconfigurable via patching algorithm

Automatically synthesizable by toolchain from C source code

Energy consumption is reduced up to 16x for functions and 2.1x for whole application

Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.

C-core organization

Data path (FU, mux, register)

Control unit (state machine)

Cache interface (ld, st)

Scan chain (CPU interface)


C-core execution

Compiler insert stubs into code compatible with c-core

Choose between c-core and CPU and use c-core if available

If no c-core available, use GP processor, else use c-core to execute

C-core raises exception when finish executing and return the value to CPU


Patching support

Basic block mapping

Control flow mapping

Register mapping

Patch generation


Patching Example

Configurable constants

Generalized single-cycle datapath operators

Control flow changes


Results 18 fully placed-and routed c-cores vs MIPS

3.3x – 16x energy efficiency improvement

Reduce system energy consumption by upto 47%

Reduce energy-delay by up to 55% at the full application level

Even higher energy saving without patching support


Memory system uses power (1/10 to ¼) in portable computers

System bus switching activity controlled by software

ALU and FPU data paths needs good scheduling to avoid pipeline stalls

Control logic and clock reduce by using shortest possible program to do the computation

Software Optimization Technique

K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics

General categories of software optimization

Minimizing memory accesses

Minimize accesses needed by algorithm

Minimize total memory size needed by algorithm

Use multiple-word parallel loads, not single word loads

Optimal selection and sequencing of machine instruction

Instruction packing

Minimizing circuit state effect

Operand swapping

K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics

Banking, dynamic voltage/frequency,scaling, dynamic resizing

+ Transparent to the user + Handle arbitrary instr/data accesses ｰ Limited program information

Banking, dynamic voltage/frequency,scaling, dynamic resizing

+ Transparent to the user + Handle arbitrary instr/data accesses ｰ Limited program information

Software controlled scratch-pad, data/code reorganization

+ Whole program information + Proactive ｰ Conservative

Software controlled scratch-pad, data/code reorganization

+ Whole program information + Proactive ｰ Conservative

Global program knowledge Proactive optimizations Efficient execution

Basic Idea: Compiler Managed, Hardware Assisted

Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran ,Michael Chu,Scott Mahlke Hardware Software

Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)

tag set offset

=? =?=?=?

tag data lru tag data lru tag data lru tag data lru

4:1 mux

Replace

• Lookup Activate all ways on every access• Replacement Choose among all the ways

Traditional Cache Architecture

– Fixed replacement policy– Set index no program locality– Set-associativity has high overhead– Activate multiple data/tag-array per access

– Fixed replacement policy– Set index no program locality– Set-associativity has high overhead– Activate multiple data/tag-array per access

Disadvantages

Partitioned Cache Architecture

tag set offset

=? =?=?=?

tag data lru tag data lru tag data lru tag data lru

Ld/St Reg [Addr] [k-bitvector] [R/U]

4:1 mux

Replace

• Lookup Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions• Replacement Restricted to partitions specified in bit-vector

P0 P3P2P1+ Improve performance by controlling replacement+ Reduce cache access power by restricting number of accesses

+ Improve performance by controlling replacement+ Reduce cache access power by restricting number of accesses

Advantages

for (i = 0; i < N1; i++) { … for (j = 0; j < N2; j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < N3; k++) y[i + k] += *w2++ + x[i + k]}

ld1/st1

ld2/st2

ld3

ld4

ld5

ld6

Partitioned Caches: Example

(a) Annotated code segment

(c) Trace consisting of array references, cache blocks, and load/storesfrom the example

(b) Fused load/store instructions

tag data tag datatag data

ld1, st1, ld2, st2 ld5, ld6 ld3, ld4

ld1 [100], Rld5 [010], Rld3 [001], R

y w1/w2 x

part-0 part-1 part-3

(d) Actual cache partition assignment for each instuction

Compiler Controlled Data Partitioning

Goal: Place loads/stores into cache partitions Analyze application’s memory characteristics

Cache requirements Number of partitions per ld/st

Predict conflicts

Place loads/stores to different partitions Satisfies its caching needs

Avoid conflicts, overlap if possible

Cache Analysis: Estimating Number of Partitions

X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y

j-loop k-loop

MM MMB1 B2B2 B1

• M has reuse distance = 1

• Minimal partitions to avoid conflict/capacity misses • Probabilistic hit-rate estimate

• Use the reuse distance to compute number of partitions

Cache Analysis:Estimating Number Of Partitions

1

1

1

1

8

16

24

32

1 2 3 4

D = 0

.76

.98

1

1

8

16

24

32

1 2 3 4

D = 2

.87

1

1

1

8

16

24

32

1 2 3 4

D = 1

Avoid conflict/capacity misses for an instruction Estimates hit-rate based on

• Reuse-distance (D), total number of cache blocks (B), associativity (A)

Compute energy matrices in reality Pick most energy efficient configuration per instruction

(Brehob et. al., ’99)

Cache Analysis: Computing Interferences

Avoid conflicts among temporally co-located references Model conflicts using interference graph

M4D = 1

M3D = 1

M2D = 1

M1D = 1

X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y

M4 M2 M1 M1 M4 M2 M1 M1 M4 M3 M1 M1 M4 M3 M1 M1

Partition Assignment Placement phase can overlap references Compute combined working-set Use graph-theoretic notion of a clique

For each clique, new D Σ D of each node

Combined D for all overlaps Max (All cliques)

M4D = 1

M3D = 1

M2D = 1

M1D = 1

Clique 2

Clique 1

Clique 1 : M1, M2, M4 New reuse distance (D) = 3Clique 2 : M1, M3, M4 New reuse distance (D) = 3 Combined reuse distance Max(3, 3) = 3

tag data tag datatag data

ld1, st1, ld2, st2 ld5, ld6 ld3, ld4

part-0 part-2part-1

ld1 [100], Rld5 [010], Rld3 [001], R

y w1/w2 x

Actual cache partition assignment for each instruction

Experimental Setup

Trimaran compiler and simulator infrastructure

ARM9 processor model

Cache configurations:

1-Kb to 32-Kb

32-byte block size

2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache

Mediabench suite

CACTI for cache energy modeling

Reduction in Tag & Data-Array Checks

0

1

2

3

4

5

6

7

8

1-K 2-K 4-K 8-K 16-K 32-K Average

Aver

age

way

acce

sses

8-part 4-part 2-part

Cache size


• 25%,30%,36% access reduction on a 2-,4-,8-partition cache

Improvement in Fetch Energy

0

10

20

30

40

50

60

rawc

audi

o

rawd

audi

o

g721

enco

de

g721

deco

de

mpe

g2de

c

mpe

g2en

c

pegw

itenc

pegw

itdec

pgpe

ncod

e

pgpd

ecod

e

gsm

enco

de

gsm

deco

de epic

unep

ic

cjpeg

djpe

g

Aver

age

Perc

enta

ge e

nerg

y im

prov

emen

t 2-part vs 2-way 4-part vs 4-way 8-part vs 8-way

16-Kb cache


• 8%,16%,25% energy reduction on a 2-,4-,8-partition cache

Summary

Maintain the advantages of a hardware-cache Expose placement and lookup decisions to the

compiler Avoid conflicts, eliminate redundancies

Achieve a higher performance and a lower power consumption

Future Works

Hybrid scratch-pad and caches

Develop advance toolchain for newer technology node such as 28nm

Incorporate the ability of partitioning data cache into the compiler of the toolchain for the ASIP

1. Qadri, Muhammad Yasir, Hemal S. Gujarathi, and Klaus D. McDonald-Maier. "Low Power Processor Architectures and Contemporary Techniques for Power Optimization--A Review." Journal of computers 4.10 (2009).

2. Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.

3. Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)

4. K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics

Reference

ece 260c – vlsi advanced topics term paper presentation may 27, 2014 keyuan huang ngoc luong low...

Documents