sdrm: simultaneous determination of regions and function-to-region mapping for scratchpad memories

27
SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories Amit Pabalkar, Aviral Shrivastava, Arun Kannan and Jongeun Lee Compiler and Micro-architecture Lab School of Computing and Informatics Arizona State University 1 High Performance Computing (HIPC) December 2008 06/18/22 http://www.public.asu.edu/ ~ashriva6

Upload: allistair-bates

Post on 31-Dec-2015

33 views

Category:

Documents


0 download

DESCRIPTION

SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories. High Performance Computing (HIPC) December 2008. Amit Pabalkar, Aviral Shrivastava, Arun Kannan and Jongeun Lee Compiler and Micro-architecture Lab School of Computing and Informatics - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping

for Scratchpad MemoriesAmit Pabalkar, Aviral Shrivastava, Arun Kannan and

Jongeun LeeCompiler and Micro-architecture LabSchool of Computing and Informatics

Arizona State University

1

High Performance Computing (HIPC)December 2008

04/19/23 http://www.public.asu.edu/~ashriva6

Page 2: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Agenda•Motivation•SPM Advantage•SPM Challenges•Previous Approach•Code Mapping Technique•Results•Continuing Effort

2 04/19/23 http://www.public.asu.edu/~ashriva6

Page 3: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Motivation - The Power Trend

3

• Cache consumes around 44% of total processor power• Cache architecture cannot scale on a many-core processor due to cache coherency attributed performance degradation.

• Within same process technology, a new processor design with 1.5x to 1.7x performance consumes 2x to 3x the die area [1] and 2x to 2.5x the power[2]

• For a particular process technology with fixed transistor budget, the performance/power and performance/unit area scales with the number of cores.

Go to References04/19/23 http://www.public.asu.edu/~ashriva6

Page 4: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Scratchpad Memory(SPM)• High speed SRAM internal memory for

CPU• SPM falls at the same level as the L1

Caches in memory hierarchy• Directly mapped to processor’s address

space.• Used for temporary storage of data, code

in progress for single cycle access by CPU

4 04/19/23 http://www.public.asu.edu/~ashriva6

Page 5: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

The SPM Advantage

• 40% less energy as compared to cache▫ Absence of tag arrays, comparators and muxes

• 34 % less area as compared to cache of same size▫ Simple hardware design (only a memory array & address

decoding circuitry) • Faster access to SPM than physically indexed and tagged cache

5

0

1

2

3

4

5

6

7

8

9

256 512 1024 2048 4096 8192 16384

memory size

En

erg

y p

er

ac

ce

ss

[n

J]

.

Scratch pad

Cache, 2way, 4GB space

Cache, 2way, 16 MB space

Cache, 2way, 1 MB space

Data ArrayTag

Array

Tag Comparators, Muxes

Address Decoder

CacheSPM

Address Decoder

04/19/23 http://www.public.asu.edu/~ashriva6

Page 6: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Challenges in using SPMs

• Application has to explicitly manage SPM contents▫ Code/Data mapping is transparent in cache based architectures

• Mapping Challenges▫ Partitioning available SPM resource among different data▫ Identifying data which will benefit from placement in SPM▫ Minimize data movement between SPM and external memory▫ Optimal data allocation is an NP-complete problem

• Binary Compatibility▫ Application compiled for specific SPM size

• Sharing SPM in a multi-tasking environment

6Need completely automated solutions (read compiler solutions)

04/19/23 http://www.public.asu.edu/~ashriva6

Page 7: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

04/19/23 http://www.public.asu.edu/~ashriva67

Using SPM

Original Code SPM Aware Code

7

int global;

FUNC2() { int a, b; global = a + b;}

FUNC1(){ FUNC2();}

int global;

FUNC2() { int a,b; DSPM.fetch.dma(global) global = a + b; DSPM.writeback.dma(global)}

FUNC1(){ ISPM.overlay(FUNC2) FUNC2();}

Page 8: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Previous Work• Static Techniques [3,4]. Contents of SPM do not change

during program execution – less scope for energy reduction.

• Profiling is widely used but has some drawbacks [3, 4, 5, 6, 7,8]▫ Profile may depend heavily depend on input data set▫ Profiling an application as a pre-processing step may be infeasible for

many large applications▫ It can be time consuming, complicated task

• ILP solutions do not scale well with problem size [3, 5, 6, 8]

• Some techniques demand architectural changes in the system [6,10]

8 04/19/23 http://www.public.asu.edu/~ashriva6

Page 9: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Code Allocation on SPM•What to map?

▫ Segregation of code into cache and SPM▫ Eliminates code whose penalty is greater than profit

No benefits in architecture with DMA engine▫ Not an option in many architecture e.g. CELL

• Where to map?▫ Address on the SPM where a function will be mapped and fetched

from at runtime. ▫ To efficiently use the SPM, it is divided into bins/regions and

functions are mapped to regions What are the sizes of the SPM regions? What is the mapping of functions to regions?

▫ The two problems if solved independently leads to sub-optimal results

Our approach is a pure software dynamic technique based on static analysis addressing the ‘where to map’ issue. It simultaneously solves the region size and function-to-region mapping sub-problems http://www.public.asu.edu/~ashriva6

Page 10: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Problem Formulation• Input

▫ Set V = {v1 , v2 … vf } – of functions▫ Set S = {s1 , s2 … sf } – of function sizes ▫ Espm/access and E cache/access

▫ Embst energy per burst for the main memory▫ Eovm energy consumed by overlay manager instruction

• Output▫ Set {S1, S2, … Sr} representing sizes of regions R = {R1, R2, … Rr } such that ∑ Sr ≤ SPM-SIZE▫ Function to Region mapping, X[f,r] = 1, if function f is mapped to region r, such that ∑ S f x

X[f,r] ≤ Sr

• Objective Function▫ Minimize Energy Consumption

Evihit = nhitvi x (Eovm + Espm/access x si)

Evimiss = nmissvi x (Eovm + Espm/access x si + Embst x (si + sj) / Nmbst

Etotal = ∑ (Evihit + Evi

miss)

▫ Maximize Runtime Performance10 04/19/23 http://www.public.asu.edu/~ashriva6

Page 11: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Overview

11

Static Analysis

Function Region

Mapping

Cycle Accurate

Simulation

GCCFGWeight

Assignment

SDRM Heuristic/IL

P

Interference Graph

Instrumented Binary

Link Phase

Application

Energy Statistics

Compiler Framework

Performance Statistics

04/19/23http://www.public.asu.e

du/~ashriva6

11 04/19/23 http://www.public.asu.edu/~ashriva6

Page 12: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Limitations of Call Graph

• Limitations▫No information on relative ordering among nodes

(call sequence)▫No information on execution count of functions

12

F2

F5

F3

F6

F4

F1

main

MAIN ( ) F2 ( ) F1( ) for for F6 ( ) F2 ( ) F3 ( ) end for whileEND MAIN F4 ( ) end whileF5 (condition) end for if (condition) F5( ) condition = … END F2 F5() end ifEND F5

Call Graph

04/19/23 http://www.public.asu.edu/~ashriva6

Page 13: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Global Call Control Flow Graph

13

MAIN ( ) F2 ( ) F1( ) for for F6 ( ) F2 ( ) F3 ( ) end for whileEND MAIN F4 ( )

end whileF5 (condition) end for if (condition) if() condition = … F5( ) else else F5(condition) F1() end if end ifEND F5 END F2

L1

L2

F2 F5

F3

L3

F6

F41000100

20

100

10

F1

main

I1

F1 I2

10

T

F F

• Advantages▫ Strict ordering among the nodes. Left child is called before the right child▫ Control information included (L-nodes and I-nodes)▫ Node weights indicate execution count of functions▫ Recursive functions identified

Loop Factor 10Recursion Factor 2

04/19/23 http://www.public.asu.edu/~ashriva6

Page 14: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

14

• Create Interference Graph. • Node of I-Graph are functions or F-nodes from GCCFG • There is an edge between two F-nodes nodes if they

interfere with each other.

• The edges are classified as • Caller-Callee-no-loop,• Caller-Callee-in-loop,• Callee-Callee-no-loop, • Callee-Callee-in-loop

• Assign weights to edges of I-Graph• Caller-Callee-no-loop:

cost[i,j] = (si + sj) x wj • Caller-Callee-in-loop:

cost[i,j] = (si + sj) x wj

• Callee-Callee-no-loop: cost[i,j] = (si+ sj) x wk, where wk= MIN (wi , wj )

• Callee-Callee-in-loop: cost[i,j] = (si+ sj) x wk, where wk= MIN (wi , wj )

3000

400

700

500500

600

1000

100

20

100

10

main

F1

F2 F5

F6 F3

F4

L3

L3

L3

F1

F2

F4

F5

F6 F3

120

Caller-Callee-no-loop

Caller-Callee-in-loop

Callee-Callee-in-loop

routines SizeF2 2F3 3F4 1F6 4F1 2F5 4

Interference Graph

04/19/23 http://www.public.asu.edu/~ashriva6

Page 15: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

SDRM Heuristic

Suppose SPM size is 7KB

Interference Graph

F6

routines Size

F2 2

F3 3

F4 1

F6 4

Region Routine Size CostR1 F2 2 0R2 F4 1 0R3 F6,F3 4 700

Total 7 700Total 3 0700

R2

Total

F2

F4

F6

1234567

F6,F3

F3F6F3

Interference Graph

F6

F2

F3

F43000

400

700

500500

600

F4,F3

F6F4,F3 3

F6 49

R3400

010

R1

R2

R3

04/19/23

Page 16: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Flow Recap

16

Static Analysis

Function Region

Mapping

Cycle Accurate

Simulation

GCCFGWeight

Assignment

SDRM Heuristic/ILP

Interference Graph

Instrumented Binary

Link Phase

Application

Energy Statistics

Compiler Framework

Performance Statistics

16 04/19/23 http://www.public.asu.edu/~ashriva6

Page 17: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Overlay Manager

F1(){ ISPM.overlay(F3) F3();}

F3() { ISPM.overlay(F2) F2() … ISPM.return}

main …. F1 F3 F2

ID Region VMA LMA

F1 0 0x300000xA0000

00x30000

0xA01300

F2

1F4

0

1 0x30200F3

0xA001000xA00300

0x30200

Size

0x100

0x200

0x1000

0x3000xA0160

02F5 0x31200 0x500

Overlay Table

Region TableRegion ID

0 F1

2 F5

1 F3

F2F1

Page 18: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Performance Degradation

•Scratchpad Overlay Manager is mapped to cache

•Branch Target Table has to be cleared between function overlays to same region

•Transfer of code from main memory to SPM is on demand

18

FUNC1( ) { computation … ISPM.overlay(FUNC2) FUNC2();}

FUNC1( ) { ISPM.overlay(FUNC2) computation … FUNC2();}

04/19/23 http://www.public.asu.edu/~ashriva6

Page 19: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

SDRM-prefetch

19

MAIN ( ) F2 ( ) F1( ) for for computation F2 ( ) F6 ( ) end for computationEND MAIN F3 ( ) F5 (condition) while if (condition) F4 ( ) end while F5() end forend if computation END F5 F5( )

END F2

main

F1

F2

L1

F3

L2

L3

F4

F6

F5

Q = 10

C = 10

1

100

100

1000

10

10

10

C3

C1

C2

Modified Cost Function

• costp[vi, vj ] = (si + sj) x min(wi,wj) x latency cycles/byte - (Ci + Cj)

• cost[vi,vj] = coste[vi, vj ] x costp[vi, vj ]

Region ID

0 F1

2 F3

1 F4,F5

F2F2,F1

Region Region

0 F1

2 F3,F6

1 F4

F2F2,F1

3 F6 3 F5

SDRM SDRM-prefetch

19 04/19/23

Page 20: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Energy Model

20

ETOTAL = ESPM + EI-CACHE + ETOTAL-MEM

ESPM = NSPM x ESPM-ACCESS

EI-CACHE = EIC-READ-ACCESS x { NIC-HITS + NIC-MISSES } + EIC-WRITE-ACCESS x 8 x NIC-MISSES

ETOTAL-MEM = ECACHE-MEM + EDMA

ECACHE-MEM = EMBST x NIC-MISSES

EDMA = NDMA-BLOCK x EMBST x 420 04/19/23 http://www.public.asu.edu/~ashriva6

Page 21: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Performance Model

21

chunks = block-size + (bus width - 1) / bus width (64 bits)mem lat[0] = 18 [first chunk]mem lat[1] = 2 [inter chunk]total-lat = mem lat[0] + mem lat[1] x (chunks - 1)

latency cycles/byte = total-lat / block-size

21 04/19/23 http://www.public.asu.edu/~ashriva6

Page 22: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Average Energy Reduction of 25.9% for SDRM

SDRM is power efficient

04/19/2322 http://www.public.asu.edu/~ashriva6

Page 23: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Cache Only vs Split Arch.

X bytesInstruction

Cache

x/2 bytesInstruction cache

x/2 bytes Instruction SPM

On chip

On chip

X bytesInstruction

Cache

Data Cache

Data Cache

ARCHITECTURE 1

ARCHITECTURE 2

• Avg. 35% energy reduction across all benchmarks• Avg. 2.08% performance degradation

04/19/23 http://www.public.asu.edu/~ashriva623

Page 24: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

• Average Performance Improvement 6%• Average Energy Reduction 32% (3% less)

SDRM with prefetching is better

04/19/2324 http://www.public.asu.edu/~ashriva6

Page 25: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Conclusion

• By splitting an Instruction Cache into an equal sized SPM and I-Cache, a pure software technique like SDRM will always result in energy savings.

• Tradeoff between energy savings and performance improvement.

• SPM are the way to go for many-core architectures.

04/19/2325 http://www.public.asu.edu/~ashriva6

Page 26: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

Continuing Effort

• Improve static analysis• Investigate effect of outlining on the mapping

function•Explore techniques to use and share SPM in a

multi-core and multi-tasking environment

04/19/2326 http://www.public.asu.edu/~ashriva6

Page 27: SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories

References

27

1. New Microarchitecture Challenges for the Coming Generations of CMOS Process Technologies. Micro32.

2. GROCHOWSKI, E., RONEN, R., SHEN, J., WANG, H. 2004. Best of Both Latency and Throughput. 2004 IEEE International Conference on Computer Design (ICCD ‘04), 236-243.

3. S. Steinke et al. : Assigning program and data objects to scratchpad memory for energy reduction.

4. F. Angiolini et al: A post-compiler approach to scratchpad mapping code. 5. B Egger, S.L. Min et al. : A dynamic code placement technique for scratchpad memory

using postpass optimization6. B Egger et al : Scratchpad memory management for portable systems with a memory

management unit7. M. Verma et al. : Dynamic overlay of scratchpad memory for energy minimization8. M. Verma and P. Marwedel : Overlay techniques for scratchpad memories in low

power embedded processors*9. S. Steinke et al. : Reducing energy consumption by dynamic copying of instructions

onto onchip memory10. A. Udayakumaran and R. Barua: Dynamic Allocation for Scratch-Pad Memory using

Compile-time Decisions

04/19/23http://www.public.asu.e

du/~ashriva6