an integer programming framework for optimizing shared memory use on gpus

22
HiPC 2010 AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University

Upload: megan-johnston

Post on 02-Jan-2016

44 views

Category:

Documents


2 download

DESCRIPTION

AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS. Wenjing Ma Gagan Agrawal The Ohio State University. GPGPU. General Purpose Programming on GPUs (accelerators) ‏ High performance/price ratio High language support CUDA Performance vs Productivity - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

Wenjing Ma Gagan Agrawal

The Ohio State University

Page 2: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

GPGPU

General Purpose Programming on GPUs (accelerators)High performance/price ratioHigh language support

CUDAPerformance vs Productivity

Hard to programMemory hierarchy to manage...

Page 3: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

Automatic code generationDevice memory access is expensive

Using shared memoryTexture and constant memoryCoalescing device memory access...

Get High Performance from GPU

And Make the Programming Simple!

Page 4: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

FEATURES OF SHARED MEMORY

Small, fast, like a cache16KB on each multiprocessor (no more than

48KB even on the latest GPU)Read-write

Software controlled __shared__ float data[n][n];

Allocating shared memory:Similar to register allocation

Page 5: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

Problem Formulation for Shared Memory Arrangement

Consider variables and basic blocks in a functionElement of array, array, section of array

Each variable can have several live ranges in the functionAccess feature of live range: read, write, read-

write, tempDetermine in which basic block a variable

is allocated to shared memory Assign_point[i][k]: variable i, basic block

k

Page 6: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

Integer Programming Problem

Integer Linear ProgrammingObjective function

Maximize z = CT xConstraints

Solution

Values of xSpecial case of linear programming

All the unknown variables are integers (1-0 in our case)

Solvable for reasonable size of problems

Page 7: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

Integer Programming for Shared Memory Arrangement

Objective FunctionMaximize shared memory usage Minimize data transfer between memory

hierarchies

Page 8: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

Integer Programming for Shared Memory Arrangement (cnt’d)

Objective Function

Page 9: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

An Example to Show size_alloc

for (int i=0; i<n; i++)

for (int j=0; j<m; j++)

for (int k = 0; k<r; k++)

C[k] += A[i][k]- B[j][k]; ......

Page 10: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

Integer Programming for Shared Memory Arrangement (cnt’d)

ConstraintsTotal allocation does not exceed the limit of

shared memory at any time

Only at most one assign_point is 1 in each live range

Page 11: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

Integer Programming for Shared Memory Arrangement (cnt’d)

Obtaining parametersUsing LLVM compiler frameworkPass 1: get access features

Read, write, read-write, temp

Pass 2: get live ranges, loop information, indices, and all other parameters

Page 12: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

Code Generation

According to the shared memory arrangement obtained from the integer programming model

Under the framework in previous workMove data to cover gap caused by data

evicted from shared memory

Page 13: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

An Example

A: n*rB: m*rC: rn: 2048m: 3r: 3NUM_THREADS: 256

assign_point[0][1]=1;assign_point[1][0]=1;assign_point[2][0]=1;/* all other elements of assign_point are 0 */

for (int i=0; i<n; i++) for (int j=0; j<m; j++) for (int k = 0; k<r; k++) C[k] += A[i][k]- B[j][k]; ......

Integer Programming

Solver

Page 14: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

An Example (cnt’d)

Generated Code:

__shared__ float s_B[m][r];__shared__ float s_C[r*NUM_THREADS];__shared__ float s_A[r*NUM_THREADS];for(int i=0;i<m*r;i++) s_B[i]=B[i];for(int i=0;i<n;i+=NUM_THREADS) { for(int j=0;j<r;j++) s_A[tid*r+j]=A[tid+i][j]; for(int j=0;j<m;j++) for(int k=0;k<r;k++) s_C[k*tid]+=s_A[tid*r+k]-s_B[j][k]; ......}/* Synchronize and combination of C */

Page 15: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

Suggesting Loop Transformationfor (int rc = 0; rc < nRowCl; rc++) { tempDis = 0; for(int c = 0;c<numCol;c++) tempDis = tempDis + data[r][c] * Acomp[rc][colCL[c]];}

for (int rc = 0; rc < nRowCl; rc++) tempDis[rc] = 0;for(int c = 0;c<numCol;c++) {

/* load into shared memory */ for (int rc = 0; rc < nRowCl; rc++) { tempDis[rc] += data[r][c] * Acomp[rc][colCL[c]]; }}

Page 16: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

Experiments

Effectiveness of using shared memoryCompare with intuitive approach in previous

workGreedy sorting: sort all the variables in

increasing order of size, and allocation them on shared memory until to the limit of shared memory

Effectiveness of loop transformation suggested by the integer programming model

Page 17: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

Experiment Results

Page 18: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

Experiment Results

K-means EM

0

1

2

3

4

5

6

7

256*

4

256*

8

256*

16

256*

32

256*

64

256*

256

Configuration (threads_per_block * blocks)

Tim

e (s

econ

ds)

no shared memory basic Int-solved

0

1

2

3

4

5

6

256*

4

256*

8

256*

16

256*

32

256*

64

Configuration (threads_per_block * blocks)

Tim

e (s

econ

ds)

basic Int-solved

Page 19: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

Experiment Results (cnt’d)

PCA Co-clustering

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

128*

1

128*

2

128*

4

128*

8

128*

16

Configuration (threads_per_block * blocks)

Tim

e (s

econ

ds)

no shared memory Int solved

0

2

4

6

8

10

12

128*

1

128*

2

128*

4

128*

8

128*

16

128*

32

128*

64

Configuration (threads_per_block * blocks)

Tim

e (s

econ

ds)

basic Int-solved

Page 20: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

Effect of Loop Transformation

PCA Co-clustering

0

2

4

6

8

10

12

128*

1

128*

2

128*

4

128*

8

128*

16

128*

32

128*

64

Configuration (threads_per_block * blocks)

Tim

e (s

econ

ds)

non-transformed transformed

0

0.2

0.4

0.6

0.8

1

1.2

128*

1

128*

2

128*

4

128*

8

128*

16

Configuration (threads_per_block * blocks)

Tim

e (s

econ

ds)

non-transformed transformed

Page 21: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

Conclusion and Future Work

Proposed an integer programming model for shared memory arrangement on GPU

Consider numeric variable, array, and section of array

Suggested loop transformation for optimization

Got better results than the intuitive methodWill automate the code generation and loop

transformation selection in future

Page 22: AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

HiPC 2010

THANK YOU!

Questions?