computing spherical harmonic transforms on cuda-compatible gpus wangqun lin, fengshun lu college of...

20
Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology CACHES 2011 Tucson, Arizona, June 4th, 2011

Upload: ruth-lucas

Post on 31-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs

Wangqun Lin, Fengshun LuCollege of ComputerNational University of Defense TechnologyCACHES 2011Tucson, Arizona, June 4th, 2011

Page 2: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Outline

Motivation Spherical Harmonic Transforms (SHT) Methods

Direct Method Efficiency of Threads Utilization Reshaped Method Concurrent Kernel Execution

Experiments

2

Page 3: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Motivation

Computing the S.H.T with GPUs S.H.T is widely used But with complexity of O(N3) GPUs are powerful

Performance Metric in the SM level Only emphasizing on the OCCUPANCY Finding another metric to measure how the

launched threads are efficiently used

3

Page 4: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Spherical Harmonic Transforms(1/2)

( )

| |

( , ) ( )N mM

m m imn n

m M n m

P e

ξ: state variable ξn

m: spectral coefficients of state variable ξ μ: Gaussian latitude λ: Longitude M: model truncation wavenumber N(m): highest degree of associated Legendre function for wavenumber mPn

m(μ)eimλ: associated Legendre functions

4

Page 5: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Spherical Harmonic Transforms(2/2)

1

1

1( ) ( , )

( ) ( ) (3)

( ) ( )

( , ) ( ) (5)

i

Iimm

j i ji

Jm m mn j n j j

j

Mm m m

j n n jn m

Mm im

j jm M

eI

P

P

e

Forward FFT

Inverse FFT

Forward Legendre

Inverse Legendre

State Variable Fourier Coefficient Spectral Coefficient

),( )(m mn

Forward FourierForward Fourier

Forward LegendreForward Legendre

Inverse LegendreInverse Legendre

Inverse FourierInverse Fourier5

Page 6: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Methods – Direct (1/9)

Forward Legendre m ≤ n

ξMM

ξ00

ξ10

ξM0 ξM

1 … …

ξ11

n=0

2

1

M

M-1

m=0 21 M

……

0

…… M-1

… …

… …

… …

ξM2 ξM

M-1

1

( ) ( )J

m m mn j n j j

j

P

CUDA Thread

Thread Block

6

busy threads idle threads

Page 7: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

m=0 M… 4321

0,0

1,0

M,0

4,0

3,0

2,0

1,1

4,1

3,1

2,1 2,2

4,2

3,2 3,3

4,3

4,4

M,1 M,2 M,3 M,4 … M,M

n=0

2

1

ξ0(μj) …

3

4

M

ξM(μj)ξ4(μj)ξ3(μj)ξ2(μj)ξ1(μj)

Methods – Direct (2/9)

Inverse Legendre m ≤ n

( ) ( )M

m m mj n n j

n m

P

CUDA Threads of block j

7

busy threads idle threads

Page 8: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Methods – ETU Metric (3/9)

Efficiency of Thread Utilization(ETU) Measures the proportion of launched threads doing

useful work during the entire execution interval Mainly used as a algorithm design guideline Assumption Algorithms consist of many micro steps tu(t,s) function t: thread s: micro step

1, if doing useful work at ( , )

0, otherwise

t stu t u

8

Page 9: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Methods – ETU (4/9)

1 1

( , )sm tm

s t

tu t sETU

sm tm

Algorithm 2: Direct Inverse Legendre Transform (DILT)

Input: ξnm, Pn

m, J, M Output: ξm

Execution configuration: (J, M+1)

Declaration: tid, bid, fc_sh(M+1) // fc_sh: shared memory

1 initialize fc_sh(tid) to null; // 1 m_s

2 for n=0 to M do // M+1 m_s

3 if tid ≤ n then

4 fc_sh(tid) += ξntid×Pn

tid(μbid); end if

5 end for

6 ξtid(μbid) = fc_sh(tid); // 1 m_s

ETU Metric

Example

( 1) ( 1)( 2) / 2 ( 1)

1 3

1 2 / 2 1 =

36

= 2( 3)

J M J M M J METU

J M M

M

MM

M

9

Page 10: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Methods – Reshaped (5/9)

Forward Legendre

ξMM

ξ00

ξ10

ξM0 ξM

1 … …

… …

… …

ξ11

blk 0

blk 2

blk 1

blk M

blk M-1

T 0 T 2T 1 T M

……

idle threads

…… T M-1

blk x x+1 threads busy M-x threads idle

ξ00

ξ10

… …

… …

ξ11

… …

blk 0

blk 2

blk 1

blk

T 0 T 2T 1 T M+1

……

…… T M

blk x

ξM0 ξM

1 ……

ξM-10 ……

M -1

2

blk M-3

2

ξ… ξ0 … ξξ0 ξ1 ξ

……

M-1

2

M-1

2M-1

2

M-1

2M+1

2

M+1

2

M-1

2M+1

2

M+1

2

M-1ξ M-1

M-1ξ M-2

Mξ M

Mξ M-1

all threads of block x busyreshape

ETU ≈ 1/2 ETU ≈ 1

10

Page 11: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Methods – Reshaped (6/9)

Inverse Legendre T213 model

(128,0) (128,128)

(213,0) (213,213)

86

34

10(9,9)

(10,0)

(19,19)

(20,0)(29,0) (29,29)

(30,0)(39,0) (39,39)

(40,0) (40,40)(49,0) (49,49)

(50,0) (50,50)(59,0) (59,59)

(60,0) (60,60)

(93,0) (93,93)

(94,0)

(127,0) (127,127)

(94,94)

α β

block size

T=9 10 30 59 60 99 100 149 150 20929

213 59 127 213

0 0 60 128

m m m m m m m m mj n n j n n j n n j n n j

n n n

P P P P

reshape

11

m=0 M=213… 4321

0,0

1,0

M,0

4,0

3,0

2,0

1,1

4,1

3,1

2,1 2,2

4,2

3,2 3,3

4,3

4,4

M,1 M,2 M,3 M,4 … M,M

n=0

2

1

ξ0(μj) …

3

4

M=213

ξM(μj)ξ4(μj)ξ3(μj)ξ2(μj)ξ1(μj)

Page 12: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Methods – Reshaped (7/9)

Inverse Legendre T213 model

10 20 30 40 50 60 sh1

94 128 sh2

214 sh3

block size

③②

① ① ①

①②

T=9 10 30 59 60 99 100 149 150 20929

93 94 221reconstruct

12

Page 13: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Methods – Reshaped (8/9)

Inverse Legendre T213 model

computation for trapezium α and β

127

60

m mn n j

n

P

93

60

m mn n j

n

P

127

94

m mn n j

n

P

94sh2 128

13

Page 14: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Methods – Concurrent Kernel (9/9)

Concurrent Kernel Execution Supported by Fermi and later architectures Programs with many small kernels can efficiently

executed on GPUs The consideration of software scalability in the

future T213 model

KernelConcurrent Forward Legendre Concurrent Inverse Legendre

n Grid size Block size m Grid size Block size

1 [ 0,53 ] 54 64 [ 0,53 ] 320 64

2 [ 54,117] 64 128 [ 54,117] 320 64

3 [118,213] 96 224 [118,213] 320 9614

Page 15: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Experiments (1/4)

Validation of ETU metric T341 model Variable Block size

Observations Basically larger ETU indicates better performance No direct relationship shows between OCCUPANCY

and performance Same OCCUPANCY doesn't mean equal performance Same-OCCUPANCY, larger-ETU, better performance

BS ETU OCCUPANCY Time (ms)96 0.8039 0.312 1.975

128 0.7480 0.417 2.239160 0.7831 0.417 2.038192 0.6519 0.625 2.198

15

Page 16: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Experiments (2/4)

Performance

Forward Legendre Inverse Legendre

16

Page 17: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Experiments (3/4)

Case Study: STSWM A global shallow water model based on S.H.T. Exhibits many mathematical and computational

properties of more complete models Used to investigate and compare numerical

methods for simulating atmospheric models T213 truncation

Forward Legendre: ftrnve, ftrndi and ftrnpi Invserse legendre: shtrns

17

Page 18: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Experiments (4/4)

Case Study: STSWM

18

Page 19: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

Review

Motivation Spherical Harmonic Transforms Methods

Direct Method Efficiency of Threads Utilization Reshaped Method Concurrent Kernel Execution

Experiments

19

Page 20: Computing Spherical Harmonic Transforms on CUDA-Compatible GPUs Wangqun Lin, Fengshun Lu College of Computer National University of Defense Technology

20