memory bound wave propagation at hardware limit | gtc 2013€¦ · − normal stress has no...

Igor Podladtchikov, Spectraseis Inc

March 19, 2013

Memory Bound Wave Propagation at Hardware Limit

© Spectraseis Inc. 2013

Geophysical Method to locate subsurface events:

• Propagate and image time-reversed data acquired at the surface

Use full wave-equation

• Acoustic or Elastic

• Heterogeneous Materials

Need very fast solvers

• Thousands of Events

• Big Models

Microseismic Monitoring

Time Reversed Imaging (TRI)

2


Performance Expectations

Results

Acoustic Solver Implementation

Elastic Solver Implementation

Summary

Roadmap

3


Performance Limiters

Processors

Memory

• Computation

• 1000 Gflops/s

• Transfer

• 100 GB/s

The two principle performance limiters

4


Acoustic Equations

1 variable read & write

2 variables read only

5


Elastic Equations

9 variables read & write

3 variables read only

6


flops / bytes ratio:

• Compute or Memory Bound?

BYTES, not numbers:

• Single precision: 4 bytes

• e.g. 1st derivative: 2 reads, 1 write, 12 bytes transferred

Arithmetic Intensity

7


M2070 machine balance:

• Peak flops / bytes : 1030 / 117 ~ 9

K10 machine balance:

• Peak flops / bytes : 4577 / 228 ~ 20

Arithmetic Intensity:

• << machine balance : memory bound

• >> machine balance : compute bound

Machine Balance

8


Arithmetic Intensity

Machine Balance:

− Fermi: ~ 9

− Kepler: ~ 20

9

We’re memory bound

<< 9

2

2acoustic elastic

flops 2 5 24 99

bytes 12 16 68 444

ratio

x x

0.17 0.31 0.35 0.22


Memory Bound – What To Do?

Option A:

− Give up (don’t even try)

− Blame memory bound for slow code

Option B:

− Celebrate: FLOPS are for FREE

− Optimize memory access efficiency

− Count bytes, not flops

− Try to approach memcpy throughput

10

Our claim: real world applications can run close to memcpy!


− Aim for minimum read / writes

• Touch everything once (un-improvable)

• Don’t read neighbors twice

How to optimize memory access?

Try to avoid redundant read / writes

11

…don’t read me again!

Read me once…


− Track optimization progress

− Don’t count neighbors in your performance metric!

“Remember traffic is the volume of data to a particular memory. It is not the

number of loads and stores”

— Performance Tuning of Scientific Applications

Don’t count neighbor reads!

Don’t cheat (yourself)

12


Ideal Memory Throughput

13

No Neighbors!

N IO Grid Size Word SizeMTP

Time Elapsed

N IO 2 DOF Constants

DOF : Degree of freedom (read and write)

Constants : read only

Grid Size : nx * ny * nz * nt

Word Size: 4 bytes (single precision)


30

30

4 Grid Size 4 bytes GBMTP

Time Elapsed 2 s

21 Grid Size 4 bytes GBMTP

Time Elapsed 2 s

Acoustic

Elastic


14

N_IO Acoustic : 2 * 1 + 2 = 4

N_IO Elastic : 2 * 9 + 3 = 21



Results



Summary

Roadmap

15


− All solvers include

• free surface

• absorbing layer

• domain decomposition along all 3 axes

• IPC if GPUs map-able, MPI otherwise

− All solvers verified against single CPU code

− All data from NVIDIA PSG Cluster – Thank You!

Results

without further ado..

16


0

20

40

60

80

100

120

64 128 192 256 320 384 448 512 576 640 704

MT

P G

B/s

cube size

M2070

Memcpy

Pressure

Density

Elastic

Memory Throughput on M2070

Real physics at 85% and 52% of hardware limit

17


po[CENTER] = 2*pcc - po[CENTER]*abs + vp2[CENTER] *

(

pc[LEFT ] + pc[RIGHT ]

+ pc[LEFT2 ] + pc[RIGHT2]

+ pcm + pcp

- 6*pcc

);

Neighbor

Reads

Acoustic pressure update:

If we count neighbor reads as IO operations: 6 additional

− 10 IO operations

− 100 GB/s / 4 * 10 = 250 GB/s MTP peak on M2070

− Theoretical hardware limit is 150 GB/s

Neighbors Don’t Count!

DON’T COUNT NEIGHBOR READS

18


0

50

100

150

200

250

64 128 192 256 320 384 448 512 576 640 704

MT

P G

B/s

cube size

K10

Memcpy

Pressure

Density

Elastic

Memory Throughput on M2070

Strong scaling on both K10 GPU’s

same size and power consumption as M2070!

19


Other GPUs

50

150

250

Memcpy K10

K20X

K20

M2090

M2070

GK104

20

The green cards win

50

150

250

Pressure K10

K20X

K20

M2090

M2070

GK104

50

150

250

Density K10

K20X

K20

M2090

M2070

GK104 50

150

250

Elastic K10

K20X

K20

M2090

M2070

GK104


Multi-GPU Weak Scaling on GK104

21

PCIe 2: 6 GB/s

PCIe 3: 12 GB/s

0

10

20

30

40

50

60

70

80

90

100

64 192 320 448 576

per

GP

U M

TP,

% o

f sin

gle

cube size

Density

0

10

20

30

40

50

60

70

80

90

100

64 128 192 256 320 384p

er

GP

U M

TP,

% o

f sin

gle

cube size

Elastic

2 nodes (IPC)

4 nodes (MPI PCIe3)

8 nodes (IB FDR)


− Defined Ideal, Un-improvable Memory Throughput

• MTP = N_IO * Grid Size * Word Size / time elapsed

• N_IO = 2*DOF + Const

• No neighbors or temporary variables

− Came close to memcpy with real world applications

• acoustic: 85 %

• elastic: 52 %

• performance proportional to memcpy on various architectures

− Solvers scale on multiple GPUs

Results Summary

22



Results



Summary

Roadmap

23


Respect the number 32

− 32 x 8 Thread-blocks

− Fast axis sizes multiples of 32 (can be padded)

− Hit global memory segments and L1 cache lines (32 x 4B = 128B)

Rely on cache

− Shared memory requires extra operations

− Shared memory needs __synchthreads()

− Registers are faster than shared memory

− If working set fits in cache, cache is faster

General Considerations

24


First Try – Acoustic Pressure

25

Yes, that’s it

#define EXIT_BND(xx,yy,nx,ny) \

int xx = blockIdx.x*blockDim.x + threadIdx.x; if(xx < 1 || xx >= nx - 1) return; \

int yy = blockIdx.y*blockDim.y + threadIdx.y; if(yy < 1 || yy >= ny - 1) return;

#define CENTER i1 + i2*n1 + i3*n1*n2

#define RIGHT i1+1 + i2*n1 + i3*n1*n2

#define LEFT i1-1 + i2*n1 + i3*n1*n2

#define RIGHT2 i1 + (i2+1)*n1 + i3*n1*n2

#define LEFT2 i1 + (i2-1)*n1 + i3*n1*n2

#define TOP i1 + i2*n1 + (i3+1)*n1*n2

#define BOT i1 + i2*n1 + (i3-1)*n1*n2

__global__ void pressure_gpu_het_vp2(const float* pc, float* po, const float* vp2,

const int n1, const int n2, const int n3){

EXIT_BND(i1,i2,n1,n2)

int i3;

for(i3 = 1; i3 < n3-1; i3++){

po[CENTER] = 2*pc[CENTER] - po[CENTER] + vp2[CENTER] * (

pc[LEFT] + pc[RIGHT] + pc[LEFT2] + pc[RIGHT2] + pc[BOT] + pc[TOP] –

6*pc[CENTER] );

}

}


35

45

55

65

75

85

95

64 128 192 256 320 384 448 512 576 640 704

MT

P G

B/s

cube size

First Try – Acoustic Pressure

pretty good, but not good enough

26

Yay!

Boo…

Boo

Hoo

Hoo…


Suspect TLB misses:

• Translation Lookaside Buffers

• Accelerating translation from virtual to physical memory

• Act like caches on the page table

“If the kernel’s working set … exceeds TLB capacity (or associativity) then one

generates TLB capacity (or conflict) misses.”

— Performance Tuning of Scientific Applications

First Try

27


If the kernel’s working set is too big, we’ll reduce it:

Batched Execution

Launch kernel batches for slowest axis

28

__global__ void pressure_gpu_het_vp2(const float* pc, float* po, const float* vp2,

const int n1, const int n2, const int n3,

const int offset){

EXIT_BND(i1,i2,n1,n2)

int i3;

for(i3 = offset+1; i3 < offset+n3-1; i3++){

po[CENTER] = 2*pc[CENTER] - po[CENTER] + vp2[CENTER] * (

pc[LEFT] + pc[RIGHT] + pc[LEFT2] + pc[RIGHT2] + pc[BOT] + pc[TOP] –

6*pc[CENTER] );

}

}


Batched Execution

29

Done

35

45

55

65

75

85

95

64 192 320 448 576 704M

TP

GB

/s

cube size

First Try vs. Batched

batch 32

no batch

35

45

55

65

75

85

95

64 192 320 448 576 704

MT

P G

B/s

cube size

Batched Execution

32

100

300

no batch


The Density Problem

Density equation has Vp inside difference,

which means twice the amount of

neighbors to fetch:

35

45

55

65

75

85

95

64 192 320 448 576 704M

TP

GB

/s

cube size

Pressure vs. Density

Pressure Density Naïve

30


What Problem? Add Variable!

Just replace vp*current inside

derivative by variable! At every time-

step:

− launch ucvp = uc*vp kernel

− launch solver, take ucvp derivative

Why is it slower??

− we introduced additional

read+write

− the additional read+write don’t

count! same problem, same result,

same performance metric formula!

35

45

55

65

75

85

95

105

64 192 320 448 576 704

MT

P G

B/s

cube size



Add Var

31

THAT problem


The code looks a little “repetitive”, we’re multiplying by vp a whole lot of times:

unew = 2*ucc - uo[CENTER]*abs

+ uc[RIGHT] *v2[RIGHT] + uc[LEFT] *v2[LEFT]

+ uc[RIGHT2]*v2[RIGHT2]+ uc[LEFT2]*v2[LEFT2]

+ ucp *v2p + ucm *v2m - 6*ucc*v2c;

What if we do this:

unew = (2*ucc - uo[CENTER]*abs) / vp[CENTER] // divide by vp for time-step!

+ uc[RIGHT] + uc[LEFT]

+ uc[RIGHT2]+ uc[LEFT2]

+ ucp + ucm - 6*ucc;

uo[CENTER] = unew*abs*vp[CENTER]; // store wave-fields pre-multiplied with vp!

The Density Trick

Same memory usage, same N_IO, but less neighbor reads!

32


The Density Trick

Memory access pattern the same as

pressure – same performance as

pressure!

35

55

75

95

64 192 320 448 576 704

MT

P G

B/s

cube size



Add Var Pre-Mul

33

And BOOM goes the dynamite



Results



Summary

Roadmap

34


Very similar situation to acoustic solver

• Less neighbors because 1st derivative, but more variables

• Use batching, 32x8 thread-blocks, fast axis sizes multiples of 32

Use staggered grid

• Average materials on the fly

All variables the same size

• All coalesced, same stride for everyone

General Considerations

35


Staggered Grid – Elementary Cell

Every grid point contains 12 elements:

‒ 3 particle velocity components Vx, Vy, Vz

‒ 3 normal stress components Sxx, Syy, Szz

‒ 3 shear stress components Sxy, Sxz, Syz

‒ 3 material properties ρ, λ, μ

36

y

x

Sxy

Sxz, out of screen

Syz, out of screen

Vx

Vy

Vz, out of screen

Sxx, Syy, Szz, ρ, λ, μ


Staggered Grid

37

┼ Everyone surrounded by correct

spatial difference neighbors

‒ Materials over Velocity and

Shear need to be averaged


Staggered Grid

38

Vy Vx

Sxz over Vx Sxy over Vy

Sxy

Sxx, Syy, Szz, ρ, λ, μ

Area updated

Ghost Stress,

ignored

Ghost Stress,

updated

Boundary Velocity,

from neighbor or

boundary condition


Separate Stress and Velocity Update

39

Velocity Kernel

Stress Kernel


Separate Stress and Velocity Update

40

‒ needs to be at time-step t

‒ needs to be at time-step

t+1/2

‒ handled by thread-block,

possibly on different SM

‒ thread-block scheduling

unknown

‒ read redundancy


Separate Stress and Shear Update

41

Velocity Kernel

Stress Kernel

Shear Kernel


Separate Stress and Shear Update

42

‒ for i 0…n-1

‒ for i 1…n

‒ Divergence experimentally

established to be slightly

worse than read redundancy


Individual Kernel Performance

− Normal stress has no material

averaging

− Velocity needs to average density

from 2 values, for 3 different

positions

− Shear stress needs to average

Lame coefficient from 4 values, for

3 different positions

35

55

75

95

64 192 320 448

MT

P G

B/s

cube size

Elastic Kernels

Stress Velocity

Shear Elastic

43

In sequence they suffer read redundancy


Individual kernels are close to limit, but introduce read redundancy:

− shear stress: read 3 V, read 3 SX, read 1 M, write 3 SX (10)

− normal stress: read 3 V (AGAIN), read 3 S, read L, read M (AGAIN),

write 3 S (11, 4 redundant)

− velocity: read 3 V (AGAIN), read 3 SX (AGAIN), read 3 S (AGAIN), read R,

write 3 V (13, 9 redundant)

• Total 34, 13 redundant

We could totally cheat and say we’re doing 34 IO, and therefore our peak

performance is 60 / 21 * 34 = 97 GB/s -> 83 % of memcpy speed!

Read Redundancy

It’s important to know whether an algorithm has room for improvement or not…

This one definitely has!

44


Respected the number 32

− Memory segments, warps and L1 cache lines

Relied on cache

− Only works if working unit small enough

− So, reduce your working units

Give Hardware maximum possibility to parallelize

− No __syncthreads()

− Minimum divergence

Implementation Summary

45



Results



Summary

Roadmap

46



− MTP = N_IO * Grid Size * Word Size / time elapsed

− N_IO = 2*DOF + Const

− GFlops misleading in memory bound situation

− Counting neighbors is a crime

Real world applications can approach memcpy throughput

− acoustic: 85 % (100 GB/s on M2070, 180 GB/s on K10)

− elastic: 52 % (60 GB/s on M2070, 100 GB/s on K10)

Summary

Physics at Memcpy Throughput:

Physics for free!

47


For fixed problem size and hardware capabilities…

Every Algorithm’s Dream…

…which is faster?

48

Read Compute Write

Read C Write

Read Write

3D FFT: 40 GB/s, 180 Gflops/s

Acoustic: 100 GB/s, 70 Gflops/s

Memcpy: 117 GB/s, 0 Gflops/s


Performance Tuning of Scientific Applications – David H. Bailey and Robert F. Lucas

GPU Performance Analysis and Optimization, Paulius Micikevicius, GTC 2012

3D Finite Difference Computation on GPUs using CUDA, Paulius Micikevicius, 2010

Numerical Modeling in Fortran, Day 9, Paul Tackley, 2012

[email protected]

References

Questions?

49


App Throughput (TP) GB/s Application Speed

Hardware (HW) TP GB/s Hardware’s Transfer Throughput

HW TP Limit GB/s Practical Throughput Limit

(memcpy)

Profile how

many bytes

transferred.

Practical

instead of

theoretical

throughput

limit.

Less than

100% is only

critical if it’s

substantially

less.

Performance Peak Analysis

51

App / Limit % 100 % is ideal

App / HW % Less than 100 % means not all

bytes transferred are used

HW / Limit % Less than 100 % means memory

bus underutilized


App Throughput (TP) GB/s 100.0 88.5

Hardware (HW) TP GB/s 103.6 92.1

HW TP Limit GB/s 117.6 114.3

App / Limit % 85.0 77.4

App / HW % 96.5 96.1

HW / Limit % 88.1 80.6

GPU M2070 GK104

Data from

448 cubed,

10 time steps

run.

Access

pattern OK.

Could have

more

concurrent

memory

access,

especially on

GK104, to

increase HW

utilization.

Performance Peak Analysis: Density

52


Cache miss

causes

memory

replays and

stalls

Register Queue GK104 Profiled

53

Metric Queue No Q Comments

APP Time [sec] 0.148 0.180

APP MTP [GB/s] 89.379 73.375

Instructions [10^9] 52.404 60.156 replays

Writes [GB] 3.320 3.320

Reads [GB] 10.780 11.335 Cache miss

Reads/cube 3.004 3.158 Cache miss

HW MTP [ GB/s] 95.266 81.416 Stalls

APP / HW MTP [%] 93.800 90.123 Cache miss


Cache miss

causes

memory

replays and

stalls

Density Trick Profiled on M2070

54

Metric Trick Naive Comments

APP Time [sec] 0.133 0.190

APP MTP [GB/s] 99.672 69.557

Instructions [10^9] 44.370 53.706 replays

Writes [GB] 3.320 3.320

Reads [GB] 10.455 12.836 Cache miss

Reads/cube 2.913 3.577 Cache miss

HW MTP [ GB/s] 103.566 85.030 Stalls

APP / HW MTP [%] 96.196 81.802 Cache miss


• nvprof from cuda toolkit 5.0 : nvprof --event <event name>

• inst_issued (Fermi), inst_issued1 + 2*inst_issued2 (K10)

per warp, 32 instructions per count

• fb_subp0_write_sectors + fb_subp1_write_sectors

32 bytes per count

• fb_subp0_read_sectors + fb_subp1_read_sectors

32 bytes per count

Profiling Notes

55


• Reported compute bound for large stencils, so not memory bound anymore

Would you prefer to pay for the bus or ride it for free?

• Reported higher accuracy

Assuming function well behaved and infinitely differentiable, which

is not the case for heterogeneous media

Ironically, free mall ride in Denver is cleaner and newer then normal

busses you actually pay for

Higher Order Approximations?

56


Smooth vs. Real World

57

Let’s approximate some derivatives

– Waves are smooth, for sure

– Sine and cosine are infinitely differentiable

– Taylor approximation seems like a good idea

• All differences are multiplied by material properties

• If property has step, difference x property will have step

• We chose factor of 0.9 here -> not a very rough step

• Function looks smooth…

sin(x) | sin(x) * 0.9

smooth real


Smooth vs. Heterogeneous:

1st Derivative

58

My oh my, what do we have here?

2

4

6

8

2 ( )

4 ( )

6 ( )

8 ( )

nd iii

th v

th vii

th ix

Order Error

h f a

h f a

h f a

h f a

smooth 1st deriv real 1st deriv

Big Error

Small Error Big h Small h

2nd 4th 6th 8th

Big Error


2nd 4th 6th 8th


Smooth vs. Heterogeneous:

2nd Derivative

59

All orders fail, but the higher ones seem worse

2

4

6

8

2 ( )

4 ( )

6 ( )

8 ( )

nd iv

th vi

th viii

th x

Order Error

h f a

h f a

h f a

h f a

Big Error


2nd 4th 6th 8th

Big Error


2nd 4th 6th 8th

smooth 2nd deriv real 2nd deriv


1D solvers: Pressure

0

5

10

15

20

25

30

1E

-6 s

um

ab

s e

rr

Points per Wavelength

Pressure Hom.

2nd 4th 6th 8th

60

Higher order not substantially better below 6 ppw

0

5

10

15

20

25

30

1E

-6 s

um

ab

s e

rr


Pressure Het.

2nd 4th 6th 8th


1D solvers: Stress – Velocity

0.1

2.1

4.1

6.1

8.1

1E

-9 s

um

ab

s e

rr


SV Hom.

2nd 4th 6th 8th

61

Higher order WORSE below 6 ppw

0.1

2.1

4.1

6.1

8.1

1E

-9 s

um

ab

s e

rr


SV Het.

2nd 4th 6th 8th


• Reported larger time-step possible

Smaller time-step required for the same resolution

Lower resolution problematic in heterogeneous media

• In Conclusion:

• More expensive to develop

• No accuracy benefits in heterogeneous media

• Building Ferrari with shopping cart wheels is silly:

» also need higher order boundary conditions

» also need higher order time-stepping

» etc.

Higher Order Approximations?

Higher order complications.

62


Kepler GK104

Same memcpy bandwidth – expect same performance

63

35

45

55

65

75

85

95

105

64 192 320 448 576 704

MT

P G

B/s

cube size

GK104 vs. M2070

M2070 GK104


What’s “wrong” with GK104?

GK104:

• max 2048 threads, 256 threads / TB

• occupancy 1 -> 8 TB / SM

• 8 SM x 8 TB / SM -> 64 TB concurrently

• 512 KB L2 -> 8 KB L2 per TB

M2070:

• max 1536 threads, 256 threads / TB

• occupancy 2/3 -> 4 TB / SM

• 14 SM x 4 TB -> 56 TB concurrently

• 768 KB L2 -> about 14 KB L2 per TB

• AND: 48 KB L1 -> 12 KB L1 per TB

64

3D Finite Difference Computation on GPUs using CUDA

Paulius Micikevicius, NVIDIA, 2009

No need to fetch center and top

use ancient register queue technique


Further improvement

more likely through

concurrent access

increase (more bytes

in flight)

Looking at compiler

numbers, occupancy

reduction to increase

cache per TB seems

like a bad idea (HW

utilization limited)

Fermi doesn’t care, as

expected

Register Queue

That’s better.

65

35

55

75

95

64 192 320 448 576 704

MT

P G

B/s

cube size

GK104 vs. M2070

M2070 - no reg Q GK104 - no reg Q

GK104 - req Q M070 - reg Q


• Why is volume and pressure performance curve so jagged and why is there a

massive kink down at 384 (12*32)?

• Suspect: accidental locality

The Kink

Read CENTER might prefetch someone’s LEFT or RIGHT, or hit in cache

Read LEFT or RIGHT might prefetch someone’s CENTER, or hit in cache

66

TB 0,0 TB 0,1 TB 0,2

read or

hit cache

read or

hit cache

read or

hit cache

read read

SM 0

TB 0,3 TB 0,4 TB 0,5

read or

hit cache

read or

hit cache

read or

hit cache

read read

SM 1


• If no accidental locality, there should be more IO operations than necessary,

and a lower performance ceiling:

unnecessary right OR left : 5 instead of 4 IO

4/5 = 80% throughput

unnecessary right AND left : 6 instead of 4 IO

4/6 = 66% throughput

• How to test?

Create 80% situation with chess pattern

The Kink

67


The Kink – Chess Experiment

• prevent possibility of accidental locality

by removing all neighbors (chess board

pattern)

• specifically: fast axis index =

(2*blockIdx.x+blockIdx.y%2) *

blockDim.x + threadIdx.x

• no direct neighbors that can help each

other, and either left or right overfetch is

an unnecessary additional read

• expect 80% of peak performance:

80 GB/s, as benchmark shows!

68

35

45

55

65

75

85

95

105

64 192 320 448 576 704M

TP

GB

/s

cube size

Pressure Chess Pattern

Pressure Normal Pressure Chess


The Kink – Locality Effect?

• second experiment: comment out left

and right neighbor access

• results are relatively flat, not jagged

• 14 SM on M2070, peak at 448 = 14*32?

• 382 = 12*32 some especially bad locality

situation?

69

35

55

75

95

64 192 320 448 576 704

MT

P G

B/s

cube size

Pressure Chess Pattern

Pressure Normal

Pressure Chess

Pressure No Left+Right


Averaged Materials

• Our weakest link is obviously shear

stress kernel

• Most probably because of material

average

• What if we pre-average and store Mue,

Mue_x, Mue_y and Mue_z?

• Less pressure on cache and faster

solver?

70

Interesting.


• Current shear stress kernel peak at 85 GB/s -> if it goes up to 100 GB/s,

overall performance won’t improve much

• Shear stress kernel currently has 7 reads and 3 writes, total 10. Adding 3

extra Mue to read would increase to total 13.

What memory throughput would MATCH existing version?

Averaged Materials

Alas.

71

1 2 21 2 2 1

1 2 1

2 1.3 85 GB/s 110.5 GB/s

s s st t mtp mtp

mtp mtp s

mtp

Maybe possible, but even if,

still used much more memory.

memory bound wave propagation at hardware limit | gtc 2013€¦ · − normal stress has no...

Documents