memory bound wave propagation at hardware limit | gtc 2013€¦ · − normal stress has no...
TRANSCRIPT
Igor Podladtchikov, Spectraseis Inc
March 19, 2013
Memory Bound Wave Propagation at Hardware Limit
© Spectraseis Inc. 2013
Geophysical Method to locate subsurface events:
• Propagate and image time-reversed data acquired at the surface
Use full wave-equation
• Acoustic or Elastic
• Heterogeneous Materials
Need very fast solvers
• Thousands of Events
• Big Models
Microseismic Monitoring
Time Reversed Imaging (TRI)
2
© Spectraseis Inc. 2013
Performance Expectations
Results
Acoustic Solver Implementation
Elastic Solver Implementation
Summary
Roadmap
3
© Spectraseis Inc. 2013
Performance Limiters
Processors
Memory
• Computation
• 1000 Gflops/s
• Transfer
• 100 GB/s
The two principle performance limiters
4
© Spectraseis Inc. 2013
Acoustic Equations
1 variable read & write
2 variables read only
5
© Spectraseis Inc. 2013
Elastic Equations
9 variables read & write
3 variables read only
6
© Spectraseis Inc. 2013
flops / bytes ratio:
• Compute or Memory Bound?
BYTES, not numbers:
• Single precision: 4 bytes
• e.g. 1st derivative: 2 reads, 1 write, 12 bytes transferred
Arithmetic Intensity
7
© Spectraseis Inc. 2013
M2070 machine balance:
• Peak flops / bytes : 1030 / 117 ~ 9
K10 machine balance:
• Peak flops / bytes : 4577 / 228 ~ 20
Arithmetic Intensity:
• << machine balance : memory bound
• >> machine balance : compute bound
Machine Balance
8
© Spectraseis Inc. 2013
Arithmetic Intensity
Machine Balance:
− Fermi: ~ 9
− Kepler: ~ 20
9
We’re memory bound
<< 9
2
2acoustic elastic
flops 2 5 24 99
bytes 12 16 68 444
ratio
x x
0.17 0.31 0.35 0.22
© Spectraseis Inc. 2013
Memory Bound – What To Do?
Option A:
− Give up (don’t even try)
− Blame memory bound for slow code
Option B:
− Celebrate: FLOPS are for FREE
− Optimize memory access efficiency
− Count bytes, not flops
− Try to approach memcpy throughput
10
Our claim: real world applications can run close to memcpy!
© Spectraseis Inc. 2013
− Aim for minimum read / writes
• Touch everything once (un-improvable)
• Don’t read neighbors twice
How to optimize memory access?
Try to avoid redundant read / writes
11
…don’t read me again!
Read me once…
© Spectraseis Inc. 2013
− Track optimization progress
− Don’t count neighbors in your performance metric!
“Remember traffic is the volume of data to a particular memory. It is not the
number of loads and stores”
— Performance Tuning of Scientific Applications
Don’t count neighbor reads!
Don’t cheat (yourself)
12
© Spectraseis Inc. 2013
Ideal Memory Throughput
13
No Neighbors!
N IO Grid Size Word SizeMTP
Time Elapsed
N IO 2 DOF Constants
DOF : Degree of freedom (read and write)
Constants : read only
Grid Size : nx * ny * nz * nt
Word Size: 4 bytes (single precision)
© Spectraseis Inc. 2013
30
30
4 Grid Size 4 bytes GBMTP
Time Elapsed 2 s
21 Grid Size 4 bytes GBMTP
Time Elapsed 2 s
Acoustic
Elastic
Ideal Memory Throughput
14
N_IO Acoustic : 2 * 1 + 2 = 4
N_IO Elastic : 2 * 9 + 3 = 21
© Spectraseis Inc. 2013
Performance Expectations
Results
Acoustic Solver Implementation
Elastic Solver Implementation
Summary
Roadmap
15
© Spectraseis Inc. 2013
− All solvers include
• free surface
• absorbing layer
• domain decomposition along all 3 axes
• IPC if GPUs map-able, MPI otherwise
− All solvers verified against single CPU code
− All data from NVIDIA PSG Cluster – Thank You!
Results
without further ado..
16
© Spectraseis Inc. 2013
0
20
40
60
80
100
120
64 128 192 256 320 384 448 512 576 640 704
MT
P G
B/s
cube size
M2070
Memcpy
Pressure
Density
Elastic
Memory Throughput on M2070
Real physics at 85% and 52% of hardware limit
17
© Spectraseis Inc. 2013
po[CENTER] = 2*pcc - po[CENTER]*abs + vp2[CENTER] *
(
pc[LEFT ] + pc[RIGHT ]
+ pc[LEFT2 ] + pc[RIGHT2]
+ pcm + pcp
- 6*pcc
);
Neighbor
Reads
Acoustic pressure update:
If we count neighbor reads as IO operations: 6 additional
− 10 IO operations
− 100 GB/s / 4 * 10 = 250 GB/s MTP peak on M2070
− Theoretical hardware limit is 150 GB/s
Neighbors Don’t Count!
DON’T COUNT NEIGHBOR READS
18
© Spectraseis Inc. 2013
0
50
100
150
200
250
64 128 192 256 320 384 448 512 576 640 704
MT
P G
B/s
cube size
K10
Memcpy
Pressure
Density
Elastic
Memory Throughput on M2070
Strong scaling on both K10 GPU’s
same size and power consumption as M2070!
19
© Spectraseis Inc. 2013
Other GPUs
50
150
250
Memcpy K10
K20X
K20
M2090
M2070
GK104
20
The green cards win
50
150
250
Pressure K10
K20X
K20
M2090
M2070
GK104
50
150
250
Density K10
K20X
K20
M2090
M2070
GK104 50
150
250
Elastic K10
K20X
K20
M2090
M2070
GK104
© Spectraseis Inc. 2013
Multi-GPU Weak Scaling on GK104
21
PCIe 2: 6 GB/s
PCIe 3: 12 GB/s
0
10
20
30
40
50
60
70
80
90
100
64 192 320 448 576
per
GP
U M
TP,
% o
f sin
gle
cube size
Density
0
10
20
30
40
50
60
70
80
90
100
64 128 192 256 320 384p
er
GP
U M
TP,
% o
f sin
gle
cube size
Elastic
2 nodes (IPC)
4 nodes (MPI PCIe3)
8 nodes (IB FDR)
© Spectraseis Inc. 2013
− Defined Ideal, Un-improvable Memory Throughput
• MTP = N_IO * Grid Size * Word Size / time elapsed
• N_IO = 2*DOF + Const
• No neighbors or temporary variables
− Came close to memcpy with real world applications
• acoustic: 85 %
• elastic: 52 %
• performance proportional to memcpy on various architectures
− Solvers scale on multiple GPUs
Results Summary
22
© Spectraseis Inc. 2013
Performance Expectations
Results
Acoustic Solver Implementation
Elastic Solver Implementation
Summary
Roadmap
23
© Spectraseis Inc. 2013
Respect the number 32
− 32 x 8 Thread-blocks
− Fast axis sizes multiples of 32 (can be padded)
− Hit global memory segments and L1 cache lines (32 x 4B = 128B)
Rely on cache
− Shared memory requires extra operations
− Shared memory needs __synchthreads()
− Registers are faster than shared memory
− If working set fits in cache, cache is faster
General Considerations
24
© Spectraseis Inc. 2013
First Try – Acoustic Pressure
25
Yes, that’s it
#define EXIT_BND(xx,yy,nx,ny) \
int xx = blockIdx.x*blockDim.x + threadIdx.x; if(xx < 1 || xx >= nx - 1) return; \
int yy = blockIdx.y*blockDim.y + threadIdx.y; if(yy < 1 || yy >= ny - 1) return;
#define CENTER i1 + i2*n1 + i3*n1*n2
#define RIGHT i1+1 + i2*n1 + i3*n1*n2
#define LEFT i1-1 + i2*n1 + i3*n1*n2
#define RIGHT2 i1 + (i2+1)*n1 + i3*n1*n2
#define LEFT2 i1 + (i2-1)*n1 + i3*n1*n2
#define TOP i1 + i2*n1 + (i3+1)*n1*n2
#define BOT i1 + i2*n1 + (i3-1)*n1*n2
__global__ void pressure_gpu_het_vp2(const float* pc, float* po, const float* vp2,
const int n1, const int n2, const int n3){
EXIT_BND(i1,i2,n1,n2)
int i3;
for(i3 = 1; i3 < n3-1; i3++){
po[CENTER] = 2*pc[CENTER] - po[CENTER] + vp2[CENTER] * (
pc[LEFT] + pc[RIGHT] + pc[LEFT2] + pc[RIGHT2] + pc[BOT] + pc[TOP] –
6*pc[CENTER] );
}
}
© Spectraseis Inc. 2013
35
45
55
65
75
85
95
64 128 192 256 320 384 448 512 576 640 704
MT
P G
B/s
cube size
First Try – Acoustic Pressure
pretty good, but not good enough
26
Yay!
Boo…
Boo
Hoo
Hoo…
© Spectraseis Inc. 2013
Suspect TLB misses:
• Translation Lookaside Buffers
• Accelerating translation from virtual to physical memory
• Act like caches on the page table
“If the kernel’s working set … exceeds TLB capacity (or associativity) then one
generates TLB capacity (or conflict) misses.”
— Performance Tuning of Scientific Applications
First Try
27
© Spectraseis Inc. 2013
If the kernel’s working set is too big, we’ll reduce it:
Batched Execution
Launch kernel batches for slowest axis
28
__global__ void pressure_gpu_het_vp2(const float* pc, float* po, const float* vp2,
const int n1, const int n2, const int n3,
const int offset){
EXIT_BND(i1,i2,n1,n2)
int i3;
for(i3 = offset+1; i3 < offset+n3-1; i3++){
po[CENTER] = 2*pc[CENTER] - po[CENTER] + vp2[CENTER] * (
pc[LEFT] + pc[RIGHT] + pc[LEFT2] + pc[RIGHT2] + pc[BOT] + pc[TOP] –
6*pc[CENTER] );
}
}
© Spectraseis Inc. 2013
Batched Execution
29
Done
35
45
55
65
75
85
95
64 192 320 448 576 704M
TP
GB
/s
cube size
First Try vs. Batched
batch 32
no batch
35
45
55
65
75
85
95
64 192 320 448 576 704
MT
P G
B/s
cube size
Batched Execution
32
100
300
no batch
© Spectraseis Inc. 2013
The Density Problem
Density equation has Vp inside difference,
which means twice the amount of
neighbors to fetch:
35
45
55
65
75
85
95
64 192 320 448 576 704M
TP
GB
/s
cube size
Pressure vs. Density
Pressure Density Naïve
30
© Spectraseis Inc. 2013
What Problem? Add Variable!
Just replace vp*current inside
derivative by variable! At every time-
step:
− launch ucvp = uc*vp kernel
− launch solver, take ucvp derivative
Why is it slower??
− we introduced additional
read+write
− the additional read+write don’t
count! same problem, same result,
same performance metric formula!
35
45
55
65
75
85
95
105
64 192 320 448 576 704
MT
P G
B/s
cube size
Pressure vs. Density
Pressure Density Naïve
Add Var
31
THAT problem
© Spectraseis Inc. 2013
The code looks a little “repetitive”, we’re multiplying by vp a whole lot of times:
unew = 2*ucc - uo[CENTER]*abs
+ uc[RIGHT] *v2[RIGHT] + uc[LEFT] *v2[LEFT]
+ uc[RIGHT2]*v2[RIGHT2]+ uc[LEFT2]*v2[LEFT2]
+ ucp *v2p + ucm *v2m - 6*ucc*v2c;
What if we do this:
unew = (2*ucc - uo[CENTER]*abs) / vp[CENTER] // divide by vp for time-step!
+ uc[RIGHT] + uc[LEFT]
+ uc[RIGHT2]+ uc[LEFT2]
+ ucp + ucm - 6*ucc;
uo[CENTER] = unew*abs*vp[CENTER]; // store wave-fields pre-multiplied with vp!
The Density Trick
Same memory usage, same N_IO, but less neighbor reads!
32
© Spectraseis Inc. 2013
The Density Trick
Memory access pattern the same as
pressure – same performance as
pressure!
35
55
75
95
64 192 320 448 576 704
MT
P G
B/s
cube size
Pressure vs. Density
Pressure Density Naïve
Add Var Pre-Mul
33
And BOOM goes the dynamite
© Spectraseis Inc. 2013
Performance Expectations
Results
Acoustic Solver Implementation
Elastic Solver Implementation
Summary
Roadmap
34
© Spectraseis Inc. 2013
Very similar situation to acoustic solver
• Less neighbors because 1st derivative, but more variables
• Use batching, 32x8 thread-blocks, fast axis sizes multiples of 32
Use staggered grid
• Average materials on the fly
All variables the same size
• All coalesced, same stride for everyone
General Considerations
35
© Spectraseis Inc. 2013
Staggered Grid – Elementary Cell
Every grid point contains 12 elements:
‒ 3 particle velocity components Vx, Vy, Vz
‒ 3 normal stress components Sxx, Syy, Szz
‒ 3 shear stress components Sxy, Sxz, Syz
‒ 3 material properties ρ, λ, μ
36
y
x
Sxy
Sxz, out of screen
Syz, out of screen
Vx
Vy
Vz, out of screen
Sxx, Syy, Szz, ρ, λ, μ
© Spectraseis Inc. 2013
Staggered Grid
37
┼ Everyone surrounded by correct
spatial difference neighbors
‒ Materials over Velocity and
Shear need to be averaged
© Spectraseis Inc. 2013
Staggered Grid
38
Vy Vx
Sxz over Vx Sxy over Vy
Sxy
Sxx, Syy, Szz, ρ, λ, μ
Area updated
Ghost Stress,
ignored
Ghost Stress,
updated
Boundary Velocity,
from neighbor or
boundary condition
© Spectraseis Inc. 2013
Separate Stress and Velocity Update
39
Velocity Kernel
Stress Kernel
© Spectraseis Inc. 2013
Separate Stress and Velocity Update
40
‒ needs to be at time-step t
‒ needs to be at time-step
t+1/2
‒ handled by thread-block,
possibly on different SM
‒ thread-block scheduling
unknown
‒ read redundancy
© Spectraseis Inc. 2013
Separate Stress and Shear Update
41
Velocity Kernel
Stress Kernel
Shear Kernel
© Spectraseis Inc. 2013
Separate Stress and Shear Update
42
‒ for i 0…n-1
‒ for i 1…n
‒ Divergence experimentally
established to be slightly
worse than read redundancy
© Spectraseis Inc. 2013
Individual Kernel Performance
− Normal stress has no material
averaging
− Velocity needs to average density
from 2 values, for 3 different
positions
− Shear stress needs to average
Lame coefficient from 4 values, for
3 different positions
35
55
75
95
64 192 320 448
MT
P G
B/s
cube size
Elastic Kernels
Stress Velocity
Shear Elastic
43
In sequence they suffer read redundancy
© Spectraseis Inc. 2013
Individual kernels are close to limit, but introduce read redundancy:
− shear stress: read 3 V, read 3 SX, read 1 M, write 3 SX (10)
− normal stress: read 3 V (AGAIN), read 3 S, read L, read M (AGAIN),
write 3 S (11, 4 redundant)
− velocity: read 3 V (AGAIN), read 3 SX (AGAIN), read 3 S (AGAIN), read R,
write 3 V (13, 9 redundant)
• Total 34, 13 redundant
We could totally cheat and say we’re doing 34 IO, and therefore our peak
performance is 60 / 21 * 34 = 97 GB/s -> 83 % of memcpy speed!
Read Redundancy
It’s important to know whether an algorithm has room for improvement or not…
This one definitely has!
44
© Spectraseis Inc. 2013
Respected the number 32
− Memory segments, warps and L1 cache lines
Relied on cache
− Only works if working unit small enough
− So, reduce your working units
Give Hardware maximum possibility to parallelize
− No __syncthreads()
− Minimum divergence
Implementation Summary
45
© Spectraseis Inc. 2013
Performance Expectations
Results
Acoustic Solver Implementation
Elastic Solver Implementation
Summary
Roadmap
46
© Spectraseis Inc. 2013
Ideal Memory Throughput
− MTP = N_IO * Grid Size * Word Size / time elapsed
− N_IO = 2*DOF + Const
− GFlops misleading in memory bound situation
− Counting neighbors is a crime
Real world applications can approach memcpy throughput
− acoustic: 85 % (100 GB/s on M2070, 180 GB/s on K10)
− elastic: 52 % (60 GB/s on M2070, 100 GB/s on K10)
Summary
Physics at Memcpy Throughput:
Physics for free!
47
© Spectraseis Inc. 2013
For fixed problem size and hardware capabilities…
Every Algorithm’s Dream…
…which is faster?
48
Read Compute Write
Read C Write
Read Write
3D FFT: 40 GB/s, 180 Gflops/s
Acoustic: 100 GB/s, 70 Gflops/s
Memcpy: 117 GB/s, 0 Gflops/s
© Spectraseis Inc. 2013
Performance Tuning of Scientific Applications – David H. Bailey and Robert F. Lucas
GPU Performance Analysis and Optimization, Paulius Micikevicius, GTC 2012
3D Finite Difference Computation on GPUs using CUDA, Paulius Micikevicius, 2010
Numerical Modeling in Fortran, Day 9, Paul Tackley, 2012
References
Questions?
49
© Spectraseis Inc. 2013
App Throughput (TP) GB/s Application Speed
Hardware (HW) TP GB/s Hardware’s Transfer Throughput
HW TP Limit GB/s Practical Throughput Limit
(memcpy)
Profile how
many bytes
transferred.
Practical
instead of
theoretical
throughput
limit.
Less than
100% is only
critical if it’s
substantially
less.
Performance Peak Analysis
51
App / Limit % 100 % is ideal
App / HW % Less than 100 % means not all
bytes transferred are used
HW / Limit % Less than 100 % means memory
bus underutilized
© Spectraseis Inc. 2013
App Throughput (TP) GB/s 100.0 88.5
Hardware (HW) TP GB/s 103.6 92.1
HW TP Limit GB/s 117.6 114.3
App / Limit % 85.0 77.4
App / HW % 96.5 96.1
HW / Limit % 88.1 80.6
GPU M2070 GK104
Data from
448 cubed,
10 time steps
run.
Access
pattern OK.
Could have
more
concurrent
memory
access,
especially on
GK104, to
increase HW
utilization.
Performance Peak Analysis: Density
52
© Spectraseis Inc. 2013
Cache miss
causes
memory
replays and
stalls
Register Queue GK104 Profiled
53
Metric Queue No Q Comments
APP Time [sec] 0.148 0.180
APP MTP [GB/s] 89.379 73.375
Instructions [10^9] 52.404 60.156 replays
Writes [GB] 3.320 3.320
Reads [GB] 10.780 11.335 Cache miss
Reads/cube 3.004 3.158 Cache miss
HW MTP [ GB/s] 95.266 81.416 Stalls
APP / HW MTP [%] 93.800 90.123 Cache miss
© Spectraseis Inc. 2013
Cache miss
causes
memory
replays and
stalls
Density Trick Profiled on M2070
54
Metric Trick Naive Comments
APP Time [sec] 0.133 0.190
APP MTP [GB/s] 99.672 69.557
Instructions [10^9] 44.370 53.706 replays
Writes [GB] 3.320 3.320
Reads [GB] 10.455 12.836 Cache miss
Reads/cube 2.913 3.577 Cache miss
HW MTP [ GB/s] 103.566 85.030 Stalls
APP / HW MTP [%] 96.196 81.802 Cache miss
© Spectraseis Inc. 2013
• nvprof from cuda toolkit 5.0 : nvprof --event <event name>
• inst_issued (Fermi), inst_issued1 + 2*inst_issued2 (K10)
per warp, 32 instructions per count
• fb_subp0_write_sectors + fb_subp1_write_sectors
32 bytes per count
• fb_subp0_read_sectors + fb_subp1_read_sectors
32 bytes per count
Profiling Notes
55
© Spectraseis Inc. 2013
• Reported compute bound for large stencils, so not memory bound anymore
Would you prefer to pay for the bus or ride it for free?
• Reported higher accuracy
Assuming function well behaved and infinitely differentiable, which
is not the case for heterogeneous media
Ironically, free mall ride in Denver is cleaner and newer then normal
busses you actually pay for
Higher Order Approximations?
56
© Spectraseis Inc. 2013
Smooth vs. Real World
57
Let’s approximate some derivatives
– Waves are smooth, for sure
– Sine and cosine are infinitely differentiable
– Taylor approximation seems like a good idea
• All differences are multiplied by material properties
• If property has step, difference x property will have step
• We chose factor of 0.9 here -> not a very rough step
• Function looks smooth…
sin(x) | sin(x) * 0.9
smooth real
© Spectraseis Inc. 2013
Smooth vs. Heterogeneous:
1st Derivative
58
My oh my, what do we have here?
2
4
6
8
2 ( )
4 ( )
6 ( )
8 ( )
nd iii
th v
th vii
th ix
Order Error
h f a
h f a
h f a
h f a
smooth 1st deriv real 1st deriv
Big Error
Small Error Big h Small h
2nd 4th 6th 8th
Big Error
Small Error Big h Small h
2nd 4th 6th 8th
© Spectraseis Inc. 2013
Smooth vs. Heterogeneous:
2nd Derivative
59
All orders fail, but the higher ones seem worse
2
4
6
8
2 ( )
4 ( )
6 ( )
8 ( )
nd iv
th vi
th viii
th x
Order Error
h f a
h f a
h f a
h f a
Big Error
Small Error Big h Small h
2nd 4th 6th 8th
Big Error
Small Error Big h Small h
2nd 4th 6th 8th
smooth 2nd deriv real 2nd deriv
© Spectraseis Inc. 2013
1D solvers: Pressure
0
5
10
15
20
25
30
1E
-6 s
um
ab
s e
rr
Points per Wavelength
Pressure Hom.
2nd 4th 6th 8th
60
Higher order not substantially better below 6 ppw
0
5
10
15
20
25
30
1E
-6 s
um
ab
s e
rr
Points per Wavelength
Pressure Het.
2nd 4th 6th 8th
© Spectraseis Inc. 2013
1D solvers: Stress – Velocity
0.1
2.1
4.1
6.1
8.1
1E
-9 s
um
ab
s e
rr
Points per Wavelength
SV Hom.
2nd 4th 6th 8th
61
Higher order WORSE below 6 ppw
0.1
2.1
4.1
6.1
8.1
1E
-9 s
um
ab
s e
rr
Points per Wavelength
SV Het.
2nd 4th 6th 8th
© Spectraseis Inc. 2013
• Reported larger time-step possible
Smaller time-step required for the same resolution
Lower resolution problematic in heterogeneous media
• In Conclusion:
• More expensive to develop
• No accuracy benefits in heterogeneous media
• Building Ferrari with shopping cart wheels is silly:
» also need higher order boundary conditions
» also need higher order time-stepping
» etc.
Higher Order Approximations?
Higher order complications.
62
© Spectraseis Inc. 2013
Kepler GK104
Same memcpy bandwidth – expect same performance
63
35
45
55
65
75
85
95
105
64 192 320 448 576 704
MT
P G
B/s
cube size
GK104 vs. M2070
M2070 GK104
© Spectraseis Inc. 2013
What’s “wrong” with GK104?
GK104:
• max 2048 threads, 256 threads / TB
• occupancy 1 -> 8 TB / SM
• 8 SM x 8 TB / SM -> 64 TB concurrently
• 512 KB L2 -> 8 KB L2 per TB
M2070:
• max 1536 threads, 256 threads / TB
• occupancy 2/3 -> 4 TB / SM
• 14 SM x 4 TB -> 56 TB concurrently
• 768 KB L2 -> about 14 KB L2 per TB
• AND: 48 KB L1 -> 12 KB L1 per TB
64
3D Finite Difference Computation on GPUs using CUDA
Paulius Micikevicius, NVIDIA, 2009
No need to fetch center and top
use ancient register queue technique
© Spectraseis Inc. 2013
Further improvement
more likely through
concurrent access
increase (more bytes
in flight)
Looking at compiler
numbers, occupancy
reduction to increase
cache per TB seems
like a bad idea (HW
utilization limited)
Fermi doesn’t care, as
expected
Register Queue
That’s better.
65
35
55
75
95
64 192 320 448 576 704
MT
P G
B/s
cube size
GK104 vs. M2070
M2070 - no reg Q GK104 - no reg Q
GK104 - req Q M070 - reg Q
© Spectraseis Inc. 2013
• Why is volume and pressure performance curve so jagged and why is there a
massive kink down at 384 (12*32)?
• Suspect: accidental locality
The Kink
Read CENTER might prefetch someone’s LEFT or RIGHT, or hit in cache
Read LEFT or RIGHT might prefetch someone’s CENTER, or hit in cache
66
TB 0,0 TB 0,1 TB 0,2
read or
hit cache
read or
hit cache
read or
hit cache
read read
SM 0
TB 0,3 TB 0,4 TB 0,5
read or
hit cache
read or
hit cache
read or
hit cache
read read
SM 1
© Spectraseis Inc. 2013
• If no accidental locality, there should be more IO operations than necessary,
and a lower performance ceiling:
unnecessary right OR left : 5 instead of 4 IO
4/5 = 80% throughput
unnecessary right AND left : 6 instead of 4 IO
4/6 = 66% throughput
• How to test?
Create 80% situation with chess pattern
The Kink
67
© Spectraseis Inc. 2013
The Kink – Chess Experiment
• prevent possibility of accidental locality
by removing all neighbors (chess board
pattern)
• specifically: fast axis index =
(2*blockIdx.x+blockIdx.y%2) *
blockDim.x + threadIdx.x
• no direct neighbors that can help each
other, and either left or right overfetch is
an unnecessary additional read
• expect 80% of peak performance:
80 GB/s, as benchmark shows!
68
35
45
55
65
75
85
95
105
64 192 320 448 576 704M
TP
GB
/s
cube size
Pressure Chess Pattern
Pressure Normal Pressure Chess
© Spectraseis Inc. 2013
The Kink – Locality Effect?
• second experiment: comment out left
and right neighbor access
• results are relatively flat, not jagged
• 14 SM on M2070, peak at 448 = 14*32?
• 382 = 12*32 some especially bad locality
situation?
69
35
55
75
95
64 192 320 448 576 704
MT
P G
B/s
cube size
Pressure Chess Pattern
Pressure Normal
Pressure Chess
Pressure No Left+Right
© Spectraseis Inc. 2013
Averaged Materials
• Our weakest link is obviously shear
stress kernel
• Most probably because of material
average
• What if we pre-average and store Mue,
Mue_x, Mue_y and Mue_z?
• Less pressure on cache and faster
solver?
70
Interesting.
© Spectraseis Inc. 2013
• Current shear stress kernel peak at 85 GB/s -> if it goes up to 100 GB/s,
overall performance won’t improve much
• Shear stress kernel currently has 7 reads and 3 writes, total 10. Adding 3
extra Mue to read would increase to total 13.
What memory throughput would MATCH existing version?
Averaged Materials
Alas.
71
1 2 21 2 2 1
1 2 1
2 1.3 85 GB/s 110.5 GB/s
s s st t mtp mtp
mtp mtp s
mtp
Maybe possible, but even if,
still used much more memory.