shoc: overview and kernel walkthrough

46
SHOC: Overview and Kernel Walkthrough Kyle Spafford Keeneland Tutorial April 14, 2011

Upload: verity

Post on 25-Feb-2016

43 views

Category:

Documents


1 download

DESCRIPTION

SHOC: Overview and Kernel Walkthrough. Kyle Spafford Keeneland Tutorial April 14, 2011. The Scalable Heterogeneous Computing Benchmark Suite (SHOC). Focus on scientific computing workloads, including common kernels like SGEMM, FFT, Stencils - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SHOC: Overview and Kernel Walkthrough

SHOC: Overview and Kernel

WalkthroughKyle Spafford

Keeneland TutorialApril 14, 2011

Page 2: SHOC: Overview and Kernel Walkthrough

2 Managed by UT-Battellefor the U.S. Department of Energy

The Scalable Heterogeneous Computing Benchmark Suite (SHOC)• Focus on scientific computing

workloads, including common kernels like SGEMM, FFT, Stencils

• Parallelized with MPI, with support for multi-GPU and cluster scale comparisons

• Implement in both CUDA and OpenCL for a 1:1 comparison

• Include system, stability tests

SHOC Results Browser (beta)http://ft.ornl.gov/~kspafford/shoctown/Shoctown.html

Page 3: SHOC: Overview and Kernel Walkthrough

3 Managed by UT-Battellefor the U.S. Department of Energy

Download SHOC• Source code

– http://ft.ornl.gov/doku/shoc/downloads

• Build and Run– http://ft.ornl.gov/doku/shoc/gettingstarted

• sh ./conf/config-keeneland.sh• make• cd tools• perl driver.pl –cuda –s 4

– Includes example output for Keeneland

• FAQ– http://ft.ornl.gov/doku/shoc/faq

Page 4: SHOC: Overview and Kernel Walkthrough

4 Managed by UT-Battellefor the U.S. Department of Energy

SHOC Categories

• Performance:– Level 0

• Speeds and feeds: raw FLOPS rates, bandwidths, latencies– Level 1

• Algorithms: FFT, matrix multiply, stencil, sort, etc.– Level 2:

• Application kernels: S3D (chemistry), molecular dynamics

• System:– PCIe Contention, MPI latency vs. host-device bandwidth, NUMA

• Stability:– FFT-based, error detection

Page 5: SHOC: Overview and Kernel Walkthrough

5 Managed by UT-Battellefor the U.S. Department of Energy

(Level 0 Example): DeviceMemory• Motivation

– Determine sustainable device memory bandwidth– Benchmark local, global, and image memory

• Basic design– Test different memory access patterns, i.e. coalesced,

uncoalesced– Measure both read and write bandwidth– Vary number of threads in a block

Coalesced

Thread sequential /Uncoalesced

Thread 1

Thread 2Thread 3Thread 4

Page 6: SHOC: Overview and Kernel Walkthrough

6 Managed by UT-Battellefor the U.S. Department of Energy

SHOC: Level 0 Tests

• BusSpeedDownload/Readback– Measures bandwidth/latency of the PCIe bus

• DeviceMemory– Measures global/constant/shared memory

• KernelCompilation– Measures OpenCL JIT kernel compilation speeds

• MaxFlops– Measures achievable FLOPS (synthetic, not-bandwidth bound)

• QueueDelay– Measures OpenCL queueing system overhead

Page 7: SHOC: Overview and Kernel Walkthrough

7 Managed by UT-Battellefor the U.S. Department of Energy

(Level 1 Example): Stencil2D

• Motivation– Supports investigation of accelerator

usage within parallel application context– Serial and True Parallel versions

• Basic design– 9-point stencil operation applied to 2D data set– MPI uses 2D Cartesian data distribution, with periodic halo exchanges– Applies stencil to data in local memory

• OpenCL/CUDA observations– Runtime dominated by data movement

• Between host and card• Between MPI processes

Page 8: SHOC: Overview and Kernel Walkthrough

8 Managed by UT-Battellefor the U.S. Department of Energy

SHOC: Level 1 Tests• FFT• Reduction• Scan• SGEMM• Sort • SpMV• Stencil2D• Triad

Page 9: SHOC: Overview and Kernel Walkthrough

9 Managed by UT-Battellefor the U.S. Department of Energy

(Level 2 Example): S3D• Motivation

– Measure performance of important DOE application

– S3D solves Navier-Stokes equations for a regular 3D domain, used to simulate combustion

• Basic design– Assign each grid point to a device

thread– Highly parallel, as grid points are

independent

• OpenCL/CUDA observations– CUDA outperforms OpenCL

• Big factor: native transcendentals (sin, cos, tan, etc.)

3D Regular Domain Decomposition – Each thread handles a grid point, blocks handle regions

Page 10: SHOC: Overview and Kernel Walkthrough

10 Managed by UT-Battellefor the U.S. Department of Energy

SHOC: Other Tests

• Stability– FFTs are sensitive to small errors– Repeated simultaneous FFT/iFFT– Parallel for testing large systems

• System– MPI contention

• Impact of GPU usage on MPI latency– Chipset contention

• Impact of MPI communication on GPU performance– NUMA

• Multi-socket, multiple PCIe slot, multiple RAM banks

Page 11: SHOC: Overview and Kernel Walkthrough

11 Managed by UT-Battellefor the U.S. Department of Energy

Compare OpenCL and CUDA

• OpenCL improving, but still trailing CUDA

• Tesla C2050, CUDA\OpenCL 3.2 RC2

FFT

FFT DP

MDMD DP

SGEMM

DGEMM S3

D

Reducti

onSc

anSo

rt

SPMV-V

ector D

P

SPMV-EL

LR DP

01234567

5.33

6.39

1.201.69 1.49 1.74

0.99 1.02 1.26 1.031.99

1.40

CUDA Performance Relative to OpenCL

Page 12: SHOC: Overview and Kernel Walkthrough

12 Managed by UT-Battellefor the U.S. Department of Energy

Example Results

ATI Radeon HD5870

NV GTX580

NV GTX480

Tesla M2070

NV Ion02468

101214

4.52

11.99

7.86.24

0.24

SP Sparse Mat-Vec Multiplication

GB/s

Page 13: SHOC: Overview and Kernel Walkthrough

13 Managed by UT-Battellefor the U.S. Department of Energy

Reduction Walkthrough

Page 14: SHOC: Overview and Kernel Walkthrough

14 Managed by UT-Battellefor the U.S. Department of Energy

Reduction Walkthrough

• Fundamental kernel in almost all programs• Easy to implement, but hard to get right• We’ll walk through the optimization process.

• Code for these kernels is at http://ft.ornl.gov/~kspafford/tutorial.tgz

• Graphics from a similar presentation by Mark Harris, NVIDIA

Page 15: SHOC: Overview and Kernel Walkthrough

15 Managed by UT-Battellefor the U.S. Department of Energy

Reduction Walkthrough

• Start with the well-known, tree-based approach:

Page 16: SHOC: Overview and Kernel Walkthrough

16 Managed by UT-Battellefor the U.S. Department of Energy

Algorithm Sketch

• Launch 1 thread per element• Each thread loads an element from global memory into

shared memory• Each block reduces its shared memory into 1 value• This value is written back out to global memory

Page 17: SHOC: Overview and Kernel Walkthrough

17 Managed by UT-Battellefor the U.S. Department of Energy

Algorithm Sketch with Code• Main steps

– Each thread loads a value from global memory into shared memory

extern __shared__ float sdata[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;sdata[tid] = g_idata[i];– Synchronize threads__syncthreads();

– Reduce shared memory into a single value– Write value out to global memory

Page 18: SHOC: Overview and Kernel Walkthrough

18 Managed by UT-Battellefor the U.S. Department of Energy

Reduction of Shared Memory

for(int s = 1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) {

sdata[tid] += sdata[tid + s]; } __syncthreads();

}

Page 19: SHOC: Overview and Kernel Walkthrough

19 Managed by UT-Battellefor the U.S. Department of Energy

Reduction v0

• Problem 1: Divergent Warps• Problem 2: Modulo operator is expensive on GPUs

Page 20: SHOC: Overview and Kernel Walkthrough

20 Managed by UT-Battellefor the U.S. Department of Energy

Recursive Invocation

• Problem: We want a single value, but blocks can’t communicate

• Solution: Recursive kernel invocation

Page 21: SHOC: Overview and Kernel Walkthrough

21 Managed by UT-Battellefor the U.S. Department of Energy

PerformanceKernel Version GB/s Step Speedup Total Speedup

0 – interleaved addressing, divergent warps

6.9 - -

Page 22: SHOC: Overview and Kernel Walkthrough

22 Managed by UT-Battellefor the U.S. Department of Energy

Reduction v1

• Get rid of divergent branch and modulo operator

Page 23: SHOC: Overview and Kernel Walkthrough

23 Managed by UT-Battellefor the U.S. Department of Energy

Reduction v1

Page 24: SHOC: Overview and Kernel Walkthrough

24 Managed by UT-Battellefor the U.S. Department of Energy

Problem – Bank Conflicts• Shared memory is composed of 32 banks.• When multiple threads access *different* words in the

*same* bank, access is serialized

Page 25: SHOC: Overview and Kernel Walkthrough

25 Managed by UT-Battellefor the U.S. Department of Energy

PerformanceKernel Version GB/s Step Speedup Total Speedup

0 – interleaved addressing, divergent warps

6.9 - -

1 – interleaved addressing, bank conflicts

10.9 1.58x 1.58x

Page 26: SHOC: Overview and Kernel Walkthrough

26 Managed by UT-Battellefor the U.S. Department of Energy

Reduction v2 – Sequential Addressing

Page 27: SHOC: Overview and Kernel Walkthrough

27 Managed by UT-Battellefor the U.S. Department of Energy

PerformanceKernel Version GB/s Step Speedup Total Speedup

0 – interleaved addressing, divergent warps

6.9 - -

1 – interleaved addressing, bank conflicts

10.9 1.58x 1.58x

2 – removed bank conflicts

14.0 1.28x 2.03x

Page 28: SHOC: Overview and Kernel Walkthrough

28 Managed by UT-Battellefor the U.S. Department of Energy

Reduction v3 – Unrolling the Last Warp• We know threads execute in a warp-synchronous fashion• For the last few steps, we can get rid of extra

__syncthreads() calls

Page 29: SHOC: Overview and Kernel Walkthrough

29 Managed by UT-Battellefor the U.S. Department of Energy

PerformanceKernel Version GB/s Step Speedup Total Speedup

0 – interleaved addressing, divergent warps

6.9 - -

1 – interleaved addressing, bank conflicts

10.9 1.58x 1.58x

2 – removed bank conflicts

14.0 1.28x 2.03x

3 – unrolled last warp

23.1 1.65x 3.34x

Page 30: SHOC: Overview and Kernel Walkthrough

30 Managed by UT-Battellefor the U.S. Department of Energy

Reduction v4 – Multiple Elements Per Thread• Still have some instruction overhead

– Can use templates to totally unroll the loop– Can have threads handle multiple elements from global memory

• Bonus: reduces any array size to 2 kernel invocations• This is a useful optimization for most kernels

Page 31: SHOC: Overview and Kernel Walkthrough

31 Managed by UT-Battellefor the U.S. Department of Energy

PerformanceKernel Version GB/s Step Speedup Total Speedup

0 – interleaved addressing, divergent warps

6.9 - -

1 – interleaved addressing, bank conflicts

10.9 1.58x 1.58x

2 – removed bank conflicts

14.0 1.28x 2.03x

3 – unrolled last warp

23.1 1.65x 3.34x

4 – totally unrolled, multiple elems per thread

36.5 1.58x 5.29x

Page 32: SHOC: Overview and Kernel Walkthrough

32 Managed by UT-Battellefor the U.S. Department of Energy

More about Reduction

• http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

• Demo

wget http://ft.ornl.gov/~kspafford/tutorial.tgz

Page 33: SHOC: Overview and Kernel Walkthrough

33 Managed by UT-Battellefor the U.S. Department of Energy

Programming Problem - Scan

• Now let’s think about how this extends to Scan (aka prefix sum)

Scan takes a binary associative operator , and an array of n ⊕elements:

[a0, a1, …, an-1],and returns the array

[a0, (a0 a⊕ 1), …, (a0 a⊕ 1 … a⊕ ⊕ n-1)].Example: If is addition⊕[3 1 7 0 4] [3 4 11 11 15]

Page 34: SHOC: Overview and Kernel Walkthrough

34 Managed by UT-Battellefor the U.S. Department of Energy

Reduce-then-scan Strategy

7 3 8 5 5 1 2 6

Kernel 1: Reduce

10 13 6 8

Page 35: SHOC: Overview and Kernel Walkthrough

35 Managed by UT-Battellefor the U.S. Department of Energy

Reduce-then-scan Strategy

7 3 8 5 5 1 2 6

Kernel 1: Reduce

10 13 6 8

Kernel 2: Exclusive Top-level scan

0 10 23 29

Page 36: SHOC: Overview and Kernel Walkthrough

36 Managed by UT-Battellefor the U.S. Department of Energy

Reduce-then-scan Strategy

7 3 8 5 5 1 2 6

Kernel 1: Reduce

10 13 6 8

Kernel 2: Exclusive Top-level scan

0 10 23 29

7 10 18 23 28 29 31 37

Kernel 3: Bottom-level scan

Page 37: SHOC: Overview and Kernel Walkthrough

37 Managed by UT-Battellefor the U.S. Department of Energy

Fast Scan Kernel• Use 2x shared memory as there are elements, set first half

to 0, second half to input.10 13 6 80 0 0 0

10 23 19 140 0 0 0

10 23 29 370 0 0 0

for i=0; i < log2 blockSize; i++) smem[idx] += smem[idx-2i];

Page 38: SHOC: Overview and Kernel Walkthrough

38 Managed by UT-Battellefor the U.S. Department of Energy

Example Code

• Kernel Found in SHOC (src/level1/scan/scan_kernel.h) in the scanLocalMem function

• You can adapt this function for the top-level exclusive scan and the bottom-level inclusive scans.

• Problems:– Determine how reduction should stride across global memory– Figure out how to make it exclusive/inclusive (hint: remember the

first half of smem is 0)– Figure out how to use the scan kernel for the bottom level scan

Page 39: SHOC: Overview and Kernel Walkthrough

39 Managed by UT-Battellefor the U.S. Department of Energy

Good Luck! Further reading on Scan:• Examples in the CUDA SDK

• http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf

• http://back40computing.googlecode.com/svn/wiki/documents/ParallelScanForStreamArchitecturesTR.pdf

Page 40: SHOC: Overview and Kernel Walkthrough

40 Managed by UT-Battellefor the U.S. Department of Energy

MD Walkthrough

• Motivation– Classic n-body pairwise

computation, important to all MD codes such as GPU-LAMMPS, AMBER, NAMD, Gromacs, Charmm

• Basic design– Computation of the Lennard

Jones potential force– 3D domain, random

distribution– Neighbor list algorithm

Page 41: SHOC: Overview and Kernel Walkthrough

41 Managed by UT-Battellefor the U.S. Department of Energy

Algorithm Sketch

for each atom, i {force = 0;for each neighbor, j {

dist = distance(pos[i],pos[j]);

if (dist < cutoff) force += interaction(i,j);

}forces[i] = force;}

Page 42: SHOC: Overview and Kernel Walkthrough

42 Managed by UT-Battellefor the U.S. Department of Energy

Performance Observations

• Neighbors are data-dependent– Results in an uncoalesced read on pos[j].

• Uncoalesced reads kill performance– But sometimes the texture cache can help

Page 43: SHOC: Overview and Kernel Walkthrough

43 Managed by UT-Battellefor the U.S. Department of Energy

(in cuda/level1/md/MD.cu)

Page 44: SHOC: Overview and Kernel Walkthrough

44 Managed by UT-Battellefor the U.S. Department of Energy

Performance on Keeneland

12288 24576 36864 73728nAtom

0

10

20

30

40

50

60

70

80

90

74.47 77.55

65.48

27.37

LJ-SP Bandwidth

GB/s

Page 45: SHOC: Overview and Kernel Walkthrough

45 Managed by UT-Battellefor the U.S. Department of Energy

For the Hands-On Session

• Scan Competition• Goal: Best performance on 16 MiB input of floats.

– Use this slide deck and knowledge from the other presentations– Test harness with timing and correctness check provided in

scan.cu– Download it from ft.ornl.gov/~kspafford/tutorial.tgz

• Email submissions (scan.cu) to [email protected]• I will announce the winner at the end of the hands-on

session.

Page 46: SHOC: Overview and Kernel Walkthrough

46 Managed by UT-Battellefor the U.S. Department of Energy

[email protected]