gpu programming: escience or engineering ?

66
GPU Programming: eScience or Engineering? Henri Bal COMMIT/ msterdam Vrije Universiteit

Upload: mercedes-diaz

Post on 04-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Commit /. GPU Programming: eScience or Engineering ?. msterdam. Vrije Universiteit. Henri Bal. Graphics Processing Units. GPUs and other accelerators take top-500 by storm Many application success stories But GPUs are very difficult to program and optimize. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GPU  Programming: eScience or  Engineering ?

GPU Programming:eScience or Engineering?

Henri Bal

COMMIT/

msterdamVrije Universiteit

Page 2: GPU  Programming: eScience or  Engineering ?

Graphics Processing Units

● GPUs and other accelerators take top-500 by storm

● Many application success stories● But GPUs are very difficult to program and

optimize

http://www.nvidia.com/object/tesla-case-studies.html

Page 3: GPU  Programming: eScience or  Engineering ?

Example: convolution

● About half a Ph.D. thesis

Naive

Fully optimized

Page 4: GPU  Programming: eScience or  Engineering ?

Parallel Programming Lab course

● Lab course for MSc students (next to lectures)● CUDA:

● Simple image processing application on 1 node

● MPI: ● Parallel all pairs shortest path algorithms

● CUDA: 11 out of 21 passed (52 %)● MPI: 17 out of 21 passed (80 %)

Page 5: GPU  Programming: eScience or  Engineering ?

Questions

● Why are accelerators so difficult to program?● What are the challenges for Computer

Science?● What role do applications play?

Page 6: GPU  Programming: eScience or  Engineering ?

Background

● Netherlands eScience Center● Bridge between ICT and applications

● Climate modeling, astronomy,water management, digital forensics, …

● COMMIT: (100 M€) public-private ICT program● http://www.commit-nl.nl/

● Distributed ASCI Supercomputer (DAS)● Testbed for Computer Science (Euro-Par 2014

keynote)

COMMIT/

Page 7: GPU  Programming: eScience or  Engineering ?

• Cluster computing• Zoo (1994), Orca

• Wide-area computing• DAS-1 (1997), Albatross

• Grid computing• DAS-2 (2002), Manta, Satin

• eScience & optical grids• DAS-3 (2006), Ibis

• Hybrid computing• DAS-4 (2010), Glasswing, MCL

My background

Page 8: GPU  Programming: eScience or  Engineering ?

Background (team)

Ph.D. students● Ben van Werkhoven

● Alessio Sclocco

● Ismail El Hewl

● Pieter Hijma

Staff● Rob van Nieuwpoort

(NLeSC)● Ana Varbanescu (UvA)

Scientific programmers● Rutger Hofman

● Ceriel Jacobs

Page 9: GPU  Programming: eScience or  Engineering ?

Agenda

• Application case studies

• Multimedia kernel (convolution)

• Astronomy kernel (dedispersion)

• Climate modelling: optimizing multiple kernels

• Lessons learned: why is GPU programming hard?

• Programming methodologies

• ‘’Stepwise refinement for performance’’ methodology

• Glasswing: MapReduce on accelerators

Page 10: GPU  Programming: eScience or  Engineering ?

Application case study 1: Convolution operations

Image I of size Iw by Ih

Filter F of size Fw by Fh

Thread block of size Bw by Bh

Naïve CUDA kernel: 1 thread per output pixel

Does 2 arithmetic operationsand 2 loads (8 bytes)

Arithmetic Intensity (AI) = 0.25

Page 11: GPU  Programming: eScience or  Engineering ?

Hierarchy of concurrent threads

GridThread Block 0, 0

0,0 0,1 0,2 0,3

1,0 1,1 1,2 2,3

2,0 2,1 2,2 2,3

Thread Block 0, 1

0,0 0,1 0,2 0,3

1,0 1,1 1,2 2,3

2,0 2,1 2,2 2,3

Thread Block 0, 2

0,0 0,1 0,2 0,3

1,0 1,1 1,2 2,3

2,0 2,1 2,2 2,3

Thread Block 1, 0

0,0 0,1 0,2 0,3

1,0 1,1 1,2 2,3

2,0 2,1 2,2 2,3

Thread Block 1, 1

0,0 0,1 0,2 0,3

1,0 1,1 1,2 2,3

2,0 2,1 2,2 2,3

Thread Block 1, 2

0,0 0,1 0,2 0,3

1,0 1,1 1,2 2,3

2,0 2,1 2,2 2,3

Page 12: GPU  Programming: eScience or  Engineering ?

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Constant Memory

Memory optimizations for tiled convolution

Filter (small) goes into constant memory

Threads within a block cooperatively load entire area they need into a small (e.g. 96KB) shared memory

Page 13: GPU  Programming: eScience or  Engineering ?

Tiled convolution

16x16 thread block processing an 11x 7 filter

● Arithmetic Intensity:

Page 14: GPU  Programming: eScience or  Engineering ?

Analysis● If filter size increases:

● Arithmetic Intensity increases: ● Kernel shifts from memory-bandwidth bound to compute-

bound

● Amount of shared memory needed increases →fewer thread blocks can run concurrently on each SM

Page 15: GPU  Programming: eScience or  Engineering ?

Tiling

● Each thread block computes 1xN tiles in horizontal direction+ Increases amount of work per thread

+ Saves loading overlapping borders

+ Saves redundant instructions

- More shared memory, fewer concurrent thread blocksNo shared memory

bank conflicts

Page 16: GPU  Programming: eScience or  Engineering ?

Adaptive tiling

● Tiling factor is selected at runtime depending on the input data and the resource limitations of the device● Highest possible tiling factor that fits within the

shared memory available (depending on filter size)

● Plus loop unrolling, memory banks, search optimal configuration

Ph.D. thesis Ben van Werkhoven,27 Oct. 2014+ FGCS journal, 2014

Page 17: GPU  Programming: eScience or  Engineering ?

Lessons learned

● Everything must be in balance to obtain high performance● Subtle interactions between resource limits

● Runtime decision system (adaptive tiling), in combination with standard optimizations● Loop unrolling, memory bank conflicts

Page 18: GPU  Programming: eScience or  Engineering ?

Application case study 2:Auto-tuning Dedispersion

● Used for searching pulsars in radio astronomy data

● Pulsar signals get dispersed: lower radio frequencies arrive progressively later● Can be reversed by shifting in time the signal’s

lower frequencies (dedispersion)

Alessio Sclocco et al.: Auto-Tuning Dedispersion for Many-Core Accelerators, IPDPS 2014

Page 19: GPU  Programming: eScience or  Engineering ?

Auto-tuning

● Using auto-tuning to find optimal configuration for:● Different many-core platforms

● NVIDIA & AMD GPUs, Intel Xeon Phi

● Different observational scenarios● LOFAR, Apertif

● Different number of Dispersion Measures (DMs)● Represents number of free electrons between source &

receiver● Measure of distance between emitting object & receiver

● Parameters:● Number of threads per sample or DM, thread block

size, number of registers per thread, ….

Page 20: GPU  Programming: eScience or  Engineering ?

Auto-tuning: number of threads per thread block

LOFAR

Apertif

Page 21: GPU  Programming: eScience or  Engineering ?

Histogram of achieved GFLOP/s

● 396 configurations, the winner is an outlier

Page 22: GPU  Programming: eScience or  Engineering ?

Lessons learned

● Auto-tuning allows algorithms to adapt to different platforms and scenarios

● Auto-tuning has large impact on dedispersion● Guessing a good configuration without auto-

tuning is difficult

Page 23: GPU  Programming: eScience or  Engineering ?

Application case study 3:Global Climate Modeling

● Understand future local sea level changes● Needs high-resolution simulations● Combine two approaches:

● Distributed computing (multiple resources)● GPUs

COMMIT/

Page 24: GPU  Programming: eScience or  Engineering ?

Distributed Computing

● Use Ibis to couple different simulation models● Land, ice, ocean, atmosphere

● Wide-area optimizations similar to Albatross project(16 years ago), like hierarchical load balancing

Page 25: GPU  Programming: eScience or  Engineering ?

Enlighten Your Research Global award

EMERALD (UK)

KRAKEN (USA)

STAMPEDE (USA)

SUPERMUC (GER)

#7

#10

10G

10G

CARTESIUS (NLD)

10G

Page 26: GPU  Programming: eScience or  Engineering ?

GPU Computing● Offload expensive kernels for Parallel Ocean

Program (POP) from CPU to GPU● Many different kernels, fairly easy to port to GPUs● Execution time becomes virtually 0

● New bottleneck: moving data between CPU & GPU

CPU hostmemor

y

GPU devicememory

Host

Device

PCI Express link

Page 27: GPU  Programming: eScience or  Engineering ?

Different methods for CPU-GPU communication

● Memory copies (explicit)● No overlap with GPU computation

● Device-mapped host memory (implicit)● Allows fine-grained overlap between computation

and communication in either direction

● CUDA Streams or OpenCL command-queues● Allows overlap between computation and

communication in different streams

● Any combination of the above

Page 28: GPU  Programming: eScience or  Engineering ?

Problem

● Problem:● Which method will be most efficient for a given

GPU kernel? Implementing all can be a large effort

● Solution:● Create a performance model that identifies the

best implementation:● What implementation strategy for overlapping

computation and communication is best for my program?

Ben van Werkhoven, Jason Maassen, Frank Seinstra & Henri Bal: Performance models for CPU-GPU data transfers, CCGrid2014(nominated for best-paper-award)

Page 29: GPU  Programming: eScience or  Engineering ?

MOVIE

Page 30: GPU  Programming: eScience or  Engineering ?

Example result

Measured Model

Page 31: GPU  Programming: eScience or  Engineering ?

Different GPUs (state kernel)

Page 32: GPU  Programming: eScience or  Engineering ?

Different GPUs (buoydiff)

Page 33: GPU  Programming: eScience or  Engineering ?

Comes with spreadsheet

Page 34: GPU  Programming: eScience or  Engineering ?

Lessons learned

● PCIe transfers can have a large performance impact for applications with many small kernels

● Several methods for transferring data and overlapping computation & communication exist

● Performance modelling helps to select the best mechanism

Page 35: GPU  Programming: eScience or  Engineering ?

Why is GPU programming hard?

● Mapping algorithm to architecture is difficult, especially as the architecture is difficult:● Many levels of parallelism● Limited resources (registers, shared memory)

● Less of everything than CPU (except parallelism), especially per thread, makes problem-partitioning difficult

● Everything must be in balance to obtain performance

Page 36: GPU  Programming: eScience or  Engineering ?

Why is GPU programming hard?

● Many crucial high-impact optimizations needed:● Data reuse

● Use shared memory efficiently● Limited by #registers per thread, shared memory per

thread block

● Memory access patterns● Shared memory bank conflicts, global memory coalescing

● Instruction stream optimization● Control flow divergence, loop unrolling

● Moving data to/from the GPU● PCIe transfers

Page 37: GPU  Programming: eScience or  Engineering ?

Why is GPU programming hard?

● Portability● Optimizations are architecture-dependent, and the

architectures change frequently● Optimizations are often input dependent

● Finding the right parameters settings is difficult● Need better performance models

● Like Roofline and our I/O model

Page 38: GPU  Programming: eScience or  Engineering ?

Why is GPU programming hard?

● Bottom line: tension between● control over hardware to achieve performance● higher abstraction level to ease programming

● Programmers need understandable performance

● Old problem in Computer Science,but now in extreme form

(1989)

Page 39: GPU  Programming: eScience or  Engineering ?

Agenda

• Application case studies

• Multimedia kernel (convolution)

• Astronomy kernel (dedispersion)

• Climate modelling: optimizing multiple kernels

• Lessons learned: why is GPU programming hard?

• Programming methodologies

• ‘’Stepwise refinement for performance’’ methodology

• Glasswing: MapReduce on accelerators

Page 40: GPU  Programming: eScience or  Engineering ?

Programming methodology: stepwise refinement for

performance

● Methodology:● Programmers can work on multiple levels of

abstraction● Integrate hardware descriptions into programming

model● Performance feedback from compiler, based on

hardware description and kernel● Cooperation between compiler and programmer

P. Hijma et al., Stepwise-refinement for Performance: a methodology for many-core programming,” Concurrency and Computation: Practice and Experience (accepted)

Page 41: GPU  Programming: eScience or  Engineering ?

MCL: Many-Core Levels

● MCL program is an algorithm mapped to hardware

● Start at a suitable abstraction level ● E.g. idealized accelerator, NVIDIA Kepler GPU,

Xeon Phi

● MCL compiler guides programmer which optimizations to apply on given abstraction level or to move to deeper levels

Page 42: GPU  Programming: eScience or  Engineering ?

MCL ecosystem

Page 43: GPU  Programming: eScience or  Engineering ?

Convolution example

Page 44: GPU  Programming: eScience or  Engineering ?

Compiler feedback

Page 45: GPU  Programming: eScience or  Engineering ?

Performance(GTX480, 9×9 filters)

380 GFLOPS

MCL:302 GFLOPS

Compiler +

Page 46: GPU  Programming: eScience or  Engineering ?

Performance evaluation

Compared to known, fully optimized versions(* measured on a C2050, ** using a different input).

Page 47: GPU  Programming: eScience or  Engineering ?

Current work on MCL:Heterogeneous many-core

clusters● New GPUs become available frequently, but

older-generation GPUs often still are fast enough● Clusters become heterogeneous and contain

different types of accelerators

● VU DAS-4 cluster:● NVIDIA GTX480 GPUs (22)● NVIDIA K20 GPUs (8)● Intel Xeon Phi (2)● NVIDIA C2050 (2), Titan, GTX680 GPU● AMD HD7970 GPU

Page 48: GPU  Programming: eScience or  Engineering ?

Cashmere

● Integration MCL + Satin divide-and-conquer system

● Satin [ACM TOPLAS 2010] does:● Load-balancing (cluster-aware random work-

stealing)● Latency hiding

● MCL allows kernels to be written and optimized for each type of hardware

● Cashmere does integration, application logic, mapping, and load balancing for multiple GPUs/node

Page 49: GPU  Programming: eScience or  Engineering ?

Cashmere skeleton

Page 50: GPU  Programming: eScience or  Engineering ?

Kernel performance (GFLOP/s)

Page 51: GPU  Programming: eScience or  Engineering ?

K-Means on a homogeneous GTX480

cluster

scalability absolute performance

Page 52: GPU  Programming: eScience or  Engineering ?

Heterogeneous performance

Homogeneous:efficiency on 16 GTX480 Heterogeneous:efficiency over total combined hardware

Page 53: GPU  Programming: eScience or  Engineering ?

Lessons learned

● MCL● Enables us to develop many optimized many-core

kernels● Key: stepwise refinement + multiple abstraction

levels

● Cashmere ● High performance and automatic load balancing

even when the many-core devices differ widely● Efficiency >90% in 3 out of 4 applications in

heterogeneous executions

Page 54: GPU  Programming: eScience or  Engineering ?

Agenda

• Application case studies

• Multimedia kernel (convolution)

• Astronomy kernel (dedispersion)

• Climate modelling: optimizing multiple kernels

• Lessons learned: why is GPU programming hard?

• Programming methodologies

• ‘’Stepwise refinement for performance’’ methodology

• Glasswing: MapReduce on accelerators

Page 55: GPU  Programming: eScience or  Engineering ?

Other approaches that deal with performance vs

abstraction● Domain specific languages● Patterns, skeletons, frameworks● Berkeley Dwarfs

Page 56: GPU  Programming: eScience or  Engineering ?

Glasswing: Rethinking MapReduce

● Use accelerators (OpenCL) as mainstream feature

● Massive out-of-core data sets● Scale vertically & horizontally● Maintain MapReduce abstraction

Ismail El Helw, Rutger Hofman, Henri Bal [HPDC’2014, SC’2014]

Page 57: GPU  Programming: eScience or  Engineering ?

Glasswing Pipeline

● Overlaps computation, communication & disk access

● Supports multiple buffering levels

Page 58: GPU  Programming: eScience or  Engineering ?

GPU optimizations

● Glasswing framework does:● Memory management● Some shared memory optimizations ● Data movement, data staging

● Programmer:● Focusses on the map and reduce kernels (using

OpenCL)● Can do kernel optimizations if needed

● Coalescing, memory banks, etc.

Page 59: GPU  Programming: eScience or  Engineering ?

Glasswing vs. Hadoop64-node CPU Infiniband

cluster

Page 60: GPU  Programming: eScience or  Engineering ?

Glasswing vs. Hadoop16-Node GTX480 GPU

Cluster

Page 61: GPU  Programming: eScience or  Engineering ?

Performance K-Means

Hadoop

GlasswingGPU

GlasswingCPU

GPMRcompute

Page 62: GPU  Programming: eScience or  Engineering ?

Compute Device Comparison

Page 63: GPU  Programming: eScience or  Engineering ?

Lessons learned

● Scalable MapReduce framework combining coarse-grained and fine-grained parallelism

● Handles out-of-core data, sticks with MapReduce model

● Overlaps kernel executions with memory transfers, network communication and disk access

● Outperforms Hadoop by 1.2 – 4x on CPUs and20 – 30x on GPUs

Page 64: GPU  Programming: eScience or  Engineering ?

Discussion

● eScience applications help us to● Understand the complexity of GPU programming● Validate our ideas and software● Give inspiration for new CS research

● Applications do need performance of GPUs● Next in line: SKA, digital forensics, water

management …

● GPU programming and optimization is too time-consuming for real applications

Page 65: GPU  Programming: eScience or  Engineering ?

Discussion

● Dealing with performance● GPU programs need many complex optimizations

to obtain high performance● Auto-tuning, performance modelling, machine

learning, compiler-based reasoning

● How to deal with the tension between abstraction-level and control?● New programming methodologies that allow a

choice● Frameworks that do separation of concerns