cs-206 concurrency lecture 11 - parsa.epfl.ch · epfl cs-206 – spring 2015 lec.11 - 5 recall:...

96
EPFL CS-206 – Spring 2015 Lec.11 - 1 CS-206 Concurrency Lecture 11 Data Parallel Computing Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ Adapted from slides originally developed by Andreas Di Blas, Babak Falsafi, Simon Green, David Kirk, Andreas Moshovos, David Patterson and Waqar Saleem EPFL Copyright 2015 ID IF MEM WB EXE EXE EXE EXE

Upload: others

Post on 24-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 1

CS-206 Concurrency���Lecture 11

Data ParallelComputingSpring 2015Prof. Babak Falsafiparsa.epfl.ch/courses/cs206/Adapted from slides originally developed by Andreas Di Blas, Babak Falsafi, Simon Green, David Kirk, Andreas Moshovos, David Patterson and Waqar SaleemEPFL Copyright 2015

ID IF MEM WB EXE

EXE

EXE

EXE

Page 2: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 2

Lecture& Lab

M T W T F16-Feb 17-Feb 18-Feb 19-Feb 20-Feb23-Feb 24-Feb 25-Feb 26-Feb 27-Feb2-Mar 3-Mar 4-Mar 5-Mar 6-Mar9-Mar 10-Mar 11-Mar 12-Mar 13-Mar16-Mar 17-Mar 18-Mar 19-Mar 20-Mar23-Mar 24-Mar 25-Mar 26-Mar 27-Mar30-Mar 31-Mar 1-Apr 2-Apr 3-Apr6-Apr 7-Apr 8-Apr 9-Apr 10-Apr13-Apr 14-Apr 15-Apr 16-Apr 17-Apr20-Apr 21-Apr 22-Apr 23-Apr 24-Apr27-Apr 28-Apr 29-Apr 30-Apr 1-May4-May 5-May 6-May 7-May 8-May11-May 12-May 13-May 14-May 15-May18-May 19-May 20-May 21-May 22-May25-May 26-May 27-May 28-May 29-May

Where are We?u  Data Parallel Computing

w Vectorw GPU

u  GPU architecture

u  CUDA

u  Next weekw More CUDA

Page 3: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 3

Recall: Historical View

Join  at:    

Program  with:    

P   P   P  

M   M   M  

IO   IO   IO  

I/O  (Network)  

Message  passing  Hadoop  SQL  (databases)  

P   P   P  

M   M   M  

IO   IO   IO  

Memory  

Shared  Memory  Java  threads  Posix  threads    

P   P   P  

M   M   M  

IO   IO   IO  

Processor  

Data  Parallel,  SIMD,  Vector,  GPU,  MapReduce      

Page 4: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 4

From now on: Data Parallel

Join  at:    

Program  with:    

P   P   P  

M   M   M  

IO   IO   IO  

I/O  (Network)  

Message  passing  Hadoop  SQL  (databases)  

P   P   P  

M   M   M  

IO   IO   IO  

Memory  

Shared  Memory  Java  threads  Posix  threads    

P   P   P  

M   M   M  

IO   IO   IO  

Processor  

Data  Parallel,  SIMD,  Vector,  GPU,  MapReduce      

Page 5: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 5

Recall: Forms of Parallelismu  Throughput parallelism

w Perform many (identical) sequential tasks at the same timew E.g., Google search, ATM (bank) transactions

u  Task parallelismw Perform tasks that are functionally different in parallelw E.g., iPhoto (face recognition with slide show)

u  Pipeline parallelismw Perform tasks that are different in a particular orderw E.g., speech (signal, phonemes, words, conversation)

u  Data parallelismw Perform the same task on different dataw E.g., Graphics, data analytics

}Re

duce

tim

e fo

r one

job

Page 6: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 6

Recall: Forms of Parallelismu  Throughput parallelism

w Perform many (identical) sequential tasks at the same timew E.g., Google search, ATM (bank) transactions

u  Task parallelismw Perform tasks that are functionally different in parallelw E.g., iPhoto (face recognition with slide show)

u  Pipeline parallelismw Perform tasks that are different in a particular orderw E.g., speech (signal, phonemes, words, conversation)

u  Data parallelismw Perform the same task on different dataw E.g., Graphics, data analytics

}Re

duce

tim

e fo

r one

job

Page 7: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 7

Example: Image Processing/Graphics

int a[N]; // N is large for (i =0; i < N; i++)

a[i] = a[i] * fade;

Page 8: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 8

Example: Speech Recognition (e.g., Siri)

u Signal processing: same algorithm run on a sampleu Neural network: propagate values across neurons

Page 9: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 9

Signal Processing: Data Parallel Transforms

Example: Discrete Fourier Transform (DFT) size 4

Mxx!transform (a matrix) sampled signal (a vector)

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

−=

⎥⎥⎥⎥

⎢⎢⎢⎢

−−

−−

−−=

111

1

1111

1111

11

1

1111

1111

111111

111111

4

iii

iiDFT

Matrix operations are embarrassingly data parallel!

Page 10: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 10

A network of neurons

x2

xn

w1j

w2j

wnj

wkjxkk=0

n

∑ + bj

bj

Yj

Hidden Layers

Signal

I:

Phonemes

Each neuron

Page 11: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 11

Data Parallel Computation on Neurons

float nron [N]; // for large N for (each neu[i]) for (i=0; i < N; i++) for (j=0; j < nron[i].outputs; j++) nron[i].y[j] = sigmoid( ) nron[i].wkjnron[i].xk

k=0

nron[i].inputs

∑ + nron[i].bj

Page 12: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 12

Example: Data Analytics

u Google processes 20 PB a dayu Wayback Machine has 3 PB + 100 TB/monthu  Facebook has 2.5 PB of user data + 15 TB/dayu  eBay has 6.5 PB of user data + 50 TB/dayu CERN’s Large Hydron Collider generates 15 PB a year

How do we aggregate this data?

Page 13: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 13

MapReduce in Data Analytics

u  It’s about aggregating statistics over datau  Divide up the data among serversu  Compute the stats (independently)u  Then aggregate/reduce

u  Example: CloudSuite classification benchmarkw 10’s of GB of web pagesw Rank pages based on the word occurrence (popularity)w Look for celebritiesw  It’s an embarrassingly (data) parallel problem!

Page 14: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 14

map map map

Aggregate values by keys

reduce reduce

5004 2513 Gaga Bieber

MapReduce from Google: Data Parallel Computing on Volume Servers

104 969 Gaga Bieber

…..

5004 104 Bieber ………………..…. 2513 969 Gaga …………………...

Bieber count Gaga count

Page 15: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 15

This Course:���Data Parallel Processor Architecture

1.  Vector Processorsw Pipelined executionw SIMD: Single instruction, multiple dataw Example: modern ISA extensions

2.  Graphics Processing Units (GPUs)w Dense grid of ALUsw SIMT: Single instruction, multiple threadsw  Integrated vs. discrete

Page 16: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 16

Recall: MIPS Processor (Instruction Cycle)

u  Instructions are fetched from instruction cache and decodedu  Operands are fetched from register fileu  Execute is the ALU (arithmetic logic unit)u  Memory access to data cacheu  Write results back to register file

ID IF MEM EXE

WB

Instruction Fetch

Instruction Decode/ Operand Fetch

Execute Memory Access

Write back Result

Page 17: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 17

Recall: MIPS Pipeline (Instruction Cycle)

ID IF MEM EXE

WB

Instruction Fetch

Instruction Decode/ Operand Fetch

Execute Memory Access

Write back Result

int a[N]; // N is large for (i =0; i < N; i++)

a[i] = a[i] * fade;

Page 18: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 18

Fader loop in assembly

for (i =0; i < N; i++) a[i] = a[i] * fade;

u  The loop iterates N times (once for each array element)

u  Same exact operation for each element

u  Assume 32-bit “mul”

; a[] -> $2,

; fade -> $3,

; &a[N] -> $4,

; $5 is a temp

loop:

lw $5, 0($2)

mul $5, $3, $5

sw $5, 0($2)

addi $2, $2, 4

bne $2, $4, loop

Page 19: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 19

Vector Processor: One instruction, multiple data Instruction Fetch

Instruction Decode/ Operand Fetch

Execute Memory Access

Write back Result

ID IF MEM WB EXE

EXE

EXE

EXE

Page 20: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 20

Vector Processing

*

r1 r2

r3

mul r3, r1, r2

SCALAR (1 operation)

v1 v2

v3 *

vector length

mul.v v3, v1, v2

VECTOR (N operations)

u  Vector processors have high-level operations that work on linear arrays of numbers: "vectors"

Page 21: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 21

Example vector instructions

Each vector register is multiple scalar registers u  In our example, a vector register V has 4 scalars So, u  mul.v v1, v2, v1 vector dot product v1*v2 u  mul.sv v1, r1, v1 multiplies scalar r1 to all elements of v1 u  lw.v v1, 0(r1) loads vector v1 from address r1 u  sw.v v1, 0(r1) stores vector v1 at address r1

Page 22: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 22

lw.v loads four integers like 4 parallel lw

ID IF MEM WB EXE

EXE

EXE

EXE lw.v v1, 0(r1)

v1[0]

lw address 4(r1)

lw address 8(r1)

lw address 12(r1)

v1[1]

lw address 0(r1)

v1[2]

v1[3]

Page 23: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 23

mul.v vector dot product (4 parallel multiplies)

ID IF MEM WB EXE

EXE

EXE

EXE mul.v v1, v2, v1

v1[0]

v1[1]*v2[1]

v1[2]*v2[2]

v1[3]*v2[3]

v1[1]

v1[0] *v2[0]

v1[2]

v1[3]

Page 24: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 24

add.v adds two vectors (4 parallel adds)

ID IF MEM WB EXE

EXE

EXE

EXE mul.sv v1, r1, v1

v1[0]

v1[1]*r1

v1[2]*r1

v1[3]*r1

v1[1]

v1[0]*r1

v1[2]

v1[3]

Page 25: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 25

Fader loop in Vector MIPS assembly

for (i =0; i < N; i++) a[i] = a[i] * fade;

u  Should do it four elements at a time

; a[] -> $2,

; fade -> $3,

; &a[N] -> $4

; $v1 is temp

loop:

Page 26: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 26

Fader loop in Vector MIPS assembly

for (i =0; i < N; i++) a[i] = a[i] * fade;

u  Should do it four elements at a time

u  How many fewer instructions?

; a[] -> $2,

; fade -> $3,

; &a[N] -> $4

; $v1 is temp

loop:

lw.v $v1, 0($2)

mul.sv $v1, $3, $v1

sw.v $v1, 0($2)

addi $2, $2, 16

bne $2, $4, loop

Page 27: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 27

Spec92fp Operations (Millions) Instructions (M) Program Scalar Vector S/V Scalar Vector S/V swim256 115 95 1.1x 115 0.8 142x hydro2d 58 40 1.4x 58 0.8 71x nasa7 69 41 1.7x 69 2.2 31x su2cor 51 35 1.4x 51 1.8 29x tomcatv 15 10 1.4x 15 1.3 11x wave5 27 25 1.1x 27 7.2 4x mdljdp2 32 52 0.6x 32 15.8 2x

Operation & Instruction Count (from F. Quintana, U. Barcelona.)

Vector reduces ops by 1.2X, instructions by 20X

Page 28: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 28

Automatic Code Vectorization

for (i =0; i < N; i++) a[i] = a[i] * fade;

Compiler can detect vector operationsu  Inspect the codeu Vectorize automaticallyBut, what about

for (i =0; i < N; i++)

a[i] = a[b[i]] * fade;

Page 29: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 29

Automatic Code Vectorization

for (i =0; i < N; i++) a[i] = a[i] * fade;

Compiler can detect vector operationsu  Inspect the codeu Vectorize automaticallyBut, what about

for (i =0; i < N; i++)

a[i] = a[b[i]] * fade;

b[i] unknown at compile time!

Page 30: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 30

x86 architecture SIMD supportu  Both current AMD and Intel’s x86 processors have ISA and

microarchitecture support SIMD operations.u  ISA SIMD support

w MMX, 3DNow!, SSE, SSE2, SSE3, SSE4, AVXw See the flag field in /proc/cpuinfo

w SSE (Streaming SIMD extensions): ISA extensions to x86w SIMD/vector operations

u  Micro architecture supportw Many functional unitsw 8 128-bit vector registers, XMM0, XMM1, …, XMM7

Page 31: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 31

SSE programmingu Vector registers support three data types:

w  Integer (16 bytes, 8 shorts, 4 int, 2 long long int, 1 dqword)w  single precision floating point (4 floats)w double precision float point (2 doubles).

Page 32: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 32

SSE instructions

u  Arithmetic instructionsw ADD, SUB, MUL, DIV, SQRT, MAX, MIN, RCP, etcw PD: two doubles, PS: 4 floats, SS: scalar

w ADDPS – add four floats, ADDSS: scalar add u  Logical instructions

w AND, OR, XOR, ANDN, etcw ANDPS – bitwise AND of operandsw ANDNPS – bitwise AND NOT of operands

u  Comparison instruction:w CMPPS, CMPSS – compare operands and return all 1’s or 0’s

Page 33: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 33

u  32 x 64-bit registers (also used as 16 x 128-bit registers)u  Registers considered as vectors of same data typeu  Data types: signed/uns. 8-bit, 16-bit, 32-bit, 64-bit, single prec. floatu  Instructions perform the same operation in all lanes

SIMD extensions in ARM: NEON

Dn

Dm

Dd

Lane

Source Registers Source Registers

Operation

Destination Register

Elements Elements Elements

Page 34: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 34

This Course:���Data Parallel Processor Architecture

1.  Vector Processorsw Pipelined executionw SIMD: Single instruction, multiple dataw Example: modern ISA extensions

2.  Graphics Processing Units (GPUs)w Dense grid of ALUsw SIMT: Single instruction, multiple threadsw  Integrated vs. discrete

Page 35: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 35

u Tens of coresu Mostly control logicu  Large cachesu Regular threads (e.g., Java)

u Thousands of tiny coresu Mostly ALUu  Little cacheu  Special threads (e.g., CUDA)

CPU vs. GPU

Cache

Cache

Page 36: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 36

GPUs are highly concurrent!

G FL

OPS

/sec

Perf

orm

ance

gap

Page 37: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 37

Integrated (e.g., AMD)u  Shared cache hierarchyu One memory

Discrete (e.g., nVidia)u  Specialized GPU memoryu Must move data back/forth

Integrated vs. Discrete GPU

Memory Memory GPU Memory

CPU I/O Bus

GPU

Page 38: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 38

Integrated (e.g., AMD)u  Shared cache hierarchyu One memory

Discrete (e.g., nVidia)u  Specialized GPU memoryu Must move data back/forth

This course: Discrete GPU

Memory Memory GPU Memory

CPU I/O Bus

GPU

Page 39: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 39

Warning! CPU/GPU connection is a bottleneck

Memory GPU Memory

CPU GPU

300 GB/s

30 GB/s

3 GB/s

Page 40: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 40

Sequential Execution Model / SISD int a[N]; // N is large

for (i =0; i < N; i++) a[i] = a[i] * fade;

tim

e Flow of control / Thread One instruction at the time Optimizations possible at the machine level

Page 41: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 41

Data Parallel Execution Model / SIMD int a[N]; // N is large

for all elements do in parallel a[i] = a[i] * fade;

tim

e

Page 42: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 42

Single Program Multiple Data / SPMD int a[N]; // N is large

for all elements do in parallel if (a[i] > threshold) a[i]*= fade;

tim

e

Code is statically identical across all threads Execution path may differ The model used in today’s Graphics Processors

Page 43: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 43

Killer app? 3D GraphicsExample apps:

w Gamesw Engineering/CAD

Computation:w Start with triangles (points in 3D space)w Transform (move, rotate, scale)w Paint / Texture mappingw Rasterize à convert into pixelsw Light & Hidden “surface” elimination

Bottom line:w Tons of independent calculationsw Lots of identical calculations���

Page 44: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 44

Target Applications

int a[N]; // N is large

for all elements of an array a[i] = a[i]* fade

u  Lots of independent computationsw CUDA threads need not be completely independent

Kernel

THREAD

Page 45: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 45

Programmer’s View of the GPU

u  GPU: a compute device that:w  Is a coprocessor to the CPU or hostw Has its own DRAM (device memory)w Runs many threads in parallel

u  Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads

Page 46: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 46

GPU vs. CPU Threads

u GPU threads are extremely lightweightw Little creation overhead (unlike Java)w e.g., ~microsecondsw All done in hardware

u GPU needs 1000s of threads for full efficiencyw Multi-core CPU needs only a few

Page 47: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 47

GPU threads help in two ways!

…..* fade

Parallelize computation

Overlap memory access Memory

GPU Memory

… = a[i]…. a[i] = …

CPU

GPU

Page 48: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 48

Execution Timelineti

me

1. Copy to GPU mem 2. Launch GPU Kernel

GPU / Device

2’. Synchronize with GPU 3. Copy from GPU mem

CPU / Host

Page 49: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 49

Programmer’s viewu  First create data in CPU memory

CPU

Memory

GPU

GPU Memory

Page 50: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 50

Programmer’s viewu  Then Copy to GPU

CPU

Memory

GPU

GPU Memory

Page 51: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 51

Programmer’s viewu  GPU starts computation à runs a kernelu  CPU can also continue

CPU

Memory

GPU

GPU Memory

Page 52: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 52

Programmer’s viewu  CPU and GPU Synchronize

CPU

Memory

GPU

GPU Memory

Page 53: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 53

Programmer’s viewu  Copy results back to CPU

CPU

Memory

GPU

GPU Memory

Page 54: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 54

Programming Languages

u  CUDAw nVidiaw Has market lead

u  OpenCLw Many including nVidiaw CUDA supersetw Targets many different devices, e.g., CPUs + programmable

acceleratorsw Fairly new

u  Both are evolving

Page 55: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 55

Computation partitioning

u  Think of computation as a series of loopsu  Think of data as an array

for (i = 0; i < big_number; i++)a[i] = some function

for (i = 0; i < big_number; i++)a[i] = some other function

for (i = 0; i < big_number; i++)a[i] = some other function

Kernels

Page 56: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 56

What is the kernel here?

Page 57: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 57

My first CUDA Program__global__ void fadepic(int *a, int fade, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] * fade; } int main() { int h[N]; int *d; cudaMalloc ((void **) &d, SIZE); ….. cudaThreadSynchronize (); cudaMemcpy (d, h, SIZE, cudaMemcpyHostToDevice)); fadepic<<< n_blocks, block_size >>> (d, 10.0, N); cudaDeviceSynchronize (); cudaMemcpy (h, d, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (d)); }

GPU

CPU

Page 58: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 58

Per Kernel Computation Partitioning

Threads within a block can communicate/synchronizew Run on the same core

Threads across blocks can’t communicatew Shouldn’t touch each others data (undefined behavior)

Block

thread

Page 59: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 59

Per Kernel Computation Partitioning

u  One thread can process multiple data elementsu  Other mappings are possible and often desirable

w We will talk about this later

Block

thread

Page 60: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 60

Fade exampleu  Each thread will process one pixelfor all elements do in parallel

a[i] = a[i] * fade;

Page 61: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 61

Code Skeleton

u  CPU:w  Initialize image from filew Allocate buffer on GPUw Copy image to bufferw Launch GPU kernel

w Reads and writes into bufferw Copy buffer back to CPUw Write image to a file

u  GPU:w Launch a thread per pixel

Page 62: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 62

GPU Kernel pseudo-code__global__ void fadepic (int *a, ���

int fade, ��� int N)

{ int v = a[x][y]; v = v * fade; a[x][y] = v;}u  This is the program for one threadu  It processes one pixel

Page 63: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 63

Which thread computes which pixel?

blockDim.y

gridDim.y

blockDim.x gridDim.x

threadIdx.y

threadIdx.x

Page 64: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 64

gridDim

u  gridDim.x = 7, gridDim.y = 6u  How many blocks per dimension?

Page 65: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 65

blockIdxu  blockIdx = coordinates of block in the gridu  blockIdx.x = 2, blockIdx.y = 3u  blockIdx.x = 5, blockIdx.y = 1

(0,0)

Page 66: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 66

blockDim

u  blockDim.x= 7, blockDim.y = 7u  How many threads in a block per dimension?

Page 67: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 67

threadIdxu  threadIdx = coordinates of thread in the blocku  threadidx.x= 2, threadIdx.y = 3u  threadIdx.x = 5, threadIdx.y = 4

(0,0)

Page 68: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 68

Which thread computes which pixel?

blockDim.y

gridDim.y

blockDim.x gridDim.x

threadIdx.y

threadIdx.x

x = blockIdx.x * blockDim.x + threadIdx.x y = blockIdx.y * blockDim.y + threadIdx.y

Page 69: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 69

GPU Kernel pseudo-code

__global__ void fade (int *a, ��� int fade, ��� int N)

{ int x = blockDim.x * blockIdx.x + threadIdx.x; int y = blockDim.y * blockIdx.y + threadIdx.y int offset = y * (blockDim.x * gridDim.x) + x; // offset within unidimensional array int v = a[offset]; v = v * fade; a[offset] = v;}

Page 70: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 70

GPU Kernel pseudo-code w/ limits

__global__ void fade (int *a, ��� int fade, ��� int N)

{ int x = blockDim.x * blockIdx.x + threadIdx.x; int y = blockDim.y * blockIdx.y + threadIdx.y int offset = y * (blockDim.x * gridDim.x) + x; if (offset > N) return; int v = a[offset]; v = v * fade; a[offset] = v;}

Page 71: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 71

Grids of Blocks of Threads

Cores and caches are clustered on chip for fast connectivity Hardware partitioned naturally into grids

Tim

e

Page 72: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 72

Programmer’s view: Memory Model

Page 73: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 73

Device Grid 1

Block (0, 0) Block

(1, 0) Block (2, 0)

Block (0, 1) Block

(1, 1) Block (2, 1)

Block (1, 1)

Thread (0, 1) Thread

(1, 1) Thread (2, 1) Thread

(3, 1) Thread (4, 1)

Thread (0, 2) Thread

(1, 2) Thread (2, 2) Thread

(3, 2) Thread (4, 2)

Thread (0, 0) Thread

(1, 0) Thread (2, 0) Thread

(3, 0) Thread (4, 0)

Grids of Thread Blocks: Dimension Limits

u  Grid of Blocks 1D, 2D, or 3Dw Max x, y and z: 232-1w Machine dependent

u  Block of Threads: 1D, 2D, or 3Dw Max number of threads: 1024w Max x: 1024w Max y: 1024w Max z: 64

Page 74: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 74

Thread Batching

u Kernel executed as a grid of thread blocks

u Threads in block cooperatew Synchronize their executionw Efficiently share data in block-

local memory

u Threads across blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device Grid 1

Block (0, 0) Block

(1, 0) Block (2, 0)

Block (0, 1) Block

(1, 1) Block (2, 1)

Grid 2

Block (1, 1)

Thread (0, 1) Thread

(1, 1) Thread (2, 1) Thread

(3, 1) Thread (4, 1)

Thread (0, 2) Thread

(1, 2) Thread (2, 2) Thread

(3, 2) Thread (4, 2)

Thread (0, 0) Thread

(1, 0) Thread (2, 0) Thread

(3, 0) Thread (4, 0)

Page 75: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 75

Thread Coordination Overview

u  Race-free access to data

Only across threads within the same block No communication across blocks

Page 76: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 76

Programmer’s view: Memory Model: Thread vs. Host

Arrows show whether read and/or write is possible

Page 77: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 77

Memory Model Summary

Memory Location Access Scope

Local off-chip R/W thread

Shared on-chip R/W all threads in a block

Global off-chip R/W all threads + host

Constant off-chip RO all threads + host

Texture off-chip RO all threads + host Surface off-chip R/W all threads + host

Page 78: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 78

Memory Model: ���Global, Constant, and Texture Memories

u Global memory–  Communicating R/W data between host and device–  Contents visible to all threads–  May be cached (machine dependent)

u Texture and Constant Memories–  Constants initialized by host –  Contents visible to all threads–  May be cached (machine dependent)

Page 79: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 79

Execution Model: Ordering

u  Execution order is undefinedu Do not assume and use:

w block 0 executes before block 1w thread 10 executes before thread 20w and any other ordering even if you can observe it

u  Future implementations may break this orderingu  It’s not part of the CUDA definitionu Why? More flexible hardware options

Page 80: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 80

Reasoning about CUDA call ordering

u Access GPU via cuda…() calls and kernel invocationsw cudaMalloc, cudaMemCpy

u Asynchronous from the CPU’s perspectivew CPU places a request in a “CUDA” queuew requests are handled in-order

Page 81: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 81

Execution Model Summary (for your reference)u  Grid of blocks of threads

w  1D/2D/3D grid of blocks of 1D/2D/3D threadsw  Threads and blocks have IDs

u  Block execution order is undefined

u  Same block threads can shared data fast

u  Across blocks, threads:w  Cannot cooperatew  Communicate (slowly) through global memory

u  Blocks do not migrate: execute on the same processor

u  Several blocks may run over the same core

Page 82: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 82

CUDA API: Exampleint a[N]; for (i =0; i < N; i++)

a[i] = a[i] + x; 1.  Allocate CPU Data Structure2.  Initialize Data on CPU3.  Allocate GPU Data Structure4.  Copy Data from CPU to GPU5.  Define Execution Configuration6.  Run Kernel7.  CPU synchronizes with GPU8.  Copy Data from GPU to CPU9.  De-allocate GPU and CPU memory

Page 83: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 83

1. Allocate CPU data structure

float *ha; main (int argc, char *argv[])

{

int N = atoi (argv[1]);

ha = (float *) malloc (sizeof (float) * N);

...

}

Page 84: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 84

2. Initialize CPU data (dummy)

float *ha;

int i;

for (i = 0; i < N; i++)

ha[i] = i;

Page 85: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 85

3. Allocate GPU data structure float *da;

cudaMalloc ((void **) &da, sizeof (float) * N);

u  Notice: no assignment sidew  NOT: da = cudaMalloc (…)

u  Assignment is done internally:w  That’s why we pass &da

u  Space is allocated in Global Memory on the GPU

Page 86: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 86

GPU Memory Allocation

u The host manages GPU memory allocation:w cudaMalloc (void **ptr, size_t nbytes)

w Must explicitly cast to (void **) w cudaMalloc ((void **) &da, sizeof (float) * N);

w cudaFree (void *ptr); w cudaFree (da);

w cudaMemset (void *ptr, int value, size_t nbytes); w cudaMemset (da, 0, N * sizeof (int));

u Check the CUDA Reference Manual

Page 87: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 87

4. Copy Initialized CPU data to GPU

float *da;

float *ha;

cudaMemCpy ((void *) da, // DESTINATION

(void *) ha, // SOURCE

sizeof (float) * N, // #bytes

cudaMemcpyHostToDevice);

// DIRECTION

Page 88: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 88

Host/Device Data Transfers

The host initiates all transfers:u cudaMemcpy( void *dst, void *src,

size_t nbytes, enum cudaMemcpyKind direction)

u Asynchronous from the CPU’s perspectivew CPU thread continues

u  In-order processing with other CUDA requestsu enum cudaMemcpyKind

w cudaMemcpyHostToDevice

w cudaMemcpyDeviceToHost

w cudaMemcpyDeviceToDevice

Page 89: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 89

5. Define Execution Configuration

u  How many blocks and threads/block

int threads_block = 64;

int blocks = N / threads_block;

if (blocks % N != 0) blocks += 1;

u Alternatively:

blocks = (N + threads_block – 1) / threads_block;

Page 90: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 90

6. Launch Kernel & ���7. CPU/GPU Synchronization

u GPU launch blocks x threads_block threads:

arradd <<<blocks, threads_block>>

(da, 10f, N);

cudaDeviceSynchronize (); // forces CPU to wait

u  arradd: kernel nameu <<<…>>> execution configurationu  (da, x, N): arguments

w 256 byte limit / No variable argumentsw Not sure this is still true

Page 91: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 91

CPU/GPU Synchronization

u CPU does not block on cuda…() callsw Kernel/requests are queued and processed in-orderw Control returns to CPU immediately

u Good if there is other work to be donew e.g., preparing for the next kernel invocation

u  Eventually, CPU must know when GPU is doneu Then it can safely copy the GPU resultsu cudaDeviceSynchronize ()

w Block CPU until all preceding cuda…() and kernel requests have completed

w Used to be cudaThreadSynchronize ()

Page 92: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 92

8. Copy data from GPU to CPU & ���9. Deallocate Memory

float *da; float *ha;

cudaMemCpy ((void *) ha, // DESTINATION

(void *) da, // SOURCE

sizeof (float) * N, // #bytes

cudaMemcpyDeviceToHost);

// DIRECTION

cudaFree (da);

// display or process results here

free (ha);

Page 93: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 93

The GPU Kernel

__global__ darradd (float *da, float x, int N)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < N) da[i] = da[i] + x;

}

Page 94: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 94

CUDA Function Declarations

u __global__ defines a kernel functionw Must return voidw Can only call __device__ functions

u __device__ and __host__ can be used togetherw Two difference versions generated

Executed on the:

Only callable from the:

__device__ float DeviceFunc() device device

__global__ void KernelFunc() device host

__host__ float HostFunc() host host

Page 95: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 95

Can you do this one now?

( ) ( ) ( )BAC •=

Page 96: CS-206 Concurrency Lecture 11 - parsa.epfl.ch · EPFL CS-206 – Spring 2015 Lec.11 - 5 Recall: Forms of Parallelism! Throughput parallelism " Perform many (identical) sequential

EPFL CS-206 – Spring 2015 Lec.11 - 96

Summary

u  Data Parallel Computingw Much of media processing is data parallelw All of data analytics on datacenters & beyond

u  Platforms for data parallel computingw Within CPU: SIMD/Vectorw Across CPU: GPUw Across a single computer: cluster of servers

u  GPUs: orders of magnitude more concurrent than CPUu  GPU programming

w  It’s complicatedw Take your time