prefix sums on gpus - gpu.cs.uct.ac.za

Post on 19-Oct-2021

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Prefix sums on GPUs

Bruce Merry

Department of Computer Science, University of Cape Town

GPGPU2 Workshop 2014

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Outline

1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications

2 Parallel AlgorithmsKogge-StoneBrent-Kung

3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Outline

1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications

2 Parallel AlgorithmsKogge-StoneBrent-Kung

3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Problem Statement

For every object in a set, output a list of the other objectsthat differ by less than some amount.This is deliberately vague: could be for n-body simulation,clustering, scattered data interpolation.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Problem Statement

For every object in a set, output a list of the other objectsthat differ by less than some amount.This is deliberately vague: could be for n-body simulation,clustering, scattered data interpolation.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Output Format

The lists should be packed together contiguously.

A0 A1 A2

Assuming one workitem per object, how do the workitemsknow where to start?

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Output Format

The lists should be packed together contiguously.

A0 A1 A2 B0 B1

Assuming one workitem per object, how do the workitemsknow where to start?

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Output Format

The lists should be packed together contiguously.

A0 A1 A2 B0 B1 D0

Assuming one workitem per object, how do the workitemsknow where to start?

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Output Format

The lists should be packed together contiguously.

A0 A1 A2 B0 B1 D0 E0 E1

Assuming one workitem per object, how do the workitemsknow where to start?

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Output Format

The lists should be packed together contiguously.

A0 A1 A2 B0 B1 D0 E0 E1

Assuming one workitem per object, how do the workitemsknow where to start?

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Solution

This can be solved with a multi-pass approach:1 Every workitem counts how many records to emit, and

writes this number to a buffer.2 The buffer is processed to determine the start position

for each object, and writes this position to a buffer.3 Each workitem reads this buffer, and emits its records

in the right place.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Solution

This can be solved with a multi-pass approach:1 Every workitem counts how many records to emit, and

writes this number to a buffer.2 The buffer is processed to determine the start position

for each object, and writes this position to a buffer.3 Each workitem reads this buffer, and emits its records

in the right place.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Solution

This can be solved with a multi-pass approach:1 Every workitem counts how many records to emit, and

writes this number to a buffer.2 The buffer is processed to determine the start position

for each object, and writes this position to a buffer.3 Each workitem reads this buffer, and emits its records

in the right place.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Outline

1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications

2 Parallel AlgorithmsKogge-StoneBrent-Kung

3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Exclusive Prefix Sum

Given an operator ⊕ and an identity element I, the exclusiveprefix sum of (a0,a1, . . . ,an−1) is

(I,a0,a0 ⊕ a1,a0 ⊕ a1 ⊕ a2, . . . ,a0 ⊕ · · · ⊕ an−2) =

i−1⊕j=0

aj

In other words, element i is the sum of all elements strictlybefore i .

4 3 7 9 2 3

0 4 7 14 23 25

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Exclusive Prefix Sum

Given an operator ⊕ and an identity element I, the exclusiveprefix sum of (a0,a1, . . . ,an−1) is

(I,a0,a0 ⊕ a1,a0 ⊕ a1 ⊕ a2, . . . ,a0 ⊕ · · · ⊕ an−2) =

i−1⊕j=0

aj

In other words, element i is the sum of all elements strictlybefore i .

4 3 7 9 2 3

0 4 7 14 23 25

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Inclusive Prefix Sum

Given an operator ⊕ and an identity element I, the inclusiveprefix sum of (a0,a1, . . . ,an−1) is

(a0,a0 ⊕ a1,a0 ⊕ a1 ⊕ a2, . . . ,a0 ⊕ · · · ⊕ an−1) =

i⊕j=0

aj

In other words, element i is the sum of all elements beforeand including i .

4 3 7 9 2 3

4 7 14 23 25 28

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Inclusive Prefix Sum

Given an operator ⊕ and an identity element I, the inclusiveprefix sum of (a0,a1, . . . ,an−1) is

(a0,a0 ⊕ a1,a0 ⊕ a1 ⊕ a2, . . . ,a0 ⊕ · · · ⊕ an−1) =

i⊕j=0

aj

In other words, element i is the sum of all elements beforeand including i .

4 3 7 9 2 3

4 7 14 23 25 28

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Outline

1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications

2 Parallel AlgorithmsKogge-StoneBrent-Kung

3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Other Applications

Compaction: select all objects that satisfy a predicatePartitioning: rearrange objects that satisfy a predicatebefore the othersSorting: radix sort is just repeated partitioningVisibility: an object is visible if it is not preceded by ataller one (using max operator instead of +)Meshing: each cell produces an variable number oftriangles

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Other Applications

Compaction: select all objects that satisfy a predicatePartitioning: rearrange objects that satisfy a predicatebefore the othersSorting: radix sort is just repeated partitioningVisibility: an object is visible if it is not preceded by ataller one (using max operator instead of +)Meshing: each cell produces an variable number oftriangles

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Other Applications

Compaction: select all objects that satisfy a predicatePartitioning: rearrange objects that satisfy a predicatebefore the othersSorting: radix sort is just repeated partitioningVisibility: an object is visible if it is not preceded by ataller one (using max operator instead of +)Meshing: each cell produces an variable number oftriangles

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Other Applications

Compaction: select all objects that satisfy a predicatePartitioning: rearrange objects that satisfy a predicatebefore the othersSorting: radix sort is just repeated partitioningVisibility: an object is visible if it is not preceded by ataller one (using max operator instead of +)Meshing: each cell produces an variable number oftriangles

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Other Applications

Compaction: select all objects that satisfy a predicatePartitioning: rearrange objects that satisfy a predicatebefore the othersSorting: radix sort is just repeated partitioningVisibility: an object is visible if it is not preceded by ataller one (using max operator instead of +)Meshing: each cell produces an variable number oftriangles

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Atomics

Atomics offer an alternative way to allocate unique memoryper work-item, but

Suffer heavy contention, which is slow (but gettingbetter all the time)Do not preserve the original orderingDo not give reproducible ordering

Atomics have the advantage of allowing for single-passalgorithms.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Outline

1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications

2 Parallel AlgorithmsKogge-StoneBrent-Kung

3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Idea

Let sti be the sum of the (up to) t inputs ending with ai . Then

s2ti = st

i−t ⊕ sti .

We start with (s1i ) = (ai), then compute (s2

i ), (s4i ), (s

8i ) and

so on, up to (sNi ), in O(log2 N) iterations, to give an

inclusive prefix sum.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

6 9 2 3 6 1 6 3 s1

6 15 11 5 9 7 7 9 s2

6 15 17 20 20 12 16 16 s3

6 15 17 20 26 27 33 36 s4

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Pseudo-code

foreach power-of-two t from 1 to N dofor i ← t to N − 1 do in parallel

ai ← ai−t ⊕ ai ;

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Work-item Pseudo-code

i ← workitem ID;foreach power-of-two t from 1 to N do

x ← ai ;if t ≤ i then

x ← x ⊕ ai−t ;barrier();ai ← x ;barrier();

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Optimizations

The working register x can be reused between loopiterations without reloading.The if statement can be eliminated by padding at thefront with zeros.Shared memory can be used to reduce global memoryaccesses.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Optimizations

The working register x can be reused between loopiterations without reloading.The if statement can be eliminated by padding at thefront with zeros.Shared memory can be used to reduce global memoryaccesses.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Optimizations

The working register x can be reused between loopiterations without reloading.The if statement can be eliminated by padding at thefront with zeros.Shared memory can be used to reduce global memoryaccesses.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

It is work-inefficient: it performs O(N log N) operationsin totalAbout 2 log2 N barriersAbout N log N reads and N log N writesMemory access pattern is good: sequential accesses

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

It is work-inefficient: it performs O(N log N) operationsin totalAbout 2 log2 N barriersAbout N log N reads and N log N writesMemory access pattern is good: sequential accesses

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

It is work-inefficient: it performs O(N log N) operationsin totalAbout 2 log2 N barriersAbout N log N reads and N log N writesMemory access pattern is good: sequential accesses

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

It is work-inefficient: it performs O(N log N) operationsin totalAbout 2 log2 N barriersAbout N log N reads and N log N writesMemory access pattern is good: sequential accesses

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Outline

1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications

2 Parallel AlgorithmsKogge-StoneBrent-Kung

3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Idea

For an exclusive scan:Add pairs of adjacent elements:

pi = a2i ⊕ a2i+1

Recursively scan these sums:

qi =i−1⊕j=0

pi =2i−1⊕j=0

aj

Use these sums to compute the result:

s2i = qi , s2i+1 = qi ⊕ a2i

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

6 9 2 3 6 1 6 3

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

15

6 9

5

2 3

7

6 1

9

6 3

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

20

15

6 9

5

2 3

16

7

6 1

9

6 3

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

36

20

15

6 9

5

2 3

16

7

6 1

9

6 3

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

0

20

15

6 9

5

2 3

16

7

6 1

9

6 3

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

0

0

15

6 9

5

2 3

20

7

6 1

9

6 3

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

0

0

0

6 9

15

2 3

20

20

6 1

27

6 3

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Example

0

0

0

0 6

15

15 17

20

20

20 26

27

27 33

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Memory Arrangement

In-place Each sum replaces the second element of thepair being summed. No extra memory, but hasbad bank conflicts.

Out-of-place Each level of the tree stored contiguously.Requires double the memory, but conflicts areonly 2-way.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Memory Arrangement

In-place Each sum replaces the second element of thepair being summed. No extra memory, but hasbad bank conflicts.

Out-of-place Each level of the tree stored contiguously.Requires double the memory, but conflicts areonly 2-way.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Pseudo-code

Out-of-place exclusive sum, for N = 2n:Copy ai to bi+N for i ∈ [0,N);for t ← n − 1 downto 1 do

for i ← 0 to 2t − 1 do in parallelb2t+i ← b2t+1+i ⊕ b2t+1+i+1;

// Exclusive prefix sum of two elementsb3 ← b2;b2 ← I;for t ← 1 to n − 1 do

for i ← 0 to 2t − 1 do in parallelb2t+1+i+1 ← b2t+i ⊕ b2t+1+i ;b2t+1+i ← b2t+i ;

Copy bi+N to si for i ∈ [0,N);

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Per-workitem Pseudo-code

Uses N2 work-items:

i ← work-item ID;bi+N ← ai ;bi+N+ N

2← ai+ N

2;

barrier();for t ← n − 1 downto 1 do

if i < 2t thenb2t+i ← b2t+1+2i ⊕ b2t+1+2i+1;

barrier();if i = 0 then

a3 ← a2;a2 ← I;

barrier();for t ← 1 to n − 1 do

if i < 2t thenb2t+1+2i+1 ← b2t+i ⊕ b2t+1+2i ;b2t+1+2i ← b2t+i ;

barrier();si ← bi+N ;si+ N

2← bi+N+ N

2;

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

Work-efficient: O(N) addition operationsStill requires about 2 log2 N barriersRequires about 4N reads and 3N writesOnly N

2 work-items requiredHas branching, but it is coherent

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

Work-efficient: O(N) addition operationsStill requires about 2 log2 N barriersRequires about 4N reads and 3N writesOnly N

2 work-items requiredHas branching, but it is coherent

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

Work-efficient: O(N) addition operationsStill requires about 2 log2 N barriersRequires about 4N reads and 3N writesOnly N

2 work-items requiredHas branching, but it is coherent

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

Work-efficient: O(N) addition operationsStill requires about 2 log2 N barriersRequires about 4N reads and 3N writesOnly N

2 work-items requiredHas branching, but it is coherent

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Properties

Work-efficient: O(N) addition operationsStill requires about 2 log2 N barriersRequires about 4N reads and 3N writesOnly N

2 work-items requiredHas branching, but it is coherent

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Motivation

Applying one of these at a larger (multi-workgroup) scalehas issues:

Synchronisation: no inter-workgroup synchronisation,so barriers must be kernel-instance boundariesMemory usage: need O(N) working spaceBandwidth: requires O(N log N) for Kogge-Stone, about7N for Brent-Kung

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Outline

1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications

2 Parallel AlgorithmsKogge-StoneBrent-Kung

3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

6 9 2 3 6 1 6 3 3 1 5 4 9 7 3 3

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

20

6 9 2 3

16

6 1 6 3

13

3 1 5 4

22

9 7 3 3

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

71

20

6 9 2 3

16

6 1 6 3

13

3 1 5 4

22

9 7 3 3

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

0

20

6 9 2 3

16

6 1 6 3

13

3 1 5 4

22

9 7 3 3

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

0

0

6 9 2 3

20

6 1 6 3

36

3 1 5 4

49

9 7 3 3

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Generalizing Brent-Kung

The Brent-Kung tree doesn’t have to be binary:

0

0

0 6 15 17

20

20 26 27 33

36

36 39 40 45

49

49 58 65 68

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Reduce-then-Scan strategy

1 Divide elements into blocks of size M.2 Use a workgroup per block to compute sum of each

block.3 Recursively prefix-sum the block sums.4 Use a workgroup per block to prefix-sum each block,

starting from the result from the previous level.

Steps 2 and 4 can use any parallel reduction/prefix sumalgorithm.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Reduce-then-Scan strategy

1 Divide elements into blocks of size M.2 Use a workgroup per block to compute sum of each

block.3 Recursively prefix-sum the block sums.4 Use a workgroup per block to prefix-sum each block,

starting from the result from the previous level.

Steps 2 and 4 can use any parallel reduction/prefix sumalgorithm.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Reduce-then-Scan strategy

1 Divide elements into blocks of size M.2 Use a workgroup per block to compute sum of each

block.3 Recursively prefix-sum the block sums.4 Use a workgroup per block to prefix-sum each block,

starting from the result from the previous level.

Steps 2 and 4 can use any parallel reduction/prefix sumalgorithm.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Reduce-then-Scan strategy

1 Divide elements into blocks of size M.2 Use a workgroup per block to compute sum of each

block.3 Recursively prefix-sum the block sums.4 Use a workgroup per block to prefix-sum each block,

starting from the result from the previous level.

Steps 2 and 4 can use any parallel reduction/prefix sumalgorithm.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Reduce-then-Scan strategy

1 Divide elements into blocks of size M.2 Use a workgroup per block to compute sum of each

block.3 Recursively prefix-sum the block sums.4 Use a workgroup per block to prefix-sum each block,

starting from the result from the previous level.

Steps 2 and 4 can use any parallel reduction/prefix sumalgorithm.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Analysis

Assuming that M is reasonably large:About logM N kernel instancesMost memory accesses can be to local memorySlightly over 2N global readsSlightly over N global writesSlightly over O(N log M) barrier instructions

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Analysis

Assuming that M is reasonably large:About logM N kernel instancesMost memory accesses can be to local memorySlightly over 2N global readsSlightly over N global writesSlightly over O(N log M) barrier instructions

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Analysis

Assuming that M is reasonably large:About logM N kernel instancesMost memory accesses can be to local memorySlightly over 2N global readsSlightly over N global writesSlightly over O(N log M) barrier instructions

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Analysis

Assuming that M is reasonably large:About logM N kernel instancesMost memory accesses can be to local memorySlightly over 2N global readsSlightly over N global writesSlightly over O(N log M) barrier instructions

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Analysis

Assuming that M is reasonably large:About logM N kernel instancesMost memory accesses can be to local memorySlightly over 2N global readsSlightly over N global writesSlightly over O(N log M) barrier instructions

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Outline

1 Definition and ApplicationsMotivating ProblemDefinitionsOther Applications

2 Parallel AlgorithmsKogge-StoneBrent-Kung

3 GPU StrategiesReduce-then-ScanTwo-Level Prefix Sum

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Reducing Parallelism

The more parallelism one uses in a prefix sum, the higherthe overheads become.Therefore, only use as much parallelism as is necessary tosaturate the hardware.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Reducing Parallelism

The more parallelism one uses in a prefix sum, the higherthe overheads become.Therefore, only use as much parallelism as is necessary tosaturate the hardware.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Fixed Block Count

Use the same reduce-then-scan strategy, butFix the number of blocks at C, set M = N

C

Fix a work-group size GC should be tuned so that C × G workitems saturatethe deviceC should be small enough that only 2 levels arerequired

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Prefix-Summing a Block

Each block has size M but workgroups only have Gworkitems. How does a workgroup prefix-sum a block?Serially. In sub-blocks of size G or 2G.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Prefix-Summing a Block

Each block has size M but workgroups only have Gworkitems. How does a workgroup prefix-sum a block?Serially. In sub-blocks of size G or 2G.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Prefix-Summing a Block

Each block has size M but workgroups only have Gworkitems. How does a workgroup prefix-sum a block?Serially. In sub-blocks of size G or 2G.

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Advantages

Only three kernel instances, two of which use the fullGPUOnly O(N log G) barrier instructionsOnly O(C) extra global memory

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Advantages

Only three kernel instances, two of which use the fullGPUOnly O(N log G) barrier instructionsOnly O(C) extra global memory

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Advantages

Only three kernel instances, two of which use the fullGPUOnly O(N log G) barrier instructionsOnly O(C) extra global memory

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Summary

Parallel prefix sum is hard workGPUs need parallelism, but algorithm works best withleast parallelismWith good implementation, can be bandwidth-limited

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Summary

Parallel prefix sum is hard workGPUs need parallelism, but algorithm works best withleast parallelismWith good implementation, can be bandwidth-limited

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

Summary

Parallel prefix sum is hard workGPUs need parallelism, but algorithm works best withleast parallelismWith good implementation, can be bandwidth-limited

Prefix sumson GPUs

Bruce Merry

Definition andApplicationsMotivating Problem

Definitions

Other Applications

ParallelAlgorithmsKogge-Stone

Brent-Kung

GPUStrategiesReduce-then-Scan

Two-Level PrefixSum

Summary

References

Guy E. Blelloch.Prefix sums and their applications.Technical Report CMU-CS-90-190, Computer ScienceDepartment, Carnegie Mellon University, November1990.

Duane Merrill and Andrew Grimshaw.Parallel scan for stream architectures.Technical Report CS2009-14, Department of ComputerScience, University of Virginia, December 2009.

top related