general transformations for gpu execution of tree...

Post on 17-Jul-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

General Transformations for GPU Execution of

Tree TraversalsMichael Goldfarb*, Youngjoon Jo**, Milind Kulkarni

School of Electrical and Computer Engineering

* Now at Qualcomm; ** Now at Google

Thursday, March 27, 14

GPU execution of irregular programs

• GPUs offer promise of massive, energy-efficient parallelism

• Much success in mapping regular applications to GPUs

• Regular memory accesses, predictable computation

• Much less success in mapping irregular applications

• Pointer-based data structures

• Unpredictable, input-dependent computation and memory accesses

2

Thursday, March 27, 14

Tree traversal algorithms

• Many irregular algorithms are built around tree-traversal

• Barnes-Hut

• Nearest-neighbor

• 2-point correlation

• Numerous papers describing how to map tree traversal algorithms to GPUs

3

Thursday, March 27, 14

Point correlation

4

• Data mining algorithm

• Goal: given a set of N points in k dimensions and a point p, find all points within a radius r of p

• Naïve approach: compare all N points with p

• Better approach: build kd-tree over points, traverse tree for point p, prune subtrees that are far from p

Thursday, March 27, 14

Point correlation

5

• Data mining algorithm

• Goal: given a set of N points in k dimensions and a point p, find all points within a radius r of p

• Naïve approach: compare all N points with p

• Better approach: build kd-tree over points, traverse tree for point p, prune subtrees that are far from p

Thursday, March 27, 14

Point correlation

6

• Data mining algorithm

• Goal: given a set of N points in k dimensions and a point p, find all points within a radius r of p

• Naïve approach: compare all N points with p

• Better approach: build kd-tree over points, traverse tree for point p, prune subtrees that are far from p

Thursday, March 27, 14

Point correlation

7

A

Thursday, March 27, 14

Point correlation

7

A

B G

Thursday, March 27, 14

Point correlation

7

A

B G

C F

Thursday, March 27, 14

Point correlation

7

A

B G

C F

D E

Thursday, March 27, 14

Point correlation

7

A

B G

C F H K

D E

Thursday, March 27, 14

Point correlation

7

A

B G

C F H K

D E I J

Thursday, March 27, 14

Point correlation

8

D E I J

C F H K

B G

A

Thursday, March 27, 14

Point correlation

8

D E I J

C F H K

B G

A

Thursday, March 27, 14

Point correlation

9

D E I J

C F H K

B G

A

Thursday, March 27, 14

Point correlation

10

D E I J

C F H K

B G

A

Thursday, March 27, 14

Point correlation

11

D E I J

C F H K

B G

A

Thursday, March 27, 14

Point correlation

12

D E I J

C F H K

B G

A

Thursday, March 27, 14

Point correlation

13

D E I J

C F H K

B G

A

Thursday, March 27, 14

Point correlation

14

D E I J

C F H K

B G

A

Thursday, March 27, 14

Point correlation

15

KDCell root = /* build kdtree */;Set<Point> ps;double radius;

foreach Point p in ps { recurse(p, root, radius);}...void recurse(Point p, KDCell node, double r) { if (tooFar(p, node, r)) return; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { recurse(p, node.left, r); recurse(p, node.right, r); }}

Thursday, March 27, 14

Basic pattern

16

TreeNode root;Set<Point> ps;

foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}

Thursday, March 27, 14

Basic pattern

16

recursive traversal

TreeNode root;Set<Point> ps;

foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}

Thursday, March 27, 14

Basic pattern

16

recursive traversal

tree structureTreeNode root;Set<Point> ps;

foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}

Thursday, March 27, 14

Basic pattern

16

recursive traversal

repeated traversal

tree structureTreeNode root;Set<Point> ps;

foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}

Thursday, March 27, 14

Basic pattern

16

recursive traversal

repeated traversal

tree structureTreeNode root;Set<Point> ps;

foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}

Lots of parallelism!

Thursday, March 27, 14

What’s the problem?

• GPUs add high overhead for recursion

• GPUs work best when memory accesses are regular and strided, but irregular algorithms have unpredictable memory accesses

• Status quo: ad hoc solutions

• New algorithm? New GPU techniques!

17

Thursday, March 27, 14

What’s the problem?

• GPUs add high overhead for recursion

• GPUs work best when memory accesses are regular and strided, but irregular algorithms have unpredictable memory accesses

• Status quo: ad hoc solutions

• New algorithm? New GPU techniques!

17

Want generally applicable techniques for

mapping irregular applications to GPUs

Thursday, March 27, 14

Contributions

• Two general techniques for mapping tree-traversals to GPUs

• Autoropes: eliminates recursion overhead

• Lockstepping: promotes memory coalescing

• Compiler pass to automatically apply techniques to recursive tree-traversal code

• Significant GPU speedups on 5 tree-traversal algorithms

18

Thursday, March 27, 14

Naïve GPU implementation

• Warp-based SIMT (single-instruction, multiple-thread) execution

• 32 points put in a single warp

• Warp traverses tree

• All points in warp must execute same instruction

• If points diverge, some points sit idle while other threads execute

19

Thursday, March 27, 14

Naïve GPU implementation

20

A

B G

C F H K

D E I J

Thursday, March 27, 14

Naïve GPU implementation

20

A

B G

C F H K

D E I J

Thursday, March 27, 14

Naïve GPU implementation

20

A

B G

C F H K

D E I J

A

B G

C F

D E

Thursday, March 27, 14

Naïve GPU implementation

20

A

B G

C F H K

D E I J

A

B G

H K

I J

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Naïve GPU implementation

21

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Naïve GPU implementation

22

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Naïve GPU implementation

23

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Naïve GPU implementation

24

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Naïve GPU implementation

25

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Naïve GPU implementation

26

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Naïve GPU implementation

27

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Naïve GPU implementation

28

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Naïve GPU implementation

29

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Naïve GPU implementation

30

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Naïve GPU implementation

31

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Naïve GPU implementation

32

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Naïve GPU implementation

33

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Naïve GPU implementation

34

B

A

G

Thursday, March 27, 14

Lots of accesses to tree

• Many accesses just moving up the tree in order to later move down again

• Lots of function stack manipulation

• Trees are very large, cannot be stored in GPU’s fast memory

• Want to minimize accesses to tree

35

Thursday, March 27, 14

How to avoid extra accesses to tree?

• Typical technique: ropes

• Pointers in each tree node that let a traversal jump to the next part of the tree

• Effectively linearizes traversal

36

A

B G

C F H K

D E I J

Thursday, March 27, 14

How to avoid extra accesses to tree?

• Typical technique: ropes

• Pointers in each tree node that let a traversal jump to the next part of the tree

• Effectively linearizes traversal

36

A

B G

C F H K

D E I J

Thursday, March 27, 14

How to avoid extra accesses to tree?

• Typical technique: ropes

• Pointers in each tree node that let a traversal jump to the next part of the tree

• Effectively linearizes traversal

36

A

B G

C F H K

D E I J

Thursday, March 27, 14

How to avoid extra accesses to tree?

• Typical technique: ropes

• Pointers in each tree node that let a traversal jump to the next part of the tree

• Effectively linearizes traversal

36

A

B G

C F H K

D E I J

Thursday, March 27, 14

How to avoid extra accesses to tree?

• Installing ropes into a tree requires complex, application-specific preprocessing

• Using ropes correctly during execution requires complex, application-specific logic

37

A

B G

C F H K

D E I J

Thursday, March 27, 14

Autoropes

• General technique for “linearizing” tree traversal for arbitrary traversal algorithms

• Achieve generality, simplicity and space-efficiency at the cost of overhead

• Key insight: recursive tree algorithms are just depth-first traversals of a tree; can transform into iterative algorithm

38

Thursday, March 27, 14

Autoropes

39

A

B G

C F H K

D E I J

A

B G

C F

D EA

Rope stack

Thursday, March 27, 14

Autoropes

40

A

B G

C F H K

D E I J

G

A

B G

C F

D EB

Rope stack

Thursday, March 27, 14

Autoropes

41

A

B G

C F H K

D E I J

F

A

B G

C F

D E

G

C

Rope stack

Thursday, March 27, 14

Autoropes

42

A

B G

C F H K

D E I J

E

A

B G

C F

D E

FG

D

Rope stack

Thursday, March 27, 14

Autoropes

43

A

B G

C F H K

D E I J

F

A

B G

C F

D E

G

E

Rope stack

Thursday, March 27, 14

Autoropes

44

A

B G

C F H K

D E I J

G

A

B G

C F

D EF

Rope stack

Thursday, March 27, 14

Autoropes

45

A

B G

C F H K

D E I J

A

B G

C F

D EG

Rope stack

Thursday, March 27, 14

Autoropes

• Ropes stored on rope stack instead of in tree

• No application-specific code to use ropes

• Ropes instantiated dynamically

• No preprocessing required

• Same access patterns as with manual ropes

• Extra pushes and pops on rope stack add some overhead

46

Thursday, March 27, 14

Autoropes

47

void recurse(Point p, KDCell node, double r) { if (tooFar(p, node, r)) return; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { recurse(p, node.left, r); recurse(p, node.right, r); }}

Simple compiler transformation convertsrecursive code into iterative autoropes code

Thursday, March 27, 14

Autoropes

48

ropeStack.push(root);while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r)) continue; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); }}

See paper for details of how to transform more complex code

Thursday, March 27, 14

Unintended consequence

49

• Recursive calls naturally lead to thread divergence

• If some threads make recursive calls, other threads wait until calls return

• Does not happen for iterative code

• All threads reconverge at beginning of loop

ropeStack.push(root);while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r))

continue; if (...) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); }}

Thursday, March 27, 14

formerly recursion

Unintended consequence

49

• Recursive calls naturally lead to thread divergence

• If some threads make recursive calls, other threads wait until calls return

• Does not happen for iterative code

• All threads reconverge at beginning of loop

ropeStack.push(root);while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r))

continue; if (...) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); }}

formerly return

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Autoropes on GPU

50

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Autoropes on GPU

51

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Autoropes on GPU

52

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Autoropes on GPU

53

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Autoropes on GPU

54

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Autoropes on GPU

55

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Autoropes on GPU

56

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Autoropes on GPU

56

B

A

G

Threads no longer diverge in execution

But do diverge in tree!

Thursday, March 27, 14

Thread divergence vs. memory coalescing• Memory accesses on GPU only well behaved

if accesses by all threads in warp can be coalesced

• Same memory or strided access

• Bad memory behavior of autoropes outweighs lack of thread divergence

• Goal: benefits of autoropes while maintaining memory coalescing

57

Thursday, March 27, 14

Lockstepping• Essentially, force GPU to let threads diverge

• If any thread in a warp wants to visit a node’s children, all threads in a warp visit the child

• Threads that are “dragged along” are programmatically masked out

• Warp execution takes longer (proportional to union of threads’ traversals, rather than longest traversal), but improved memory performance makes up for it

• Automatically implemented during autoropes compiler pass

58

Thursday, March 27, 14

Dynamic lockstepping

• Some algorithms allow different traversal orders

• Some points visit left child before right, and others visit right before left

• Optimization reduces traversal size

• Inherently bad memory access patterns

59

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Dynamic lockstepping

60

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Dynamic lockstepping

61

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Dynamic lockstepping

62

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Dynamic lockstepping

63

B

A

G

Thursday, March 27, 14

A

B G

H K

I J

C F

D E

Dynamic lockstepping

64

B

A

G

Thursday, March 27, 14

Dynamic lockstepping

• Dynamic lockstepping allows all points in a warp to “vote” on which traversal order to use

• Maintains memory coalescing

• Some points do more work than in original algorithm

• Tradeoff can still be worth it!

65

Thursday, March 27, 14

Engineering details

• Transform point data from array of structures format to structure of arrays

• Use analysis from [PACT 2013] to prove safety and transform automatically

• Copy tree data to GPU in linearized fashion

• Lay out fields of tree and point according to use (more commonly-accessed fields placed in shared memory)

• Interleave rope stacks for points in warp to allow strided access

66

Thursday, March 27, 14

Results• Two platforms

• GPU platform: NVIDIA Tesla C2070 (6GB global memory, 14 SMs)

• CPU platform: 32-core, 2.3 GHz Opteron

• Five benchmarks

• Barnes-Hut, Point correlation, Nearest neighbor, k-Nearest neighbor, Vantage point trees

• Multiple inputs per benchmark

• Used sorted and unsorted points

67

Thursday, March 27, 14

High-level takeaways

• Autoropes+lockstep always faster than simple recursive GPU implementation (up to 14x faster)

• For most benchmarks/inputs, best GPU implementation faster than CPU implementation up to 16 threads

• Speedups comparable to hand-written implementations

68

Thursday, March 27, 14

Barnes-Hut

69

Thursday, March 27, 14

Point correlation

70

Thursday, March 27, 14

Nearest-neighbor

71

Thursday, March 27, 14

Conclusions

• Mapping irregular applications to GPUs is very difficult

• Developed two general techniques, autoropes and lockstepping, that can achieve significant speedup on GPU

• vs. baseline GPU code and CPU implementations

• Automatic approaches competitive with previous hand-written implementations

72

Thursday, March 27, 14

General Transformations for GPU Execution of

Tree TraversalsMichael Goldfarb*, Youngjoon Jo**, Milind Kulkarni

School of Electrical and Computer Engineering

* Now at Qualcomm; ** Now at Google

Thursday, March 27, 14

top related