general transformations for gpu execution of tree...
TRANSCRIPT
![Page 1: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/1.jpg)
General Transformations for GPU Execution of
Tree TraversalsMichael Goldfarb*, Youngjoon Jo**, Milind Kulkarni
School of Electrical and Computer Engineering
* Now at Qualcomm; ** Now at Google
Thursday, March 27, 14
![Page 2: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/2.jpg)
GPU execution of irregular programs
• GPUs offer promise of massive, energy-efficient parallelism
• Much success in mapping regular applications to GPUs
• Regular memory accesses, predictable computation
• Much less success in mapping irregular applications
• Pointer-based data structures
• Unpredictable, input-dependent computation and memory accesses
2
Thursday, March 27, 14
![Page 3: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/3.jpg)
Tree traversal algorithms
• Many irregular algorithms are built around tree-traversal
• Barnes-Hut
• Nearest-neighbor
• 2-point correlation
• Numerous papers describing how to map tree traversal algorithms to GPUs
3
Thursday, March 27, 14
![Page 4: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/4.jpg)
Point correlation
4
• Data mining algorithm
• Goal: given a set of N points in k dimensions and a point p, find all points within a radius r of p
• Naïve approach: compare all N points with p
• Better approach: build kd-tree over points, traverse tree for point p, prune subtrees that are far from p
Thursday, March 27, 14
![Page 5: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/5.jpg)
Point correlation
5
• Data mining algorithm
• Goal: given a set of N points in k dimensions and a point p, find all points within a radius r of p
• Naïve approach: compare all N points with p
• Better approach: build kd-tree over points, traverse tree for point p, prune subtrees that are far from p
Thursday, March 27, 14
![Page 6: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/6.jpg)
Point correlation
6
• Data mining algorithm
• Goal: given a set of N points in k dimensions and a point p, find all points within a radius r of p
• Naïve approach: compare all N points with p
• Better approach: build kd-tree over points, traverse tree for point p, prune subtrees that are far from p
Thursday, March 27, 14
![Page 7: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/7.jpg)
Point correlation
7
A
Thursday, March 27, 14
![Page 8: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/8.jpg)
Point correlation
7
A
B G
Thursday, March 27, 14
![Page 9: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/9.jpg)
Point correlation
7
A
B G
C F
Thursday, March 27, 14
![Page 10: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/10.jpg)
Point correlation
7
A
B G
C F
D E
Thursday, March 27, 14
![Page 11: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/11.jpg)
Point correlation
7
A
B G
C F H K
D E
Thursday, March 27, 14
![Page 12: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/12.jpg)
Point correlation
7
A
B G
C F H K
D E I J
Thursday, March 27, 14
![Page 13: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/13.jpg)
Point correlation
8
D E I J
C F H K
B G
A
Thursday, March 27, 14
![Page 14: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/14.jpg)
Point correlation
8
D E I J
C F H K
B G
A
Thursday, March 27, 14
![Page 15: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/15.jpg)
Point correlation
9
D E I J
C F H K
B G
A
Thursday, March 27, 14
![Page 16: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/16.jpg)
Point correlation
10
D E I J
C F H K
B G
A
Thursday, March 27, 14
![Page 17: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/17.jpg)
Point correlation
11
D E I J
C F H K
B G
A
Thursday, March 27, 14
![Page 18: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/18.jpg)
Point correlation
12
D E I J
C F H K
B G
A
Thursday, March 27, 14
![Page 19: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/19.jpg)
Point correlation
13
D E I J
C F H K
B G
A
Thursday, March 27, 14
![Page 20: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/20.jpg)
Point correlation
14
D E I J
C F H K
B G
A
Thursday, March 27, 14
![Page 21: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/21.jpg)
Point correlation
15
KDCell root = /* build kdtree */;Set<Point> ps;double radius;
foreach Point p in ps { recurse(p, root, radius);}...void recurse(Point p, KDCell node, double r) { if (tooFar(p, node, r)) return; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { recurse(p, node.left, r); recurse(p, node.right, r); }}
Thursday, March 27, 14
![Page 22: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/22.jpg)
Basic pattern
16
TreeNode root;Set<Point> ps;
foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}
Thursday, March 27, 14
![Page 23: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/23.jpg)
Basic pattern
16
recursive traversal
TreeNode root;Set<Point> ps;
foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}
Thursday, March 27, 14
![Page 24: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/24.jpg)
Basic pattern
16
recursive traversal
tree structureTreeNode root;Set<Point> ps;
foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}
Thursday, March 27, 14
![Page 25: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/25.jpg)
Basic pattern
16
recursive traversal
repeated traversal
tree structureTreeNode root;Set<Point> ps;
foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}
Thursday, March 27, 14
![Page 26: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/26.jpg)
Basic pattern
16
recursive traversal
repeated traversal
tree structureTreeNode root;Set<Point> ps;
foreach Point p in ps { recurse(p, root, ...);}...recurse(Point p, KDCell node, ...) { if (truncate?(p, node, ...)) { ... } recurse(p, node.child1, ...); recurse(p, node.child2, ...); ...}
Lots of parallelism!
Thursday, March 27, 14
![Page 27: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/27.jpg)
What’s the problem?
• GPUs add high overhead for recursion
• GPUs work best when memory accesses are regular and strided, but irregular algorithms have unpredictable memory accesses
• Status quo: ad hoc solutions
• New algorithm? New GPU techniques!
17
Thursday, March 27, 14
![Page 28: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/28.jpg)
What’s the problem?
• GPUs add high overhead for recursion
• GPUs work best when memory accesses are regular and strided, but irregular algorithms have unpredictable memory accesses
• Status quo: ad hoc solutions
• New algorithm? New GPU techniques!
17
Want generally applicable techniques for
mapping irregular applications to GPUs
Thursday, March 27, 14
![Page 29: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/29.jpg)
Contributions
• Two general techniques for mapping tree-traversals to GPUs
• Autoropes: eliminates recursion overhead
• Lockstepping: promotes memory coalescing
• Compiler pass to automatically apply techniques to recursive tree-traversal code
• Significant GPU speedups on 5 tree-traversal algorithms
18
Thursday, March 27, 14
![Page 30: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/30.jpg)
Naïve GPU implementation
• Warp-based SIMT (single-instruction, multiple-thread) execution
• 32 points put in a single warp
• Warp traverses tree
• All points in warp must execute same instruction
• If points diverge, some points sit idle while other threads execute
19
Thursday, March 27, 14
![Page 31: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/31.jpg)
Naïve GPU implementation
20
A
B G
C F H K
D E I J
Thursday, March 27, 14
![Page 32: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/32.jpg)
Naïve GPU implementation
20
A
B G
C F H K
D E I J
Thursday, March 27, 14
![Page 33: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/33.jpg)
Naïve GPU implementation
20
A
B G
C F H K
D E I J
A
B G
C F
D E
Thursday, March 27, 14
![Page 34: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/34.jpg)
Naïve GPU implementation
20
A
B G
C F H K
D E I J
A
B G
H K
I J
Thursday, March 27, 14
![Page 35: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/35.jpg)
A
B G
H K
I J
C F
D E
Naïve GPU implementation
21
B
A
G
Thursday, March 27, 14
![Page 36: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/36.jpg)
A
B G
H K
I J
C F
D E
Naïve GPU implementation
22
B
A
G
Thursday, March 27, 14
![Page 37: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/37.jpg)
A
B G
H K
I J
C F
D E
Naïve GPU implementation
23
B
A
G
Thursday, March 27, 14
![Page 38: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/38.jpg)
A
B G
H K
I J
C F
D E
Naïve GPU implementation
24
B
A
G
Thursday, March 27, 14
![Page 39: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/39.jpg)
A
B G
H K
I J
C F
D E
Naïve GPU implementation
25
B
A
G
Thursday, March 27, 14
![Page 40: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/40.jpg)
A
B G
H K
I J
C F
D E
Naïve GPU implementation
26
B
A
G
Thursday, March 27, 14
![Page 41: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/41.jpg)
A
B G
H K
I J
C F
D E
Naïve GPU implementation
27
B
A
G
Thursday, March 27, 14
![Page 42: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/42.jpg)
A
B G
H K
I J
C F
D E
Naïve GPU implementation
28
B
A
G
Thursday, March 27, 14
![Page 43: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/43.jpg)
A
B G
H K
I J
C F
D E
Naïve GPU implementation
29
B
A
G
Thursday, March 27, 14
![Page 44: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/44.jpg)
A
B G
H K
I J
C F
D E
Naïve GPU implementation
30
B
A
G
Thursday, March 27, 14
![Page 45: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/45.jpg)
A
B G
H K
I J
C F
D E
Naïve GPU implementation
31
B
A
G
Thursday, March 27, 14
![Page 46: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/46.jpg)
A
B G
H K
I J
C F
D E
Naïve GPU implementation
32
B
A
G
Thursday, March 27, 14
![Page 47: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/47.jpg)
A
B G
H K
I J
C F
D E
Naïve GPU implementation
33
B
A
G
Thursday, March 27, 14
![Page 48: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/48.jpg)
A
B G
H K
I J
C F
D E
Naïve GPU implementation
34
B
A
G
Thursday, March 27, 14
![Page 49: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/49.jpg)
Lots of accesses to tree
• Many accesses just moving up the tree in order to later move down again
• Lots of function stack manipulation
• Trees are very large, cannot be stored in GPU’s fast memory
• Want to minimize accesses to tree
35
Thursday, March 27, 14
![Page 50: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/50.jpg)
How to avoid extra accesses to tree?
• Typical technique: ropes
• Pointers in each tree node that let a traversal jump to the next part of the tree
• Effectively linearizes traversal
36
A
B G
C F H K
D E I J
Thursday, March 27, 14
![Page 51: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/51.jpg)
How to avoid extra accesses to tree?
• Typical technique: ropes
• Pointers in each tree node that let a traversal jump to the next part of the tree
• Effectively linearizes traversal
36
A
B G
C F H K
D E I J
Thursday, March 27, 14
![Page 52: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/52.jpg)
How to avoid extra accesses to tree?
• Typical technique: ropes
• Pointers in each tree node that let a traversal jump to the next part of the tree
• Effectively linearizes traversal
36
A
B G
C F H K
D E I J
Thursday, March 27, 14
![Page 53: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/53.jpg)
How to avoid extra accesses to tree?
• Typical technique: ropes
• Pointers in each tree node that let a traversal jump to the next part of the tree
• Effectively linearizes traversal
36
A
B G
C F H K
D E I J
Thursday, March 27, 14
![Page 54: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/54.jpg)
How to avoid extra accesses to tree?
• Installing ropes into a tree requires complex, application-specific preprocessing
• Using ropes correctly during execution requires complex, application-specific logic
37
A
B G
C F H K
D E I J
Thursday, March 27, 14
![Page 55: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/55.jpg)
Autoropes
• General technique for “linearizing” tree traversal for arbitrary traversal algorithms
• Achieve generality, simplicity and space-efficiency at the cost of overhead
• Key insight: recursive tree algorithms are just depth-first traversals of a tree; can transform into iterative algorithm
38
Thursday, March 27, 14
![Page 56: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/56.jpg)
Autoropes
39
A
B G
C F H K
D E I J
A
B G
C F
D EA
Rope stack
Thursday, March 27, 14
![Page 57: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/57.jpg)
Autoropes
40
A
B G
C F H K
D E I J
G
A
B G
C F
D EB
Rope stack
Thursday, March 27, 14
![Page 58: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/58.jpg)
Autoropes
41
A
B G
C F H K
D E I J
F
A
B G
C F
D E
G
C
Rope stack
Thursday, March 27, 14
![Page 59: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/59.jpg)
Autoropes
42
A
B G
C F H K
D E I J
E
A
B G
C F
D E
FG
D
Rope stack
Thursday, March 27, 14
![Page 60: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/60.jpg)
Autoropes
43
A
B G
C F H K
D E I J
F
A
B G
C F
D E
G
E
Rope stack
Thursday, March 27, 14
![Page 61: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/61.jpg)
Autoropes
44
A
B G
C F H K
D E I J
G
A
B G
C F
D EF
Rope stack
Thursday, March 27, 14
![Page 62: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/62.jpg)
Autoropes
45
A
B G
C F H K
D E I J
A
B G
C F
D EG
Rope stack
Thursday, March 27, 14
![Page 63: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/63.jpg)
Autoropes
• Ropes stored on rope stack instead of in tree
• No application-specific code to use ropes
• Ropes instantiated dynamically
• No preprocessing required
• Same access patterns as with manual ropes
• Extra pushes and pops on rope stack add some overhead
46
Thursday, March 27, 14
![Page 64: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/64.jpg)
Autoropes
47
void recurse(Point p, KDCell node, double r) { if (tooFar(p, node, r)) return; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { recurse(p, node.left, r); recurse(p, node.right, r); }}
Simple compiler transformation convertsrecursive code into iterative autoropes code
Thursday, March 27, 14
![Page 65: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/65.jpg)
Autoropes
48
ropeStack.push(root);while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r)) continue; if (node.isLeaf() && (dist(node.point, p) < r)) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); }}
See paper for details of how to transform more complex code
Thursday, March 27, 14
![Page 66: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/66.jpg)
Unintended consequence
49
• Recursive calls naturally lead to thread divergence
• If some threads make recursive calls, other threads wait until calls return
• Does not happen for iterative code
• All threads reconverge at beginning of loop
ropeStack.push(root);while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r))
continue; if (...) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); }}
Thursday, March 27, 14
![Page 67: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/67.jpg)
formerly recursion
Unintended consequence
49
• Recursive calls naturally lead to thread divergence
• If some threads make recursive calls, other threads wait until calls return
• Does not happen for iterative code
• All threads reconverge at beginning of loop
ropeStack.push(root);while (!ropeStack.isEmpty()) { node = ropeStack.pop(); if (tooFar(p, node, r))
continue; if (...) p.correlated++; else { ropeStack.push(node.right); ropeStack.push(node.left); }}
formerly return
Thursday, March 27, 14
![Page 68: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/68.jpg)
A
B G
H K
I J
C F
D E
Autoropes on GPU
50
B
A
G
Thursday, March 27, 14
![Page 69: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/69.jpg)
A
B G
H K
I J
C F
D E
Autoropes on GPU
51
B
A
G
Thursday, March 27, 14
![Page 70: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/70.jpg)
A
B G
H K
I J
C F
D E
Autoropes on GPU
52
B
A
G
Thursday, March 27, 14
![Page 71: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/71.jpg)
A
B G
H K
I J
C F
D E
Autoropes on GPU
53
B
A
G
Thursday, March 27, 14
![Page 72: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/72.jpg)
A
B G
H K
I J
C F
D E
Autoropes on GPU
54
B
A
G
Thursday, March 27, 14
![Page 73: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/73.jpg)
A
B G
H K
I J
C F
D E
Autoropes on GPU
55
B
A
G
Thursday, March 27, 14
![Page 74: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/74.jpg)
A
B G
H K
I J
C F
D E
Autoropes on GPU
56
B
A
G
Thursday, March 27, 14
![Page 75: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/75.jpg)
A
B G
H K
I J
C F
D E
Autoropes on GPU
56
B
A
G
Threads no longer diverge in execution
But do diverge in tree!
Thursday, March 27, 14
![Page 76: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/76.jpg)
Thread divergence vs. memory coalescing• Memory accesses on GPU only well behaved
if accesses by all threads in warp can be coalesced
• Same memory or strided access
• Bad memory behavior of autoropes outweighs lack of thread divergence
• Goal: benefits of autoropes while maintaining memory coalescing
57
Thursday, March 27, 14
![Page 77: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/77.jpg)
Lockstepping• Essentially, force GPU to let threads diverge
• If any thread in a warp wants to visit a node’s children, all threads in a warp visit the child
• Threads that are “dragged along” are programmatically masked out
• Warp execution takes longer (proportional to union of threads’ traversals, rather than longest traversal), but improved memory performance makes up for it
• Automatically implemented during autoropes compiler pass
58
Thursday, March 27, 14
![Page 78: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/78.jpg)
Dynamic lockstepping
• Some algorithms allow different traversal orders
• Some points visit left child before right, and others visit right before left
• Optimization reduces traversal size
• Inherently bad memory access patterns
59
Thursday, March 27, 14
![Page 79: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/79.jpg)
A
B G
H K
I J
C F
D E
Dynamic lockstepping
60
B
A
G
Thursday, March 27, 14
![Page 80: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/80.jpg)
A
B G
H K
I J
C F
D E
Dynamic lockstepping
61
B
A
G
Thursday, March 27, 14
![Page 81: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/81.jpg)
A
B G
H K
I J
C F
D E
Dynamic lockstepping
62
B
A
G
Thursday, March 27, 14
![Page 82: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/82.jpg)
A
B G
H K
I J
C F
D E
Dynamic lockstepping
63
B
A
G
Thursday, March 27, 14
![Page 83: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/83.jpg)
A
B G
H K
I J
C F
D E
Dynamic lockstepping
64
B
A
G
Thursday, March 27, 14
![Page 84: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/84.jpg)
Dynamic lockstepping
• Dynamic lockstepping allows all points in a warp to “vote” on which traversal order to use
• Maintains memory coalescing
• Some points do more work than in original algorithm
• Tradeoff can still be worth it!
65
Thursday, March 27, 14
![Page 85: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/85.jpg)
Engineering details
• Transform point data from array of structures format to structure of arrays
• Use analysis from [PACT 2013] to prove safety and transform automatically
• Copy tree data to GPU in linearized fashion
• Lay out fields of tree and point according to use (more commonly-accessed fields placed in shared memory)
• Interleave rope stacks for points in warp to allow strided access
66
Thursday, March 27, 14
![Page 86: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/86.jpg)
Results• Two platforms
• GPU platform: NVIDIA Tesla C2070 (6GB global memory, 14 SMs)
• CPU platform: 32-core, 2.3 GHz Opteron
• Five benchmarks
• Barnes-Hut, Point correlation, Nearest neighbor, k-Nearest neighbor, Vantage point trees
• Multiple inputs per benchmark
• Used sorted and unsorted points
67
Thursday, March 27, 14
![Page 87: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/87.jpg)
High-level takeaways
• Autoropes+lockstep always faster than simple recursive GPU implementation (up to 14x faster)
• For most benchmarks/inputs, best GPU implementation faster than CPU implementation up to 16 threads
• Speedups comparable to hand-written implementations
68
Thursday, March 27, 14
![Page 88: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/88.jpg)
Barnes-Hut
69
Thursday, March 27, 14
![Page 89: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/89.jpg)
Point correlation
70
Thursday, March 27, 14
![Page 90: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/90.jpg)
Nearest-neighbor
71
Thursday, March 27, 14
![Page 91: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/91.jpg)
Conclusions
• Mapping irregular applications to GPUs is very difficult
• Developed two general techniques, autoropes and lockstepping, that can achieve significant speedup on GPU
• vs. baseline GPU code and CPU implementations
• Automatic approaches competitive with previous hand-written implementations
72
Thursday, March 27, 14
![Page 92: General Transformations for GPU Execution of Tree Traversalson-demand.gputechconf.com/gtc/2014/presentations/S4668... · 2014-04-09 · • Data mining algorithm • Goal: given a](https://reader033.vdocuments.site/reader033/viewer/2022042915/5f52619adc395358a915c45f/html5/thumbnails/92.jpg)
General Transformations for GPU Execution of
Tree TraversalsMichael Goldfarb*, Youngjoon Jo**, Milind Kulkarni
School of Electrical and Computer Engineering
* Now at Qualcomm; ** Now at Google
Thursday, March 27, 14