gpu join presentation
DESCRIPTION
This paper talks about algorithms to do database joins on a GPU. Some interesting work here, that will someday lead to implementing databases on a GPGPU like CUDA.TRANSCRIPT
![Page 1: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/1.jpg)
Relational Joins on Graphics Processors
Suman KarumuriJamie Jablin
![Page 2: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/2.jpg)
Background
![Page 3: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/3.jpg)
Introduction
• Utilizing hardware features of the GPU– Massive thread parallelism– Fast inter-processor communication– High memory bandwidth– Coalesced access
![Page 4: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/4.jpg)
Relational Joins
• Non-indexed nested-loop join (NINLJ)• Indexed nested-loop join (INLJ)• Sort-merge join (SMJ)• Hash join (HJ)
![Page 5: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/5.jpg)
Block Non-indexed nested-loop join (NINLJ)
foreach r in R: foreach s in S:
if condition(r,s) output = <r,s>
![Page 6: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/6.jpg)
Block Indexed nested-loop join (INLJ)
foreach r in R: if S[].has(r.f1)
if condition(r,s) output = <r,s>
![Page 7: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/7.jpg)
Hash join (HJ)
Hr = Hashtable()foreach r in R:Hr.add(r) if Hr.size() == MAX_MEMORY:
for s in S:if Hr(s):
output sHr.clear()
![Page 8: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/8.jpg)
Sort-merge join (SMJ)Sort(R), Sort(S) ; i, j = 0while !R.empty() && !S.empty():
if (R[i] == S[j])output += R[i]
i++ ; j++elif (R[i] < S[j])
i++else
j++
![Page 9: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/9.jpg)
Algorithms on GPU
• Tips for algorithm design– Use the inherent concurrency.– Keep SIMD nature in mind. – Algorithms should be side-effect free.– Memory properties:
• High memory bandwidth.• Coalesced access (for spatial locality)• Cache in local memory (for temporal locality)• Access memory via indices and offsets.
![Page 10: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/10.jpg)
GPU Primitives
![Page 11: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/11.jpg)
Design and Implementation
• A complete set of parallel primitives:– Map, scatter, gather, prefix scan, split, and
sort• Low synchronization overhead.• Scalable to hundreds of processors.• Applicable to joins as well as other relational query
operators.
![Page 12: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/12.jpg)
Map
![Page 13: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/13.jpg)
Scatter
• Indexed writes to a relation.
![Page 14: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/14.jpg)
Gather
• Performs indexed reads from a relation.
![Page 15: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/15.jpg)
Prefix Scan• A prefix scan applies a binary operator on the input
of size n and produces an output of size n.• Ex: Prefix sum: cumulative sum of all elements to
the left of the current element.– Exclusive (used in paper)– Inclusive
![Page 16: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/16.jpg)
Split
![Page 17: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/17.jpg)
Each thread constructs tHist from Rin
![Page 18: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/18.jpg)
L[(p-1)*#thread+t]=
tHist[t][p]
![Page 19: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/19.jpg)
Prefix sum L(i) = sum(L[0…i-1])
Gives the start location of partitions
![Page 20: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/20.jpg)
tOffset[t][p]=
L[(p-1)*#thread+t]
![Page 21: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/21.jpg)
Scatter tuples to Rout based on offset.
![Page 22: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/22.jpg)
Sort
• Bitonic sort– Uses sorting networks, O(N log2N).
• Quick sort– partition using a random pivot until partition fits in
local memory– Sort each partition using bitonic sort.– Partioning can be parallelized using split.– Complexity is O(N logN).– 30% faster than bitonic sort in experiments– Use Quick sort for sorting
![Page 23: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/23.jpg)
Spatial and Temporal locality
![Page 24: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/24.jpg)
Memory Optimizations
• Coalesced memory improves memory bandwidth utilization (spatial locality)
![Page 25: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/25.jpg)
Local Memory Optimization
• Quick sort– Temporal locality– Use the bitonic sort to sort each chunk after
the partitioning step.
![Page 26: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/26.jpg)
Joins on GPGPU
![Page 27: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/27.jpg)
NINLJ on GPU
• Block nested• Uses Map primitive on both relations
– Partition R into R’ and S into S’ blocks respectively.
– Create R’ x S’ thread groups– A thread in a thread group processes one
tuple from R’ and matches all tuples from S’.– All tuples in S’ are in local cache.
![Page 28: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/28.jpg)
B+ Tree vs CSS Tree
• B+ tree imposes – Memory stalls when traversed (no spatial locality)– Can’t perform multiple searches ( loses temporal
locality).• CSS-Tree (Cache optimized search tree)
– One dimensional array where nodes are indexed.– Replaces traversal with computation.– Can also perform parallel key lookups.
![Page 29: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/29.jpg)
Indexed Nested Loop Join (INLP)
• Uses Map primitive on outer relation• Uses CSS tree for index.• For each block in outer relation R
– Start with a root node to find the next level• Binary search is shown to be better than sequential search.
– Go down until you find the data node.• Upper level nodes are cached in local memory
since they are frequently accessed.
![Page 30: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/30.jpg)
Sort Merge Join
• Sort the relations R, S using the sort primitive• Merge phase
– Break S into chunks (s’) of size M.– Find first and last key values of each chunk in s’ and
partition R into those many chunks.– Merge all chunks in parallel using map
• Each thread group handles a pair• Each thread compares 1 tuple in R with s’ using binary
search.
• Chunk size is chosen to fit in local memory.
![Page 31: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/31.jpg)
Hash Join
• Uses split primitive on both relations• Developed a parallel version of radix hash join
– Partitioning• Split R and S into the same number of partitions, so S
partitions fit into the local memory
– Matching• Choose smaller one of R and S partitions as inner partition to
be loaded into local memory• Larger relation will be used as the outer relation• Each tuple from outer relation uses a search on the inner
relation for matching.
![Page 32: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/32.jpg)
Lock-Free Scheme for Result Output
• Problems– Unknown join result size. Max size of joins
doesn’t fit in memory.– Concurrent writes are not atomic.
![Page 33: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/33.jpg)
Lock-Free Scheme for Result Output
• Solution: Three-phase scheme– Each thread counts the number of join results.– Compute a prefix sum on the counts to get an
array of write locations and the total number of results generated by the join.
– Host code allocates memory on device.– Run join again with outputs.
• Run joins twice. That’s ok, GPU’s are fast.
![Page 34: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/34.jpg)
Experimental Results
![Page 35: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/35.jpg)
Hardware Configuration
• Theoretical Memory bandwidth – GPU: 86.4 GB/s– CPU: 10.4 GB/s
• Practical Memory bandwidth – GPU: 69.2 GB/s– CPU: 5.6 GB/s
![Page 36: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/36.jpg)
Workload• R and S tables with 2 integer columns.• SELECT R.rid, S.rid FROM R, S WHERE <predicate>• SELECT R.rid, S.rid FROM R, S WHERE R.rid=S.rid• SELECT R.rid, S.rid FROM R, S WHERE
R.rid<=S.rid<=R.rid + k• Tested on all combinations:
– Fix R, Vary S. All values uniform distribution. |R| = 1M– Performance impact varying join selectivity. |R| = |S| = 16M– Non – uniform distribution of data sizes and also varying join
selectivity. |R| = |S| = 16M• Also tested with columns as strings.
![Page 37: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/37.jpg)
Implementation Details on CPU
• Highly optimized primitives and join algorithms matching hardware architecture
• Tuned for cache performance.• Compiled programs using MSVC 8.0 with
full optimizations.• Used openMP for threading mechanisms.• 2-6X faster than their sequential counter
parts.
![Page 38: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/38.jpg)
Implementation Details on GPU
• CUDA parameters– Number of thread groups (128)– Number of threads for each thread group (64)– Block size is 4MB (main memory to device
memory)
![Page 39: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/39.jpg)
![Page 40: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/40.jpg)
![Page 41: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/41.jpg)
Memory Optimizations Work
![Page 42: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/42.jpg)
Works when join selectivity is varied
![Page 43: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/43.jpg)
Better than in-memory database
![Page 44: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/44.jpg)
CUDA vs. DirectX10
• DirectX10 is difficult to program, because the data is stored as textures.
• NINLJ and INLJ have similar performance.• HJ and SMJ are slower because of texture
decoding.• Summary: low level primitives on GPGPU
are better than graphics primitives on GPU.
![Page 45: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/45.jpg)
Criticisms
• Applications of skew handling are unclear.• Primitives are sufficient to implement the
given joins, but they do not prove the set of primitives to be minimal.
![Page 46: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/46.jpg)
Limitations and future research directions
• Lack of synchronization mechanisms for handling read/write conflicts on GPU.
• More primitives.• More open GPGPU hardware spec for
optimizations.• Power consumption on GPU.• Lack of support for complex data types.• On GPU in-memory database.• Automatic detection of thread groups and
number of threads using program analysis techniques.
![Page 47: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/47.jpg)
Conclusion
• GPU-based primitives and join algorithms achieve a speedup of 2-27X over optimized CPU-based counterparts.
• NINLJ, 7.0X; INLJ, 6.1X; SMJ, 2.4X; HJ, 1.9X
![Page 48: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/48.jpg)
Refrerences
• Scan Primitives for GPU Computing, Sengupta et al
• wikipedia.org• monetdb.cwi.nl
![Page 49: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/49.jpg)
Thank You.
![Page 50: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/50.jpg)
Scan Primitives for GPU Computing, Sengupta et al
![Page 51: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/51.jpg)
Skew Handling
• Skew in data results in an imbalanced partition size in partitioned-based algorithms (SMJ and HJ)
• Solution– Identify partitions that do not fit into the local
memory– Decompose partitions into multiple chunks the
size of local memory
![Page 52: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/52.jpg)
Implementation Details on GPU
• CUDA parameters– Number of threads for each thread group– Number of thread groups
• DirectX10– Join algorithms implemented using
programmable pipeline• Vertex shader, geometry shader, and pixel shader
![Page 53: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/53.jpg)
![Page 54: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/54.jpg)
![Page 55: Gpu Join Presentation](https://reader034.vdocuments.site/reader034/viewer/2022052618/55493ff5b4c9050f4d8b5023/html5/thumbnails/55.jpg)