for c++ developers - s91043 rapids cuda dataframe internals...rapids cuda dataframe internals for...

Jake Hemstad - NVIDIA - Developer Technology EngineerGTC2019 | 03/20/19

RAPIDS CUDA DataFrame Internals for C++ Developers - S91043

What is RAPIDS cuDF?

200GB CSV dataset; Data preparation includes joins, variable transformations. 5x DGX-1 on InfiniBand network. CPU nodes: 61 GiB of memory, 8 vCPUs, 64-bit platform, Apache Spark

Time in seconds — Shorter is better

cuDF File Read and Data Preparation

GPU-accelerated DataFrames

Data science operations: filter, sort, join, groupby,…

High-level, Python productivity (Pandas-like)

Bare-metal, CUDA/C++ performance

Open-Source CUDA DataFrame

github.com/rapidsai/cudfrapids.ai

libcudf

You want to learn about:

● libcudf: cuDF’s underlying C++14 library

● How to use libcudf in your applications

● CUDA-accelerated data science algorithms

● How to contribute to libcudf

Who This Talk is For

cuDFPandas-like

libcudf

Thrust

Cython

CUB Jitify

RAPIDS Memory Manager (RMM)

CUDA DataFrame What is a DataFrame?

Think spreadsheet

Equal length columns of different types

How to store in memory?

● cuDF uses Apache Arrow[1]

● Contiguous, column-major data

representation

Mortgage ID Pay Date Amount($)

101 12/18/2018 1029.30

102 12/21/2018 1429.31

103 12/14/2018 1289.27

101 01/15/2018 1104.59

102 01/17/2018 1457.15

103 NULL NULL

[1] https://arrow.apache.org/docs/memory_layout.html

Apache Arrow Memory FormatEnabling Interoperability

cuDF cuML cuGraph cuDNN

Column Representation

struct column { void* data; // contiguous buffer int size; // number of elements DType type; // type indicator uint32_t* mask; // null bitmask};

enum DType { INT, // int value FLOAT, // float value DATE // int64_t ms since epoch ...};

All operations defined in terms of

operations on columns

Type-erased data (void*) allows

interoperability with other languages and

type systems

Arrow enables efficient, shallow copy

data sharing across frameworks/languages

libcudf is column-centric

Null Bitmask

Any element can be NULL —> undefined

Different from NaN —> defined invalid

NULL values are represented in bitmask

1-bit per element

● 0 == NULL ● 1 == not NULL

Least-significant bit ordering

How To Represent Missing Data

values = [0, 1, null, NaN, null, 3]

bitmask = [0 0 1 0 1 0 1 1] = 0x2B

Column Example

Mortgage ID Pay Date Amount

101 12/18/2018 1029.30

102 12/21/2018 1429.31

103 12/14/2018 1289.27

101 01/15/2018 1104.59

102 01/17/2018 1457.15

103 NULL NULL

data = [101, 102, 103, 101, 102, 103] size = 6type = INTbitmask = [0x3F] = [0 0 1 1 1 1 1 1]

data = [1545091200000, 1545350400000, 1544745600000, 1514764800000, 1516147200000, *garbage* ] size = 6type = DATEbitmask = [0x1F] = [0 0 0 1 1 1 1 1]

data = [1029.30, 1429.31, 1289.27, 1104.59, 1457.15, *garbage*] size = 6type = FLOATbitmask = [0x1F] = [0 0 0 1 1 1 1 1]

Mortgage ID

Pay Date

Amount

Note LSB order

Apache Arrow Memory Layout

libcudf OperationsAll functions act on one or more columns

void some_function( cudf::column const* input, cudf::column * output, args...){

// Do something with input// Produce output

Operations include:

● Sort● Join● Groupby● Filtering● Transpose● etc.

Input columns are generally immutable

Example OperationSort

void sort(cudf::column * in){ switch(in->type){

case INT: typed_sort<int32_t>(in); break;

case FLOAT: typed_sort<float>(in); break;

case DATE: typed_sort<int64_t>(in); break;

... }}

template <typename T>void typed_sort(cudf::column * in){ T* typed_data{ static_cast<T*>(in->data) }; thrust::sort(thrust::device,

typed_data, typed_data + in->size);}

in->data is type-erased

1. Deduce T from enum in->type

2. Call function template with T

3. Cast in->data to T*

4. Perform thrust::sort with

typed_data

Common pattern in libcudf

Problem: Duplicated switches are difficult to maintain and error-prone

Type Dispatchinglibcudf’s Solution

Centralize and abstract the switch

type_dispatcher

1. Maps type enum to T

2. Invokes functor F<T>()

template <typename Functor>auto type_dispatcher(DType type,

Functor F){ switch(type){

case INT: return F<int32_t>(); case FLOAT: return F<float>(); case DATE: return F<int64_t>(); ...

Note: The syntax F<T>() is abbreviated for clarity. The correct syntax is F::template operator()<T>().

libcudf’s type dispatcher also supports functors with arguments

Type DispatchingSort Revisited

Define a functor F with operator() template

type_dispatcher maps type to T and invokes F<T>()

sort_functor casts with T

Perform thrust::sort on typed_data

#include <type_dispatcher.cuh>

sort_functor{ cudf::column _col; sort_functor(cudf_column col ) : _col{col} {}

template <typename T> void operator()(){ T* typed_data = static_cast<T*>(_col->data); thrust::sort(typed_data, typed_data + _col->size); }};

void sort(cudf::column * col){ type_dispatcher(col->type, sort_functor{ *col });}

sort.cu

Type DispatchingCombinatorial Type Explosion

void binary_op(cudf::column* out, cudf::column* lhs, cudf::column* rhs, Op op){ // out, lhs, rhs types are all independent // Need to instantiate code for all combinations // Repeat for every `op`}

Binary operations between two columns are common (e.g., sum, minus, div, etc.)

out = lhs op rhs

Independent types

11+ types, 14+ ops

Problem:

● 113 x 14 = ~18,600 instantiations● 1+ hour to compile just binary operations

Solution: JIT compilation with JitifySimplify CUDA Run-time Compilation

const char* program_source = "my_program\n"

"template<int N, typename T>\n"

"__global__\n"

"void my_kernel(T* data) {\n"

" T data0 = data[0];\n"

" for( int i=0; i<N-1; ++i ) {\n"

" data[0] *= data0;\n"

" }\n"

"}\n";

static jitify::JitCache kernel_cache;

jitify::Program program = kernel_cache.program(program_source);

dim3 grid(1); dim3 block(1);

using jitify::reflection::type_of;

program.kernel("my_kernel")

.instantiate(3, type_of(*data)) // Instantiates template

.configure(grid, block)

.launch(data);

Compiles specialized kernel string at run time

Compiled kernel is cached for reuse

libcudf uses Jitify for binary operations

● ~300ms overhead to compile new kernel● ~150ms to reuse kernel w/ new types● Trivial overhead to reuse from cache

https://github.com/NVIDIA/jitify

Recaplibcudf so far...

● Apache Arrow memory layout● Column-centric operations● Type-erased data● type_dispatcher to reconstruct type● Runtime compilation w/ Jitify

Many operations require temporary memory allocations

Most cuDF ops not performed in place: many column allocations/deallocations

sort_functor{ cudf::column _col; sort_functor(cudf_column col ) : _col{col} {}

template <typename T> void operator()(){ T* typed_data = static_cast<T*>(_col->data);

// Allocates temporary memory! thrust::sort(thrust::device,

typed_data, typed_data + _col->size); }};

void sort(cudf::column * col){ type_dispatcher(col->type, sort_functor{ *col });}

Memory Management

Data cleanup and feature engineering

1. Read CSV files into DataFrames2. Joins, groupbys, unary/binary ops3. Create DMatrix for XGBoost

cuDF ops are not in-place => frequent malloc/free

88% of cuDF time spent in CUDA memory management!

Memory Management OverheadExample: cuDF Mortgage Workflow

CUDA Memory Allocation

Synchronous: blocks the device

cudaFree scrubs memory for security

Peer Access: GPU-to-GPU direct memory access

cudaMalloc creates peer mappings

Scales O(#GPUs2)

cudaMalloc / cudaFree: Why are they expensive?

cudaMalloc(&buffer, size_in_bytes);

cudaFree(buffer);

RMM Memory Pool Allocation

Use large cudaMalloc allocation as memory pool

Custom memory management in pool

Streams enable asynchronous malloc/free

RMM currently uses CNMem as it’s Sub-allocatorhttps://github.com/NVIDIA/cnmem

RMM is standalone and free to use in your own projects!

https://github.com/rapidsai/rmm

GPU Memory

cudaMalloc’d Memory Pool

Previously Allocated Blocks

bufferA

bufferB

RMM_ALLOC(&buffer, size_in_bytes, stream_id);

RMM_FREE(buffer, stream_id);

RAPIDS Memory Manager (RMM)Drop-in Allocation Replacement

dev_ones = rmm.device_array(np.ones(count))dev_twos = rmm.device_array_like(dev_ones)# also rmm.to_device(), rmm.auto_device(), etc.

rmm::device_vector<int> dvec(size);

thrust::sort(rmm::exec_policy(stream)->on(stream), …);

Asynchronous

RMM Raw Performance1000x faster than cudaMalloc/cudaFree (microbenchmark)

RMM: 10x Performance on RAPIDS

cudaMalloc/cudaFree overhead gets worse with more GPUs

RMM is valuable even on Single-GPU runs, where the fraction is “only” 14-15%

RMM benefit is combination of low-overhead suballocation and reduced synchronization

Mortgage Workflow on 16x V100 GPUs of DGX-2

Time spent in malloc/free

Total ETL Time % Time

cudaMalloc / cudaFree (no pool) 486s 550s 88.3%

rmmAlloc / rmmFree (pool) 0.088s 55s 0.16%

Deep Dive

CUDA-Accelerated GroupByDeep Dive

Common data science operation

Group unique keys and aggregate associated values —> reduce by key

Answers questions like:

“What’s the avg payment for each mortgage?”

“Which mortgages are delinquent?”

“Which mortgages are paid off early?”

Mortgage ID Pay Date Amount

101 12/18/2018 1029.30

101 01/15/2018 1104.59

102 12/21/2018 1429.31

102 01/17/2018 1457.15

103 12/14/2018 1289.27

103 NULL NULL

1066.95

1443.23

1289.27

Hash-Based GroupBy

Idx Key {count, sum}

Mortgage ID Amount

101 1029.30

102 1429.31

103 1289.27

101 1104.59

102 1457.15

103 NULL

Hash-Based GroupBy

4 101 {1, 1029.30}

Mortgage ID Amount

101 1029.30

102 1429.31

103 1289.27

101 1104.59

102 1457.15

103 NULL

hash(101) == 4

Hash-Based GroupBy

1 102 {1, 1429.31}

4 101 {1, 1029.30}

Mortgage ID Amount

101 1029.30

102 1429.31

103 1289.27

101 1104.59

102 1457.15

103 NULL

hash(102) == 1

Hash-Based GroupBy

1 102 {1, 1429.31}

4 101 {1, 1029.30}

Mortgage ID Amount

101 1029.30

102 1429.31

103 1289.27

101 1104.59

102 1457.15

103 NULL

hash(103) == 4

Hash-Based GroupBy

1 102 {1, 1429.31}

4 101 {1, 1029.30}

Mortgage ID Amount

101 1029.30

102 1429.31

103 1289.27

101 1104.59

102 1457.15

103 NULL

hash(103) == 4

103 =? 101

Hash-Based GroupBy

1 102 {1, 1429.31}

4 101 {1, 1029.30}

Mortgage ID Amount

101 1029.30

102 1429.31

103 1289.27

101 1104.59

102 1457.15

103 NULL

hash(103) == 4

103 != 101Collision!

Hash-Based GroupBy

1 102 {1, 1429.31}

4 101 {1, 1029.30}

5 103 {1, 1289.27}

Mortgage ID Amount

101 1029.30

102 1429.31

103 1289.27

101 1104.59

102 1457.15

103 NULL

hash(103) == 4

Hash-Based GroupBy

1 102 {1, 1429.31}

4 101 {1, 1029.30}

5 103 {1, 1289.27}

Mortgage ID Amount

101 1029.30

102 1429.31

103 1289.27

101 1104.59

102 1457.15

103 NULL

hash(101) == 4

Hash-Based GroupBy

1 102 {1, 1429.31}

4 101 {1, 1029.30}

5 103 {1, 1289.27}

Mortgage ID Amount

101 1029.30

102 1429.31

103 1289.27

101 1104.59

102 1457.15

103 NULL

hash(101) == 4

101 =? 101

Hash-Based GroupBy

1 102 {1, 1429.31}

4 101 {2, 2133.89}

5 103 {1, 1289.27}

Mortgage ID Amount

101 1029.30

102 1429.31

103 1289.27

101 1104.59

102 1457.15

103 NULL

hash(101) == 4

101 == 101

Hash-Based GroupBy

1 102 {2, 2886.46}

4 101 {2, 2133.89}

5 103 {1, 1289.27}

Mortgage ID Amount

101 1029.30

102 1429.31

103 1289.27

101 1104.59

102 1457.15

103 NULL

hash(102) == 1

102 == 102

Hash-Based GroupBy

1 102 {2, 2886.46}

4 101 {2, 2133.89}

5 103 {1, 1289.27}

Mortgage ID Amount

101 1029.30

102 1429.31

103 1289.27

101 1104.59

102 1457.15

103 NULLhash(102) == 4

103 != 101

Hash-Based GroupBy

1 102 {2, 2886.46}

4 101 {2, 2133.89}

5 103 {1, 1289.27}

Mortgage ID Amount

101 1029.30

102 1429.31

103 1289.27

101 1104.59

102 1457.15

103 NULLhash(102) == 4 NULL value is ignored

Hash-Based GroupByIdx Key {count, sum}

1 102 {2, 2886.46}

4 101 {2, 2133.89}

5 103 {1, 1289.27}

Mortgage ID

AvgAmount

102 1443.23

101 1066.95

103 1289.27Extract

non-empty entries and perform (sum/count)

concurrent_unordered_mapEnabling Hash-based GroupBy

template<typename KeyT, typename PayloadT>

__device__ void insert(KeyT const& new_key, PayloadT new_value){ uint32_t hash_value = hash_function(new_key); int index = hash_value % hash_table_size; while (not insert_success) {

// Attempt to update hash bucket KeyT old_key = atomicCAS(&hash_table[index].key, EMPTY, new_key);

// If the bucket was empty, or already contains “new_key”// Then update the associated payload

if ( (EMPTY == old_key) or (new_key == old_key ){ // Update payload atomicAdd(&hash_table[index].count, 1); // count++ atomicAdd(&hash_table[index].sum, new_value); // sum += new_value insert_success = true; }

// Insert failed, advance to next hash bucket index = (index + 1) % hash_table_size; }}

Note: Code is simplified for clarity. Actual insert code accepts any generic binary operation(s) to be performed between the new and old payload. Likewise, handling of null values is omitted.

Wrapping Up

libcudf C++

libcudf is not built for cuDF alone

Single-GPU primitives to enable building multi-GPU algorithms

libcudf C++ API is designed for reuse

Modular, reusable components

● concurrent_unordered_map● Memory Manager (reusable sub-allocator)● algorithms—join, groupby, etc.

How to Use libcudf in Your Applications

Analyzes VAST NetFlow 5GB data set BlazingSQL: 1xNVIDIA Tesla T4 16GBSpark&Pandas: 4x 8 vCPU 32GB

Future Directions

Overhaul of legacy C interface to modern C++

Feature Completeness

Push functionality from Python into C++

Coming Soon

Improved String support, rolling window functions, statistic operations

Generic variable-length datatypes

Future language support

Spark Java bindings

What We Are Working On

Contribute to libcudf

libcudf is open source: Apache 2 license

Many interesting CUDA/C++ engineering and algorithmic problems to solve

Try it out! File an issue or submit a PR!

https://github.com/rapidsai/cudf

Help Us Improve

Contributors:

Learn More at GTCCUDA Accelerated Data Analytics

Talk with me and others about libcudf and accelerating Data Analytics on GPUsCE9113 - Connect with the Experts: Data Analytics on GPU: Algorithms and Implementations

Tomorrow - 11:00 AM -12:00 PM – SJCC Hall 3 Pod D

Learn about accelerating Join on multiple GPUs

S9557 - Effective, Scalable Multi-GPU JoinsNikolay Sakharnykh, Jiri Kraus, Tim Kaldwey

Today - 4PM - SJCC Room 212A (Concourse Level)

Learn how BlazingDB uses libcudf to accelerate SQL queriesS9798 - BlazingSQL on RAPIDS: SQL for Apache Arrow in GPU MemoryWilliam Malpica, Rodrigo Aramburu, Felipe Aramburu

Today - 3:00 PM - 03:50 PM – SJCC Room 212A

Learn how Unified Memory can help for Data AnalyticsS9726 - Unified Memory for Data Analytics and Deep LearningNikolay Sakharnykh, Chirayu Garg

Tomorrow - 3:00 PM - 03:50 PM– SJCC Room 211A

S9793 - cuDF: RAPIDS GPU-Accelerated Data Frame Library (Python API) Keith Kraus (GTC on-demand)

RMMPool Allocation Example

GPU Memory

Pre-allocated RMM Memory Pool

RMM_ALLOC(&bufferA, sizeA, streamA);RMM_ALLOC(&bufferB, sizeB, streamB);...kernel<<<blocks,threads,streamA>>>(blockA,...);cudaMemcpy(blockB, hostBuf, sizeB, streamB, ...);...RMM_FREE(bufferA, streamA);...RMM_FREE(bufferA, streamB);

GPU Memory

bufferA

GPU Memory

bufferA

bufferB

GPU Memory

RMM_ALLOC(&bufferA, szA, streamA);RMM_ALLOC(&bufferB, szB, streamB);...kernel<<<blocks,threads,streamA>>>(blockA,...);cudaMemcpyAsync(blockB, hostBuf, szB, streamB,...);...RMM_FREE(bufferA, streamA);...RMM_FREE(bufferA, streamB);

bufferA

bufferB

Potential overlap!

GPU Memory

bufferB

GPU Memory

for c++ developers - s91043 rapids cuda dataframe internals...rapids cuda dataframe internals for...

Documents

gpudirect, cuda aware mpi, & cuda ipc...steve abbott, summit...

cuda lecture 8 cuda memories

cuda programming performance considerations (cuda best...

cuda: new and upcoming features - university of oxford ·...

introduction to scientific programming using gpgpu and...

cuda-gdb: the nvidia cuda...

code gpu with cuda - cuda introduction

best practices guide -- cuda 2 - nc state...

debugging experience with cuda-gdb and cuda ......2 cuda...

dataframe ® transparenz in ihrem unternehmen referent:...

directfbdirectfb internals internals --thinthings you need

python advanced 02-dataframe

developing apache spark applications - cloudera · apache...

property series dataframe - wordpress.com

graphframes: dataframe-based graphs for apache® spark™

performant data processing with pyspark, sparkr and...

cuda without cuda (cuda libraries) -...

md-cuda · gpgpu cuda n-body problem ... –application...

debugging experience with cuda-gdb and cuda …

solaris internals, 2nd edition - solaris internals and...