out$of$coresortingacceleration...

0"

50,000,000"

100,000,000"

150,000,000"

200,000,000"

250,000,000"

0 10 20

throughp

ut([record/sec]

num(records([109(record]

in)core)cpu(18) in)core)cpu(36)in)core)cpu(54) in)core)cpu(72)in)core)gpu out)of)core)cpu(18)out)of)core)cpu(36) out)of)core)cpu(54)out)of)core)cpu(72) out)of)core)gpuxtr2sort

Out-‐of-‐core Sorting Accelerationusing GPU and Flash NVM

Hitoshi Sato†‡, Ryo Mizote†‡, Satoshi Matsuoka†‡† Tokyo Institute of Technology, ‡ CREST, JST

Intoroduction

Motivation: ü How to overcome memory capacity limitation? ü How to offload bandwidth oblivious operations

onto low throughput devices?Proposal: xtr2sort (Extreme External Sort)

Experiment

Summary

1. Unsorted records are located on Flash NVM

2. Divide input records into c chunks to fit GPU mem capacity

3. Then, sort the chunks on GPU in the pipeline w/ data transfers

4. Partition each of chinks into c buckets using randomly sampledc-‐1 splitters

5. Then, swap the buckets between chunks

6. Sort each of the chunks on GPU in the pipeline w/ data transfer

7. Sorted records are placedon Flash NVM

CPU

c"#1"splitters

c chunks

c buckets

Unsorted+records+on+NVM

Sorted"records

In#core"GPU"sorting

Swap"buckets"between"chunks

In#core"GPU"sorting

GPU

CPUGPU

• Sample-‐Sort-‐Based Out-‐of-‐core Sorting Approach[1][2] for Deep Memory Hierarchy Systems w/ GPU and Flash NVM• I/O Chunking to fit GPU Memory Capacity in order to exploit Massive Parallelism and Memory Bandwidth of GPU

ü Employ Asynchronous Data Transfers using CUDA Streams and cudaMemCpyAsync() between CPU and GPUü Page-‐locked Memory (a.k.a. Pinned Memory) Volumes required

• Pipeline-‐based Latency Hiding to overlap File I/O between Flash NVM and CPU using Linux Asynchronous I/O System Callsü Pros: Fully-‐overlapped READ/WRITE File I/Oü Cons: Direct I/O required, e.g., O_DIRECT Flag, Aligned File Offset Memory Buffer, Transfer Size

• Sorting is a Key Building Block for Big Data Applications ü e.g., Database Management Systems Programming Frameworks,

Supercomputing Applications, etc.ü Large Memory Capacity Requirement

• Towards Future Computing Architecturesü Dropping Available Memory Capacity per Core for Achieving Efficient Bandwidth

by Increasing in Parallelism, Heterogeneity, Density of ProcessorsØ e.g., Multi-‐core CPUs, Many-‐core AcceleratorsØ Post Moore Era

ü Deeping Memory/Storage ArchitecturesØ Device Memory on Many-‐core Accelerators,Ø Host Memory on Compute NodesØ Semi-‐external Memory connected w/ Compute Nodes

such as Non-‐volatile Memory(NVM), Storage Class Memory (SCM)

RD R2H H2D EX D2H H2W WR






chunk&i

chunk&i+1chunk&i+2

chunk&i+3

chunk&i+4


chunk&i+5

chunk&i+6

c chunkstime

RD H2D EX D2H WR

RD H2D EX D2H

RD H2D EX D2H

RD H2D EX

RD H2D

WR

EX

D2H

D2H

WR

WR

WR

time

chunk*i

chunk*i+1chunk*i+2

chunk*i+3

chunk*i+4

c((chunks

Regular Chunk Size for Aligned File Offset, Memory Buffer, Transfer Size

Irregular Chunk Size depending on Sampling (Splitting) Results

ü 3 CUDA Streams for H2D, EX, D2Hü Asynchronous I/O for RD, WRü 2 READ Pinned Buffers for RD, H2D, and

2 WRITE Pinned Buffers for D2H, WR

ü 3 CUDA Streams for H2D, EX, D2Hü Asynchronous I/O for RD, WRü 2 POSIX Threads for R2H, D2Hü 2 READ Aligned Buffers for RD, H2D,

2 WRITE Aligned Buffers forD2H, WR, and 4 Device Pinned Buffersfor R2H, H2D, D2H, H2W

5 Stage Pipeline Approach

7 Stage Pipeline Approach

RD READ I/O from NVM

WR WRITE I/O from NVM

R2H

H2W

Memcpy from Host (Aligned) to Host (Pinned)

Memcpy from Host (Pinned) to Host (Aligned)

H2D EX

D2H

Memcpy from Host(Pinned) to Device

Memcpy from Deviceto Host (Pinned)

Compute on Device

Hardware

CPU Intel Xeon E5-‐2699 v3 2.30 GHz (18 cores) x 2 sockets, HT enabled

MEM DDR4-‐2133 128 GBGPU NVIDIA Tesla K40 w/ 12 GB MemNVM Huawei ES 3000 v1 PCIe SSD 2.4 TB

Software

OS Linux 3.19.8Compiler gcc 4.4.7CUDA v7.0Thurst v1.8.1

File System xfs

Comparison using uniformly distributed random records w/ int64_t ü in-‐core-‐cpu(n): In-‐core CPU sorting w/ libc++ Parallel Mode using n threadsü in-‐core-‐gpu: In-‐core GPU sorting w/ Thrustü out-‐of-‐core-‐cpu(n): Same Technique as xtr2sort, but only using CPU (Same Device Mem, n threads)ü out-‐of-‐core-‐gpu: Same Technique as xtr2sort, but only using GPU, no File I/Oü xtr2sort: Proposed Technique

Sorting Throughput

Distribution of Execution Time in Each Pipeline Stage

0

50

100

150

200

250

300

350

400

450

500


elap

sed'tim

e'[m

s]

• xtr2sort: Sample-‐sort-‐based Out-‐of-‐core Sorting for Deep Memory Hierarchy Systems w/ GPU and Flash NVM

• Experimental results show that xtr2sort achieves up to ü x64 larger record size than in-‐core GPU sortingü x4 larger record size than in-‐core CPU sortingü x2.16 faster than out-‐of-‐core CPU sorting using 72 threads

• I/O chunking and latency hiding approach works really well for GPU and Flash NVM• Future work includes performance modeling, power measurement, etc.

In-‐core GPU Sorting ~ 0.4 G recordsGPU Memory Capacity Limitation

In-‐core CPU Sorting~ 6.4 G recordsHost (CPU) Memory Capacity Limitation

xtr2sort~ 25.6 G recordsx64 larger record size than in-‐core-‐gpux4 larger record size than in-‐core-‐cpu

x2.16 faster

Next-‐gen NVM devicesNVMe, 3D XPoints, etc.

NVLink etc.

Next-‐gen Accelerators (GPU) etc.

[1] Peaters et al. “Parallel external sorting for CUDA-‐enabled GPUs with load balancing and low transfer overhead”, IPDPSW Phd Forum, pp 1-‐8, 2010[2] Ye et al. “GPU Mem Sort: A High Performance Graphics Co-‐ processors Sorting Algorithm for Large Scale In-‐Memory Data”, GSTF International Journol on Computing, Vol. 1, No.2, pp. 23-‐ 28, 2011

out$of$core*sorting*acceleration...

Documents

out$of$coresortingacceleration...