out$of$core*sorting*acceleration...

1
0 50,000,000 100,000,000 150,000,000 200,000,000 250,000,000 0 10 20 throughput [record/sec] num records [10 9 record] in)core)cpu(18) in)core)cpu(36) in)core)cpu(54) in)core)cpu(72) in)core)gpu out)of)core)cpu(18) out)of)core)cpu(36) out)of)core)cpu(54) out)of)core)cpu(72) out)of)core)gpu xtr2sort Outofcore Sorting Acceleration using GPU and Flash NVM Hitoshi Sato †‡ , Ryo Mizote †‡ , Satoshi Matsuoka †‡ Tokyo Institute of Technology, CREST, JST Intoroduction Motivation: How to overcome memory capacity limitation? How to offload bandwidth oblivious operations onto low throughput devices? Proposal: xtr2sort (Extreme External Sort) Experiment Summary 1. Unsorted records are located on Flash NVM 2. Divide input records into c chunks to fit GPU mem capacity 3. Then, sort the chunks on GPU in the pipeline w/ data transfers 4. Partition each of chinks into c buckets using randomly sampled c1 splitters 5. Then, swap the buckets between chunks 6. Sort each of the chunks on GPU in the pipeline w/ data transfer 7. Sorted records are placed on Flash NVM CPU c #1 splitters c chunks c buckets Unsorted records on NVM Sorted records In#core GPU sorting Swap buckets between chunks In#core GPU sorting GPU CPU GPU SampleSortBased Outofcore Sorting Approach [1][2] for Deep Memory Hierarchy Systems w/ GPU and Flash NVM I/O Chunking to fit GPU Memory Capacity in order to exploit Massive Parallelism and Memory Bandwidth of GPU Employ Asynchronous Data Transfers using CUDA Streams and cudaMemCpyAsync() between CPU and GPU Pagelocked Memory (a.k.a. Pinned Memory) Volumes required Pipelinebased Latency Hiding to overlap File I/O between Flash NVM and CPU using Linux Asynchronous I/O System Calls Pros: Fullyoverlapped READ/WRITE File I/O Cons: Direct I/O required, e.g., O_DIRECT Flag, Aligned File Offset Memory Buffer, Transfer Size Sorting is a Key Building Block for Big Data Applications e.g., Database Management Systems Programming Frameworks, Supercomputing Applications, etc. Large Memory Capacity Requirement Towards Future Computing Architectures Dropping Available Memory Capacity per Core for Achieving Efficient Bandwidth by Increasing in Parallelism, Heterogeneity, Density of Processors e.g., Multicore CPUs, Manycore Accelerators Post Moore Era Deeping Memory/Storage Architectures Device Memory on Manycore Accelerators, Host Memory on Compute Nodes Semiexternal Memory connected w/ Compute Nodes such as Nonvolatile Memory(NVM), Storage Class Memory (SCM) RD R2H H2D EX D2H H2W WR RD R2H H2D EX D2H H2W WR RD R2H H2D EX D2H H2W WR RD R2H H2D EX D2H H2W WR RD R2H H2D EX D2H H2W WR RD R2H H2D EX D2H H2W WR chunk i chunk i+1 chunk i+2 chunk i+3 chunk i+4 RD R2H H2D EX D2H H2W WR chunk i+5 chunk i+6 c chunks time RD H2D EX D2H WR RD H2D EX D2H RD H2D EX D2H RD H2D EX RD H2D WR EX D2H D2H WR WR WR time chunk i chunk i+1 chunk i+2 chunk i+3 chunk i+4 c chunks Regular Chunk Size for Aligned File Offset, Memory Buffer, Transfer Size Irregular Chunk Size depending on Sampling (Splitting) Results 3 CUDA Streams for H2D, EX, D2H Asynchronous I/O for RD, WR 2 READ Pinned Buffers for RD, H2D, and 2 WRITE Pinned Buffers for D2H, WR 3 CUDA Streams for H2D, EX, D2H Asynchronous I/O for RD, WR 2 POSIX Threads for R2H, D2H 2 READ Aligned Buffers for RD, H2D, 2 WRITE Aligned Buffers for D2H, WR, and 4 Device Pinned Buffers for R2H, H2D, D2H, H2W 5 Stage Pipeline Approach 7 Stage Pipeline Approach RD READ I/O from NVM WR WRITE I/O from NVM R2H H2W Memcpy from Host (Aligned) to Host (Pinned) Memcpy from Host (Pinned) to Host (Aligned) H2D EX D2H Memcpy from Host (Pinned) to Device Memcpy from Device to Host (Pinned) Compute on Device Hardware CPU Intel Xeon E52699 v3 2.30 GHz (18 cores) x 2 sockets, HT enabled MEM DDR42133 128 GB GPU NVIDIA Tesla K40 w/ 12 GB Mem NVM Huawei ES 3000 v1 PCIe SSD 2.4 TB Software OS Linux 3.19.8 Compiler gcc 4.4.7 CUDA v7.0 Thurst v1.8.1 File System xfs Comparison using uniformly distributed random records w/ int64_t incorecpu(n): Incore CPU sorting w/ libc++ Parallel Mode using n threads incoregpu: Incore GPU sorting w/ Thrust outofcorecpu(n): Same Technique as xtr2sort, but only using CPU (Same Device Mem, n threads) outofcoregpu: Same Technique as xtr2sort, but only using GPU, no File I/O xtr2sort : Proposed Technique Sorting Throughput Distribution of Execution Time in Each Pipeline Stage 0 50 100 150 200 250 300 350 400 450 500 RD R2H H2D EX D2H H2W WR elapsed time [ms] xtr2sort: Samplesortbased Outofcore Sorting for Deep Memory Hierarchy Systems w/ GPU and Flash NVM Experimental results show that xtr2sort achieves up to x64 larger record size than incore GPU sorting x4 larger record size than incore CPU sorting x2.16 faster than outofcore CPU sorting using 72 threads I/O chunking and latency hiding approach works really well for GPU and Flash NVM Future work includes performance modeling, power measurement, etc. Incore GPU Sorting ~ 0.4 G records GPU Memory Capacity Limitation Incore CPU Sorting ~ 6.4 G records Host (CPU) Memory Capacity Limitation xtr2sort ~ 25.6 G records x64 larger record size than incoregpu x4 larger record size than incorecpu x2.16 faster Nextgen NVM devices NVMe, 3D XPoints, etc. NVLink etc. Nextgen Accelerators (GPU) etc. [1] Peaters et al. “Parallel external sorting for CUDAenabled GPUs with load balancing and low transfer overhead”, IPDPSW Phd Forum, pp 18, 2010 [2] Ye et al. “GPU Mem Sort: A High Performance Graphics Co processors Sorting Algorithm for Large Scale InMemory Data”, GSTF International Journol on Computing, Vol. 1, No.2, pp. 23 28, 2011

Upload: others

Post on 31-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Out$of$core*Sorting*Acceleration using*GPU*and*Flash*NVMsc15.supercomputing.org/sites/all/themes/SC15images/tech... · 2016-05-10 · 0" 50,000,000" 100,000,000" 150,000,000" 200,000,000"

0"

50,000,000"

100,000,000"

150,000,000"

200,000,000"

250,000,000"

0 10 20

throughp

ut([record/sec]

num(records([109(record]

in)core)cpu(18) in)core)cpu(36)in)core)cpu(54) in)core)cpu(72)in)core)gpu out)of)core)cpu(18)out)of)core)cpu(36) out)of)core)cpu(54)out)of)core)cpu(72) out)of)core)gpuxtr2sort

Out-­‐of-­‐core  Sorting  Accelerationusing  GPU  and  Flash  NVM

Hitoshi  Sato†‡,  Ryo  Mizote†‡,  Satoshi  Matsuoka†‡† Tokyo  Institute  of  Technology,  ‡ CREST,  JST

Intoroduction

Motivation:  ü How  to  overcome  memory  capacity  limitation?  ü How  to  offload  bandwidth  oblivious  operations  

onto  low  throughput  devices?Proposal:  xtr2sort  (Extreme  External  Sort)

Experiment  

Summary

1. Unsorted  records  are  located  on  Flash  NVM

2. Divide  input  records  into  c  chunks  to  fit  GPU  mem  capacity

3. Then,  sort  the  chunks  on  GPU  in  the  pipeline  w/  data  transfers

4. Partition  each    of  chinks  into  c  buckets  using  randomly  sampledc-­‐1  splitters

5. Then,  swap  the  buckets  between chunks

6. Sort  each  of  the  chunks  on  GPU  in the  pipeline  w/  data  transfer

7. Sorted  records  are  placedon  Flash  NVM

CPU

c"#1"splitters

c chunks

c buckets

Unsorted+records+on+NVM

Sorted"records

In#core"GPU"sorting

Swap"buckets"between"chunks

In#core"GPU"sorting

GPU

CPUGPU

• Sample-­‐Sort-­‐Based  Out-­‐of-­‐core  Sorting  Approach[1][2] for  Deep  Memory  Hierarchy  Systems  w/  GPU  and  Flash  NVM• I/O  Chunking  to  fit  GPU  Memory  Capacity  in  order  to  exploit  Massive  Parallelism  and  Memory  Bandwidth  of  GPU

ü Employ  Asynchronous  Data  Transfers  using  CUDA  Streams  and  cudaMemCpyAsync()  between  CPU  and  GPUü Page-­‐locked  Memory  (a.k.a.  Pinned  Memory)  Volumes  required  

• Pipeline-­‐based  Latency  Hiding  to  overlap  File  I/O  between  Flash  NVM  and  CPU  using  Linux  Asynchronous  I/O  System  Callsü Pros: Fully-­‐overlapped  READ/WRITE  File  I/Oü Cons:  Direct  I/O  required,  e.g.,  O_DIRECT  Flag,  Aligned  File  Offset  Memory  Buffer,  Transfer  Size

• Sorting  is  a  Key  Building  Block  for  Big  Data  Applications  ü e.g.,  Database  Management  Systems  Programming  Frameworks,

Supercomputing  Applications,  etc.ü Large  Memory  Capacity  Requirement

• Towards  Future  Computing  Architecturesü Dropping  Available  Memory  Capacity  per  Core  for  Achieving  Efficient  Bandwidth

by  Increasing  in  Parallelism,  Heterogeneity,  Density  of  ProcessorsØ e.g.,  Multi-­‐core  CPUs,  Many-­‐core  AcceleratorsØ Post  Moore  Era

ü Deeping  Memory/Storage  ArchitecturesØ Device  Memory  on  Many-­‐core  Accelerators,Ø Host  Memory  on  Compute  NodesØ Semi-­‐external  Memory  connected  w/  Compute  Nodes  

such  as  Non-­‐volatile  Memory(NVM),  Storage  Class  Memory  (SCM)

RD R2H H2D EX D2H H2W WR

RD R2H H2D EX D2H H2W WR

RD R2H H2D EX D2H H2W WR

RD R2H H2D EX D2H H2W WR

RD R2H H2D EX D2H H2W WR

RD R2H H2D EX D2H H2W WR

chunk&i

chunk&i+1chunk&i+2

chunk&i+3

chunk&i+4

RD R2H H2D EX D2H H2W WR

chunk&i+5

chunk&i+6

c chunkstime

RD H2D EX D2H WR

RD H2D EX D2H

RD H2D EX D2H

RD H2D EX

RD H2D

WR

EX

D2H

D2H

WR

WR

WR

time

chunk*i

chunk*i+1chunk*i+2

chunk*i+3

chunk*i+4

c((chunks

Regular  Chunk  Size  for  Aligned  File  Offset,  Memory  Buffer,  Transfer  Size

Irregular  Chunk  Size  depending  on  Sampling  (Splitting)  Results

ü 3  CUDA  Streams  for  H2D,  EX,  D2Hü Asynchronous  I/O  for  RD,  WRü 2  READ  Pinned  Buffers  for  RD,  H2D,  and

2  WRITE  Pinned  Buffers  for  D2H,  WR

ü 3  CUDA  Streams  for  H2D,  EX,  D2Hü Asynchronous  I/O  for  RD,  WRü 2  POSIX  Threads  for  R2H,  D2Hü 2  READ  Aligned  Buffers  for  RD,  H2D,

2  WRITE  Aligned  Buffers  forD2H,  WR,  and  4  Device  Pinned  Buffersfor  R2H,  H2D,  D2H,  H2W

5  Stage  Pipeline  Approach  

7  Stage  Pipeline  Approach  

RD READ  I/O  from  NVM

WR WRITE  I/O  from  NVM

R2H

H2W

Memcpy from  Host  (Aligned)  to  Host  (Pinned)

Memcpy from  Host  (Pinned)  to  Host  (Aligned)

H2D EX

D2H

Memcpy from  Host(Pinned)  to  Device

Memcpy from  Deviceto  Host  (Pinned)

Compute  on  Device

Hardware

CPU Intel  Xeon  E5-­‐2699 v3  2.30  GHz  (18  cores)  x  2  sockets,  HT  enabled

MEM DDR4-­‐2133 128  GBGPU NVIDIA  Tesla  K40   w/  12  GB  MemNVM Huawei  ES 3000  v1  PCIe SSD  2.4  TB

Software

OS Linux  3.19.8Compiler gcc 4.4.7CUDA v7.0Thurst v1.8.1

File System xfs

Comparison  using uniformly  distributed  random  records  w/  int64_t  ü in-­‐core-­‐cpu(n):  In-­‐core  CPU  sorting  w/  libc++  Parallel  Mode  using  n  threadsü in-­‐core-­‐gpu:  In-­‐core  GPU  sorting  w/  Thrustü out-­‐of-­‐core-­‐cpu(n):   Same  Technique  as  xtr2sort,  but  only  using  CPU  (Same  Device  Mem,  n  threads)ü out-­‐of-­‐core-­‐gpu:  Same  Technique  as  xtr2sort,  but  only  using  GPU,  no  File  I/Oü xtr2sort:  Proposed  Technique

Sorting  Throughput

Distribution  of  Execution  Time  in  Each  Pipeline  Stage

0

50

100

150

200

250

300

350

400

450

500

RD R2H H2D EX D2H H2W WR

elap

sed'tim

e'[m

s]

• xtr2sort:  Sample-­‐sort-­‐based  Out-­‐of-­‐core  Sorting  for  Deep  Memory  Hierarchy  Systems  w/  GPU  and  Flash  NVM

• Experimental  results  show  that  xtr2sort  achieves  up  to  ü x64  larger  record  size  than  in-­‐core  GPU  sortingü x4  larger  record  size  than  in-­‐core  CPU  sortingü x2.16  faster  than  out-­‐of-­‐core  CPU  sorting  using  72  threads  

• I/O  chunking  and  latency  hiding  approach  works  really  well  for  GPU  and  Flash  NVM• Future  work  includes  performance  modeling,  power  measurement,  etc.

In-­‐core  GPU  Sorting      ~  0.4  G  recordsGPU  Memory  Capacity    Limitation

In-­‐core  CPU  Sorting~  6.4  G  recordsHost  (CPU)  Memory  Capacity  Limitation

xtr2sort~  25.6  G  recordsx64  larger  record  size  than  in-­‐core-­‐gpux4  larger  record  size  than  in-­‐core-­‐cpu

x2.16    faster

Next-­‐gen  NVM  devicesNVMe,  3D  XPoints,  etc.    

NVLink etc.    

Next-­‐gen  Accelerators  (GPU)  etc.

[1]  Peaters et  al.  “Parallel  external  sorting  for  CUDA-­‐enabled  GPUs  with  load  balancing  and  low  transfer  overhead”,  IPDPSW  Phd Forum,  pp  1-­‐8,  2010[2]  Ye  et  al.  “GPU  Mem  Sort:  A  High  Performance  Graphics  Co-­‐ processors  Sorting  Algorithm  for  Large  Scale  In-­‐Memory  Data”,    GSTF  International  Journol on  Computing,  Vol.  1,  No.2,  pp.  23-­‐ 28,  2011