exploring memory options for data transfer on ... · exploring memory options for data transfer on...

3
Exploring Memory Options for Data Transfer on Heterogeneous Platforms Chao Liu Northeastern University [email protected] Janki Bhimani Northeastern University [email protected] Miriam Leeser Northeastern University [email protected] 1 INTRODUCTION Heterogeneous platforms that use GPUs for acceleration are more and more popular in high performance and parallel computing. However, to accelerate applications with GPUs is not straightforward and great efforts are needed for users to develop applications that can make use of GPUs effectively. In our previous work, we proposed a programming frame- work that allows users to develop parallel applications based on a high level tasks and conduits abstraction [2]. Here we develop GPU applications based on this framework, imple- menting computational intensive kernels as GPU tasks in these applications. We target NVIDIA GPUs. Through the CUDA runtime, there are three different types of memory: pageable memory, pinned memory and unified memory, that can be allocated for data transfer between CPU and GPU. The type of mem- ory that is a good choice for a given application depends on the characteristics of the application [1]. To facilitate developing GPU programs and exploring different types of CPU/GPU memory transfer for performance optimization, we provide a uniform and simple interface to allocate space on system main memory (host memory) and GPU memory (device memory) in GPU task implementations. We defined following basic class: class MemType_t { pageable, pinned, unified }; template<typename T> class GpuMem; When we need to allocate memory for a data set that is used on a GPU, we can create a GpuMem object with the desired memory type (pageable, pinned or unified). Along with the object, we have different methods to get proper memory pointers and complete host/device data synchro- nization. Through this uniform interface and helper methods, we can create a GPU task with appropriate memory type arguments to explore the use of different memory schemes for host/device communication easily and effectively. 2 EXPERIMENTS AND ANALYSIS To demonstrate the use of GPU task based applications and analyze the performance of different memory for CPU/GPU No need of demo setup. Graduate student of ECE at Northeastern University Graduate student of ECE at Northeastern University transfer, we developed a benchmark set that currently in- cludes the ten applications shown in Table 1. Table 1: Benchmarks Application Domain Image Rotation(Rotate) Image Processing Color Conversion(YUV) Image Processing Matrix Multiply(MM) Linear Algebra 2D Heat Conduction(HC) Linear Algebra MD5 Calculation(MD5) Cryptography K-means Clustering(Kmeans) Data Mining N-body simulation(Nbody) Space Simulation Ray tracing(Raytrace) Computer Graphics Bread First Search(BFS) Graph Algorithm Nearest Neighbors(NN) Data Mining These applications involve a variety of domains and we implement the parallel version with GPU tasks to make use of GPU for acceleration. For each application, we prepare three different sizes of workload in our experiments, which are referred to as Small, Medium and Large (S, M, L). Our test platform includes three different GPUs: NVIDIA Tesla C2070, NVIDIA Tesla K20m, and NVIDIA Tesla K40m. Tesla C2070 is NVIDIA Fermi architecture GPU and Tesla K20m and K40m belong to NVIDIA Kepler architecture.We run sequential implementations on an Intel Xeon E5-2650 CPU, recording the runtime as the baseline. All the speedup results are computed with regard to this baseline normalized by log 2 for convenience. We first test the effect of using different types of memory in our benchmarks. With the assistance of single memory interface, we only need to pass different memory type argu- ments to use different types of memory for the application. Figure 1 shows the speedup results of kernel applications running on a Tesla K20m GPU. From the results we can see that, for Rotate, YUV, MD5 and NN, pinned memory performs better than pageable or unified memory. For NN, there is no speedup benefit un- less using pinned memory. For MM and BFS, using pinned memory is also preferable. However, for large workloads, the performance improvement by using pinned memory is 1

Upload: others

Post on 04-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Exploring Memory Options for Data Transfer onHeterogeneous Platforms∗

Chao Liu†Northeastern [email protected]

Janki Bhimani‡Northeastern [email protected]

Miriam LeeserNortheastern University

[email protected]

1 INTRODUCTIONHeterogeneous platforms that use GPUs for acceleration aremore and more popular in high performance and parallelcomputing. However, to accelerate applications with GPUs isnot straightforward and great efforts are needed for users todevelop applications that can make use of GPUs effectively.In our previous work, we proposed a programming frame-work that allows users to develop parallel applications basedon a high level tasks and conduits abstraction [2]. Here wedevelop GPU applications based on this framework, imple-menting computational intensive kernels as GPU tasks inthese applications.We target NVIDIA GPUs. Through the CUDA runtime,

there are three different types of memory: pageable memory,pinned memory and unified memory, that can be allocatedfor data transfer between CPU and GPU. The type of mem-ory that is a good choice for a given application dependson the characteristics of the application [1]. To facilitatedeveloping GPU programs and exploring different types ofCPU/GPU memory transfer for performance optimization,we provide a uniform and simple interface to allocate spaceon system main memory (host memory) and GPU memory(device memory) in GPU task implementations. We definedfollowing basic class:

class MemType_t { pageable, pinned, unified };template<typename T> class GpuMem;

When we need to allocate memory for a data set that isused on a GPU, we can create a GpuMem object with thedesired memory type (pageable, pinned or unified). Alongwith the object, we have different methods to get propermemory pointers and complete host/device data synchro-nization. Through this uniform interface and helper methods,we can create a GPU task with appropriate memory typearguments to explore the use of different memory schemesfor host/device communication easily and effectively.

2 EXPERIMENTS AND ANALYSISTo demonstrate the use of GPU task based applications andanalyze the performance of different memory for CPU/GPU∗No need of demo setup.†Graduate student of ECE at Northeastern University‡Graduate student of ECE at Northeastern University

transfer, we developed a benchmark set that currently in-cludes the ten applications shown in Table 1.

Table 1: Benchmarks

Application DomainImage Rotation(Rotate) Image ProcessingColor Conversion(YUV) Image ProcessingMatrix Multiply(MM) Linear Algebra2D Heat Conduction(HC) Linear AlgebraMD5 Calculation(MD5) CryptographyK-means Clustering(Kmeans) Data MiningN-body simulation(Nbody) Space SimulationRay tracing(Raytrace) Computer GraphicsBread First Search(BFS) Graph AlgorithmNearest Neighbors(NN) Data Mining

These applications involve a variety of domains and weimplement the parallel version with GPU tasks to make useof GPU for acceleration. For each application, we preparethree different sizes of workload in our experiments, whichare referred to as Small, Medium and Large (S, M, L). Ourtest platform includes three different GPUs: NVIDIA TeslaC2070, NVIDIA Tesla K20m, and NVIDIA Tesla K40m. TeslaC2070 is NVIDIA Fermi architecture GPU and Tesla K20mand K40m belong to NVIDIA Kepler architecture.We runsequential implementations on an Intel Xeon E5-2650 CPU,recording the runtime as the baseline. All the speedup resultsare computed with regard to this baseline normalized by log2for convenience.

We first test the effect of using different types of memoryin our benchmarks. With the assistance of single memoryinterface, we only need to pass different memory type argu-ments to use different types of memory for the application.Figure 1 shows the speedup results of kernel applicationsrunning on a Tesla K20m GPU.From the results we can see that, for Rotate, YUV, MD5

and NN, pinned memory performs better than pageable orunified memory. For NN, there is no speedup benefit un-less using pinned memory. For MM and BFS, using pinnedmemory is also preferable. However, for large workloads,the performance improvement by using pinned memory is

1

0

1

2

3

4

5

6

7

8

Rotate YUV MM HC MD5 Kmeans Nbody Raytrace BFS NN

LogSpeedup

PageablePinnedUnified

(a) Small workload

0

1

2

3

4

5

6

7

8

Rotate YUV MM HC MD5 Kmeans Nbody Raytrace BFS NN

LogSpeedup

PageablePinnedUnified

(b) Medium workload

0

2

4

6

8

10

12

Rotate YUV MM HC MD5 Kmeans Nbody Raytrace BFS NN

LogSpeedup

PageablePinnedUnified

(c) Large workload

Figure 1: Applications speedup of using different types of memory on Tesla K20m

0

20

40

60

80

100

120

Rotate YUV MM HC MD5 Kmeans Nbody Raytrace BFS NN

Percentage

KernelCompute Host/DeviceComm

(a)

0

20

40

60

80

100

120

Rotate YUV MM HC MD5 Kmeans Nbody Raytrace BFS NN

Percentage

KernelCompute Host/DeviceComm

(b)

0

10

20

30

40

50

60

70

80

90

Rotate YUV MM MD5 BFS NN

Improvem

entP

ercentage S M L

(c)

Figure 2: Computation/Communication time cost percentage using pageable memory in (2a) Small workload, (2b)Large workload and (2c) Host/Device communication improvement from pageable to pinned memory

0

2

4

6

8

10

S M L S M L S M L

LogSpeedup

Pageable Unified

Rotat MM Raytrac

(a) Tesla C2070 test

0

2

4

6

8

10

12

S M L S M L S M L

LogSpeedup

Pageable Unified

Rotat MM Raytrac

(b) Tesla K40m test

Figure 3: Unified memory test on different GPUs

not as good as for the small and medium workloads. So con-sidering the scarcity of pinned memory, pageable or unifiedmemory may be a wise choice for applications with a largeamount of data to process. For the rest of the applications,performance of the different types of memory are very closeunder all three workloads.Figure 2a and 2b show the percentage of computation/

communication time in the total time using pageable mem-ory for each application. We can see that host/device com-munication time of HC, K-Means, Nbody and Raytrace arerelatively small compared to other applications. This is alsowhy these four applications change less when adopting dif-ferent types of memory. Figure 2c shows the host/devicecommunication improvement for some applications whenchanging pageable memory to pinned memory for some ofthe benchmark applications.

Next, we test the effect of using unified memory on differ-ent types of GPUs. Figure 3 shows the test results of some

applications for different sizes of workload. We can find thaton Tesla C2070, the performance of Rotate andMMusing uni-fied memory is much lower than pageable memory. While onTesla K40m GPU pageable memory performance improves alot. For Raytrace, because there is little host/device transferhappened, using unified memory has little affection on appli-cation performance. Generally, Tesla K40m which is NVIDIAKepler architecture GPU has better support for unified mem-ory than Tesla C2070.From these tests, we conclude that pinned memory can

improve host/device communication performance signifi-cantly. For applications where memory transfer takes up asubstantial amount of total runtime, using pinned memoryimproves overall performance. But we cannot use pinnedmemory randomly. Unified memory eases the programmingprocedure, but the implicit data transfer managed by theunderlying system and hardware support are not alwaysefficient. For choosing between different types of memory,our interface saves the user extra programming efforts, thusmaking the exploration of different design choices easier.

REFERENCES[1] Janki Bhimani, Miriam Leeser, and Ningfang Mi. 2016. Design space

exploration of GPU Accelerated cluster systems for optimal data trans-fer using PCIe bus. InHigh Performance Extreme Computing Conference(HPEC), 2016 IEEE. IEEE, 1–7.

[2] Chao Liu and Miriam Leeser. 2017. A Framework for Developing Paral-lel Applications with high level Tasks on Heterogeneous Platforms. InProceedings of the 8th International Workshop on Programming Modelsand Applications for Multicores and Manycores. ACM, 74–79.

2

Exploring Memory Options for Data Transfer on Heterogeneous Platforms

ChaoLiu;Janki Bhimani;MiriamLeeserDepartmentofElectricalandComputerEngineering,NortheasternUniversity

ContactInformation Toreadfurther

Developing parallel applications for GPU platformsand optimizing GPU related applications for goodperformance is important. We develop a high leveltask design framework with a concise interface tochoose and allocate desired memory blocks that areused for host/device data transfer flexibly with littlemodification of the original programs. Memoryoptions studied include pageable, pinned andunified. We developed a test benchmark setcontaining ten different kernel applications to testand analyze the effect of memory options in GPUapplications.

Abstract

Our test platforms include three different NVIDIA GPUs: Tesla C2070, Tesla K20m, and Tesla K40m. Each test case wasconducted for several rounds and average results recorded. We run sequential implementations on an Intel Xeon E5-2650CPU, recording the runtime as the baseline.

Introduction

To ease the use of different kinds of memory in a GPU task, we introduce a concise interface to select and create memoryspace for host/device data transfer:

To demonstrate the use of GPU task based applications and analyze the performance of different memory for CPU/GPUtransfer, we developed a benchmark set that includes ten applications ( see Table 1). For each application, we prepare threedifferent sizes of workload in our experiments, referred to as Small, Medium and Large (S, M, L).

ImplementationandApplicationDevelopment

From these tests, pinned memory can improvehost/device communication performance and ispreferable for applications where memory transfertakes up a substantial amount of total runtime. Butpinned memory is not always the best. Unifiedmemory eases the programming procedure, but theimplicit data transfer managed by the underlyingsystem and hardware support are not alwaysefficient. Our interface saves the user extraprogramming effort, thus making the exploration ofdifferent design choices and types of memory easier.

Conclusions

In previous work, we proposed a programmingframework that allows users to develop parallelapplications based on a high level tasks andconduits[2]. Here we develop GPU applications basedon this framework, implementing computationallyintensive kernels as GPU tasks in each application.

Three basic procedures for each GPU task:

Target NVIDIA GPUs using the CUDA runtime. Threedifferent kinds of memory can be allocated in systemfor host/device communication: pageable, pinnedand unified.

ExperimentsandResults

Applicationhigh-levelmainstructure

Task<GPUTask1>gtask1;Task<GPUTask2>gtask2;

classGPUTask1;classGPUTask2;

Frameworkruntime

CPU

GPUgpuTaskHostPart

gputTaskDevicePart

gtask1

Figure1. GPUtasksbasedapplication

*P-Host *P-Host

HostMemory

DeviceMemory

Inputshost->device

RunGPUkernel

Resultsdevice->host

Pageable

Pinned

Unified

Figure2. GPUprogramprocedureandmemoryoptions

enum classMemType_t {pageable,pinned,unified

}

template<typename T>classGpuMem{MemType_t m_memType;T*m_hostPtr;T*m_devPtr;….}

Task<GPUTask1>gtask1(….,gTask,MemType_t::pinned,….);

Easily define a GPU task using desired typeof memory for host/device data transfer inan application, without change of existedtask implementations.

Table1.Benchmarks

Application DomainImageRotation(A1:Rotate) ImageProcessingColorConversion(A2:YUV) ImageProcessingMatrixMultiply(A3:MM) LinearAlgebra2DheatConduction(A4:HC) LinearAlgebraMD5Calculation(A5:MD5) CryptographyK-meansClustering(A6:Kmeans) DataMiningN-bodysimulation(A7:Nbody) SpaceSimulationRaytracing(A8:Raytrace) Computer GraphicsBreadFirst Search(A9:BFS) GraphAlgorithmNearestNeighbors(A10:NN) DataMining

0

1

2

3

4

5

6

7

8

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10

LogSpe

edup

PageablePinnedUnified

0

1

2

3

4

5

6

7

8

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10

LogSpe

edup

PageablePinnedUnified

0

2

4

6

8

10

12

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10

LogSpe

edup

PageablePinnedUnified

Chart1. ApplicationsspeedupofusingdifferenttypesofmemoryonTeslaK20m.

(a)SmallWorkload (b)MediumWorkload (c)LargeWorkload

0

20

40

60

80

100

120

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10

Percen

tage

KernelCompute Host/DeviceComm

0

20

40

60

80

100

120

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10

Percen

tage

KernelCompute Host/DeviceComm

0

10

20

30

40

50

60

70

80

90

A1 A2 A3 A5 A9 A10

Improvem

entP

ercentage

S M L

(a) (c)(b)Chart2. Computation/Communicationtimecostpercentageusingpageable memoryfor(2a)Smallworkload,(2b)Largeworkloadand

(2c)Host/Devicecommunicationimprovementfrompageable topinnedmemory.

Chart3. UnifiedmemorytestondifferentGPUs.

0

1

2

3

4

5

6

7

8

9

10

S M L S M L S M L

LogSpe

edup

Pageable Unified

A1 A8A3(a)TeslaC2070test

0

2

4

6

8

10

12

S M L S M L S M L

LogSpe

edup

Pageable Unified

A1 A8A3(b)TeslaK40mtest

Pinned memory provides betterperformance for Rotate(A1), YUV(A2),MM(A3), MD5(A5), BFS(A9) andNN(A10). For MM and BFS, theperformance improvement of usingpinned memory for large workloads isnot as good as for small and mediumworkloads.

Chart 2(a) and (b) show that host/ device communication time of HC(A4), K-Means(A6), Nbody(A7) and Raytrace(A8) are relatively muchsmaller compared to other applications. Chart 2(c) shows the data transfer improvements of using pinned memory for applications that havesubstantial data transfer workload.

Unified memory tried for three applications: Rotate(A1),MM(A3) and Raytrace(A8). Unified memory causesperformance degradation, especially on older GPUs.

ChaoLiuEmail:[email protected]

MiriamLeeserEmail:[email protected]

Janki BhimaniEmail:[email protected]

1. Janki Bhimani,MiriamLeeser,andNingfang Mi.2016.DesignspaceexplorationofGPUAcceleratedclustersystemsforoptimaldatatransferusingPCIe bus.InHighPerformanceExtremeComputingConference(HPEC),2016IEEE.IEEE,1–7.

2. ChaoLiuandMiriamLeeser.2017.AFrameworkforDevelopingParallelApplicationswithhighlevelTasksonHeterogeneousPlatforms.InProceedingsofthe 8thInternationalWorkshoponProgrammingModelsandApplicationsforMulticoresandManycores.ACM,74–79.

RCLLab:http://www.coe.neu.edu/Research/rcl//projects.php