a dynamic scheduling framework for emerging heterogeneous systems vignesh ravi and gagan agrawal

A Dynamic Scheduling Framework forEmerging Heterogeneous Systems

Vignesh Ravi and Gagan AgrawalDepartment of Computer Science and Engineering

The Ohio State UniversityColumbus, Ohio - 43210

1

Motivation

• Today Heterogeneous architectures are very common– Eg., Today’s desktops & notebooks

– Multi-core CPU + Graphics card on PCI-E, AMD APUs …

• 3 of the top 5 Supercomputers are heterogeneous (as of Nov 2011)– Use Multi-core CPUs and GPU (C2050) on each node

• Application development for multi-core CPU and GPU usage is still independent– Resources may be under-utilized

• Can Multi-core CPU and GPU be used simultaneously for a single computation

2

Outline

• Challenges Involved• Analyzing Architectural Tradeoffs and

Communication Patterns• Cost Model for choosing Chunk Size• Optimized Dynamic Work Distribution Schemes• Experimental Results• Conclusions

3

Challenges Involved

• CPU / GPU vary in compute power, memory sizes, and latencies

• CPU / GPU relative performance varies across– Each application

– Every combination of CPU and GPU

– Different problem sizes

• Effective work distribution is critical for performance– Manual or Static distribution is extremely cumbersome

• Dynamic distribution schemes are essential– Consider tradeoffs due to heterogeneity in CPU and GPU

– Adaptable to varying CPU/GPU performance across each application

– Adaptable to different problem sizes and combination of CPU and GPU

4

Contributions

• A general dynamic scheduling framework for data parallel loops on heterogeneous CPU/GPU systems– Critical factors on architectural tradeoffs and communication patterns

• Identify “Chunk size” as key factor for performance– Developed Cost Model for heterogeneous systems

• Derived two optimized Dynamic Scheduling Schemes– Non-Uniform-Chunk Distribution Scheme (NUCS)

– Two-Level Hybrid Distribution Scheme

• Using four applications representing two distinct communication patterns, we show:– A 35-75% performance improvement using our dynamic schemes

– Within 7% of performance obtained from best static distribution

5

Analysis of Architectural Tradeoffs and Communication Patterns

6

CPU – GPU Architectural Tradeoffs

7

Important observations based on architecture

• Each CPU core is slower than a GPU

• GPU have smaller memory capacity than the CPU

• GPU memory latency is very high

• CPU memory latency is relatively much small

Required Optimizations

• Minimize GPU memory transfer overheads

• Minimize number of GPU Kernel invocations

• Reduce potential resource idle time

Analysis of Communication Patterns

8

• We analyze communication patterns in data parallel loops– Divide input dataset into large number of chunklets– Chunklets can be scheduled in arbitrary order (data parallel)– Processing each element involve local and/or global updates– Global updates involve only associative/commutative operations– Global updates avoid races by privatizing global elements– Global elements may be shared by all or subset of processing elements

• In this work, we consider two distinct communication patterns:– Generalized Reduction Structure– Structured Grids (Stencil Computations)

Generalized Reduction Computations

9

{* Outer sequential loop*}

While(unfinished) {

{*Reduction loop*}

Foreach( element e in chunklet) {

(i, val) = compute(e)

RObj(i) = Reduc(Robj(i), val)

}

}

Reduction ObjectShared Memory

Comm./Assoc. operation

• Similar to Map-Reduce model• But only one stage, Reduction• Reduction Object, Robj, exposed to programmer• Reduction Object is a shared memory [Race conditions]• Reduction operation, Reduc, is associative or commutative• All updates are Global updates• Global elements are shared by all processing elements

Structured Grid Computations

10

For i = 1 to num_rows_chunklet {

For j = 1 to y-1 {

B[i,j] = C0 * (A[i,j] + A[i+1,j] +

A[i-1,j] + A[i,j+1] + A[i,j-1])

}

}

Example: 2-D, 5-point Stencil Kernel

• Stencil kernels are instances of structured grids• Involves nearest neighbor computations• Input partitioned along rows for parallelization

For i = 1 to num_rows_chunklet {

For j = 1 to y {

if( is local row(i) ) { /*local update*/

B[i,j] += C0 * A[i,j];

B[i+1,j] += C0 * A[i,j];

B[i-1,j] += C0 * A[i,j];

B[i,j+1] += C0 * A[i,j];

B[i,j-1] += C0 * A[i,j]; }

Else /*global update*/

Reduc(offset) = Reduc(offset)

op A[i,j];

}

}

Rewriting Stencil Kernel as Reduction

• Rewrite as reduction and maintain correctness•Processing involve both local and global updates• Global elements are shared by only subset of processing elements

Basic Distribution Scheme & Optimization Goals

11

• Global Work Queue

• Idle processor consumes work from the queue

• FCFS policy

• Fast worker ends up processing more than slow worker

• Slow worker still processes reasonable portion of data

Master/Job Scheduler

Worker 1

Worker n

Worker 1

Worker n

Fast Workers

Slow Workers

Uniform Distribution Scheme

1. Ensure sufficient number of chunks

2. Minimize GPU data transfer and kernel invocation overhead

3. Minimize number of global elements allocation

4. Minimize number of distinct process that share the global a global element

Optimization Goals

Cost Model for Choosing Chunk Size

12

Chunk Size: A Key Factor

13

Chunk Size impacts two important factors that directly impact performance

1. GPU Kernel Invocation and Data Transfer cost

2. Resource Idle time due to heterogeneous processing elements

0102030405060

32 64 128 256 512

Ove

rhea

d Pe

rcen

tage

No. of Chunks

Device Invocation & Transfer costResource Idle Time

Cost Model for Choosing Chunk Size

14

• Happens at the last iteration of the processing

• Slower processor takes more time, while faster processor is idle

• GPU being a fast processor will be idle at the end

Idle TimeGPU Kernel Call & Transfer Overheads (GKT)

We show that: Chunk-size is proportional to the square root of the total processing time

• Each data transfer has 3 factors:

Latency, transfer cost, & Kernel Invocation.

• 1st and 3rd factors are dependent on number of chunks (chunk size)

• 2nd factor is constant for the entire dataset size

Goal: Minimize(Sum(Idle time, GKT)) for the entire processing

Optimized Dynamic Distribution Schemes

15

Non-Uniform Chunk Size Scheme

16

• Start with initial chunk size as indicated by the cost model

• If CPU requests, data with initial size is forwarded

• If GPU requests, a larger chunk is formed by merging smaller chunks

• Minimizes GPU data transfer and device invocation overhead

• At the end of processing, idle time is also minimized

Chunk 1

Chunk 2

…

…

Chunk K

Initial Data Division

Work Dist. System

Job Scheduler

Merging

GPU workers

CPU workersSmall Chunk

Large Chunk

Two-Level Hybrid Scheme

• In the first level data between CPU and GPU is dynamically distributed

• Allows coarse-grained distribution of data

• Reduces the number of global updates

CPU GPU

DynamicChunk Chunk

GPU chunk

CPU chunk

Thread 0

Thread p-1

Thread k

• In the second level, static and equal distribution within CPU cores and GPU cores

• Reduces number of subsets that share global elements (P^2 P-1)

• Reduces Combination overhead

Experimental Results

18

April 21, 2023 19

Experimental Setup

Applications for Generalized Reduction Structure• K-Means Clustering [6.4 GB]• Principal Component Analysis (PCA) [8.5 GB]Applications for Structured Grid Computations• 2-D Jacobi Kernel [1 GB]• Sobel Filter [1 GB]

19

Environment 1 (CPU-centric)• AMD Opteron• 8 CPU cores• 16 GB Main Memory• Nvidia GeForce 9800 GTX• 512 MB Device Memory

Environment 2 (GPU-centric)• AMD Opteron• 4 CPU cores• 8 GB Main Memory• Nvidia Tesla C1060• 4GB Device Memory

April 21, 2023 20

Experimental Goals

• Validate the accuracy of cost model for choosing chunk size

• Evaluate the performance gain from using optimized work distribution schemes (using CPU+GPU simultaneously)

• Study the overheads of dynamic distribution compared to the best static distribution

20

Accuracy of the Cost Model for Choosing Chunk Size

1.01.11.21.31.41.51.61.71.81.9

16 32 64 128 256

Nor

mal

ized

Rel

ative

Spee

dup

No. of Chunks

K-Means PCA Jacobi Sobel Predicted

78

140

1552

• Each application achieves best performance at different chunk sizes

• Poor chunk size selection can impact the performance significantly

• “Predicted” chunk size is always close to the chunk size with best performance

Performance Gains from Using CPU&GPU

0

10

20

30

40

k-means PCA Jacobi Sobel

Rela

tive

Spe

edup

CPU-only GPU-only CPU+GPU

0

5

10

15

20

25

k-means PCA Jacobi Sobel

Rela

tive

Spe

edup

CPU-only GPU-only CPU+GPU

ENV 1 ENV 2

• For K-means and PCA, CPU+GPU version uses NUCS• For Jacobi and Sobel, CPU+GPU version uses 2-Level Hybrid Scheme• In both ENV1 & ENV2, performance improvements ranging from 37% to 75% can be achieved• Shows that our dynamic scheduling framework can adapt to different hardware configurations of CPU and GPU

37%

75%

Scheduling Overheads of Dynamic Distribution Schemes

0

10

20

30

40

1+1 2+1 4+1 8+1

Rela

tive

Spe

edup

Thread Configuration (CPU+GPU)

Naïve-Static M-OPT Static Dynamic

0

5

10

15

1+1 2+1 4+1 8+1

Rela

tive

Spe

edup

Thread Configuration (CPU+GPU)

Naïve-Static M-OPT Static Dynamic

K-Means

Sobel Filter

• We compare Dynamic schemes with static schemes

• “Naïve” static – Distributes work equally between CPU and GPU

• “M-OPT” static – obtained from exhaustive search for every problem size, application and h/w config.

• Dynamic Schemes: At most 7% slower than the “M-OPT” static

• Significantly better than “Naïve”

7%

Conclusions

• We present a dynamic scheduling framework for data parallel loops on heterogeneous systems

• Analyze architectural and communication pattern tradeoffs to infer critical constraints for dynamic scheduling

• A cost model for choosing optimal chunk size in a heterogeneous setup

• Developed two instances of optimized work distribution schemes– NUCS & 2-Level Hybrid scheme

• Our evaluation include 4 applications representing 2 distinct communication patterns

• We show up to 75% improvement from using CPU& GPU simultaneously

24

25

Thank You!

Questions?

Contacts:Vignesh Ravi - [email protected]

Gagan Agrawal - [email protected]

mailto:[email protected]

mailto:[email protected]

a dynamic scheduling framework for emerging heterogeneous systems vignesh ravi and gagan agrawal

Documents

gpu c2050

gpu usage

architectureeach cpu

challenges involvedcpu

memory sizes

cpugpu performance

cpugpu memory latency

best static distribution