running multiple workloads on a gpu a ux oriented approach · a ux oriented approach yuval sarna...

Running Multiple Workloads on a GPUA UX Oriented Approach

Yuval SarnaGraphics Software Expert @ GameFly Streaming

Agenda

• Sharing the GPU

• We all like to Play

• Introduction to GPU Scheduling

• Proposed GPU Scheduler

• Summary & Q&A

What does it means to “share the GPU”?

• Most modern applications use the GPU

• They all share the same hardware resources –CPU, RAM, GPU, etc.

• The GPU executes tasks coming from different processes, satisfying their needs – be it Graphical HW Acceleration GPGPU Etc.

The GPU Model

• Many physical cores but a single core computational model (no “SetAffinity”)

• Access model is FIFO, no fairness, no preemption

• Many processes use the GPU simultaneously – can process only one task at a time

Why do we need to share the GPU?

• Cost Efficiency

• Cloud Environments

• Academic Super-Computers

Difficulties in Sharing the GPU Efficiently

• Running non-demanding application in parallel is easy Not real-time – i.e., don’t require low latency

• When it comes to running multiple demanding workloads on the GPU, sharing becomes difficult Which workload should execute now? How do we handle greedy workloads? What do we expect from a GPU sharing scheme?

Efficient GPU Sharing

• Utilizing the GPU

• Fairness of GPU between applications

• Smooth User Experience (UX)

Agenda

• Sharing the GPU




• Summary & Q&A

Case Study – GameFly Streaming

Case Study – GameFly Streaming

Rendered frames are streamed as video in real-

time to the client

Gamepad commands are sent back to the server

Game is running (and rendered)

on a server

The Technology

Game is running (and rendered)

on a server

Rendered frames are streamed as video in real-

time to the client

Agenda

• Sharing the GPU




• Summary & Q&A

Definitions & Assumptions

GPU

Node 1 Node 2 Node 3 Node 4Node 0 Node 5 Node 6

GPU Scheduler

Process

Context

Context

Process

Context

Context

Context

Context

Process

Node 2

Command Buffers

Scheduling Efficiency

To measure the efficiency of a scheduling algorithm, we may look at two main factors:

• Maximum utilization of the GPU The algorithm should allow it to be 100% utilized.

• Number of frames that missed their deadline With relation to them exceeding their expected

time.

• Ask your target audience

Efficient GPU Sharing

• Utilizing the GPU

• Fairness of GPU between applications

• Smooth User Experience (UX)

0ms Deadline 33msBeginFrame

CB CB CB CB?

Fairness

If life is unfair to everyone,

Isn’t life fair?

How is it done?

• Windows Display Driver Model

• Stall command buffers if they shouldn’t yet be submitted for GPU execution

Application

Direct3D runtime User-modedisplay driver OpenGL runtime Win32®

GDI

OpenGL installableclient driver (ICD)

Kernel-mode access(gdi32.dll)

Win32K.sysDirectX graphics kernel subsystem (Dxgkrnl.sys), which includesDisplay port driver, video memory manager, and GPU scheduler

Display miniport driver

User Mode

Kernel Mode

DirectX graphics kernel subsystem (Dxgkrnl.sys), which includesDisplay port driver, video memory manager, and GPU scheduler

Windows OS GPU Scheduler

• Round-Robin scheduling algorithm

• Let’s take a look at a video showing the issues GPU utilization is ~105% Six concurrent games –

• 5 Overlord II• 1 Alan Wake’s American Nightmare

Running on NVIDIA GRID K520





X5

A Look Behind the Scenes

Grey command buffersare new frames released

by the game

~142ms ~48ms ~57ms ~35ms ~80ms

A Look Behind the Scenes

~24ms

Agenda

• Sharing the GPU




• Summary & Q&A

Why can it be done better?

• We know what kind of workloads we want to schedule

• We can set a target performance

• Our scheduler doesn’t have to be generic

GPU Resources• For example, say we set the target performance to a 30 frames per-

second (FPS) rate

• Each frame shouldn’t take more than ~33ms

• These are the GPU’s resources we have to manage and schedule

• We don’t allow running more than “33 blocks” worth of workloads concurrently But is it enough?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33GPU

First Attempt – Earliest Deadline First

• Prioritize CBs with earlier deadlines using the following data:

The time it took the context to complete the previous frame The time a context has used so far to create the current frame

Round-Robin Scheduling

???

0 8 16 24 32 40MS


0 8 16 24 32 40MS


??

0 8 16 24 32 40MS

0 8 16 24 32 40MS


??

0 8 16 24 32 40MS

Round-Robin Scheduling33 ms – Frame Deadline

Tomb Raider is the only game that managed to complete its frame before the deadline.


???

0 8 16 24 32 40MS


0 8 16 24 32 40MS


??

0 8 16 24 32 40MS


??

0 8 16 24 32 40MS

This game has already started, so its priority is higher.

0 8 16 24 32 40MS


??

This game has already started, so its priority is higher.

0 8 16 24 32 40MS


??

0 8 16 24 32 40MS

First Attempt – Earliest Deadline First33 ms – Frame Deadline

Both Tomb Raider & MotoGP15 completed their frames before the deadline.

Results

• 10 games running concurrently• UX is improved – frames interval variance is reduced significantly

0

200

400

600

800

1000

1200

1400

0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96Frames Interval (ms)

Windows GPU Scheduler

Sum

0

200

400

600

800

1000

1200

1400

0 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 90

Frames Interval (ms)

EDF GPU Scheduler

Sum


• Drawbacks: Tries to schedule more than 100% capacity worth of work.

Greedy workloads get the highest priority

The innocents suffer from low FPS and stuttering

Proposed New Scheduling Algorithm

• The proposed new algorithm uses a combination of two principles:

Each process gets a time quantum.• If the time quantum is depleted before finishing the frame, the process may not

further submit tasks for execution.• The time given to all processes will always be equal to the global frame time (for

example, 33ms).

Amongst those with available time quantum, use priorities using:• Deadline.• Other schemes

Definitions

• n – Number of running processes.• i – The index of a process (counting from 1).• 𝑻𝑻𝒊𝒊 – The time the previous frame took for process i.• 𝑬𝑬𝑬𝑬𝑻𝑻𝒊𝒊 – The expected time a single frame will take for process i.• 𝑫𝑫𝒊𝒊 – How much did process i exceeded its expected frame time,

compared to the previous frame. 𝐷𝐷𝑖𝑖 ≥ 0.• Time(i) – The new time quantum process i receives.• FT – The global Frame Time. This dictates the deadlines. For

example, for a 30FPS target, the FT is ~33.66ms.

Calculating Time Quantum

1. If ∑𝑖𝑖=1𝑛𝑛 𝑇𝑇𝑇𝑇 = 0 : 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇 = 𝐹𝐹𝑇𝑇

2. Else If 0 < ∑𝑖𝑖=1𝑛𝑛 𝑇𝑇𝑇𝑇 ≤ 𝐹𝐹𝑇𝑇 : 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇 = 𝑇𝑇𝑖𝑖

∑𝑗𝑗=1𝑛𝑛 𝑇𝑇𝑗𝑗

∗ 𝐹𝐹𝑇𝑇

3. Else :

𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇 = 𝑇𝑇𝑖𝑖 −(∑𝑗𝑗=1

𝑛𝑛 𝑇𝑇𝑇𝑇−𝐹𝐹𝑇𝑇)

(∑𝑗𝑗=1𝑛𝑛 𝐷𝐷𝑗𝑗)

∗ 𝐷𝐷𝑖𝑖

• 𝐼𝐼𝐼𝐼 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇 ≤ 0 → 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇 = 𝑇𝑇𝑖𝑖

Utilization

0%

≤ 100%

> 100%

Example 1 Utilization < 100%

n = 3FT = 33ms

𝑫𝑫𝒊𝒊𝑬𝑬𝑬𝑬𝑻𝑻𝒊𝒊𝑻𝑻𝒊𝒊𝐷𝐷1 = 𝑇𝑇1 − 𝐸𝐸𝐹𝐹𝑇𝑇 = 088P1

01312P2235P3

�𝑖𝑖=1

3

𝐷𝐷𝑖𝑖 = 2𝑇𝑇𝑚𝑚�𝑖𝑖=1

3

𝑇𝑇𝑖𝑖 = 25 𝑇𝑇𝑚𝑚Total

Utilization: ∑𝑖𝑖=13 𝑇𝑇𝑖𝑖𝐹𝐹𝑇𝑇

= 2533

= ~75%

Example 1 Utilization < 100%

• Here’s the time quantum each process will get for the current frame:

𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 1 = 𝑇𝑇1∑𝑗𝑗=13 𝑇𝑇𝑗𝑗

∗ 𝐹𝐹𝑇𝑇 = 825∗ 33 = 10.56 𝑇𝑇𝑚𝑚


∗ 𝐹𝐹𝑇𝑇 = 1225∗ 33 = 15.84 𝑇𝑇𝑚𝑚


∗ 𝐹𝐹𝑇𝑇 = 525∗ 33 = 6.6 𝑇𝑇𝑚𝑚

𝑫𝑫𝒊𝒊𝑬𝑬𝑬𝑬𝑻𝑻𝒊𝒊𝑻𝑻𝒊𝒊

088P1

01312P2

235P3

2. If 0 < ∑𝑖𝑖=1𝑛𝑛 𝑇𝑇𝑇𝑇 ≤ 𝐹𝐹𝑇𝑇 :

𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇 =𝑇𝑇𝑖𝑖

∑𝑇𝑇=1𝑛𝑛 𝑇𝑇𝑇𝑇∗ 𝐹𝐹𝑇𝑇

Utilization: ∑𝑖𝑖=13 𝑇𝑇𝑖𝑖𝑚𝑚𝑒𝑒(𝑖𝑖)

𝐹𝐹𝑇𝑇= 33

33= 1 → 100%

Example 2 Utilization > 100%

n = 3FT = 33ms

𝑫𝑫𝒊𝒊𝑬𝑬𝑬𝑬𝑻𝑻𝒊𝒊𝑻𝑻𝒊𝒊𝐷𝐷1 = 𝑇𝑇1 − 𝐸𝐸𝐹𝐹𝑇𝑇 = 01010P1

61016P22810P3

�𝑖𝑖=1

3

𝐷𝐷𝑖𝑖 = 8𝑇𝑇𝑚𝑚�𝑖𝑖=1

3

𝑇𝑇𝑖𝑖 = 36 𝑇𝑇𝑚𝑚Total

Utilization: ∑𝑖𝑖=13 𝑇𝑇𝑖𝑖𝐹𝐹𝑇𝑇

= 3633

= ~110%

Example 2 Utilization > 100%

• 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 1 = 𝑇𝑇1 −∑𝑗𝑗=13 𝑇𝑇𝑇𝑇−𝐹𝐹𝑇𝑇

∑𝑗𝑗=13 𝐷𝐷𝑗𝑗

∗ 𝐷𝐷1 = 10 − 36−338

∗ 0 = 10 − 38∗ 0 = 10 𝑇𝑇𝑚𝑚



∗ 𝐷𝐷2 = 16 − 36−338

∗ 6 = 16 − 38∗ 6 = 13.75 𝑇𝑇𝑚𝑚



∗ 𝐷𝐷3 = 10 − 36−338

∗ 2 = 10 − 38∗ 2 = 9.25 𝑇𝑇𝑚𝑚

𝑫𝑫𝒊𝒊𝑬𝑬𝑬𝑬𝑻𝑻𝒊𝒊𝑻𝑻𝒊𝒊

01010P1

61016P2

2810P3

3. Else:

𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇 = 𝑇𝑇𝑖𝑖 −(∑𝑇𝑇=1𝑛𝑛 𝑇𝑇𝑇𝑇 − 𝐹𝐹𝑇𝑇)

(∑𝑇𝑇=1𝑛𝑛 𝐷𝐷𝑇𝑇)∗ 𝐷𝐷𝑖𝑖

Utilization: ∑𝑖𝑖=13 𝑇𝑇𝑖𝑖𝑚𝑚𝑒𝑒(𝑖𝑖)

𝐹𝐹𝑇𝑇= 33

33= 1 → 100%

Calculating Priorities

• To address the case where we have several processes with enough time quantum, each process also gets a priority

• Priorities are given based on the deadline by using Earliest Deadline First

• Other schemes may be used – For example, we could take into account the amount of the time the

process exceeded its expected frame time

Example 1 Context given enough QT

• Context received 12ms time quantum

• Finished Frame @ 10ms• QT Left – 2ms

33ms0msBeginFrame EndFrame Deadline BeginFrame

10ms

Example 2 Context not given enough QT

• Context received 10ms time quantum, needs 14ms

• FPS Drop to 27FPS

33ms0msBeginFrame Out of Time Quantum.

All future CBs mustwait.

Deadline New timeQuantum given.

10ms 37msEndFrame BeginFrame

Results

• Let’s take a look at a video showing the scheduler’s result GPU utilization is ~105% Six concurrent games –



Results



X5

Results




Results

Purple command buffersare new frames released

by the game

Agenda

• Sharing the GPU




• Summary & Q&A

Thank You!

• You’re more than welcome to talk to me after the lecture or email me

Yuval [email protected]

• Please don’t forget to fill out the survey

running multiple workloads on a gpu a ux oriented approach · a ux oriented approach yuval sarna...

Documents