running multiple workloads on a gpu a ux oriented approach · a ux oriented approach yuval sarna...
TRANSCRIPT
Running Multiple Workloads on a GPUA UX Oriented Approach
Yuval SarnaGraphics Software Expert @ GameFly Streaming
Agenda
• Sharing the GPU
• We all like to Play
• Introduction to GPU Scheduling
• Proposed GPU Scheduler
• Summary & Q&A
What does it means to “share the GPU”?
• Most modern applications use the GPU
• They all share the same hardware resources –CPU, RAM, GPU, etc.
• The GPU executes tasks coming from different processes, satisfying their needs – be it Graphical HW Acceleration GPGPU Etc.
The GPU Model
• Many physical cores but a single core computational model (no “SetAffinity”)
• Access model is FIFO, no fairness, no preemption
• Many processes use the GPU simultaneously – can process only one task at a time
Why do we need to share the GPU?
• Cost Efficiency
• Cloud Environments
• Academic Super-Computers
Difficulties in Sharing the GPU Efficiently
• Running non-demanding application in parallel is easy Not real-time – i.e., don’t require low latency
• When it comes to running multiple demanding workloads on the GPU, sharing becomes difficult Which workload should execute now? How do we handle greedy workloads? What do we expect from a GPU sharing scheme?
Efficient GPU Sharing
• Utilizing the GPU
• Fairness of GPU between applications
• Smooth User Experience (UX)
Agenda
• Sharing the GPU
• We all like to Play
• Introduction to GPU Scheduling
• Proposed GPU Scheduler
• Summary & Q&A
Case Study – GameFly Streaming
Case Study – GameFly Streaming
Rendered frames are streamed as video in real-
time to the client
Gamepad commands are sent back to the server
Game is running (and rendered)
on a server
The Technology
Game is running (and rendered)
on a server
Rendered frames are streamed as video in real-
time to the client
Agenda
• Sharing the GPU
• We all like to Play
• Introduction to GPU Scheduling
• Proposed GPU Scheduler
• Summary & Q&A
Definitions & Assumptions
GPU
Node 1 Node 2 Node 3 Node 4Node 0 Node 5 Node 6
GPU Scheduler
Process
Context
Context
Process
Context
Context
Context
Context
Process
Node 2
Command Buffers
Scheduling Efficiency
To measure the efficiency of a scheduling algorithm, we may look at two main factors:
• Maximum utilization of the GPU The algorithm should allow it to be 100% utilized.
• Number of frames that missed their deadline With relation to them exceeding their expected
time.
• Ask your target audience
Efficient GPU Sharing
• Utilizing the GPU
• Fairness of GPU between applications
• Smooth User Experience (UX)
0ms Deadline 33msBeginFrame
CB CB CB CB?
Fairness
If life is unfair to everyone,
Isn’t life fair?
How is it done?
• Windows Display Driver Model
• Stall command buffers if they shouldn’t yet be submitted for GPU execution
Application
Direct3D runtime User-modedisplay driver OpenGL runtime Win32®
GDI
OpenGL installableclient driver (ICD)
Kernel-mode access(gdi32.dll)
Win32K.sysDirectX graphics kernel subsystem (Dxgkrnl.sys), which includesDisplay port driver, video memory manager, and GPU scheduler
Display miniport driver
User Mode
Kernel Mode
DirectX graphics kernel subsystem (Dxgkrnl.sys), which includesDisplay port driver, video memory manager, and GPU scheduler
Windows OS GPU Scheduler
• Round-Robin scheduling algorithm
• Let’s take a look at a video showing the issues GPU utilization is ~105% Six concurrent games –
• 5 Overlord II• 1 Alan Wake’s American Nightmare
Running on NVIDIA GRID K520
Windows OS GPU Scheduler
• Round-Robin scheduling algorithm
• Let’s take a look at a video showing the issues GPU utilization is ~105% Six concurrent games –
• 5 Overlord II• 1 Alan Wake’s American Nightmare
X5
Windows OS GPU Scheduler
• Round-Robin scheduling algorithm
• Let’s take a look at a video showing the issues GPU utilization is ~105% Six concurrent games –
• 5 Overlord II• 1 Alan Wake’s American Nightmare
X5
Windows OS GPU Scheduler
• Round-Robin scheduling algorithm
• Let’s take a look at a video showing the issues GPU utilization is ~105% Six concurrent games –
• 5 Overlord II• 1 Alan Wake’s American Nightmare
X5
Windows OS GPU Scheduler
• Round-Robin scheduling algorithm
• Let’s take a look at a video showing the issues GPU utilization is ~105% Six concurrent games –
• 5 Overlord II• 1 Alan Wake’s American Nightmare
Running on NVIDIA GRID K520
A Look Behind the Scenes
Grey command buffersare new frames released
by the game
~142ms ~48ms ~57ms ~35ms ~80ms
A Look Behind the Scenes
~24ms
Agenda
• Sharing the GPU
• We all like to Play
• Introduction to GPU Scheduling
• Proposed GPU Scheduler
• Summary & Q&A
Why can it be done better?
• We know what kind of workloads we want to schedule
• We can set a target performance
• Our scheduler doesn’t have to be generic
GPU Resources• For example, say we set the target performance to a 30 frames per-
second (FPS) rate
• Each frame shouldn’t take more than ~33ms
• These are the GPU’s resources we have to manage and schedule
• We don’t allow running more than “33 blocks” worth of workloads concurrently But is it enough?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33GPU
First Attempt – Earliest Deadline First
• Prioritize CBs with earlier deadlines using the following data:
The time it took the context to complete the previous frame The time a context has used so far to create the current frame
Round-Robin Scheduling
???
0 8 16 24 32 40MS
Round-Robin Scheduling
0 8 16 24 32 40MS
Round-Robin Scheduling
??
0 8 16 24 32 40MS
Round-Robin Scheduling
??
0 8 16 24 32 40MS
0 8 16 24 32 40MS
Round-Robin Scheduling
??
0 8 16 24 32 40MS
Round-Robin Scheduling
??
0 8 16 24 32 40MS
Round-Robin Scheduling33 ms – Frame Deadline
Tomb Raider is the only game that managed to complete its frame before the deadline.
First Attempt – Earliest Deadline First
???
0 8 16 24 32 40MS
First Attempt – Earliest Deadline First
0 8 16 24 32 40MS
First Attempt – Earliest Deadline First
??
0 8 16 24 32 40MS
First Attempt – Earliest Deadline First
??
0 8 16 24 32 40MS
This game has already started, so its priority is higher.
0 8 16 24 32 40MS
First Attempt – Earliest Deadline First
??
This game has already started, so its priority is higher.
0 8 16 24 32 40MS
First Attempt – Earliest Deadline First
??
0 8 16 24 32 40MS
First Attempt – Earliest Deadline First33 ms – Frame Deadline
Both Tomb Raider & MotoGP15 completed their frames before the deadline.
Results
• 10 games running concurrently• UX is improved – frames interval variance is reduced significantly
0
200
400
600
800
1000
1200
1400
0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96Frames Interval (ms)
Windows GPU Scheduler
Sum
0
200
400
600
800
1000
1200
1400
0 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 90
Frames Interval (ms)
EDF GPU Scheduler
Sum
First Attempt – Earliest Deadline First
• Drawbacks: Tries to schedule more than 100% capacity worth of work.
Greedy workloads get the highest priority
The innocents suffer from low FPS and stuttering
First Attempt – Earliest Deadline First
• Drawbacks: Tries to schedule more than 100% capacity worth of work.
Greedy workloads get the highest priority
The innocents suffer from low FPS and stuttering
First Attempt – Earliest Deadline First
• Drawbacks: Tries to schedule more than 100% capacity worth of work.
Greedy workloads get the highest priority
The innocents suffer from low FPS and stuttering
Proposed New Scheduling Algorithm
• The proposed new algorithm uses a combination of two principles:
Each process gets a time quantum.• If the time quantum is depleted before finishing the frame, the process may not
further submit tasks for execution.• The time given to all processes will always be equal to the global frame time (for
example, 33ms).
Amongst those with available time quantum, use priorities using:• Deadline.• Other schemes
Definitions
• n – Number of running processes.• i – The index of a process (counting from 1).• 𝑻𝑻𝒊𝒊 – The time the previous frame took for process i.• 𝑬𝑬𝑬𝑬𝑻𝑻𝒊𝒊 – The expected time a single frame will take for process i.• 𝑫𝑫𝒊𝒊 – How much did process i exceeded its expected frame time,
compared to the previous frame. 𝐷𝐷𝑖𝑖 ≥ 0.• Time(i) – The new time quantum process i receives.• FT – The global Frame Time. This dictates the deadlines. For
example, for a 30FPS target, the FT is ~33.66ms.
Calculating Time Quantum
1. If ∑𝑖𝑖=1𝑛𝑛 𝑇𝑇𝑇𝑇 = 0 : 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇 = 𝐹𝐹𝑇𝑇
2. Else If 0 < ∑𝑖𝑖=1𝑛𝑛 𝑇𝑇𝑇𝑇 ≤ 𝐹𝐹𝑇𝑇 : 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇 = 𝑇𝑇𝑖𝑖
∑𝑗𝑗=1𝑛𝑛 𝑇𝑇𝑗𝑗
∗ 𝐹𝐹𝑇𝑇
3. Else :
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇 = 𝑇𝑇𝑖𝑖 −(∑𝑗𝑗=1
𝑛𝑛 𝑇𝑇𝑇𝑇−𝐹𝐹𝑇𝑇)
(∑𝑗𝑗=1𝑛𝑛 𝐷𝐷𝑗𝑗)
∗ 𝐷𝐷𝑖𝑖
• 𝐼𝐼𝐼𝐼 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇 ≤ 0 → 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇 = 𝑇𝑇𝑖𝑖
Utilization
0%
≤ 100%
> 100%
Example 1 Utilization < 100%
n = 3FT = 33ms
𝑫𝑫𝒊𝒊𝑬𝑬𝑬𝑬𝑻𝑻𝒊𝒊𝑻𝑻𝒊𝒊𝐷𝐷1 = 𝑇𝑇1 − 𝐸𝐸𝐹𝐹𝑇𝑇 = 088P1
01312P2235P3
�𝑖𝑖=1
3
𝐷𝐷𝑖𝑖 = 2𝑇𝑇𝑚𝑚�𝑖𝑖=1
3
𝑇𝑇𝑖𝑖 = 25 𝑇𝑇𝑚𝑚Total
Utilization: ∑𝑖𝑖=13 𝑇𝑇𝑖𝑖𝐹𝐹𝑇𝑇
= 2533
= ~75%
Example 1 Utilization < 100%
• Here’s the time quantum each process will get for the current frame:
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 1 = 𝑇𝑇1∑𝑗𝑗=13 𝑇𝑇𝑗𝑗
∗ 𝐹𝐹𝑇𝑇 = 825∗ 33 = 10.56 𝑇𝑇𝑚𝑚
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 2 = 𝑇𝑇2∑𝑗𝑗=13 𝑇𝑇𝑗𝑗
∗ 𝐹𝐹𝑇𝑇 = 1225∗ 33 = 15.84 𝑇𝑇𝑚𝑚
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 3 = 𝑇𝑇3∑𝑗𝑗=13 𝑇𝑇𝑗𝑗
∗ 𝐹𝐹𝑇𝑇 = 525∗ 33 = 6.6 𝑇𝑇𝑚𝑚
𝑫𝑫𝒊𝒊𝑬𝑬𝑬𝑬𝑻𝑻𝒊𝒊𝑻𝑻𝒊𝒊
088P1
01312P2
235P3
2. If 0 < ∑𝑖𝑖=1𝑛𝑛 𝑇𝑇𝑇𝑇 ≤ 𝐹𝐹𝑇𝑇 :
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇 =𝑇𝑇𝑖𝑖
∑𝑇𝑇=1𝑛𝑛 𝑇𝑇𝑇𝑇∗ 𝐹𝐹𝑇𝑇
Utilization: ∑𝑖𝑖=13 𝑇𝑇𝑖𝑖𝑚𝑚𝑒𝑒(𝑖𝑖)
𝐹𝐹𝑇𝑇= 33
33= 1 → 100%
Example 2 Utilization > 100%
n = 3FT = 33ms
𝑫𝑫𝒊𝒊𝑬𝑬𝑬𝑬𝑻𝑻𝒊𝒊𝑻𝑻𝒊𝒊𝐷𝐷1 = 𝑇𝑇1 − 𝐸𝐸𝐹𝐹𝑇𝑇 = 01010P1
61016P22810P3
�𝑖𝑖=1
3
𝐷𝐷𝑖𝑖 = 8𝑇𝑇𝑚𝑚�𝑖𝑖=1
3
𝑇𝑇𝑖𝑖 = 36 𝑇𝑇𝑚𝑚Total
Utilization: ∑𝑖𝑖=13 𝑇𝑇𝑖𝑖𝐹𝐹𝑇𝑇
= 3633
= ~110%
Example 2 Utilization > 100%
• 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 1 = 𝑇𝑇1 −∑𝑗𝑗=13 𝑇𝑇𝑇𝑇−𝐹𝐹𝑇𝑇
∑𝑗𝑗=13 𝐷𝐷𝑗𝑗
∗ 𝐷𝐷1 = 10 − 36−338
∗ 0 = 10 − 38∗ 0 = 10 𝑇𝑇𝑚𝑚
• 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 2 = 𝑇𝑇2 −∑𝑗𝑗=13 𝑇𝑇𝑇𝑇−𝐹𝐹𝑇𝑇
∑𝑗𝑗=13 𝐷𝐷𝑗𝑗
∗ 𝐷𝐷2 = 16 − 36−338
∗ 6 = 16 − 38∗ 6 = 13.75 𝑇𝑇𝑚𝑚
• 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 3 = 𝑇𝑇3 −∑𝑗𝑗=13 𝑇𝑇𝑇𝑇−𝐹𝐹𝑇𝑇
∑𝑗𝑗=13 𝐷𝐷𝑗𝑗
∗ 𝐷𝐷3 = 10 − 36−338
∗ 2 = 10 − 38∗ 2 = 9.25 𝑇𝑇𝑚𝑚
𝑫𝑫𝒊𝒊𝑬𝑬𝑬𝑬𝑻𝑻𝒊𝒊𝑻𝑻𝒊𝒊
01010P1
61016P2
2810P3
3. Else:
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇 = 𝑇𝑇𝑖𝑖 −(∑𝑇𝑇=1𝑛𝑛 𝑇𝑇𝑇𝑇 − 𝐹𝐹𝑇𝑇)
(∑𝑇𝑇=1𝑛𝑛 𝐷𝐷𝑇𝑇)∗ 𝐷𝐷𝑖𝑖
Utilization: ∑𝑖𝑖=13 𝑇𝑇𝑖𝑖𝑚𝑚𝑒𝑒(𝑖𝑖)
𝐹𝐹𝑇𝑇= 33
33= 1 → 100%
Calculating Priorities
• To address the case where we have several processes with enough time quantum, each process also gets a priority
• Priorities are given based on the deadline by using Earliest Deadline First
• Other schemes may be used – For example, we could take into account the amount of the time the
process exceeded its expected frame time
Example 1 Context given enough QT
• Context received 12ms time quantum
• Finished Frame @ 10ms• QT Left – 2ms
33ms0msBeginFrame EndFrame Deadline BeginFrame
10ms
Example 2 Context not given enough QT
• Context received 10ms time quantum, needs 14ms
• FPS Drop to 27FPS
33ms0msBeginFrame Out of Time Quantum.
All future CBs mustwait.
Deadline New timeQuantum given.
10ms 37msEndFrame BeginFrame
Results
• Let’s take a look at a video showing the scheduler’s result GPU utilization is ~105% Six concurrent games –
• 5 Overlord II• 1 Alan Wake’s American Nightmare
Running on NVIDIA GRID K520
Results
• Let’s take a look at a video showing the scheduler’s result GPU utilization is ~105% Six concurrent games –
• 5 Overlord II• 1 Alan Wake’s American Nightmare
X5
Results
• Let’s take a look at a video showing the scheduler’s result GPU utilization is ~105% Six concurrent games –
• 5 Overlord II• 1 Alan Wake’s American Nightmare
X5
Results
• Let’s take a look at a video showing the scheduler’s result GPU utilization is ~105% Six concurrent games –
• 5 Overlord II• 1 Alan Wake’s American Nightmare
X5
Results
• Let’s take a look at a video showing the scheduler’s result GPU utilization is ~105% Six concurrent games –
• 5 Overlord II• 1 Alan Wake’s American Nightmare
Running on NVIDIA GRID K520
Results
Purple command buffersare new frames released
by the game
Agenda
• Sharing the GPU
• We all like to Play
• Introduction to GPU Scheduling
• Proposed GPU Scheduler
• Summary & Q&A
Thank You!
• You’re more than welcome to talk to me after the lecture or email me
Yuval [email protected]
• Please don’t forget to fill out the survey