gpu direct for video - nvidia...unified data transfer api for all graphics and compute apis’...

25
GPU Direct for Video

Upload: others

Post on 06-Mar-2021

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

GPU Direct for Video

Page 2: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

Agenda

Video Processing Workflow

Video I/O with GPU

GPUDirect for Video

Why Not Peer to Peer

SDK Availability

Conclusions

Page 3: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

Video In GPU

Video Processing Workflow

Video In Video Out

Video In

GPU

GPU

Video Out

Video Out

Page 4: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

Video I/O with GPU: transfer path

CPU Memory

SYSTEM

Video 3rd Party

Input/Output

Card GPU

Page 5: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

Video I/O with GPU: inefficiencies

I/O <-> GPU through GPU

unshareable system

memory buffer

GPU synchronous transfers

GPU Upload

GPU Processing

GPU Readback

Input DMA

Output DMA

F2

F2

F2

F2

F2 CPU Memcopy

CPU Memcopy F2

vsync vsync

F2

Scan Out

Scan In F3

F1

2 frames of latency

Page 6: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

Video I/O with GPU: improvements

GPU Upload

GPU Processing

GPU Readback

Input DMA

Output DMA

F4

F2 CPU Memcopy

CPU Memcopy F4 Increasing latency

F3

F2

Scan Out

Scan In F5

F1

vsync vsync

F3

F3

4 frames of latency!

Page 7: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

Video I/O with GPU: improvements

GPU Upload

GPU Processing

GPU Readback

Input DMA

Output DMA

F2

F2 CPU Memcopy

CPU Memcopy F2 I/O <-> GPU through GPU

shareable system memory

buffer

F2

F2

F2

F2

Scan Out

Scan In F3

F1

vsync vsync

2 frames of latency

Page 8: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

Video I/O with GPU: improvements

GPU Upload

GPU Processing

GPU Readback

Input DMA

Output DMA

F2

F2

F2

Overlapping I/O and GPU

DMAs by using:

— sub-field chunks

F2

F2

vsync vsync

Scan Out

Scan In F3

F1

2 frames of latency

Page 9: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

Video I/O with GPU: improvements

GPU Upload

GPU Processing

GPU Readback

Input DMA

Output DMA

F2

F2

F2

F2

Overlapping GPU DMAs and

GPU processing by using:

— sub-field chunks

— GPU asynchronous transfers

F2

vsync vsync

Scan Out

Scan In F3

F1

2 frames of latency

Page 10: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

F2

F2

F3

F3

Video I/O with GPU: improvements

GPU Upload

GPU Processing

GPU Readback

Input DMA

Output DMA

Overlapping GPU DMAs and

GPU processing by using:

— GPU asynchronous transfers

— Increasing latency

F3

vsync vsync

Scan Out

Scan In F4

F1

3 frames of latency

Page 11: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

F2

F2

F4

F4

Video I/O with GPU: improvements

GPU Upload

GPU Processing

GPU Readback

Input DMA

Output DMA

Overlapping GPU DMAs and

GPU processing by using:

— GPU asynchronous transfers

— Increasing latency even

more

F3

vsync vsync

Scan Out

Scan In F5

F1

4 frames of latency = processing can occupy the entire frame!

Page 12: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

GPUDirect for Video

GPU Upload

GPU Processing

GPU Readback

Input DMA

Output DMA F2

Unified data transfer API

for all Graphics and

Compute APIs’ objects

— Video oriented

— Efficient synchronization

— GPU shareable system

memory

— Sub-field transfers

— GPU Asynchronous transfers

F2

F2

F2

vsync vsync

Scan Out

Scan In F3

F1

2 frames of latency!

F2

Page 13: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

GPUDirect for Video: Application usage

Not this

Application

3rd Party Video I/O SDK

OpenGL CUDA DX GPU

Direct

for

Video

3rd Party Video I/O

Device driver NVIDIA Driver

Application

OpenGL CUDA DX GPU

Direct

for

Video

3rd Party Video I/O

Device driver NVIDIA Driver

But this:

Page 14: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

GPUDirect for Video:Application Usage

Use the SDK Provided by Your Preferred Video I/O Vendor

Page 15: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

GPUDirect for Video: Application Usage Video Capture to OpenGL Texture main()

{

…..

GLuint glTex;

glGenTextures(1, &glTex); \\ Create OpenGL texture obect

glBindTexture(GL_TEXTURE_2D, glTex);

glTexImage2D(GL_TEXTURE_2D, 0, GL_RGB, bufferWidth, bufferHeight, 0, 0, 0, 0);

glBindTexture(GL_TEXTURE_2D, 0);

EXTRegisterGPUTextureGL(glTexIn); \\ Register texture with 3rd party Video I/O SDK

while(!quit)

{

EXTBegin(glTexIn); \\ Release texture from Video I/O SDK

Render(glTexIn); \\ Use the texture

EXTEnd(glTexIn); \\ Release texture back to Video I/O SDK

}

EXTUnregisterGPUTextureGL(glTexIn); \\ Unregister texture with 3rd party Video I/O SDK

}

Page 16: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

Results

Capture Latency Playout Latency

ExtDev to SysMem SysMem to GPU GPU to SysMem SysMem to ExtDev

1.4ms 1.5ms 1.5ms 1.5ms

2.9 ms 3.0ms

Optimal transfer time for 4-component 8-bit 1080p video:

* Although Gen 2 PCI Express bandwidth is specified at 8.0GB / sec, the maximum achievable is ~6.0 GB/sec

Direct to GPU video transfer time for 4-component 8-bit 1080p video:

transfer time ≅𝑓𝑟𝑎𝑚𝑒 𝑠𝑖𝑧𝑒

𝑃𝐶𝐼𝐸 𝑏𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ ∗ 2 ≅

497664000 𝑏𝑦𝑡𝑒𝑠 𝑝𝑒𝑟 𝑠𝑒𝑐𝑜𝑛𝑑

6000000000∗ 𝑏𝑦𝑡𝑒𝑠 𝑝𝑒𝑟 𝑠𝑒𝑐𝑜𝑛𝑑

∗ 1

60 ∗ 2

𝑡𝑟𝑎𝑛𝑠𝑓𝑒𝑟 𝑡𝑖𝑚𝑒 ≅ 2.397 msec

Page 17: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

Results

Without Direct to GPU Transfers

With Direct to GPU Transfers

Direct to GPU Transfers with Chunking

time

16.6 ms

vsync vsync

Capture

GPU Processing

Readback

3.8 ms 8.9 ms 3.9 ms

2.9 ms 10.7 ms 3.0 ms

1.5 ms

13.6 ms 1.6 ms 1.6 ms

1.7 ms

Page 18: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

Video I/O with GPU: transfer path

CPU Memory

SYSTEM

Video 3rd Party

Input/Output

Card GPU

Why not

Peer-to-Peer?

Page 19: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

NVIDIA Digital Video Pipeline: Peer-to-Peer Communication

CPU Memory

SYSTEM

SDI Video SDI Video

NVIDIA SDI capture and output cards only

Quadro®

SDI Output

Quadro® GPU

Compute + Graphics

Quadro®

SDI Capture

Page 20: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

F3 F4

NVIDIA Digital Video Pipeline

GPU blit/

Chroma expansion

GPU Processing

Input DMA

Output DMA

The Input DMA is peer-to-

peer of 4 inputs

The output is a direct

scanout from the GPU

BUT…

Fixed 3 frames of latency

Each I/O board must be

tied to a single GPU

F2

F3

F2

vsync vsync

Scan Out

Scan In

F1

3 frames of latency!

Page 21: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

NVIDIA GPUDirect™ v2.0: Peer-to-Peer Communication

Direct Transfers between GPUs with CUDA

GPU1

GPU1

Memory

GPU2

GPU2

Memory

PCI-e

CPU

Chip

set

System

Memory

Only one copy required: 1. cudaMemcpy(GPU2, GPU1)

Page 22: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

NVIDIA GPUDirect™ now supports RDMA

GPU1 GPU2

PCI-e

System

Memory GDDR5

Memory GDDR5

Memory

CPU

Network

Card

Server 1

PCI-e

GPU1 GPU2

GDDR5

Memory GDDR5

Memory

System

Memory

CPU

Network

Card

Server 2

Network

Linux only, HPC centric

Page 23: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

Why Not Peer-to-Peer?

• Supportability

• Same performance

• IOH-to-IOH communication issues

• Limited PCIE slot options due to lane allocations

• Support Graphics APIs as well as CUDA

• Multi-GPU Support!

• Multi-OS support!

Page 24: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

Video I/O Card Vendors

Use the NVIDIA GPU Direct for Video SDK:

http://developer.nvidia.com/nvidia-

gpudirect%E2%84%A2-video

Samples (OpenGL, D3D9, D3D11, CUDA)

Programming Guide

Windows7, Linux

Static and Shared Libraries

Page 25: GPU Direct for Video - NVIDIA...Unified data transfer API for all Graphics and Compute APIs’ objects —Video oriented —Efficient synchronization —GPU shareable system memory

Conclusions

Lowest latency video I/O in and out of the GPU

Optimal transfer times

Optimal GPU processing time

Supports OpenGL and DirectX as well as CUDA

Does not require sophisticated programming.

Scales to multiple GPUs