gdc_14_directx advancements in the many-core era getting the most out of the pc platform

Upload: jacktor1

Post on 02-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    1/38

    DIRECTX

    ADVANCEMENTS IN THE MANY CORE ERA

    Getting the most out of the PC Platform

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    2/38

    System Architecture Trends

    CPU Evolution

    From single core to multi-corePower wall limits frequency scaling

    GPU EvolutionUnified processor cores

    Multiple hardware engines

    MMU and paged memories

    More autonomous, fully

    programmable machines

    Highly Integrated SoC Designs

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    3/38

    GeForce FX 5800GeForce 6800 Ultra

    GeForce 7800 GTX

    GeForce 8800 GTX

    GeForce GTX 280

    GeForce GTX 480

    GeForce GTX 580

    GeForce GTX 680

    GeForce GTX TITAN

    GeForce 780 Ti

    Pentium 4

    Woodcrest

    Harpertown

    BloomfieldWestmere

    Sandy BridgeIvy Bridge

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    4500

    5000

    5500

    6000

    Apr-01 Jan-04 Oct-06 Jul-09 Apr-12 Dec-14

    NVIDIA GPU Single Precision

    Intel CPU Single Precision

    Northwood

    PrescottWoodcrest

    Harpertown

    Blo

    GeForce FX 5900

    GeForce 6800 GT

    GeForce 7800 GTX

    GeForce 8800 GTX

    GeForce GTX 280

    GeForce GT

    0

    30

    60

    90

    120

    150

    180

    210

    240

    270

    300

    330

    360

    2003 2005 2007 2

    CPU

    GeForce GPU

    GPU vs CPU Perf Scaling

    GFLOPS

    GB/s

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    4/38

    Application Trends

    From Functional Parallelism to Task-

    based ParallelismGraphs of jobs, work stealing schedulers

    GPU as a General Purpose ProcessorPhysics & simulation, Programmable Graphics

    Requires more control over underlying hardware

    Content is KingShifting balance between runtime costs vs

    production costsFrom

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    5/38

    Current API Abstraction is Getting

    Designed After >20 Years Old Machine Abstraction

    Large State Machine AbstractionState transitions orchestrated by the CPU

    Opaque Memory ModelResource management hidden from the app

    Limited Multi-core ScalabilitySerial submission model

    Implicit Hazard Tracking and SynchronizationHigh work submission costs

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    6/38

    Radical Reductions in Submission

    Embrace Latest GPU ArchitecturImprovements

    Highly Scalable on Multi-core Sys

    Provide Console-like Execution E

    Designed in Close Collaboration

    Microsoft and NVIDIA

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    7/38

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    8/38

    DX12 Nutsand Bolts

    State Management Model

    Command Lists

    Resource Binding And Hazard Tra

    Residency Management And Mem

    CPU/GPU Synchronization

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    9/38

    State Management

    Further State Grouping and FactorizationProvide opportunities for aggressive pre-baking

    As much cost as possible is paid up-front

    Pipeline State Encapsulates Heavy-weight StateBits of state specified and used together

    More Frequently Changing State Still Lives

    Outside PSOsLighter weight state changes

    ID3D12Pipel

    ID3D1

    ID3D11Ra

    ID3D11

    ID3D11Dep

    ID3D11I

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    10/38

    PSO and Non-PSO State

    PSO Non-PSO

    Shader Program at Every Stage Resource Bindings

    Blend State Viewport Mappings

    Rasterizer State Scissor Rectangles

    Depth-stencil State Blend Factor

    Input Layout Depth Test

    Render Target Properties Stencil

    Input Topology Type Input Topology Bucket (List v

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    11/38

    New Binding Model

    Current GPUs Can Reference a Large Namespace of ResourcSo called bindless model

    Resource Binding State is Pulled Into the SystemRather than CPU pushing it - more scalable and efficient

    Can Reference a Very Large Working Set of ResourcesE.g. Kepler GPUs can use a palette of > 1M textures

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    12/38

    Shader Code

    Bindless BenefitsEssential for raytracing and next-gen rendering

    where resource working set is not known in advan

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    13/38

    Descriptors

    Descriptors encapsulate handles to resources

    Semantically equivalent to ID3D11*ResourceView objects

    Opaque bag of bits, of implementation-specific format and size

    Expected to be on the order of 64B-128B

    No driver-side allocations associated with descriptors

    Can be freely copied around and thrown away by the app

    API functions to create descriptors given app-provided spec

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    14/38

    Descriptor Tables and Heaps

    Descriptor Tables Encapsulate Palettes

    of Resource ReferencesContiguous arrays of descriptor entries

    Individual descriptors indexable from the GPU

    Tables Defined in Descriptor Heaps

    Driver provided memoryImplementation-specific size restrictions

    Represent Unit of Resource Binding StateAllow for bulk binding of resources

    Descriptor Heap

    SR

    SR

    SR

    Table #0

    SR

    SR

    SR

    Table #1

    SR

    SRTable #6

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    15/38

    Descriptors, Tables, and Heaps

    Tables Very Cheap to Switch

    Ideally just a pointer change to the HW

    Efficent bulk binding state change

    A Collection of Tables Can be Set at a Time

    Shaders can select a table to use for indexingRestrictions based on resource type

    Applications Responsible for Managing Tables Within Heaps

    Various strategies, balancing heap space vs performance

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    16/38

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    17/38

    Bundles

    Highly Reusable Rendering Components

    A kind of command list

    Can Only be Executed From Direct Command Lists

    Non-pipeline State Inherited from the Calling Command List

    Provides for reuse flexibility

    Represent Individual Objects in a SceneCollection of draws and state changes

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    18/38

    Residency Management And Heap

    Heaps Represent Bulk Memory Allocation

    Unit of OS/driver memory management

    Explicitly Separate From The Binding ModelIndividual resource bindings no longer tracked

    Residency Managed At Coarse Grain Per HeapApplications responsible for managing memory within heaps

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    19/38

    Hazard Tracking

    Implicit Hazard Tracking Becoming Hard and Impractical

    Tiled resources allow for memory aliasing

    Very large palette of resources can be referenced

    Applications Use Explicit Barriers to Signal HazardsE.g. resource transitioning from RTV to SRV

    Can exploit app-level knowledge and be less conservative

    Robustness vs Efficiency Trade off

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    20/38

    CPU/GPU Synchronization

    Hazards Between Multiple Concurrent Processors Managed W

    CPU sharing data with the GPU

    Applications Responsible for Setting Fences and Tracking ThCan use standard OS synchronization APIs

    Renaming and versioning optimizations are responsibility ofDynamic Resources are Effectively Permanently Mapped

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    21/38

    Diving IntoNitrous

    Nitrous = Oxides custom engi

    Specifically designed for high

    Core neutral. Main thread act

    lightweight sequencer

    All work divided up into small

    are in the microsecond range

    Can produce lots of jobs, 10,0

    per frame

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    22/38

    Multi-core CPU Basics

    Be Wary, There Is A Lot Of Very Bad Advice In The Wild

    Spawning threads to handle tasksRelying OS preemptive scheduler, heavy weight OS synchronization primitiv

    Functional threading in general

    Your Survival GuideOK: Multi-thread read of same location

    OK: Multi-thread write to different locations

    OK: Multi-thread write to same location in stamp mode

    CAUTION: Atomic instructions

    STOP: Multi-thread read/write to same location

    STOP: Multi-thread write to same CACHE line

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    23/38

    Even Frame

    Setting Up Our Frame

    Concept of Asynchronous GPU

    Now Exposed Through APIMust buffer frames ourselves

    Will create 2+ copies of everything

    CPU to GPU DataShader constants

    Texture updates

    GPU CommandsCommand group = self contained, no

    cross thread hazards

    Means WAW, RAW hazards must be

    marked in command buffers

    Data nor command should not be

    changed while GPU could be reading

    CommandBuffer

    GPU visible

    memory

    Command

    Buffer

    CommandBuffer

    ComBu

    Com

    Bu

    ComBu

    d b d

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    24/38

    But Commands May be Generated

    Inter Frame Data

    Job Scheduler

    Core 0

    Command

    Buffer

    Command

    Buffer

    Core 1 Core 2

    Command

    Buffer

    Command

    Buffer

    M d il b f

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    25/38

    More details about our frame

    In reality, diagram is over simplified

    Nitrous has its own internal command formatSmall, efficient commands

    Stateless, each command contains references to all needed state

    Inheritance unneeded

    Seperates internal graphics system from any particular API

    Being Stateless, can be generated completely out of order

    Entire Frame is queued up in internal command formatD3D12 is translated. Each internal command is translated, 1:1each command buffer to a D3D12 command list

    Gets more optimal use out of instruction cache and data cachAllows us surrender the entire system to the driver

    Sh d

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    26/38

    Shaders

    Shader Blocks

    Group of shaders with identicleresources

    Key point: No changing of individual

    shaders, shader considered onemonolithic block

    All resources bound at same places

    across all stages

    Maps well to a group of pipeline

    state objects in d3d12

    ShaderGroup SimpleShader

    {

    ResourceSetPrimitive = V

    ConstantSetDynamic[0] = D

    ResourceSetBatch[1] = UConstantSetShader[0] = Gl

    Methods

    {

    main:

    CodeBlocks = SimpleShade

    VertexShader = SimpleVSS

    PixelShader = SimplePSS

    zprime:

    CodeBlocks = SimpleShade

    VertexShader = SimpleVSS

    PixelShader = BlankSimp

    }

    }

    F F i f U d t

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    27/38

    Even FrameBatch Set

    Four Frequencies of Updates

    Batch 0 Batch 1 Batch 2 Batch 3 Batch 4

    Prim 0 Prim 1 Prim 2

    Prim 1 Prim 2

    PRIMITIVE

    BATCH SET

    IB

    Resources

    Tri info

    Pri

    Sh

    ReCo

    BatchesPrimitives

    Shaders

    RTs

    Blend State

    ReCo

    Sh

    R S t

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    28/38

    Resource Sets

    In Real World, Textures

    Are Grouped

    Nitrous Has 5 Bind Points2 for batch

    2 for shader

    1 for primitive

    VB Is Just A Resource Set

    Maps Well To D3d12 Descriptor Tables

    SPACE FIGHTER 1

    (0) Albiedo

    (1) Material Mask(2) Ambient Occlusion

    (3) Normal Map

    (4) Weathering Map

    M P l

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    29/38

    Memory Pools

    Resources used together, created

    together

    Multiple resource sets are often pooled

    Simplifies memory management, less

    then 1000 total allocations

    Maps well to D3D12 memory heap

    Orange Team Units M

    SPACE FIGHTER 1 CA

    CARRIER FORWARD CA

    (0) Albiedo

    (1) Material Mask(2) Ambient Occlusion

    (3) Normal Map

    (4) Weathering Map

    (0) Al

    (1) Ma(2) Am

    (3) No

    (4) We

    (0) Albiedo

    (1) Material Mask

    (2) Ambient Occlusion

    (3) Normal Map(4) Weathering Map

    (0) Al

    (1) Ma

    (2) Am

    (3) No(4) We

    Constants and Reso rce Update

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    30/38

    Constants and Resource Update

    Common data used per frame is dumped into a Graphics Transfer M

    Reset every frame, with a simple incrementing counterAll constants are uploaded into this memory, and referenced by com

    Per frame resource updates placed in this memory

    No point in resizing and reallocating, need the max memory planned f

    Non per frame updates, which are big, other memory is allocated, b

    free memory after receiving notification that GPU command is comp

    D3d12 this ends up just being a buffer

    Starting the frame

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    31/38

    Starting the frame

    bool BeginFrame()

    {

    bool bGPUbound = true

    int FrameData = g_uFrame % FramesQueued

    if g_Fences[FrameData] is blocking

    then g_Fences[FrameData]->WaitOnFence

    Else bGPUBound = false

    //we know the GPU is free with this nowg_FrameData[FrameData]->LockAndMap()

    g_uFrame++;

    return bGPUBound;

    }

    This function shplace on sequen

    once ready to be

    generating comm

    Can place timer

    both sides of the

    get accurate CPof our app

    Median frames q

    should be Fram

    .5, so usually on

    frames behind

    Generating a command list

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    32/38

    Generating a command list

    Once frame has begun, can begin issuing command lists

    Easy strategy: Create 2 sets of command lists, 1 set for tbeing rendered, and 1 for the frame being generated (orthen 2 frames are queued)

    What we do:Create a series of tasks which dump rendering into command b

    Dont create a new command buffer for every task, however, ea context which points to a command buffer

    So, if we have 6 CPUs and 600 objects to process in a particularend up generating 6 command buffers with 100 batches each

    The draw order will fluctuate frame to frame, so must handle adifferently

    Resource Sets

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    33/38

    Resource Sets

    B

    B

    B

    P

    S

    S

    Natural Allegory To Descriptor Tables

    Since Resource Sets are immutable, and can bind multiple, canpre create

    If hardware can support multiple descriptor tables,

    conceivable that we dont have to update them per frame ever

    Otherwise, need to dynamically create a descritor table for

    each batch

    What Resource Bindings Exist For Nitrous? Turns

    Out We Have A Small State Vector

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    34/38

    Generating a Command List

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    35/38

    Generating a Command List

    For each Batch (inside a task)

    Create State vector from Batch, Shader, and Primitiveif not enough bind points

    Lookup state vector in our cache (16 entry, un

    per thread)

    if in cache

    Bind created descriptor table

    elseCreate New descriptor table (since same size, we h

    pool of them)

    Bind descriptor table, evict last used thing in cato cache

    Tracking Resource Usage

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    36/38

    Tracking Resource Usage

    Apps responsibility to track what

    resources get used

    Simple strategy: Stamp a frame number

    on each memory pool anytime it is bound

    Traverse the complete resource list,

    anything which matches current framemust be resident

    Quick as long as we keep # of heaps

    reasonableImportant: Frame # should be paddedinto a cache line to avoid serialization

    Heap Description Las

    UI Textures intro

    UI Textures in Game

    Orange Faction Units

    Purple Faction Units

    Weapon effects

    Post Process RTs

    Terrain Heightmap

    Expected results

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    37/38

    Expected results

    Expecting large increases in performance in terms of CPU t

    driver/D3D

    Expecting vast reduction in driver complexity, hopefully mo

    drivers

    Expecting less frames to be queued, generally more respon

    Shouldnt have frame hitches caused by driver at all.

  • 8/11/2019 GDC_14_DirectX Advancements in the Many-Core Era Getting the Most Out of the PC Platform

    38/38