threading successes 03 gamebryo

32
Emergent Game Technologies Gamebryo Element Engine Thread for Performance

Upload: guest40fc7cd

Post on 23-Jun-2015

522 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Threading Successes 03   Gamebryo

Emergent Game TechnologiesGamebryo Element Engine

Thread for Performance

Page 2: Threading Successes 03   Gamebryo

2

Goals for Cross-Platform Threading

•Play well with others•Take advantage of platform-specific performance features

•For engines/middleware, be adaptable to the needs of customers

Page 3: Threading Successes 03   Gamebryo

3

Write Once, Use Everywhere

•Underlying multi-threaded primitives are replicated on all platforms– Define cross-platform wrappers for these

•Processing models can be applied on different architectures– Define cross-platform systems for these

•Typical developer writes once, yet code performs well on all platforms

Page 4: Threading Successes 03   Gamebryo

4

Emergent's Gamebryo Element

•A foundation for easing cross-platform and multi-core development– Modular, customizable– Suite of content pipeline tools– Supports PC, Xbox, PS3 and Wii

•Booth # 5716 - North Hall

Page 5: Threading Successes 03   Gamebryo

5

Cross-Platform Threading Requires Common Primitives •Threads

– Something that executes code– Sub issues: local storage, priorities

•Data Locks / Critical sections– Manage contention for a resource

•Atomic operations– An operation that is guaranteed to complete

without interruption from another thread

Page 6: Threading Successes 03   Gamebryo

6

Choosing a Processing Model

•Architectural features drive choice– Cache coherence– Prefetch on Xbox– SPUs on PS3– Many processing units– General purpose GPU

•Stream Processing fits these properties– Provide infrastructure to compute this way– Shift engine work to this model

Page 7: Threading Successes 03   Gamebryo

7

Stream Processing (Formal)

Wikipedia: Given a set of input and output data (streams), the principle essentially defines a series of computer-intensive operations (kernel functions) to be applied for each element in the stream.

Input 1

Kernel 1

Input 2

Kernel 2

Output

Page 8: Threading Successes 03   Gamebryo

8

Generalized Stream Processing

•Improve for general purpose computing– Partition streams into chunks– Kernels have access to entire chunk– Parameters for kernels (fixed inputs)

•Advantages– Reduce need for strict data locality– Enables loops, non-SIMD processing– Maps better onto hardware

Page 9: Threading Successes 03   Gamebryo

9

Morphing+Skinning Example

Morp

h T

arg

et 1

Vertice

s

Morp

h W

eig

hts

Morph Kernel (MK)

Skin Vertices

Bone Matrices

Blend Weights

Skinning Kernel (SK)

Verte

x Lo

catio

ns

Morp

h T

arg

et 2

Vertice

s

Page 10: Threading Successes 03   Gamebryo

10

Morphing+Skinning Example

MW Fixed

MKInstance 1

Matrices Fixed

Weights Fixed

Verts P

art 1

MT 1 V Part 1

MT 1 V Part 2

MT 2 V Part 1

MT 2 V Part 2

MKInstance 2

Skin V Part 1

Skin V Part 2

SKInstance 1

SKInstance 2

Verts P

art 2

Page 11: Threading Successes 03   Gamebryo

11

Floodgate

•Cross platform stream processing library•Optimized per-platform implementation•Documented API for customer use•Engine uses the same API for built in functionality– Skinning, Morphing, Particles, Instance Culling, ...

Page 12: Threading Successes 03   Gamebryo

12

Floodgate Basics

•Stream: A buffer of varying or fixed data– A pointer, length, stride, locking

•Kernel: An operation to perform on streams of data– Code implementing “Execute” function

•Task: Wrapper a kernel and IO streams•Workflow: A collection of Tasks processed as a unit

Page 13: Threading Successes 03   Gamebryo

13

Kernel Example: Times2

// Include Kernel Definition macros

#include <NiSPKernelMacros.h>

// Declare the Timer2Kernel

NiSPDeclareKernel(Times2Kernel)

Page 14: Threading Successes 03   Gamebryo

14

Kernel Example: Times2

#include "Times2Kernel.h"NiSPBeginKernelImpl(Times2Kernel){ // Get the input stream float *pInput = kWorkload.GetInput<float>(0); // Get the output stream float *pOutput = kWorkload.GetOutput<float>(0); // Process data NiUInt32 uiBlockCount = kWorkload.GetBlockCount(); for (NiUInt32 ui = 0; ui < uiBlockCount; ui++) { pOutput[ui] = pInput[ui] * 2; }}NiSPEndKernelImpl(Times2Kernel)

Page 15: Threading Successes 03   Gamebryo

15

Life of a Workflow

•1. Obtain Workflow from Floodgate•2. Add Task(s) to Workflow•3. Set Kernel•4. Add Input Streams •5. Add Output Streams•6. Submit Workflow•… Do something else …•7. Wait or Poll when results are needed

Page 16: Threading Successes 03   Gamebryo

16

Example Workflow// Setup input and output streams from existing buffersNiTSPStream<float> inputStream(SomeInputBuffer, MAX_BLOCKS);NiTSPStream<float> outputStream(SomeOutputBuffer, MAX_BLOCKS);

// Get a Workflow and setup a new task for itNiSPWorkflow* pWorkflow = NiStreamProcessor::Get()->GetFreeWorkflow();NiSPTask* pTask = pWorkflow->AddNewTask();

// Set the kernel and streamspTask->SetKernel(&Times2Kernel);pTask->AddInput(&inputStream);pTask->AddOutput(&outputStream);

// Submit workflow for executionNiStreamProcessor::Get()->Submit(pWorkflow);

// Do other operations...

// Wait for workflow to completeNiStreamProcessor::Get()->Wait(pWorkflow);

Page 17: Threading Successes 03   Gamebryo

17

Floodgate Internals

•Partitioning streams for Tasks•Task Dependency Analysis•Platform specific Workflow preparation•Platform specific execution•Platform specific synchronization

Page 18: Threading Successes 03   Gamebryo

18

Overview of Workflow Analysis

•Task dependencies defined by streams•Sort tasks into stages of execution

– Tasks that use results from other tasks run in later stages

– Stage N+1 tasks depend on output of Stage N tasks

•Tasks in a given stage can run concurrent•Once a stage has completed, the next stage can run

Page 19: Threading Successes 03   Gamebryo

19

Analysis: Workflow with many Tasks

Task 1Task 1Stream AStream A Stream BStream B Task 2Task 2Stream CStream C Stream DStream D Task 3Task 3Stream EStream E Stream FStream F

Task 4Task 4

Stream B

Stream B

Stream DStream D

Stream GStream G Task 6Task 6

Stream G

Stream G

Stream FStream F

Stream IStream I

Task 7Task 7 SyncSync

Task 5Task 5Stream GStream G Stream HStream H

Page 20: Threading Successes 03   Gamebryo

20

Analysis: Dependency Graph

Task 1Task 1Stream AStream A

Task 4Task 4

Stream B

Stream B

Task 2Task 2Stream CStream C

Task 3Task 3Stream EStream E

Stream DStream D

Task 5Task 5Stream GStream G

Task 6Task 6Stream FStream F

SyncTask

SyncTask

Stream HStream H

Stream IStream I

SyncSyncStream

G

Stream G

Stage 0 Stage 1 Stage 2 Stage 3

Page 21: Threading Successes 03   Gamebryo

21

Performance Notes

•Data is broken into blocks -> Locality– Good cache performance– Optimize size for prefetch or DMA transfers– Fits in limited local storage (PS3)

•Easily adapt to #cores– Can manage interplay with other systems

•Kernels encapsulate processing– Good target for optimization, platform-specific– Clean solution without #if

Page 22: Threading Successes 03   Gamebryo

22

Usability Notes

•Automatically manage data dependency and simplify synchronization

•Hide nasty platform-specific details– Prefetch, DMA transfers, processor detection, ...

•Learn one API, use it across platforms– Productivity gains

– Helps us produce quality documentation and samples

– Eases debugging

Page 23: Threading Successes 03   Gamebryo

23

Exploiting Floodgate in the Engine

•Find tasks that operate on a single object– Skinning, morphing, particle systems, ...

•Move these to Floodgate: Mesh Modifiers– Launch at some point during execution

– After updating animation and bounds– After determining visibility– After physics finishes ...

– Finish them when needed– Culling– Render– etc

Page 24: Threading Successes 03   Gamebryo

24

Same applications, new performance ...

Skinning Objects

Morphing Objects

42fps

12fps

62fps

38fps

Before After

•The big win is out-of-the-box performance– Same results could be achieved

with much developer time– Hides details on different

platforms (esp. PS3)

Page 25: Threading Successes 03   Gamebryo

25

Example CPU Utilization, MorphingBefore

After

Page 26: Threading Successes 03   Gamebryo

26

Thread profiling, Morphing Before

•Some parallelization through hand-coded parallel update– Note high overhead and 85% or so in serial

execution

Page 27: Threading Successes 03   Gamebryo

27

Thread profiling, Morphing After

•Automatic parallelism in engine– 4 threads for Floodgate (4 CPUs)– Roughly, 50% of old serial time replaced

with 4x parallelism

Page 28: Threading Successes 03   Gamebryo

28

New Issues

•Within the engine, resource usage peaks at certain times– e.g. Between visibility culling and rendering– Application-level work might fill in the empty

spaces– Physics, global illumination, ...

•What about single processor machines?•What about variable sized output?

– Instance culling, for example

Page 29: Threading Successes 03   Gamebryo

29

Ongoing Improvements

•Improved workflow scheduling– Mechanisms to enhance application control

•Optimizing when tasks change– Stream lengths change– Inputs/outputs are changed

•More platform specific improvements•Off-loading more engine work

Page 30: Threading Successes 03   Gamebryo

30

Using Floodgate in a game

•Identify stream processing opportunities– Places where lots of data is processed with local

access patterns– Places where work can be prepared early but

results are not needed until later

•Re-factor to use Floodgate– Depending on task, could be as little as a few

hours.– Hard part is enforcing locality

Page 31: Threading Successes 03   Gamebryo

31

Future proofed?

•Both CPUs and GPUs can function as stream processors

•Easily extends to more processing units•Potential snags are in application changes

Page 32: Threading Successes 03   Gamebryo

32

Questions?

•Ask Stephen!•Visit Emergent's booth at the show.

– Booth 5716, North Hall, opposite Intel on the central aisle