a soft real-time concurrent graphics platform

8
A Soft Real-time Concurrent Graphics Platform André F. B. Silva The Cyclops Group Federal University of Santa Catarina Felipe B. Alves The Cyclops Group Federal University of Santa Catarina Olinto J. V. Furtado INE - CTC Federal University of Santa Catarina Abstract This article describes a language and framework that aims to ease soft real-time and concurrent programming for games. The defined language extends the standard C 99 language, adding some new constructions to it. Its standard library provides some graphical util- ities for game development, object file loading, rendering optimiza- tion passes, built-in physical simulation, high-level collision detec- tion, 3D audio and shader loading. It also provides basic concur- rent programming structures and real-time support by some func- tions which changes real-time exception handling. This extension, CEx, was implemented as a extension of the Clang compiler that generates code for the Low Level Virtual Machine (LLVM). From this implementation, it was possible to provide an environment for game development, which comprehends the language, and an inter- mediary library for concurrency, real-time signals and graphics. Keywords:: Parallelism, Gaming Framework, Concurrent Lan- guages, Soft Real-time Author’s Contact: {dedefbs,fbafelipe,olinto}@inf.ufsc.br 1 Introduction There is a growing demand for tools which provide better abstrac- tions, debugging support and productivity, for systems program- ming. Thus, this is not different with real-time and concurrent ap- plications. These programming tools are commonly added as a li- brary to the programming language. This is truth for C and C++. Other languages as Erlang [Armstrong 2003] and Concurrent C [Gehani and Roome 1989] and C++ [Gehani and Roome 1988] have specific language instructions which promote the same or equiva- lent capabilities as those libraries. Hybrid solutions provide both instructions and library support. In this last class is the CEx lan- guage, which provides concurrent and real-time instructions while focusing on game development. Games are real-time applications and they tend to use lots of con- currency nowadays [Bethke 2003; Eberly 2004]. They use graphi- cal engines, which are known to implement the three greater steps of a game: physics with collision, artificial intelligence and ren- dering. These are performed up to 60 times per second [Akenine- Möller et al. 2008; Eberly 2004]. The growing complexity, mainly from games considered as state of art, implies in the need of tools, techniques, libraries and languages to make the software develop- ment swifter and robust. Then, most of the enterprise solutions involve commercial engines, and as a consequence some language that the engine supports. These commercial engines provide paral- lel programs and abstraction, however they usually don’t do this in languages which exposes the code parallelism with instructions. The CEx language uses instructions only for the concurrent and Figure 1: UML component model of the intermediary library real-time parts. The engine parts are all provided as libraries, writ- ten as a C interface. To implement the entire framework envi- ronment some tools were used: LLVM with Clang for language back-end and front-end, respectively; Boost C++ library and the Threadpool concept for the concurrent under the hood support; Bul- let Physics and the Ronin Engine [Alves 2010] for the engine steps support; a modified Eclipse CDT tool to use as a programming en- vironment. Both C and C++ languages are used by lots of engines, because they provide easy access to hardware, GPU included. To access the graphics hardware there are some specific languages which inter- faces are accessed with the used graphics layer. GLSL, HLSL and NVidia Cg are some examples. Other languages provide standalone interfaces, as NVidia CUDA and the OpenCL APIs. The C gram- mar is simple because the language was developed to be a portable assembly. That way, we utilize C grammar simplicity and explicit low-level parallelization with a robust C interface for C++ codes. LLVM is a virtual machine and a framework for building compil- ers. The LLVM has its own assembly, similar to CISC architecture instructions, called Static Single Assignment (SSA) [Lattner and Adve 2004]. The Clang tool is a front-end for C-like languages, including Objective-C and Objective-C++. It is capable of code SBC - Proceedings of SBGames 2010 Computing Track - Full Papers IX SBGames - Florianópolis - SC, November 8th-10th, 2010 46

Upload: others

Post on 01-Jan-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Soft Real-time Concurrent Graphics Platform

A Soft Real-time Concurrent Graphics PlatformAndré F. B. Silva

The Cyclops GroupFederal University of Santa Catarina

Felipe B. AlvesThe Cyclops Group

Federal University of Santa Catarina

Olinto J. V. FurtadoINE - CTC

Federal University of Santa Catarina

Abstract

This article describes a language and framework that aims to easesoft real-time and concurrent programming for games. The definedlanguage extends the standard C 99 language, adding some newconstructions to it. Its standard library provides some graphical util-ities for game development, object file loading, rendering optimiza-tion passes, built-in physical simulation, high-level collision detec-tion, 3D audio and shader loading. It also provides basic concur-rent programming structures and real-time support by some func-tions which changes real-time exception handling. This extension,CEx, was implemented as a extension of the Clang compiler thatgenerates code for the Low Level Virtual Machine (LLVM). Fromthis implementation, it was possible to provide an environment forgame development, which comprehends the language, and an inter-mediary library for concurrency, real-time signals and graphics.

Keywords:: Parallelism, Gaming Framework, Concurrent Lan-guages, Soft Real-time

Author’s Contact:

{dedefbs,fbafelipe,olinto}@inf.ufsc.br

1 Introduction

There is a growing demand for tools which provide better abstrac-tions, debugging support and productivity, for systems program-ming. Thus, this is not different with real-time and concurrent ap-plications. These programming tools are commonly added as a li-brary to the programming language. This is truth for C and C++.Other languages as Erlang [Armstrong 2003] and Concurrent C[Gehani and Roome 1989] and C++ [Gehani and Roome 1988] havespecific language instructions which promote the same or equiva-lent capabilities as those libraries. Hybrid solutions provide bothinstructions and library support. In this last class is the CEx lan-guage, which provides concurrent and real-time instructions whilefocusing on game development.

Games are real-time applications and they tend to use lots of con-currency nowadays [Bethke 2003; Eberly 2004]. They use graphi-cal engines, which are known to implement the three greater stepsof a game: physics with collision, artificial intelligence and ren-dering. These are performed up to 60 times per second [Akenine-Möller et al. 2008; Eberly 2004]. The growing complexity, mainlyfrom games considered as state of art, implies in the need of tools,techniques, libraries and languages to make the software develop-ment swifter and robust. Then, most of the enterprise solutionsinvolve commercial engines, and as a consequence some languagethat the engine supports. These commercial engines provide paral-lel programs and abstraction, however they usually don’t do this inlanguages which exposes the code parallelism with instructions.

The CEx language uses instructions only for the concurrent and

Figure 1: UML component model of the intermediary library

real-time parts. The engine parts are all provided as libraries, writ-ten as a C interface. To implement the entire framework envi-ronment some tools were used: LLVM with Clang for languageback-end and front-end, respectively; Boost C++ library and theThreadpool concept for the concurrent under the hood support; Bul-let Physics and the Ronin Engine [Alves 2010] for the engine stepssupport; a modified Eclipse CDT tool to use as a programming en-vironment.

Both C and C++ languages are used by lots of engines, becausethey provide easy access to hardware, GPU included. To access thegraphics hardware there are some specific languages which inter-faces are accessed with the used graphics layer. GLSL, HLSL andNVidia Cg are some examples. Other languages provide standaloneinterfaces, as NVidia CUDA and the OpenCL APIs. The C gram-mar is simple because the language was developed to be a portableassembly. That way, we utilize C grammar simplicity and explicitlow-level parallelization with a robust C interface for C++ codes.

LLVM is a virtual machine and a framework for building compil-ers. The LLVM has its own assembly, similar to CISC architectureinstructions, called Static Single Assignment (SSA) [Lattner andAdve 2004]. The Clang tool is a front-end for C-like languages,including Objective-C and Objective-C++. It is capable of code

SBC - Proceedings of SBGames 2010 Computing Track - Full Papers

IX SBGames - Florianópolis - SC, November 8th-10th, 2010 46

Page 2: A Soft Real-time Concurrent Graphics Platform

generation for the LLVM, but only its C part is complete and some-how stable, even though LLVM/Clang self compiles, the C++ partof this front-end is not yet considered usable, i.e., the API is still inconstant modification.

The Boost library has abstractions for multithread, filesystem, andmetaprogramming. It is used on the underlying library that encap-sulates the C interface for both the engine and the parallel struc-tures. The Threadpool concept that uses the Boost multithread li-brary is used on this library as well and supports multiple threadsmanagement. The works to run are called tasks and these are sched-uled to different running threads.

The rendering part of the Ronin engine is complemented with theBullet Physics solid bodies simulation [Bullet 2009; Boeing andBräunl 2007]. Both Bullet and Boost are written in C++ and arecommon commercial solutions. These libraries work together un-der the same interface, illustrated in figure 1 as a UML componentmodel. The Thread Mixer is a library which includes the Thread-pool and control structures. The RN part provides a C interface forthe Ronin engine and Bullet Physics.

1.1 Problems and solutions

During the framework development some problems were found.Here we discuss the major problems and their respective solutions.

C is an old language and there are lots of complete compilers forit. It was decided then, that the Clang front-end was to be extendedfor the upcoming grammar modifications implementation. Clang isboth stable and robust for C and it was considered a good startingpoint.

The extended grammar predicted lots of statements that executedasynchronously, this brought the need of an efficient and easy wayto execute statements in parallel. Therefore, this was solved withoutprogrammer efforts by detaching the statement from the functioncode and putting it in another function. Remarks about this solutionwill be made later on.

For the real-time part of the solution, it is used a POSIX-like ex-ception system that is thrown when the deadline is not achieved intime. This solution was divided in two different instructions. Onehas both the periodicity and deadline values and the other only thedeadline one. These two instructions provide the possibility to bringimprecise computation instructions that could be used for level ofdetail management.

Providing the entire concurrent tools as language instructions is notwise because the amount of work for the front-end is huge and theresult is mostly the same as a library solution. But, if specific typesof commonly code used in parallel algorithms are studied, thensome interesting instructions can be implemented. That way, fivedifferent instructions were defined.

1.2 Techniques

This work covers some wide range of different techniques, fromparallelization and real-time to physics treatment. These are de-scribed in this session, as a overview of the work different solutions.

The Bullet Physics library was used and it implements a fast GJKcollision detection algorithm [van den Bergen 1999] and paralleliz-able broadphase and integration steps. Also, it has specific opti-mized code for parallel architectures as GPUs and Cell SPUs, per-forming broadphases on both multi-core and vectorized hardware.The integration with this library is done transparently with Roninengine. This approach makes the physics interface of the frame-work independent from the underlying Bullet Physics implementa-tion.

The Boost Threadpool concept uses the Boost libraries mutex andcondition structures for implementing a low-level thread managersimple scheduler, that can be dynamically resizable (thread num-ber) and support two simple scheduling policies: FIFO and LIFO.Even though this simplicity implies in some restrictions of tasks

dependency, this makes the code faster and heavily parallelizableachieving high usage means for greatly parallelizable codes.

The Ronin engine uses an octree based clipping and culling algo-rithm with a BSP collision detection, promoting real-time complexstructures support and optimized rendering for real-time graphics[Akenine-Möller et al. 2008; Jiménez et al. 2001]. The octree hi-erarchy is optimized using its loose version [Akenine-Möller et al.2008].

This work is divided in the following sections: Section 2 showsthe language keywords and their semantics; Section 3 talks aboutparallel and real-time code; Section 4 describes the Ronin engine;Section 5 exemplifies some solutions with its remarkable data; Sec-tion 6 shows our conclusions with this work.

2 Language definitions

To provide the needed support the five language instructions weredefined for parallel processing and two to provide a soft real-timeenvironment capable of imprecise computation. After each key-word definition we show the extensions we did to the ISO C99 [ISO1999], in the BNF notation. For this BNF part the ID stands forIDENTIFIER and TPN for TYPE_NAME. Also, the Facade-lib wasdefined and its integration with the current CEx apps is illustratedin figure 1.

2.1 Concurrency keywords

The five keywords defined and their semantics are explained in thissection.

2.1.1 parallel

It defines that the target statement will be executed in parallel. Theparallel indicates that the internal scheduler will treat the parallelcall the way it finds better. That means that the caller holds no con-trol about the parallelized block execution. It is designed for usagein not emergencial or almost optional calls, which won’t change thefinal result drastically.

< f u n c t i o n _ d e f i n i t i o n _ a s y n c > : : = PARALLEL< f u n c t i o n _ d e f i n i t i o n _ s y n c >

< p a r a l l e l _ s t a t e m e n t > : : = PARALLEL < s t a t e m e n t >< s t a t e m e n t > : : = < p a r a l l e l _ s t a t e m e n t >

2.1.2 async

Function and block modifier to execute the target in an specific jobor thread. Indicates that the block may be used asynchronously, thatis, there is no relation the executed code, its current condition andending.

< f u n c t i o n _ d e f i n i t i o n _ a s y n c > : : =ASYNC ’ ( ’ ID ’ ) ’ < f u n c t i o n _ d e f i n i t i o n _ s y n c >

< p a r a l l e l _ s t a t e m e n t > : : =ASYNC ’ ( ’ ID ’ ) ’ < s t a t e m e n t >

< s t a t e m e n t > : : = < p a r a l l e l _ s t a t e m e n t >

2.1.3 atomic

Warranties atomic access to the variable1 . Binary, logic and com-parison operations will be atomic. The behavior for other operatorsis undefined.

< t y p e _ q u a l i f i e r > : : = ATOMIC

2.1.4 parallel_for

Loop structure where the syntax is equal to a “for”. All iterationsare possibly executed in parallel. There is no type of dependencyexplicit notation between the parallel executions. If necessary, theprogrammer can manually program such codes using the interme-diary library. This instructions is based on OpenMP [Dagum and

1signed and unsigned { char, shor, int, long long}

SBC - Proceedings of SBGames 2010 Computing Track - Full Papers

IX SBGames - Florianópolis - SC, November 8th-10th, 2010 47

Page 3: A Soft Real-time Concurrent Graphics Platform

Menon 1998], although it resembles the cilk_for [Leiserson 2009;CIL 2009], the scheduling and synchronization styles are different.

< i t e r a t i o n _ s t a t e m e n t > : : =PARALLEL_FOR ’ ( ’ TPN ID ’ : ’ ID ’ ) ’ < s t a t e m e n t > |PARALLEL_FOR ’ ( ’ TPN ID ’ : ’ ’ [ ’ CONSTANT ’ , ’ CONSTANT

’ ] ’ ’ ) ’ < s t a t e m e n t > |PARALLEL_FOR ’ ( ’ e x p r e s s i o n _ s t a t e m e n t

e x p r e s s i o n _ s t a t e m e n t ’ ) ’ s t a t e m e n t |PARALLEL_FOR ’ ( ’ e x p r e s s i o n _ s t a t e m e n t

e x p r e s s i o n _ s t a t e m e n t e x p r e s s i o n ’ ) ’ s t a t e m e n t

2.1.5 synchronized

Supplies automatic synchronization of a statement, accessed bymultiple threads. One structure must be passed as its argument andthe semantic of the instruction depends if the variable is either amutex, semaphore, condition or barrier.

< s y n c h r o n i z e d _ s t a t e m e n t > : : =SYNCHRONIZED ’ ( ’ < i d e n t i f i e r _ l i s t > ’ ) ’ < s t a t e m e n t >

< s t a t e m e n t > : : = < s y n c h r o n i z e d _ s t a t e m e n t >

2.2 Real-time keywords

The real-time instructions described in this section support soft real-time programming in our platform. Thus, it is necessary to integratetwo different types of real-time blocks: the blocks that will be ex-ecuted once and which must perform under a deadline time and aperiodical one, that executes always under a deadline.

2.2.1 do_rt

This keyword indicates that the target statement must be executedin a specified time (deadline). That way, the block must executein the maximum time, as specified. The default behavior, if thereal-time signal is untreated is to abort the program, as a POSIXkill signal. Such specification is similar to the POSIX [IEEE 2004]signals. Real-time exceptions must specify which are its handlers.This is done via the intermediary library.

< r t _ s t a t e m e n t > : : =DO_RT ’ ( ’ < c o n s t a n t _ e x p r e s s i o n > ’ ) ’ < s t a t e m e n t >

< s t a t e m e n t > : : = < r t _ s t a t e m e n t >

2.2.2 periodic_rt

It is semantically and syntacticly identical to the do_rt. The differ-ence is in the fact that the block is treated as a periodic one. It mustbe executed before the deadline, for each period. It is similar toattach a while with a do_rt, but in this last case the same code couldbe executed more than once in the same period.

< p e r i o d i c _ r t _ s t a t e m e n t > : : =PERIODIC_RT ’ ( ’ < c o n s t a n t _ e x p r e s s i o n > ’ ) ’ ’ ( ’ <

c o n s t a n t _ e x p r e s s i o n > ’ ) ’ < s t a t e m e n t >< s t a t e m e n t > : : = < p e r i o d i c _ r t _ s t a t e m e n t >

2.3 Facade-lib

Since we implemented the language with the Clang frontend, andmost of our underlying libraries are written in C++. Then, it wasnecessary to provide these library services with a C interface. Also,the Ronin engine is written mostly in C++ and so a C interface forit was necessary. Both these interfaces together are what we callthe Facade-lib. The concurrent part provides abstraction for mutex,semaphore, barrier, condition, task, job.

Task is some computation that must be done and can be done in aparallel way in the same job. Jobs are sets of tasks that contribute toa greater computation. In the case of a parallel_for each iterationcreates a task under the same job. The async and parallel keywordscreates tasks to be executed in the target job, passed as a identifier.When the programmer wants to execute these jobs in new threads,this library defines a LONG_JOB identifier, which means that theparallelized statement will run in a new thread. That is, it is equalto a thread spawn instruction.

Figure 2: Producer-consumer with our parallel instructions. Thesynchronized instruction can accept multiple variables as shown.The functions in bold, are example of library calls that integratewith the built-in types like the synchronizer structure.

3 Parallelism and real-time overview

Scheduling can be optimized for MIMD by the correct processorwork distribution algorithm. In a work sharing algorithm, the pro-cessors share threads with the others when new threads are allo-cated, but on high usage the threads tend to migrate between pro-cessors increasing latency due to scheduling. In work stealing al-gorithms, the processors tries to steal the new threads and so inhigh usage cases the threads tend to stay in the same processor .The Threadpool scheduler is compliant with the work stealing al-gorithm and allocate all threads for processors in a work stealingscheme [Blumofe and Leiserson 1999].

Amdahl’s Law [Amdahl 1967] states that an algorithm running onn cores with a number p for the total percentage of parallel partsof an algorithm, and s is its counterpart, then there is the followingrelation:

s+ p = 1 (1)

Speedup =1

s+ pn

(2)

For example, if s was 50% and then the n was considered infinite,then the maximum amount of speedup would be simply:

Speedup =1

s(3)

Which leads us to the result 2. That means that the maximum paral-lelization speedup of an algorithm is determined by its serial parts.In this case, the maximum speedup of a parallel version of this al-gorithm would be 200%.

The implemented instruction set for parallelism consist of five key-words which can be used with current C and C++ libraries. Theparallel_for instruction considers that every single iteration is in-dependent from the looping statement. The async and parallel in-structions spawn new tasks from the target statement and the syn-chronized shows a point where all workers will synchronize undera semaphore, condition, mutex or barrier. The last instruction isthe atomic instruction which turns all invokations of the variableinto atomical lock-free operations. In our LLVM implementationwe prioritized the first four, excluding the atomic keyword instruc-tions implementation. Some of these instructions are shown on theproducer-consumer algorithm shown in figure 2.

If we have a 100% parallelizable code, where S is a startup time, in-cluding thread overhead creation time, Tt is the task creation time,

SBC - Proceedings of SBGames 2010 Computing Track - Full Papers

IX SBGames - Florianópolis - SC, November 8th-10th, 2010 48

Page 4: A Soft Real-time Concurrent Graphics Platform

Pn is the number of processors, Lt is the worst case of loop execu-tion unit time and N is the number of iterations. Then we have thefollowing equation:

t =Lt ∗NPn

+N ∗ Tt + S (4)

This lead us to the worst execution time of the algorithm, t. Notice,that the optimizable parts of the code, would be the thread creationtime. This overhead can be diminished with code optimization, butin our case, we also execute parallelizable for iterations, while thefor condition and task creation runs parallel with the already allo-cated tasks. Therefore, we achieve higher throughputs in concurrentloops because we hide the latency of task creation with the task ex-ecution. Our parallel loops executes far better when there Lt islarger. That is, we achieve higher efficiency on massively paralleland CPU hungry algorithms.

Real-time programming algorithms, on the other hand, involve ei-ther deadlines and periodicity or both. Most real-time systemsare sensors observing systems that follow the idea of environment-sensor-control-actuator [Buttazzo 2004]. However, when gameprogramming is taken into consideration, the main reason for pro-viding real-time instructions would be for render flux control. Thatway, when wanted, the current rendering can be discarded or mod-ified for achieving the rendering deadline, maintaining the gamestep. These, could be used as well for level of detail adjustment,adaptive graphics for the system and imprecise computation sup-port. The POSIX like signaling of both do_rt and periodic_rt in-structions makes easier to detect the deadline skip. Suppose thatthe main loop of a game is inside a periodic_rt instruction and thatwhen the deadline is skipped, the current frame objects are renderedas boxes. This effect can be seen in figure 3.

Hard real-time scheduling can use only up 70% of the processingpower [Liu and Layland 1973]. Then, at least 30% percent of allcomputing power won’t be used for useful computation in a hardreal-time environment. Of course, this cost is prohibitive for gam-ing software and they don’t have to perform always under a specificdeadline. An average rendering of frames per second of thirty orsixty are common, thus they are desirable, but not really a gamepurpose. Therefore, we only support soft real-time with our instruc-tions, not applying some common techniques seen in hard real-timeenvironments, like process priority inversion. Even our real-timeexception system can’t be considered reliable in that way.

These instructions and the parallel library are complemented by theRonin engine and they are implemented in the Thread manager li-brary.

3.1 Thread manager

The Thread manager (TM) is part of the Facade-lib. It implementsthe functions that the compiler backend will call to implement theconcurrency and real-time instructions. The Thread manager con-sists of a C interface that implements mutex, semaphore, condition,barrier, tasks and jobs. Every Job runs on at least one thread andcontains tasks. A Task run on a thread, dynamically scheduled, ofthe Job it belongs. We needed a dynamical thread scheduler andthe Threadpool BOOST concept was chosen. There are other com-mercial and similar applications like KDE Threadweaver [Boehm2008] but they are either dependant on other libraries or overly com-plex. To maintain simplicity and attain high Task throughput weused Threadpool. That way, our Job concept has a built-in thread-pool which schedules the Jobs tasks over time. A new task will onlybe scheduled when another one has finished its execution. Thatway, the Task abstraction is very different from a thread which canbe directly mapped to a Job running only one task (LONG_JOB).

Mutex, semaphore, condition and barrier are all implemented withthe the BOOST library condition and mutex abstractions. These areused directly with the manager interface, with calls implemented bythe compiler backend.

(a) Render of objects with 7K+ triangles

(b) Render of box objects

Figure 3: Example usage of real-time instruction for LoD.

3.2 Instructions and OpenMP

Our defined instruction provide syntax-level real-time instructionsand parallel instructions. There are other works that do the same,like the OpenMP model. OpenMP is an API for multi-platformshared-memory parallel programming in C/C++ and Fortran, im-plemented in a lot of compilers and is independent of target plat-form.

OpenMP was developed aiming High Performance Computing. Ithas a great set of instructions and is, because of that, not a sim-ple API. The CEx language was built aiming parallel processingas OpenMP, but with a different goal: we wanted the programmerto have a easy way of using parallelism instructions without hav-ing the obligation of exposing scheduling dynamics and internalblocks specifications. In this manner, OpenMP is a more complexand complete tool, but with it is needed a high-level of formality towrite the code. The figure 4 depicts that formality with a simple ex-ample of vector add method. As a standard, OpenMP is tested andreliable but as a practical daily usage tool we believe that CEx canbe a more programming time effective tool. Also, we wanted theinterference with the current code syntax to be minimal, so schedul-ing and block size were hidden instead of exposed and this resultedin a major difference between both APIs.

It may be highlighted as well that it has no support, in any way, toreal-time instructions, because OpenMP’s concept model does notincludes real-time signals. Although, that may be possible, whileusing OpenMP in a real-time environment, e.g. RTOS, it is essen-tially different than using a concept model that is target independentand covers real-time instructions.

The next session describes the Ronin engine architecture.

SBC - Proceedings of SBGames 2010 Computing Track - Full Papers

IX SBGames - Florianópolis - SC, November 8th-10th, 2010 49

Page 5: A Soft Real-time Concurrent Graphics Platform

# i n c l u d e <omp . h>. . .

void vec to rAdd ( f l o a t ∗ a , f l o a t ∗ b , f l o a t ∗ c ) {# pragma omp p a r a l l e l s h a r e d ( a , b , c , chunk ) p r i v a t e ( i ){

# pragma omp f o r s c h e d u l e ( dynamic , chunk ) no wa i tf o r ( i =0 ; i < N; i ++)

c [ i ] = a [ i ] + b [ i ] ;

}}

i n t main ( ) {. . .

vec to rAdd ( a , b , c ) ;

. . .

re turn 0 ;}

Figure 4: Vector addition example performed with OpenMP in-structions in C++. OpenMP exposes scheduling policy and blocksize

4 Ronin engine

In this section, we present the architecture of the 3D game engineRonin. We will explain the physics, collision detection and render-ization modules and the basic architecture of the engine.

4.1 Basic architecture

The engine’s basic modules are: Rendering, Physics and collisiondetection. For physics, we use the Bullet Physics engine, while therendering and collision detection are implemented by the engineitself.

4.2 Bounding volume hierarchies

The engine uses octrees and Binary Space Partitioning (BSP) treesfor rendering optimization and collision detection. The renderingpart uses a quadtree or an octree to optimize view clipping andculling. The octree and quadtree implementations use the looseimplementation, as in [Akenine-Möller et al. 2008]. The collisiondetection uses BSP trees to store complex objects geometry.

4.3 Rendering

The rendering module was implemented to be flexible, allowingrendering in single thread or multi thread.

4.3.1 Single thread Rendering

The sequential version of the rendering module goes through thescene tree (octree or quadtree), clipping the nodes and objectsagainst the frustum. The objects that intersects the frustum are thenrendered.

4.3.2 Multi thread Rendering

In the multi thread version of the rendering module, a pool of vis-ible objects is populated by one or more clipping threads (workerthreads that perform clipping on octree/quadtree nodes and objects).The rendering thread draw the objects in that pool without the needof clipping against the frustum (which was already done by one ofthe worker threads).

Setup 1 Setup 2Memory 4GB 4GB

Graphics adapter 9600M GT 512MB 9600 GT 512 MBProcessor Core 2 Duo 2.13Ghz Phenom X4 2.0Ghz

Table 1: Setups for the prototypes

Figure 5: Times, in seconds, for the execution of a parallel(parallel_for) and sequential (for) Fibonacci algorithm

4.4 Collision detection

The collision detection module has a different propose from physicscollision detection. Instead of detecting all collisions, for collisionreaction, the collision detection module target the collisions of asingle object. It also allow the use of an arbitrary geometric objectto be used to detect collisions. The collision detection module isintended to be used direct by game logic.

This module uses a regular BVH to organize the objects in space.Each collidable object has a CollisionHandler, which hold the col-lision geometry (does not need to be the same of physics nor therendering) and make the collision test. For complex objects, its ge-ometry is stored in a Binary Space Partitioning tree.

4.5 Physics

The physics module uses the Bullet [Bullet 2009] physics engine.This module implements the facade pattern [Gamma et al. 1994],to ease the use of the physics engine and better integration with therest of the engine.

4.6 Other modules

Among the other important modules there are the modules of In-put and Shaders loading. The Input module supports basic eventslike mouse, keyboard, joystick and events related to window man-agement (focus and resize). The Shader loading uses the abstractfactory pattern [Gamma et al. 1994], allowing factory interfaces tomany different shading languages.

5 Prototypes

Some prototypes were implemented to test the language concur-rency support and graphics library. These were performed on theSetups observed in table 1. While the first setup has a dual coreprocessor the second has a quadruple core processor.

5.1 Parallelization examples

Concurrent programs examples are a simple way to show the paral-lelization capabilities of the concurrency library. Three benchmarks

SBC - Proceedings of SBGames 2010 Computing Track - Full Papers

IX SBGames - Florianópolis - SC, November 8th-10th, 2010 50

Page 6: A Soft Real-time Concurrent Graphics Platform

Setup 1 Setup 2Sequential code 9.5 7.43

Parallel code 18.12 19

(a) Frames per second of a mandelbrot seriesrender

Setup 1 Setup 2sequential 4.36 3.26

parallel - 1 thread 4.29 2.98parallel - 3 thread 2.62 1.54parallel - 6 thread 2.58 1.35

(b) Execution times, in seconds, for a parallel re-cursive Fibonacci execution

Table 2: Sequential and parallel setups for two examples

Figure 6: Eclipse syntax check tool showing a problem in the state-ment where a parenthesis was forgotten

were picked: a recursive Fibonacci’s algorithm, a parallel Mandel-brot set render.

The Mandelbrot example results are observed in table 2a and illus-trated in figure. The parallel code in the dual core, runs almost twotimes faster as the serial one, but the quadruple core setup runs only255% faster. This happens because the communication betweenthe CPU and GPU starts to bottleneck the application paralleliz-able code. This is a practical observation of equation 2, which isAmdahl’s Law expression.

The same happens with the Fibonacci code, but this example suffersfrom task creation latency, that is, the parallel execution of eachloop is not complex enough to show greater gains.

Both results are in table 2b and shown in figure 5. In thisgraphic we can see that when the thread number increases above1.5 ∗ processors the gain is not great. For example, in the Setup1 we observe that the transition from 3 to 6 threads yields almostno gain. While in the Setup 2 it supplies a significative gain. If weused more than 6 threads there woudn’t be much gain, and morethan that, the thread creation overhead could actually make the al-gorithm perform slower.

5.2 Eclipse integration

To support the CEx language it was needed an Integrated De-velopment Environment (IDE) with suppor to code refactoring,method generation; code completion, lexical, syntactic and seman-tical analysis, Syntax highlighting, semantic navigation, versioncontrol tools and multiple project management.

After evaluation of these parameters it was implemented as a ex-tension of the CDT Eclipse plugin. That way, it was constructed anprogramming environment which supports syntactic and semantical

Figure 7: Mandelbrot half-rendered scene

checks as a pre-compiling phase. Also, CDT provides semanticalindexing, so the intermediary library is automatically indexed inits initialization. An example of its usage is illustrated in figure 6.The extension keywords and library structures are highlighted andit shows a syntax problem.

CDTs internal Abstract Syntax Tree (AST) was modified in the pro-cess of this implementation. The grammar modifications were sys-tematically applied to its AST. Consequently, the lexical and syn-tactic checks are performed correctly. The semantical rules werehandwritten as those in the Clang compiler. Includes from the inter-mediary library were automatically detected and added to the path,as happens with both C and C++ standard libraries.

5.3 Imprecise computation

Automatically adjusting the level of detail due to scene complexity,sluggish rendering and distant or fast objects is commonly used ingame engines. As a proof of concept, we used the Mandelbrot setexample to show how the real-time signaling instruction can changethe current rendering frame. A half-rendered scene is illustrated infigure 7. That technique could be used for an automatic scene man-ager, so the game complexity would be adapted to the computingcapacities of the computer. That is, the game could try to adapt thegame video options according to the achievement of deadlines ornot with our signaling instructions. The object to box swap examplealso higlight another usage that is related to imprecise computation.

6 Concluding remarks

Implementing a concurrent and real-time basis is a solution forbuilding a platform for graphics programming. Although our im-plementation is not complete, the prototypes implemented withthese show that such solution is possible. The parallel prototypesbenchmarks exposes that the communication between CPU andGPU is a bottleneck for some applications as much as the non-parallelizable parts of algorithms. Optimization in the communi-cation should be considered in these cases. Also, the results showthat the processing results are pretty linear in scalability, thereforeour scheduling efficiency is comproved.

Game development is a complex subject, but the current platformcan supplement most parts of it, aside from sound support and artifi-cial intelligence step interface. Both can be added in the game code,so that is not a problem. Other than that, dynamic code with LLVMcan generate code and execute just-in-time. A complete game de-velopment editor with real-time effects and code test could be con-structed this way. In fact, plans for an editor plugin are in the projectroadmap and a fully functional editor is a long-term objective. Fi-nally, the engine physics support needs to be improved to supportdeformable objects, already used in many games.

SBC - Proceedings of SBGames 2010 Computing Track - Full Papers

IX SBGames - Florianópolis - SC, November 8th-10th, 2010 51

Page 7: A Soft Real-time Concurrent Graphics Platform

The language extension was implemented almost on its full andits missing parts are more syntax simplifications than necessities.Then, aside from improvement of the current implementation withClang, it is mandatory that an extension similar to the purposed oneis made for C++. Also, a debugging tool for deadlock and concur-rency bottlenecks detection is needed because as the complexity ofthe problems and programs grows, so does the interest in tools likethese. It can be noted that some third-party tools for profiling anddebugging as the Valgrind framework may already be used.

The language also needs improvements for shader programming.Instead of loading a shader as it is currently done, the languageshould have it’s own shader syntax, integrated with the rest of thelanguage, making easier to game developers to make use of suchtechnology and granting portability.

References

AKENINE-MÖLLER, T., HAINES, E., AND HOFFMAN, N. 2008.Real-Time Rendering 3rd Edition. A. K. Peters, Ltd., Natick,MA, USA.

ALVES, F. B., 2010. Ronin, May.

AMDAHL, G. M. 1967. Validity of the single processor approachto achieving large scale computing capabilities. In AFIPS ’67(Spring): Proceedings of the April 18-20, 1967, spring jointcomputer conference, ACM, New York, NY, USA, 483–485.

ARMSTRONG, J., 2003. Making reliable distributed systems in thepresence of software errors.

BETHKE, E. 2003. Game Development and Production. WorlwarePublishing.

BLUMOFE, R. D., AND LEISERSON, C. E. 1999. Schedulingmultithreaded computations by work stealing. J. ACM 46, 5,720–748.

BOEHM, M., 2008. Threadweaver: Threadweaver, Novembro.

BOEING, A., AND BRÄUNL, T. 2007. Evaluation of real-timephysics simulation systems. In GRAPHITE ’07: Proceedingsof the 5th international conference on Computer graphics andinteractive techniques in Australia and Southeast Asia, ACM,New York, NY, USA, 281–288.

BULLET, 2009. Physics simulation forum.

BUTTAZZO, G. C. 2004. Hard Real-time Computing Systems:Predictable Scheduling Algorithms And Applications (Real-TimeSystems Series). Springer-Verlag TELOS, Santa Clara, CA,USA.

2009. Cilk++, Junho.

DAGUM, L., AND MENON, R. 1998. Openmp: an industry stan-dard api for shared-memory programming. Computational Sci-ence and Engineering, IEEE 5, 1 (Jan-Mar), 46–55.

EBERLY, D. H. 2004. 3D Game Engine Architecture: Engineer-ing Real-Time Applications with Wild Magic (The Morgan Kauf-mann Series in Interactive 3D Technology). Morgan KaufmannPublishers Inc., San Francisco, CA, USA.

GAMMA, E., HELM, R., JOHNSON, R., AND VLISSIDES, J. M.1994. Design Patterns: Elements of Reusable Object-OrientedSoftware, illustrated edition ed. Addison-Wesley Professional,November.

GEHANI, N. H., AND ROOME, W. D. 1988. Concurrent c++:Concurrent programming with class(es. Software — Practiceand Experience 18, 1157–1177.

GEHANI, N., AND ROOME, W. D. 1989. The concurrent C pro-gramming language. Silicon Press, Summit, NJ, USA.

IEEE. 2004. Standard for information technology - portable oper-ating system interface (posix). shell and utilities. Tech. rep.

ISO. 1999. Iso c standard 1999. Tech. rep. ISO/IEC 9899:1999draft.

JIMÉNEZ, P., THOMAS, F., AND TORRAS, C. 2001. 3d collisiondetection: a survey. Computers and Graphics 25, 2, 269 – 285.

LATTNER, C., AND ADVE, V. 2004. LLVM: A CompilationFramework for Lifelong Program Analysis & Transformation.In Proceedings of the 2004 International Symposium on CodeGeneration and Optimization (CGO’04).

LEISERSON, C. E. 2009. The cilk++ concurrency platform. InDAC ’09: Proceedings of the 46th Annual Design AutomationConference, ACM, New York, NY, USA, 522–527.

LIU, C. L., AND LAYLAND, J. W. 1973. Scheduling algorithmsfor multiprogramming in a hard-real-time environment. J. ACM20, 1, 46–61.

VAN DEN BERGEN, G. 1999. A fast and robust gjk implementationfor collision detection of convex objects. journal of graphics,gpu, and game tools 4, 2, 7–25.

SBC - Proceedings of SBGames 2010 Computing Track - Full Papers

IX SBGames - Florianópolis - SC, November 8th-10th, 2010 52

Page 8: A Soft Real-time Concurrent Graphics Platform

Figure 8: Physic simulation of a thousand boxes. In I all boxes drop from the sky. At II they are stable. In III they suffer a huge force to themcenter and in IV and V they repel each other and fall into rest, respectively

(a) Mandelbrot set complete render. The almost in set values are highlightedwith darker tones

(b) Test game implemented in CEx. The AI is computed in parallel usingthe language instructions. The enemies throw balls into the player directionwith some swarm intelligence

Figure 9: Rendering tests with CEx and Facade-lib

SBC - Proceedings of SBGames 2010 Computing Track - Full Papers

IX SBGames - Florianópolis - SC, November 8th-10th, 2010 53