development and research in previsualization for advanced ...1098475/fulltext01.pdf · development...

Department of Science and Technology Institutionen för teknik och naturvetenskap Linköping University Linköpings universitet

gnipökrroN 47 106 nedewS ,gnipökrroN 47 106-ES

LiU-ITN-TEK-A--17/007--SE

Development and Research inPrevisualization for Advanced

Live-Action on CGI FilmRecordingAron Tornberg

Sofia Wennström

2017-02-27

LiU-ITN-TEK-A--17/007--SE

Development and Research inPrevisualization for Advanced

Live-Action on CGI FilmRecording

Examensarbete utfört i Medieteknikvid Tekniska högskolan vid

Linköpings universitet

Aron TornbergSofia Wennström

Handledare Joel KronanderExaminator Jonas Unger

Norrköping 2017-02-27

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –under en längre tid från publiceringsdatum under förutsättning att inga extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat förickekommersiell forskning och för undervisning. Överföring av upphovsrättenvid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning avdokumentet kräver upphovsmannens medgivande. För att garantera äktheten,säkerheten och tillgängligheten finns det lösningar av teknisk och administrativart.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman iden omfattning som god sed kräver vid användning av dokumentet på ovanbeskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådanform eller i sådant sammanhang som är kränkande för upphovsmannens litteräraeller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press seförlagets hemsida http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possiblereplacement - for a considerable time from the date of publication barringexceptional circumstances.

The online availability of the document implies a permanent permission foranyone to read, to download, to print out single copies for your own use and touse it unchanged for any non-commercial research and educational purpose.Subsequent transfers of copyright cannot revoke this permission. All other usesof the document are conditional on the consent of the copyright owner. Thepublisher has taken technical and administrative measures to assure authenticity,security and accessibility.

According to intellectual property law the author has the right to bementioned when his/her work is accessed as described above and to be protectedagainst infringement.

For additional information about the Linköping University Electronic Pressand its procedures for publication and for assurance of document integrity,please refer to its WWW home page: http://www.ep.liu.se/

© Aron Tornberg, Sofia Wennström

LINKÖPING UNIVERSITY

Development and Research inPrevisualization for Advanced Live-Action

on CGI Film Recording

Master of Science in Engineering

Department of Science and Technology

Master’s Thesis

February 26, 2017

Authors:Aron TORNBERG

SofiaWENNSTRÖM

Examiner:Jonas UNGER

Supervisor:Joel KRONANDER

Abstract

This report documents the theory, work and result of a master’s thesis in MediaTechnology at Linköping University. The aim of the thesis is to come up with solutionsfor improving the film studio Stiller Studios’s previsualization system. This involves areview and integration of game engines for previsualization in a motion control greenscreen studio, a camera calibration process with blur detection and automatic selectionof images as well as research into camera tracking and depth compositing. The imple-mentation and research done are based on literature within the computer graphics andcomputer vision fields as well as discussions with the Stiller Studios employees. Thework also includes a robust camera simulation for testing of camera calibration meth-ods using virtual images capable modeling the inverse of Brown’s distortion model,something largely unexplored in existing literature. The visual quality of the previ-sualization was substantially improved as well as Stiller Studios’s camera calibrationprocess. The work concludes that the CGI filmmaking industry is under fast devel-opment, leading to discussions about alternative solutions and also the importance ofmodularity.

i

Contents

Abstract i

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Typographic conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Existing system 4

2.1 Film studio setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.1 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Shooting process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.3 Flair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.4 DeckLink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Camera calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Low quality rendering . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 Slow and imprecise camera calibration . . . . . . . . . . . . . . . 82.3.3 Motion control safety and errors . . . . . . . . . . . . . . . . . . . 8

Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Mechanical errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.4 Rudimentary compositing . . . . . . . . . . . . . . . . . . . . . . . 92.3.5 Suggested improvements . . . . . . . . . . . . . . . . . . . . . . . 9

3 Theory 10

3.1 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.1 Offline rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Ray tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.2 Real-time rendering . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Rasterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Real-time global illumination . . . . . . . . . . . . . . . . . . . . . 10Real-time shadows . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Forward and deferred rendering . . . . . . . . . . . . . . . . . . . 11

3.2 Camera fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.1 Pinhole camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.2 Real camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Diffraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Lens Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Aperture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Exposure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

ii

Center of projection and angle of view . . . . . . . . . . . . . . . . 15Zoom and prime lenses . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Camera model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.1 Other camera models . . . . . . . . . . . . . . . . . . . . . . . . . 183.3.2 Modelling blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 Geometric camera calibration . . . . . . . . . . . . . . . . . . . . . . . . . 193.4.1 Zhang’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4.2 Reprojection error . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4.3 Camera tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4.4 Control point detection . . . . . . . . . . . . . . . . . . . . . . . . 21

Corner detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Center of squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Circles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4.5 Precalculation of lens distortion . . . . . . . . . . . . . . . . . . . 233.4.6 Interative refinement of control points . . . . . . . . . . . . . . . . 233.4.7 Fiducial markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4.8 Image selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.9 Blur detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5 Depth detection and shape reconstruction . . . . . . . . . . . . . . . . . . 28

4 Method 31

4.1 Previsualization using game engines . . . . . . . . . . . . . . . . . . . . . 314.1.1 CryENGINE + Film Engine . . . . . . . . . . . . . . . . . . . . . . 314.1.2 Ogre3D + MotionBuilder . . . . . . . . . . . . . . . . . . . . . . . 324.1.3 Stingray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.1.4 Unity 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.1.5 Unreal Engine 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.1.6 Game engine compilation . . . . . . . . . . . . . . . . . . . . . . . 33

Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.7 Previsualization tool implementation and selection . . . . . . . . 35Unreal Engine implementation . . . . . . . . . . . . . . . . . . . . 35Film Engine implementation . . . . . . . . . . . . . . . . . . . . . 36

4.2 Improved camera calibration . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.1 OpenCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.2 Previzion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.3 A new camera calibration process . . . . . . . . . . . . . . . . . . 384.2.4 Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2.5 Calibration process . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2.6 Calibration software . . . . . . . . . . . . . . . . . . . . . . . . . . 404.2.7 Calibration pattern design and marker detection . . . . . . . . . . 40

ArUco pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Concentric circle pattern . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.8 Image selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2.9 Iterative refinement of control points . . . . . . . . . . . . . . . . 434.2.10 Calculating extrinsic parameters relative to the robot’s coordinates 43

Offline calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 Camera tracking solution review . . . . . . . . . . . . . . . . . . . . . . . 44

4.3.1 Ncam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

iii

4.3.2 Trackmen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.3.3 Mo-Sys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4 Commercial multi-camera systems . . . . . . . . . . . . . . . . . . . . . . 464.4.1 Markerless motion capture systems . . . . . . . . . . . . . . . . . 464.4.2 Stereo cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.4.3 OptiTrack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.5 Camera calibration simulation . . . . . . . . . . . . . . . . . . . . . . . . . 474.5.1 The camera matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.5.2 Inverse lens distortion . . . . . . . . . . . . . . . . . . . . . . . . . 484.5.3 Blur and noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Results 51

5.1 Previsualization using graphics engine . . . . . . . . . . . . . . . . . . . . 515.2 Camera calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.3 Image selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6 Discussion and future work 56

6.1 Graphics engine integration . . . . . . . . . . . . . . . . . . . . . . . . . . 566.2 Modular architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.3 Camera tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.4 Compositing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.5 Object tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.6 Augmented reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.7 Camera calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.8 Using calibration data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

7 Conclusions 61

Appendix A Inverse distortion 66

iv

Chapter 1

Introduction

Pure live action movie production is often limited by time, location, money and the pos-sibilities using practical effects. Building sets, costumes and animatronics takes time.Certain locations are not accessible for shooting when needed or at all. Large crowdsof people can be hard to get by. All of these factors can also be prohibitively expensiveand some things can simply not be done in a satisfying manner using practical effects.

Computer generated images are in many cases indistinguishable from reality to thehuman eye and are used more and more in films and advertising. Most of the photosthe Ikea catalog are computer generated. However, in many cases realistic computergenerated images are hard and expensive to produce and often fall short in realism(especially the case when dealing with animation and human beings).

The best of both worlds can be achieved by mixing realities and shooting live actionon CGI using green screens. This allows scenes that would be too expensive or evenimpossible to film in full live action while still getting the realism of real human actors.

A major disadvantage of shooting on green screen with CGI is that the director can-not see or interact with the virtual elements in the scene and thus get a realistic view ofhow the end result will look like, and also not being able to make changes as necessaryby moving objects and actors around during the shooting session. This disadvantagecan be mitigated by the use of previsualization where the director is given a roughtake of what the final cut will look like by combining the filmed material with the vir-tual environment in real-time. To achieve this a number problems should be solved indescending order of importance:

• At minimum, a solution for compositing the camera feed with a rendering of thevirtual scene, placing actors and props in the virtual environment, is needed.

• To allow camera movement, the parameters of the camera in the virtual sceneshould match the those of the real camera.

• To allow renderings with dynamic scenes and more advanced camera move-ments a real-time rendering of the virtual scene is needed.

• To allow the director to make corrections between takes the previsualization toolshould allow for easy and fast changes within the virtual scene.

Photorealistic rendering has long been the offline renderings domain alone. Ad-vances in computer hardware and rendering algorithms has improved and still is im-proving real-time rendering massively. Several game engine developers are starting totake advantage of this fact to market themselves as tools for filmmaking and severalshort films have been made to demonstrate these capabilities [1] [2].

1

Chapter 1. Introduction 2

1.1 Background

Stiller Studios at Lidingö is one of the world’s most technologically advanced film stu-dios. Instead of letting a film crew travel around the world recording different envi-ronments, every shot is set up and recorded in a single location. This is done by severaldigital tools which cooperate in green screen to build a scene that do not exist in reality.

What stands out extra for Stiller Studios is the film material delivered to the cus-tomer. Stiller Studios market themselves as the only green screen studio in the worldgenerates perfectly matched foreground and background clips without first havingprocessed it, something that is extremely time-saving.

The company is especially specialized in previsualization, which is used to givean idea of the final result before post-processing. The existing previsualization toolcould not handle light, reflections and other visual phenomena required for realisticrenderings. Moreover, there was development potential in terms of including cameracalibration and depth detection. Therefore two students from the M.Sc. program inMedia Technology at Linköping University were given these research areas to examineand develop as part of their thesis work.

1.2 Purpose

The thesis work is meant to solve the previsualization problems and wishes that StillerStudios obtains in collaboration with a supervisor and examiner at Linköping Univer-sity. The following questions are designed with the aim to correspond to the assign-ments of the work:

• How can Stiller Studios’s previsualization be improved?

• Which methods are suitable for improving the previsualization?

• What is the intended outcome of improving the previsualization?

1.3 Limitations

Since the work corresponds to full time studies for two students during one semesterit must be limited in terms of capabilities and performance. The hardware and soft-ware solutions therefore depends on what is considered most appropriate for the stu-dio structure within the timeframe available.

1.4 Typographic conventions

The following typographic conventions are used in this report:

• Italics text refers to a variable of a mathematical expression.

• Italics and bold text refers to a first presentation of a product or organization.

Chapter 1. Introduction 3

1.5 Planning

At start-up, the students had little or no experience of the various tools Stiller Studiosplanned integrate with the previsualization system. Therefore, a preparatory evalu-ation work of the tools was conducted followed by an implementation of those thatwould seem most suitable for the task. The preparatory work also included a summaryof the company pipeline and workflow. In addition to this were further developments,such as camera calibration and depth detection, included in the planning schedule.

Chapter 2

Existing system

2.1 Film studio setup

FIGURE 2.1: Virtual view of Stiller Studios, green screen studio showingthe cyclops.

Stiller Studios’s previsualization system is based on several components that interact.The first step consists of a 3D view of the camera, according to customer requirements.With the help of a plugin built for the animation software program Maya, it is possibleto digitally match that scene with the actual green screen studio. When the matchingis considered completed (all elements are positioned so that the upcoming shooting isconsidered physically feasible) is real camera data connected to the virtual one. Thecamera in the studio is attached to a motion control robot known as the cyclops 2.1and gets set to the correct position according to the scene in Maya. The software thatcontrols the cyclops is called Flair and is not communicating with Maya, but with thereal-time engine MotionBuilder. Data is therefore sent from Maya via MotionBuilderto the cyclops in order to adjust the camera position for the scene. The final imageis then given by a combination of foreground and background images in QTAKE, anadvanced video system with integrated assistance for keying (removing the green color

4

Chapter 2. Existing system 5

from green screen). Foreground images are given from the camera and backgroundimages from MotionBuilder. This process is shown in figure 2.2.

FIGURE 2.2: The process of CGI combined with live action camera data.

MotionBuilder runs in two sets, one that provides compressed image data to QTAKEfor real-time rendering and one rendering high-resolution image post-processing. Theshooting data gets saved in the database software FileMaker.

2.1.1 Pipeline

The Stiller Studios pipeline is written in Python and is based on a specific folder struc-ture. It would be possible to rewrite the pipeline in C++ to speed it up, but it is not apriority or even necessary for the studio at the moment.

The pipeline server is running FileMaker and is waiting for REST commands thatsends and receives database information. The commands is given by the user via aPython command-line interface. Each command has its own Python file, containingthe command functionality.

For each project in FileMaker there are several film clips. Every project has anunique ID. The clips are divided into locations with different scenes. Under each scenethere are different shots and under each shot there are assets (video elements). Assetsare shots of the same scene but from different angles. All shots get saved in a tablewhere every row is created for each shooting. This can be rated so that the customerafterwards can check whether a shot went well or not.

The command-line interface is based on an API package, which has the same struc-ture as FileMaker does (film→scene→shot→asset).

The current pipeline uses UDP for communication with the system’s all differentapplications and computers. The advantage of UDP is that it is very easy and fast. Thedisadvantage is that the messages are not guaranteed to arrive at all or in the correctorder which may become a problem, especially when it comes to getting the precise


movement data from a camera robot. According to Stiller Studios’s software devel-opers this should not be a problem, partly because the data is sent locally and partlybecause the new data is sent continuously and Stiller Studios has not experienced itworking poorly.

2.1.2 Shooting process

Before the studio starts recording it must be prepared. This is done via a prepare RESTcommand. The command is sending out information so that all hardware componentsfor the shooting know what to do. If any hardware is not working, the user must beinformed. This is something that is currently under development.

A Rasberry Pi (credit card-sized computer) works as the leader of the studio setup.It is attached to a RED-camera that, via serial port, tells when a recording starts. TheRasberry Pi receives and controls the sent commands and then forwards the messageto all other hardware (computers running 3d softwares and camera/cyclops) involved.When a prepare command is sent is the Rasberry Pi telling the computers which visualworld to load and the cyclops that camera data is needed. Additional information fromthe Rasberry Pi is sent when the camera starts recording, such as possible animationsthat should be triggered in MotionBuilder.

Once a scene has been recorded, the camera magazine gets copied by inserting itinto a computer that scans the data and places it in the correct folder according to theproject structure explained in section 2.1.1.

2.1.3 Flair

Motion control data from the camera can be streamed from Flair via UDP, TCP or serialports. It is possible to manually set the format in which the data should be be sent inFlair.

The different modes for streamed data tested in Flair is called XYZ, Axis or Motion-Builder. XYZ and MotionBuilder contains values of the data type float that representsthe camera’s position, target, scroll, zoom and focus. Axis sends actual values for therobot. The MotionBuilder mode was written by the developers of Flair by request fromStiller Studios and differs from the XYZ mode by also sending time data with the pack-ages [3]. This allows MotionBuilder to know not only the position of the camera, butalso the date of the camera to which the camera is located and can therefore adjust thetimeline in MotionBuilder. This is important for scenes containing animation.

From the position of the camera and target, it is trivial to figure out a directionvector. With this direction vector together with the roll it is possible to calculate theorientation of the camera.

2.1.4 DeckLink

On Stiller Studios’s computers there are DeckLink capture cards from Blackmagic whichcan take video signals via SDI connections. This image signal can be handled throughBlack Magic’s DeckLink SDK. An Interface Definition Language file (IDL) can be in-cluded and compiled into a Visual Studio project to generate a header with functionsthat can be called from a C++ application to communicate with the DeckLink card in-stalled on the computer.

Via the SDK, it is possible to find and iterate through all DeckLink cards installed onthe computer and also to find and iterate through all video out connections available


on these cards. The SDK provides functions to get the image data in different formats(resolution, color spaces and color depth) in a byte string which can be used freely bythe user. Exactly what formats that are supported may vary between different cards.Some cards also make it possible to effectively make alpha keying to combine imageswith transparency.

2.2 Camera calibration

An important part of the filming live action on CGI is the camera calibration, makingsure that correct camera settings for the current lens, such as angle of view and distor-tion are known. A more in-depth explanation of these parameters is found in section3.2 on Camera Fundamentals.

Stiller Studios’s calibration process is done by setting up a large ”ruler” on the farend of the robot’s rails using movable walls and tape where the zero position of theruler is lined up due to the rails and robot. The robot is moved as far away from theruler as possible and the lens’s focus distance is set to infinite and looks at the zeroposition of the ruler. The length of the ruler visible to the camera is estimated visuallyand the view angle and focal length is calculated using trigonometry as seen in figure2.3.

This process is repeated for all focus distances written on the lens, noting discrep-ancies between the actual focus distance and the given focus distance.

Lens distortion, bulging or contraction of the image due to the optics are handledin Nuke (post-processing software). An image of a grid is taken by the camera and thenNuke calculates the distortion using the warping of the grid.

The center of projection or no-parallax point’s position relative to the sensor is cal-culated by rotating the camera and finding the rotational position where parallax doesnot occur.

FIGURE 2.3: Sketch of the Stiller Studios camera calibration process.

2.3 Problem definition

From observation and discussion with the Stiller Studios crew a number of areas withroom for improvement was identified.


2.3.1 Low quality rendering

Optimally the previsualization rendering of the virtual scene should be as close to thepostproduction result as possible. The graphics provided by MotionBuilder howeverdo not hold a particularly high quality, lacking advanced materials or rendering meth-ods. MotionBuilder scenes often look flat and represent a bad attempt of conveyingthe feel of a scene. This makes MotionBuilder mostly useful as a purely technical toolfor knowing the physical limits of the scene where actors should stand and not stand,rather than a creative tool by providing the director a feel for the scene.

2.3.2 Slow and imprecise camera calibration

To provide a good previsualization and sending the correct data to the customer forproduction, it is important that the camera calibration works well. Optical distortionand field of vision must be known in order to match the virtual background to theforeground filmed accurately. The position of the center of perspective is particularlyimportant to know since this is the same thing as the no-parallax point, the point whichthe camera rotates around without parallax, which results in that objects in the sceneappear to move relative to each other. Thus, when Stiller Studios is filming against astatic background the camera must only rotate around the no-parallax point to preventthat the film material slides relative to the background.

The current camera calibration technique takes massive amount of time to set upand execute. If someone disturbs the ruler or if it is not setup correctly the results willbe ruined and the calibration has to be redone. The calibration also relies on the nakedeye viewing when things look right which will introduce a large degree of imprecision.

The current method also fails to account for images that are squeezed or stretchedby the lens and camera or cases where the optical center is not in the middle of theimage.

Stiller Studios’s method for camera calibration also prevents the company to useother types of lenses such as zoom lenses where the center of projection and angle ofview change by a large degree since it would simply take to much time.

2.3.3 Motion control safety and errors

Stiller Studios’s camera tracking relies on translating motor values from the camerarobot into camera parameters. This has the benefit of allowing the studio to preciselyplan and repeat exact camera movements. However, the use of motion control forcamera tracking has two major drawbacks - safety and mechanical errors.

Safety

The robot is heavy, fast and requires safety measurements. Operations of the robotare limited to certified motion control supervisors. To precisely plan the movement ofthe camera is not only a possibility but in many cases a necessity. The safety require-ments for the motion control robot can slow down the shooting, especially if changesto camera movement is required.

Mechanical errors

Due to the robot’s weight and speed, mechanical errors exist. Rapid retardation willresult in camera shaking due to inertia. When the robot fully extends its arm it will be


slightly lower than in its retracted state due to its weight. Mechanical errors also occursfrom gears due to backlash of the gears in the robot.

2.3.4 Rudimentary compositing

Stiller Studios combines camera feed with virtual scenes by simply keying out the greenscreen and adding the camera feed over the virtual scene. No depth information of thefilmed material is known or used. This limits the kind of scenes that can be previsual-ized since the actors can not go behind virtual objects, leading to unrealistic previsual-ization or requiring green screen props as stand in for the virtual objects.

2.3.5 Suggested improvements

From the problem definitions four areas of improvement can be defined:

• Improved rendering.

• Improved camera calibration.

• Alternative camera tracking.

• Depth compositing.

In order to limit the scope of the project, the majority of the work has focused onrendering and camera calibration. In chapter 3 are the underlying theories behind theseconcepts explored, while in 4 are methods for solving these problems evaluated andimplemented.

Chapter 3

Theory

3.1 Rendering

Many people encounter computer renderings every day - TV, the internet, video gamesand billboards at the streets are just some examples. Rendering is the process wherea geometric description of an virtual object is converted from 3D into 2D that looksrealistic. There are both offline and real-time rendering techniques [4], some of thempresented below.

3.1.1 Offline rendering

Offline rendering is used by systems that prioritize accuracy over frame rate. Only oneframe can, depending on what is being rendered, take hours of time to complete.

Ray tracing

Physically correct simulations of how light transports are sought when rendering real-istic 3D scenes. This can be done by casting rays from the camera, through a pixel, intothe scene to calculate shadows and reflections due to how the rays bounce betweendifferent scene surfaces, called ray tracing. This means heavy computer calculationsand is therefore traditionally performed offline.

3.1.2 Real-time rendering

Real-time rendering is a direct interaction between a user and a virtual environment,especially common in the computer games industry. It consists of algorithms usingrasterization for converting geometries into pixels, and techniques for defining what,how and where pixels should be drawn.

Rasterization

Rasterization converts a vector graphics image into a raster (pixel or dots) image. Thetechnique is extremely fast but does not, unlike ray tracing, prescribe any way of simu-lating reflections or shadowing. To handle these issues rasterization is combined withcertain real-time global illumination and real-time shadow methods, described in fol-lowing subsections. This will not however make the rendering result end up as realisticlooking compared to ray tracing.

Real-time global illumination

Two methods used for real-time global illumination is baking and voxel cone tracing.Baked global illumination has the limits that it only can handle static objects since the

10

Chapter 3. Theory 11

technique must store light information before it can be processed further. Based on ahierarchical voxel octree representation [5], voxel cone tracing however supports dy-namic indirect lightning but is not as good-looking as baked lightning.

Real-time shadows

There are several ways of generating real-time shadows, for instance shadow mappingand shadow volumes. The shadow mapping is image based and the shadow volumesis geometric based [6], resulting in that shadow mapping is faster but not as precise asshadow volumes.

Forward and deferred rendering

There are rasterization techniques for determining the path of how pixels should berendered in real-time. Two examples of these are the forward rendering and deferredrendering [7] [8].

Forward rendering supplies the graphics card with geometry that breaks down intovertices, and then those are split and transformed into fragments, or pixels, that getrendered before getting passed on to the screen. It is a fairly linear process and is donefor each geometry in the scene before producing the final image [9].

Deferred rendering, on the other hand, performs its calculation directly on the pix-els on the screen instead on relying on the total fragment count. This simplifies theuse of many dynamic light sources within a scene, an optimized forward rendering.However it cannot handle everything that forward rendering does, for example therendering of transparent objects. It also requires newer hardware to run.

3.2 Camera fundamentals

3.2.1 Pinhole camera

A simple model for understanding how a camera or an eye works is the pinhole cam-era. The idea behind a pinhole camera is essentially to take a lightproof box with asmall hole (a pinhole) on one side. An upside down projection of the view outside thepinhole is projected on the opposite side of the box as a result of the pinhole blockingall rays of light not coming from the pinhole’s direction as seen in figure 3.1.


FIGURE 3.1: The pinhole camera model [10].

The image of a perfect pinhole camera with the pinhole as a single point in spaceand no bending of light can be described using the following equations

−x = fX

Z(3.1a)

−y = fY

Z(3.1b)

where x and y are the coordinates of the projection on the image plane, f is the focallength or distance between the image plane and the pinhole, X and Y are the horizontaland vertical displacement from the pinhole and Z is the depth of the observed object.

The angle of view of the image is given by the following formula

a = 2arctan(d

2f) (3.2)

where a is the angle of view, d is dimension size of the projected image (width or heightdepending on whether a horizontal or vertical angle of view is requested) and f is thefocal length.

By removing the minus signs from equations 3.1 a correctly oriented image is given.This can be geometrically interpreted as putting the image plane in front of the pinholeand coloring it based on the rays that pass through. This mathematical abstraction ofthe pinhole camera is typically known as the projective transform and the pinhole asthe center of projection [11].

This is the camera model commonly used when rendering virtual 3D images. Whenmodeling real cameras there are a few other factors that need to be taken into account.


3.2.2 Real camera

One of the biggest issues with the pinhole camera is that in order to get a clear imagethe pinhole needs to be very small. A larger hole will allow light from a larger area toreach the same location on the image plane or sensor in the camera, creating blur. Asmaller hole however means less light and a darker image.

Diffraction

Another problem with real pinhole cameras is diffraction. While light is commonlymodeled as perfect rays this does not quite accurately describe how light behaves inreal life due to the wave-particle duality of light. A consequence of the wave natureof light is that it bends slightly around corners. Thus, a small pinhole will have mostlight bend around its edges resulting in an effectively blurring of the image and alsocreation of an artifact known as the airy disk [12]. This limits how sharp images a realcamera can produce.

Proportions

Yet another problem with the pinhole camera is that the angle of view completely de-pends on the relation between the size of the projected image or sensor and the dis-tance between the image plane and the pinhole, making very large or very small angleof views infeasible.

Lens Distortion

To get around the pinhole camera problems mentioned above, real cameras use lenses.The lens allows for a larger opening while retaining a crisp image by focusing lightrays originating from the same point in space to the same position on the sensor. Theuse of lenses does though result in a number of new problems.

Using lenses will result in various degrees of distortion. That is deviation of therectilinear projection of the pinhole camera model where straight lines in the imageremain straight after projection as seen in figure 3.2. The most common forms of dis-tortion introduced by lenses are radial and tangential distortion [11].


FIGURE 3.2: A rectangular grid warped by radial distortion.

Radial distortion occurs as a result of the radial symmetry of the lenses. Tangentialdistortion in turn occurs as a result of misalignment of the elements in the lens andcamera, such as the sensor being at an angle relative to the optical axis of a lens.

A lens only correctly focus light at a specific distance with objects outside the focusdistance are perceived as increasingly blurry. The range in which objects are consideredsharp is called the depth of field, see figure 3.3. The human eye solves blurring bycontrolling the shape of the lens using muscles and thus changing the focus distance towhatever the person is looking at. Cameras usually solve the same problem by havingone camera lens consisting of an array of lenses whose configuration can be changedmechanically, changing the focus distance.

FIGURE 3.3: Within the depth of field are objects perceived as sharp [13].

Aperture

The problem of depth of field can be mitigated by the use of an aperture that can closeand focus the image essentially the same way as the pinhole camera by only lettinglight in from small opening, but with the same drawback of making the image darker.


How much light that reaches the sensor is usually measured using the f-stop orf-number which is equal to

N =f

D(3.3)

where N is the f-number, f is the focal length and D is the diameter of the effectiveaperture.

The larger the f-number is, the less light there are. The larger the diameter of theaperture is, the more light enters the camera. A larger focal length means that less lightreaches the sensor due to decreased energy density over distance traveled. This meansthat the f-number gives a measurement of how much light that is captured and translat-able over cameras with different focal lengths [14]. However, light is not only lost dueto distance traveled but also by absorption by the optical elements of the lens. There-fore are t-stops (transmittance stops) used which also take the transmission efficiencyof the lens into account.

Exposure

Another method of making an image brighter is to increase the exposure time. Ratherthan increasing the area through which light enters the camera it is possible to increasethe amount of light by increasing the time-period the sensor is exposed to light foreach image. This however creates problem if there is movement in the image duringthe exposure, leading to motion blur.

Center of projection and angle of view

Finding the center of projection and angle of view for a pinhole camera is somewhattrivial since these are well defined and easily measured properties of the camera. Thecenter of projection is at the same position as the pinhole. By measuring the distancebetween the pinhole and the sensor as well as the size of the sensor, the angle of viewcan be determined using equation 3.2. This however is not the case for real cameraswith complex lenses. A common misconception is that the center of projection is thenodal point or front nodal point of a lens or that the no-parallax point is the same asthe nodal point or front nodal point, either as a mistake in terminology or a misunder-standing in how lenses work.

The no-parallax point is the point on the camera around which no parallax willoccur if the camera is rotated. This means that objects seen from the camera will notappear to change their relative location to each other when the camera is rotated. Thatthe no-parallax point and the center of projection are the same thing can easily be un-derstood since the center of projection is the point where all rays entering the cameraconverges. Drawing lines between the points in the scene and the center of projectionshows that the angle between objects only changes with the position of the center ofprojection rather than its orientation. This also has the convenient effect that the centerof projection can be found manually by rotating the camera around different pointsuntil no parallax is achieved.

The center of projection is the same as the entrance pupil or apparent position of theaperture seen from the front of the camera. The position of the entrance pupil will alsoaffect the angle of view of the camera, however due to bending of light by the lens theangle of view cannot be calculated by simple measurements as for the pinhole camera.A short explanation is that the aperture puts the same constraints on incoming rays of


lights as on the pinhole on the pinhole camera. All rays will pass by the aperture andthus the aperture will in practice work as a center of projection. While this is intuitivefor a small aperture it also holds true if the aperture size is large. The same rays enteringthe small aperture will also enter the large one, however more rays will enter the largeone creating blur. The position of the aperture will determine which part of the blurredimage to sharpen as a function of the center of projection and angle of view decided bythe aperture position. A more detailed explanation and proof can be found at [15].

For a lens with the aperture at the front, the position of the center of projectionis easily determined. For many lenses however is the aperture behind a number oflens elements, bending the light before it hits the aperture. This is why the center ofprojection is not necessarily at the apertures physical position but rather at its apparentposition as seen in figure 3.4.

FIGURE 3.4: A raytraced telephoto lens, showing the entrance pupil andits relation or lack thereof to the nodal points or any physical part of the

camera [15].

Zoom and prime lenses

Complex lenses used for professional photographing and filmmaking usually fall intoone of two categories - prime lenses and zoom lenses. Prime lenses are lenses withsupposed constant focal length (and thus angle of view) with the possibility to changethe focal distance. However, due to “lens breathing” the focal length and the centerof projection can change slightly when changing the focus of the camera. For zoomlenses this changing of focal length is not a bug but a feature allowing the cameramanto not only change the focus distance but also the angle of view by a large amount,zooming in or out of the image as a result. What this means is that not only are thecenter of projection and angle of view hard to find for complex lenses, they also changedepending on the mechanical settings of the lens.


3.3 Camera model

A more complete mathematical model of the pinhole camera is given by

x = fxX

Z+ cx (3.4a)

y = fxY

Z+ cy (3.4b)

or as a matrix using homogeneous coordinates, as

xyw

=

fx 0 cx0 fy cy0 0 1

XYZ

(3.5)

where fx and fy are the horizontal and vertical focal length and cx and cy are the opticalcenter in the image. Sometimes a fifth parameter representing skew in the image isadded to the model [11].

The reason why there are two focal lengths if the focal length of the pinhole camerais equal to the distance from the image plane to the center of projection is because thefocal length is given in the coordinate system of the image rather than the coordinatesystem of the observed scene. Different values for fx and fy will squeeze or stretch theimage horizontally or vertically. In real cameras this can be used to compensate for thelens squeeze of anamorphic lenses or the sensor elements representing each pixel in thecamera when rectangular rather than square which is generally the case for pixels.

cx and cy represent the optical center of the image also given in image coordinates.That is the intersection of the optical axis and the image plane. The optical axis is theline going through the center of projection orthogonal to the image plane. Typically theoptical center is requested to be the actual center of the image. For an image given inpixel coordinates that is 800 pixels wide and 600 pixels high is cx = 400 and cy = 300.However, since real cameras and lenses are not perfect this might vary slightly.

There also exist applications where the optical center and the image center shoulddiffer. It is possible to think of the center of projection as an observer in itself and theimage plane as a screen or window that the observer is looking at. Thus, most imageswill look right when looking at them from a centered position, aligning the observer’sview with the center of projections. If an image is meant to be viewed from an angle,the position of the optical center should be changed accordingly. A notable example ofthis are VR-displays creating an illusion of depth by tracking the eyes of the user andchanging the position of the center of projection [16].

To represent the camera position and orientation in the world it is also possibleto add a rotation and translation matrix to the model, transforming the coordinates ofobjects given in a coordinate system of the scene to the coordinate system of the camera.This completes the pinhole camera model as

q = MTRQ (3.6)

where q is the image coordinates, M is the camera matrix describing the mapping of thepinhole camera from 3D points in the world to 2D points in an image, T is a translationmatrix, R is a rotation matrix and Q is the object space coordinates.

To model real cameras however, lens distortion must be taken into account as dis-cussed above. This can be done by using Brown’s distortion model [17]. Brown’s


distortion model corrects for lens distortion by mapping the image coordinates of andistorted image point to its undistorted position, thus correcting the distortion of theimage. Brown’s model is given by

xd = xu(1+K1r2 +K2r

4 + ...) +P2(r2 +2x2u) + 2P1xuyu)(1+P3r

2 +P4r4 + ...) (3.7a)

yd = yu(1 +K1r2 +K2r

4 + ...) + P1(r2 + 2y2u) + 2P2xuyu)(1 + P3r

2 + P4r4 + ...) (3.7b)

r =√

(xu − xc)2 + yu − yc)2 (3.7c)

where xd and yd are the undistorted image coordinates, xu and yu are the distortedimage coordinates, xc and yc are the distortion center, Kn are the radial distortion pa-rameters and Pn are the tangential distortion parameters.

Since Brown’s distortion model includes radial and tangential distortion based onTaylor polynomials, it can in theory be used to model radial and tangential distortionof arbitrary complexity [18].

The camera matrix together with Brown’s distortion have been shown empiricallyto be capable of giving relevant estimations of the type of cameras commonly used inphotography and filmmaking. Variables of the camera matrix and distortion functionsare together referred to as the intrinsic parameters of the camera. That is the parametersdescribing the camera’s internal structure where the orientation and position of thecamera, given by the rotation and translation matrices, are referred to as the extrinsicparameters.

3.3.1 Other camera models

While the pinhole camera model with added distortion works well enough in mostcases, it is entirely physically accurate and have limitations. For example, it is gener-ally unsuitable for fisheye lenses since the model breaks when dealing with angles ofview that exceed 180o. For these cases omnidirectional camera models can be used [19].Another example of an alternative and more physically accurate camera model is theone proposed by Kumar et al. [20], modelling radial distortion as a change in the posi-tion of the center of projection.

3.3.2 Modelling blur

The real camera produces both depth and motion blur and the virtual perfect pinholecamera does not. To correctly match the cameras might require either correction of blurin the real camera image or applying blur to the virtual one. In general this is less of aproblem when shooting on green screen since the actors and real props are usually at asimilar distance from the camera or can be shot in several takes.

Simulating blur is generally easier than correctly sharpen an image since informa-tion is lost in the blurring process. However, in case of attempting to sharpen blurredimages it would probably be preferable to look into deconvolution algorithms whichare algorithms used to reverse effects of convolution [21].

Nvidia’s GPU Gems 3 discusses both problems of simulating motion blur and depthof field [22]. The suggested motion blur method applies motion blur as a post-processing


step by first calculating the world position of each pixel using the depth buffer. The pre-vious pixel world position is then calculated using the view-projection matrix of theprevious frame. The difference between these two positions are calculated to get thepixel velocity. The color of the pixels in the velocity direction is sampled and blendedfor each pixel creating motion blur. This method however only render motion blur asa result of camera movement.

If enough computer power is available both the motion blur and depth of fieldproblems can be solved in largely physically accurate ways by multisampling. Motionblur can be simulated by taking several subsamples between each frame and blendingthem. The depth of field problem can be solved by rendering the scene several times,each time offsetting the center of projection slightly while still having the camera look-ing at the point that is the center of focus and blending the samples. However, bothof these methods can become extremely expensive since the whole scene is renderedanew for each sample. Too few samples will create no blur or spectres where severalcopies of the blurred objects will appear rather than a continuous blur.

3.4 Geometric camera calibration

"We focus on intricate motion control work, where virtual andreal camera positions and paths need to be perfectly matched andoutput in real time as usable data."

from Stiller Studios’ website [23].

Having a virtual camera model capable of describing how real cameras in gen-eral generate images from scenes is not enough to match them. It is also necessary tofind the actual values of the parameters describing the specific camera used and as ex-plained previously, this is non-trivial for cameras with complex lenses. The purposeof geometric camera calibration is to find the intrinsic and extrinsic parameters of acamera.

The chapter on Camera calibration written by Zhang in the book "Emerging Topicsin Computer Vision" [24] gives a general overview of photogrammetric camera calibra-tion. That is camera calibration using measurements from images taken by the camera.Several different calibration methods exist but can generally function by the followingtwo parts:

1. Detecting geometry in one or more images.

2. Finding the camera parameters that map the detected geometry between the im-ages taken or to a known model of the geometry.

Generally points are used as geometry in step 1, however different methods existsuch as methods using lines and vanishing points [25].

Calibration methods can be divided into two categories - calibration with apparatusand self-calibration. Calibration with apparatus is done via locating known points inthe image of the calibration apparatus and matching these to the points in a predefinedmodel of the points’ real location. Self-calibration in contrast is done without a knownmodel, but instead by identifying the same points from a number of views and findingthe camera transform that maps these points between each other.

Zhang [24] further divides calibration with apparatus into three categories basedon the dimension of the calibration:


• 3D reference object based calibration.

• 2D plane based calibration.

• 1D line based calibration.

While self-calibration has the benefit of not requiring a well-defined and well-madecalibration apparatus it requires a larger number of parameters to be estimated andthus also is a harder problem. However, in some cases calibration using an apparatusis not possible such as when calibration of a camera from a given video is desired.

The more dimensions the calibration apparatus has, the easier the parameters areto solve for where it is possible to calibrate a camera using a 3D apparatus with onlyone image of the apparatus. However, creating a 3D reference object with precise mea-surements is hard or expensive while almost anyone can print a pattern on a paper ororder a custom made poster of a pattern. A 2D apparatus in turn requires that severalimages with different views of the apparatus are used. 1D objects, such as a rod andstring with beads attached along its length, have the advantage that it is easier to makesure that all points are visible from several different viewpoints at the same time forseveral different orientations of the calibration object. This is useful when calibratingthe extrinsic parameters of several cameras with the same view relative to each other.

Calibration techniques can also be classified based on their constraints. By reducingthe number of parameters needed to be solved for the problem, the calibration canbe simplified. Often the intrinsic parameters are kept constant for all images used incalibration. Methods also exist that allow for varying intrinsic parameters with knownrotation of the camera [26].

3.4.1 Zhang’s method

One popular calibration method is the one devised by Zhang in 1998 [18]. Zhang’smethod is a 2D plane based calibration method that uses the same pinhole cameramodel with Brown’s distortion model using two radial distortion parameters as de-scribed in section 3.3. The method takes at least two lists of pairs of points, each con-sisting of at least eight pairs. One pair is being detected points from a view of thecalibration plane and the other being the corresponding points in a known model ofthe calibration plane. The homography transformation between the model pairs andthe detected points are calculated and from this are the camera matrix and extrinsic pa-rameters estimated. With this information the least square method is used to estimatethe lens distortion parameters. This gives an initial guess for camera parameters. Theseare in turn refined using the iterative Levenberg-Marquardt optimization algorithm tominimize the reprojection error.

3.4.2 Reprojection error

The reprojection error is a method for measuring the quality of a camera model by tak-ing the average distance between a set of measured points projected using the cameramodel and the same points in an image taken by the actual camera. If the points’ pre-cise position is detected in the image and the camera model is completely correct, thedistance should be zero.


3.4.3 Camera tracking

The traditional approach for camera tracking, or match moving, is usually done inpost-production through self-calibration by finding points and matching their positionbetween frames without information about their actual 3D-position.

Good camera calibration algorithms are not known for their speed. However, oncethe intrinsic parameters have been calculated they will not change unless the optics ofthe camera are changed. For a prime lens, only the focus can change thus the change inoptics is one-dimensional and can easily be stored for later use. Thus, to get a real-timecamera tracking solution only the camera’s extrinsic parameters in real-time have to besolved.

Finding the extrinsic parameters of a camera knowing the intrinsic parameters us-ing a known set of 3D points and their 2D projection in an image is known as Perspective-n-Point or PnP and can be done much faster than general camera calibration [27].

3.4.4 Control point detection

A crucial step in camera calibration is detecting the points used for calibration. For thecalibration algorithm the points used in calibration must be correctly detected. Pre-cise detection of points is limited by noise, blur, imprecision in manufacturing of thecalibration pattern, image resolution and distortion.

Corner detection

One of the simplest calibration point solutions is using a chessboard pattern and acorner detection algorithm such as Harris corner detection [11]. An iterative gradientsearch can be applied to achieve subpixel accuracy [11]. However, these methods onlywork optimally for patterns orthogonal to the image plane and due to perspective andlens distortion this will not be the case.

Center of squares

Another calibration point method is to apply the corner detection scheme to detect thecorners of squares and finding the middle of the square from these corners. The idea isthat the error in corner detection will even out to some degree by using the average ofseveral points.

Detecting the center of the square is trivial while viewing the frontal image of theplane. In this case the middle of a square is simply the average of the corners. How-ever, this will not be the case when viewing the plane from an angle due to perspectivedistortion making parts of the square that are further away appear smaller than partscloser to the camera. The real center can be found by defining lines between the di-agonal corners of the square as seen in figure 3.5. The intersection of these lines as


calculated by equation 3.8 is the center of the square. However, due to lens distortionthe square center will still be imprecisely detected.

x =

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

x1 y1x2 y2

∣

∣

∣

∣

x1 − x2

∣

∣

∣

∣

x3 y3x4 y4

∣

∣

∣

∣

x3 − x4

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

x1 − x2 y1 − y2x3 − x4 y3 − y4

∣

∣

∣

∣

(3.8a)

y =

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

x1 y1x2 y2

∣

∣

∣

∣

y1 − y2

∣

∣

∣

∣

x3 y3x4 y4

∣

∣

∣

∣

y3 − y4

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

x1 − x2 y1 − y2x3 − x4 y3 − y4

∣

∣

∣

∣

(3.8b)

(x, y) is the intersection point, (x1, y1) and (x2, y2) are points along one line and (x3, y3)and (x4, y4) are points along the other line.

FIGURE 3.5: A rectangle seen from an angle with the actual center of therectangle in green and the average of the corner points in red.

Circles

The idea of reducing errors by using the average of several points circles can be ex-tended further using circles. Circles can be thought of as polygons with an infiniteamount of corners. A contour detection algorithm can be used to detect circles and anellipse fitting function [11] or an average of the contour points can be used to find thecenter. However, as for the center for squares, the problem with perspective distortionexists since circles lack of corners. No trivial solution exists.

Mateos [28] suggests a solution by finding the shared tangents of four circles neigh-boring each other in a square grid and calculating bounding squares to find the circlecenters using the line intersection method. As with the square and corner detectionsolutions, an error due to lens distortion will occur.

By using concentric circles, more contour points can be used to even out errors.Another benefit of concentric circles is that shapes that can be mistaken for concentric


circles are unlikely to occur in the background of an image, reducing the risk of faultypoint detection.

3.4.5 Precalculation of lens distortion

One way to solve the problem of lens distortion bias in marker detection is to solve lensdistortion separately and correct for it before doing any other calibration. An advancedmethod for solving lens distortion separately is presented by Tang et al. [29] using whatthe authors call a calibration harp - a frame with very straight lines stringed vertically.A picture of the harp is taken and the distortion model is optimized to minimize thecurve of the lines.

3.4.6 Interative refinement of control points

Datta et al. [30] present a method for overcoming the problem of lens and perspectivedistortion errors using an iterative calibration method. The basic idea is that even if thefirst intrinsic and extrinsic parameters are not calculated exactly right these can be usedfor reprojecting the image and still produce an image with less distortion that is viewedmore from the front. Detection of control points in a small image with less perspectiveand lens distortion results in more accuracy. Thus, by detecting the control points evenin the frontal image and projecting these back using the calibrated camera, the rotationand translation matrices should get an even more accurate set of control points thatcan be used for recalibration. This process can be repeated until convergence or a fairenough reprojection error is reached.

A more advanced method based on the same idea of using the reprojected frontalimage is presented by Wang et al. [31] with the main difference being the use of iterativedigital image correlation of concentric circles rather than the same detection algorithmused in the first step for obtaining more precise control points from the frontal image.

3.4.7 Fiducial markers

When detecting control points it is generally not necessary to detect every single controlpoint or the same control points in every image used for calibration. It is howeverusually necessary to correctly match the control point to the equivalent point in themodel or between the images. One way to allow for identification of control points is touse markers with identities built into their design. Rice [32] divides these kind of visualmarkers or tags by two broad coding schemes - template and symbolic. Template-basedschemes work by matching the detected tag against a database and predefined markerimages using autocorrelation. As such, in theory, any image can be used as a tag. Anexample of template-based tags are the ones used in ARToolKit (figure 3.6). Using asymbolic scheme, on the other hand, means that the tag is created and read using aset of well-defined rules of how the data is encoded in the tag. One of the most wellknown symbolic tags are QR codes (figure 3.7).


FIGURE 3.6: Template fiducial used in ARToolKit [33].

FIGURE 3.7: QR code, a symbolic data matrix tag [33].

Template-based tags benefit from allowing the use of images with meaning to a hu-man observer to be used as tags. However, they also present a number of problems. Toavoid false detections the tags need to be as distinct as possible, which in turn hindersthe use of arbitrary images or images with specific meanings. The images also need tobe sufficiently distinct at different orientations and only little research has been doneinto the effects of pixelation and perspective correction on the autocorrelation func-tion. A large dataset of template images will also lead to a lot of potentially expensivecomparison operations for each detected marker.

Symbolic tags in contrast usually work by dividing the tag into a number of datacells representing binary data by coloring the data cells black (0) or white (1). Themarker is detected in an image, the data cells are sampled and a code representing themarker is built from the sampled data.

Symbolic tags benefit from being very clearly defined. Since the identity of themarker is built from the data contained in the tag, it is not necessary to do a linearsearch through a database for the marker’s identity. A risk for errors in the detecteddata exists. This however can be modeled as bit-errors and solved the same way by


including redundant bits, parity bits, checksums or similar. Increasing the number ofmarkers is also well defined by simply increasing the number of data cells at the costof more details in the marker.

Tags can have different shapes, most commonly circular or square (3.8). The bene-fits and drawbacks of using either circular markers or square markers have been dis-cussed previously. However, in the context of symbolic data markers, the shape ofthe marker has some significance when it comes to how the data is structured in themarker. Square markers are usually structured as a simple data matrix and is quiteeasy to process. If the corners are found, a perspective transform can be used to findthe frontal image which in turn can be sampled along the x-axis and y-axis of the im-age with an offset defined by the data matrices’ size. To optimally make use of thearea of a circular tag, the data cells can instead be organized using polar coordinates,by radius and angle, since these can be read sampled using trigonometry for frontalimages. When seen from an angle, the problem of defining the perspective of a circlecan possibly lead to errors if a lot of data is encoded along the radius of the circle.

FIGURE 3.8: Circular and square symbolic tags [34].

Rotation of fiducials can be handled for both circular and square tags by readingthe data cells in a rotationally invariant manner and offsetting the data. This howeverreduces the number of unique data representations of the tag. Another solution is toadd orientation descriptors to the marker as seen in figure 3.7 and figure 3.9.

FIGURE 3.9: Circular symbolic tag with orientation markers [35].


One interesting and quite different kind of fiducial marker is the chromaglyphs [36].Chromaglyphs consist of concentric circles in different colors (figure 3.10). As long asthese colors are different enough they should provide a simple but yet robust systemfor unique markers. The use of color however could be a problem in a green screenenvironment. The question of what constitutes different enough also exists.

FIGURE 3.10: Chromaglyph markers [36].

3.4.8 Image selection

Calibration methods that rely on several images of a calibration plane need these im-ages to be from different perspectives in order to provide enough information for thealgorithm. Similar images will give similar, and thus redundant, information to the al-gorithm. More images mean more calculations and thus longer time for the calibrationto complete. Similar perspectives also increase the risk of error bias leading to faultyparameters.

Ideally the least amount of images with the most variance, and thus information,should be used for calibration. This could possibly be done by either planning thecapture of the images in detail or by selecting a number of images for calibration basedon whether they look different to the human eye or not.

Byrne et al. [37] present a method for automatic selection of images using the con-cept of the calibration line. By using the calibration line it is possible to describe eachimage as a single line and using the angle of these lines as image descriptors. Thismeans that how similar images are as a one-dimensional metric can be described andalso that information if optimal coverage is achieved for the selected number of imagesis available.

The calibration line can be approximated by finding the homography between fourdetected control points in an image and same four control points in the reference. Thecalibration line is then calculated as a function of the homography matrix as

y = kx+m (3.9a)

k =−h11h

332 + h12h

331 − h11h

331h32 + h12h31h

232

h22h31h232 − h21h231h32 − h21h332 + h22h331(3.9b)

m =h21h31 + h22h32

h231 + h232−

h11h31 + h12h32h231 + h232

k (3.9c)

where y = kx + m is the lines equation with k being the slope and m being the inter-section of the line and the y-axis and hrc is the value at the r-th row and c-th column ifthe homography matrix.

θ = arctan(k) (3.10)

gives the angle Θ of the line. According to Byrne et al. and shown in their results,this angle can be used as a one-dimensional difference between images for the purposeof calibration. A line can rotate 180o before lining up with itself again. Thus, to get


optimal variance for the images the angles of the calibration lines should be as spreadout over 180o as possible. The optimal angle between the calibration lines of the imagesused can be calculated as

β =180o

N(3.11)

where β is the optimal angle and N is the number of images used.

3.4.9 Blur detection

Determining when an image is in focus and as sharp as possible is of interest for sev-eral reasons. When detecting markers of any variety, the image should be as sharp aspossible to give an accurate determination of its position. When calibrating the camerait is preferable to accurately determine the focus distance of the optics so as to knowwhat focus settings to use when filming from what distance. It could possibly be usedfor auto-focus.

Traditionally, the lens focus setting for a particular distance is determined by hav-ing the camera look at a pattern, such as the Siemens star as seen in figure 3.11, andmanually adjusting the focus.

FIGURE 3.11: The Siemens star pattern.

Research also exists into determining the focus of an image automatically. Pertuzet al. [38] did a review of a number of different focus measurement methods for usein shape-from-focus depth estimation. That is, using focus distance and depth-of-fieldblur effect, to estimate the shape and distance of objects. Pertuz et al. divides theoperators used for focus measurement into six families:

1. Gradient-based operators. Using the gradient or first derivative of an image.

2. Laplacian-based operators. Using the second derivative or Laplacian of an image.

3. Wavelet-based operators. Using the capabilities of the wavelet transform to de-scribe frequency and spatial content of an image.


4. Statistics-based operators. Using image statistics as texture descriptors to com-pute the focus.

5. DCT-based operators. Using the discrete cosine transform to compute focus basedon the frequency of the image.

6. Miscellaneous operators. Operators not belonging to any of the other five cate-gories.

Laplacian operators were shown to have the best performance overall alongsidewavelet-based operators. However, they were also shown to be the most sensitive tonoise. One operator that showed a general good performance were Nayar’s modifiedLaplacian

Φ(x, y) =∑

(i,j)∈Ω(x,y)

∆mI(i, j) (3.12a)

∆mI(i, j) = |I ∗ Lx|+ |I ∗ Ly| (3.12b)

whereLx =

[

−1 2 −1]

(3.12c)

Ly = Ltx (3.12d)

and Φ(x, y)is the focus at pixel (x, y), Ω(x, y) is local neighborhood of the pixel andI(i, j) is the value of pixel (i, j).

3.5 Depth detection and shape reconstruction

One of the most common ways of detecting depth in images is through triangulation.By using the projection of a point in 2D in two or four images, the 3D-position of thepoint can be determined (figure 3.12). In theory this problem is trivial. If the camera’sextrinsic and intrinsic parameters are known exactly, it is simply a matter of finding theintersection of the lines going from the camera’s center of projection through the 2D-projection of the point. In practice however, the camera’s parameters are not knownwith 100% accuracy and neither will the detection and matching of the control pointbe. Nor will it have infinite resolution [39].


FIGURE 3.12: Locating a point in space using epipolar geometry andtwo cameras [40].

This way of detecting depth is fundamentally how human depth vision works. Anestimation of the depth of objects can be made by combining the images created byboth eyes.

Many computer vision applications use the same principles for depth detection us-ing stereo camera systems. One major problem however is to correctly match pointsbetween images. Methods for detecting and matching arbitrary points in images suchas SIFT (Scale-invariant feature transform) exist but are prone to error. This means thatin practice stereo depth cameras will have a lot of noise and require a lot of guessing,cleaning of data and interpolation.

One interesting method for reconstructing a 3D object without having to deal withthe problem of correctly matching points between images is space carving [41]. Thismethod however is limited by requiring a large number of views and a backgroundthat is distinct, and can easily be separated from the object being reconstructed suchas a green screen. Rather than projecting and finding the intersection of lines at singlepoints, the silhouette of the object is detected and a virtual cone is projected out fromthe center of projection to the silhouette’s contour, classifying all points outside thecone as empty. This is repeated for a large number of views, classifying more andmore space as empty (essentially carving away virtual space) until an accurate modelremains. A drawback of space carving besides the elaborate setup required is that itcannot deal with cavities since these will not be visible from any direction. Complexobjects or scenes with several objects may also provide less than optimal results due tothe objects obscuring each other from certain views.

The problem of finding the depth of filmed objects can be solved on several differentlevels, where the most basic is a way of finding an approximate depth of the wholeimage and the ideal way is perfect 3D-reconstruction of the filmed material.


In a green screen studio where foreground objects can easily be distinguished fromthe green background, a multi-camera setup can be used either for space carving orrough trigonometry. This would identify objects on the green screen as blobs and cal-culating an average distance.

A single-camera setup could possibly be used for giving a rough approximationof the objects’ 3D-position. That assumes that the floor of the studio is visible to thecamera, the camera parameters are known relative to the floor and that objects areplaced on the floor. If the lower points of objects are detected, it is simply a matter ofprojecting the points from the camera plane to the floor plane.

Chapter 4

Method

4.1 Previsualization using game engines

A previsualization solution for improved rendering at Stiller Studios should be able to:

• Provide a high-quality real-time rendering.

• Control the virtual camera with streamed motion control data.

• Build scenes that can easily be modified during recording.

• Manage timeline-based animation.

• Be automated to facilitate use.

• Receive image data via the DeckLink SDK in real-time that can be composed witha scene.

High-end game engines are built to provide high-quality real-time rendering, toolsand assets for building and modifying virtual scenes and also to be highly extendableand customizable with code. These factors make high-end game engines very interest-ing as potential previsualization tools.

To find a suitable game (or other graphics) engine for the given task was an evalu-ation conducted where graphics performance, usability and development possibilitiesaspects were compared against each other. To examine this, software documentationswere used and in some cases meetings with the engines’ developers themselves. Sim-ple test implementations was also performed to give a more thorough understandingof the engines’ different benefits and disadvantages.

4.1.1 CryENGINE + Film Engine

Film Engine, previously known as Cinebox, is a new film and previsualization tool stillunder development, an extension of the game engine CryENGINE [42]. Being builtfor film, Film Engine has both video-in and keying functionalities. The tool handlescomplex graphics, have good interaction capabilities and high performance. From FilmEngine there are handy exporting opportunities as well as live sync to Maya. FilmEngine has also live motion capture functionality.

CryENGINE has a unique visual scripting language and supports Lua scripting.The Film Engine developers also made it possible to script in Python. Scripts can besent from other applications to Film Engine to control the software. The support forplugins, however, is less than for both the graphics engines Unity and Unreal Engine(evaluated below).

31

Chapter 4. Method 32

A great CryENGINE advantage is its full dynamic global illumination in real-timewith voxel ray tracing.

The source code for the CryENGINE is hard to come by, but the sales site Amazonrecently released their game engine Lumberyard [43] which is entirely based on theCryENGINE where the source code is free to download.

4.1.2 Ogre3D + MotionBuilder

The possibility to maintain MotionBuilder and, if so, to implement another utility tothe already existing previsualization was also investigated. Ogre3D [44] is not a gameengine itself, but it handles graphics and is often used as a component to games. Thesource code is open for modification and use.

A disadvantage of Ogre3D is that there is no global illumination (except simpleshadows). However, such a solution would mean that Stiller Studios may continue touse the system they have already mastered in full which was considered to be a smoothand convenient solution.

Stiller Studios is interested in using the funds of models, tools and effects found inthe different game engines. If an improved rendering would be made in MotionBuilder[45] the task would still be to solve how the previsualization artists should work withMotionBuilder to build good-looking scenes. Possibly it would be desirable to buildadditional tools for this or, for example, to import scenes with materials from Maya.Game engines are made to construct 3D scenes and make fancy real-time renderings ofthose, making game engines easier to evaluate from an artistic previsualization pointof view.

4.1.3 Stingray

Stingray [46] is a new gaming and visualization engine developed by the world’s lead-ing design software publisher Autodesk, which also is the publisher of both Maya andMotionBuilder. This gives Stingray great potential when it comes to connecting theengine with the already existing previsualization system. Stingray has scripting capa-bilities in Lua and source code written in C++/C# for which access was given for thisproject thanks to shown interest from Autodesk themselves.

It is essential that in a modular way be able to expand the existing editor’s ownfunctionality. Here is Stringray currently lacking although it can be controlled by script-ing. The fact that Stingray is new on the market makes it difficult to handle due to thelack of documentation available.

The graphics in Stingray support deferred rendering and global illumination basedon Autodesk’s Beast, which requires baking. Since lighting of the scene is precomputedwhen baking, the scene or parts of the scene affected by the lighting must remain static.

4.1.4 Unity 5

Another proposal on a possible previsualization software was the game engine Unity5. Unity’s scripting system, built in C#, makes this tool a strong candidate among otherapplications. It is undoubtedly the game engine that has in the shortest developmenttime given the most results when it comes to exploring the possibility to add owncamera functionality in the editor, thanks to a very detailed documentation [47].

Compared with for example Unreal Engine, Unity is not equally comprehensiveand lacks a well-developed support for cinematics. Unity also has a special license for


companies that earn a certain amount, which makes it less useful for a company likeStiller Studios compared to an engine like Unreal Engine.

For global illumination is Unity using a product called Enlighten [48]. Besidesbaked lighting Enlighten offers the option to use precomputed light paths. Insteadof storing the resulting lightmaps as with baking, the light paths or visibility betweensurfaces is stored. This allows for real-time diffuse global illumination with a movinglight source by saving processing power on light bounces. This however still requiresthe scene to remain static.

4.1.5 Unreal Engine 4

The source code for the game engine Unreal Engine 4 is open and available on the web-based Git repository hosting service GitHub. The editor of the game engine can beextended with plugins in C ++ [49].

Unreal Engine has great advantages since it is possible for the program to createanimations and save camera data in the scene. The game engine lacks however in itsreal-time global illumination (requires baking) and also when it comes to opportuni-ties of controlling and modifying the editor. To write plugins to expand the editor isno easy task when the aid available hardly covers anything more than game mode andnot the actual editor. In addition to writing plugins, it is also possible to script with Un-real Engine’s visual scripting language blueprints but it is limited and lacks sufficientdocumentation for this specific task.

Unreal Engine graphics support deferred rendering and baked global illuminationcalled Lightmass.

Something that could be of interest to note is that after evaluating Unreal Engine4.10, Unreal Engine 4.11 and 4.12 were released with 4.12 introducing a number of newfeatures to facilitate using Unreal Engine as a movie making tool such as an improvedcinematic and animation tool Sequencer and the CineCameraActor allowing for greatercontrol of the virtual camera settings.

4.1.6 Game engine compilation

The different evaluated applications for the previsualization mostly consist of variousgame engines. The tables below show a summary of the information generated duringthe evaluation. Since Ogre3D is not a game engine, it is left outside of the compilationwhich is meant to serve as a clear comparison between engines. To implement Ogre3Dalso means a very different solution of the project where the current previsualizationsystem remains, which is another reason of keeping it separate.


Graphics

Unreal En-gine 4.10.2

Stingray Unity 5.3.2 CryEngine

Renderingpath

Deferredrendering

Deferredrendering

Deferred/forwardrendering

Deferredrendering

Anti-aliasing

Temporalanti-aliasing

Variationof tempo-ral anti-aliasing

Doesn’t haveany anti-aliasingapart from post-process unless aforward render-ing path is usedwith a gammabuffer

Combinationof patternrecognitionpost-processand tem-poral anti-aliasing

Global illu-mination

Lightmass Beast Enlighten Voxel raytracing

Development



Source access C++ C++/C#(givenfor StillerStudios)

Possible toget withspeciallicense(C++)

Lumberyard(C++) (notfor FilmEngine)

Editor plugins C++ Not yet C++/C# C++Editor scripting Blutility

(very lim-ited)

Script edi-tor (Lua)

C# Lua/macros

Documentation Decent Insufficient Detailed Insufficient


Usability



Cutsceneeditor

Matinee Lua script Plugin Track view editor

Editorconsolecommand

Yes No No Yes

Stiller Stu-dios prefers

Yes No No Yes

License Free Fee Fee forUnity Pro(requiredif thecompanyincome ismore than100,000 USdollars peryear)

Free as Lum-beryard, fee forCryENGINE/FilmEngine

Exportingframes

Matinee Lua script Plugins Film Engine

Assets(models,audio,images)

Great sup-port

Great sup-port

Great sup-port

Great support

4.1.7 Previsualization tool implementation and selection

Two softwares stood out in the selection of the most suitable tool for Stiller Studios -Film Engine and Unreal Engine. Film Engine is, unlike the other programs, actuallymade for the film industry. Functionalities such as video-in and keying already exist,which facilitate the process of getting the software to collaborate in the studio. FilmEngine is however still under development and therefore not a finalized software likeUnreal Engine, which has a really good graphics support. Unreal Engine has also opensource and is more fully documented.

The reason why Ogre3D, Stingray and Unity did not make it to the top leaguemostly depends on the Stiller Studios crew’s own preferences. The previsualizationartists did not approve the other engines whether it came to graphics or usability, whichled to a decision of not investing in these for further development.

Due to great trust in both Film Engine and Unreal Engine, a deeper investigationwas conducted for each one examining the possibilities to integrate these with the restof the previsualization system components which is essential for selection of the mostsuitable tool.

Unreal Engine implementation

Stiller Studios had a rudimentary solution for camera tracking in Unreal Engine evenbefore this thesis, consisting of two parts - a simple MotionBuilder device plugin that


takes camera orientation in Euler angles, position and optional extra data and sendsthis to a specified IP port over UDP, and a custom version of Unreal Engine with thesource code modified to receive camera data UDP packets to set the view in the editorthereafter.

This solution consists of several problems. If the solution was updated or a newerversion of Unreal Engine was installed, the solution had to be reimplemented in thesource code and the entire source had to be recompiled which leaves much room forerrors and takes a long time. In addition, the solution locked the entire editor if noUDP packet was received. The solution also wrote to arrays out of bounds, possiblyexplaining some random crashes.

To facilitate the continued development, the existing solution was implementedinto a plugin. This was somewhat tricky due to limited development opportunities,but at the same time a more modular and easier solution to maintain and test.

To get around the problem with the locked editor, attempts were made to read theUDP packets in a separate thread. This however created a new problem. The motiondata is sent from MotionBuilder at the same frequency as the frame rate of the camera.By blocking the rendering of the editor by having the UDP request in the same thread,the frame rate of Unreal Engine would be forced to be the same as the frame rate ofthe camera as long as the frame rate of the camera was lower than the frame rate ofUnreal Engine. By receiving the motion data asynchronously the frame rate of UnrealEngine would not match the camera, creating notable motion artifacts when movingit. To avoid having the editor lock up when not receiving UDP packets the solutionwas remade to continue Unreal Engines’ update cycle of no packet was received aftera certain amount of time.

The possibilities of getting camera data into Unreal Engine have been examined.A solution was found using OpenCV (section 4.2.1) to bring a webcam or other videofeed into the engine as an animated texture, but in Unreal Engine this only works ingame mode. Another problem arose when it came to getting the camera image from theDeckLink card as the SDK’s IDL file can not be compiled in Unreal Engine C++ projects.A possible solution to this would be to build a separate library that communicates withthe DeckLink card and in turn include this library in the Unreal Engine plugin.

Film Engine implementation

The fact that Film Engine is a cinematic tool gives many advantages. Putting up an exe-cutable version in the studio with camera data in and functioning in editor keying wasdone spending only a work day, which is significantly more time efficient compared towhat it would have been for the other surveyed programs.

Given the access to Film Engine’s motion capture API, it was possible to developa plugin to connect the camera data from Flair and MotionBuilder to the virtual cam-era in Film Engine. Via the motion capture API it is also possible to control severalother camera features in addition to the position and direction, such as aperture, focaldistance and focal length.

When live keying inside Film Engine there was a notable but constant delay of thecamera image relative to the motion data. A naive but workable solution (since thisdelay seemed to be more or less constant) is to set up a corresponding delay in themovement of the camera data manually. This was done by sending a value D fromMotionBuilder to Film Engine along with the camera data, setting up a queue in theFilm Engine motion capture plugin, storing the motion data and sampling the motiondata from the specified amount of frames (D) earlier in the queue.


A rudimentary depth placement was also implemented by rendering the video-infeed from the camera on a plane in the scene, locking the plane to the view and scalingand moving it depth-wise. This depth placement allows actors to be behind virtualobjects in well planned scenes with known depth.

A dialogue took place with Film Engine’s developers who continuously receivedfeedback to match the software to Stiller Studios requirements. The perceived problemswith Film Engine at first implementation was:

• Delay live in terms of sync between the camera and the engine. The engineshould preferably synchronize with the FPS (frames per second) of the input sig-nal.

• Lack of FPS timeline with the opportunity to scroll back and forth in given ani-mations.

• Unautomated system. The desired result should only be a few clicks away.

• Bugs and random application crashes.

Film Engine’s development team took responsibility for developing the tool furtherwith the above challenges in mind. The team visited the studio and was therefore wellaware of the problems mentioned.

To get around the limitations of the data sent directly from Flair as well as for in-stance restrictions on the timeline in Film Engine, data from the camera motion wasfirst sent to MotionBuilder and then to Film Engine. This made it possible to use Mo-tionBuilder’s timeline and control other parameters such as depth of field and field ofview externally.

4.2 Improved camera calibration

4.2.1 OpenCV

OpenCV is an open source computer vision library for C++ that includes many of thealgorithms and methods for dealing with the problems discussed so far [11]. OpenCVincludes methods for importing and exporting images and video, for example a widearray imaging algorithms and operators, a camera calibration function based on Zhang’smethod as well as functions for detecting control points, solving the Perspective-n-Point problem and more. Several extensions also exist for OpenCV such as the fidu-cial marker library ArUco [50], which provides functions for generating and detectingsquare data matrix markers.

4.2.2 Previzion

Lightcraft Technology’s Previzion software includes a camera calibration process. Whilethe documentation for Previzion does not include any information on how the systemactually works but only instructions on how to use the system, some parts can be de-duced [51].

Previzion’s system is capable of solving the camera’s intrinsic parameters as wellas the center of projection’s distance from the calibration target.

The Previzion system works by rigging a large motorized calibration board (figure4.1) in front of the camera. The board is aligned so that the center of the board is at


the camera’s optical axis. The aperture of the camera is closed to its smallest settingto avoid blur. When activating the calibration the motorized board swings around itsvertical axis. Images are captured by Previzion using the camera and these are used forcalibration.

FIGURE 4.1: The Prevision’s calibration board setup [51].

Previzion’s calibration board consists of a number of square data matrix markersthat become smaller the closer to the center of the board they get. This can probablybe motivated by the fact that the markers should be small in order to allow for moremarkers. Too small markers however are untrackable and when the board is viewedat a sharp angle the outermost markers will be further away and appear smaller, thusneed to be larger to be detectable.

The use of data matrices for markers greatly simplifies matching the control pointsto the predefined model since each individual marker can be identified exactly withouttaking its position relative to the other markers’ into account. Since not all markers onthe wide board will be visible at the same time to the camera, simpler markers wouldmake correct identification of the markers a lot harder.

4.2.3 A new camera calibration process

The purpose of the camera calibration process is to find intrinsic parameters for differ-ent focus and zoom settings and also, given by the motion control system, the positionof the center of projection and camera orientation relative to the motion control robot.

The positional offset along the optical axis and focal length are the most importantfactors to find when matching the virtual and real camera since these naturally varydepending on the lens. Other intrinsic parameters such as the optical center and lensdistortion are generally the results of flaws in the camera or lens itself or in the case of


error in position and orientation, the result of misalignment between the camera andthe robot or faulty calibration of the motion control robot.

4.2.4 Conditions

The setup of Stiller Studios’s system and studio provides several benefits as well aslimitations. The studio has highly controllable lighting of high quality. The cameramovement is controlled by a high precision motion control robot and the movementas well as zoom, focus and iris settings are recorded. The camera and most lensesused are of high quality. The use of the motion control system is however limited,primarily for security reasons. It is not allowed to use software to take control of therobot, even though it is doable in theory. The reason for this is lack of documentation,dangerous risks and huge costs if something goes wrong. In general the motion controloperator prefers if the camera itself moves as little as possible, even if the movementis predefined by a pattern. This, once again, for security reasons and the time andattention needed keep things safe. Still the calibration process should be as automatedas possible and require little to no understanding by the operator of how the calibrationprocess works.

4.2.5 Calibration process

The suggested calibration process has been developed through ongoing discussionwith Stiller Studios’s motion control operator since that is the one who is responsiblefor camera calibration and also the one who would use the methods. As mentioned,there was a strong preference for not moving the camera during calibration. A possibledoable setup was therefore considered as a precisely constructed rig at the end of thecamera robot’s rail with a mountable motor controlling rotating calibration aboard akinto the one used with Previzion.

The proposed calibration process can be described with the following steps:

1. A calibration pattern board is mounted at the end of the robot rail.

2. The aperture is closed to its smallest size.

3. The desired focus and zoom for the robot is set.

4. The robot is positioned to look at the center of the calibration board at the correctdistance along the rails for optimal focus and zoom.

5. The calibration process starts when the calibration board begins to rotate backand forth to sharp angles towards the camera and a series of images then getscollected.

6. The calibration is done using OpenCV’s camera calibration function.

7. The resulting data is saved.

8. Step 3 through 7 is repeated for the desired focus and zoom values.


4.2.6 Calibration software

An application was developed to perform the calibration using C++ and OpenCV thatis capable of taking either a video feed via webcam, DeckLink or by reading imagesfrom a folder, and motion data by either directly from Flair or from a Flair exportedJOB file with saved motion data. There are settings for calibration such as initial pa-rameters, parameter locking, calibration target position and orientation, image inputand motion data input. The program provides a live view of the video input togetherwith a blur measure. The interface provides a single button for starting the captureand starting the calibration once capturing is completed. The calibration process itselfis run asynchronously from the rest of the application allowing the user the set up thecamera for the next calibration while the calibration function is executed.

4.2.7 Calibration pattern design and marker detection

The calibration process is agnostic to what kind of calibration pattern is used as longas enough control points can be clearly detected and matched to known model. Themarker detection code was structured using a strategy pattern for easy selection of themarker type and calibration pattern. Two custom pattern types were developed fortesting. An ArUco-based fiducial pattern inspired by Previzion’s calibration processand a concentric circle pattern.

ArUco pattern

For ArUco-based patterns, the reference model for the board was not manually con-figured or generated as a mathematical description at the boards’ creation. Instead animage of the calibration pattern using ArUco markers was given. The markers weredetected and their coordinates were used as model points. This means that an ArUcocalibration pattern could be made using an image editing program choice and databaseof ArUco markers.

Detecting and identifying ArUco markers is a built-in functionality. Using OpenCV’ssubpixel function, the detection of the markers corners can be improved. One issuethat occurred in testing was the subpixel function detecting corners inside the ArUcotags’ borders for small tags. This can be mitigated by making the ArUco tags’ bor-ders thicker. To make sure this problem would not occur, the detection algorithm wasmade to skip markers within an area small enough that the search area of the subpixelfunction not risked being larger than the border of the marker itself.

Another problem occurred with markers at the edge of the image. If parts of amarker’s border were partly outside the image it would still be detected with incorrectcorners. This was solved by simply skipping markers with one corner close to the edge.

Calibration patterns that can be used from a large number of distances are neededin order to allow calibration of a high variety of lenses with different optical centers andangle of view. A calibration board similar to the one used in Previzion solves this. Asimilar calibration board can be generated using quadtrees. By defining the calibrationboard as a quadtree or grid of quadtrees and letting the depth of the quadtree be afunction of the distance from the board’s center with larger depth closer to the centeralong with using the chebyshev distance metric, a grid subdivided into progressivelysmaller cells closer to the center is given. An ArUco fiducial is then placed in each ofthese cells.


FIGURE 4.2: Aruco board generated using quad-trees and chebyshevdistance metric.

Concentric circle pattern

A grid pattern of concentric circles was also developed. The circles are detected usingthresholding and OpenCV’s built-in function for finding contours. The innermost con-tours are selected and if these have centers that are approximately the same as theirN containing contour, where N is the number of circles representing one concentriccircle marker, it is added to the list of detected markers. The contour of all markersis detected and the dot product between neighboring contour markers is calculatedto correctly match the detected markers with the reference pattern. The markers withthe sharpest angle relative to their neighbors are the corners of the calibration pattern.These are used to calculate an approximate homography for the calibration pattern.The reference marker positions are transformed to match the captured image patternusing the homography and the markers matched with the reference by pairing the de-tected markers with their closest match in the transformed reference.

One benefit of concentric circles over the standard circle pattern used in OpenCV isthat they are more robust to background objects that might accidentally be classified asmarkers since concentric circles or a hierarchy of contours that all share the same centerare less likely to appear in the background by accident.

FIGURE 4.3: Concentric circle pattern.


4.2.8 Image selection

A max number of frames used for calibration should be set before calibration. This isimportant since too many frames will make the calibration very slow without improv-ing the result. However, it is desirable to not force the user to determine which imagesto use. If frames are taken from similar perspectives this will create bias and possiblynot give the algorithm enough information for good calibration. Even worse, the errorwill likely not be visible in the reprojection error since the faulty camera parameterswork for the small subset of frames used.

A naive approach would be to select a random set of images or to set a timed in-terval to capture images with the right frequency to match the max number of imagesand the relative moment of the calibration board and camera.

Attempts were made to use the calibration line concept described in section 3.4.8.This however turned out to be less than ideal - either by programming errors, lack ofunderstanding of how it should be used or perhaps the calibration line concept beingless than ideal as is. Camera movements with rotation around the calibration pat-tern center which can be empirically shown to have very good coverage for calibrationwould give a near constant angle of the calibration line, while camera movements withpure translation of the pattern in the image plane which provide no useful informationfor calibration gave different calibration lines angle. Transposing the homography ma-trix to take into account possible differences in matrix multiplication in OpenCV andthe paper did not solve the problem of rotational information but also produced nearconstant angles for the case of pure translation.

Instead a method using OpenCV’s Perspective-n-Point (PnP) function was devised.A camera matrix is constructed using the dimensions and extrinsic parameters of theimage with the camera matrix calculated using PnP. The vertical and horizontal ro-tation of the plane can then be extracted from the resulting rotation matrix and theeuclidean distance pairs of two angles are used as a distance metric.

Finding the optimal variance can be done by taking the set of captured images andfinding the size N subset, where N is the number of images that should be used forcalibration, whose smallest distance between two neighboring images is maximized.For simplicity and to avoid having to store a huge amount of images a simpler solutionwas implemented as described be the following psuedocode

function SELECT(ImageCapture images, Integer numberOfImages)repeat

Add image from images to selectedImagesuntil sizeofselectedImages equals numberOfImagesrepeat

Add image from images to selectedImagesSelect imagePair from selectedImages closest to each otherSelect image img1 from imagePair closest to a third image img2 in selectedImagesRemove img1 from selectedImages

until endCriteriareturn selectedImagesend function

where the endCritera could be a maximum number of captured images, a minimumdistance between the captured images for maximum coverage or a combination thereof.


What this algorithm does in essence is to check if each new captured image addsmore information to the current set of selected images. If the new image is furtheraway, "more different" than the current image in the set closest to the other images,then it replaces that image.

4.2.9 Iterative refinement of control points

For improved calibration the iterative control point refinement method described byDatta et al. [30] was implemented as described by calibrating, undistorting the imagesand using the resulting intrinsic and extrinsic parameters to calculate the homographytransform to obtain the frontal image for each captured image. The markers in thefrontal images get detected and transformed back using the inverted homography andOpenCV’s projectPoints function for redistortion. Then the calibration is repeated.

The homography for the frontal image is obtained by

Hr = MRtM−1 (4.1)

where Hr is the homography and M is the camera matrix and R the rotation matrix.However, since the image is also translated it will be off center. To make sure thepattern is visible the correct translation needs to be added.

C = −MRtT(RtT )y

(4.2a)

Ht =

1 0 Cx

0 1 Cy

0 0 Cz

(4.2b)

H = HtHr (4.2c)

T is the translation vector.

4.2.10 Calculating extrinsic parameters relative to the robot’s coordinates

A static calibration board with known position, orientation and size and a movingcamera should in theory allow calculations of the extrinsic parameters of the camerafor every image used. The motion capture data stream is however not synced with thevideo feed. There is both likely a slight delay but also a difference in frame rate. Whenfilming, the motion control will lock to the camera’s frame rate. However, when theframe rate is not connected to the camera it is stuck at 50 Hz. This means that takingthe difference between the orientation and position given by Flair and the calibrationwill not necessarily give the right offset and orientation of the virtual camera. Thus,the extrinsic parameters are not guaranteed to fit with the motion capture data. If thecamera however is still at the first few frames the motion capture data will match forthe first frame and can thereby be used. This also holds true when calibrating usinga moving target and a stationary camera assuming that the calibration target is at itsknown position and orientation at the initial frame.

Offline calibration

The calibration can also be done in offline mode, using filmed material and recordedmotion control data. This has two advantages:


1. Filmed video taken from the camera’s physical memory is much higher qualitywith a image width of 4800 pixels vs 1080 pixels of the video stream.

2. The recorded motion control data is synced with the image feed.

4.3 Camera tracking solution review

A number of commercial real-time camera tracking solutions were briefly looked intofor integration into the Stiller Studios setup.

4.3.1 Ncam

The Ncam system (figure 4.4) is capable of camera tracking in an unstructured scene.That is without requiring any specially constructed markers for reference. It uses acombination of stereo cameras and other sensors for detecting the camera’s positionand orientation [52].

FIGURE 4.4: Ncam sensor mounted on a film camera [52].

Ncam has not been tested live and it is unclear how well it performs in scenes witha lot of movement or if a green screen studio does not have to be modified for Ncam inorder to reliably find control points from the green backdrop.

4.3.2 Trackmen

Trackmen is a company that specializes in camera tracking [53] and has several opticaltracking solutions using a variant of square data matrix fiducial markers, see figure 4.5.Instead of square data cells a grid of circles are used.


FIGURE 4.5: Trackmen’s optical tracking solution with fiducial markers[53].

Trackmen’s optical tracking can be done either by only using the camera image orby attaching an extra camera sensor to the film camera, rejecting any worries abouthaving to obscure the markers with props by putting them on the floor or ceiling.

4.3.3 Mo-Sys

Mo-Sys has both an optical tracking system known as the StarTracker and several me-chanical tracking systems using stands and cranes as well as rotary sensors for zoomand focus on the camera lens [54].

Mo-Sys’s StarTracker optical camera tracking system was briefly tested live at StillerStudios. The StarTracker works by attaching a small computer with a gyro and an extracamera sensor with ultraviolet diodes on the film camera, shown in figure 4.6. Circularreflectors are then placed in the studio in a random pattern visible to the camera sensor,usually on the ceiling. The diodes combined with the reflectors make sure that thecontrol points are clearly and unambiguously detected.

FIGURE 4.6: Mo-Sys’s StarTracker ceiling markers and sensor system[54].

Before filming can start a somewhat extensive calibration process needs to be per-formed. This can be achieved by moving the camera around the studio and then lettingthe system construct a model of how the control points are placed. The camera tracking


is done by finding the transform that gives the detected control points from the internalmodel using information from the previous frames and gyro as an initial guess.

When testing the system the tracking provided was, from the subjective perspec-tive of Stiller Studios’s employees, performing well. No exact pixel measurements orcomparison to the motion control robot was made. Some delay during fast cameramovements could be observed and in certain cases some gliding between foregroundand background. This is likely the case of imprecisely positioned center of projection.

4.4 Commercial multi-camera systems

4.4.1 Markerless motion capture systems

Stiller Studios was looking into the markerless motion capture systems produced byThe Captury [55] and Organic Motion [56]. These systems work by using a large num-ber of cameras positioned around a person. Software is then used to process theseimages in real-time to create a skeleton model of the person. This skeleton should alsobe usable to give a rough estimation of the depth of a person being filmed to detail ofthe individual body parts by being modeled by the system.

It is possible that the images produced by these systems could be obtained and usedfor 3D-reconstruction using a method like space carving. Organic Motion has a systemin development for real-time 3D-reconstruction of people.

4.4.2 Stereo cameras

The ZED camera is a consumer grade stereo depth camera that was tested briefly at thestudio. The major problem of using the ZED camera at Stiller Studios is the fact thatit uses USB for data and has therefore very limited length cable opportunities, makingit hard to place on the motion control robot. Using a depth camera would otherwiseopen up interesting possibilities for camera tracking, using the 3D-reconstruction of thescene given by the depth camera. The Ncam works like this and could possible also beuseful for depth-detection.

4.4.3 OptiTrack

Another multi-camera solution making methods such as space carving possible is theone provided by OptiTrack, a motion capture system that uses markers as indicatorsfor 3D position and orientation. This would provide additional benefits of allowingfilming of marker-based motion capture material in studio. The markers put on thecamera itself could also possibly be used for camera tracking.


FIGURE 4.7: OptiTrack motion capture rig with cameras centeredaround the capture area [57].

4.5 Camera calibration simulation

A camera simulation application using Processing was developed to test the accuracyof the calibration. Processing is a programming language based on Java and is designedfor ease use, electronic arts and visual design and thus a good fit for a simple virtualimage generation.

The application takes a template image of the calibration board as input. An emptyvirtual environment is initiated and the image is rendered as a texture on a quad in thisenvironment. The extrinsic parameters of the camera can be defined using Processing’sbuilt-in camera method. Camera data in Processing takes the camera position, targetand up direction as arguments with roll defined using a rotation around the cameraaxis and letting the up direction vector be constant. The extrinsic parameters of thecamera can also be defined using vector operations such as translation and rotation,which can be applied using predefined methods.

Which method is used is a matter of preference or what kind of movement that isabout to be described. Movement of the camera can either be defined mathematicallyas a function of time or by reading camera data from a Flair JOB file giving the data inthe form of position, target and roll.

The aspect ratio and the field of view of the camera can be defined using Process-ing’s perspective method. This does not however yield any control over the offset ofthe optical center. For a greater control of the view cone the Processing frustum methodwas used instead. The frustum method takes the left, right, bottom, top and depth co-ordinates of the near clipping plane and the depth of the far clipping plane with theposition of the center of projection at the origo.

4.5.1 The camera matrix

The left, right, bottom, top and depth coordinates of the near clipping plane togetherwith the center of projection form the projection cone of the virtual pinhole camera and,


as such, define the camera matrix. Converting desired focal lengths and optical offsetsto the right frustum can be done using basic trigonometry.

The depth of the near clipping plane is defined as nd. The width and height of theof the near clipping plane (nw, nh) are given as a function of the horizontal and verticalfocal lengths (fx, fy) and width and height of the image (iw, ih) as

nw = ndiw/fx (4.3a)

nh = ndih/fy (4.3b)

Thus with the optical center at zero, the coordinates of the frustum can be defined as

left = −nw/2, right = nw/2, bottom = −nh/2, top = nh/2 (4.4)

The optical offset can be modeled by adding the scaled optical offset ox and oy to thefrustum coordinates giving

left = −nw/2 + ox, right = nw/2 + ox, bottom = −nh/2 + oy, top = nh/2 + oy (4.5)

with ox and oy defined as

ox = (iw/2− cx)/fx, oy = (ih/2− cy)/fy (4.6)

where cx and cy are the optical center in image coordinates. This will however move theimage plane so that the center of the image is no longer aligned with the optical axis.Generally the calibration pattern should be aligned with the center of the image ratherthan the optical axis. This can be achieved by translating the image in the camera’scoordinate system by

x = oxd, y = oyd (4.7)

where d is the depth of the calibration pattern from the camera.

4.5.2 Inverse lens distortion

Correcting for lens distortion using Brown’s distortion model can be done trivially byusing a post-processing shader. For each fragment the coordinate is sampled fromthe input image. The same equations can also be used to apply lens distortion to aundistorted image.

In order to analyze how correct the distortion parameters returned by camera cal-ibration are, it is desirable to simulate lens distortion defined by the same parametersused for undistorting the image. This could also be of interest for matching the realand virtual camera not by undistorting the the real camera image, but rather distortingthe virtual camera rendering.

To achieve this it is necessary to solve the inverse of Brown’s distortion model. Noanalytical solution exists for this problem. One possible solution considered was tocalculate and store the mapping of Brown’s distortion model and interpolate betweenvalues to solve for sparse mapping. This however would be expensive both in terms ofmemory, computation time and algorithmic complexity while still possibly providingless accurate results due to the sparse mapping and interpolation.

Another solution would be to solve the problem by iterative approximation, butonly briefly academic discussions about this was found by Abeles [58]. Abeles presents


an algorithm for solving the inverse of Brown’s distortion model (section 3.3) using twocoefficients for radial distortion and no tangential distortion which can be expressed as

function DISTORT(point p)d = p− c/fu = d

repeat

r = ||u||u = d/(1 +K1 ∗ r

2 +K2 ∗ r4)

until d convergesreturn u ∗ f + cend function

where p is a point in image coordinates, c is the optical center and f are the focallengths. The basic idea is to express the undistorted coordinates u as function the dis-torted coordinates d by dividing out the coefficient part m of the algorithm as

m = (1 +K1r2 +K2r

4) (4.8a)

d = um (4.8b)

u = d/m (4.8c)

r = ||u|| (4.8d)

However, since r is the distance of the point u from the optical center in normalizedcoordinates, this equation is unsolvable analytically. An initial guess of u = d is used tosolve u. If m is correct, u will not change and the correct answer has thus been found.

Abeles further suggests that a similar solution can be found by adding the terms fortangential distortion. This is done by first subtracting the tangential part of the equa-tion and then dividing by the radial part. This and adding a third distortion parametergives the following algorithm

function DISTORT(point p)d = p− c/fu = d

repeat

r = ||u||ux = dx − p2 ∗ (r

2 + 2 ∗ ux ∗ ux)− 2 ∗ p1 ∗ ux ∗ uyuy = dy − p1 ∗ (r

2 + 2 ∗ uy ∗ uy)− 2 ∗ p2 ∗ ux ∗ uyu = d/(1 + k1 ∗ r

2 + k2 ∗ r4 + k3 ∗ r

6)until d converges

return u ∗ f + cend function

This algorithm was trivially implemented in Processing using a GLSL (OpenGLShading Language) post-processing shader.


One interesting question is the convergence of the algorithm. For one parameterof radial distortion with small values, the algorithm should at least move in the rightdirection each iteration. However it does not necessarily converge for large distortion,tangential distortion or complex distortions with local maxima and minima. An eval-uation of the method’s convergence is presented in Appendix A.

4.5.3 Blur and noise

When testing a calibration algorithm it is of interest to test its robustness for noise andblur, which naturally occur in real images.

Simulating physically accurate blur is a non-trivial problem. However, the purposeof blur in this context is not to calibrate the depth of field, shape from focus or similarbut rather to see how the calibration method can handle control points that can not beperfectly detected.

A simple way to blur an image is to apply a Gaussian blur filter. The filter calculatesthe color of each pixel as a weighted sum of the colors of the surrounding pixels withradially decreasing weight for pixels further away with drop depending on the kernelsize.

Processing includes a Gaussian blur filter with varying kernel size, which was usedfor the simulation.

Noise was applied using additive white Gaussian noise. A random variable with aGaussian normal distribution and a mean of 0 and a standard deviation given by theuser was added to each pixel.

Chapter 5

Results

5.1 Previsualization using graphics engine

The implemented result of Film Engine in use is shown in figure 5.1, a shot fromlive recording in the studio with composition. Corresponding results for the improvedsolution in Unreal Engine can be seen in figure 5.2.

51

Chapter 5. Results 52

FIGURE 5.1: Film Engine in live use.


FIGURE 5.2: Unreal Engine in live use.



Table 5.1 shows the results of camera calibration using the camera simulation applica-tion written in Processing with three different calibration pattern detectors. Asymmetricis the standard 11x4 asymmetric circles pattern found on the OpenCV website whileboth Circles and Concentric used the 7x10 concentric circles pattern in figure 4.3 butwith different detection algorithms where Circles uses OpenCV’s standard findCircles-Grid function and Concentric the custom-made concentric circle detection algorithm.Iter is the results from the iterative version of the same detection method. In this casehowever only the first iteration was shown to actually improve the estimation of theparameters. The value to the right of the estimated parameters for each detector isthe difference between the estimated value and the real value given by the simulation.One interesting observation is that while the reprojection error (RE) is a lot better forCirclesiter compared to Concentriciter it is reasonable to question whether the esti-mated parameters are actually better.

Simulation Circles Circles iter Concentric Concentric iter AsymmetricK1 0.200 0.204 0.004 0.201 0.001 0.200 0.000 0.200 0.000 0.219 0.019K2 0.000 -0.035 0.035 -0.006 0.006 -0.002 0.002 0.001 0.001 -0.268 0.268P1 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000P2 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000K3 0.000 0.079 0.079 0.014 0.014 0.009 0.009 -0.009 0.009 1.140 1.140Cx 600.000 599.507 0.493 599.512 0.488 599.549 0.451 599.500 0.500 599.441 0.559Cy 450.000 449.487 0.513 449.509 0.491 449.499 0.501 449.485 0.515 449.357 0.643Fx 800.000 798.939 1.061 799.977 0.023 799.090 0.910 799.982 0.018 798.990 1.010Fy 800.000 798.939 1.061 799.975 0.025 799.090 0.910 799.995 0.005 799.064 0.936RE - 0.016 0.010 0.038 0.026 0.023

TABLE 5.1: Table showing calibration results using virtual camera im-ages.

5.3 Image selection

To test the deviced image selection process a large set of simulated camera imagesusing the concentric circle pattern was generated with many images being identical orconsisting of pure translation together with images with the calibration pattern viewedfrom different angles. Five different selection processes were compared, each selectingfour images from the set:

1. Ourmethod using the correct parameters for the camera matrix and zero distor-tion.

2. Random selection.

3. Manual selection.

4. Sequential selection, taking the first four images captured.

5. Our method using a badguess for the camera matrix with Fx = Fy = 8000, Cx =300 and Cy = 225.

According to the results in 5.2 our image selection method provides an image selec-tion equal or even better than manual selection. Interesting to note is that all selectionprocesses provide decent reprojection errors even when the results are bad as with the


Simulation Our method Random Manual Sequential Bad guessK1 0.200 0.198 0.002 0.181 0.019 0.222 0.022 18.914 18.714 0.202 0.002K2 0.000 -0.014 0.014 0.015 0.015 -0.336 0.336 219.151 219.151 -0.026 0.026P1 0.000 0.001 0.001 -0.003 0.003 0.000 0.000 0.036 0.036 0.000 0.000P2 0.000 0.000 0.000 0.002 0.002 -0.001 0.001 0.016 0.016 -0.001 0.001K3 0.000 0.023 0.023 -0.013 0.013 1.504 1.504 -0.093 0.093 0.045 0.045Cx 600.000 599.369 0.631 606.463 6.463 598.733 1.267 605.363 5.363 598.753 1.247Cy 450.000 449.941 0.059 439.546 10.454 449.900 0.100 463.306 13.306 449.352 0.648Fx 800.000 799.933 0.067 771.220 28.780 799.829 0.171 7,758.321 6,958.321 799.818 0.182Fy 800.000 799.857 0.143 771.313 28.687 799.854 0.146 7,760.709 6,960.709 799.888 0.112RE - 0.118 0.119 0.114 0.108 0.118

TABLE 5.2: Table showing results of testing the image selection algo-rithm.

random selection or useless as the case with the sequential selection. Thus the reprojec-tion error is only a useful measurement of the quality of calibration if the images usedprovide enough information to only fit a small fraction the parameter search space.

Chapter 6

Discussion and future work

6.1 Graphics engine integration

Compare to the Unreal Engine solution that only works as a renderer in the studio, FilmEngine meets more of the initial objectives of implementing a new graphics engine andis therefore considered as the most suitable tool due to Stiller Studios’s requests. Theengine provides high-quality rendering, Flair communication, possibilities to modifythe scene during recording and also takes image data via the DeckLink SDK. The onlyfunctions missing are the timeline-based animation and the automatization of the en-gine.

Even though Film Engine fulfills many of the desired requirements and also workstogether with the rest of the system, there is still work to be done before the studiocan fully let go of the old software setup. Especially, the fact that no FPS-timelineexists in Film Engine is a strong motive for not going through with an exchange yet.Without a timeline it is not possible to control whether a scene is physically doable,which makes the engine unusable for real film job cases since no shots can be digitallyplanned before the actual recording. Another crucial Film Engine issue is the troublingwith bugs, something that especially the new engines Stingray and Film Engine hascontributed to. Both the timeline and the bugging is somehow under development bythe Film Engine development team.

Apart from the FPS-timeline, the issue with no automation persists but can besolved through scripting which after all fulfills the requirement of automatization fa-cilities. The automation issue however is something that is not fully developed forthe current system either. Currently, to setup a previsualization between Maya, Mo-tionBuilder and Flair requires many clicks and settings tweaks due to lack of logicalstructure. The employees at Stiller Studios compare their system with house building,where the house is built by placing one brick at the time without having an architec-tural plan. This leads to very complex solutions, meant to work in the short run. Inother words, to automate Film Engine is only a small part of what the studio needswhen it comes to system architecture. The discussion of rebuilding the software solu-tions from scratch has been a recurrent topic during this project, but has never beenconsidered as something doable during such a short period of time. It is worth men-tioning that the web interface and Filemaker structure works fine, and a reconstructurewould only include the above mentioned parts (Maya, MotionBuilder and Flair) of theprevisualization system. What would be really interesting, since the game and film en-gine industry is evolving fast, is to build a software architecture where it is possible tojust exchange the engine part without having to deal with almost every unit in the sys-tem like how it is structured now with several hard coded solutions. Just the fact thatthe studio does not have access to all code but must call their different external devel-opers working around the world when an engine update is needed is clearly showing

56

Chapter 6. Discussion and future work 57

the deficiencies that exist.

6.2 Modular architecture

Both Unreal Engine and Film Engine have their advantages and disadvantages as pre-visualization rendering solutions. Neither are perfect solutions at the moment. FilmEngine however has the advantage of being made for film and is actively developedto solve the issues that Stiller Studios run into. The main advantage of Unreal Engineis that the license gives access to the source code and is overall much more customiz-able. Both however, or rather most, graphics engines are in constant development. Itis possible that the Sequencer updates in Unreal Engine 4.12 dramatically increases itsviability of using Unreal Engine for film.

Only during these few months that this project has been ongoing has Autodeskshown big effort to get recently released Stingray out on the market, Film Engine hasbeen released and renamed (from Cinebox) and Amazon has become a game enginedeveloper. Even though there are existing solutions to be found in this area, manyof the questions that emerged during the engine evaluations have also been currenttopics for discussion in different forums on the engines’ websites which indicates hownew the 3D engine industry for filmmaking really is. As computer power continuouslyincreases, the techniques and possibilities for rendering and camera tracking will to -for example real-time raytracing.

Stiller Studios showed interest in selecting a game engine and making a completelyintegrated solution, doing away with as many different applications and computers aspossible and solving most of the previsualization pipeline with a single software. How-ever, due to the constant change and development in the field, perhaps an integratedsolution is the wrong approach. One of the benefits of a studio like Stiller Studios witha technical crew is that having a large number of computers, each performing a dif-ferent task, is not really a problem and provides the benefit of modularity. A modularsystem with many specialized parts means that each part should be easy to replace.The graphics engine that seems to best today might not be the best tomorrow. Differ-ent graphics engines might also be better for different purposes and projects. Simplywriting a plugin that controls the virtual camera based on data from network packetsshould, if possible at all, be trivial in most rendering solutions while making a com-plete integrated solution that solves all problems is likely to be hard or at least take alot more time.

For a futuristic integrated solution it would also be desirable to be able to achievefunctionalities such as chroma keying, depth compositing and distortion correction.Both Film Engine and Unreal Engine showed problems with getting image data fromthe DeckLink card with Film Engine having a significant delay and Unreal Engine notallowing compilation of the DeckLink API without a workaround. The use of QTAKEcan continue with a modular solution. Writing a separate application that takes twoimage signals from DeckLink cards and composite them together should not be to dif-ficult either, and doing things like correcting lens distortion in a GLSL shader is triv-ial. Handling these kind of tasks separately not only saves possible unexpected issueswhen trying to work with the given tools, it also saves time not having to reimplementsolutions when changing the rendering solution.

One benefit of modular design was clearly shown when testing Mo-Sys’s Star-Tracker. Since a solution for sending and receiving motion control data already existedin MotionBuilder and solutions for receiving motion control data from MotionBuilder


were implemented in both Unreal Engine and Film Engine, writing a MotionBuilderplugin for receiving motion tracking data from StarTracker directly allowed the use ofStarTracker in three different engines while only having to implement it into one.

An ideal solution would revolve less around finding the best tools for the job andsuper-glue them to the pipeline, but rather to define how the different computers andapplications communicate via video cards, network packets, files and defining transla-tors (such as MotionBuilder in the previous example).

6.3 Camera tracking

Tracking systems, such as Mo-Sys’s StarTracker that do not require a motion controlrobot are versatile. A more in-depth study of different tracking solutions could defi-nitely be of interest.

Placing tracking markers around the studio that can be used for high precision of-fline tracking and filming these with a camera using different real-time tracking solu-tions could provide a good reference for measuring the quality of the different trackingsolutions. In addition, this would make it possible to calibrate the camera on the mo-tion control robot through its full range of motion and possibly also mapping of themechanical errors to the robot’s motor values.

Using a tracking solution such as the StarTracker in conjuncture with the motioncontrol robot by attaching the StarTracker on the robot could also be very interesting.Since there are known flaws with the motion control system there could perhaps bebenefits in using Flair for motion control when needed, but for camera tracking to in-stead use a separate system such as the StarTracker.

6.4 Compositing

Depth compositing would probably require an upgrade of the studio by adding one ormore cameras. Good depth compositing would greatly expand upon the kind of scenesthat are possible to shoot correctly in real-time. Depending on how Stiller Studioschooses to upgrade the studio various possibilities open up.

Depth detection could open up possibilities for light and shadow interaction be-tween the filmed material and the virtual scene.

The quality of the chroma keying in the current system depends highly on the light-ing and filtering of the image. Real-time keying is an active area of research and im-provements could definitely be made.

One area of improvement that has not been discussed previously is color correctionby using some sort of automatic methods for matching the colors of the virtual sceneand the filmed material.

6.5 Object tracking

Besides the use for camera tracking and motion capture, optical tracking solutionscould be useful for increasing the level of interactivity in the scene. By using a motioncapture solution such as OptiTrack or placing fiducial markers on objects, objects in thescene could be tracked and placed into the rendering engine allowing the directors andactors to move and interact with for example virtual furniture.


6.6 Augmented reality

Virtual and augmented reality (VR and AR) are fields that are developing fast with sev-eral upcoming commercial solutions. VR and AR also relate closely to the purpose ofprevisualization, breaking the border between the virtual scene and reality and bring-ing filming on green screen closer to pure live action. Integrating VR or AR solutionsinto the studio would make it possible for the actors and director to actually be in thevirtual scene as if it almost was real.


While the developed calibration method is a significant improvement of Stiller Stu-dios’s existing calibration pipeline there is still room for further improvement. Al-lowing automatic, controlled or preprogrammed camera movements would open upseveral possibilities and simplifications of the calibration process. Focus could be setautomatically. Calibration methods with controlled movement such as the one withknown rotation by Frahm et al. [26] allows for varying focal length during the calibra-tion by keeping the movement known. If this was used it could make the calibrationprocess significantly faster.

The calibration could most likely be further improved by using advanced calibra-tion methods such as the digital correlation method by Wang et al. [31] or separate dis-tortion calibration, but there is also the question of how good the calibration actuallyneeds to be. Camera calibration can be performed with just two images of a calibra-tion pattern or one image of a 3D-calibration object. Perhaps auto-calibration is goodenough? It is possible that a separate process for the camera calibration is superfluousand that simple camera movements in a static scene when filming or a single image ofa known reference object would give a good enough calibration.

6.8 Using calibration data

When the parameters for different lenses with different zoom and focus settings havebeen found, these values need to be added to the previsualization. The motion con-trol software Flair offers some support for this that allows the user to store a list oflenses and some settings such as desired offset for the view point from the camera rig,the focus distance for different focus settings, the focal length, the change in center ofprojection and field of view for different zoom settings.

The system has several limitations though. Lens settings must be set up manually.The possibility to save and load backups of the lens settings exists, but this can only bedone in batch and is saved in an unknown binary format. The system is also limited inwhat information that can be provided. No settings for orientational offset exist besidesa checkbox telling Flair whether the lens is a snorkel lens or a normal lens.

The process of setting the values can also be quite confusing if the process givenin the Flair manual [59] is not followed. It demands setting offsets based on severalphysical measurements. The manual in turn incorrectly refers to the nodal point (oneof the offsets that must be set) as the no-parallax point.

One possible alternative would be to store the values obtained from the cameracalibration in a table-based data file. Since the focus and zoom are sent together withthe motion control data from Flair both to the calibration and to the rendering program,these values could be set in the data-table together with the calculated parameters for


focal length, distortion, orientation and center of projection offset from the camera.The values could then be used when rendering and interpolating between neighboringfocus and zoom values.

Chapter 7

Conclusions

The thesis work was based on the following three questions:

• How can Stiller Studios’s previsualization be improved?

• Which methods are suitable for improving the previsualization?

• What is the intended outcome of improving the previsualization?

To answer the first question four areas where the previsualization could be im-proved got identified:

• Improved rendering.

• Improved camera calibration.

• Alternative camera tracking.

• Depth compositing.

Several options for improving each of these areas were identified. While depthcompositing and alternative camera tracking was explored mostly in theory, severalcommercial solutions exist that solve the problem of camera tracking without a motioncontrol robot with decent results.

The visual quality of the previsualization was substantially improved using gameengines for rendering compared to Stiller Studios’s former solution using MotionBuilder.

The camera calibration process was indeed also improved simply by using cameracalibration algorithms rather than visual estimation.

While some improvements have been found, there is room for much more. A lotmore testing and implementation could and should be done and each of the identifiedareas of improvements such as object tracking and modular architecture could be newmaster theses on their own.

The intended outcome of improving the previsualization was to remove bordersbetween the real and the virtual in filmmaking. Ideally the previsualization shouldbe ready for shooting advanced mixed reality scenes live, ready for TV or streamingwithout post-production. While not being quite there yet, the work in this master thesisis a step on the way.

In order to stay among the most technical studios it is not enough to be aware ofnew technologies on the market but to be able to efficiently implement those as well.What seems to be the best solution today may change tomorrow, that is the conclusionof how the advanced CGI filmmaking industry works.

61

Bibliography

[1] Unreal Engine. Real-time cinematography in unreal engine 4. https://www.

youtube.com/watch?v=pGtb6uMUgZA. Accessed: 2016-10-24.

[2] Unity. Unity Adam demo. https://www.youtube.com/watch?v=

GXI0l3yqBrA. Accessed: 2016-10-24.

[3] Mark Roberts Motion Control. Real-time data output to cgi. http://

www.mocoforum.com/discus/messages/19/144.html?1255709935. Ac-cessed: 2016-10-19.

[4] Salvator D. 3d pipeline tutorial. http://www.extremetech.com/

computing/49076-extremetech-3d-pipeline-tutorial. Accessed:2016-10-16.

[5] Crassin C., Neyret F., Sainz M., Green S., and Eisemann E. Interactive indirect illu-mination using voxel cone tracing. Comput. Graph. Forum, 30(7):1921–1930, 2011.

[6] Kolivand H. and Sunar M.S. Shadow mapping or shadow volume? InternationalJournal of New Computer Architectures and their Applications (IJNCAA), 1(2):275–281,2011.

[7] van Oosten J. Forward vs deferred vs forward+ rendering with directx 11. http://www.3dgep.com/forward-plus/. Accessed: 2016-10-16.

[8] Crystal Space. What is deferred rendering? http://www.crystalspace3d.

org/docs/online/manual/About-Deferred.html. Accessed: 2016-10-16.

[9] Owens B. Forward rendering vs. deferred rendering.https://gamedevelopment.tutsplus.com/articles/

forward-rendering-vs-deferred-rendering--gamedev-12342.Accessed: 2016-10-16.

[10] Pinhole camera model. https://en.wikipedia.org/wiki/Pinhole_

camera_model. Accessed: 2016-10-24.

[11] Bradski G.R. and Kaehler A. Learning Opencv, 1st Edition. O’Reilly Media, Inc.,first edition, 2008.

[12] Cambridge in color. Lens diffraction and photography. http://www.

cambridgeincolour.com/tutorials/diffraction-photography.htm.Accessed: 2016-10-24.

[13] Distortion (optics). https://en.wikipedia.org/wiki/Distortion_

(optics). Accessed: 2016-10-24.

[14] LA Video Filmmaker. F-stops, t-stops, focal length and lens aper-ture. http://www.lavideofilmmaker.com/cinematography/

f-stops-focal-length-lens-aperture.html. Accessed: 2016-10-24.

62

BIBLIOGRAPHY 63

[15] Littlefield R. Theory of the “no-parallax” point in panorama photography. 2006.

[16] Lee J. Head tracking for desktop vr displays using the wiiremote. https://www.youtube.com/watch?v=Jd3-eiid-Uw. Accessed: 2016-10-24.

[17] Brown D.C. Decentering distortion of lenses. Photometric Engineering, 32(3):444–462, 1966.

[18] Zhang Z. A flexible new technique for camera calibration. IEEE Transactions onpattern analysis and machine intelligence, 22(11):1330–1334, 2000.

[19] Scaramuzza D., Martinelli A., and Siegwart R. A toolbox for easily calibratingomnidirectional cameras. In 2006 IEEE/RSJ International Conference on IntelligentRobots and Systems, pages 5695–5701, Oct 2006.

[20] Kumar A. and Ahuja N. On the equivalence of moving entrance pupil and radialdistortion for camera calibration. In 2015 IEEE International Conference on ComputerVision (ICCV), pages 2345–2353, Dec 2015.

[21] Kundur D. and Hatzinakos D. Blind image deconvolution. IEEE signal processingmagazine, 13(3):43–64, 1996.

[22] Nguyen H. Gpu Gems 3. Addison-Wesley Professional, first edition, 2007.

[23] Stiller Studios. Welcome to stiller studios. http://stillerstudios.com/.Accessed: 2016-10-24.

[24] Medioni G. and Kang S.B. Emerging topics in computer vision. Prentice Hall PTR,2004.

[25] Caprile B. and Torre V. Using vanishing points for camera calibration. Internationaljournal of computer vision, 4(2):127–139, 1990.

[26] Frahm J-M. and Koch R. Camera calibration with known rotation. In ComputerVision, 2003. Proceedings. Ninth IEEE International Conference on, pages 1418–1425.IEEE, 2003.

[27] Lepetit V., Moreno-Noguer F., and Fua P. Epnp: An accurate o (n) solution to thepnp problem. International journal of computer vision, 81(2):155–166, 2009.

[28] Mateos G.G. et al. A camera calibration technique using targets of circular features.In 5th Ibero-America Symposium On Pattern Recognition (SIARP). Citeseer, 2000.

[29] von Gioi R.G., Monasse P., Morel J-M., and Tang Z. Lens distortion correction witha calibration harp. In 2011 18th IEEE International Conference on Image Processing,pages 617–620, Sept 2011.

[30] Datta A., Kim J-S., and Kanade T. Accurate camera calibration using iterativerefinement of control points. In Computer Vision Workshops (ICCV Workshops), 2009IEEE 12th International Conference on, pages 1201–1208, Sept 2009.

[31] Vo M., Wang Z., Luu L., and Ma J. Advanced geometric camera calibration formachine vision. Optical Engineering, 50(11):110503–110503–3, 2011.

[32] Andrew Colin Rice. Dependable systems for sentient computing. PhD thesis, Citeseer,2007.

BIBLIOGRAPHY 64

[33] ARToolKit. Creating and training traditional template square markers.https://artoolkit.org/documentation/doku.php?id=3_Marker_

Training:marker_training. Accessed: 2016-10-24.

[34] Cantag. Marker-based machine vision. http://www.cl.cam.ac.uk/~acr31/cantag/. Accessed: 2016-10-24.

[35] Lightcraft Technology. Circular barcodes. http://www.lightcrafttech.

com/overview/setup/. Accessed: 2016-10-24.

[36] Culbertson B. Chromaglyphs for pose determination. http://shiftleft.

com/mirrors/www.hpl.hp.com/personal/Bruce_Culbertson/ibr98/

chromagl.htm. Accessed: 2016-10-24.

[37] Byrne B.P., Mallon J., and Whelan P.F. Efficient planar camera calibration via au-tomatic image selection. 2009.

[38] Pertuz S., Puig D., and Garcia M.A. Analysis of focus measure operators for shape-from-focus. Pattern Recognition, 46(5):1415–1432, 2013.

[39] Hartley R. and Zisserman A. Multiple View Geometry in Computer Vision. Cam-bridge University Press, ISBN: 0521540518, second edition, 2004.

[40] Distortion (optics). https://en.wikipedia.org/wiki/Epipolar_

geometry. Accessed: 2016-10-24.

[41] Computerphile. Space carving. https://www.youtube.com/watch?v=

cGs90KF4oTc. Accessed: 2016-10-24.

[42] Crytek. Cinebox Beta technical manual. 2015.

[43] Amazon lumberyard. https://aws.amazon.com/lumberyard/. Accessed:2016-10-16.

[44] Ogre3d. http://www.ogre3d.org. Accessed: 2016-10-19.

[45] Autodesk. Custom renderer api. http://docs.autodesk.com/MB/2014/

ENU/MotionBuilder-SDK-Documentation/index.html?url=files/

GUID-EBB95B3D-E75B-4033-ABB4-29EE7B1F9F4A.htm,topicNumber=

d30e11937. Accessed: 2016-10-16.

[46] Autodesk. Stingray. http://www.autodesk.com/products/stingray/

overview. Accessed: 2016-10-19.

[47] Unity Technologies. Unity manual. https://docs.unity3d.com/Manual/

index.html. Accessed: 2016-10-19.

[48] Enlighten. Demystifying the enlighten precompute process. http://www.

geomerics.com/blogs/demystifying-the-enlighten-precompute/.Accessed: 2016-10-19.

[49] Epic Games. Unreal engine 4 documentation. https://docs.unrealengine.com/latest/INT/. Accessed: 2016-09-25.

BIBLIOGRAPHY 65

[50] "Applications of Artificial Vision" research group of the University of Cordoba.Aruco: a minimal library for augmented reality applications based on opencv.http://www.uco.es/investiga/grupos/ava/node/26. Accessed: 2016-10-19.

[51] Lightcraft Technology. Previzion manual. http://www.lightcrafttech.

com/support/doc/. Accessed: 2016-10-19.

[52] Ncam. Ar/vr real time camera tracking. http://www.ncam-tech.com/. Ac-cessed: 2016-10-19.

[53] Trackmen. Camera tracking solutions. http://www.trackmen.de/. Accessed:2016-10-19.

[54] Mo-Sys. Camera motion systems. http://www.mo-sys.com/. Accessed: 2016-01-23.

[55] The Captury. Pure performance. http://www.thecaptury.com/. Accessed:2016-10-25.

[56] Organic Motion. Markerless motion capture. http://www.organicmotion.

com/. Accessed: 2016-10-25.

[57] OptiTrack. Optitrack motion capture. http://optitrack.com/products/

flex-13/indepth.html. Accessed: 2016-10-19.

[58] Abeles P. Inverse radial distortion formula. http://peterabeles.com/blog/?p=73. Accessed: 2016-10-19.

[59] Wakley S. Flair operator’s manual version 5. https://www.mrmoco.com/

downloads/MANUAL.pdf. Accessed: 2016-10-24.

Appendix A

Inverse distortion

Shown from left to right; The image distorted using the inverse distortion algorithmwith given parameters and number of iterations, the same image sampled using dis-torted and then undistorted coordinates and thirdly the difference between the dis-torted and undistorted image and the original.

FIGURE A.1: K1 = 0.1, iterations = 1.

FIGURE A.2: K1 = 0.1, iterations = 5.

66

Appendix A. Inverse distortion 67

FIGURE A.3: K1 = 1, iterations = 500.

FIGURE A.4: K1 = 25, iterations = 500.

FIGURE A.5: K1 = -0.1, iterations = 5.


FIGURE A.6: K1 = -0.1, iterations = 20

FIGURE A.7: K1 = -0.1, iterations = 500.

FIGURE A.8: P1 = 0.01, iterations = 500.


FIGURE A.9: P1 = 0.1, iterations = 500.

development and research in previsualization for advanced ...1098475/fulltext01.pdf · development...

Documents