3d display simulation using head tracking with microsoft kinect (printing)

Manfredas Zabarauskas

3D Display Simulation UsingHead-Tracking with Microsoft Kinect

Computer Science Tripos, Part II

University of Cambridge

Wolfson College

May 14, 2012

Proforma

Name: Manfredas Zabarauskas

College: Wolfson College

Project Title: 3D Display Simulation Using Head Tracking

with Microsoft Kinect

Examination: Part II in Computer Science, June 2012

Word Count: 11,9761

Project Originator: M. Zabarauskas

Supervisor: Prof N. Dodgson

Original Aims of the Project

The main project aim was to simulate depth perception using motion parallax on a regular

LCD screen, without requiring the user to wear glasses/other headgear or to modify the screen

in any way. Such simulated 3D displays could serve as a “stepping-stone” between full-3D

displays (providing stereopsis depth cue) and currently pervasive 2D displays. The proposed

approach for achieving this aim was to use viewer’s head tracking based on colour and depth

data provided by the Microsoft Kinect sensor.

Work Completed

In order to detect the viewer’s face, a distributed Viola-Jones face detector training framework

has been implemented, and a colour-based face detector cascade has been trained. To track

the viewer’s head, a combined colour- and depth-based approach has been proposed. The

combined head-tracker was able to predict viewer’s head center location within less than 13

of head’s size from the actual head center on average. A proof-of-concept 3D display system

(using a created head-tracking library) has also been implemented, simulating pictorial and

motion parallax depth cues. A short demonstration of the working system can be seen at

http://zabarauskas.com/3d.

Special Difficulties

None.

1Computed using detex diss.tex | tr -cd ’0-9A-Za-z \n’ | wc -w excluding proforma and appen-

dices .

i

http://zabarauskas.com/3d

Declaration

I, Manfredas Zabarauskas of Wolfson College, being a candidate for Part II of the Computer

Science Tripos, hereby declare that this dissertation and the work described in it are my own

work, unaided except as may be specified below, and that the dissertation does not contain

material that has already been used to any substantial extent for a comparable purpose.

Signed

Date

ii

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Human Depth Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Depth Cue Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Related Work on 3D Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Detailed Project Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Preparation 5

2.1 Starting Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Project Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.1 Risk Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.2 Problem Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.3 Data Flow and System Components . . . . . . . . . . . . . . . . . . . . . 7

2.4 Image Processing and Computer Vision Methods . . . . . . . . . . . . . . . . . 7

2.4.1 Viola-Jones Face Detector . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4.2 CAMShift Face Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4.3 ViBe Background Subtractor . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 Depth-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5.1 Peters-Garstka Head Detector . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5.2 Depth-Based Head Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Implementation 22

3.1 Development Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Languages and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.2 Development Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.3 Development Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.4 Code Versioning and Backup Policy . . . . . . . . . . . . . . . . . . . . . 24

3.3 Implementation Milestones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 High-Level Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5 Viola-Jones Detector Distributed Training Framework . . . . . . . . . . . . . . . 25

3.5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5.2 Class Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5.3 Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.6 Head-Tracking Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.6.1 Head-Tracker Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.6.2 Colour-Based Face Detector . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6.3 Colour-Based Face Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . 37

iii

3.6.4 Colour- and Depth-Based Background Subtractors . . . . . . . . . . . . . 38

3.6.5 Depth-Based Head Detector and Tracker . . . . . . . . . . . . . . . . . . 39

3.6.6 Tracking Postprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.7 3D Display Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.7.1 3D Game (Z-Tris) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Evaluation 48

4.1 Viola-Jones Face Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.1.1 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.1.2 Trained Cascade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1.3 Face Detector Accuracy Evaluation . . . . . . . . . . . . . . . . . . . . . 52

4.1.4 Face Detector Speed Evaluation . . . . . . . . . . . . . . . . . . . . . . . 52

4.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2 HT3D (Head-Tracking in 3D) Library . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2.1 Tracking Accuracy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.3 3D Display Simulator (Z-Tris) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5 Conclusions 75

5.1 Accomplishments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

A Depth Cue Perception 80

A.1 Oculomotor Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

A.2 Monocular Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

A.2.1 Pictorial Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

A.2.2 Motion Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

A.3 Binocular Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

B 3D Display Technologies 84

B.1 Binocular (Two-View) Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

B.2 Multi-View Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

B.3 Light-Field (Volumetric and Holographic) Displays . . . . . . . . . . . . . . . . 86

B.4 3D Display Comparison w.r.t. Depth Cues . . . . . . . . . . . . . . . . . . . . . 86

B.5 3D Display Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

B.5.1 Scientific and Medical Software . . . . . . . . . . . . . . . . . . . . . . . 87

B.5.2 Gaming, Movie and Advertising Applications . . . . . . . . . . . . . . . 87

C Computer Vision Methods (Additional Details) 89

C.1 Viola-Jones Face Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

C.1.1 Weak Classifier Boosting using AdaBoost . . . . . . . . . . . . . . . . . . 89

C.1.2 Best Weak-Classifier Selection . . . . . . . . . . . . . . . . . . . . . . . . 92

C.1.3 Cascade Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

C.2 CAMShift Face Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

iv

C.2.1 Mean-Shift Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

C.2.2 Centroid and Search Window Size Calculation . . . . . . . . . . . . . . . 94

C.3 ViBe Background Subtractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

C.3.1 Background Model Initialization . . . . . . . . . . . . . . . . . . . . . . . 95

C.3.2 Background Model Update . . . . . . . . . . . . . . . . . . . . . . . . . . 96

D Depth-Based Methods (Additional Details) 97

D.1 Depth Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

D.1.1 Depth Shadow Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . 97

D.1.2 Real-Time Depth Image Smoothing . . . . . . . . . . . . . . . . . . . . . 97

D.2 Depth Cue Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

D.2.1 Generalized Perspective Projection . . . . . . . . . . . . . . . . . . . . . 98

D.2.2 Real-Time Shadows using Z-Pass Algorithm with Stencil Buffers . . . . . 100

E Implementation (Additional Details) 104

E.1 Viola-Jones Distributed Training Framework . . . . . . . . . . . . . . . . . . . . 104

E.2 HT3D Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

E.2.1 Head Tracker Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

E.2.2 Colour- and Depth-Based Background Subtractors . . . . . . . . . . . . . 104

E.3 3D Display Simulator Components . . . . . . . . . . . . . . . . . . . . . . . . . 110

E.3.1 Application Entry Point . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

E.3.2 Head Tracker Configuration GUI . . . . . . . . . . . . . . . . . . . . . . 110

E.3.3 3D Game (Z-Tris) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

F HT3D Library Evaluation (Additional Details) 116

F.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

F.1.1 Sequence Track Detection Accuracy . . . . . . . . . . . . . . . . . . . . . 116

F.1.2 Multiple Object Tracking Accuracy/Precision . . . . . . . . . . . . . . . 117

F.1.3 Average Normalized Distance from the Head Center . . . . . . . . . . . . 117

F.2 Evaluation Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

F.2.1 Viola-Jones Face Detector Output . . . . . . . . . . . . . . . . . . . . . . 120

F.2.2 δ Metric for Individual Recordings . . . . . . . . . . . . . . . . . . . . . 121

F.2.3 MOTA/MOTP Evaluation Results . . . . . . . . . . . . . . . . . . . . . 141

G 3D Display Simulator (Z-Tris) Evaluation 142

G.1 Automated Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

G.2 Manual Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

G.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

H Sample Code Listings 145

I Project Proposal 153

v

Chapter 1

Introduction

This chapter describes the motivation for a three-dimensional display simulation using Microsoft

Kinect, the basic workings of the human depth perception (in order to understand how it could

be simulated), the related work that has been done on the 3D display simulation, and the main

applications for 3D displays.

1.1 Motivation

The ideas and research about three-dimensional displays can be traced back to mid-nineteeth

century, when Wheatstone first demonstrated his findings about stereopsis to the Royal Society

of London.

In the last decade a number of usable glasses-free autostereoscopic systems became available,

and multiple glasses-based stereoscopic 3D display systems were available for a few decades.

Nevertheless, 3D displays have struggled to break out of their niche markets because of their

relatively low quality and high price, when compared to the conventional displays.

In November 2010, Microsoft has launched the Kinect sensor, containing an IR depth-finding

camera. It became a huge commercial success, entering the Guiness World Book of Records

Figure 1.1: Just-discriminable depth thresholds for two objects with distances D1 and D2 as afunction of the logarithm of distance from the observer for the nine depth cues. Depth of two objectsis represented by their average distance D1+D2

2 , depth contrast is obtained by calculating 2(D1−D2)D1+D2

.Reproduced from Cutting and Vishton, 1995 [13].

1

CHAPTER 1. INTRODUCTION 2

as the ”fastest selling consumer electronic device” with 18 million units sold as of January

2012.

Based on this new development, an idea was conceived to explore the applicability of the

cheap and ubiquitous Kinect sensor in creating the depth perception on existing widespread

high-quality single-view displays.

The crucial first step in developing such system is to understand the main principles of the

human depth perception.

1.2 Human Depth Perception

According to Goldstein [19], all depth cues can be classified into three major groups.

1. Oculomotor cues (based on human ability to sense the position of the eyes and the tension

in the eye muscles),

2. Monocular cues (use the input just from one eye).

3. Binocular cues (use the input from both eyes).

These major groups (together with the definitions used in the rest of this chapter) are fully

described in appendix A.

1.2.1 Depth Cue Comparison

The relative efficacy and importance of various depth cues has been summarized by Cutting

and Vishton [13]. Figure 1.1 presents the just-discriminable depth thresholds as a function of

2 - 30 m

Depth cue 0-2 m All sources Pictorial sources > 30 m

Occlusion 1 1 1 1Relative size 4 3.5 3 2Relative density 7 6 4 4.5Relative height — 2 2 3Atmospheric perspective 8 7 5 4.5Motion parallax 3 3.5 — 6Convergence 5.5 8.5 — 8.5Accommodation 5.5 8.5 — 7Stereopsis 2 5 — 7

Table 1.1: Ranking of depth cues in the observer’s space, obtained by integrating the area under eachdepth-threshold function from figure 1.1 within each spatial region, and comparing relative areas.Lower rank means higher importance, a dash indicates that data was not applicable to source depthcue. Based on Cutting and Vishton, 1995 [13].


Figure 1.2: A sample taxonomy of 3D display technologies. Italic font indicates autostereoscopicdisplays.

the logarithm of distance from the observer for each of the depth cues, and table 1.1 describes

the relative importance of these depth cues in three circular areas around the observer. In

particular, occlusions, stereopsis and motion parallax are distinguished as the most important

cues for depth perception in low to average viewing distance ranges.

1.3 Related Work on 3D Displays

Physiological knowledge about human depth cue perception has been extensively applied in 3D

display design, and multiple ways to classify such displays have been presented in the literature

[40, 4, 14, 23]. A sample taxonomy of currently dominating 3D display technologies is given in

figure 1.2.

Table 1.3 compares these display types with respect to the depth cues that they can simu-

late, and the special equipment that they require, while a much broader discussion is given in

appendix B.

Table 1.2: Different display type comparison with respect to the depth cues that they provide,and requirements for special equipment.

Requirements Simulated depth cues

Display typeHeadtrack-

ing

Eye-wear

StandardLCD/CRT

monitor

Pic-torial

Stere-opsis

Motionparallax

Accomm.& conv.match

Binocular X X XX X X X Continuous

Multi-view X X Discrete1

Light-field2 X X Continuous X

Proposed X X X Continuous

1 Typically only in horizontal direction.2 Light-field displays still remain largely experimental (as described by Holliman et al. in [24]).


1.4 Applications

Dodgson [14] distinguishes two main classes of applications for the autostereoscopic 3D display

systems:

� Scientific and medical software, where 3D depth perception is needed for the successful

completion of the task,

� Gaming and advertising applications, where the novelty of a stereo parallax is useful as

a commercial selling point.

Examples from these two application classes are discussed in appendix B.5.

1.5 Detailed Project Aims

To achieve the main project’s aim (“to simulate depth perception on a regular LCD screen

through the use of the ubiquitous and affordable Microsoft Kinect sensor, without requiring

the user to wear glasses or other headgear, or to modify the screen in any way”), the project

will simulate

� pictorial depth cues : lighting, shadows, occlusions, relative height/size/density and tex-

ture gradient (by implementing an appropriate three-dimensional scene in a 3D rendering

framework),

� continuous horizontal and vertical motion parallax, through real-time head tracking using

Microsoft Kinect sensor.

The project will not aim to simulate stereopsis, because it would require modifications to the

screen (a standard LCD display inherently provides a single view that is seen binocularly).

Because of the same reason, simulating depth perception for multiple viewers using a single

view will not be attempted.

Since motion parallax is one of the strongest near- and middle-range depth perception cues, its

simulation through viewer’s head tracking will be one of the main focal points of the project. It

is obvious from the start that in order to achieve accurate head tracking, a significant amount

of computer vision and signal processing techniques will be required. Even more importantly,

these algorithms will have to be extended to use the depth information provided by Microsoft

Kinect sensor.

These tasks require a careful consideration of various tractability issues and a lot of attention

to the computer vision techniques before embarking on the project. They will be discussed in

much more detail in the following chapter.

Chapter 2

Preparation

This chapter outlines the planning and research that was undertaken before starting the imple-

mentation of the project. In particular, it discusses the starting point, the main requirements

of the overall system, project methodology, risk analysis and the problem constraints. It then

describes the most important theory and algorithms that were used in the project, paying

particular attention to the computer vision techniques.

2.1 Starting Point

Before starting the project I had

� basic knowledge of Microsoft Visual Studio development environment and C# program-

ming language (six months working experience),

� next-to-zero practical experience with the OpenGL rendering framework,

� no experience with the Kinect SDK,

� no experience with the relevant machine learning and computer vision techniques.

2.2 Project Methodology

Agile Software Development philosophies [3] were followed in requirement analysis, design,

implementation and testing.

More precisely,

� Requirements analysis (section 2.3) was based on the usage modelling.

� System design (section 3.4) was focused on

– process modelling (through data flow diagrams), and

– architectural modelling (through component diagrams).

� Implementation was focused on

– constant pace with clear milestones and deliverables (following the project proposal),

– iterative development with weekly/bi-weekly iteration cycles,

– continuous integration, where working software is extended weekly/bi-weekly by

adding new features, but is always kept in the working state,

5

CHAPTER 2. PREPARATION 6

� Testing was based on agile approaches (functional, sanity and usability manual testing

performed continuously throughout the iteration) and automated regression unit tests

(performed at the end of an iteration).

2.3 Requirements Analysis

The variety of use cases, scenarios and applications of depth displays are described in section

B.5‘.

To limit the scope of the project into something manageable in the Part II project timeframe,

and at the same time to design concrete deliverables that would achieve the main aim of the

project (as described in section 1.5), two simple user stories are given in the table 2.1.

User: 3D Application Developer User: Gamer

As an application developer who wants tocreate her own 3D application on a regulardisplay, I want to easily obtain viewer’s headlocation information so that I could use it torender my depth-aware application accord-ingly.

As a gamer, I want to experience a highersense of realism when playing a 3D game, sothat I could a) perform tasks that requiresdepth estimation more easily and b) I couldexperience a higher level of imersiveness inthe game.

Table 2.1: User stories for the agile requirement analysis of the project.

Extrapolating from these two simple user stories, the deliverables of the project (and the main

requirements for them) can be defined more precisely:

1. A head-tracking library that can be used to easily obtain the viewer’s head location in

three-dimensions. The main requirements (in the order of their priority) are:

(a) Accuracy: head-tracker should be able to correctly detect a single viewer’s head

in the majority of input frames (i.e. the average distance between the tracker’s

prediction and the actual head center in the image should not exceed 12

of the viewer’s

head size),

(b) Performance: head-tracker should work in real-time (i.e. should process at least at

30 frames-per-second),

(c) Ease of use: library should be flexible enough to be used in multiple projects.

2. A simple 3D game that simulates depth perception and requires the user to accurately

estimate depth in order to achieve certain in-game goals. The main requirements are:

(a) Continuous vertical and horizontal motion parallax depth cue simulation,

(b) Pictorial depth cues simulation (lighting, shadows, occlusions, relative height/size/-

density and texture gradient),

(c) In-game goal system requiring the player to estimate depth accurately.


2.3.1 Risk Analysis

Undoubtedly, the biggest challenge and the highest uncertainty associated with these deliver-

ables is the requirement for the accurate real-time head tracking.

For this reason, the remaining sections of this chapter (and a very significant part of the overall

dissertation) are focused on successfully implementing viewer’s head tracking using colour and

depth information provided by Microsoft Kinect.

2.3.2 Problem Constraints

As described in section 1.5, depth perception simulation for multiple viewers will not be at-

tempted because it would require modifications to the screen (a standard LCD display inher-

ently provides a single view). This reduces the complexity of head-tracking, since only a single

viewer needs to be tracked.

Furthermore, observe that the reference point of the tracked head location is Kinect sensor.

Since the location of the sensor might not necessarily coincide with the position of the display,

the constraint that the Kinect sensor must always be placed directly above the display is

imposed. This helps to avoid complicated semi-automatic calibration routines.

2.3.3 Data Flow and System Components

Head-tracking task as the main task of the project (as described in the requirements and risk

analysis) can be formalized as a sequence of data transformations, where input data is the depth

and colour streams coming from Microsoft Kinect and the transformed data is the location of

the viewer’s head w.r.t. the display.

Based on the background research, this transformation can be broken down into individual

components as shown in figure 2.1.

Each of these components can be developed, tested and refined nearly-independently from oth-

ers. This modular approach makes testing and debugging process much easier, and maximises

the opportunity for the code reuse. It also closely adheres to the iterative prototyping style, as

one of the most important agile software engineering methodologies.

The following sections describe the relevant theory needed to successfully implement these

individual components and the actual implementation details are given in Chapter 3.

2.4 Image Processing and Computer Vision Methods

This section introduces the first three colour stream transformation algorithms (as shown in

figure 2.1), viz.:


Figure 2.1: Project data flow as a sequence of data transformations performed by correspondingalgorithms. Transformations with dashed borders are optional.

� viewer’s face detection using Viola-Jones object detection framework (specifically trained

for human faces),

� face tracking using CAMShift object tracker, and

� image segmentation into foreground and background using ViBe background subtractor,

to improve tracking and detection tasks.

2.4.1 Viola-Jones Face Detector

Face detection in unconstrained images is a difficult task due to large intra-class variations:

� differences in facial appearance (hair, beards, glasses),

� changing lighting conditions,

� within- and out-of image plane head rotations,

� changing facial expressions,

� impoverished image data, and so on.

In 2001, Paul Viola and Michael Jones in their seminal work [41] proposed a machine learning-

based generic object detection framework. It became a de facto standard for face detection due

to its rapid image processing speed and high detection accuracy.

Viola-Jones object detection framework is based on the general classification framework, i.e.

given a set of N examples (~x1, y1), ..., (~xN , yN) where ~xi ∈ X are the feature vectors and

yi ∈ {0, 1} is the class of the training example (non-face/face respectively), the goal is to find a

classifier h : X → {0, 1} such that the error of the misclassification would be minimized.


Figure 2.2: Three classes of features (two-rectangle, three-rectangle and four-rectangle) used in Viola-Jones algorithm. The value of the feature (h) is defined as the sum of pixel intensities in the blackregion B subtracted from those in the white region W , i.e. h =

∑(x,y)∈B I(x, y)−

∑(x,y)∈W I(x, y).

Figure 2.3: Integral image representation usedin Viola-Jones algorithm. The value of the in-tegral image I at coordinates (x, y) is equal toI(x, y) =

∑m≤x,n≤y I(m,n), where I is the orig-

inal image.

Figure 2.4: Method to rapidly (in 6-9 array ref-erences) calculate rectangle feature values: D =I(x4, y4) − C − B + A = I(x4, y4) − I(x3, y3) −I(x2, y2) + I(x1, y1).

2.4.1.1 Features

Instead of using raw pixel intensities as feature vectors in classification, higher-level features are

used. There are multiple reasons for doing so: most notably, higher-level features help to encode

ad-hoc domain knowledge, increase between-class variability (when compared to within-class

variability) and increase the processing speed.

Viola-Jones algorithm uses Haar-like features (resembling Haar wavelets used by Papageorgiou

et al. [35]), shown in figure 2.2.

The first main contribution of Viola-Jones algorithm is the integral image representation (see

figure 2.3) which allows a constant-time feature evaluation at any location or scale.

The value of the integral image I at coordinates (x, y) is equal to the sum of all pixels above

and to the left of (x, y), i.e.

I(x, y) =∑

x′≤x,y′≤y

I(x′, y′), (2.1)

where I is the original image.

Then the sum of the pixel intensities within an arbitrary rectangle in the image can be computed


with four array references (as shown in figure 2.4).

Note that I itself can be completed in one pass over the image using recurrences

R(x, y) = R(x, y − 1) + I(x, y)

I(x, y) = I(x− 1, y) + S(x, y),(2.2)

where R is the cumulative row sum and R(x,−1) = 0, I(−1, y) = 0.

However, for the base resolution of the detector (24×24 pixels), the total count of these rectan-

gular features is 162, 336. Evaluating this complete set would be computationally prohibitively

expensive, and unlike the Haar basis, this basis set is many times overcomplete.

2.4.1.2 Ensemble Learning

Viola and Jones proposed that a small number of these features could be chosen to form an

effective classifier using the boosting techniques, common in machine learning.

The actual boosting technique as used by Viola and Jones is called AdaBoost (Adaptive Boost-

ing) and was first described by Freund and Shapire in 1995 [15]. Shapire and Singer [37] proved

that the training error of a strong classifier obtained using AdaBoost decreases exponentially

in the number of rounds.

AdaBoost attempts to minimize the overall training error, but for the face detection task it

is more important to minimize the false negative rate than the false positive (as discussed in

section 2.4.1.4.

Viola and Jones in 2002 [42] proposed the fix to AdaBoost, called AsymBoost (Asymmetric

AdaBoost). AsymBoost algorithm is specifically designed to be used in classification tasks

where the distribution of positive and negative training examples is highly skewed.

The precise details and the explanations of both AdaBoost and AsymBoost techniques are

given in the appendix C.1.1.

2.4.1.3 Weak classifiers

For the purpose of face detection, the decision stump weak classifiers can be used. An individual

classifier hi(~x, f, p, θ) takes a Haar-like feature f , a threshold θ and a polarity p, and returns

the class of a training example ~x:

hi(~x, f, p, θ) =

{1, if p× f(~x) < p× θ,0, otherwise.

(2.3)

To find a decision stump with the lowest error εt for a given training round t, algorithm C.1.2.1

can be used. It is worth noting that the asymptotic time cost to find the best weak classifier for


Figure 2.5: Decision making process in the attentional cascade, when a series of classifiers are appliedto every sub-window. Due to “immediate rejection” property, the number of sub-images that reachdeep layers of the cascade is drastically smaller than the overall count of sub-images.

a given training round is O(KN logN), where K is a number of features and N is the number

of training examples1.

2.4.1.4 Attentional Cascade

A further observation by Viola and Jones is based on the fact that the face/non-face classes

are highly asymmetric, viz. the number of negative sub-images (not containing faces) in a given

image is typically overwhelmingly higher than the number of positive sub-images (containing

faces). With this insight in mind, it is sensible to focus the initial effort of the detector into

eliminating large areas of the image (as not containing faces) using some simple classifiers, with

progressively more accurate (and computationally expensive) classifiers focusing on the rare

areas on the image that could possibly contain a face.

The idea given in the previous paragraph is embodied in the construction of the attentional

cascade (see figure 2.5). It is enough for one of the classifiers to reject the sub-image so that

the sub-image would be rejected by the whole detector. The sub-image has to be accepted by

all classifiers in the cascade so that it would be accepted by the detector. Also, each of the

classifiers in the attentional cascade is designed to have a much smaller false negative rate than

the false positive rate; this provides confidence that when the classifier rejects the sub-image,

it is very likely not to have contained face in the first place.

Each of the strong classifiers in the cascade is obtained through boosting. A new classifier in

the cascade is trained on the data that all the previous classifiers misclassify, hence in that

sense a second classifier in the cascade faces a more difficult and time-consuming task than the

first one, and so on.

A detailed training algorithm for building a cascaded detector is given in the appendix

C.1.3.1.

1Putting this into a perspective, to obtain a single strong classifier containing ≈ 100 weak classifiers for≈ 10, 000 training examples and≈ 160, 000 Haar-like features, O(1011) operations are needed (assuming constantfeature evaluation time).


Because of the construction, the false positive rate of the overall cascade is

F =K∏i=1

fi, (2.4)

where K is the number of classifiers in the cascade and fi is the FP rate of the ith classifier;

similarly, the detection rate of the overall cascade is

D =K∏i=1

Di, (2.5)

where di is the detection rate of the ith classifier2.

Due to the large Haar-like feature search space, the specifics of a strong-classifier boosting and

the false positive training image bootstrapping for new cascade layers, a careful consideration

for the training framework implementation is required (see appendix C.1.3.1 for a “back-of-

the-envelope” training time estimation for a naıve implementation). Section 3.5 presents the

distributed Viola-Jones cascade training implementation and discusses the main methods to

tackle the training time complexity in more detail.

2.4.2 CAMShift Face Tracker

After the face in the image has been localized using Viola-Jones face detection algorithm, it

can be tracked using CAMShift (Continuously Adaptive Mean Shift) algorithm, first described

by Gary Bradski in 1998 [8].

CAMShift is largely based on the mean shift algorithm [16], which is a non-parametric tech-

nique to climb the gradient of a given probability distribution to find the nearest dominant

peak (mode). The mean shift algorithm is given in C.2.1.1, and a short proof of mean shift

convergence to the mode of the probability distribution can be found in [8].

CAMShift extends the the Mean Shift algorithm by adapting the search window size to the

changing probability distribution. The distributions are recomputed for each frame and ze-

roth/first spatial (horizontal and vertical) moments are used to iterate towards the mode of the

distribution. This makes CAMShift algorithm robust enough to track the face when the viewer

moves in horizontal, vertical and lateral directions, when the minor facial features of the face

(e.g. expressions) change or when the face is rotated in the camera plane (head roll).

2.4.2.1 “Face” Probability Distribution

In order to use CAMShift for face tracking, a “face” probability distribution function (that

assigns an individual pixel a probability that it belongs to a face) needs to be constructed. It

2Notice that to achieve a detection rate of 0.9 and a false positive rate of 6× 10−6 using a 10-stage classifier,each stage has to have a detection rate of 0.99, but a false positive rate of only about 0.3 (i.e. three out of tennon-face images on average are allowed to be misclassified as faces by each strong classifier!).


Figure 2.6: Conical HSV (hue, saturation, value) colour space.

is done by converting the input video frame into the HSV (hue, saturation, value) colour space

(shown in figure 2.6) and building the hue histogram of the region in the image where the face

was detected.

The main reason for using the hue histogram is the fact that all humans (except albinos) have

basically the same skin colour hue (as observed by Bradski and verified in [7].

The construction of the hue histogram works as follows. Assume that the hue of each pixel

is encoded using m-bits and h(x, y) is the hue of the pixel with coordinates (x, y). Then the

unweighted histogram {qu}u=1...2m can be computed using

qu =∑x,y∈Id

δ (h(x, y)− u) , (2.6)

where Id is the detected face region in the video frame.

The rescaled histogram {qu}u=1...2m can be obtained by calculating

qu = min

({qu

max({qu}), 1

}). (2.7)

Then the “face” probability of a pixel at coordinates (x′, y′) can be calculated using the his-

togram backprojection, i.e.

Pr(“I(x′, y′) belongs to a face”) = qh(x′,y′). (2.8)

An illustration of the “face” probability calculated using histogram backprojection is shown in

figure 2.7.


a) b)

Figure 2.7: “Face” probability image b) obtained from input image a) using a histogram backprojec-tion method. Brighter areas of image b) indicate a higher probability for a pixel to be a part of theface.

2.4.2.2 Centroid Calculation and Algorithm Convergence

After the face probability distribution has been constructed, the CAMShift algorithm uses the

zeroth and first moments of the face probability distribution to compute the centroid of the

high-probability region (see the appendix C.2.2 for precise details).

The mean shift component of the CAMShift algorithm continually recomputes new centroid

until there is no significant change in the position. Typically, the maximum number of iterations

in this process is set between 10 and 20, and since the sub-pixel accuracy cannot be observed,

a minimum shift of one pixel in either horizontal or vertical directions is used as a convergence

criteria3.

2.4.3 ViBe Background Subtractor

In order to mitigate one of the main drawbacks of CAMShift face tracking algorithm, viz. its

inability to distinguish an object from the background if they have similar hue, a separate

background/foreground segmentation algorithm can be used.

ViBe (Visual Background Extractor) algorithm, as described by Barnich and Van Droogen-

broeck [2], is a universal4, sample-based background subtraction algorithm.

3Some care must also be taken to ensure that the algorithm terminates when the search window does notcontain any pixels with non-zero face probability distribution, i.e. when the zeroth moment is equal to zero.

4In a sense that the algorithm itself makes no assumptions about the video stream frame rate, colour space,scene content, background itself or its variability over time.


Figure 2.8: Comparison of a pixel value v(x) with a set of samples M(x) = {v1, v2, ..., v5} ina two-dimensional Euclidean colour space C1C2. Pixel value v(x) is classified as background ifthe number of samples in M(x) that are within the circle SR(v(x)) is greater or equal to θ.

2.4.3.1 Background Model and Classification

In ViBe, an individual background pixel x is modelled using a collection of N observed pixel

values, i.e.

M(x) = {v1, v2, ..., vN}, (2.9)

where vi is a background sample value with index i (taken in the previous frames5).

Let v(x) be the pixel x value in a given colour space, then x can be classified based on its

corresponding model M(x) by comparing it to the closest values within the set of samples in

the following way.

Define SR(v(x)) to be a hypersphere of radius R in the given colour space, centred on v(x).

The pixel value v(x) is classified as background if

|{SR(v(x)) ∩M(x)}| ≥ θ, (2.10)

where θ is the classification threshold (see figure 2.8). Barnich and Van Droogenbroeck in

[2] have empirically established the appropriate parameter values as θ = 2 and R = 20 for

monochromatic images.

The purpose of using a collection of samples is to reduce the influence of the outliers. To this end,

an insight made by Barnich and Van Droogenbroeck is that classification of a new pixel value

with respect to its immediate neighbourhood in the colour space estimates the distribution of

the background pixels more reliably than the typical statistical parameter estimation techniques

applied to a much larger number of samples.

5See appendix C.3.1 for precise details on the model initialization for the first frame of the video sequence.


Figure 2.9: An example ViBe background model update sequence demonstrating a fast model recoveryin presence of a ghost (“a set of connected points, detected as in motion, but not corresponding to anyreal moving object” [38]) and a slow incorporation of real moving objects into the background model.

2.4.3.2 Background Model Update

The background model update method used in ViBe provides three important features:

1. a memoryless update policy (to ensure an exponential monotonic decay of the remaining

lifespan for the individual samples stored in background models),

2. a random time subsampling (to ensure that the time windows covered by the background

pixel models are extended),

3. a mechanism that propagates background pixel samples spatially (to ensure spacial con-

sistency and to allow the adaptation of the background pixel models that are masked by

the foreground).

Precise details on the background model update (as shown in figure 2.9) are given in the

appendix C.3.2.

2.5 Depth-Based Methods

Akin to the colour-based face detection and tracking approach, a similar two step process is used

in viewer’s head tracking based on the depth data provided by Kinect. Namely, the tracking

process is split into head detection using Peters and Garstka method [17] and head tracking

using a modified CAMShift algorithm. More details on both of these methods are given in the

subsections below.

a) b) c) d)

Figure 2.10: Viola-Jones integral image based real-time depth image smoothing. Input depth imageis preprocessed by a) removing depth shadows, and then is smoothed using b) r = 2, c) r = 4 andd) r = 8, where r is the side length of the averaging rectangle.


a) b) c)

Figure 2.11: Kinect depth shadow removal. Images a) and b) show the aligned colour and depthinput images from Kinect. Blue areas in the input depth image b) indicate the regions where no depthdata is present; image c) is the resulting depth image after depth shadow removal.

2.5.1 Peters-Garstka Head Detector

In 2011, Peters and Garstka [17] introduced a novel approach for head detection and tracking

using depth images.

Their approach consists of three main steps:

� preprocessing of the depth data provided by Microsoft Kinect (“depth shadow” and noise

elimination), described in detail in appendix D.1 and briefly illustrated in figures 2.10,

2.11,

� detection of the local minima in a depth-image and the use of surrounding gradients in

order to identify a head (based on certain prior knowledge about the adult head size),

discussed below,

� postprocessing of the head location, discussed in section 3.6.6.

2.5.1.1 Head Detection

After obtaining a smoothed depth image with depth shadows eliminated (as described in ap-

pendix D.1), viewer’s head can be detected using the prior knowledge about the typical adult

human head size (20 cm × 15 cm × 25 cm, length × width × height) and its shape. Under

the assumption that the head is in the upright position, but its orientation with respect to

camera is not clear, the inner horizontal bound of the head is chosen to be 10 cm and the outer

horizontal bound is chosen to be 25 cm.

Note that for a given object with dimensions w× h at a distance d from the Kinect sensor, the

width ph and height pw in pixels of the area that it occupies on the screen can be calculated

using basic trigonometry, i.e.

(pw, ph) =

(w × rw

d× 2 tan fw2

,h× rh

d× 2 tan fh2

), (2.11)


Figure 2.12: Prior assumptions about the human head shape that are used to detect head-likeobjects in depth images. The light blue dot is local minimum on a horizontal scan line.

where (rw, rh) is the resolution of the screen and (fw, fh) is the horizontal/vertical field of view

of the depth camera6.

Using the equation 2.11, the inner and outer bounds can be defined as

bi(d) =320 px× 10 cm

d cm× 2 tan 58◦

2

≈ 28, 865

dpx,

bo(d) =320 px× 25 cm

d cm× 2 tan 58◦

2

≈ 72, 162

dpx.

(2.12)

Then for each horizontal line v′, consider a local minimum point u′ for which

� all depth values within the inner bounds have a smaller depth difference than 10 cm from

u′, and

� depth values at the outer bounds have a larger depth difference than 20 cm from u′ (see

figure 2.12 for an illustration).

More formally, find the point u′ on the line v′ such that Ir(u′, v′) is a local minimum, and the

inequalities

Ir(u′ + f, v′)− Ir(u′, v′) < 10 cm,

Ir(u′ − f, v′)− Ir(u′, v′) < 10 cm,

(2.13)

6PrimeSense PS1080 SoC Reference Design 1.081 (http://www.primesense.com/en/press-room/resources/file/4-primesensor-data-sheet) states 58◦ horizontal and 45◦ vertical field-of-view, whichnearly corresponds to Peters and Garstka empirically measured horizontal FOV of 61.66◦ [17].

http://www.primesense.com/en/press-room/resources/file/4-primesensor-data-sheet

http://www.primesense.com/en/press-room/resources/file/4-primesensor-data-sheet


hold ∀f ∈{

1, 2, ..., bi(Ir(u′,v′))

2

}, and

Ir

(u′ +

bo(Ir(u′, v′))

2, v′)− Ir(u′, v′) > 20 cm,

Ir

(u′ − bo(Ir(u

′, v′))

2, v′)− Ir(u′, v′) > 20 cm,

(2.14)

To match the sides of the head and the vertical head axis more accurately, for each of the

local minimum u′ satisfying the criteria above calculate the positions u1 and u2 of the lateral

gradients, where the 20 cm threshold difference to the local minimum is exceeded, i.e. find u1and u2 such that

Ir(u1, v′)− Ir(u′, v′) ≤ 20 cm,

Ir(u1 − 1, v′)− Ir(u′, v′) > 20 cm,

Ir(u2, v′)− Ir(u′, v′) ≤ 20 cm,

Ir(u2 + 1, v′)− Ir(u′, v′) > 20 cm

(2.15)

and use the arithmetic mean u(v′) = u1+u22

as a possible point on a vertical head axis.

Furthermore, assume that the head height should be of at least 25 cm. To calculate the required

head height in pixels, let n be the number of number of subsequent lines on which the points

u are found. If u(v′) is found for the current line v′, increment n; otherwise set n = 0.

The average distance to the points found in the last n subsequent lines can be calculated

using

d =1

n

n∑i=0

Ir(u(v′ − i), v′ − i), (2.16)

then the number of lines required for this average distance is

nmax =25 cm × 240 px

d cm× 2 tan 45◦

2

≈ 72, 426

dpx. (2.17)

If n ≥ nmax then the center of a head is treated as detected at coordinates

(xc, yc) =

(1

n

n−1∑i=0

u(v′ − i), v′ − n

2

), (2.18)

where v′ is the current horizontal line.

An example result of head detection using this method is shown in figure 2.13.


a) b)

Figure 2.13: Head detection using Garstka and Peters approach. Image a) shows the detected headrectangle (in yellow) overlaid on top of the colour input image, image b) shows the detected headrectangle overlaid over the smoothed (using r = 2) depth image with depth shadows removed. In bothimages, white pixels represent the local horizontal minima which satisfy inequalities 2.13 and 2.14.

2.5.2 Depth-Based Head Tracker

After the head is localized by the Garstka and Peters head detector, a modified CAMShift

algorithm is used to track the head. The motivation for this approach stems from the fact that

one of the main assumptions made by Garstka and Peters (viz. that there is a single head-like

object present in the depth frame) ceases to hold in an unconstrained environment.

This assumption breakdown would result in the incorrect localization of the vertical head axis,

and subsequently lost track of the head.

To mitigate this problem, the criteria that Garstka and Peters used to reject these horizontal

local minima which could not possibly lie on a vertical head axis (equations 2.13, 2.14) are now

used to obtain the “face” probability in the CAMShift tracker.

More precisely, instead of using histogram backprojection to obtain

Pr(“I(x′, y′) belongs to a face”), define a degenerate “face” probability

Pr(“I(x′, y′) belongs to a face”) =

1, if Ir(x

′, y′) is a local minimum on

the line y′, and inequalities 2.13

and 2.14 hold,

0, otherwise.

(2.19)

Since the only non-zero probability pixels in the search-window are likely to be positioned on the

vertical head axis, the function that is used to obtain the size of the next search window is also

updated to s = 4√M00 (where the multiplicative constant was established empirically).

This re-definition of the “face” probability ensures that even when there are some other head-

like objects present in the frame, CAMShift algorithm will keep tracking the head which was


a) b) c)

Figure 2.14: Head tracking using depth information. Images a), b) and c) show the head rectangle (inyellow) overlaid on top of the colour input image. White pixels represent non-zero “face” probabilitiesderived from the depth image using prior knowledge about human head shape 2.5.1, which are thentracked using CAMShift 2.4.2 algorithm.

initially detected. An example of this method in action is shown in figure 2.14.

Since the rest of the depth-based head-tracking algorithm continues in the same manner as

the colour-based face tracking using CAMShift, the remaining details can be found in section

2.4.2.

2.6 Summary

Agile methodologies have been applied in the project’s requirement analysis (and later, in

project’s design, implementation and testing). During the planning and research phase, the

main requirements of the overall system were broken down into i) a real-time “head-tracking

in 3D” library, and ii) a 3D game simulating pictorial and motion parallax depth cues. The

former requirement has been identified as associated with the highest uncertainty (during the

risk analysis), hence a very significant amount of time has been spent researching face/head

detection and tracking techniques. Ultimately, a combination of de facto standard methods in

the industry (like Viola-Jones face detection or CAMShift face tracking), novel techniques (like

ViBe background subtraction and Peters-Garstka depth-based head detection) and self-designed

methods (like depth-based head tracking) were chosen to be used in the project.

Chapter 3

Implementation

This chapter provides details on how the algorithms and theory from Chapter 2 are imple-

mented to achieve the main project’s aims. It starts by discussing the development environ-

ment, languages and tools used, then it introduces a high-level architectural break-down of

the system into large components (“Viola-Jones Face Detector”, “Head-Tracking Library” and

a “3D Display Simulator”), and finally it discusses the implementation of these individual

components.

3.1 Development Strategy

Early in the project, a decision has been made to implement all required algorithms and methods

from scratch.

While there are numerous open-source computer vision libraries, they are developed to primarily

deal with colour data (e.g. OpenCV), isolating only face detection and tracking routines in these

large libraries is a complex and time-consuming task.

It was deemed that extending these large libraries in multiple different ways to use depth

information (see [10] for Viola-Jones extensions using depth, or section 2.5.1 for depth-based

head detection and tracking) involves higher risk than implementing a single-purpose, cohesive

head tracker.

3.2 Languages and Tools

3.2.1 Libraries

To obtain the depth and colour data from Microsoft Kinect, a free (for non-commercial projects)

Kinect SDK 1.0 Beta 2 library [33] (released by Microsoft in November 2011) was chosen. While

there are some alternative open-source libraries that can extract depth and colour data from

the Kinect sensor, they are not officially supported by the manufacturer and hence were not

used to avoid various compatibility issues.

To render the depth cues, OpenGL library was chosen as a de facto industry standard for 3D

graphics.

22

CHAPTER 3. IMPLEMENTATION 23

Figure 3.1: (Partial) GANTT chart showing project’s status as of 18/12/2011.


3.2.2 Development Language

C# was chosen as a main development language because it provides typical advantages of a

third-generation programming language (machine independence, human readability) combined

with object-oriented programming benefits (cohesive and decoupled program modules, clear

separation between contracts and implementation details, code re-use through inheritance and

polymorphism, and so on). It also has a number of advanced programming constructs, like

events, delegates, extension/generator methods, SQL-like native data querying capabilities and

lambda expressions.

Furthermore, it provides features that are missing from Java (like value types, operator over-

loading or reified generics) and has a stronger tool support for GUI development. Finally, it

is supported by Kinect SDK (which is targeting .NET Framework 4.0) and OpenGL (using a

wrapper for C# called OpenTK [34]).

3.2.3 Development Environment

One of the requirements for Kinect SDK is Windows 7/8 OS, hence a Windows-based develop-

ment environment had to be chosen. Since Visual Studio IDE fully supports the development

language C#, and also features built-in code versioning, code testing, and GUI development

environments, it was used across the whole project.

3.2.4 Code Versioning and Backup Policy

Apache Subversion (SVN) code versioning system was used for the source code and the disser-

tation version control, and immediate back-up. A centralized storage space for SVN was set

up in the PWF file system.

Also, a weekly backup strategy was established, where the source code and the dissertation

were mirrored once a week on two 16 GB USB flash drives to protect from data loss.

3.3 Implementation Milestones

Milestones of the project proposal (see appendix I) were carefully followed (except for a sin-

gle design change outlined in the progress report, viz. replacing 3D hemisphere-fitting depth

tracker [44] with the Garstka and Peters depth tracker). Any minor delays were covered by the

“slack” time, as planned in the project proposal.

A snapshot of the project status as of 18/12/11 is shown in GANTT chart in figure 3.1.


3.4 High-Level Architecture

The overall project can be split into the independent development components shown in figure

3.2. Each of these components are discussed in more detail in the following sections.

Figure 3.2: UML 2.0 component diagram of the system’s high-level architecture.

3.5 Viola-Jones Detector Distributed Training Frame-

work

According to Viola and Jones [41], the training time for their 32-layer detector “was in the

order of weeks”. Similarly, according to [27], “the training of the cascade which is used by

the detector turned out to be very time consuming” and “the [17-layer] cascade was never

completed”.

To mitigate the time complexity of the detector cascade training, a decision was made to

exploit the processing power of PWF (Public Workstation Facility) machines1 available at the

University of Cambridge Computing Laboratory’s Intel lab. A distributed training framework

targeting Microsoft .NET 2.0 framework (available on PWF) was designed and implemented,

1Running Windows XP OS on Intel Core 2 Q9550 Quad CPU @ 2.83 GHz with 3.21 GB of RAM.


Figure 3.3: PWF machines at University of Cambridge Computer Laboratory’s Intel Lab distributedlytraining a Viola-Jones detector framework. Special care was taken to ensure that PWF machines wouldonly be used for training when they are not needed by other people (i.e. most of the training was doneduring the weekends and term breaks), and that training would not interfere with regular PWF userlog-ons.

which trained a 22-layer cascade containing 1, 828 decision stumps in 20 hours, 15 minutes and

2 seconds.

While the performance of training framework is further discussed in section 4.1.2, it is worth

mentioning that the best-performing rectangle feature selection time has been reduced from

nearly 16 minutes in a naıve single-threaded single-CPU implementation (which would require

more than three weeks to train a 1, 828 feature cascade) to the average of 38.39 seconds per

feature in a distributed multi-threaded implementation using 65 CPU cores2.

Two most time-consuming tasks were parallelized: best weak classifier selection (out of 162, 336

rectangle features) when building a strong classifier, and the false positive training image boot-

strapping for each layer of the cascade3.

2Amount of parallel processing was limited by the number of simultaneous logins (19) allowed by PWFsecurity policy. Out of 19 machines, 18 were running 4 training client instances each, one additional machinewas running one server instance and one training client instance.

3The computational complexity of these tasks is best illustrated by the numbers: each of the 162, 336rectangle features has to be evaluated on each of the 9, 916 training images (as described in section 4.1.1);best-performing decision stump has to be selected out of those. This process of adding best-performing weakclassifiers has to be repeated until the individual layer false positive rate and detection rate objectives are met,and new layers have to be added until all training data is learned (in total, 1, 828 decision stumps were added).Similarly, 5, 000 false positive training images have to be bootstrapped for each new layer of the cascade; as thecascade grows, the effort required to find false positive images increases exponentially.


Figure 3.4: UML 2.0 deployment diagram of the Viola-Jones distributed training framework archi-tecture.

The architecture and the implementation details of this distributed training framework are

described below.

3.5.1 Architecture

To provide a better understanding on how the tasks and the main training data are physically

distributed, a deployment diagram of the distributed training framework is shown in figure 3.4.

As shown in this diagram, two separate communication channels are used: TCP/IP and CIFS

(Common Internet File System, also known as SMB, Server Message Block).

A standard client-server architecture with “star” topology (with server in the center) is used

for the framework. This arrangement greatly simplifies work coordination and makes it easier

to ensure a strict consistency of training results.

To avoid bottlenecking server’s Ethernet link, the following rule-of-thumb is applied: short

messages between server and clients are transmitted over TCP/IP, and CIFS is used for large

data exchanges.

3.5.2 Class Structure

Due to the space constraints, a detailed class structure of the framework is given in the appendix

E. In particular, the class diagram4 is shown in figures E.1 and E.2, and, while the purpose of

4Note that all class and component diagrams given in this chapter have been simplified for the convenienceof the reader. The implementation follows the agile methodologies “self-documenting” code principle, hence theauthor hopes that some insight into the purpose and responsibilities of individual classes/components can beobtained by examining the names and signatures of the functions that they provide.


the classes should be self-explanatory from the method signatures, the main responsibilities of

the most important individual classes are given in table E.1.

All classes were implemented using a defensive programming technique. This proved to be cru-

cially important, since machines repeatedly lost CIFS connections to DS-Filestore, experienced

TCP/IP connection time-outs under high network load, were forcefully restarted both to install

updates and by other Intel lab users, and so on.

3.5.3 Behaviour

The high-level communication sequence between the server and the clients is given in figure

3.5.

Figure 3.5: UML 2.0 sequence diagram of the high-level communications overview between serverand clients in Viola-Jones distributed training framework. Bounded “while” rectangle corresponds toline 4 in Build-Cascade algorithm given in C.1.3.1.

Two most time consuming tasks (false positive training image bootstrapping and weak classifier

boosting using AsymBoost) were both multi-threaded and distributed between clients.

The interactions between the clients while performing these tasks are coordinated in the follow-

ing way: immediately after the connection is established between the server and the client, the

server sends to the client the indices of high-resolution negative training images5 which that

particular client should use to bootstrap detector-resolution false positive training images for

each layer of the cascade.

5As shown in the deployment diagram 3.4, all negative training images reside on the DS-Filestore and areaccessed through CIFS.


After receiving the “start false positive image bootstrapping” command, a client obtains a copy

of the current detector cascade and repeatedly executes the algorithm 3.5.3.1.

Algorithm 3.5.3.1 Single false positive training image bootstrapping. It requires a high-resolutionnegative training image Ii, an exhaustive array of triples Ai = [(x0, y0, size0), ..., (xn, yn, sizen)] de-scribing all possible locations and sizes of bootstrapping samples for image Ii, and a current detectorcascade Ct(~x). The result of this algorithm is either a single false positive training image, or Nil ifno such image could be found.

False-Positive-Training-Image-Bootstrapping(Ii, Ai, Ct(~x))

1 while Ai. length > 02 // Generate a random sample index.3 r ← Random-Betweeen(0, Ai. length − 1)

4 // Acquire and resize the selected sample.5 ~xcurrent ← Resize(Sample(Ii, Ai[r]),Base-Resolution)

6 // If the negative sample is misclassified as a face, return it.7 if Ct(~xcurrent) = 18 return ~xcurrent

9 // Otherwise, put the last sample into current sample’s place and decrement the10 // array length marker.11 Ai[r]← Ai[Ai. length − 1]12 Ai. length ← Ai. length − 113 return Nil

When a false positive training image is bootstrapped, its standard deviation σ is calculated

and stored in NegativeTrainingImage class. Standard deviation is then used to inversely scale

the values of rectangle features, to normalize the variance of all false positive training images,

hence minimizing the effect of different lighting conditions. It is worth mentioning that σ can

be efficiently calculated using integral image (see 2.4.1) technique: define I2 to be the squared

integral image, then

σ =√

E[I2]− (E[I])2,

where E[I2] = I2(h,w)h×w and E[I] = I(h,w)

h×w , with h,w being the height and width (respectively) of

the false positive training image.

Each bootstrapped false positive training image is then sent back to server, which assembles

them into a new negative training image set. Figure 3.6 explains this interaction pictori-

ally.

Similarly, figure 3.7 shows the interactions between the server and clients in the weak classifier

boosting task (based on AsymBoost algorithm described in section C.1.1.1).


Figure 3.6: UML 2.0 interaction overview diagram of the distributed false positive training imagebootstrapping.


Figure 3.7: UML 2.0 interaction overview diagram of the distributed weak classifier boosting usingAdaBoost algorithm (see C.1.1.1) with AsymBoost extension (see C.1.1.1).


3.6 Head-Tracking Library

A good initial insight into the implementation details of the HT3D (Head-Tracking in 3D)

library can be obtained by observing the data flow between its various components, as demon-

strated in figure 3.8.

As shown in figure 3.8, HT3D internal implementation follows a highly modular design, with

cohesive, single-purpose components, arranged in “star” topology and de-coupled from each

other.

These design features enable a high-degree of flexibility to choose which information sources

should be used for viewer’s head tracking and how they should be arranged (this proved to be

crucial for the evaluation chapter, in which the performances of different trackers with different

features enabled were compared).

Furthermore, this design simplified an interchange of components (as shown by the colour-based

background subtractor example) and streamlined testability.


Figure 3.8: Data flow diagram of the implemented HT3D library. Numbers on the arrows indicatethe order in which data is passed in a typical head-tracking communication sequence.


The individual components of HT3D library are further discussed below.

3.6.1 Head-Tracker Core

As shown in data flow diagram (figure 2.1) the head-tracker core orchestrates various individual

head-tracking components and exposes the HT3D library API to the end-user. A detailed head-

tracker core class diagram is given in figure E.3 and the implementation-wise responsibilities of

the most important classes are discussed in detail in the table E.2.1.

Most importantly, head-tracker combines the colour- and depth-based tracker (discussed below)

outputs using the algorithm 3.6.1.1.

Algorithm 3.6.1.1 Combining colour- and depth-based tracker predictions. Given the inputs Cand D (colour- and depth-tracker output rectangles respectively), this algorithm returns the combinedhead center coordinates (or ∅, in case of the tracking failure).

Combine-Trackers(C,D)

1 if C 6= ∅ ∧D 6= ∅2 if |C ∩D| 6= ∅3 return Rectangle-Center(Average-Rectangle(C,D))4 else5 Reset colour and depth trackers to “detecting” state.6 return ∅7 else8 if D 6= ∅9 return Rectangle-Center(D)

10 if C 6= ∅11 return Rectangle-Center(C)

12 return ∅

From the user’s point of view, head-tracker core API exposes the following tracking output (via

HeadTrackFrameReadyEventArgs):

1. Tracking images, rendering one of the options shown in figure 3.9),

2. Detected face image (from Viola-Jones face detector),

3. Tracked head rectangles (from colour and depth trackers),

4. Combined head center position in pixels,

5. Combined head center position in space w.r.t. Kinect sensor.

Also HeadTracker class exposes a number of head-tracking settings (as shown in figure E.3),

allowing the user to tweak detection and tracking components.


a) c = COLOUR FRAME b) c = DEPTH FRAME

c) c = HISTOGRAM d) c = BACKGROUND e) c = DEPTH FACE

BACKPROJECTION SUBTRACTION PROBABILITY

Figure 3.9: HT3D image frame rendering options, enabled by executingheadTracker.EnabledRenderingCapabilities[c] = true.

3.6.2 Colour-Based Face Detector

Colour-based face detector component in HT3D library is mainly responsible for localiz-

ing the viewer’s face in colour images using a trained Viola-Jones cascade. For this rea-

son, a large part of distributed Viola-Jones training framework code is reused (in particu-

lar, NormalizedTrainingImage, StrongLearner and StrongLearnerCascade classes, together

with the RectangleFeature class hierarchy), as shown in figure 3.10.

A new ViolaJonesFaceDetector class is added, with the main responsibilities of

� Deserializing the strong learner cascade from XML (obtained from the distributed training

framework),

� Providing means to adjust the learner cascade coefficients (pre-multiplying each layer’s

threshold with a given constant),

� Detecting the viewer’s face given the input colour and depth images.

While the implementation of the first two responsibilities is trivial, the latter deserves some

further attention. As discussed by Burgin et al. [10], cues present in depth data can be used

to make face detection faster and more accurate. In particular, the face search space can be

reduced from exploring multiple scales at each pixel, to searching for only plausible face sizes

at a pixel, given its distance from the camera.

This optimization to an exhaustive search is implemented as follows:


Figure 3.10: UML 2.0 class diagram of the colour-based face detector component of HT3D library.

1. Given the aligned colour and depth images (provided by Microsoft Kinect SDK), iterate

through the pixels in the colour image using a step size ∆ = 3 px.

2. For each pixel assume that a potential face is centred there. Set the face height upper

and lower bounds to 40 cm and 20 cm respectively, and use equation 2.11 to estimate the

face height upper (hu) and lower (hl) bounds in pixels.

3. Run Viola-Jones face detector starting at hl resolution, using a scaling factor s = 1.075

to increment the resolution, until hu upper bound is reached or a face is detected.

One of the important points in the detection algorithm outlined above is that the face detection

is triggered as soon as one of the sub-windows in the image passes through the detector cascade.

The reason why the search is terminated immediately at that point is because of the main

assumption given in the problem constraints (viz. that only a single viewer is present.

Another point to note is that scaling is achieved by scaling the face detector itself, and not the

input image. More precisely, given a weak classifier as described in section 2.4.1.3, its scaled

and variance-normalized version hi,s,σ which takes a Haar-like feature f , a threshold θ and a

polarity p, and returns the class of an input image ~x (where s is the scale and σ is the standard

deviation of the input image) can be defined as

hi,s,σ(~x, f, p, θ) =

{1, if pf(~x) < s2σpθ,

0, otherwise.(3.1)


3.6.3 Colour-Based Face Tracker

Figure 3.11: UML 2.0 class diagram of the colour-based face tracker component of HT3D library.

The class diagram of the colour-based face tracker is shown in figure 3.11.

The main responsibilities of the CamShiftFaceTracker and

CamShiftFaceTrackerUsingSaturation classes are:

� Computing the “face” probability distribution (described in detail in 2.4.2.1),

� Calculating the face centroid and the search window size (described in C.2.2),

� Generating a face probability bitmap, as shown in figure 2.7.

While the implementation of these responsibilities closely follows the theory given in relevant

subsections of section 2.4.2, two main implemented extensions are worth mentioning sepa-

rately:

� Tracking using two-dimensional histogram from the hue-saturation colour space (based

on the ideas in [1]). This extension is implemented to mitigate one of the well-known

deficiencies of CAMShift algorithm, viz. the inclusion of the background region if it has

a similar hue as the object being tracked.

Figure 3.12 shows the probability images obtained using one- and two-dimensional his-

togram backprojections in equivalent tracking conditions.

Since CamShiftFaceTrackerUsingSaturation class inherits from

CamShiftFaceTracker, most of the standard CAMShift tracker code is reused,

and, more importantly, the extended tracker becomes interchangeable in place of the old

one because of inheritance covariance.

� As described by Bradski in [8], a large amount of hue noise in HSV space is introduced

when the brightness is low (as can be seen from figure 2.6). Similarly, small changes in

the colour of low-saturated pixels in RGB space can lead to large swings in hue. For this

reason, brightness (value) and saturation cut-off thresholds (θv and θs respectively) are


a)

b) c) d)

Figure 3.12: Probability images obtained from an input image b) using c) hue and d) hue-saturationhistograms (brighter colour indicates a higher probability for the pixel to be part of the face; bothhistograms initialized with a) the output of the Viola-Jones detector shrunk by 20%). As shown inpicture c), using a two-dimensional histogram built in hue-saturation colour space would allow thetracker to maintain track of the object even when the background has a similar hue.

introduced: if the brightness or saturation of a given pixel are below these thresholds,

such pixel is ignored when building the colour histograms.

3.6.4 Colour- and Depth-Based Background Subtractors

Figure 3.13: UML 2.0 class diagram of the colour and depth background subtractor components ofHT3D library.


a) b)

Figure 3.14: Depth-based background subtractor operation. If the depth-based head tracker is lockedonto the viewer’s head in the input image a) (yellow rectangle), then the image can be segmented intobackground and foreground using pixel’s distance from Kinect as a decision criterion. In particular, ifa given pixel is further away than the viewer’s head center, then it is classified as background (black),otherwise it is classified as part of the foreground (white), as shown in image b).

Colour- and depth-based background subtractors share a common abstract ancestor class

BackgroundSubtractor, which is responsible for creating a background segmentation bitmap

(using concrete background subtractor implementations) given a certain background subtrac-

tion sensitivity (again, dependant on the concrete implementation). Because of this design, all

background subtractors are interchangeable, and mock background subtractors can be used to

test the library.

Two concrete colour-based subtractors are implemented: ViBe background subtractor

(ViBeBackgroundSubtractor class) and an Euclidean distance thresholding based back-

ground subtractor (EuclideanBackgroundSubtractor class). Similarly, a depth-based

DepthBackgroundSubtractor class is implemented, albeit serving a slightly different purpose:

to increase the speed and the accuracy of colour-based face detector and tracker6.

3.6.5 Depth-Based Head Detector and Tracker

Since depth-based head detection and tracking methods are based on the same priors about

the human head shape, both the detection and tracking functionality is provided by the

DepthHeadDetectorAndTracker class (as shown in figure 3.15).

The main responsibilities of the DepthHeadDetectorAndTracker class are:

� Preprocessing of depth information (depth shadow elimination and integral-image based

real-time depth image blurring),

6Due to space constraints, further background subtractor implementation details are given in the appendixE.2.2.


Figure 3.15: UML 2.0 class diagram of the depth-based head detector and tracker.

� Viewer’s head detection using Peters and Garstka method (implemented closely following

section 2.5.1),

� Head tracking using a modified CAMShift algorithm (implemented following section

2.5.2).

DepthHeadDetectorAndTracker class exposes the means to set the integral-image based blur

radius r (as shown in figure 2.10) and to enable/disable the depth shadow elimination (as shown

in figure 2.11).

The only slight deviation from the theory described in section 2.5.1 when implementing the

head detector, is that the minimum head height requirement is relaxed from 25 cm to 15 cm

(increasing the detection rate).

While this modification can also result in higher amount of false positives, using a modified

CAMShift tracker (as described in 2.5.2) to track the detected head prevents this from happen-

ing in practice. In particular, the search window expands to the whole head area if the regions

with high “face” probability are connected, or degenerates to the minimum if two different

physical objects were detected as a single object (since the first moment becomes relatively

small compared to the initial search window size).

3.6.6 Tracking Postprocessing

Noise in depth and colour images creates instabilities when tracking viewer’s face/head (i.e.

even if the viewer is not moving between consecutive frames, the detected head/face positions

might slightly differ).

The noise sources present in depth images are briefly discussed in section D.1.2.

The main noise sources present in colour images produced by Kinect’s RGB camera are:


Figure 3.16: UML 2.0 class diagram of the tracking post-processing filters used in HT3D library.

� photon shot noise (a spatially and temporally random phenomenon arising due to Poisson-

like fluctuations with which photons arrive at sensor elements),

� sensor read noise (voltage fluctuations in the signal processing chain from the sensor

element readout, to ISO gain and digitization) and quantization noise (analogue voltage

signal rounding to the nearest integer value in ADC),

� pixel response non-uniformity, or PRNU (differences in sensor element efficiencies in cap-

turing and counting photons, due to the variations in their manufacturing), and so on.

To help mitigate these face-/head-tracking noise issues, two simple filter classes are imple-

mented:

� ImpulseFilter class serves as an exponentially weighted moving average (EWMA) im-

plementation of an infinite impulse response (IIR) filter, attenuating low amplitude jitter

in the head movements. Given the input vector xt, the filtered value xt is obtained by

calculating

xt = (1− α)xt−1 + αxt, (3.2)

where α is the smoothing (attenuation) factor. The initial value x0 is equal to the first

value of x obtained, i.e. x0 , x0.

� HighPassFilter class implements a discrete-time RC high-pass filter which is used to

smooth out the transitions when one of the trackers loses the track of the head/face.

Given the input vector xt, the high-pass filtered value xt is obtained by calculating

xt = βxt−1 + β(xt − xt−1). (3.3)

where β is the smoothing factor. The initial value x0 is equal to the first value of x

obtained, i.e. x0 , x0.


In the overall HT3D architecture, ImpulseFilter class is used to reduce the noise in the

output face-/head-tracking rectangles returned by colour and depth trackers. If both trackers

are locked onto the viewer’s head then the HeadTracker calculates the location of the head

centroid as the arithmetic average of two rectangle centers; otherwise the centroid location is

equal to the center of the tracking rectangle (if there is one).

To avoid a sudden face centroid jump if one of the trackers loses the track of the face, a

HighPassFilter is used. In particular, a high-pass filtered frame-by-frame change in centroid

positions is subtracted from the predicted face centroid position in that particular frame to

obtain the final prediction (which is returned to the user of the library).

3.7 3D Display Simulator

Figure 3.17: UML 2.0 class diagram of the 3D display simulation program.

In the final part of the project’s implementation, HT3D library is used to simulate horizontal

and vertical motion parallax in a 3D game. The game is largely based on a Blockout video game

(published by California Dreams in 1989), and is basically an extension of Tetris into the third

dimension (hence the name, Z-Tris). The purpose of the game is to solve a real-time packing

problem by forming complete layers of polycubes which are falling into a three-dimensional pit.

For this reason, achieving in-game goals requires accurate depth perception.

The high-level class, component and deployment diagrams of the 3D display simulation program

are shown in figures 3.17, 3.18 and 3.19 respectively.

As illustrated in figure 3.17, 3D display simulator consists of two small UI modules (“3D Simula-

tion Entry Point” and “Head Tracker Configuration”), and a larger model-view-controller-based

module (“Z-Tris”).

This break-down into self-contained modules is based on very clear individual responsibilities,

as described both below (for ‘Z-Tris” module) and in appendix E.3 (for UI modules).


Figure 3.18: UML 2.0 component diagram of the 3D display simulation program.

Figure 3.19: UML 2.0 deployment diagram of the 3D display simulation program (ZTris.exe) showingthe required run-time components and artifacts.


Figure 3.20: An entry-point into the 3D display simulation program (MainForm class).

Figure 3.21: Head-tracker configuration GUI (ConfigurationForm) exposing all available HT3Dlibrary options.


Figure 3.22: Screenshot of Z-Tris game: the viewer is looking “down” into the pit, i.e. the active(transparent) polycube is moving away from the player. Depth perception is simulated using occlu-sions, relative density/height/size, perspective convergence, lighting and shadows and texture gradientpictorial depth cues.

3.7.1 3D Game (Z-Tris)

As shown in the class diagram in figure E.4, the implementation of the game is based on a

MVC (model-view-controller) architectural pattern (incorporating “Observer”, “Composite”

and “Strategy” design patterns) [9]. MVC facilitates a clear separation of concerns and respon-

sibilities, reduces coupling, simplifies the growth of individual architectural units, supports

powerful UIs (necessary for the 3D display simulation) and streamlines testing.

Due to the space limitations, the implementations of “Model”, “Controller” and part of the

“View” architectural units are discussed in the appendix E.3.3. Since it is very important for

the project aims that in the process of rendering the game state, a number of depth cues are

simulated pictorially (shown in the in-game screenshot 3.22), the depth-cue rendering part of

the “View” is discussed below.


3.7.1.1 Generalized Perspective Projection

Occlusions, relative density, height and size, perspective convergence and motion parallax depth

cues are simulated using the off-axis perspective projection, as described in section D.2.1.

Given the viewer’s head location in space (obtained by the ConfigurationForm from the HT3D

library), the generalized (off-axis) perspective projection matrix G = PMTT can be expressed

using OpenGL projection matrix stack as shown in code listing 3.1 (see section D.2.1 for notation

reminder).

Listing 3.1: Generalized projection matrix implementation in OpenGL.

GL.MatrixMode(MatrixMode.Projection);

GL.LoadIdentity ();

GL.Frustum(l, r, b, t, n, f);

Matrix4 MT = new Matrix4(~vr,x, ~vr,y, ~vr,z, 0,

~vu,x, ~vu,y, ~vu,z, 0,

~vn,x, ~vn,y, ~vn,z, 0,

0, 0, 0, 1);

GL.MultMatrix(ref MT);

GL.Translate(-(pe,x, pe,y, pe,z));

3.7.1.2 Shading

To simulate the lighting depth cue, a default OpenGL Blinn-Phong shading model [6] is used.

Vertex illumination is divided into emmissive, ambient, diffuse (Lambertian) and specular com-

ponents, which are computed independently and added together (summarized in eq. 3.4).

The colour c~v of a vertex ~v is defined as

c~v =c~v,a + c~v,e +∑

~l∈lights

[attenuation(~l, ~v)× spotlight(~l, ~v)×

(c~l,a+[

max

{~l − ~v||~l − ~v||

· ~vn, 0

}× c~l,d × c~v,d

]+[(

max

{~l + ~v

||~l + ~v||· ~vn, 0

})α~v

× c~l,s × c~v,s

])],

(3.4)

where c~v,e, c~v,a, c~v,d, c~v,s are vertex ~v material’s emissive, ambient, diffuse and specular normal-

ized (i.e. between 0 and 1) colours respectively, α~v is the shininess of vertex ~v, ~vn is a normal

vector to vertex ~v and c~l,a, c~l,d, c~l,s are the light’s ~l ambient, diffuse and specular normalized

colours respectively.

Between-vertex pixel values are interpolated using Gouraud [20] shading.

In Z-Tris implementation, a scene is lit by a single light source positioned in front of the pit,

in the top left corner of the screen.


a) b)

Figure 3.23: Screenshot of the scene in Z-Tris game rendered a) without and b) with shadows,generated using Z-Pass technique.

3.7.1.3 Shadows

Section A.2.1 briefly discusses the relative importance of depth cues. In particular, shadows play

an important role in understanding the position, size and the geometry of the light-occluding

object, as well as the geometry of the objects on which the shadow is being cast [25, 19].

For this reason, a Z-Pass shadow rendering technique using stencil buffers (as described in detail

in section D.2.2) is implemented. The algorithm itself is slightly optimized in the following

way: instead of rendering the unlit scene, projecting shadow volumes and rendering the lit

scene outside the shadow volumes again, a fully-lit scene is rendered first, then the shadow

volumes are projected and a semi-transparent shadow mask is rendered onto the areas within

the shadow volumes (saving one full scene rendering pass).

Figure 3.23 shows the same scene rendered with/without shadows generated using Z-Pass tech-

nique.

3.8 Summary

Based on the chosen development strategy, all required computer vision and image process-

ing methods have been successfully implemented from scratch, and integrated into the three

main components of the system (“Distributed Viola-Jones Face Detector Training Framework”,

“Head-Tracking Library” and “3D Display Simulator”). These components were developed us-

ing industry-standard design patterns and software engineering techniques, strictly adhering

to the time frame given in the project proposal. In the final system, the output from the

training framework (face detector cascade) has been integrated into the HT3D library, which

was then used by the proof-of-concept 3D display application to simulate pictorial and motion

parallax depth cues (see http://zabarauskas.com/3d for a brief demonstration of the system

in action).


Chapter 4

Evaluation

This chapter describes the evaluation metrics used and the results obtained for all three ma-

jor architectural components from Chapter 3 (“Viola-Jones Face Detector”, “Head-Tracking

Library” and “3D Display Simulator”).

In particular, Viola-Jones face detector evaluation characterizes the performance of the classifier

cascade in terms of the false positive counts for a given detection rate. In the head-tracking

library evaluation, the library’s performance w.r.t. the average distance and the spatio-temporal

overlaps between the tracker’s prediction and the tagged ground-truth is analysed. Finally, the

evaluation of a 3D display simulation program (Z-Tris) examines its correctness using different

types of testing, and describes its run-time performance.

4.1 Viola-Jones Face Detector

4.1.1 Training Data

The positive face training database consisted of 4, 916 upright full frontal images (24 × 24 px

resolution), obtained from [11] (originally assembled by Michael Jones). A sample of the first

196 images from this set is shown in figure 4.1.

Figure 4.1: First 196 faces from the Viola-Jones face detector positive training image database.

48

CHAPTER 4. EVALUATION 49

Figure 4.2: Viola-Jones negative training images gathered using “aerial photograph”, “foliage”, ”un-derwater”, “Persian rug” and “cave” search queries.


A collection of 7, 960 negative training images (i.e. images not containing faces) for the first

layer were obtained from the same source. A further set of 2, 384 larger resolution (994 ×770 px on average) negative training images were manually assembled using a Google Image

Downloader [18] tool, using search queries like “Persian rug”, “aerial photograph”, “foliage”,

etc. A few examples of such images are shown in figure 4.2.

All training images have been converted from 24 bits-per-pixel (bpp) colour images to 8-bpp

grayscale bitmaps using Image Magick Mogrify command-line tool [26] (to reduce disk/RAM

storage requirements), and stored on DS-Filestore.

The amount of training data collected was relatively small compared to Viola-Jones imple-

mentation: 32.9 million of non-face sub-windows contained in 2, 384 images, compared to 350

million sub-windows contained in 9, 500 images, collected by Viola and Jones.

A decision to stop further data mining was made based on the project’s time limitation (negative

training images downloaded using Google Image Downloader had to be manually verified not

to contain faces, which was a laborious and time-consuming task), storage limitation (DS-

Filestore storage quota limit has been reached by the current negative training image set) and,

most importantly, the specifics of the intended use case for the detector being trained.

In particular, a colour-based face detector is used only to initialize the face tracker in the video

sequence, hence it is enough for a face to be detected within the first few seconds of use. For

a 30 frames/second video rate, viewer’s face can to be detected in one of the few hundred of

initial input frames for the viewer not to experience any significant discomfort. The use of the

face detector in spatio-temporal viewer tracking is therefore much less stringent than a classical

“face detection in still images” task.

Based on this observation, thresholds of strong classifiers in the input cascade can be increased

sufficiently reducing the false positive rates to compensate for the lack of training data (limited,

of course, by the simultaneous reduction in detection rates).

4.1.2 Trained Cascade

Using the distributed training framework, a 22-layer cascade containing 1, 828 decision stump

classifiers has been trained (see figure 4.4). The first three selected classifiers based on Haar-

features are shown in figure 4.3).

a) b) c)

Figure 4.3: First three Haar-like features selected by AsymBoost training algorithm as the weakclassifiers; a) and b) were selected for the first, c) for the second layer of the cascade.


Figure 4.4: Weak-classifier count growth for different-size (in layers) cascades.

Individual layers of the cascade (strong classifiers) were trained using AsymBoost (as described

in section C.1.1.1. Distributed training of the whole cascade on 65 CPU cores (Intel Core 2

Q9550 @ 2.83 GHz) took 20 hours, 15 minutes and 2 seconds.

The breakdown of training time into the main individual tasks is shown in table 4.1. Interest-

ingly enough, it took more time to distribute the bootstrapped false-positive samples between

all clients than to actually bootstrap them using the distributed framework.

TaskAveragetime (s)

Distributed best weak classifier search 38.39Training data distribution (per layer) 137.30Distributed negative training sample bootstrapping (per layer) 86.20Distributed negative training sample bootstrapping (per image) 0.0024

Table 4.1: Average execution times for the main distributed Viola-Jones cascade training tasks.

Hit and false alarm rates used for each layer in the cascade are shown in table 4.2.

Layer 1 2 3 4 5 6 7 ≥ 8

Hit rate 0.960 0.965 0.970 0.975 0.980 0.985 0.990 0.995FP rate 0.625 0.600 0.575 0.550 0.525 0.500 0.500 0.500

Table 4.2: Hit (detection) and false positive rate limits used for each layer of the cascade.


Figure 4.5: Three false positive samples (24×24 px), misclassified by the 22-layer Viola-Jones detectorcascade.

For each new layer, 5, 000 negative training image samples were bootstrapped using the dis-

tributed algorithm described in section 3.5.3.

Out of 32,988,622 negative training samples (obtained from the large-resolution negative train-

ing images), only 40 samples were misclassified as faces for the last round of training (three of

these samples are shown in figure 4.5).

4.1.3 Face Detector Accuracy Evaluation

To compare the performance of the trained cascade with Viola-Jones results, the cascade was

evaluated on CMU/MIT [36] upright frontal face evaluation set, containing 511 labelled frontal

faces. The receiver operating characteristic (ROC) curve showing the trade-off between the

detection and false alarm rates of both cascades is shown in figure 4.6.

As expected, the cascade obtained by Viola and Jones performs significantly better. This

performance difference can be attributed to the fact that Viola and Jones have used 1, 063%

more data and have trained additional 16 cascade layers with 4, 232 additional decision stump

classifiers.

Nevertheless, for the purposes of face detection in the context of face tracking, the trained

cascade has proven to be completely adequate. In ten minutes of colour and depth recordings

for HT3D (Head Tracking in 3D) library evaluation (described below) the trained face detector

has achieved 97.9% face detection precision. This result is illustrated in figure F.3, where all

face detections in HT3D evaluation recordings are shown.

4.1.4 Face Detector Speed Evaluation

As described by Viola and Jones [43], the speed of the detector cascade directly relates to

the number of rectangular features that have to be evaluated per search sub-window. Due to

the cascaded structure, most of the search sub-windows are rejected very early in the cascade.

In particular, for the CMU/MIT set the average number of decision stump weak classifiers

evaluated per sub-window is 3.058 (out of 1, 828 present in the cascade)12.

1Cf. 8 weak classifiers on average in Viola and Jones cascade.2With the strong classifier rescaling coefficient of 0.4, as used in all HT3D evaluation recordings (see table

4.6).


Figure 4.6: Receiver operating characteristics curves for the distributively trained detector and thedetector cascade trained by Viola and Jones. Both ROC curves were established by running theface detector on CMU/MIT frontal face evaluation set. Face detector search window is shifted by[s∆], where s is the scale initialized to 1.0 (progressively increased by 25%) and ∆ is the shift factor,initialized to 1.0. Duplicate detections are merged if the area of their intersection is larger thanhalf of the area of any individual detection rectangle. To obtain the full ROC curve, the thresholdsof individual classifiers are progressively increased for the distributively trained detector, decreasingboth the detection rate and the false positive count. A false positive rate can be obtained by dividingthe false positive count by 69, 055, 978.


Figure 4.7: Trained Viola-Jones face detector evaluation tool. Two sample images from MIT/CMUset are shown; ground-truth is marked with red/blue/white dots, detector’s output (produced usingdefault settings, as given in table 4.6) is shown in green.


A C# implementation of the face detector achieved comparable performance to the one de-

scribed by Viola and Jones [43]. In particular, the trained detector was able to process a

384× 288 px image in 0.028 seconds on average (achieving 35.71 frames-per-second processing

rate), using a starting scale s = 1.25 and a step ∆ = 1.5.

While the image processing speed achieved by the trained cascade is 239% faster than the

speed described in [43] under similar detector settings, it is unclear how much of this speed-up

can be directly attributed to the shorter cascade size and a smaller amount of weak classifiers

evaluated per sub-window3.

4.1.5 Summary

A 22-layer frontal/upright face detector cascade has been successfully trained in a very short

timeframe (less than a day) using the distributed Viola-Jones framework implementation. While

the performance of the cascade has been limited by the amount of training data available (over

32.9 million negative training samples have been exhausted), the achieved performance proved

to be adequate for the face-tracking tasks. The face detector was also able to process 384×288

px input images at 35.71 FPS, making it suitable for the real-time applications.

4.2 HT3D (Head-Tracking in 3D) Library

4.2.1 Tracking Accuracy Evaluation

The performance of a 3D display simulation program (the main project aim) crucially depends

on the accurate viewer’s head localization in space. To that end, Kinect SDK is used to

obtain the relative location of a point in space corresponding to a speculated head-center pixel,

hence accurately finding the head-center pixel coordinates is crucial to the overall project’s

success.

4.2.1.1 Evaluation Data

At the time of writing this dissertation, no standardized benchmark containing both colour and

depth data for the face tracking evaluation was available.

A set of evaluation data was manually collected using StatisticsHandler class from the HT3D

library 3.6.1. All videos in the set were taken to reflect conditions that might naturally occur

when a single viewer is observing a 3D display, including head rotations/translations, changing

lighting conditions, cluttered backgrounds, occlusions and even multiple viewers present in the

frame. In total, 10 minutes of depth and colour data feed from Kinect were recorded at 27.5

average FPS (totalling in over 16, 000 frames).

3In particular, it is unclear what speed improvement could have been achieved only by using a faster CPUbecause of different operating systems, different implementation programming languages, and so on.


All scenarios covered in this evaluation set are given in the table 4.3.

4.2.1.2 External Participant Recordings

Recordings for participants #1 to #5 were taken as a part of the “Measuring Head Detection

and Tracking System Accuracy” experiment4.

Before the experiment a possible range of head/face muscle motions that can be performed

was suggested to each participant. Then each participant was asked to move his/her head in a

free-form manner and two colour and depth videos (each 30 seconds long) were recorded5.

4.2.1.3 Ground-Truth Establishment

In order to establish the head position ground-truth in recorded colour and depth videos, a

laborious manual-tagging process is required. To alleviate some of the difficulties associated

with this process, a video tagging tool named “Head Position Tagger” was implemented using

C# for .NET Framework 4.0 (see figure 4.8).

Using this tool, the location of the head in the aligned colour and depth image can be specified

by manually best-fitting an ellipse. The ratio of minor and major ellipse axes is set to 23, hence

only two points are needed to fully describe an ellipse (viz. the antipodal points on the major

axis).

These two points are given using the mouse (in a single click-and-drag motion). The po-

sition/orientation/size of the ellipse can then be further adjusted using the keyboard. Fur-

thermore, the ground-truth locations are linearly interpolated in between frames, hence only

the start and end ground-truth locations need to be established for spatially-continuous head

motions.

Using this tool, 2, 437 out of 16, 489 frames were tagged, accounting for 17.3% of the total

frames (121.85 frames out of 703 were tagged per video on average, with σ = 36.61), with

the rest of the frames interpolated. Around 30 minutes were spent on tagging an individual

video.

Based on the main project’s assumption (viz. presence of a single viewer in the image), a single

face was tagged in every frame6 (including cases where viewer’s head was partially occluded,

or was partially out of frame).

4The experiment consent form describing the manner of the experiment in more detail is given in appendixH.

5Recorded videos were kept in accordance to the Data Protection Act and will be destroyed after thesubmission of the dissertation.

6For “Multiple viewers” scenario, the viewer that was present in the recording for the longest time wastagged.


Scenario Framecount

Length(sec.)

Brief description

F.4 Head rotation (roll) 839 29.95 Head roll of ±70◦.

F.6 Head rotation (yaw) 828 29.95 Head yaw of ±160◦.

F.8 Head rotation (pitch) 799 29.98 Head pitch of ±90◦.

F.10 Head rotation (all) 812 29.98 Combined head roll, yaw and pitch.

F.12 Head translation (hori-zontal/vertical)

831 29.95 Head translation in 80% of horizontal FOV, 70% of verticalFOV.

F.14 Head translation(anterior-posterior)

821 29.98 Head translation in 80% of Kinect’s depth range

F.16 Head translation (all) 822 29.95 Combined horizontal, vertical and anterior-posterior trans-lation.

F.18 Head rotation and trans-lation (all)

787 30.01 Combined head roll, yaw, pitch and horizontal, vertical,anterior/posterior translation (6 degrees-of-freedom).

F.20 Participant #1 813 29.94 Face occlusion, varying facial expressions, partial headmovement out of frame.

F.22 Participant #2 831 29.96 Varying facial expressions, fast spatial motions.

F.24 Participant #3 846 29.98 Partial and full face occlusion by hair and hands, fast spa-tial motions, changing facial expressions.

F.26 Participant #4 848 29.95 Skin-hued clothing, partial face occlusion, varying facial ex-pressions.

F.28 Participant #5 828 29.98 Changing facial appearance (removing glasses, releasing thehair), partial face occlusion.

F.30 Illumination (low) 788 29.98 Difficult lighting conditions (with only the monitor glareilluminating an otherwise dark scene).

F.32 Illumination (changing) 849 29.98 Single light source moving around the scene.

F.34 Illumination (high) 843 29.97 Direct sunlight (with depth data only partially present).

F.36 Changing facial expres-sions

819 29.88 Drastically changing facial expressions.

F.38 Cluttered similar-huebackground

848 29.97 Scene with the skin-hue background and multiple skin-hueobjects.

F.40 Occlusions 809 29.98 Full head occlusions by multiple skin-hue and head-shapedobjects.

F.42 Multiple viewers 828 29.98 Two spectators present in the scene.

Total: 16, 497 599.3

Table 4.3: Head-tracking evaluation set. Each recording consists of uncompressed input from Kinect’s depthand colour sensors (320 × 240 px/12 bits-per-pixel, and 640 × 480 px/32 bits-per-pixel respectively), and thealigned colour and depth image (320× 240 px/32 bits-per-pixel). The total size of all recordings is 24.3 GB.


Figure 4.8: “Head Position Tagger” tool GUI. Frame 112 of the “Occlusions” recording is beingtagged; head position marker is shown in red.


Figure 4.9: Ground-truth objects tagged by two different annotators in frames 160, 466 and 617

of the “Participant #1” recording. Blue/red ellipses represent objects G(t)1 and G

(t)1 respectively for

t ∈ {160, 466, 617}.

4.2.1.4 Evaluation Metrics

Three main metrics are used when evaluating different tracker performances on the evaluation

set recordings7:

� δ metric, which measures the average normalized distance between the predicted and

ground-truth head centers,

� STDA metric, which measures the spatio-temporal overlap (i.e. the ratio of the spatial

intersection and union, averaged over time) between the ground truth and the detected

objects,

� MOTA/MOTP metrics, which evaluate i) tracking precision as the total error in estimated

positions of ground truth/detection pairs for the whole sequence, averaged over the total

number of matches made, and ii) tracking accuracy as the cumulative ratio of misses, false

alarms and mismatches in the recording, computed over the number of objects present in

all frames.

4.2.1.5 Inter-Annotator Agreement

Even humans do not entirely agree about the exact location of the head in an image (especially

for partially occluded head images, motion-blurred head images, etc). To establish an indication

of the upper limit of system’s performance, two recordings (“Participant #1” and “Participant

#2”) were independently tagged by two annotators (1, 644 frames in total).

Inter-annotator agreement was established for STDA, MOTA, MOTP and δ metrics, with the

tracker output object D(t)i in the metric definitions replaced by an object tagged by annotator

#2 (denoted G(t)i ), as illustrated in the figure 4.9.

The average distance between head centers as marked by both annotators was approximately

9.8% of the head size (as indicated by δ measure). Similarly, an 82.9% spatio-temporal overlap

ratio for the tagged ground-truths has been achieved (STDA measure).

7See appendix F.1 for full metric descriptions.


Figure 4.10: Inter-annotator δ metric evolution over time for “Participant #1” recording (annotator#1 ground-truth is arbitrarily used as the baseline).

Complete results for all metrics obtained from “Participant #1” and “Participant #2” record-

ings are listed in table 4.5, and the balance of the fault modes can be seen from the confusion

matrix 4.4.

Annotator #1Overlap ≥ 75% Overlap < 75%

Annotator #2Overlap ≥ 75% 1, 505 46Overlap < 75% 93 0

Table 4.4: Inter-annotator confusion matrix (in # of frames). The overlap is measured as the pro-portion of the area tagged by both annotators versus the area tagged by just a single annotator (i.e|G(t)

1 ∩G(t)1 |

|G(t)1 |

and|G(t)

1 ∩G(t)1 |

|G(t)1 |

).

4.2.1.6 Evaluation Procedure

All scenarios in table 4.3 were tested using the same set of HT3D head-tracker parameters, as

given in table 4.6.

For each recording, raw depth and colour streams were loaded using StatisticsHandler and

fed into the HT3D library core using the same data path as for the live data from the Kinect

sensor. The individual colour/depth/combined trackers were initialized at the first frame of the

recording and the predicted head area/head center coordinates in each frame were serialized to

an XML file.


Recording Frame count δ STDA MOTA MOTP

Participant #1 813 0.1374 0.8122 0.8370 0.8122Participant #2 831 0.0592 0.8447 0.8923 0.8447

Total: 1, 644 0.0979 0.8286 0.8650 0.8286

Table 4.5: Inter-annotator agreement for all evaluation metrics.

The serialized tracker output and the ground-truth data was then loaded into the “Head Posi-

tion Tagger” tool, and the report containing evaluated STDA, MOTA, MOTP and δ metrics

was generated.

4.2.1.7 Evaluation Results

Colour, depth and combined tracker8 performances for the evaluation recordings with respect

to δ, STDA, MOTA and MOTD metrics are discussed below.

Average Normalized Distance from the Head Center (δ) Results of δ metric conclu-

sively show that both depth and combined trackers perform better than only the colour tracker

on the given input recordings. In particular, both depth and combined trackers performed

better than the colour one in 18/20 recordings.

While the difference between the performances of depth and combined trackers is much smaller,

a combined tracker still outperforms the depth tracker in 14/20 recordings, and achieves a

slightly better total δ result.

Nonetheless, all trackers fell short of the “gold” inter-annotator agreement standard.

For illustration purposes, figures 4.11 and 4.13 show δ measure evolution over time for “Partici-

pant #5” and “Illumination (high)” recordings respectively. Similar analyses for the remaining

recordings are given in the appendix F.2.2.

A summary of δ measure for all the recordings is shown in figure 4.16.

8Using default settings as given in table 4.6, unless otherwise noted.


Frame 70 Frame 174 Frame 240 Frame 298



Figure 4.11: ”Participant #5” recording. Marked red area indicates the output of the combinedhead-tracker.

Figure 4.12: δ metric evolution over time for “Participant #5” recording.




Figure 4.13: “Illumination (high)” recording. Marked red area indicates the output of the combined

head-tracker.


Figure 4.14: “Illumination (high)” recording depth frames. Blue colour indicates the areas of the

image where no depth data is present. More depth data becomes available towards the end of the

recording due to the reduced amount of sunlight in the scene.

Figure 4.15: δ metric evolution over time for “Illumination (high)” recording.


Figure 4.16: δ (average normalized distance from the head center) metric for all evaluation recordings(default settings). Lower values indicate better performance.

Figure 4.17: δ metric for all evaluation recordings (custom settings: increasedColourTrackerSaturationThreshold and ColourTrackerValueThreshold values for “Illumination (low)”,

“Cluttered similar-hue background” and “Occlusions” recordings).


Figure 4.18: STDA (Sequence Track Detection Accuracy) metric for all evaluation recordings (highervalues indicate better performance). Average (mean) STDA metric values for individual trackers aregiven in the table 4.7.

STDA, MOTA and MOTP Similar to δ metric, colour-based tracker is nearly always out-

performed by both depth and combined trackers with regards to STDA, MOTA and MOTP met-

rics. In particular, depth and combined trackers perform better in 19/20 and 20/20 recordings

respectively for STDA metric (as shown in figure 4.18), 16/20 and 18/20 recordings respectively

for MOTA, 20/20 and 19/20 recordings respectively for MOTP metric. Depth and combined

trackers also consistently achieve better STDA/MOTA/MOTP results than the colour-only

tracker, but, again, do not reach the performance of the “gold standard” (inter-annotator

agreement).

Interestingly, a purely depth-based tracker (as described in 2.5.2) performs better than the com-

bined one w.r.t. the measures based on the spatio-temporally averaged ground-truth/detection

overlap ratios.

In particular, depth-based tracker outperforms the combined tracker in 12/20 recordings using

the STDA metric, 12/20 recordings using MOTA metric and 16/20 recordings using the MOTP

metric (see figure F.44 for MOTA/MOTP metric values per individual recording). Depth-


k = 1 k = 2 k = 4 k = 8

Figure 4.19: Kinect’s colour and depth streams subsampled and rescaled by a factor of k.

tracker also achieves slightly better total STDA/MOTA/MOTP results.

These results can be partially explained by the fact that the combined tracker uses the inter-

section of individual colour and depth tracker outputs to produce the final prediction. This

approach can potentially reduce the amount of both false positives (since both trackers have

to “vote” for the pixel to be classified as part of the head) and false negatives (in cases where

one of the trackers loses the track).

While this approach increases the accuracy of the head-center localization as shown by δ metric

(which is crucial for the project’s success), these benefits are outweighed by the slight increase

in the false negatives occuring in the majority of frames due to the relatively poor performance

of the colour tracker and hence decreased intersection area.

4.2.1.8 Robustness to Undersampling and Noise

The effects of spatio-temporal undersampling and the additive white Gaussian noise (AWGN)

were also briefly investigated.

The average normalized distance from head center (δ) metric was calculated for colour, depth

and combined head trackers on spatially and temporally undersampled “Participant #2” record-

ing (see figure 4.19). The results are summarized in figure 4.20, but in brief, all trackers demon-

strated good robustness to undersampling, indicating that these algorithms could potentially

be applied for sensors with a lower resolution.

Similarly, a varying degree of Gaussian noise was added to both colour and depth streams of the

“Participant #2” recording (see figure 4.21). The results for the combined tracker are shown

in figure 4.22. While all three trackers showed some degree of robustness to noise, it has been

observed that the depth tracker was much more error-prone to AWGN in depth stream. This

is possibly due to the head detection approach which requires that the horizontal local minima

satisfying equations 2.13 and 2.14 would be found in a number of consecutive rows.


Figure 4.20: Average Normalized Distance from Head Center (δ) metric for “Participant #2” record-ing (using default tracker settings) under varying degrees of spatio-temporal undersampling. Lowervalues indicate better performance (notice the difference in vertical axis range for each of the trackers).


σ = 0 σ = 0.05 σ = 0.1 σ = 0.15

σ = 0 σ = 0.05 σ = 0.1 σ = 0.15

Figure 4.21: White Gaussian noise N ∼ N (0, σ2) added to Kinect’s colour and depth streams.

Figure 4.22: Average Normalized Distance from Head Center (δ) metric for “Participant #2” record-

ing with added white Gaussian noise. Noise is distributed around zero, with deviation given as a

proportion of a maximum range value (255 for colour data, 4000 for depth data). Lower values

indicate better performance.


Setting Default value

BackgroundSubtractorSensitivity 20ColourDetectorSensitivity 0.4ColourTrackerUseSaturation TrueColourTrackerSaturationThreshold 32ColourTrackerValueThreshold 64BackgroundSubtractorType BackgroundSubtractorType.DEPTH

DepthShadowEliminationEnabled TrueDepthSensorBlurRadius 1ColourTrackerSensitivity 0.8DepthTrackerSensitivity 0.8CombinedTrackerSensitivity 0.4

Table 4.6: Default HT3D library settings.

4.2.1.9 Summary

Table 4.7 shows the average (mean) metric values for all tracker performances. While the inter-

annotator agreement has not been reached, both depth and combined trackers have demon-

strated good performance in recordings containing varying backgrounds and lighting conditions,

and unconstrained viewer’s head movements (with ±70◦ roll, ±160◦ yaw, ±90◦ pitch, anteri-

or/posterior translations within 40-400 cm, and with horizontal/vertical translations within the

FoV of the sensor).

Tracker δ STDA MOTA MOTP

Colour 0.8259 0.3764 0.4158 0.4438Depth 0.3554 0.6024 0.5651 0.6552Combined colour and depth 0.3270 0.5926 0.5574 0.6066

Inter-annotator agreement 0.0979 0.8286 0.8650 0.8286

Table 4.7: Tracker performance averaged over all evaluation recordings (obtained using default set-tings). Bold font indicates the best tracker values achieved for a given metric.

In particular, the combined tracker was able to predict the viewer’s head center location within

less than 13

of head’s size from the actual head center (most important for the main project’s

aim), and the depth tracker was able to achieve over 60% spatio-temporal overlap for the

predicted head area.

Regarding the relative performance of different trackers, the main conclusion is that using

depth data besides colour data significantly improves head tracking accuracy (as indicated by

all metrics).

This is mostly due to the very good performance of the CAMShift algorithm when applied

to the head probability distribution, obtained from the depth data using Peters and Garstka

priors. The main performance losses of the combined tracking algorithm stemmed from the


Table 4.8: HT3D library performance when running configuration GUI for 60 seconds on a dual-corehyperthreaded Intel Core i5-2410M CPU @ 2.30 GHz, with 8 GB RAM.

AverageFPS

Average% CPUTime1

Minimum% CPUTime

Maximum% CPUTime

Colour tracker (no background subtraction) 27.833 28.820 21.839 36.658

Colour tracker (Euclidean background subtractor) 27.795 36.230 28.079 47.038

Colour tracker (ViBe background subtractor) 27.914 48.751 42.118 56.157

Depth tracker 27.568 47.135 37.438 60.837

Combined tracker 28.243 56.806 43.678 76.436

Histogram backprojection rendering 27.830 64.165 56.214 74.096

Background subtraction rendering 27.460 39.578 32.726 46.018

Depth head probability rendering 28.060 63.624 53.818 78.776

Depth image rendering 27.966 63.759 53.818 74.802

1 The percentage of time that a single CPU core (with hyperthreading enabled) was busy servicing the process.

inaccuracies of the colour tracker in bad lighting conditions or in presence of other similar-hue

objects in the scene.

4.2.2 Performance Evaluation

To successfully achieve the main project aim (3D display simulation) it is crucial that the HT3D

DLL performance would reach real-time.

In order to evaluate the run-time head-tracking costs in realistic conditions, the performance

of a HT3D configuration GUI (see 3.21) was measured for various tracker settings. HT3D

configuration GUI was chosen as a good representative program since it introduces only minimal

run-time overheads for data rendering (any projects using HT3D library would be likely to incur

similar costs).

Run-time performance was tested on a main development machine running 64-bit Windows 7

OS on a dual-core hyperthreaded Intel Core i5-2410M CPU @ 2.30 GHz, with 8 GB RAM. A 64-

bit release build containing no debug information was measured using the Windows Performance

Monitor and dotTrace Performance 5.0 [28] tools.

Evaluation results are summarized in figures 4.23, 4.24 and in table 4.2.2. In summary, all

trackers achieved real-time performance: more than 27.4 frames per second were processed on

a single CPU core (with raw input provided by Kinect sensor at 30 Hz).


Figure 4.23: Performance of HT3D trackers when running configuration GUI on a dual-core hyper-threaded Intel Core i5-2410M CPU @ 2.30 GHz.

Figure 4.24: HT3D background subtractor performance with a colour tracker when running configu-ration GUI.


4.2.2.1 “Hot Paths”

“Hot path” analysis indicates where most of the work in the process has been performed (also

known as the most active function call tree).

Due to a rather clumsy depth and colour stream alignment implementation in Kinect SDK

Beta 2, over 40% of the total head-tracking time has been spent in aligning the colour and

depth images (see figure 4.25).

In order perform this alignment, an SDK function GetColorPixelCoordinatesFromDepthPixel

is provided. This function takes the coordinates of a depth pixel in the depth image, together

with the depth pixel value, and returns the corresponding coordinates of a colour pixel in the

colour image.

This API design effectively means that in every single frame, for every single pixel (xd, yd) in

the depth image, i) the function GetColorPixelCoordinatesFromDepthPixel has to be called to

return the corresponding colour coordinates (xc, yc), ii) the colour image has to be referenced

at coordinates (xc, yc) to obtain the colour value (r, g, b), and only then iii) the depth pixel

(xd, yd) can be assigned a colour value (r, g, b)910.

9This flaw has been fixed in Kinect SDK v1 (released on 01/02/2012, after the “Implementation Finish” mile-stone of the project) where an API for full-frame conversion has been provided via MapDepthFrameToColorFrame

function. Assuming that the combined API yields a 10-fold performance improvement, Amdahl’s law predictsthat the overall head-tracker performance could be improved by over 38%, as shown in figure 4.25 part b).

10Incidentally, the updated image alignment API also supports 640 × 480 px resolution, hence the overalltracker resolution could be quadrupled from 320× 240 px to 640× 480 px.


a)

b)

Figure 4.25: “Hot path” analysis for HT3D library running configuration GUI. The high-lighted Kinect SDK function GetColorPixelCoordinatesFromDepthPixel performs the colour anddepth image alignment. Image a) shows the unoptimized head-tracker performance (methodnui DepthFrameReady), while image b) shows the possible performance improvement obtained from10-fold reduction of GetColorPixelCoordinatesFromDepthPixel function calls.


4.3 3D Display Simulator (Z-Tris)11

Correctness of Z-Tris implementation was evaluated using a combination of automated tests

(unit, “smoke”, regression) and manual (white-box, functional, sanity, usability, integration)

tests. Around 85.25% code coverage was achieved by automated unit tests for the core Z-Tris

classes (a sample unit test run is shown in figure 4.26).

Figure 4.26: Sample Z-Tris unit test run in Visual Studio Unit Testing Framework.

Regarding performance, Z-Tris (with the combined colour and depth head tracker enabled)

achieved 29.969 frames-per-second average rendering speed, satisfying the real-time rendering

requirements. Also, a single CPU core experienced an average load of 64.98%, indicating that

some further processing resources were available.

11Only the evaluation summary is presented in this section due to the space limitations; see appendix G formore Z-Tris evaluation details.

Chapter 5

Conclusions

5.1 Accomplishments

The main project’s aim (“to simulate depth perception on a regular LCD screen through theuse of the ubiquitous and affordable Microsoft Kinect sensor, without requiring the user to wearglasses or other headgear, or to modify the screen in any way”) has been successfully achieved.While static images cannot do justice to the level of depth perception simulated by the system,a short video demonstration can be seen at http://zabarauskas.com/3d.

To achieve the main project’s aim, the following new approaches were suggested:

� A distributed Viola-Jones face detector training framework. The framework running on 65 CPU

cores was able to train a 22-layer detector cascade containing 1, 828 decision stump classifiers in

less than a day (vs. a three week estimate using a naıve approach). Training process was limited

only by the amount of data available (exhausting 32.9 millions of negative training examples).

� A real-time depth-based head tracker, combining CAMShift tracking algorithm with Peters and

Garstka priors. During 10 minutes of colour and depth recordings, the depth-based head-tracker

was able to achieve a better than 60% average spatio-temporal overlap ratio between the ground-

truth objects and their predicted locations.

� A real-time combined (colour and depth) head-tracker. During 10 minutes of evaluation record-

ings (containing unconstrained viewer’s head movement in six degrees-of-freedom, in presence of

occlusions, changing facial expressions, different backgrounds and varying lighting conditions)

the combined head-tracker was able to predict the viewer’s head center location within less than13 of head’s size from the actual head center (on average).

To the same end, a number of published methods were implemented:

� Viola-Jones face detector (with depth cue extensions as suggested in [10]),

� Depth-based face detector, using Peters and Garstka method,

� CAMShift face tracker (extended to use both hue and saturation data), and

� ViBe background subtractor (in itself an extension to the project).

All of these methods were combined into a robust and flexible HT3D head-tracking library.

Finally, a proof-of-concept application was developed, creating depth perception on a regular LCD

display by simulating continuous horizontal/vertical motion parallax (using HT3D DLL) and a number

of pictorial depth cues.

Such systems could serve as potential “backwards-compatibility” providers during the transition from

2D to 3D displays (being able to render convincing 3D content on ubiquitous 2D displays).

75


CHAPTER 5. CONCLUSIONS 76

5.2 Future Work

Despite the obvious improvement over the colour-based head-tracker, neither depth, nor the combined

trackers have reached the inter-annotator agreement (“gold standard”) results.

The obvious next steps in increasing the tracker performance would be to i) train the Viola-Jones face

detector using more data, and ii) port the HT3D library from Kinect SDK Beta 2 to Kinect SDK

v1, increasing the depth resolution to 640× 480 px (effectively quadrupling the amount of depth data

present).

A more interesting direction, however, would be to explore the applicability of well-performing colour-

based methods to depth data. Possible examples include training a Viola-Jones face detector on

depth images (involving the collection of a representative depth data training set), or exploring the

applicability of adaptive background subtraction techniques (like ViBe) to depth image sequences.

Based on the experience obtained throughout the project, it seems quite likely that these approaches

could further improve the tracking accuracy.

Furthermore, the head-tracking library could be extended to deal with multiple people (this would

involve implementing partial/full occlusion disambiguation and object identification). Running both

colour- and depth-based multiple viewer trackers in parallel could potentially provide a significant

advantage over the systems based only on a single information source.

Bibliography

[1] Allen, J. G., Xu, R. Y. D., and Jin, J. S. Object Tracking Using CAMShift Algorithm and

Multiple Quantized Feature Spaces. Reproduction 36 (2006), 3–7.

[2] Barnich, O., and Van Droogenbroeck, M. ViBe: A Universal Background Subtraction

Algorithm for Video Sequences. IEEE Transactions on Image Processing 20, 6 (2011), 1709–

1724.

[3] Beck, K., Beedle, M., van Bennekum, A., Cockburn, A., Cunningham, W., Fowler,

M., Grenning, J., Highsmith, J., Hunt, A., Jeffries, R., Kern, J., Marick, B., Mar-

tin, R. C., Mellor, S., Schwaber, K., Sutherland, J., and Thomas, D. Manifesto for

Agile Software Development, 2001.

[4] Benzie, P., Watson, J., Surman, P., Rakkolainen, I., Hopf, K., Urey, H., Sainov, V.,

and Kopylow, C. V. A Survey of 3DTV Displays: Techniques and Technologies, 2007.

[5] Bernardin, K., and Stiefelhagen, R. Evaluating Multiple Object Tracking Performance:

The CLEAR MOT Metrics. EURASIP Journal on Image and Video Processing 2008 (2008),

1–10.

[6] Blinn, J. F. Models of Light Reflection for Computer Synthesized Pictures. ACM SIGGRAPH

Computer Graphics 11, 2 (1977), 192–198.

[7] Boyle, M. The Effects of Capture Conditions on the CAMShift Face Tracker. Alberta, Canada:

Department of Computer Science, (2001).

[8] Bradski, G. Real Time Face and Object Tracking as a Component of a Perceptual User Interface.

In Proceedings of the Fourth IEEE Workshop on Applications of Computer Vision, 1998. WACV

’98., pp. 214 –219.

[9] Burbeck, S. Applications Programming in Smalltalk-80: How to Use Model-View-

Controller (MVC). http://st-www.cs.illinois.edu/users/smarch/st-docs/mvc.html. Last accessed

in 07/04/2012 , 12.

[10] Burgin, W., Pantofaru, C., and Smart, W. D. Using Depth Information to Improve Face

Detection. In Proceedings of the 6th International Conference on Human-Robot Interaction (New

York, NY, USA, 2011), HRI ’11, ACM, pp. 119–120.

[11] Carbonetto, P. Training Data for Robust Object Detection. http://www.cs.ubc.ca/

~pcarbo.

[12] Crow, F. C. Shadow Algorithms for Computer Graphics. In Proceedings of the 4th Annual Con-

ference on Computer graphics and Interactive Techniques (1977), vol. 11, ACM Press, pp. 242–

248.

[13] Cutting, J. E., and Vishton, P. M. Perceiving Layout and Knowing Distances: The Inte-

gration, Relative Potency, and Contextual Use of Different Information About Depth. Perception

5, 3 (1995), 1–37.

[14] Dodgson, N. A. Autostereoscopic 3D Displays. Computer 38, 8 (2005), 31–36.

77

http://www.cs.ubc.ca/~pcarbo

http://www.cs.ubc.ca/~pcarbo

BIBLIOGRAPHY 78

[15] Freund, Y., and Schapire, R. E. A Decision-Theoretic Generalization of On-Line Learning

and an Application to Boosting. Computational Learning Theory 139 (1995), 119–139.

[16] Fukunaga, K., and Hostetler, L. The Estimation of the Gradient of a Density Function,

with Applications in Pattern Recognition. IEEE Transactions on Information Theory 21, 1

(1975), 32–40.

[17] Garstka, J., and Peters, G. View-dependent 3D Projection using Depth-Image-based Head

Tracking. In 8th IEEE International Workshop on Projector Camera Systems PROCAMS (2011),

pp. 52–57.

[18] GiD. Google Image Downloader. http://googleimagedownloader.com.

[19] Goldstein, E. B. Sensation and Perception. Wadsworth Pub Co, 2009.

[20] Gouraud, H. Continuous Shading of Curved Surfaces. IEEE Transactions on Computers C-20,

6 (1971), 623–629.

[21] Heidmann, T. Real Shadows, Real Time, vol. 18. 1991.

[22] Herrera C., D., and Kannala, J. Accurate and Practical Calibration of a Depth and Color

Camera Pair. Computer Analysis of Images and (2011).

[23] Holliman, N. 3D Display Systems. Handbook of Optoelectronics. IOP Press, London (2005).

[24] Holliman, N., Dodgson, N., Favalora, G., and Pockett, L. Three-Dimensional Displays:

A Review and Applications Analysis. Broadcasting, IEEE Transactions on 57, 99 (June 2011),

1–10.

[25] Hubona, G. S., Shirah, G. W., and Jennings, D. K. The Effects of Cast Shadows and

Stereopsis on Performing Computer-Generated Spatial Tasks, 2004.

[26] ImageMagick. Mogrify Command-Line Tool. http://www.imagemagick.org/www/mogrify.

html.

[27] Jensen, O. Implementing the Viola-Jones Face Detection Algorithm. M.Sc Thesis, Informatics

and Mathematical Modelling, Technical University of Denmark (2008).

[28] JetBrains. dotTrace 5.0 Performance. http://www.jetbrains.com/profiler.

[29] Jones, A., McDowall, I., Yamada, H., Bolas, M., and Debevec, P. Rendering for an

Interactive 360 Light Field Display. ACM Transactions on Graphics (TOG) 26, 3 (2007), 40.

[30] Kooima, R. Generalized Perspective Projection. http://aoeu.snth.net/static/

gen-perspective.pdf, 2009.

[31] L. Xia C.-C. Chen, and Aggarwal, J. K. Human Detection Using Depth Information by

Kinect. In Workshop on Human Activity Understanding from 3D Data in conjunction with CVPR

(HAU3D) (Colorado Springs, USA, 2011).

[32] Manohar, V., Soundararajan, P., and Raju, H. Performance Evaluation of Object Detec-

tion and Tracking in Video. In Proceedings of the Seventh Asian Conference on Computer Vision

(2006), pp. 151–161.

http://googleimagedownloader.com

http://www.imagemagick.org/www/mogrify.html

http://www.imagemagick.org/www/mogrify.html

http://www.jetbrains.com/profiler

http://aoeu.snth.net/static/gen-perspective.pdf

http://aoeu.snth.net/static/gen-perspective.pdf

BIBLIOGRAPHY 79

[33] Microsoft. Kinect for Windows SDK. http://www.microsoft.com/en-us/

kinectforwindows.

[34] OpenTK. The Open Toolkit Library. http://www.opentk.com.

[35] Papageorgiou, C., and Oren, M. A General Framework for Object Detection. Computer

Vision, 1998. (1998), 555–562.

[36] Rowley, H. A., Baluja, S., and Kanade, T. Neural Network-Based Face Detection. IEEE

Transactions on Pattern Analysis and Machine Intelligence 20, 1 (1998), 23–38.

[37] Schapire, R. Improved Boosting Algorithms Using Confidence-Rated Predictions. Machine

Learning (1999).

[38] Shoushtarian, B., and Bez, H. E. A Practical Adaptive Approach for Dynamic Background

Subtraction Using an Invariant Colour Model and Object Tracking. Pattern Recognition Letters

26, 1 (2005), 5–26.

[39] Stiefelhagen, R., Bernardin, K., Bowers, R., Rose, R. T., Michel, M., and Garo-

folo, J. The CLEAR 2007 Evaluation. Multimodal Technologies for Perception of Humans 4625

(2008), 3–34.

[40] Urey, H., Chellappan, K. V., Erden, E., and Surman, P. State of the Art in Stereoscopic

and Autostereoscopic Displays. Proceedings of the IEEE 99, 4 (Apr. 2011), 540–555.

[41] Viola, P. Rapid Object Detection Using a Boosted Cascade of Simple Features. Proceedings of

the CVPR 2001 (2001).

[42] Viola, P., and Jones, M. Fast and Robust Classification using Asymmetric AdaBoost and a

Detector Cascade. Advances in Neural Information Processing Systems 14 (2002), 1311–1318.

[43] Viola, P., and Jones, M. Robust Real-Time Face Detection. Int. J. Comput. Vision 57, 2

(May 2004), 137–154.

[44] Xia, L., Chen, C.-c., and Aggarwal, J. K. Human Detection Using Depth Information by

Kinect. Pattern Recognition (2011), 15–22.

[45] Zhang, C., Yin, Z., and Florencio, D. Improving Depth Perception with Motion Parallax

and its Application in Teleconferencing. In Multimedia Signal Processing, 2009. MMSP’09. IEEE

International Workshop on (2009), IEEE, pp. 1–6.

http://www.microsoft.com/en-us/kinectforwindows

http://www.microsoft.com/en-us/kinectforwindows

http://www.opentk.com

Appendix A

Depth Cue Perception

A.1 Oculomotor Cues

Oculomotor cues are created by two phenomena: convergence and accommodation.

Convergence is the inward movement of the eye (created by stretching the extraocular muscles) that

occurs when the object of focus moves closer to the eye (see figure A.1). The kinesthetic sensations

that arise are processed in the visual cortex and serve as cues for the depth perception.

Accommodation is the change in the shape of the eye lens that occurs when the sight is focused on

the objects at different distances. Ciliary muscles stretch the lens making it thinner thus changing the

eye’s focal length (see figure A.2). Similarly to convergence, the kinesthetic sensations that arise from

contracting and relaxing ciliary muscles serve as basic cues for the distance interpretation.

Both of those phenomena are most effective at the range of up to 10 meters from the observer [13]

and provide absolute distance information.

A.2 Monocular Cues

Monocular cues provide depth information when the scene is viewed with just one eye. They are

typically split into pictorial and motion cues.

a) b)

Figure A.1: Eye convergence on a) near and b) far target.

80

APPENDIX A. DEPTH CUE PERCEPTION 81

a) b)

Figure A.2: Right eye accommodation on a) near and b) far target.

A.2.1 Pictorial Cues

Pictorial cues are the sources of depth information that are present purely in the image formed on the

retina. They include:

� Occlusion, which occurs when one object is hiding another from view. The partially hidden

object is then interpreted as being farther away.

� Relative height. The object that is below horizon and has its base higher in the field-of-view is

interpreted as being farther away.

� Relative size, which occurs when two objects that are of equal size occupy different amounts

of space in the field-of-view. Typically, the object that subtends a larger visual angle of the

retina than the other is interpreted as being closer. If the object’s size is known, then this prior

knowledge can be combined with the angle that the object subtends on the retina to provide

cues about its absolute distance.

� Relative density, which occurs when a cluster of objects or texture features have a characteristic

spacing on the retina, and the observer is able to infer the distance to the cluster by the

perspective foreshortening effects on this characteristic spacing.

� Perspective convergence, which occurs when the parallel lines extending from the observer appear

to be converging at infinity. The distance between these two lines provide hints about the

distances from observer to objects on these lines.

� Atmospheric perspective which occurs when the objects in the distance appear less sharp, have

lower luminance contrast, lower colour saturation and the colours are slightly shifted towards

the blue end of the spectrum. This happens because the light from far away objects is scattered

by small molecules in the air (water droplets, dust, airborne pollution).

� Lighting and shadows. The way that light reflects off the surfaces of an object and the shadows

that are cast provide cues to the visual cortex to determine both the shape and the relative

position of the objects.

� Texture gradient, which manifests as a decrease in fineness of the texture details with increasing

distance from the observer. This change in texture detail as the objects recede is detected in

parietal cortex and provides further depth information.


Figure A.3: A photograph exhibiting a number of pictorial depth cues: occlusion, relative height,size and density, atmospheric perspective, lighting and shadows, and texture gradient.

Most of these pictorial depth cues are demonstrated in figure A.3.

A.2.2 Motion Cues

All the cues described above are present for the stationary observer. However, if the observer is in

motion, the following new cues emerge that further enhance human perception of depth:

� Motion parallax, which occurs when the objects closer to the moving observer seem to move

faster and in opposite direction to the movement of the observer, whereas the objects farther

away move slower and in the same direction. This difference in motion speeds provides hints

about their relative distance. Given the surface marking and the some knowledge about the

observer’s position, motion parallax can yield the absolute measure of depth at each point of

the scene.

� Deletion and accretion. Deletion occurs when the object in the background gets covered by the

object in front when the observer moves, and accretion occurs when the observer moves in the

opposite direction and the object in the background gets uncovered. This information can then

be used to infer depth order.


a) b)

Figure A.4: The points on left and right retinae with the same relative angle from the fovea areknown as the corresponding retinal points (or cover points). Absolute disparity is the angle betweentwo corresponding retinal points. Horopter is an imaginary surface that passes through the point offixation; only images of the objects on the horopter fall on corresponding points on the two retinae;they also have the absolute disparity equal to zero (e.g. objects A1 and B in picture a). Relativedisparity is the difference between two objects’ absolute disparities. Notice that the absolute disparityof the object A1 changes from 0 in picture a) to φ in picture b), but the relative disparity betweenobjects A1 and A2 remains constant.

A.3 Binocular Cues

In the average adult human, the eyes are horizontally separated by about 6 cm, hence even when

looking at the same scene, the images formed on the retinae are different. The difference in the images

in the left and right eyes is known as binocular disparity.

Binocular disparity gives rise to two phenomena that provide information about the distances of

objects, absolute disparity and relative disparity, which are illustrated in the figure A.4.

It has been shown that this information about depth which is present in the geometry (both absolute

and relative disparity), is actually translated into depth perception in the brain, creating a stereopsis

depth cue. In particular, neurons in the striate cortex respond to absolute disparity (Uka & DeAngelis,

2003), and the neurons higher up in the visual system (temporal lobe and other areas) respond to the

relative disparity (Parker, 2007).

Appendix B

3D Display Technologies

B.1 Binocular (Two-View) Displays

Binocular displays generate two separate viewing zones, one for each eye. Various multiplexing meth-

ods (and their combinations) are used to provide the binocular separation of the views:

� Wave-length division, used in anaglyph-type (wavelength-selective) displays (e.g. red/cyan

colour channel separation using anaglyph glasses, or amber/blue channel separation used in

ColorCode 3D display system, both shown in figure B.1).

Most of the technologies based on wave-length division require eyewear (stereoscopic displays).

� Space/direction division, used in parallax-barrier type and lenticular-type displays. These are

mainly autostereoscopic displays, i.e. they do not require glasses.

Also, a number of space/direction division based displays can be combined with head tracking

to provide the viewing zone movement (using shifting parallax barriers/lenticulars, or steerable

backlight).

� Time division, used in active LCD-shutter glasses (e.g. DepthQ system, figure B.2),

� Polarization division, used in systems requiring passive polariser glasses (e.g. RealD ZScreen,

figure B.2).

a) b)

Figure B.1: Wave-length division display technologies: a) red/cyan channel multiplexed glassesfor anaglyph 3D image viewing, b) patented ColorCode 3D display system that uses amber/bluecolour channel multiplexing to produce full colour 3D images.

84

APPENDIX B. 3D DISPLAY TECHNOLOGIES 85

a) b)

Figure B.2: Time- and polarization-division based technologies: a) RealD ZScreen display systemthat uses a single projector equipped with an electrically controllable polarization rotator, to produceorthogonally polarized frames, b) DepthQ display system that uses a single projector with time-multiplexed output (to be viewed with active liquid-crystal based shutter glasses).

B.2 Multi-View Displays

Multi-view displays create a fixed set of viewing zones across the viewing field, in which different stereo

pairs are presented. Typical implementation techniques for this type of displays include:

� Combination of pixelated emissive displays, with static parallax barriers or lenticular arrays

(integral imaging displays). For the latter, hemispherical (as opposed to cylindrical) lenslets

can be used to provide vertical, as well as horizontal parallax.

However, constraints on pixel size and resolution in LCD or plasma displays limit horizontal

multiplexing to a small number of views [14]. Also, parallax barriers can cause a significant light

loss with the increasing number of views, whereas lenticular displays magnify the underlying

subpixel structure of the device, creating dark transitions between viewing zones.

� Multiprojector displays, where the image from each projector is projected on the entire double-

lenticular screen, but is visible only within the corresponding viewing regions at the optimal

viewing distance. These displays require a very precise alignment of projected images, and are

extremely costly since they require having a single projector per view.

� Time-sequential displays, where the different views are generated by a single display device

running at a very high frame rate. A secondary optical component (synchronized to the former

image-generation device) then directs the images at different time-slots to different viewing

zones. An example implementation using a high-speed CRT monitor and liquid crystal shutters

in lens array has been developed at Cambridge (see figure B.3). However, the optical path

length required by such displays reduces their commercial appeal in comparison to the flat-

panel displays [23].


a) b)

Figure B.3: Multi-view and light-field 3D display technologies: image a) shows a 25” diagonal, 28-view time-multiplexed autostereoscopic display system developed at Cambridge. A high-speed CRTdisplay renders each view sequentially and the synchronised LCD shutters direct the view through aFresnel field lens at the appropriate angle. Image b) shows the light-field display system as describedby Jones et al. [29] consisting of a high-speed video projector and a spinning mirror covered by aholographic diffuser.

B.3 Light-Field (Volumetric and Holographic) Dis-

plays

Light-field displays simulate light faring in every direction through every point in image volume.

Volumetric displays generate images by rendering each point of the scene at its actual position in space

through slice-stacking, solid-state processes, open-air plasma effects and so on. Sample implementa-

tions of such displays include laser projection onto a spinning helix (Lewis et al.), varifocal mirror

displays (Traub) or swept-screen systems (Hirsch).

Holographic displays attempt to reconstruct the light-field field of a 3D scene in space by modulating

coherent light (e.g. with spatial light modulators, liquid crystals on silicon, etc). Two commercial ex-

amples of holographic displays are Holografika, which uses a sheet of holographic optical elements as its

principal screen, and Quinetiq system, which uses optically-addressed spatial light modulators.

Another light-field display system is described by Jones et al. [29], which consists of a high-speed video

projector and a spinning mirror covered by a holographic diffuser (see figure B.3).

B.4 3D Display Comparison w.r.t. Depth Cues

All display types listed above can simulate all of the pictorial cues. Two-view displays without head

tracking add stereopsis to the pictorial depth cues, and head-tracked two-view displays can simulate

motion parallax. However, two-view displays typically require eyewear or head-tracking.


Multi-view displays create the perception of stereopsis and can simulate motion parallax without the

head-tracking/eyewear. However, motion parallax is typically segmented into discrete steps and is

only horizontal. Building multiview displays with a large number of views to overcome these problems

remains technologically challenging.

Light-field displays can provide continuous motion parallax and accommodation depth cues (besides

stereopsis, convergence and pictorial depth cues). However, as described by Holliman et al. [24],

volumetric displays remain a niche product, and computation holography remains experimental.

In general, despite the fact that most of the stereoscopic binocular display systems have been man-

ufactured for decades and some of the autostereoscopic systems have been available for 10-15 years,

they are still mainly used in niche applications (further discussed in the following section).

B.5 3D Display Applications

B.5.1 Scientific and Medical Software

� Geospatial applications, in which 3D displays are used for terrain analysis, defence intelligence

gathering, pairing of aerial and satellite imagery by photogrammetrists, and so on.

� Oil and gas applications, in which 3D displays help exploration geophysicists to visualise sub-

terranean material density images, in order to make more accurate predictions where petroleum

reservoirs might be located.

� Molecular modelling, computation chemistry, crystallography visualisations. Since the structure

of a particular molecule is determined the spatial location of its molecular constituents, 3D dis-

plays can help to visualize spatial relationships between thousands of atoms in a given molecule,

helping to determine its structure and function.

� Mechanical design, where 3D displays can help industrial designers, mechanical engineers and

architects to design and showcase complex 3D models.

� Medical applications, in which magnetic resonance imaging (MRI), computed tomography (CT),

ultrasound and other inherently volumetric images can be represented in 3D to help doctors make

a more accurate and quicker judgement. Three-dimensional displays can also help in minimally

invasive surgeries (MIS) to give surgeons a better understanding of depth and position when

making critical movements.

� Training of complex operations, remote robot manipulation in dangerous environments, aug-

mented and virtual reality applications, 3D teleconferencing and so on.

B.5.2 Gaming, Movie and Advertising Applications

In this application class, 3D displays have the advantage of novelty and increased user imersiveness

over regular 2D displays.


Figure B.4: Free2C interactive kiosk (built for the use at showrooms, shops, airports, etc), whichuses head-tracking to control a vertically aligned lenticular screen to overcome the fixed viewing-zonerequirement.

Over the last few decades this advantage has been exploited by a large number of different 3D display

systems, manufactured for the purpose of advertising. An example of such system (Free2C interactive

kiosk) is shown in figure B.4.

Similarly, a number of recent developments show an increasing interest in 3D display technologies for

movies and gaming.

Examples given by Zhang et al. in [45] include Nvidia’s release of a 3D Vision technology stereoscopic

gaming kit (in 2008) containing liquid-crystal shutter glasses and a GeForce Stereoscopic 3D Driver

(enabling 3D gaming on supported displays), an agreement between The Walt Disney Company and

Pixar (made in April 2008) to make eight new 3D animated films over the next four years, and

an announcement by DreamWorks Animation that it will release all its movies in 3D, starting in

2009.

Appendix C

Computer Vision Methods (Additional

Details)

C.1 Viola-Jones Face Detector

C.1.1 Weak Classifier Boosting using AdaBoost

AdaBoost combines a collection of simple classification functions into a stronger classifier through a

number of rounds, where in each round

� the best weak classifier (simple classification function) for the current training data is found,

� lower/higher weights are assigned to correctly/incorrectly classified training examples.

The final strong classifier is obtained by taking a weighted linear combination of weak classifiers,

where the weights assigned to individual weak hypotheses are inversely proportional to the number of

classification errors that they make.

These steps are illustrated in the figure C.1, and precisely formalized in the algorithm C.1.1.1.

A number of properties of AdaBoost have been proven. Of a particular interest is a generalized

theorem of Freund and Shapire, by Shapire and Singer[37] which states that the training error of a

strong classifier decreases exponentially in the number of rounds, i.e. the training error at round T is

bounded by

1

N

∣∣∣∣∣N∑i=1

(C(~xi) 6= yi)

∣∣∣∣∣ ≤ 1

N

T∑i=1

exp (−yif(~xi)) , (C.1)

where N is the number of training examples and f(~x) =∑T

t=1 αtht(~x).

AdaBoost is designed to minimize the quantity related to the overall classification error, but in the

context of the face detection it is not the most optimal strategy. As discussed in section 2.4.1.4, it is

more important to minimize the false negative rate than the false positive.

C.1.1.1 AsymBoost Modification

AsymBoost (asymmetric AdaBoost) is a variant of AdaBoost (presented by Viola and Jones in

2002 [42]) specifically designed to be used in classification tasks where the distribution of positive

and negative training examples is highly skewed. The fix proposed in [42] is to adjust the training

89

APPENDIX C. COMPUTER VISION METHODS (ADDITIONAL DETAILS) 90

Figure C.1: A simplified illustration of AdaBoost weak classifier boosting algorithm given in C.1.1.1.In this training sequence, three weak classifiers that minimize the classification error are selected; afterselecting each classifier, the remaining training examples are reweighed (increasing/decreasing theweights of incorrectly/correctly classified examples respectively). After selecting all three classifiers,a weighed linear combination of their individual thresholds is taken, yielding a final strong classifier.


Algorithm C.1.1.1 Weak classifier boosting using AdaBoost. It requires N training examples givenin the array A = (~x1, y1), ..., (~xN , yN ) (where yi = 0 for a negative and yi = 1 for a positive trainingexample) and uses T weak classifiers to construct a strong classifier. The result of the boosting is thefinal strong classifier h(~x), which is a weighted linear combination of T hypotheses with the weightsinversely proportional to the training errors.

AdaBoost(A, T )

1 // Initialize training weights (where m is the count of negative, l is the count of2 // positive training examples).3 for each training example (~xi, yi) ∈ A4 if yi = 05 w1,i ← 1

2m

6 else7 w1,i ← 1

2l

8 for t← 1 to T9 // 1. Normalize the weights:

10 for each weight wt,i11 wt,i ← wt,i∑N

j=1 wt,j

12 // 2. Select the best weak classifier h(x, ft, pt, θt) which minimizes the error13 // εt = minf,p,θ

∑iwt,i|h(xi, f, p, θ)− yi|:

14 ht(~x)← Find-Best-Weak-Classifier(~wt, A)

15 // 3. Update the weights:16 for each training example (~xi, yi) ∈ A17 if ht(~xi) = yi18 wt+1,i ← wt,i

εt1−εt

19 else20 wt+1,i ← wt,i

21 return h(~x) =

{1,

∑Tt=1 ht(~x) log 1−εt

εt≥ 1

2

∑Tt=1 log 1−εt

εt,

0, otherwise.


weights in each round by a multiplicative factor of

exp

(1

Tyi log

√k

), (C.2)

where T is the number of rounds1, yi is a class of the training example i and k is the penalty ratio

between false negatives and false positives.

C.1.2 Best Weak-Classifier Selection

The algorithm to efficiently find the best decision stump weak classifier is given in C.1.2.1. The

asymptotic time cost to find the best weak classifier for a given training round is O(KN logN), where

K is a number of features and N is the number of training examples.

Algorithm C.1.2.1 Selection of the best decision stump weak classifier. It requires an array oftraining examples A = (~x1, y1), ..., (~xN , yN ), together with the training example weights ~wt. Thisalgorithm returns the best rectangle feature based decision stump classifier.

Find-Best-Weak-Classifier(~wt, A)

1 Calculate T+, T− (total sums of positive/negative example weights).

2 for each feature f3 for each training example (~xi, yi) ∈ A4 vi ← f(~xi)

5 ~v.sort()

6 for vi ∈ ~v7 Maintain S+

i , S−i (total sums of positive/negative weights below the current

example).

8 // Calculate the current error:9 εf,i = min{S+ + (T− − S−), S− + (T+ − S+)}

10 If εf,i is smaller than the previously known smallest error, remember thecurrent threshold θf and parity pf .

11 Maintain the feature with the smallest error fb and the associated threshold θb &parity pb.

12 return h(~x, fb, pb, θb)

1 If the strong classifier obtained using AsymBoost is to be used in the attentional cascade (see section2.4.1.4), the number of rounds required to train a particular strong classifier will be unknown in advance. Inthat case, it can be approximated using the round counts of the previous two layers: Ti+2 = Ti+1 + (Ti+1−Ti).


C.1.3 Cascade Training

The precise algorithm to build a cascade Viola-Jones face detector is shown in the listing C.1.3.1.

Algorithm C.1.3.1 Building a cascaded detector. It requires the maximum acceptable false positiverate per layer f , the minimum acceptable acceptable detection rate per layer d, the target overall falsepositive rate Ftarget, a set of positive training examples P and a set of negative training examples N .The algorithm returns a cascaded detector C(~x).

Build-Cascade(f, d, Ftarget, P,N)

1 C(~x)← ∅

2 F0 ← 1.0, D0 ← 1.03 i← 0

4 while Fi > Ftarget

5 i← i+ 16 ni ← 0, Fi ← Fi−1

7 while Fi > f × Fi−18 ni ← ni + 1

9 hi(~x)← AdaBoost(N ∪ P, ni)10 C(~x)← C(~x) ∪ hi(~x)

11 Evaluate the cascaded classifier on validation set to determine Fi and Di.

12 Decrease threshold for hi(~x) until the cascaded classifier has a detection rateof at least d×Di−1.

13 N ← ∅14 if Fi > Ftarget

15 Evaluate C(~x) on the set of non-face images and put any false detections intoN (“bootstrap” negative images).

16 return C(~x)

C.1.3.1 Training Time Complexity

As briefly discussed in section 2.4.1.3, the asymptotic-time cost to find the best decision-stump weak

classifier is O(KN logN), where K is a number of features and N is the number of training examples.

Then, the cost of training a single strong classifier is O(MKN logN) where M is the number of weak

classifiers combined through boosting. Finally, the cost of training a detector cascade containing L

strong classifiers is O(LMKN logN).

To put the numbers into perspective, assume that it takes 10 milliseconds on average to evaluate a

rectangle feature on 10, 000 training images (1 µsec/image). Then, to train a cascade containing 25


strong classifiers, with a total of 4, 000 decision-stumps, selected from 160, 000 features and trained

on 10, 000 training examples would require over 74 days of continuous training (without considering

the time it takes to select a best feature out of 160, 000, and to bootstrap the false positive training

images for each layer).

C.2 CAMShift Face Tracker

C.2.1 Mean-Shift Technique

CAMShift face tracker is based on the mean-shift technique, which is a non-parametric technique to

climb the gradient of a given probability distribution to find the nearest dominant peak (mode). The

precise details of this technique are summarized in the algorithm C.2.1.1.

Algorithm C.2.1.1 Two dimensional mean shift. It requires the probability distribution P , initiallocation of the search window (x, y), search window size s and the convergense threshold θ. It returnsthe location of the nearest dominant mode of the probability distribution P .

2D-Mean-Shift(P, x, y, s, θ)

1 (x′c, y′c)← (x, y)

2 repeat3 (xc, yc)← (x′c, y

′c)

4 // Find the zeroth moment of the search window

5 M00 ←∑|x|≤s/2|y|≤s/2

P (xc + x, yc + y)

6 // Find the first horizontal and vertical moments

7 M10 ←∑|x|≤s/2|y|≤s/2

xP (xc + x, yc + y)

8 M01 ←∑|x|≤s/2|y|≤s/2

yP (xc + x, yc + y)

9 (x′c, y′c)←

(M10

M00, M01

M00

)10 until Distance((xc, yc), (x

′c, y′c)) < θ

11 return (x′c, y′c)

C.2.2 Centroid and Search Window Size Calculation

Define a shorthand p(x′, y′) , Pr(“I(x′, y′) belongs to a face”). Then the face centroid and the search

window size can be calculated as follows.


1. Compute the zeroth moment:

M00 =∑x,y∈Is

p(x, y), (C.3)

where Is is the current search window.

2. Compute the first horizontal and vertical spatial moments:

M10 =∑x,y∈Is

x× p(x, y),

M01 =∑x,y∈Is

y × p(x, y).(C.4)

3. The centroid location (xc, yc) is then given by

(xc, yc) =

(M10

M00,M01

M00

). (C.5)

Similarly, the size s of the search window is set as

s = 2√M00. (C.6)

This expression is based on two observations: first of all, the zeroth moment represents a distribution

area under the search window, hence assuming a rectangular search window, its side length can be

approximated as√M00. Secondly, the goal of CAMShift is to track the whole object, hence the search

window needs to be expansive. A factor of two ensures the growth of the search window so that the

whole connected distribution area would be spanned.

Bradski also suggests that in practice the search window width and height for the face tracking can

be set to s and 1.2s respectively, to resemble the natural elliptical shape of the face.

C.3 ViBe Background Subtractor

C.3.1 Background Model Initialization

Background models used in ViBe algorithm can be instantaneously initialized using only the first

frame of the video sequence. Since no temporal information is present in a single frame, the main

assumption made is that the neighbouring pixels share a similar temporal distribution.

Under this assumption, the pixel models can be populated using the values found in the spatial

neighbourhood of each pixel; based on the empirical observations by Barnich and Van Droogenbroeck,

selecting samples from the 8-connected neighbourhood of each pixel has proven to be satisfactory for

VGA resolution images.


This observation can be formalized in the following way. Say that NG(x) is a spatial neighbourhood

of a pixel x, then

M0(x) = {v0(y)}, (C.7)

where locations y ∈ NG(x) are chosen randomly according to the uniform law, Mt(x) is a model of

pixel x at time t, and vt(x) is the colour-space value of pixel x at time t.

C.3.2 Background Model Update

After a new pixel value vt(x) is observed, the memoryless update policy dictates that the old to-be-

discarded sample would be chosen randomly fromMt−1(x), according to a uniform probability density

function. This way the probability that a sample which is present at time t will not be discarded at

time t+ 1 is N−1N , where N = |Mt(x)|.

Assuming time continuity and the absence of memory in the selection procedure, the probability that

a sample in question will still be present after dt time units is(N − 1

N

)dt= exp

[−dt ln

(N

N − 1

)], (C.8)

which is indeed an exponential decay.

Since in many practical situations it is not necessary to update each background pixel model for each

new frame, the time window covered by a pixel model of a fixed size can be extended using the random

time subsampling.

In ViBe this is implemented by introducing a time subsampling factor φ. If a pixel x is classified as

belonging to the background, its value v(x) is used to update its model M(x) with the probability1φ .

Finally, based on the assumption that neighbouring background pixels share a similar temporal dis-

tribution, the neighbouring pixel models are stochastically updated when the new background sample

of a pixel is taken. More precisely, given the 8-connected spatial neighbourhood NG(x) of a pixel x,

the model M(y) of one of the neighbouring pixels y ∈ NG(x) is updated (y is chosen randomly, with

the uniform probability).

This approach allows a spatial diffusion of information using only samples classified exclusively as

background, i.e. the background model is able to adapt to changing illumination or the structure of

the scene while retaining a conservative update2 policy.

2Conservative update policy “never includes a sample belonging to foreground in the background model.”[2]

Appendix D

Depth-Based Methods (Additional De-

tails)

D.1 Depth Data Preprocessing

D.1.1 Depth Shadow Elimination

In order to obtain the depth values in the frame, Kinect uses the infrared light to project a reference

dot pattern on the scene, which is then captured using an infrared camera. Since these images are not

equivalent due to the horizontal distance between the projector and the camera, stereo triangulation

can be used to calculate the depth after the correspondence problem is solved.

However, this leads to depth shadowing (see the example in figure D.1). Since the infrared projector

is placed 2.5 cm to the right of the infrared camera, depth shadows of the concave objects always

appear on the left side if the sensor is placed on a flat horizontal surface.

This suggests a straightforward depth shadow elimination technique for head tracking (as heads are

indeed concave):

1. Process the depth images one horizontal line at a time, from left to right.

2. If the unknown depth value is reported by Kinect, replace it using the last known depth value.

An example of the depth shadow removed using this technique is shown in figure 2.11.

D.1.2 Real-Time Depth Image Smoothing

The noise in the depth calculation can arise from the inaccurate measure of disparities within the

correlation algorithm, limited resolutions of the Kinect infrared camera and projector, external infrared

radiation (e.g. sun light), object surface properties (especially high specularity), and so on.

In the detection method below, every local minimum on a horizontal scan line will be treated as a point

which potentially lies on the vertical head axis (a hypothesis which will be confirmed or refuted using

certain prior knowledge about human head sizes). Since finding a local minimum basically involves

discrete differentiation, such method is very prone to noise. A solution proposed in [17] is to smooth

the depth-image in real time using the “integral image” filter from Viola and Jones face detection

algorithm.

As further described in section 2.4.1, the integral image can be calculated in linear-time using a

dynamic programming approach; a smoothed depth value Ir(x, y) of the pixel at coordinates (x, y)

97

APPENDIX D. DEPTH-BASED METHODS (ADDITIONAL DETAILS) 98

Figure D.1: Kinect depth shadowing. Light blue polygon shows the area which is visible from the IRcamera point of view, light red polygon shows the region where the IR pattern is projected. Thickerblue lines indicate the areas on the objects that are visible by the IR camera, thicker red lines indicatethe areas on the objects that have IR pattern projected on them.

can be obtained by calculating

Ir(x, y) =I(x− r, y − r)− I(x+ r, y − r)− I(x− r, y + r) + I(x+ r, y + r)

(2r + 1)2, (D.1)

where I is the integral image and r is the side length of the averaging rectangle. The result of

smoothing using different average rectangle sizes is shown in the figure 2.10.

D.2 Depth Cue Rendering

This subsection describes two algorithms which are used to render various depth cues, as described

in the main project aims 1.5. More precisely, a generalized perspective projection [30] is used to

simulate the motion parallax when the viewer’s head position is known, and the Z-Pass algorithm [21]

is used to simulate the pictorial shadow depth cue. More details on both of these algorithms are given

below.

D.2.1 Generalized Perspective Projection

A generalized perspective projection (as described by Kooima in [30]) is used to simulate the motion

parallax, occlusion, relative height, relative size, relative density and perspective convergence depth

cues.

The generalized perspective projection matrix G can be derived as follows. Let pa, pb, pc be the three

corners of the screen as shown in figure D.2. Then the screen-local axes ~vr, ~vu and ~vn that give the


Figure D.2: Screen definition for the generalized perspective projection. Viewer-space pointspa, pb, pc give the three corners of the screen, point pe gives the position of the viewer’s eye,screen-local axes ~vr, ~vu and ~vn give the orthonormal basis for describing points relative to thescreen, non-unit vectors ~va, ~vb and ~vc span from the eye position to the screen corners and dis-tances from the screen-space origin l, r, t, b give the left/right/top/bottom extents respectivelyof the perspective projection.

orthonormal basis for describing points relative to the screen can be calculated using

~vr =pb − pa||pb − pa||

,

~vu =pc − pa||pc − pa||

,

~vn =~vr × ~vu||~vr × ~vu||

.

(D.2)

If the viewer’s position changes in a way that the head is no longer positioned at the center of the screen,

the frustum becomes asymmetric. The frustum extents l, r, b, t can be calculated as follows.

Let

~va = pa − pe,~vb = pb − pe,~vc = pc − pe,

(D.3)

where pe is the position of the viewer in the world coordinates. Then the distance from the viewer to

the screen-space origin is d = −(~vn ·~va). Given this information, the frustum extents can be computed

using

l = (~vr · ~va)n

d,

r = (~vr · ~vb)n

d,

b = (~vu · ~va)n

d,

t = (~vu · ~vc)n

d.

(D.4)


Let n, f be the distances of the near and far clipping planes respectively. Then the 3D perspective

projection matrix P (which maps from a truncated pyramid frustum to a cube) is

P =

2nr−l 0 r+l

r−l 0

0 2nt−b

t+bt−b 0

0 0 −f+nf−n − 2fn

f−n0 0 −1 0

(D.5)

The base of the viewing frustum would always lie in XY-plane in the world coordinates. To enable

the arbitrary positioning of the frustum, two additional matrices are needed:

MT =

~vr,x ~vr,y ~vr,z 0

~vu,x ~vu,y ~vu,z 0

~vn,x ~vn,y ~vn,z 0

0 0 0 1

, (D.6)

which transforms points lying in the plane of the screen to lie in the XY-plane (so that the perspective

projection could be applied), and

T =

1 0 0 −pe,x0 1 0 −pe,y0 0 1 −pe,z0 0 0 1

, (D.7)

which translates the tracked head location to the apex of the frustum.

Finally, note that the composition of linear transformations in homogeneous coordinates corresponds

to the product of the matrices that describe these transformations. This way, the overall generalized

perspective projection G (which produces a correct off-axis projection given constant screen corner

coordinates pa, pb, pc and a varying head position pe) can be calculated by taking a product of the

three matrices described above, i.e.

G = PMTT. (D.8)

D.2.2 Real-Time Shadows using Z-Pass Algorithm with Stencil

Buffers

As discussed in section A.2.1, cast shadows are very important for the human perception of the 3D

world. In particular, shadows play an important role in understanding the position, size and the

geometry of the light-occluding object, as well as the geometry of the objects on which the shadow is

being cast.

The first hardware-accelerated algorithm that uses stencil buffers and shadow volumes to render

shadows in real-time was described by Heidmann in 1991 [21].

His technique uses a following two-step process:


Figure D.3: Shadow volume of a triangle (white polygon) lit by a single point-light source. Anypoint inside this volume is in the shadow, everything outside is lit by the light.

1. The scene is rendered as if it was completely in the shadow (e.g. using only ambient lighting),

2. Shadow volumes are calculated for each face and the stencil buffer is updated to mask the

areas within the shadow volumes; then for each light source, the scene is rendered as if it was

completely lit, using the stencil buffer mask.

D.2.2.1 Shadow Volumes

Shadow volumes were first proposed by Crow[12] in 1977. A shadow volume is defined by the object-

space tesselations of the boundaries of the regions of space, occluded from the light source [12].

To understand how the shadow volume can be constructed without the loss of generality consider a

triangle lit by a single point-light source. Projecting rays from the light source through each of vertices

of the triangle to the points at infinity will form a shadow volume. Any point inside that volume is

hidden from the light source (i.e. it is in the shadow), everything outside is lit by the light (see figure

D.3).

D.2.2.2 Z-Pass Shadow Algorithm

After calculating shadow volumes, the locations in the scene where the shadows should be rendered

can be found in the following way:

1. For every pixel, project the ray from the viewpoint to the object visible at that pixel.


Figure D.4: Z-Pass algorithm. Blue polygon represents the viewing area from the camera pointof view, grey polygons represent the shadow volumes, blue points indicates the “entries” to theshadow volumes, red points indicate the “exits”. Numbers above the blue/red points indicatethe operation that is being performed on a stencil buffer; if more shadow volumes have beenentered than left (i.e. the value present at the stencil buffer is greater than zero) then the pixelin question is in the shadow.

2. Follow this ray, counting the number of times when some shadow volume is entered and left. For

every pixel, subtract the number of times when some shadow volume is left from the number of

times when some shadow volume is entered.

3. If this count is greater than zero when the object is reached, more shadow volumes have been

entered than left, therefore that pixel of the object must be in the shadow.

See figure D.4 for an illustration.

D.2.2.3 Stencil Buffer Implementation

A stencil buffer is an integer per-pixel buffer (additional to the colour and depth buffers) found in

modern graphics cards; it is typically used to limit the area of rendering.

An interesting application of stencil buffer in real-time shadow rendering arises from the strong con-

nection between the depth and stencil buffers in the rendering pipeline. Since the values in the

stencil buffer can be incremented/decremented every time the pixels passes or fails the depth test,

the following two-pass implementation of the Z-Pass shadow algorithm (as described in [21]) becomes

feasible:

1. Initialize the stencil buffer to zero; render the scene with the lighting disabled. Amongst other

things, this will load the depth buffer with the depth values of the visible objects in the scene.

2. Enable the back-face culling, set the stencil operation to increment on depth-test pass and render

the shadow volumes without writing the rendering result into colour and depth buffers.


This will count the number of “entries” into the shadow volume as described above.

3. Enable the front-face culling and set the stencil operation to decrement on the depth-test pass.

Again, render the shadow volumes without storing the render in the colour and depth buffers.

In this case, each pixel value in the stencil buffer will be decremented when the ray “leaves”

some shadow volume.

As described in section D.2.2.2, only the pixels that have a stencil buffer value of zero should be lit as

they are the ones that lie outside the shadow volume.

Using the zero values as a mask in the stencil buffer and rendering the scene with the lighting enabled

will correctly overwrite the previously shadowed pixels with the lit ones.

Appendix E

Implementation (Additional Details)

E.1 Viola-Jones Distributed Training Framework

The main classes of the Viola-Jones distributed training framework are shown in figures E.1 and E.2

below. The main responsibilities of these classes are summarized in table E.1.

E.2 HT3D Library

E.2.1 Head Tracker Core

A UML 2.0 class diagram of the HT3D library core is shown in figure E.3, and the responsibilities of

individual classes are summarized in table E.2.1.

E.2.2 Colour- and Depth-Based Background Subtractors

The class diagram for the colour- and depth-based background subtractors is shown in figure 3.13.

ViBeBackgroundSubtractor, EuclideanBackgroundSubtractor and DepthBackgroundSubtractor

classes have the shared responsibility to distinguish the moving objects (foreground) from the static

parts of the scene (background). Further implementation details of these classes are given below.

E.2.2.1 ViBe Background Subtractor

In ViBeBackgroundSubtractor, the background models of pixels obtained from the 8-bit grayscale

input bitmaps are internally represented as a three-dimensional byte array, where the first two di-

mensions represent the pixel coordinates in the image and the third dimension serves as an index into

the model of that pixel. The background models are updated over time following the theory given in

section C.3.2.

The background sensitivity of ViBe background subtractor is defined as the radius of the hypersphere

SR in the colour space, as shown in figure 2.8.

E.2.2.2 Euclidean Background Subtractor

The background models in EuclideanBackgroundSubtractor are built under the assumption that at

the moment of the background subtractor initialization, only background objects are present in the

frame. Then the subsequent frames can be segmented into foreground and background by inspecting

104

APPENDIX E. IMPLEMENTATION (ADDITIONAL DETAILS) 105

Figure E.1: UML 2.0 class diagram of the Viola-Jones distributed training framework architecture(part #1 of 2).


Figure E.2: UML 2.0 class diagram of the Viola-Jones distributed training framework architecture(part #2 of 2).


Class Responsibilities

ViolaJones

Trainer

Serves as an entry point to the program (includes input parameter parsingand server/client protocol set-up); provides core shared training server andclient functionality (e.g. multi-threaded best local rectangle feature search);implements training state preservation and restore.

ViolaJones

TrainerServer

Manages connections with clients; provides means of data serialization toXML (e.g. detector cascade) or to compressed binary format (e.g integralimages, training weights); handles data transfer to clients over TCP/IP andCIFS.

ViolaJones

TrainerClient

Implements connection and data exchange client-end, data deserializationand other client-specific functionality.

Rectangle

Feature

Provides efficient means to generate, store and evaluate rectangle features.

DecisionStump

WeakClassifier

Implements Find-Best-Weak-Classifier algorithm (given in C.1.2.1) aspart of the IWeakLearner interface.

StrongLearner Encapsulates a collection of rectangle features obtained using AsymBoostinto a strong learner (representing a single layer in the cascade).

StrongLearner

Cascade

Encapsulates a collection of trained strong learners into a detector cascade.

NegativeTrai-

ningImage

Stores large resolution negative training images; implementsFalse-Positive-Training-Image-Bootstrapping algorithm givenin 3.5.3.1.

NormalizedTrai-

ningImage

Stores normalized training images (both negative and positive) in the detec-tor resolution scale.

Utilities Provides helper functions (e.g. conversion between different image formats);implements various workarounds to prevent PWF machines from logging-offafter a certain period of inactivity, as well as mouse and keyboard softwarelocks, to prevent other users from accidentally shutting down training clients(see figure 3.3 for a picture of training machines in action).

Synchronized

List

Implements a thread-safe, synchronized generic item list (e.g. used in storingbootstrapped false positive training images which are simultaneously sent bya number of clients).

Log Handles output logging to hard drive in a thread-safe manner.

Table E.1: Responsibilities of individual classes in the Viola-Jones distributed training framework.


Figure E.3: UML 2.0 class diagram of the HT3D library core.


Class Responsibilities

HeadTracker Sets up the tracking environment: i) deserializes Viola-Jones face detectorcascade from the training framework output XML file, ii) sets up KinectSDK (registers for DepthFrameReady and VideoFrameReady events, opensdepth2 and colour3 byte streams), iii) initializes face/head detection andtracking components.

Orchestrates inputs and outputs from face/head detection and tracking com-ponents: i) aligns colour and depth images using a calibration procedureprovided by Kinect SDK (which uses a proprietary camera model developedby the manufacturer of Kinect sensor, PrimeSense)1, ii) maintains the head-tracking state of depth and colour trackers (using HeadTrackerState enu-meration), iii) prepares input data for individual tracking components (e.g.converting colour bitmaps to grayscale, or combining input colour bitmapswith background/foreground segmentation information), iv) invokes track-ing components as required and combines their outputs, and v) passes thetracking output to HeadTrackFrameReady event subscribers via instances ofthe event arguments class HeadTrackFrameReadyEventArgs.

Statistics

Handler

Provides means to record aligned colour and depth frames (as a stream of320 × 240 px bitmap images), together with the output from the head/-face trackers (serialized into XML file as a list of FaceCenterFramePair,allows recording and playback of the raw colour and depth frame data (one-dimensional byte arrays provided by Kinect SDK), gathers statistics aboutthe face/head detection and tracking speeds.

Utilities Provides functionality to convert data between different formats (e.g. rep-resenting depth values as colours, or converting input bitmap to the HSVbyte array) and various methods that simplify bitmap manipulation (e.g.resizing, conversion to grayscale, etc).

Table E.2: Responsibilities of individual classes in the HT3D library core.1 The process of colour and depth image alignment is necessary since IR and RGB cameras have different intrinsics

and extrinsics (due to the physical separation). As proposed by Herrera et al. [22], the intrinsics can be modelledusing a pinhole camera model with radial and tangential distortion corrections and extrinsics can be modelledusing a rigid transformation, consisting of a rotation and a translation.After the alignment, colour data is represented as 32-bit, 320×240 px bitmap images and depth data is representedas two-dimensional (320× 240) short arrays, where each item in the array represents the distance of the depthpixel from the Kinect sensor in millimetres.

2 320 × 240 px, 30 Hz. (While Kinect sensor supports 640 × 480 px depth output, 320 × 240 px is the highestresolution compatible with colour and depth image alignment API).

3 640× 480 px, 30 Hz.


which individual pixels differ from the initial frame more than the background subtractor sensitivity

threshold.

More precisely, if If is the initial 8-bit grayscale input frame, Ic is the current frame being segmented

and θ is the background subtractor sensitivity threshold, then a pixel (x, y) is classified as part of the

background by EuclideanBackgroundSubtractor if

|If (x, y)− Ic(x, y)| < θ. (E.1)

E.2.2.3 Depth-Based Background Subtractor

While the depth-based background subtractor inherits from the same base BackgroundSubtractor

class as colour-based background subtractors (see figure 3.13), it serves a slightly different purpose in

the head-tracking pipeline.

The main responsibility of the DepthBackgroundSubtractor class is to increase the speed and

the accuracy of colour-based face detector and tracker, using the information provided by the

DepthHeadDetectorAndTracker.

In particular, as long as the depth-based tracker is accurately locked onto the viewer’s head (i.e. if

the depth tracker state maintained in HeadTracker is equal to HeadTrackerState.TRACKING), all

pixels that are further away from the Kinect sensor than the detected head center are classified as

background.

An illustration of this process is shown in figure 3.14.

E.3 3D Display Simulator Components

As described in section 3.7 and illustrated in figure 3.17, 3D display simulator consists of two small

UI modules (“3D Simulation Entry Point” and “Head Tracker Configuration”), and a larger model-

view-controller-based module (“Z-Tris”).

Both UI module implementations and “Z-Tris” M-V-C architectural units are briefly described be-

low.

E.3.1 Application Entry Point

When the application is initialized, the Program class launches the MainForm (shown in figure 3.20).

The main form is responsible for launching the configuration form, showing help or launching the

game form.

E.3.2 Head Tracker Configuration GUI

ConfigurationForm class handles the communication to the HT3D library DLL. Through the user

interface (as shown in figure 3.21), all available head-tracking tweaking options are exposed.


All user’s preferences are saved by the PreferencesHandler when the configuration form is closed, and

restored when the form is reopened. PreferencesHandler achieves this functionality by recursively

walking through the configuration form’s component tree and storing/reading the values of check-

boxes, sliders and combo-boxes to/from the special XML file.

Finally, a DoubleBufferedPanel class is implemented to remove the flicker-on-repaint artifacts when

rendering the output from the head-tracking library (extends the WinForms Panel component to

enable the double-buffering functionality).

E.3.3 3D Game (Z-Tris)

Figure E.4 shows the model-view-controller architectural grouping of the Z-Tris game classes. Each

of the M-V-C architectural units are discussed in more detail below.

E.3.3.1 Model

The main responsibility for the LogicHandler class is to maintain and update

� The status of the pit (represented as a three-dimensional byte array),

� The status of the active (falling) polycube,

� Scores/line count/current level.

The status of the pit/active polycube is updated either on the user’s key press (notified by the

KeyboardHandler controller), or when the time for the current move expires (notified by the internal

timer).

At the end of the move, LogicHandler updates the score s using the following formula:

s← s+ line count× line score× fline count × flevel+ bempty pit × line score× flevel,

(E.2)

where fline count and flevel are the multiplicative factors which increase with the number of layers and

the number of levels (since the time allowance for each move decreases with increasing levels), and

bempty pit is equal to 1 if the pit is empty and 0 otherwise.

After a move is finished, a random polycube (represented as a 3× 3 byte array in Polycube.Shapes

dictionary) is added to the pit if it is not already full; otherwise, the game status is changed to

LogicHandler.Status.GAME OVER.

Both the model and the view are highly customizable (i.e. they can correctly process and render

different pit sizes, polycube shape sets, timing constraints, scoring systems and so on).

E.3.3.2 Controller

KeyboardHandler class (full code listing given at appendix H.1) is responsible for interfacing between

the user and the game logic. It operates using the following protocol:


Figure E.4: UML 2.0 class diagram of the Z-Tris game, grouped into the model-view-controllerarchitectural units.


Figure E.5: KeyboardHandler class event timing diagram. When the user presses and holdsa key on the keyboard, the first OnKeyPress event is triggered immediately, the second is trig-gered after INITIAL KEY HOLD DELAY MS milliseconds and all the following events are triggered afterREPEATED KEY HOLD DELAY MS milliseconds.

1. A keyboard key code is registered through a call to KeyboardHandler.RegisterKey(...) and

an event handler (callback function) is registered with KeyboardHandler.OnKeyPress event.

2. KeyboardHandler monitors the state of the keyboard, and given that one of the registered keys

was pressed, it notifies the appropriate OnKeyPress event subscriber(-s).

3. If a key is not released, it repeatedly triggers OnKeyPress events according to the timing diagram

shown in figure E.5.

This class is capable of handling multiple key events simultaneously (as required for the control of the

game), multiple keyboard event subscribers and customization of timing constraints.

E.3.3.3 View

View component (and in particular RenderHandler class) is responsible for

� Rendering the static game state (pit and the active cube),

� Rendering the active cube animations (rotations and translations).

The animation of simultaneous rotations and translations of the active cube is achieved by keeping

two vectors ~θr and ~θt which indicate the amount of rotation/translation animations remaining. At

each frame, the active polycube is

1. Translated to the coordinate origin,

2. Rotated in all three directions simultaneously by the fraction ~θr ×timeFromPreviousRender

KeyboardHandler.REPEATED KEY HOLD DELAY MS,

3. Translated by ~θt × timeFromPreviousRenderKeyboardHandler.REPEATED KEY HOLD DELAY MS

, and

4. Translated back to its original location.

A screenshot of simultaneous translations and rotations is shown in figure E.6.

The active polycube is also rendered as being semi-transparent, so as not to occlude the playing field.

It is achieved by


Figure E.6: Screenshot of Z-Tris game showing a simultaneous translation and rotation of the activepolycube around X- and Y-axes.


1. Rendering the active polycube as the last element of the scene with blending enabled,

2. Hiding the internal faces of individual cubes that make up the active polycube,

3. Culling the front faces and blending the remaining back faces of the polycube onto the scene,

4. Culling the back faces and blending the remaining front faces of the polycube onto the scene.

The rest of the rendering details are described in section 3.7.1.

Appendix F

HT3D Library Evaluation (Additional

Details)

F.1 Evaluation Metrics

F.1.1 Sequence Track Detection Accuracy

STDA measure (introduced by Manohar et al. [32]) evaluates the performance of the object tracker in

terms of the overall detection (number of objects detected, false alarms and missed detections), spatio-

temporal accuracy of the detection (the proportion of the ground truth detected both in individual

frames and in the whole tracking sequence) and the spatio-temporal fragmentation.

The following notation (following the original paper) is used:

� G(t)i denotes the ith ground truth object in tth frame,

� D(t)i denotes the ith detected object in tth frame,

� N(t)G and N

(t)D denote the number of ground truth/detected objects in tth frame respectively,

� Nframes is the total number of ground truth frames in the sequence,

� Nmapped is the number of mapped ground truth and detected objects in the whole sequence, and

� N(Gi∪Di 6=∅) is the total number of frames in which either the ground truth object i, or the

detected object i (or both) are present.

Then the Track Detection Accuracy (TDA) measure for ith object can be calculated as the spatial

overlap (i.e. the ratio of the spatial intersection and union) between the ground truth and the tracking

output of object i. More precisely, TDA can be defined as

TDAi =

Nframes∑t=1

|G(t)i ∩D

(t)i |

|G(t)i ∪D

(t)i |

(F.1)

Observe that the TDA measure penalizes for both false negatives (undetected ground truth area) and

false positives (detections that do not overlay any ground truth area).

To obtain the STDA measure, TDA is averaged for the best mapping of all objects in the sequence,

i.e.

STDA =

Nmapped∑i=1

TDAi

N(Gi∪Di 6=∅)=

Nmapped∑i=1

∑Nframes

t=1|G(t)

i ∩D(t)i |

|G(t)i ∪D

(t)i |

N(Gi∪Di 6=∅)(F.2)

116

APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 117

F.1.2 Multiple Object Tracking Accuracy/Precision

CLEAR (Classification of Events, Activities and Relationships) was an “international effort to evaluate

systems for the perception of people, their activities and interactions.” CLEAR evaluation workshops

[39] held in 2006 and 2007 introduced Multiple Object Tracking Precision (MOTP) and Multiple

Object Tracking Accuracy (MOTA) metrics for 2D face tracking task evaluation [5].

MOTP metric attempts to evaluate the total error in estimated positions of ground truth/detection

pairs for the whole sequence, averaged over the total number of matches made. More precisely, MOTP

is defined as

MOTP =

∑Nmapped

i=1 TDAi∑Nframes

j=1 N(j)mapped

, (F.3)

where N(j)mapped is the number of mapped objects in jth frame.

MOTA metric is derived from three error ratios (ratio of misses, false alarms and mismatches in the

sequence, computed over the number of objects present in all frames) and attempts to assess the

accuracy aspect of the system’s performance. MOTA is defined as

MOTA = 1−∑Nframes

i=1 (cM (FNi) + cFP (FPi) + lnS)∑Nframes

i=1 N(i)G

, (F.4)

where cM (x) and cFP (x) are the cost functions for missed detection and false alarm penalties, FNi

and FPi are the numbers of false negatives/false positives in ith frame respectively and S is the total

number of object ID switches for all objects.

In turn, false negative and false positive counts are defined as

FNi =

Nmapped∑j=1

1{|G(i)

j\D(i)

j|

|G(i)j|

>θFN

} (F.5)

FPi =

Nmapped∑j=1

1{|D(i)

j\G(i)

j|

|D(i)j|

>θFP

}, (F.6)

where θFN and θFP are the false negative/false positive ratio thresholds (illustrated in figure

F.1).

F.1.3 Average Normalized Distance from the Head Center

For motion parallax simulation, accurately localizing the face center is more important than achieving

a higher spatio-temporal overlap between the detected and tagged objects. To measure HT3D colour,

depth and combined head-trackers in this regard, an Average Normalized Distance from Head Center

(δ) metric is constructed.

The ground truth head ellipse in frame i is described by its center location ci and the locations of the

semi-major and semi-minor axes (points ai and bi respectively).


Figure F.1: False positive (false alarm) and false negative (miss) definitions for MOTA metric. Blueellipse indicates the detected head Di, red ellipse indicates tagged ground truth Gi.

Let hi be the head center location in frame i, as predicted by the head tracker. Then the normalized

distance between the detected and tagged head centres δi can be calculated by transforming the ellipse

into a unit circle centred around the origin, and measuring the length of the transformed head center

vector (as shown in figure F.2).

Let φj be the angle between the major-axis of the ellipse and the x−axis in jth frame. Observe that

φj = cos−1((aj−cj)·i|aj−cj |

).

Then the average normalized distance from the tagged head center can be calculated as

δ =1

Nframes

Nframes∑i=1

δi =1

Nframes

Nframes∑i=1

∣∣∣∣∣∣ cosφi

|ai−ci|sinφi|ai−ci|

− sinφi|bi−ci|

cosφi|bi−ci|

(hi − ci)T

∣∣∣∣∣∣ . (F.7)


a) b)

Figure F.2: δ-metric computation: a) ith input frame is transformed to image b) so that the groundtruth ellipse shown in red would be mapped to a unit circle centred at the origin. Then the normalizeddistance metric δi is the length of the position vector given by transformed head center prediction(shown in blue).


F.2 Evaluation Set

F.2.1 Viola-Jones Face Detector Output

Figure F.3: Output of Viola-Jones face detector for all HT3D library evaluation recordings. Falsepositive face detections are marked with red crosses.


F.2.2 δ Metric for Individual Recordings

Head tracking accuracy (δ metric evolution) for all evaluation set recordings is shown below.




Figure F.4: “Head rotation (roll)” recording. Marked red area indicates the output of the combined(depth and colour) head tracker.

Figure F.5: Head-tracking accuracy (δ metric) for “Head rotation (roll)” recording.





Figure F.6: “Head rotation (yaw)” recording.

Figure F.7: δ metric for “Head rotation (yaw)” recording.





Figure F.8: “Head rotation (pitch)” recording.

Figure F.9: δ metric for “Head rotation (pitch)” recording.





Figure F.10: “Head rotation (all)” recording.

Figure F.11: δ metric for “Head rotation (all)” recording.





Figure F.12: “Head translation (horizontal and vertical)” recording.

Figure F.13: δ metric for “Head translation (horizontal and vertical)” recording.





Figure F.14: “Head translation (anterior/posterior)” recording.

Figure F.15: δ metric for “Head translation (anterior/posterior)” recording.





Figure F.16: “Head translation (all)” recording.

Figure F.17: δ metric for “Head translation (all)” recording.





Figure F.18: “Head rotation and translation (all)” recording.

Figure F.19: δ metric for “Head rotation and translation (all)” recording.





Figure F.20: “Participant #1” recording.

Figure F.21: δ metric for “Participant #1” recording.





Figure F.28: ”Participant #5” recording.






Figure F.30: “Illumination (low)” recording.

Figure F.31: δ metric for “Illumination (low)” recording.





Figure F.32: “Illumination (changing)” recording.

Figure F.33: δ metric for “Illumination (changing)” recording.





Figure F.34: “Illumination (high)” recording.

Figure F.35: δ metric for “Illumination (high)” recording.





Figure F.36: “Facial expressions” recording.

Figure F.37: δ metric for “Facial expressions” recording.





Figure F.38: “Cluttered background” recording.

Figure F.39: δ metric for “Cluttered background” recording.





Figure F.40: “Occlusions” recording.

Figure F.41: δ metric for “Occlusions” recording.





Figure F.42: “Multiple viewers” recording.

Figure F.43: δ metric for “Multiple Viewers” recording.


F.2.3 MOTA/MOTP Evaluation Results

MOTA/MOTP metrics for all evaluation recordings are summarized in figure F.44. Akin to STDA

metric, depth and combined trackers outperform colour-based head tracker, but fall short of the inter-

annotator agreement.

Figure F.44: MOTA/MOTP metrics for all evaluation recordings evaluated using default trackersettings given in table 4.6. Higher values indicates better performance.

Appendix G

3D Display Simulator (Z-Tris) Evalua-

tion

The main intention behind Z-Tris implementation was to provide a proof-of-concept 3D application

that strengthens depth perception using continuous motion parallax (obtained by changing the per-

spective projection based on viewer’s head position). To verify the operation of this proof-of-concept,

a combination of automated and manual tests was used. The performance of 3D display simulator

was also measured, to ensure that real-time rendering rates can be achieved while simulating all depth

cues mentioned earlier.

G.1 Automated Testing

Unit Testing Framework provided by the Microsoft Visual Studio IDE was used to author/run unit

and “smoke” tests. A sample such run is shown in figure 4.26.

As summarized in table G.1, most of the code (around 85.25%) in Z-Tris core (main classes from figure

E.4) was covered by automated testing.

Automated “smoke” tests were also used as part of the regression testing to always maintain the

application in a working state when progressing through development iterations.

Class name Code coverage(% blocks)

Covered(blocks)

Not covered(blocks)

RenderHandler 85.38% 543 93LogicHandler 84.25% 385 72SpriteHandler 67.27% 74 36PreferencesHandler 93.94% 93 6KeyboardHandler 100.00% 58 0Polycube 100.00% 45 0DisplayUtilities 83.33% 10 2

Total: 85.25% 1, 208 209

Table G.1: Z-Tris core unit test code coverage.

142

APPENDIX G. 3D DISPLAY SIMULATOR (Z-TRIS) EVALUATION 143

a) b)

c) d)

Figure G.1: Z-Tris off-axis projection and head-tracking manual integration testing: the samescene is rendered with the viewer’s head positioned at the a) left, b) right, c) top and d) bottombevels of the display.

G.2 Manual Testing

Manual tests included hours of functional testing, both to evaluate the requirements given in section

2.3 and to perform basic usability and sanity tests.

A significant amount of time has also been spent performing the manual system integration testing.

Figure G.1 shows one example integration test scenario, where the same scene is rendered from four

different viewpoints based on viewer’s head position to test the integration of head-tracking and off-axis

projection rendering.

G.3 Performance

After integrating the 3D display rendering subsystem and HT3D head-tracking library, the overall

system run-time performance has been measured. The system was built in a 64-bit “Release” mode,

APPENDIX G. 3D DISPLAY SIMULATOR (Z-TRIS) EVALUATION 144

with no debug information, and with optimizations enabled. The final setup of the integrated system

is shown in figure 3.19.

The overall system’s performance was measured using the main development machine, running a 64-bit

Windows 7 OS on a dual-core hyperthreaded Intel Core i5-2410M CPU @ 2.30 GHz.

As expected, the real-time average Z-Tris game rendering speed (with the combined colour and depth

head tracker enabled) was 29.969 frames-per-second (with the standard deviation of 5.167 frames), i.e.

real-time requirements were satisfied. A single CPU core experienced an average load of 64.98% (min-

imum – 17.13%, maximum – 88.01%) per 5 minutes of game play, indicating that further processing

resources were still available.

Appendix H

Sample Code Listings

Listing H.1: KeyboardHandler code.using System;

using System.Collections.Generic;

using System.Text;

using OpenTK.Input;

namespace ZTris

{

/// <summary >

/// <para >

/// Keyboard handler , responsible for interfacing between the user and the game

/// logic.

/// </para >

/// <para >

/// This class is capable of handling multiple key events simultaneously (as

/// required for the control of the game), multiple keyboard event subscribers

/// and customization of timing constraints.

/// </para >

/// <para >

/// It operates using the following protocol:

/// <list type=" bullet">

/// <item >

/// A keyboard key code and the caller are registered through a call to

/// <see cref=" RegisterKey(Key key , object sender)">, and an event

/// handler (callback function) is registered through

/// <see cref=" OnKeyPress"> event.

/// </item >

/// <item >

/// <see cref=" KeyboardHandler"> monitors the state of the keyboard , and

/// when one of the registered keys was pressed , it notifies all

/// <see cref=" OnKeyPress"> event subscribers.

/// </item >

/// <item >

/// If a key is not released , it repeatedly triggers <see cref=" OnKeyPress">

/// events according to <see cref=" INITIAL_KEY_HOLD_DELAY_MS"> and

/// <see cref=" REPEATED_KEY_HOLD_DELAY_MS"> timing.

/// </item >

/// </list >

/// </para >

/// </summary >

public class KeyboardHandler

{

#region Internal classes

/// <summary >

/// Internal mutable key state representation.

/// </summary >

private class KeyState

{

public DateTime LastPressTime;

public bool IsRepeated;

public bool IsFirst;

}

145

APPENDIX H. SAMPLE CODE LISTINGS 146

#endregion

#region Constants

/// <summary >Represents the initial key hold delay until the second key event is

/// trigerred.</summary >

public const int INITIAL_KEY_HOLD_DELAY_MS = 400;

/// <summary >Represents the key hold delay until the third (and all subsequent)

/// key events are trigerred.</summary >

public const int REPEATED_KEY_HOLD_DELAY_MS = 180;

#endregion

#region Private fields

/// <summary >Interfaces keyboard handler.</summary >

private IKeyboardDevice _keyboard = null;

/// <summary >Maps keyboard keys to their states.</summary >

private Dictionary <Key , KeyState > _pressedKeys = new Dictionary <Key , KeyState >();

/// <summary >Maps registered keyboard keys to their subscribers .</summary >

private Dictionary <Key , List <object >> _registeredKeys =

new Dictionary <Key , List <object >>();

#endregion

#region Public fields

/// <summary >Key press event handler type.</summary >

/// <param name="key">Keyboard key that triggered the event.</param >

public delegate void KeyEventHandler(Key key);

/// <summary >Key press event handler.</summary >

public event KeyEventHandler OnKeyPress;

#endregion

#region Constructors

/// <summary >Default keyboard handler constructor .</summary >

/// <param name=" keyboard">Keyboard interface.</param >

public KeyboardHandler(IKeyboardDevice keyboard)

{

_keyboard = keyboard;

}

#endregion

#region Public methods

/// <summary >

/// A method to register subscriber 's interest in a particular key press.

/// Typically this method would be called as <c>RegisterKey (..., this)</c>.

/// </summary >

/// <param name="key">Keyboard key to register.</param >

/// <param name=" subscriber">Handle to the subscriber .</param >

public void RegisterKey(Key key , object subscriber)

{

_pressedKeys.Add(key , new KeyState ()

{


LastPressTime = DateTime.Now ,

IsRepeated = false ,

IsFirst = true

});

if (! _registeredKeys.ContainsKey(key))

{

_registeredKeys.Add(key , new List <object >());

}

_registeredKeys[key].Add(subscriber);

}

/// <summary >

/// A method to register subscriber 's interest in particular key presses.

/// Typically this method would be called as <c>RegisterKeys (..., this)</c>.

/// </summary >

/// <param name="keys">Keyboard keys to register.</param >

/// <param name=" subscriber">Handle to the subscriber .</param >

public void RegisterKeys(Key[] keys , object subscriber)

{

foreach (Key key in keys)

{

this.RegisterKey(key , subscriber);

}

}

/// <summary >

/// Main event processing loop where the subscribed key events are triggered.

/// </summary >

public void UpdateStatus ()

{

// Record key press time before processing

DateTime keyPressTime = DateTime.Now;

// Check the status of each registered key

foreach (Key key in _registeredKeys.Keys)

{

KeyState pressedKeyState = _pressedKeys[key];

if (_keyboard[key])

{

bool triggerKeyPressEvent = false;

// If the key is pressed for the first time , trigger the event ←↩immediately

if (pressedKeyState.IsFirst)

{

triggerKeyPressEvent = true;

pressedKeyState.IsFirst = false;

}

// If the key was held , trigger the events accordingly to timing ←↩constraints

else

{

double timeSinceLastPressInMs =

keyPressTime.Subtract(pressedKeyState.LastPressTime).←↩TotalMilliseconds;

triggerKeyPressEvent = (pressedKeyState.IsRepeated ?

(timeSinceLastPressInMs > REPEATED_KEY_HOLD_DELAY_MS) :

(timeSinceLastPressInMs > INITIAL_KEY_HOLD_DELAY_MS));

// Update key state

if (triggerKeyPressEvent)

{


pressedKeyState.IsRepeated = true;

}

}

if (triggerKeyPressEvent)

{

// Record last press time

pressedKeyState.LastPressTime = keyPressTime;

// Trigger subscriber event handlers

this.CallbackSubscribers(key);

}

}

else

{

pressedKeyState.IsRepeated = false;

pressedKeyState.IsFirst = true;

}

}

}

#endregion

#region Private methods

/// <summary >

/// Calls back the event handlers of subscribers to a particular key press.

/// </summary >

/// <param name="key">Keyboard key that was pressed.</param >

private void CallbackSubscribers(Key key)

{

foreach (Delegate eventCallback in OnKeyPress.GetInvocationList ())

{

if (_registeredKeys[key]. Contains(eventCallback.Target))

{

eventCallback.DynamicInvoke(key);

}

}

}

#endregion

}

}

Listing H.2: IKeyboardDevice interface.using System;

using OpenTK.Input;

namespace ZTris

{

/// <summary >

/// Keyboard device interface , responsible for providing the keyboard status.

/// </summary >

public interface IKeyboardDevice

{

/// <summary >

/// An indexer returning a status of the particular key.

/// </summary >

/// <param name="key">Keyboard key of interest.</param >

/// <returns >

/// Status of <see cref="key"/>:

/// <list type=" table">


/// <item >

/// <term >true </term >

/// <description ><see cref="key"/> is pressed.</description >

/// </item >

/// <item >

/// <term >false </term >

/// <description ><see cref="key"/> is released.</description >

/// </item >

/// </list >

/// </returns >

bool this[Key key] { get; }

}

}

1

M E A S U R I N G H E A D D E T E C T I O N A N D T R A C K I N G S Y S T E M A C C U R A CY

EXPERIMENT CONSENT FORM

EXPERIMENT PURPOSE

This experiment is part of the Computer Science Tripos Part II project evaluation. The project in question involves using a Microsoft Kinect sensor to track viewer’s head position in space. The main purpose of the experiment is to ensure that the face detector/tracker is robust and works for different viewers.

EXPERIMENT PROCEDURE

The experiment consists of recording two colour and depth videos (each 30 seconds long) of the participant moving his/her head in the free-form manner.

A possible range of head/face muscle motions that can be performed include (but are not limited to):

Head rotation (yaw/pitch/roll)

Head translation (horizontal/vertical, anterior/posterior)

Facial expressions, e.g. joy, surprise, fear, anger, disgust, sadness, etc.

Roll

Yaw

Pitch

Vertical

Horizontal

Anterior

Posterior

2

CONFIDENTIALITY

The following data will be stored: two (2) colour and depth recordings (each 30 seconds long).

No other personal data will be retained. Recorded videos will be kept in accordance to the Data Protection Act and destroyed after the submission of the dissertation.

FINDING OUT ABOUT RESULT

If interested, you can find out the result of the study by contacting Manfredas Zabarauskas, after 18/05/2012. His phone number is 0754 195 8411 and his email address is [email protected].

PLEASE NOTE THAT:

- YOU HAVE THE RIGHT TO STOP PARTICIPATING IN THE EXPERIMENT,

POSSIBLY WITHOUT GIVING A REASON.

- YOU HAVE THE RIGHT TO OBTAIN FURTHER INFORMATION ABOUT THE

PURPOSE AND THE OUTCOMES OF THE EXPERIMENT.

- NONE OF THE TASKS IS A TEST OF YOUR PERSONAL ABILITY. THE OBJECTIVE

IS TO TEST THE ACCURACY OF THE IMPLEMENTED HEAD TRACKING SYSTEM.

RECORD OF CONSENT

Your signature below indicates that you have understood the information about the “Measuring Head Detection and Tracking System Accuracy” experiment and consent to your participation. The participation is voluntary and you may refuse to answer certain questions on the questionnaire and withdraw from the study at any time with no penalty. This does not waive your legal rights. You should have received a copy of the consent form for your own record. If you have further questions related to this research, please contact the researcher.

Participant (Name, Signature): Date (dd/mm/yy): __________________________________________________________________________

__________________________________

Researcher (Name, Signature): Date (dd/mm/yy): __________________________________________________________________________

__________________________________

1

M E A S U R I N G H E A D D E T E C T I O N A N D T R A C K I N G S Y S T E M A C C U R A CY

VIDEO AND DEPTH RECORDING RELEASE FORM

RELEASE STATEMENT

I HEREBY ASSIGN AND GRANT TO MANFREDAS ZABARAUSKAS THE RIGHT AND

PERMISSION TO USE AND PUBLISH (PARTIALLY OR IN FULL) THE VIDEO AND/OR

DEPTH RECORDINGS MADE DURING THE “MEASURING HEAD DETECTION AND

TRACKING SYSTEM ACCURACY” EXPERIMENT, AND I HEREBY RELEASE

MANFREDAS ZABARAUSKAS FROM ANY AND ALL LIABILITY FROM SUCH USE AND

PUBLICATION.

Participant (Name, Signature): Date (dd/mm/yy): __________________________________________________________________________

_________________________

Researcher (Name, Signature): Date (dd/mm/yy): __________________________________________________________________________

_________________________

Appendix I

Project Proposal

Computer Science Tripos Part II Project Proposal

3D Display Simulation Using Head Tracking with Microsoft Kinect

M. Zabarauskas, Wolfson College (mz297)

Originator: M. Zabarauskas

20 October 2011

Project Supervisor: Prof N. Dodgson

Signature:

Director of Studies: Dr C. Town

Signature:

Project Overseers: Dr S. Clark & Prof J. Crowcroft

Signatures:

APPENDIX I. PROJECT PROPOSAL 154

Introduction

Reliable real-time human face detection and tracking has been one of the most interesting problems in

the field of computer vision in the past few decades. The emergence of cheap and ubiquitous Microsoft

Kinect sensor containing an IR-depth camera provides new opportunities to enhance the reliability

and speed of face detection and tracking. Moreover, the ability to use the depth information to track

the user’s head in 3D space opens up a lot of potential for new immersive user interfaces.

In my project I want to implement widely recognized, industry-standard face detection and tracking

methods: a Viola-Jones object detection framework and a CAMShift (Continuously Adaptive Mean

Shift) face tracker, based on the ideas presented by the authors in their original papers. Having

achieved that, I want to explore the opportunities of using the current state-of-the-art methods to

integrate the depth information into face detection and tracking algorithms, in order to increase their

speed and accuracy. As a next and final part of the project, I want to employ the depth information

provided by Kinect to obtain an accurate 3D location of the viewer with respect to the display. The

knowledge of viewer head’s coordinates in 3D will allow me to simulate the parallax motion that

occurs between the visually overlapping near and far objects in a 3D scene when the user’s viewpoint

changes, mimicking a three-dimensional display viewing experience.

Method Descriptions

The Viola-Jones face detector mentioned above is a breakthrough method for face detection, proposed

by Viola and Jones [43] in 2001. They described a family of extremely simple classifiers (called

“rectangle features”, reminiscent of Haar wavelets) and a representation of a grayscale image (called

“integral image”), using which these Haar-like features can be calculated in constant time. Then,

using a classifier boosting algorithm based on AdaBoost a number of most effective features can be

extracted and combined to yield an efficient “strong” classifier with an extremely low false negative

rate, but a high false positive rate. Finally, they proposed a method to arrange strong classifiers into a

linear cascade, which can quickly discard non-face regions, focusing on likely face regions in the image

to decrease the false positive rate.

After the face has been localized in the image, it can be efficiently tracked using the face colour

distribution information. CAMShift (Continuously Adaptive Mean Shift) was first proposed by Gary

Bradski [8] at Intel in 1998. In this method, a hue histogram of a face that is being tracked is used to

derive the “face probability distribution”, where the most frequently occurring colour is assigned the

probability 1.0 and the probabilities of other colours are computed based on their relative frequency

to the most frequent colour. Then, given a new search window a “mean shift” algorithm is used (with

a simple step-function as a kernel) to converge to a probability centroid of the face colour probability

distribution. The size of the search window is then adjusted as a function of the zeroth moment, and

the repositioning/resizing is repeated until the result changes less than a fixed threshold.

However, these colour information based face detection and tracking methods encounter difficulties in

situations when the face orientation does not match the training ones (e.g. when the user is facing

away from the camera), when the background is visually cluttered and so on.

Burgin et al. [10] suggested a few simple ways how the depth information could be used to improve

face detection. For example, given a certain distance from the camera, the realistic range of human


head sizes in pixels can be calculated. This can then be used to reject certain window sizes and to

improve on the exhaustive search for faces in an entire image in Viola-Jones algorithm. Similarly, they

suggested that the distance thresholding could also be used to improve the face detection efficiency,

since far-away points are likely to be blurry or to contain too few pixels for reliable face detection.

On a similar note, Xia et al. [31] described a 3D model fitting algorithm for the head tracking.

Their algorithm scales a hemisphere model to the head size estimated from the depth values of the

location possibly containing a head (using an equation regressed from the empirical head size/depth

measurements). Then it attempts to minimize the square error between the possible head region and

the hemisphere template. Since this approach uses the generalized head depth characteristics (front,

side, back views, as well as higher and lower views of the head approximate a hemisphere), it is view

invariant. When combined with CAMShift face tracker, the 3D model fitting approach should enhance

the reliability of the overall tracking even when the person turns to look away for a few seconds.

These improved ways of face detection and tracking, combined with the depth information provided

by Kinect can be employed to obtain the accurate 3D location of the viewer with respect to the

display. The location can then be used to simulate the parallax motion (near objects moving faster

in relation to far objects), evoking a visual sense of depth as perceived in real three-dimensional

environments.

Resources Required

� Hardware

– Microsoft Kinect sensor. Acquired.

– Development PC. Acquired: hyperthreaded dual-core Intel i5-2410M running at 2.90 GHz,

8 GB RAM, 250 GB HDD.

– Primary back-up storage: 0.5 GB space on PWF for program and dissertation sources

only. Acquired.

– Secondary back-up storage: 16 GB USB flash drive for source code/ dissertation/ built

snapshots. Acquired.

� Software

– Development: Microsoft Visual Studio 2010, Kinect SDK, OpenTK (OpenGL wrapper for

C#), Math.NET (mathematical open source library for C#). All installed.

– Back-up: Subversion CVS. Installed both on local machine and on PWF.

� Training data

– Face/non-face training images for Viola-Jones. Acquired 4916 face images and 7960 non-

face images from Robert Pless’ website [11].

Starting Point

� Basic knowledge of C#,


� Minimal familiarity with OpenGL,

� Nearly no knowledge of computer vision.

Substance and Structure of the Project

As discussed in the introduction, the substance of the project can be split into the following

stages:

� to implement the industry standard colour-based face detection and tracking algorithms (viz.

Viola-Jones and CAMShift),

� to extend these algorithms using the depth information provided by Microsoft Kinect’s IR-depth

sensing camera,

� to simulate the parallax motion effect using the calculated head movement’s in 3D, creating a

3D display effect.

Viola-Jones Face Detector

As described in the introduction, the main task will be to implement the AdaBoost algorithm which

will combine Haar-like weak classifiers into a strong classifier. These classifiers will be connected

into a classifier cascade, such that early stages reject the image locations that are not likely to con-

tain the faces. It is crucial to get this stage implemented early, since classifier training can take

days/weeks.

CAMShift Face Tracker

The main task for the face tracker will be to implement the “histogram backprojection” method and

the “mean-shift” algorithm.

Depth Cue Integration for Viola-Jones Detector

Since the suggestions in Burgin et al. [10] paper are relatively straightforward (e.g. distance threshold-

ing), the main task will be simply to code the unnecessary image region elimination before launching

the Viola-Jones detector.

Tracker Extension Using Depth Cues

Based on Xia et al. [31] approach, 3D hemisphere fitting will have to be implemented. However,

additional work will be required to ensure that when the CAMShift colour-based tracker loses the

faces, depth-based tracker reliably takes over, and vice versa.


3D Display Simulation Using Parallax Motion

Having obtained the head location in pixel (and depth) coordinates during the stages above, the

head’s location in 3D can be calculated using publicly available conversion equations (derived by

measuring the focal distances, distortion coefficients and other parameters of both depth and RGB

cameras).

To simulate the effect of parallax motion, a simple OpenGL scene will be created and a scene’s

viewpoint will be set to follow the head’s motions in 3D.

Success Criteria

For the project to be deemed success, the following items have to be completed:

1. Viola-Jones face detector,

2. CAMShift tracker,

3. Viola-Jones detector extensions using depth cues,

4. 3D hemisphere-fitting tracker,

5. OpenGL program, simulating the parallax motion effect.

Furthermore, the implemented items should have a comparable performance to the one in papers that

describe these methods.

Evaluation Criteria

Face detector and trackers can be quantitatively evaluated on its speed and ROC (receiver operating

characteristic), i.e. the rate of correct detections versus the false positive rate, as well as on the

precision (TP/TP + FP ), recall (TP/TP + FN), accuracy (TP/TP + TN + FP + FN) and other

metrics.

Similarly, its robustness against different head orientations (tilt, rotation), distance to camera, speed

of movement, global illumination conditions, etc can be quantitatively measured.

Then the relative performance and accuracy gain/loss, obtained by adding the depth cues to the face

detector/tracker, can be obtained.

Finally, the accuracy of head location tracking in 3D (with respect to head translation in X, Y and

Z-axis) can be assessed.


Possible Extensions

Given enough time, the system could be extended to deal with multiple people. This would involve

only minor changes to the Viola-Jones detector, but should be more challenging for the trackers. Both

colour and depth information based trackers should now deal with partial/full occlusion and object

tagging (i.e. if person A passes behind person B, it should not treat person A as a new person in the

image, and should not get confused between A and B); depth information tracker should have more

potential in disambiguating these situations.

After implementing the extension above, OpenGL scene could be trivially segmented so that each

viewer would see her own 3D segment of a display.

Work Plan

The work will be split into 16 two-week packages, as detailed below:

07/10/11 - 20/10/11

Gain a better understanding in face detection and tracking methods described above. Set up the

development and backing-up environment. Obtain the colour and depth input streams from Kinect.

Write the project proposal.

Milestones: SVN set up on PWF. Written a small C# test project for Microsoft Visual Studio 2010

and Kinect SDK, that fetches the input colour and depth streams from the device, and renders them

on a screen. Project proposal written and handed in.

21/10/11 - 03/11/11

Fully understand the Viola-Jones face detector and start implementing it. Add additional face images

to the training set if required.

Milestones: clear understanding of Viola-Jones face detector. Pieces of working implementa-

tion.

04/11/11 - 17/11/11

Finish implementing the Viola-Jones algorithm and start the training. Start reading about the

CAMShift algorithm.

Milestones: implementation of Viola-Jones face detector.


18/11/11 - 01/12/11

Fully understand and implement the CAMShift face tracker. Integrate it to the Viola-Jones face

detector as a next stage (when the face is detected).

Milestones: implementation of CAMShift tracker, integrated into the system.

02/12/11 - 15/12/11

Add depth cues to the Viola-Jones face detector, start reading about the 3D hemisphere-fitting

tracker.

Milestones: depth cues added to the Viola-Jones detector. Clear understanding of 3D hemisphere-

fitting tracker.

16/12/11 - 29/12/11

Implement the 3D hemisphere-fitting tracker and integrate it to the system to start tracking in parallel

with CAMShift algorithm when Viola-Jones detects a face in the image. Start reading about the

parallax motion simulation.

Milestones: implementation of 3D hemisphere-fitting tracker, integrated into the system. Clear

understanding of how the parallax motion could be simulated knowing the head’s position.

30/12/11 - 12/01/12

Prepare the presentation for the progress meeting in January. Write progress report. Slack time in case

any of the face detector/face trackers/progress report/progress presentation are not finished.

Milestones: presentation for the progress meeting and a progress report.

13/01/12 - 26/01/12

Fully understand how head’s pixel and depth coordinates can be converted into its location in 3D

space. Further research on how the parallax motion can be simulated from the head location in 3D.

Start implementing an OpenGL scene which could be used to display the parallax motion effect.

Milestones: basic implementation of an OpenGL scene.

27/01/12 - 09/02/12

Finish implementing an OpenGL scene. Slack time for any unfinished implementation details.


Milestones: finished implementation of an OpenGL scene. At this stage the overall system should

be functional, i.e. it should combine the output from the face detector and face trackers to obtain the

head’s location in 3D and use it to simulate parallax motion on the display.

10/02/12 - 23/02/12

Start writing a dissertation. Come up with a structure, including sections, subheadings and short

bullet points to be covered in each section.

Milestones: basic structure of the dissertation.

24/02/12 - 09/03/12

Write the “Introduction” and “Preparation” sections. Get feedback from the supervisor/DoS.

Milestones: complete “Introduction” and “Preparation” sections.

10/03/12 - 23/03/12

Milestones: Incorporate the feedback from the supervisor/DoS regarding the “Introduction” and

“Preparation” sections. Write the “Implementation” section and send for feedback to supervisor/-

DoS.

24/03/12 - 06/04/12

Incorporate the feedback from the supervisor/DoS regarding the “Implementation” section. Gather

the numerical data for “Evaluation” section. Slack time for finishing “Introduction”, “Preparation”

and “Implementation” sections.

Milestones: finished “Introduction”, “Preparation” and “Implementation” sections. Gathered data

for “Evaluation” section.

07/04/12 - 20/04/12

Write the “Evaluation” section and send it for feedback to DoS/supervisor.

Milestones: finished “Evaluation” section.

21/04/12 - 04/05/12

Incorporate the feedback for “Evaluation” section and finish a draft dissertation. Send it for final

feedback to supervisor/DoS.

Milestones: finished draft dissertation.


05/05/12 - 18/05/12

Incorporate final feedback from supervisor/DoS and get the final version approved.

Milestones: dissertation is finished, approved, bound and handed in before 18/05/2012.

3d display simulation using head tracking with microsoft kinect (printing)

Documents

nearest dominant peak

free2c interactive kiosk

receiver operating characteristic

ht3d conguration gui

head face muscle motions

euclidean background subtractor

ground truth detection pairs

data protection act