3d display simulation using head tracking with microsoft kinect (printing)
DESCRIPTION
Depth perception simulation using motion parallax on a regular LCD screen, without requiring the user to wear glasses/other headgear or to modify the screen in any way.TRANSCRIPT
Manfredas Zabarauskas
3D Display Simulation UsingHead-Tracking with Microsoft Kinect
Computer Science Tripos, Part II
University of Cambridge
Wolfson College
May 14, 2012
Proforma
Name: Manfredas Zabarauskas
College: Wolfson College
Project Title: 3D Display Simulation Using Head Tracking
with Microsoft Kinect
Examination: Part II in Computer Science, June 2012
Word Count: 11,9761
Project Originator: M. Zabarauskas
Supervisor: Prof N. Dodgson
Original Aims of the Project
The main project aim was to simulate depth perception using motion parallax on a regular
LCD screen, without requiring the user to wear glasses/other headgear or to modify the screen
in any way. Such simulated 3D displays could serve as a “stepping-stone” between full-3D
displays (providing stereopsis depth cue) and currently pervasive 2D displays. The proposed
approach for achieving this aim was to use viewer’s head tracking based on colour and depth
data provided by the Microsoft Kinect sensor.
Work Completed
In order to detect the viewer’s face, a distributed Viola-Jones face detector training framework
has been implemented, and a colour-based face detector cascade has been trained. To track
the viewer’s head, a combined colour- and depth-based approach has been proposed. The
combined head-tracker was able to predict viewer’s head center location within less than 13
of head’s size from the actual head center on average. A proof-of-concept 3D display system
(using a created head-tracking library) has also been implemented, simulating pictorial and
motion parallax depth cues. A short demonstration of the working system can be seen at
http://zabarauskas.com/3d.
Special Difficulties
None.
1Computed using detex diss.tex | tr -cd ’0-9A-Za-z \n’ | wc -w excluding proforma and appen-
dices .
i
Declaration
I, Manfredas Zabarauskas of Wolfson College, being a candidate for Part II of the Computer
Science Tripos, hereby declare that this dissertation and the work described in it are my own
work, unaided except as may be specified below, and that the dissertation does not contain
material that has already been used to any substantial extent for a comparable purpose.
Signed
Date
ii
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Human Depth Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Depth Cue Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Related Work on 3D Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Detailed Project Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Preparation 5
2.1 Starting Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Project Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Risk Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Problem Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.3 Data Flow and System Components . . . . . . . . . . . . . . . . . . . . . 7
2.4 Image Processing and Computer Vision Methods . . . . . . . . . . . . . . . . . 7
2.4.1 Viola-Jones Face Detector . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.2 CAMShift Face Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.3 ViBe Background Subtractor . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Depth-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.1 Peters-Garstka Head Detector . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.2 Depth-Based Head Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Implementation 22
3.1 Development Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Languages and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Development Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Development Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.4 Code Versioning and Backup Policy . . . . . . . . . . . . . . . . . . . . . 24
3.3 Implementation Milestones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 High-Level Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Viola-Jones Detector Distributed Training Framework . . . . . . . . . . . . . . . 25
3.5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5.2 Class Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5.3 Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6 Head-Tracking Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6.1 Head-Tracker Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6.2 Colour-Based Face Detector . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.3 Colour-Based Face Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . 37
iii
3.6.4 Colour- and Depth-Based Background Subtractors . . . . . . . . . . . . . 38
3.6.5 Depth-Based Head Detector and Tracker . . . . . . . . . . . . . . . . . . 39
3.6.6 Tracking Postprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 3D Display Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.7.1 3D Game (Z-Tris) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 Evaluation 48
4.1 Viola-Jones Face Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.1 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.2 Trained Cascade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1.3 Face Detector Accuracy Evaluation . . . . . . . . . . . . . . . . . . . . . 52
4.1.4 Face Detector Speed Evaluation . . . . . . . . . . . . . . . . . . . . . . . 52
4.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 HT3D (Head-Tracking in 3D) Library . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.1 Tracking Accuracy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 3D Display Simulator (Z-Tris) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5 Conclusions 75
5.1 Accomplishments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
A Depth Cue Perception 80
A.1 Oculomotor Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
A.2 Monocular Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
A.2.1 Pictorial Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
A.2.2 Motion Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.3 Binocular Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
B 3D Display Technologies 84
B.1 Binocular (Two-View) Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
B.2 Multi-View Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
B.3 Light-Field (Volumetric and Holographic) Displays . . . . . . . . . . . . . . . . 86
B.4 3D Display Comparison w.r.t. Depth Cues . . . . . . . . . . . . . . . . . . . . . 86
B.5 3D Display Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
B.5.1 Scientific and Medical Software . . . . . . . . . . . . . . . . . . . . . . . 87
B.5.2 Gaming, Movie and Advertising Applications . . . . . . . . . . . . . . . 87
C Computer Vision Methods (Additional Details) 89
C.1 Viola-Jones Face Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
C.1.1 Weak Classifier Boosting using AdaBoost . . . . . . . . . . . . . . . . . . 89
C.1.2 Best Weak-Classifier Selection . . . . . . . . . . . . . . . . . . . . . . . . 92
C.1.3 Cascade Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
C.2 CAMShift Face Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
iv
C.2.1 Mean-Shift Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
C.2.2 Centroid and Search Window Size Calculation . . . . . . . . . . . . . . . 94
C.3 ViBe Background Subtractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
C.3.1 Background Model Initialization . . . . . . . . . . . . . . . . . . . . . . . 95
C.3.2 Background Model Update . . . . . . . . . . . . . . . . . . . . . . . . . . 96
D Depth-Based Methods (Additional Details) 97
D.1 Depth Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
D.1.1 Depth Shadow Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . 97
D.1.2 Real-Time Depth Image Smoothing . . . . . . . . . . . . . . . . . . . . . 97
D.2 Depth Cue Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
D.2.1 Generalized Perspective Projection . . . . . . . . . . . . . . . . . . . . . 98
D.2.2 Real-Time Shadows using Z-Pass Algorithm with Stencil Buffers . . . . . 100
E Implementation (Additional Details) 104
E.1 Viola-Jones Distributed Training Framework . . . . . . . . . . . . . . . . . . . . 104
E.2 HT3D Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
E.2.1 Head Tracker Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
E.2.2 Colour- and Depth-Based Background Subtractors . . . . . . . . . . . . . 104
E.3 3D Display Simulator Components . . . . . . . . . . . . . . . . . . . . . . . . . 110
E.3.1 Application Entry Point . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
E.3.2 Head Tracker Configuration GUI . . . . . . . . . . . . . . . . . . . . . . 110
E.3.3 3D Game (Z-Tris) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
F HT3D Library Evaluation (Additional Details) 116
F.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
F.1.1 Sequence Track Detection Accuracy . . . . . . . . . . . . . . . . . . . . . 116
F.1.2 Multiple Object Tracking Accuracy/Precision . . . . . . . . . . . . . . . 117
F.1.3 Average Normalized Distance from the Head Center . . . . . . . . . . . . 117
F.2 Evaluation Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
F.2.1 Viola-Jones Face Detector Output . . . . . . . . . . . . . . . . . . . . . . 120
F.2.2 δ Metric for Individual Recordings . . . . . . . . . . . . . . . . . . . . . 121
F.2.3 MOTA/MOTP Evaluation Results . . . . . . . . . . . . . . . . . . . . . 141
G 3D Display Simulator (Z-Tris) Evaluation 142
G.1 Automated Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
G.2 Manual Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
G.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
H Sample Code Listings 145
I Project Proposal 153
v
Chapter 1
Introduction
This chapter describes the motivation for a three-dimensional display simulation using Microsoft
Kinect, the basic workings of the human depth perception (in order to understand how it could
be simulated), the related work that has been done on the 3D display simulation, and the main
applications for 3D displays.
1.1 Motivation
The ideas and research about three-dimensional displays can be traced back to mid-nineteeth
century, when Wheatstone first demonstrated his findings about stereopsis to the Royal Society
of London.
In the last decade a number of usable glasses-free autostereoscopic systems became available,
and multiple glasses-based stereoscopic 3D display systems were available for a few decades.
Nevertheless, 3D displays have struggled to break out of their niche markets because of their
relatively low quality and high price, when compared to the conventional displays.
In November 2010, Microsoft has launched the Kinect sensor, containing an IR depth-finding
camera. It became a huge commercial success, entering the Guiness World Book of Records
Figure 1.1: Just-discriminable depth thresholds for two objects with distances D1 and D2 as afunction of the logarithm of distance from the observer for the nine depth cues. Depth of two objectsis represented by their average distance D1+D2
2 , depth contrast is obtained by calculating 2(D1−D2)D1+D2
.Reproduced from Cutting and Vishton, 1995 [13].
1
CHAPTER 1. INTRODUCTION 2
as the ”fastest selling consumer electronic device” with 18 million units sold as of January
2012.
Based on this new development, an idea was conceived to explore the applicability of the
cheap and ubiquitous Kinect sensor in creating the depth perception on existing widespread
high-quality single-view displays.
The crucial first step in developing such system is to understand the main principles of the
human depth perception.
1.2 Human Depth Perception
According to Goldstein [19], all depth cues can be classified into three major groups.
1. Oculomotor cues (based on human ability to sense the position of the eyes and the tension
in the eye muscles),
2. Monocular cues (use the input just from one eye).
3. Binocular cues (use the input from both eyes).
These major groups (together with the definitions used in the rest of this chapter) are fully
described in appendix A.
1.2.1 Depth Cue Comparison
The relative efficacy and importance of various depth cues has been summarized by Cutting
and Vishton [13]. Figure 1.1 presents the just-discriminable depth thresholds as a function of
2 - 30 m
Depth cue 0-2 m All sources Pictorial sources > 30 m
Occlusion 1 1 1 1Relative size 4 3.5 3 2Relative density 7 6 4 4.5Relative height — 2 2 3Atmospheric perspective 8 7 5 4.5Motion parallax 3 3.5 — 6Convergence 5.5 8.5 — 8.5Accommodation 5.5 8.5 — 7Stereopsis 2 5 — 7
Table 1.1: Ranking of depth cues in the observer’s space, obtained by integrating the area under eachdepth-threshold function from figure 1.1 within each spatial region, and comparing relative areas.Lower rank means higher importance, a dash indicates that data was not applicable to source depthcue. Based on Cutting and Vishton, 1995 [13].
CHAPTER 1. INTRODUCTION 3
Figure 1.2: A sample taxonomy of 3D display technologies. Italic font indicates autostereoscopicdisplays.
the logarithm of distance from the observer for each of the depth cues, and table 1.1 describes
the relative importance of these depth cues in three circular areas around the observer. In
particular, occlusions, stereopsis and motion parallax are distinguished as the most important
cues for depth perception in low to average viewing distance ranges.
1.3 Related Work on 3D Displays
Physiological knowledge about human depth cue perception has been extensively applied in 3D
display design, and multiple ways to classify such displays have been presented in the literature
[40, 4, 14, 23]. A sample taxonomy of currently dominating 3D display technologies is given in
figure 1.2.
Table 1.3 compares these display types with respect to the depth cues that they can simu-
late, and the special equipment that they require, while a much broader discussion is given in
appendix B.
Table 1.2: Different display type comparison with respect to the depth cues that they provide,and requirements for special equipment.
Requirements Simulated depth cues
Display typeHeadtrack-
ing
Eye-wear
StandardLCD/CRT
monitor
Pic-torial
Stere-opsis
Motionparallax
Accomm.& conv.match
Binocular X X XX X X X Continuous
Multi-view X X Discrete1
Light-field2 X X Continuous X
Proposed X X X Continuous
1 Typically only in horizontal direction.2 Light-field displays still remain largely experimental (as described by Holliman et al. in [24]).
CHAPTER 1. INTRODUCTION 4
1.4 Applications
Dodgson [14] distinguishes two main classes of applications for the autostereoscopic 3D display
systems:
� Scientific and medical software, where 3D depth perception is needed for the successful
completion of the task,
� Gaming and advertising applications, where the novelty of a stereo parallax is useful as
a commercial selling point.
Examples from these two application classes are discussed in appendix B.5.
1.5 Detailed Project Aims
To achieve the main project’s aim (“to simulate depth perception on a regular LCD screen
through the use of the ubiquitous and affordable Microsoft Kinect sensor, without requiring
the user to wear glasses or other headgear, or to modify the screen in any way”), the project
will simulate
� pictorial depth cues : lighting, shadows, occlusions, relative height/size/density and tex-
ture gradient (by implementing an appropriate three-dimensional scene in a 3D rendering
framework),
� continuous horizontal and vertical motion parallax, through real-time head tracking using
Microsoft Kinect sensor.
The project will not aim to simulate stereopsis, because it would require modifications to the
screen (a standard LCD display inherently provides a single view that is seen binocularly).
Because of the same reason, simulating depth perception for multiple viewers using a single
view will not be attempted.
Since motion parallax is one of the strongest near- and middle-range depth perception cues, its
simulation through viewer’s head tracking will be one of the main focal points of the project. It
is obvious from the start that in order to achieve accurate head tracking, a significant amount
of computer vision and signal processing techniques will be required. Even more importantly,
these algorithms will have to be extended to use the depth information provided by Microsoft
Kinect sensor.
These tasks require a careful consideration of various tractability issues and a lot of attention
to the computer vision techniques before embarking on the project. They will be discussed in
much more detail in the following chapter.
Chapter 2
Preparation
This chapter outlines the planning and research that was undertaken before starting the imple-
mentation of the project. In particular, it discusses the starting point, the main requirements
of the overall system, project methodology, risk analysis and the problem constraints. It then
describes the most important theory and algorithms that were used in the project, paying
particular attention to the computer vision techniques.
2.1 Starting Point
Before starting the project I had
� basic knowledge of Microsoft Visual Studio development environment and C# program-
ming language (six months working experience),
� next-to-zero practical experience with the OpenGL rendering framework,
� no experience with the Kinect SDK,
� no experience with the relevant machine learning and computer vision techniques.
2.2 Project Methodology
Agile Software Development philosophies [3] were followed in requirement analysis, design,
implementation and testing.
More precisely,
� Requirements analysis (section 2.3) was based on the usage modelling.
� System design (section 3.4) was focused on
– process modelling (through data flow diagrams), and
– architectural modelling (through component diagrams).
� Implementation was focused on
– constant pace with clear milestones and deliverables (following the project proposal),
– iterative development with weekly/bi-weekly iteration cycles,
– continuous integration, where working software is extended weekly/bi-weekly by
adding new features, but is always kept in the working state,
5
CHAPTER 2. PREPARATION 6
� Testing was based on agile approaches (functional, sanity and usability manual testing
performed continuously throughout the iteration) and automated regression unit tests
(performed at the end of an iteration).
2.3 Requirements Analysis
The variety of use cases, scenarios and applications of depth displays are described in section
B.5‘.
To limit the scope of the project into something manageable in the Part II project timeframe,
and at the same time to design concrete deliverables that would achieve the main aim of the
project (as described in section 1.5), two simple user stories are given in the table 2.1.
User: 3D Application Developer User: Gamer
As an application developer who wants tocreate her own 3D application on a regulardisplay, I want to easily obtain viewer’s headlocation information so that I could use it torender my depth-aware application accord-ingly.
As a gamer, I want to experience a highersense of realism when playing a 3D game, sothat I could a) perform tasks that requiresdepth estimation more easily and b) I couldexperience a higher level of imersiveness inthe game.
Table 2.1: User stories for the agile requirement analysis of the project.
Extrapolating from these two simple user stories, the deliverables of the project (and the main
requirements for them) can be defined more precisely:
1. A head-tracking library that can be used to easily obtain the viewer’s head location in
three-dimensions. The main requirements (in the order of their priority) are:
(a) Accuracy: head-tracker should be able to correctly detect a single viewer’s head
in the majority of input frames (i.e. the average distance between the tracker’s
prediction and the actual head center in the image should not exceed 12
of the viewer’s
head size),
(b) Performance: head-tracker should work in real-time (i.e. should process at least at
30 frames-per-second),
(c) Ease of use: library should be flexible enough to be used in multiple projects.
2. A simple 3D game that simulates depth perception and requires the user to accurately
estimate depth in order to achieve certain in-game goals. The main requirements are:
(a) Continuous vertical and horizontal motion parallax depth cue simulation,
(b) Pictorial depth cues simulation (lighting, shadows, occlusions, relative height/size/-
density and texture gradient),
(c) In-game goal system requiring the player to estimate depth accurately.
CHAPTER 2. PREPARATION 7
2.3.1 Risk Analysis
Undoubtedly, the biggest challenge and the highest uncertainty associated with these deliver-
ables is the requirement for the accurate real-time head tracking.
For this reason, the remaining sections of this chapter (and a very significant part of the overall
dissertation) are focused on successfully implementing viewer’s head tracking using colour and
depth information provided by Microsoft Kinect.
2.3.2 Problem Constraints
As described in section 1.5, depth perception simulation for multiple viewers will not be at-
tempted because it would require modifications to the screen (a standard LCD display inher-
ently provides a single view). This reduces the complexity of head-tracking, since only a single
viewer needs to be tracked.
Furthermore, observe that the reference point of the tracked head location is Kinect sensor.
Since the location of the sensor might not necessarily coincide with the position of the display,
the constraint that the Kinect sensor must always be placed directly above the display is
imposed. This helps to avoid complicated semi-automatic calibration routines.
2.3.3 Data Flow and System Components
Head-tracking task as the main task of the project (as described in the requirements and risk
analysis) can be formalized as a sequence of data transformations, where input data is the depth
and colour streams coming from Microsoft Kinect and the transformed data is the location of
the viewer’s head w.r.t. the display.
Based on the background research, this transformation can be broken down into individual
components as shown in figure 2.1.
Each of these components can be developed, tested and refined nearly-independently from oth-
ers. This modular approach makes testing and debugging process much easier, and maximises
the opportunity for the code reuse. It also closely adheres to the iterative prototyping style, as
one of the most important agile software engineering methodologies.
The following sections describe the relevant theory needed to successfully implement these
individual components and the actual implementation details are given in Chapter 3.
2.4 Image Processing and Computer Vision Methods
This section introduces the first three colour stream transformation algorithms (as shown in
figure 2.1), viz.:
CHAPTER 2. PREPARATION 8
Figure 2.1: Project data flow as a sequence of data transformations performed by correspondingalgorithms. Transformations with dashed borders are optional.
� viewer’s face detection using Viola-Jones object detection framework (specifically trained
for human faces),
� face tracking using CAMShift object tracker, and
� image segmentation into foreground and background using ViBe background subtractor,
to improve tracking and detection tasks.
2.4.1 Viola-Jones Face Detector
Face detection in unconstrained images is a difficult task due to large intra-class variations:
� differences in facial appearance (hair, beards, glasses),
� changing lighting conditions,
� within- and out-of image plane head rotations,
� changing facial expressions,
� impoverished image data, and so on.
In 2001, Paul Viola and Michael Jones in their seminal work [41] proposed a machine learning-
based generic object detection framework. It became a de facto standard for face detection due
to its rapid image processing speed and high detection accuracy.
Viola-Jones object detection framework is based on the general classification framework, i.e.
given a set of N examples (~x1, y1), ..., (~xN , yN) where ~xi ∈ X are the feature vectors and
yi ∈ {0, 1} is the class of the training example (non-face/face respectively), the goal is to find a
classifier h : X → {0, 1} such that the error of the misclassification would be minimized.
CHAPTER 2. PREPARATION 9
Figure 2.2: Three classes of features (two-rectangle, three-rectangle and four-rectangle) used in Viola-Jones algorithm. The value of the feature (h) is defined as the sum of pixel intensities in the blackregion B subtracted from those in the white region W , i.e. h =
∑(x,y)∈B I(x, y)−
∑(x,y)∈W I(x, y).
Figure 2.3: Integral image representation usedin Viola-Jones algorithm. The value of the in-tegral image I at coordinates (x, y) is equal toI(x, y) =
∑m≤x,n≤y I(m,n), where I is the orig-
inal image.
Figure 2.4: Method to rapidly (in 6-9 array ref-erences) calculate rectangle feature values: D =I(x4, y4) − C − B + A = I(x4, y4) − I(x3, y3) −I(x2, y2) + I(x1, y1).
2.4.1.1 Features
Instead of using raw pixel intensities as feature vectors in classification, higher-level features are
used. There are multiple reasons for doing so: most notably, higher-level features help to encode
ad-hoc domain knowledge, increase between-class variability (when compared to within-class
variability) and increase the processing speed.
Viola-Jones algorithm uses Haar-like features (resembling Haar wavelets used by Papageorgiou
et al. [35]), shown in figure 2.2.
The first main contribution of Viola-Jones algorithm is the integral image representation (see
figure 2.3) which allows a constant-time feature evaluation at any location or scale.
The value of the integral image I at coordinates (x, y) is equal to the sum of all pixels above
and to the left of (x, y), i.e.
I(x, y) =∑
x′≤x,y′≤y
I(x′, y′), (2.1)
where I is the original image.
Then the sum of the pixel intensities within an arbitrary rectangle in the image can be computed
CHAPTER 2. PREPARATION 10
with four array references (as shown in figure 2.4).
Note that I itself can be completed in one pass over the image using recurrences
R(x, y) = R(x, y − 1) + I(x, y)
I(x, y) = I(x− 1, y) + S(x, y),(2.2)
where R is the cumulative row sum and R(x,−1) = 0, I(−1, y) = 0.
However, for the base resolution of the detector (24×24 pixels), the total count of these rectan-
gular features is 162, 336. Evaluating this complete set would be computationally prohibitively
expensive, and unlike the Haar basis, this basis set is many times overcomplete.
2.4.1.2 Ensemble Learning
Viola and Jones proposed that a small number of these features could be chosen to form an
effective classifier using the boosting techniques, common in machine learning.
The actual boosting technique as used by Viola and Jones is called AdaBoost (Adaptive Boost-
ing) and was first described by Freund and Shapire in 1995 [15]. Shapire and Singer [37] proved
that the training error of a strong classifier obtained using AdaBoost decreases exponentially
in the number of rounds.
AdaBoost attempts to minimize the overall training error, but for the face detection task it
is more important to minimize the false negative rate than the false positive (as discussed in
section 2.4.1.4.
Viola and Jones in 2002 [42] proposed the fix to AdaBoost, called AsymBoost (Asymmetric
AdaBoost). AsymBoost algorithm is specifically designed to be used in classification tasks
where the distribution of positive and negative training examples is highly skewed.
The precise details and the explanations of both AdaBoost and AsymBoost techniques are
given in the appendix C.1.1.
2.4.1.3 Weak classifiers
For the purpose of face detection, the decision stump weak classifiers can be used. An individual
classifier hi(~x, f, p, θ) takes a Haar-like feature f , a threshold θ and a polarity p, and returns
the class of a training example ~x:
hi(~x, f, p, θ) =
{1, if p× f(~x) < p× θ,0, otherwise.
(2.3)
To find a decision stump with the lowest error εt for a given training round t, algorithm C.1.2.1
can be used. It is worth noting that the asymptotic time cost to find the best weak classifier for
CHAPTER 2. PREPARATION 11
Figure 2.5: Decision making process in the attentional cascade, when a series of classifiers are appliedto every sub-window. Due to “immediate rejection” property, the number of sub-images that reachdeep layers of the cascade is drastically smaller than the overall count of sub-images.
a given training round is O(KN logN), where K is a number of features and N is the number
of training examples1.
2.4.1.4 Attentional Cascade
A further observation by Viola and Jones is based on the fact that the face/non-face classes
are highly asymmetric, viz. the number of negative sub-images (not containing faces) in a given
image is typically overwhelmingly higher than the number of positive sub-images (containing
faces). With this insight in mind, it is sensible to focus the initial effort of the detector into
eliminating large areas of the image (as not containing faces) using some simple classifiers, with
progressively more accurate (and computationally expensive) classifiers focusing on the rare
areas on the image that could possibly contain a face.
The idea given in the previous paragraph is embodied in the construction of the attentional
cascade (see figure 2.5). It is enough for one of the classifiers to reject the sub-image so that
the sub-image would be rejected by the whole detector. The sub-image has to be accepted by
all classifiers in the cascade so that it would be accepted by the detector. Also, each of the
classifiers in the attentional cascade is designed to have a much smaller false negative rate than
the false positive rate; this provides confidence that when the classifier rejects the sub-image,
it is very likely not to have contained face in the first place.
Each of the strong classifiers in the cascade is obtained through boosting. A new classifier in
the cascade is trained on the data that all the previous classifiers misclassify, hence in that
sense a second classifier in the cascade faces a more difficult and time-consuming task than the
first one, and so on.
A detailed training algorithm for building a cascaded detector is given in the appendix
C.1.3.1.
1Putting this into a perspective, to obtain a single strong classifier containing ≈ 100 weak classifiers for≈ 10, 000 training examples and≈ 160, 000 Haar-like features, O(1011) operations are needed (assuming constantfeature evaluation time).
CHAPTER 2. PREPARATION 12
Because of the construction, the false positive rate of the overall cascade is
F =K∏i=1
fi, (2.4)
where K is the number of classifiers in the cascade and fi is the FP rate of the ith classifier;
similarly, the detection rate of the overall cascade is
D =K∏i=1
Di, (2.5)
where di is the detection rate of the ith classifier2.
Due to the large Haar-like feature search space, the specifics of a strong-classifier boosting and
the false positive training image bootstrapping for new cascade layers, a careful consideration
for the training framework implementation is required (see appendix C.1.3.1 for a “back-of-
the-envelope” training time estimation for a naıve implementation). Section 3.5 presents the
distributed Viola-Jones cascade training implementation and discusses the main methods to
tackle the training time complexity in more detail.
2.4.2 CAMShift Face Tracker
After the face in the image has been localized using Viola-Jones face detection algorithm, it
can be tracked using CAMShift (Continuously Adaptive Mean Shift) algorithm, first described
by Gary Bradski in 1998 [8].
CAMShift is largely based on the mean shift algorithm [16], which is a non-parametric tech-
nique to climb the gradient of a given probability distribution to find the nearest dominant
peak (mode). The mean shift algorithm is given in C.2.1.1, and a short proof of mean shift
convergence to the mode of the probability distribution can be found in [8].
CAMShift extends the the Mean Shift algorithm by adapting the search window size to the
changing probability distribution. The distributions are recomputed for each frame and ze-
roth/first spatial (horizontal and vertical) moments are used to iterate towards the mode of the
distribution. This makes CAMShift algorithm robust enough to track the face when the viewer
moves in horizontal, vertical and lateral directions, when the minor facial features of the face
(e.g. expressions) change or when the face is rotated in the camera plane (head roll).
2.4.2.1 “Face” Probability Distribution
In order to use CAMShift for face tracking, a “face” probability distribution function (that
assigns an individual pixel a probability that it belongs to a face) needs to be constructed. It
2Notice that to achieve a detection rate of 0.9 and a false positive rate of 6× 10−6 using a 10-stage classifier,each stage has to have a detection rate of 0.99, but a false positive rate of only about 0.3 (i.e. three out of tennon-face images on average are allowed to be misclassified as faces by each strong classifier!).
CHAPTER 2. PREPARATION 13
Figure 2.6: Conical HSV (hue, saturation, value) colour space.
is done by converting the input video frame into the HSV (hue, saturation, value) colour space
(shown in figure 2.6) and building the hue histogram of the region in the image where the face
was detected.
The main reason for using the hue histogram is the fact that all humans (except albinos) have
basically the same skin colour hue (as observed by Bradski and verified in [7].
The construction of the hue histogram works as follows. Assume that the hue of each pixel
is encoded using m-bits and h(x, y) is the hue of the pixel with coordinates (x, y). Then the
unweighted histogram {qu}u=1...2m can be computed using
qu =∑x,y∈Id
δ (h(x, y)− u) , (2.6)
where Id is the detected face region in the video frame.
The rescaled histogram {qu}u=1...2m can be obtained by calculating
qu = min
({qu
max({qu}), 1
}). (2.7)
Then the “face” probability of a pixel at coordinates (x′, y′) can be calculated using the his-
togram backprojection, i.e.
Pr(“I(x′, y′) belongs to a face”) = qh(x′,y′). (2.8)
An illustration of the “face” probability calculated using histogram backprojection is shown in
figure 2.7.
CHAPTER 2. PREPARATION 14
a) b)
Figure 2.7: “Face” probability image b) obtained from input image a) using a histogram backprojec-tion method. Brighter areas of image b) indicate a higher probability for a pixel to be a part of theface.
2.4.2.2 Centroid Calculation and Algorithm Convergence
After the face probability distribution has been constructed, the CAMShift algorithm uses the
zeroth and first moments of the face probability distribution to compute the centroid of the
high-probability region (see the appendix C.2.2 for precise details).
The mean shift component of the CAMShift algorithm continually recomputes new centroid
until there is no significant change in the position. Typically, the maximum number of iterations
in this process is set between 10 and 20, and since the sub-pixel accuracy cannot be observed,
a minimum shift of one pixel in either horizontal or vertical directions is used as a convergence
criteria3.
2.4.3 ViBe Background Subtractor
In order to mitigate one of the main drawbacks of CAMShift face tracking algorithm, viz. its
inability to distinguish an object from the background if they have similar hue, a separate
background/foreground segmentation algorithm can be used.
ViBe (Visual Background Extractor) algorithm, as described by Barnich and Van Droogen-
broeck [2], is a universal4, sample-based background subtraction algorithm.
3Some care must also be taken to ensure that the algorithm terminates when the search window does notcontain any pixels with non-zero face probability distribution, i.e. when the zeroth moment is equal to zero.
4In a sense that the algorithm itself makes no assumptions about the video stream frame rate, colour space,scene content, background itself or its variability over time.
CHAPTER 2. PREPARATION 15
Figure 2.8: Comparison of a pixel value v(x) with a set of samples M(x) = {v1, v2, ..., v5} ina two-dimensional Euclidean colour space C1C2. Pixel value v(x) is classified as background ifthe number of samples in M(x) that are within the circle SR(v(x)) is greater or equal to θ.
2.4.3.1 Background Model and Classification
In ViBe, an individual background pixel x is modelled using a collection of N observed pixel
values, i.e.
M(x) = {v1, v2, ..., vN}, (2.9)
where vi is a background sample value with index i (taken in the previous frames5).
Let v(x) be the pixel x value in a given colour space, then x can be classified based on its
corresponding model M(x) by comparing it to the closest values within the set of samples in
the following way.
Define SR(v(x)) to be a hypersphere of radius R in the given colour space, centred on v(x).
The pixel value v(x) is classified as background if
|{SR(v(x)) ∩M(x)}| ≥ θ, (2.10)
where θ is the classification threshold (see figure 2.8). Barnich and Van Droogenbroeck in
[2] have empirically established the appropriate parameter values as θ = 2 and R = 20 for
monochromatic images.
The purpose of using a collection of samples is to reduce the influence of the outliers. To this end,
an insight made by Barnich and Van Droogenbroeck is that classification of a new pixel value
with respect to its immediate neighbourhood in the colour space estimates the distribution of
the background pixels more reliably than the typical statistical parameter estimation techniques
applied to a much larger number of samples.
5See appendix C.3.1 for precise details on the model initialization for the first frame of the video sequence.
CHAPTER 2. PREPARATION 16
Figure 2.9: An example ViBe background model update sequence demonstrating a fast model recoveryin presence of a ghost (“a set of connected points, detected as in motion, but not corresponding to anyreal moving object” [38]) and a slow incorporation of real moving objects into the background model.
2.4.3.2 Background Model Update
The background model update method used in ViBe provides three important features:
1. a memoryless update policy (to ensure an exponential monotonic decay of the remaining
lifespan for the individual samples stored in background models),
2. a random time subsampling (to ensure that the time windows covered by the background
pixel models are extended),
3. a mechanism that propagates background pixel samples spatially (to ensure spacial con-
sistency and to allow the adaptation of the background pixel models that are masked by
the foreground).
Precise details on the background model update (as shown in figure 2.9) are given in the
appendix C.3.2.
2.5 Depth-Based Methods
Akin to the colour-based face detection and tracking approach, a similar two step process is used
in viewer’s head tracking based on the depth data provided by Kinect. Namely, the tracking
process is split into head detection using Peters and Garstka method [17] and head tracking
using a modified CAMShift algorithm. More details on both of these methods are given in the
subsections below.
a) b) c) d)
Figure 2.10: Viola-Jones integral image based real-time depth image smoothing. Input depth imageis preprocessed by a) removing depth shadows, and then is smoothed using b) r = 2, c) r = 4 andd) r = 8, where r is the side length of the averaging rectangle.
CHAPTER 2. PREPARATION 17
a) b) c)
Figure 2.11: Kinect depth shadow removal. Images a) and b) show the aligned colour and depthinput images from Kinect. Blue areas in the input depth image b) indicate the regions where no depthdata is present; image c) is the resulting depth image after depth shadow removal.
2.5.1 Peters-Garstka Head Detector
In 2011, Peters and Garstka [17] introduced a novel approach for head detection and tracking
using depth images.
Their approach consists of three main steps:
� preprocessing of the depth data provided by Microsoft Kinect (“depth shadow” and noise
elimination), described in detail in appendix D.1 and briefly illustrated in figures 2.10,
2.11,
� detection of the local minima in a depth-image and the use of surrounding gradients in
order to identify a head (based on certain prior knowledge about the adult head size),
discussed below,
� postprocessing of the head location, discussed in section 3.6.6.
2.5.1.1 Head Detection
After obtaining a smoothed depth image with depth shadows eliminated (as described in ap-
pendix D.1), viewer’s head can be detected using the prior knowledge about the typical adult
human head size (20 cm × 15 cm × 25 cm, length × width × height) and its shape. Under
the assumption that the head is in the upright position, but its orientation with respect to
camera is not clear, the inner horizontal bound of the head is chosen to be 10 cm and the outer
horizontal bound is chosen to be 25 cm.
Note that for a given object with dimensions w× h at a distance d from the Kinect sensor, the
width ph and height pw in pixels of the area that it occupies on the screen can be calculated
using basic trigonometry, i.e.
(pw, ph) =
(w × rw
d× 2 tan fw2
,h× rh
d× 2 tan fh2
), (2.11)
CHAPTER 2. PREPARATION 18
Figure 2.12: Prior assumptions about the human head shape that are used to detect head-likeobjects in depth images. The light blue dot is local minimum on a horizontal scan line.
where (rw, rh) is the resolution of the screen and (fw, fh) is the horizontal/vertical field of view
of the depth camera6.
Using the equation 2.11, the inner and outer bounds can be defined as
bi(d) =320 px× 10 cm
d cm× 2 tan 58◦
2
≈ 28, 865
dpx,
bo(d) =320 px× 25 cm
d cm× 2 tan 58◦
2
≈ 72, 162
dpx.
(2.12)
Then for each horizontal line v′, consider a local minimum point u′ for which
� all depth values within the inner bounds have a smaller depth difference than 10 cm from
u′, and
� depth values at the outer bounds have a larger depth difference than 20 cm from u′ (see
figure 2.12 for an illustration).
More formally, find the point u′ on the line v′ such that Ir(u′, v′) is a local minimum, and the
inequalities
Ir(u′ + f, v′)− Ir(u′, v′) < 10 cm,
Ir(u′ − f, v′)− Ir(u′, v′) < 10 cm,
(2.13)
6PrimeSense PS1080 SoC Reference Design 1.081 (http://www.primesense.com/en/press-room/resources/file/4-primesensor-data-sheet) states 58◦ horizontal and 45◦ vertical field-of-view, whichnearly corresponds to Peters and Garstka empirically measured horizontal FOV of 61.66◦ [17].
CHAPTER 2. PREPARATION 19
hold ∀f ∈{
1, 2, ..., bi(Ir(u′,v′))
2
}, and
Ir
(u′ +
bo(Ir(u′, v′))
2, v′)− Ir(u′, v′) > 20 cm,
Ir
(u′ − bo(Ir(u
′, v′))
2, v′)− Ir(u′, v′) > 20 cm,
(2.14)
To match the sides of the head and the vertical head axis more accurately, for each of the
local minimum u′ satisfying the criteria above calculate the positions u1 and u2 of the lateral
gradients, where the 20 cm threshold difference to the local minimum is exceeded, i.e. find u1and u2 such that
Ir(u1, v′)− Ir(u′, v′) ≤ 20 cm,
Ir(u1 − 1, v′)− Ir(u′, v′) > 20 cm,
Ir(u2, v′)− Ir(u′, v′) ≤ 20 cm,
Ir(u2 + 1, v′)− Ir(u′, v′) > 20 cm
(2.15)
and use the arithmetic mean u(v′) = u1+u22
as a possible point on a vertical head axis.
Furthermore, assume that the head height should be of at least 25 cm. To calculate the required
head height in pixels, let n be the number of number of subsequent lines on which the points
u are found. If u(v′) is found for the current line v′, increment n; otherwise set n = 0.
The average distance to the points found in the last n subsequent lines can be calculated
using
d =1
n
n∑i=0
Ir(u(v′ − i), v′ − i), (2.16)
then the number of lines required for this average distance is
nmax =25 cm × 240 px
d cm× 2 tan 45◦
2
≈ 72, 426
dpx. (2.17)
If n ≥ nmax then the center of a head is treated as detected at coordinates
(xc, yc) =
(1
n
n−1∑i=0
u(v′ − i), v′ − n
2
), (2.18)
where v′ is the current horizontal line.
An example result of head detection using this method is shown in figure 2.13.
CHAPTER 2. PREPARATION 20
a) b)
Figure 2.13: Head detection using Garstka and Peters approach. Image a) shows the detected headrectangle (in yellow) overlaid on top of the colour input image, image b) shows the detected headrectangle overlaid over the smoothed (using r = 2) depth image with depth shadows removed. In bothimages, white pixels represent the local horizontal minima which satisfy inequalities 2.13 and 2.14.
2.5.2 Depth-Based Head Tracker
After the head is localized by the Garstka and Peters head detector, a modified CAMShift
algorithm is used to track the head. The motivation for this approach stems from the fact that
one of the main assumptions made by Garstka and Peters (viz. that there is a single head-like
object present in the depth frame) ceases to hold in an unconstrained environment.
This assumption breakdown would result in the incorrect localization of the vertical head axis,
and subsequently lost track of the head.
To mitigate this problem, the criteria that Garstka and Peters used to reject these horizontal
local minima which could not possibly lie on a vertical head axis (equations 2.13, 2.14) are now
used to obtain the “face” probability in the CAMShift tracker.
More precisely, instead of using histogram backprojection to obtain
Pr(“I(x′, y′) belongs to a face”), define a degenerate “face” probability
Pr(“I(x′, y′) belongs to a face”) =
1, if Ir(x
′, y′) is a local minimum on
the line y′, and inequalities 2.13
and 2.14 hold,
0, otherwise.
(2.19)
Since the only non-zero probability pixels in the search-window are likely to be positioned on the
vertical head axis, the function that is used to obtain the size of the next search window is also
updated to s = 4√M00 (where the multiplicative constant was established empirically).
This re-definition of the “face” probability ensures that even when there are some other head-
like objects present in the frame, CAMShift algorithm will keep tracking the head which was
CHAPTER 2. PREPARATION 21
a) b) c)
Figure 2.14: Head tracking using depth information. Images a), b) and c) show the head rectangle (inyellow) overlaid on top of the colour input image. White pixels represent non-zero “face” probabilitiesderived from the depth image using prior knowledge about human head shape 2.5.1, which are thentracked using CAMShift 2.4.2 algorithm.
initially detected. An example of this method in action is shown in figure 2.14.
Since the rest of the depth-based head-tracking algorithm continues in the same manner as
the colour-based face tracking using CAMShift, the remaining details can be found in section
2.4.2.
2.6 Summary
Agile methodologies have been applied in the project’s requirement analysis (and later, in
project’s design, implementation and testing). During the planning and research phase, the
main requirements of the overall system were broken down into i) a real-time “head-tracking
in 3D” library, and ii) a 3D game simulating pictorial and motion parallax depth cues. The
former requirement has been identified as associated with the highest uncertainty (during the
risk analysis), hence a very significant amount of time has been spent researching face/head
detection and tracking techniques. Ultimately, a combination of de facto standard methods in
the industry (like Viola-Jones face detection or CAMShift face tracking), novel techniques (like
ViBe background subtraction and Peters-Garstka depth-based head detection) and self-designed
methods (like depth-based head tracking) were chosen to be used in the project.
Chapter 3
Implementation
This chapter provides details on how the algorithms and theory from Chapter 2 are imple-
mented to achieve the main project’s aims. It starts by discussing the development environ-
ment, languages and tools used, then it introduces a high-level architectural break-down of
the system into large components (“Viola-Jones Face Detector”, “Head-Tracking Library” and
a “3D Display Simulator”), and finally it discusses the implementation of these individual
components.
3.1 Development Strategy
Early in the project, a decision has been made to implement all required algorithms and methods
from scratch.
While there are numerous open-source computer vision libraries, they are developed to primarily
deal with colour data (e.g. OpenCV), isolating only face detection and tracking routines in these
large libraries is a complex and time-consuming task.
It was deemed that extending these large libraries in multiple different ways to use depth
information (see [10] for Viola-Jones extensions using depth, or section 2.5.1 for depth-based
head detection and tracking) involves higher risk than implementing a single-purpose, cohesive
head tracker.
3.2 Languages and Tools
3.2.1 Libraries
To obtain the depth and colour data from Microsoft Kinect, a free (for non-commercial projects)
Kinect SDK 1.0 Beta 2 library [33] (released by Microsoft in November 2011) was chosen. While
there are some alternative open-source libraries that can extract depth and colour data from
the Kinect sensor, they are not officially supported by the manufacturer and hence were not
used to avoid various compatibility issues.
To render the depth cues, OpenGL library was chosen as a de facto industry standard for 3D
graphics.
22
CHAPTER 3. IMPLEMENTATION 23
Figure 3.1: (Partial) GANTT chart showing project’s status as of 18/12/2011.
CHAPTER 3. IMPLEMENTATION 24
3.2.2 Development Language
C# was chosen as a main development language because it provides typical advantages of a
third-generation programming language (machine independence, human readability) combined
with object-oriented programming benefits (cohesive and decoupled program modules, clear
separation between contracts and implementation details, code re-use through inheritance and
polymorphism, and so on). It also has a number of advanced programming constructs, like
events, delegates, extension/generator methods, SQL-like native data querying capabilities and
lambda expressions.
Furthermore, it provides features that are missing from Java (like value types, operator over-
loading or reified generics) and has a stronger tool support for GUI development. Finally, it
is supported by Kinect SDK (which is targeting .NET Framework 4.0) and OpenGL (using a
wrapper for C# called OpenTK [34]).
3.2.3 Development Environment
One of the requirements for Kinect SDK is Windows 7/8 OS, hence a Windows-based develop-
ment environment had to be chosen. Since Visual Studio IDE fully supports the development
language C#, and also features built-in code versioning, code testing, and GUI development
environments, it was used across the whole project.
3.2.4 Code Versioning and Backup Policy
Apache Subversion (SVN) code versioning system was used for the source code and the disser-
tation version control, and immediate back-up. A centralized storage space for SVN was set
up in the PWF file system.
Also, a weekly backup strategy was established, where the source code and the dissertation
were mirrored once a week on two 16 GB USB flash drives to protect from data loss.
3.3 Implementation Milestones
Milestones of the project proposal (see appendix I) were carefully followed (except for a sin-
gle design change outlined in the progress report, viz. replacing 3D hemisphere-fitting depth
tracker [44] with the Garstka and Peters depth tracker). Any minor delays were covered by the
“slack” time, as planned in the project proposal.
A snapshot of the project status as of 18/12/11 is shown in GANTT chart in figure 3.1.
CHAPTER 3. IMPLEMENTATION 25
3.4 High-Level Architecture
The overall project can be split into the independent development components shown in figure
3.2. Each of these components are discussed in more detail in the following sections.
Figure 3.2: UML 2.0 component diagram of the system’s high-level architecture.
3.5 Viola-Jones Detector Distributed Training Frame-
work
According to Viola and Jones [41], the training time for their 32-layer detector “was in the
order of weeks”. Similarly, according to [27], “the training of the cascade which is used by
the detector turned out to be very time consuming” and “the [17-layer] cascade was never
completed”.
To mitigate the time complexity of the detector cascade training, a decision was made to
exploit the processing power of PWF (Public Workstation Facility) machines1 available at the
University of Cambridge Computing Laboratory’s Intel lab. A distributed training framework
targeting Microsoft .NET 2.0 framework (available on PWF) was designed and implemented,
1Running Windows XP OS on Intel Core 2 Q9550 Quad CPU @ 2.83 GHz with 3.21 GB of RAM.
CHAPTER 3. IMPLEMENTATION 26
Figure 3.3: PWF machines at University of Cambridge Computer Laboratory’s Intel Lab distributedlytraining a Viola-Jones detector framework. Special care was taken to ensure that PWF machines wouldonly be used for training when they are not needed by other people (i.e. most of the training was doneduring the weekends and term breaks), and that training would not interfere with regular PWF userlog-ons.
which trained a 22-layer cascade containing 1, 828 decision stumps in 20 hours, 15 minutes and
2 seconds.
While the performance of training framework is further discussed in section 4.1.2, it is worth
mentioning that the best-performing rectangle feature selection time has been reduced from
nearly 16 minutes in a naıve single-threaded single-CPU implementation (which would require
more than three weeks to train a 1, 828 feature cascade) to the average of 38.39 seconds per
feature in a distributed multi-threaded implementation using 65 CPU cores2.
Two most time-consuming tasks were parallelized: best weak classifier selection (out of 162, 336
rectangle features) when building a strong classifier, and the false positive training image boot-
strapping for each layer of the cascade3.
2Amount of parallel processing was limited by the number of simultaneous logins (19) allowed by PWFsecurity policy. Out of 19 machines, 18 were running 4 training client instances each, one additional machinewas running one server instance and one training client instance.
3The computational complexity of these tasks is best illustrated by the numbers: each of the 162, 336rectangle features has to be evaluated on each of the 9, 916 training images (as described in section 4.1.1);best-performing decision stump has to be selected out of those. This process of adding best-performing weakclassifiers has to be repeated until the individual layer false positive rate and detection rate objectives are met,and new layers have to be added until all training data is learned (in total, 1, 828 decision stumps were added).Similarly, 5, 000 false positive training images have to be bootstrapped for each new layer of the cascade; as thecascade grows, the effort required to find false positive images increases exponentially.
CHAPTER 3. IMPLEMENTATION 27
Figure 3.4: UML 2.0 deployment diagram of the Viola-Jones distributed training framework archi-tecture.
The architecture and the implementation details of this distributed training framework are
described below.
3.5.1 Architecture
To provide a better understanding on how the tasks and the main training data are physically
distributed, a deployment diagram of the distributed training framework is shown in figure 3.4.
As shown in this diagram, two separate communication channels are used: TCP/IP and CIFS
(Common Internet File System, also known as SMB, Server Message Block).
A standard client-server architecture with “star” topology (with server in the center) is used
for the framework. This arrangement greatly simplifies work coordination and makes it easier
to ensure a strict consistency of training results.
To avoid bottlenecking server’s Ethernet link, the following rule-of-thumb is applied: short
messages between server and clients are transmitted over TCP/IP, and CIFS is used for large
data exchanges.
3.5.2 Class Structure
Due to the space constraints, a detailed class structure of the framework is given in the appendix
E. In particular, the class diagram4 is shown in figures E.1 and E.2, and, while the purpose of
4Note that all class and component diagrams given in this chapter have been simplified for the convenienceof the reader. The implementation follows the agile methodologies “self-documenting” code principle, hence theauthor hopes that some insight into the purpose and responsibilities of individual classes/components can beobtained by examining the names and signatures of the functions that they provide.
CHAPTER 3. IMPLEMENTATION 28
the classes should be self-explanatory from the method signatures, the main responsibilities of
the most important individual classes are given in table E.1.
All classes were implemented using a defensive programming technique. This proved to be cru-
cially important, since machines repeatedly lost CIFS connections to DS-Filestore, experienced
TCP/IP connection time-outs under high network load, were forcefully restarted both to install
updates and by other Intel lab users, and so on.
3.5.3 Behaviour
The high-level communication sequence between the server and the clients is given in figure
3.5.
Figure 3.5: UML 2.0 sequence diagram of the high-level communications overview between serverand clients in Viola-Jones distributed training framework. Bounded “while” rectangle corresponds toline 4 in Build-Cascade algorithm given in C.1.3.1.
Two most time consuming tasks (false positive training image bootstrapping and weak classifier
boosting using AsymBoost) were both multi-threaded and distributed between clients.
The interactions between the clients while performing these tasks are coordinated in the follow-
ing way: immediately after the connection is established between the server and the client, the
server sends to the client the indices of high-resolution negative training images5 which that
particular client should use to bootstrap detector-resolution false positive training images for
each layer of the cascade.
5As shown in the deployment diagram 3.4, all negative training images reside on the DS-Filestore and areaccessed through CIFS.
CHAPTER 3. IMPLEMENTATION 29
After receiving the “start false positive image bootstrapping” command, a client obtains a copy
of the current detector cascade and repeatedly executes the algorithm 3.5.3.1.
Algorithm 3.5.3.1 Single false positive training image bootstrapping. It requires a high-resolutionnegative training image Ii, an exhaustive array of triples Ai = [(x0, y0, size0), ..., (xn, yn, sizen)] de-scribing all possible locations and sizes of bootstrapping samples for image Ii, and a current detectorcascade Ct(~x). The result of this algorithm is either a single false positive training image, or Nil ifno such image could be found.
False-Positive-Training-Image-Bootstrapping(Ii, Ai, Ct(~x))
1 while Ai. length > 02 // Generate a random sample index.3 r ← Random-Betweeen(0, Ai. length − 1)
4 // Acquire and resize the selected sample.5 ~xcurrent ← Resize(Sample(Ii, Ai[r]),Base-Resolution)
6 // If the negative sample is misclassified as a face, return it.7 if Ct(~xcurrent) = 18 return ~xcurrent
9 // Otherwise, put the last sample into current sample’s place and decrement the10 // array length marker.11 Ai[r]← Ai[Ai. length − 1]12 Ai. length ← Ai. length − 113 return Nil
When a false positive training image is bootstrapped, its standard deviation σ is calculated
and stored in NegativeTrainingImage class. Standard deviation is then used to inversely scale
the values of rectangle features, to normalize the variance of all false positive training images,
hence minimizing the effect of different lighting conditions. It is worth mentioning that σ can
be efficiently calculated using integral image (see 2.4.1) technique: define I2 to be the squared
integral image, then
σ =√
E[I2]− (E[I])2,
where E[I2] = I2(h,w)h×w and E[I] = I(h,w)
h×w , with h,w being the height and width (respectively) of
the false positive training image.
Each bootstrapped false positive training image is then sent back to server, which assembles
them into a new negative training image set. Figure 3.6 explains this interaction pictori-
ally.
Similarly, figure 3.7 shows the interactions between the server and clients in the weak classifier
boosting task (based on AsymBoost algorithm described in section C.1.1.1).
CHAPTER 3. IMPLEMENTATION 30
Figure 3.6: UML 2.0 interaction overview diagram of the distributed false positive training imagebootstrapping.
CHAPTER 3. IMPLEMENTATION 31
Figure 3.7: UML 2.0 interaction overview diagram of the distributed weak classifier boosting usingAdaBoost algorithm (see C.1.1.1) with AsymBoost extension (see C.1.1.1).
CHAPTER 3. IMPLEMENTATION 32
3.6 Head-Tracking Library
A good initial insight into the implementation details of the HT3D (Head-Tracking in 3D)
library can be obtained by observing the data flow between its various components, as demon-
strated in figure 3.8.
As shown in figure 3.8, HT3D internal implementation follows a highly modular design, with
cohesive, single-purpose components, arranged in “star” topology and de-coupled from each
other.
These design features enable a high-degree of flexibility to choose which information sources
should be used for viewer’s head tracking and how they should be arranged (this proved to be
crucial for the evaluation chapter, in which the performances of different trackers with different
features enabled were compared).
Furthermore, this design simplified an interchange of components (as shown by the colour-based
background subtractor example) and streamlined testability.
CHAPTER 3. IMPLEMENTATION 33
Figure 3.8: Data flow diagram of the implemented HT3D library. Numbers on the arrows indicatethe order in which data is passed in a typical head-tracking communication sequence.
CHAPTER 3. IMPLEMENTATION 34
The individual components of HT3D library are further discussed below.
3.6.1 Head-Tracker Core
As shown in data flow diagram (figure 2.1) the head-tracker core orchestrates various individual
head-tracking components and exposes the HT3D library API to the end-user. A detailed head-
tracker core class diagram is given in figure E.3 and the implementation-wise responsibilities of
the most important classes are discussed in detail in the table E.2.1.
Most importantly, head-tracker combines the colour- and depth-based tracker (discussed below)
outputs using the algorithm 3.6.1.1.
Algorithm 3.6.1.1 Combining colour- and depth-based tracker predictions. Given the inputs Cand D (colour- and depth-tracker output rectangles respectively), this algorithm returns the combinedhead center coordinates (or ∅, in case of the tracking failure).
Combine-Trackers(C,D)
1 if C 6= ∅ ∧D 6= ∅2 if |C ∩D| 6= ∅3 return Rectangle-Center(Average-Rectangle(C,D))4 else5 Reset colour and depth trackers to “detecting” state.6 return ∅7 else8 if D 6= ∅9 return Rectangle-Center(D)
10 if C 6= ∅11 return Rectangle-Center(C)
12 return ∅
From the user’s point of view, head-tracker core API exposes the following tracking output (via
HeadTrackFrameReadyEventArgs):
1. Tracking images, rendering one of the options shown in figure 3.9),
2. Detected face image (from Viola-Jones face detector),
3. Tracked head rectangles (from colour and depth trackers),
4. Combined head center position in pixels,
5. Combined head center position in space w.r.t. Kinect sensor.
Also HeadTracker class exposes a number of head-tracking settings (as shown in figure E.3),
allowing the user to tweak detection and tracking components.
CHAPTER 3. IMPLEMENTATION 35
a) c = COLOUR FRAME b) c = DEPTH FRAME
c) c = HISTOGRAM d) c = BACKGROUND e) c = DEPTH FACE
BACKPROJECTION SUBTRACTION PROBABILITY
Figure 3.9: HT3D image frame rendering options, enabled by executingheadTracker.EnabledRenderingCapabilities[c] = true.
3.6.2 Colour-Based Face Detector
Colour-based face detector component in HT3D library is mainly responsible for localiz-
ing the viewer’s face in colour images using a trained Viola-Jones cascade. For this rea-
son, a large part of distributed Viola-Jones training framework code is reused (in particu-
lar, NormalizedTrainingImage, StrongLearner and StrongLearnerCascade classes, together
with the RectangleFeature class hierarchy), as shown in figure 3.10.
A new ViolaJonesFaceDetector class is added, with the main responsibilities of
� Deserializing the strong learner cascade from XML (obtained from the distributed training
framework),
� Providing means to adjust the learner cascade coefficients (pre-multiplying each layer’s
threshold with a given constant),
� Detecting the viewer’s face given the input colour and depth images.
While the implementation of the first two responsibilities is trivial, the latter deserves some
further attention. As discussed by Burgin et al. [10], cues present in depth data can be used
to make face detection faster and more accurate. In particular, the face search space can be
reduced from exploring multiple scales at each pixel, to searching for only plausible face sizes
at a pixel, given its distance from the camera.
This optimization to an exhaustive search is implemented as follows:
CHAPTER 3. IMPLEMENTATION 36
Figure 3.10: UML 2.0 class diagram of the colour-based face detector component of HT3D library.
1. Given the aligned colour and depth images (provided by Microsoft Kinect SDK), iterate
through the pixels in the colour image using a step size ∆ = 3 px.
2. For each pixel assume that a potential face is centred there. Set the face height upper
and lower bounds to 40 cm and 20 cm respectively, and use equation 2.11 to estimate the
face height upper (hu) and lower (hl) bounds in pixels.
3. Run Viola-Jones face detector starting at hl resolution, using a scaling factor s = 1.075
to increment the resolution, until hu upper bound is reached or a face is detected.
One of the important points in the detection algorithm outlined above is that the face detection
is triggered as soon as one of the sub-windows in the image passes through the detector cascade.
The reason why the search is terminated immediately at that point is because of the main
assumption given in the problem constraints (viz. that only a single viewer is present.
Another point to note is that scaling is achieved by scaling the face detector itself, and not the
input image. More precisely, given a weak classifier as described in section 2.4.1.3, its scaled
and variance-normalized version hi,s,σ which takes a Haar-like feature f , a threshold θ and a
polarity p, and returns the class of an input image ~x (where s is the scale and σ is the standard
deviation of the input image) can be defined as
hi,s,σ(~x, f, p, θ) =
{1, if pf(~x) < s2σpθ,
0, otherwise.(3.1)
CHAPTER 3. IMPLEMENTATION 37
3.6.3 Colour-Based Face Tracker
Figure 3.11: UML 2.0 class diagram of the colour-based face tracker component of HT3D library.
The class diagram of the colour-based face tracker is shown in figure 3.11.
The main responsibilities of the CamShiftFaceTracker and
CamShiftFaceTrackerUsingSaturation classes are:
� Computing the “face” probability distribution (described in detail in 2.4.2.1),
� Calculating the face centroid and the search window size (described in C.2.2),
� Generating a face probability bitmap, as shown in figure 2.7.
While the implementation of these responsibilities closely follows the theory given in relevant
subsections of section 2.4.2, two main implemented extensions are worth mentioning sepa-
rately:
� Tracking using two-dimensional histogram from the hue-saturation colour space (based
on the ideas in [1]). This extension is implemented to mitigate one of the well-known
deficiencies of CAMShift algorithm, viz. the inclusion of the background region if it has
a similar hue as the object being tracked.
Figure 3.12 shows the probability images obtained using one- and two-dimensional his-
togram backprojections in equivalent tracking conditions.
Since CamShiftFaceTrackerUsingSaturation class inherits from
CamShiftFaceTracker, most of the standard CAMShift tracker code is reused,
and, more importantly, the extended tracker becomes interchangeable in place of the old
one because of inheritance covariance.
� As described by Bradski in [8], a large amount of hue noise in HSV space is introduced
when the brightness is low (as can be seen from figure 2.6). Similarly, small changes in
the colour of low-saturated pixels in RGB space can lead to large swings in hue. For this
reason, brightness (value) and saturation cut-off thresholds (θv and θs respectively) are
CHAPTER 3. IMPLEMENTATION 38
a)
b) c) d)
Figure 3.12: Probability images obtained from an input image b) using c) hue and d) hue-saturationhistograms (brighter colour indicates a higher probability for the pixel to be part of the face; bothhistograms initialized with a) the output of the Viola-Jones detector shrunk by 20%). As shown inpicture c), using a two-dimensional histogram built in hue-saturation colour space would allow thetracker to maintain track of the object even when the background has a similar hue.
introduced: if the brightness or saturation of a given pixel are below these thresholds,
such pixel is ignored when building the colour histograms.
3.6.4 Colour- and Depth-Based Background Subtractors
Figure 3.13: UML 2.0 class diagram of the colour and depth background subtractor components ofHT3D library.
CHAPTER 3. IMPLEMENTATION 39
a) b)
Figure 3.14: Depth-based background subtractor operation. If the depth-based head tracker is lockedonto the viewer’s head in the input image a) (yellow rectangle), then the image can be segmented intobackground and foreground using pixel’s distance from Kinect as a decision criterion. In particular, ifa given pixel is further away than the viewer’s head center, then it is classified as background (black),otherwise it is classified as part of the foreground (white), as shown in image b).
Colour- and depth-based background subtractors share a common abstract ancestor class
BackgroundSubtractor, which is responsible for creating a background segmentation bitmap
(using concrete background subtractor implementations) given a certain background subtrac-
tion sensitivity (again, dependant on the concrete implementation). Because of this design, all
background subtractors are interchangeable, and mock background subtractors can be used to
test the library.
Two concrete colour-based subtractors are implemented: ViBe background subtractor
(ViBeBackgroundSubtractor class) and an Euclidean distance thresholding based back-
ground subtractor (EuclideanBackgroundSubtractor class). Similarly, a depth-based
DepthBackgroundSubtractor class is implemented, albeit serving a slightly different purpose:
to increase the speed and the accuracy of colour-based face detector and tracker6.
3.6.5 Depth-Based Head Detector and Tracker
Since depth-based head detection and tracking methods are based on the same priors about
the human head shape, both the detection and tracking functionality is provided by the
DepthHeadDetectorAndTracker class (as shown in figure 3.15).
The main responsibilities of the DepthHeadDetectorAndTracker class are:
� Preprocessing of depth information (depth shadow elimination and integral-image based
real-time depth image blurring),
6Due to space constraints, further background subtractor implementation details are given in the appendixE.2.2.
CHAPTER 3. IMPLEMENTATION 40
Figure 3.15: UML 2.0 class diagram of the depth-based head detector and tracker.
� Viewer’s head detection using Peters and Garstka method (implemented closely following
section 2.5.1),
� Head tracking using a modified CAMShift algorithm (implemented following section
2.5.2).
DepthHeadDetectorAndTracker class exposes the means to set the integral-image based blur
radius r (as shown in figure 2.10) and to enable/disable the depth shadow elimination (as shown
in figure 2.11).
The only slight deviation from the theory described in section 2.5.1 when implementing the
head detector, is that the minimum head height requirement is relaxed from 25 cm to 15 cm
(increasing the detection rate).
While this modification can also result in higher amount of false positives, using a modified
CAMShift tracker (as described in 2.5.2) to track the detected head prevents this from happen-
ing in practice. In particular, the search window expands to the whole head area if the regions
with high “face” probability are connected, or degenerates to the minimum if two different
physical objects were detected as a single object (since the first moment becomes relatively
small compared to the initial search window size).
3.6.6 Tracking Postprocessing
Noise in depth and colour images creates instabilities when tracking viewer’s face/head (i.e.
even if the viewer is not moving between consecutive frames, the detected head/face positions
might slightly differ).
The noise sources present in depth images are briefly discussed in section D.1.2.
The main noise sources present in colour images produced by Kinect’s RGB camera are:
CHAPTER 3. IMPLEMENTATION 41
Figure 3.16: UML 2.0 class diagram of the tracking post-processing filters used in HT3D library.
� photon shot noise (a spatially and temporally random phenomenon arising due to Poisson-
like fluctuations with which photons arrive at sensor elements),
� sensor read noise (voltage fluctuations in the signal processing chain from the sensor
element readout, to ISO gain and digitization) and quantization noise (analogue voltage
signal rounding to the nearest integer value in ADC),
� pixel response non-uniformity, or PRNU (differences in sensor element efficiencies in cap-
turing and counting photons, due to the variations in their manufacturing), and so on.
To help mitigate these face-/head-tracking noise issues, two simple filter classes are imple-
mented:
� ImpulseFilter class serves as an exponentially weighted moving average (EWMA) im-
plementation of an infinite impulse response (IIR) filter, attenuating low amplitude jitter
in the head movements. Given the input vector xt, the filtered value xt is obtained by
calculating
xt = (1− α)xt−1 + αxt, (3.2)
where α is the smoothing (attenuation) factor. The initial value x0 is equal to the first
value of x obtained, i.e. x0 , x0.
� HighPassFilter class implements a discrete-time RC high-pass filter which is used to
smooth out the transitions when one of the trackers loses the track of the head/face.
Given the input vector xt, the high-pass filtered value xt is obtained by calculating
xt = βxt−1 + β(xt − xt−1). (3.3)
where β is the smoothing factor. The initial value x0 is equal to the first value of x
obtained, i.e. x0 , x0.
CHAPTER 3. IMPLEMENTATION 42
In the overall HT3D architecture, ImpulseFilter class is used to reduce the noise in the
output face-/head-tracking rectangles returned by colour and depth trackers. If both trackers
are locked onto the viewer’s head then the HeadTracker calculates the location of the head
centroid as the arithmetic average of two rectangle centers; otherwise the centroid location is
equal to the center of the tracking rectangle (if there is one).
To avoid a sudden face centroid jump if one of the trackers loses the track of the face, a
HighPassFilter is used. In particular, a high-pass filtered frame-by-frame change in centroid
positions is subtracted from the predicted face centroid position in that particular frame to
obtain the final prediction (which is returned to the user of the library).
3.7 3D Display Simulator
Figure 3.17: UML 2.0 class diagram of the 3D display simulation program.
In the final part of the project’s implementation, HT3D library is used to simulate horizontal
and vertical motion parallax in a 3D game. The game is largely based on a Blockout video game
(published by California Dreams in 1989), and is basically an extension of Tetris into the third
dimension (hence the name, Z-Tris). The purpose of the game is to solve a real-time packing
problem by forming complete layers of polycubes which are falling into a three-dimensional pit.
For this reason, achieving in-game goals requires accurate depth perception.
The high-level class, component and deployment diagrams of the 3D display simulation program
are shown in figures 3.17, 3.18 and 3.19 respectively.
As illustrated in figure 3.17, 3D display simulator consists of two small UI modules (“3D Simula-
tion Entry Point” and “Head Tracker Configuration”), and a larger model-view-controller-based
module (“Z-Tris”).
This break-down into self-contained modules is based on very clear individual responsibilities,
as described both below (for ‘Z-Tris” module) and in appendix E.3 (for UI modules).
CHAPTER 3. IMPLEMENTATION 43
Figure 3.18: UML 2.0 component diagram of the 3D display simulation program.
Figure 3.19: UML 2.0 deployment diagram of the 3D display simulation program (ZTris.exe) showingthe required run-time components and artifacts.
CHAPTER 3. IMPLEMENTATION 44
Figure 3.20: An entry-point into the 3D display simulation program (MainForm class).
Figure 3.21: Head-tracker configuration GUI (ConfigurationForm) exposing all available HT3Dlibrary options.
CHAPTER 3. IMPLEMENTATION 45
Figure 3.22: Screenshot of Z-Tris game: the viewer is looking “down” into the pit, i.e. the active(transparent) polycube is moving away from the player. Depth perception is simulated using occlu-sions, relative density/height/size, perspective convergence, lighting and shadows and texture gradientpictorial depth cues.
3.7.1 3D Game (Z-Tris)
As shown in the class diagram in figure E.4, the implementation of the game is based on a
MVC (model-view-controller) architectural pattern (incorporating “Observer”, “Composite”
and “Strategy” design patterns) [9]. MVC facilitates a clear separation of concerns and respon-
sibilities, reduces coupling, simplifies the growth of individual architectural units, supports
powerful UIs (necessary for the 3D display simulation) and streamlines testing.
Due to the space limitations, the implementations of “Model”, “Controller” and part of the
“View” architectural units are discussed in the appendix E.3.3. Since it is very important for
the project aims that in the process of rendering the game state, a number of depth cues are
simulated pictorially (shown in the in-game screenshot 3.22), the depth-cue rendering part of
the “View” is discussed below.
CHAPTER 3. IMPLEMENTATION 46
3.7.1.1 Generalized Perspective Projection
Occlusions, relative density, height and size, perspective convergence and motion parallax depth
cues are simulated using the off-axis perspective projection, as described in section D.2.1.
Given the viewer’s head location in space (obtained by the ConfigurationForm from the HT3D
library), the generalized (off-axis) perspective projection matrix G = PMTT can be expressed
using OpenGL projection matrix stack as shown in code listing 3.1 (see section D.2.1 for notation
reminder).
Listing 3.1: Generalized projection matrix implementation in OpenGL.
GL.MatrixMode(MatrixMode.Projection);
GL.LoadIdentity ();
GL.Frustum(l, r, b, t, n, f);
Matrix4 MT = new Matrix4(~vr,x, ~vr,y, ~vr,z, 0,
~vu,x, ~vu,y, ~vu,z, 0,
~vn,x, ~vn,y, ~vn,z, 0,
0, 0, 0, 1);
GL.MultMatrix(ref MT);
GL.Translate(-(pe,x, pe,y, pe,z));
3.7.1.2 Shading
To simulate the lighting depth cue, a default OpenGL Blinn-Phong shading model [6] is used.
Vertex illumination is divided into emmissive, ambient, diffuse (Lambertian) and specular com-
ponents, which are computed independently and added together (summarized in eq. 3.4).
The colour c~v of a vertex ~v is defined as
c~v =c~v,a + c~v,e +∑
~l∈lights
[attenuation(~l, ~v)× spotlight(~l, ~v)×
(c~l,a+[
max
{~l − ~v||~l − ~v||
· ~vn, 0
}× c~l,d × c~v,d
]+[(
max
{~l + ~v
||~l + ~v||· ~vn, 0
})α~v
× c~l,s × c~v,s
])],
(3.4)
where c~v,e, c~v,a, c~v,d, c~v,s are vertex ~v material’s emissive, ambient, diffuse and specular normal-
ized (i.e. between 0 and 1) colours respectively, α~v is the shininess of vertex ~v, ~vn is a normal
vector to vertex ~v and c~l,a, c~l,d, c~l,s are the light’s ~l ambient, diffuse and specular normalized
colours respectively.
Between-vertex pixel values are interpolated using Gouraud [20] shading.
In Z-Tris implementation, a scene is lit by a single light source positioned in front of the pit,
in the top left corner of the screen.
CHAPTER 3. IMPLEMENTATION 47
a) b)
Figure 3.23: Screenshot of the scene in Z-Tris game rendered a) without and b) with shadows,generated using Z-Pass technique.
3.7.1.3 Shadows
Section A.2.1 briefly discusses the relative importance of depth cues. In particular, shadows play
an important role in understanding the position, size and the geometry of the light-occluding
object, as well as the geometry of the objects on which the shadow is being cast [25, 19].
For this reason, a Z-Pass shadow rendering technique using stencil buffers (as described in detail
in section D.2.2) is implemented. The algorithm itself is slightly optimized in the following
way: instead of rendering the unlit scene, projecting shadow volumes and rendering the lit
scene outside the shadow volumes again, a fully-lit scene is rendered first, then the shadow
volumes are projected and a semi-transparent shadow mask is rendered onto the areas within
the shadow volumes (saving one full scene rendering pass).
Figure 3.23 shows the same scene rendered with/without shadows generated using Z-Pass tech-
nique.
3.8 Summary
Based on the chosen development strategy, all required computer vision and image process-
ing methods have been successfully implemented from scratch, and integrated into the three
main components of the system (“Distributed Viola-Jones Face Detector Training Framework”,
“Head-Tracking Library” and “3D Display Simulator”). These components were developed us-
ing industry-standard design patterns and software engineering techniques, strictly adhering
to the time frame given in the project proposal. In the final system, the output from the
training framework (face detector cascade) has been integrated into the HT3D library, which
was then used by the proof-of-concept 3D display application to simulate pictorial and motion
parallax depth cues (see http://zabarauskas.com/3d for a brief demonstration of the system
in action).
Chapter 4
Evaluation
This chapter describes the evaluation metrics used and the results obtained for all three ma-
jor architectural components from Chapter 3 (“Viola-Jones Face Detector”, “Head-Tracking
Library” and “3D Display Simulator”).
In particular, Viola-Jones face detector evaluation characterizes the performance of the classifier
cascade in terms of the false positive counts for a given detection rate. In the head-tracking
library evaluation, the library’s performance w.r.t. the average distance and the spatio-temporal
overlaps between the tracker’s prediction and the tagged ground-truth is analysed. Finally, the
evaluation of a 3D display simulation program (Z-Tris) examines its correctness using different
types of testing, and describes its run-time performance.
4.1 Viola-Jones Face Detector
4.1.1 Training Data
The positive face training database consisted of 4, 916 upright full frontal images (24 × 24 px
resolution), obtained from [11] (originally assembled by Michael Jones). A sample of the first
196 images from this set is shown in figure 4.1.
Figure 4.1: First 196 faces from the Viola-Jones face detector positive training image database.
48
CHAPTER 4. EVALUATION 49
Figure 4.2: Viola-Jones negative training images gathered using “aerial photograph”, “foliage”, ”un-derwater”, “Persian rug” and “cave” search queries.
CHAPTER 4. EVALUATION 50
A collection of 7, 960 negative training images (i.e. images not containing faces) for the first
layer were obtained from the same source. A further set of 2, 384 larger resolution (994 ×770 px on average) negative training images were manually assembled using a Google Image
Downloader [18] tool, using search queries like “Persian rug”, “aerial photograph”, “foliage”,
etc. A few examples of such images are shown in figure 4.2.
All training images have been converted from 24 bits-per-pixel (bpp) colour images to 8-bpp
grayscale bitmaps using Image Magick Mogrify command-line tool [26] (to reduce disk/RAM
storage requirements), and stored on DS-Filestore.
The amount of training data collected was relatively small compared to Viola-Jones imple-
mentation: 32.9 million of non-face sub-windows contained in 2, 384 images, compared to 350
million sub-windows contained in 9, 500 images, collected by Viola and Jones.
A decision to stop further data mining was made based on the project’s time limitation (negative
training images downloaded using Google Image Downloader had to be manually verified not
to contain faces, which was a laborious and time-consuming task), storage limitation (DS-
Filestore storage quota limit has been reached by the current negative training image set) and,
most importantly, the specifics of the intended use case for the detector being trained.
In particular, a colour-based face detector is used only to initialize the face tracker in the video
sequence, hence it is enough for a face to be detected within the first few seconds of use. For
a 30 frames/second video rate, viewer’s face can to be detected in one of the few hundred of
initial input frames for the viewer not to experience any significant discomfort. The use of the
face detector in spatio-temporal viewer tracking is therefore much less stringent than a classical
“face detection in still images” task.
Based on this observation, thresholds of strong classifiers in the input cascade can be increased
sufficiently reducing the false positive rates to compensate for the lack of training data (limited,
of course, by the simultaneous reduction in detection rates).
4.1.2 Trained Cascade
Using the distributed training framework, a 22-layer cascade containing 1, 828 decision stump
classifiers has been trained (see figure 4.4). The first three selected classifiers based on Haar-
features are shown in figure 4.3).
a) b) c)
Figure 4.3: First three Haar-like features selected by AsymBoost training algorithm as the weakclassifiers; a) and b) were selected for the first, c) for the second layer of the cascade.
CHAPTER 4. EVALUATION 51
Figure 4.4: Weak-classifier count growth for different-size (in layers) cascades.
Individual layers of the cascade (strong classifiers) were trained using AsymBoost (as described
in section C.1.1.1. Distributed training of the whole cascade on 65 CPU cores (Intel Core 2
Q9550 @ 2.83 GHz) took 20 hours, 15 minutes and 2 seconds.
The breakdown of training time into the main individual tasks is shown in table 4.1. Interest-
ingly enough, it took more time to distribute the bootstrapped false-positive samples between
all clients than to actually bootstrap them using the distributed framework.
TaskAveragetime (s)
Distributed best weak classifier search 38.39Training data distribution (per layer) 137.30Distributed negative training sample bootstrapping (per layer) 86.20Distributed negative training sample bootstrapping (per image) 0.0024
Table 4.1: Average execution times for the main distributed Viola-Jones cascade training tasks.
Hit and false alarm rates used for each layer in the cascade are shown in table 4.2.
Layer 1 2 3 4 5 6 7 ≥ 8
Hit rate 0.960 0.965 0.970 0.975 0.980 0.985 0.990 0.995FP rate 0.625 0.600 0.575 0.550 0.525 0.500 0.500 0.500
Table 4.2: Hit (detection) and false positive rate limits used for each layer of the cascade.
CHAPTER 4. EVALUATION 52
Figure 4.5: Three false positive samples (24×24 px), misclassified by the 22-layer Viola-Jones detectorcascade.
For each new layer, 5, 000 negative training image samples were bootstrapped using the dis-
tributed algorithm described in section 3.5.3.
Out of 32,988,622 negative training samples (obtained from the large-resolution negative train-
ing images), only 40 samples were misclassified as faces for the last round of training (three of
these samples are shown in figure 4.5).
4.1.3 Face Detector Accuracy Evaluation
To compare the performance of the trained cascade with Viola-Jones results, the cascade was
evaluated on CMU/MIT [36] upright frontal face evaluation set, containing 511 labelled frontal
faces. The receiver operating characteristic (ROC) curve showing the trade-off between the
detection and false alarm rates of both cascades is shown in figure 4.6.
As expected, the cascade obtained by Viola and Jones performs significantly better. This
performance difference can be attributed to the fact that Viola and Jones have used 1, 063%
more data and have trained additional 16 cascade layers with 4, 232 additional decision stump
classifiers.
Nevertheless, for the purposes of face detection in the context of face tracking, the trained
cascade has proven to be completely adequate. In ten minutes of colour and depth recordings
for HT3D (Head Tracking in 3D) library evaluation (described below) the trained face detector
has achieved 97.9% face detection precision. This result is illustrated in figure F.3, where all
face detections in HT3D evaluation recordings are shown.
4.1.4 Face Detector Speed Evaluation
As described by Viola and Jones [43], the speed of the detector cascade directly relates to
the number of rectangular features that have to be evaluated per search sub-window. Due to
the cascaded structure, most of the search sub-windows are rejected very early in the cascade.
In particular, for the CMU/MIT set the average number of decision stump weak classifiers
evaluated per sub-window is 3.058 (out of 1, 828 present in the cascade)12.
1Cf. 8 weak classifiers on average in Viola and Jones cascade.2With the strong classifier rescaling coefficient of 0.4, as used in all HT3D evaluation recordings (see table
4.6).
CHAPTER 4. EVALUATION 53
Figure 4.6: Receiver operating characteristics curves for the distributively trained detector and thedetector cascade trained by Viola and Jones. Both ROC curves were established by running theface detector on CMU/MIT frontal face evaluation set. Face detector search window is shifted by[s∆], where s is the scale initialized to 1.0 (progressively increased by 25%) and ∆ is the shift factor,initialized to 1.0. Duplicate detections are merged if the area of their intersection is larger thanhalf of the area of any individual detection rectangle. To obtain the full ROC curve, the thresholdsof individual classifiers are progressively increased for the distributively trained detector, decreasingboth the detection rate and the false positive count. A false positive rate can be obtained by dividingthe false positive count by 69, 055, 978.
CHAPTER 4. EVALUATION 54
Figure 4.7: Trained Viola-Jones face detector evaluation tool. Two sample images from MIT/CMUset are shown; ground-truth is marked with red/blue/white dots, detector’s output (produced usingdefault settings, as given in table 4.6) is shown in green.
CHAPTER 4. EVALUATION 55
A C# implementation of the face detector achieved comparable performance to the one de-
scribed by Viola and Jones [43]. In particular, the trained detector was able to process a
384× 288 px image in 0.028 seconds on average (achieving 35.71 frames-per-second processing
rate), using a starting scale s = 1.25 and a step ∆ = 1.5.
While the image processing speed achieved by the trained cascade is 239% faster than the
speed described in [43] under similar detector settings, it is unclear how much of this speed-up
can be directly attributed to the shorter cascade size and a smaller amount of weak classifiers
evaluated per sub-window3.
4.1.5 Summary
A 22-layer frontal/upright face detector cascade has been successfully trained in a very short
timeframe (less than a day) using the distributed Viola-Jones framework implementation. While
the performance of the cascade has been limited by the amount of training data available (over
32.9 million negative training samples have been exhausted), the achieved performance proved
to be adequate for the face-tracking tasks. The face detector was also able to process 384×288
px input images at 35.71 FPS, making it suitable for the real-time applications.
4.2 HT3D (Head-Tracking in 3D) Library
4.2.1 Tracking Accuracy Evaluation
The performance of a 3D display simulation program (the main project aim) crucially depends
on the accurate viewer’s head localization in space. To that end, Kinect SDK is used to
obtain the relative location of a point in space corresponding to a speculated head-center pixel,
hence accurately finding the head-center pixel coordinates is crucial to the overall project’s
success.
4.2.1.1 Evaluation Data
At the time of writing this dissertation, no standardized benchmark containing both colour and
depth data for the face tracking evaluation was available.
A set of evaluation data was manually collected using StatisticsHandler class from the HT3D
library 3.6.1. All videos in the set were taken to reflect conditions that might naturally occur
when a single viewer is observing a 3D display, including head rotations/translations, changing
lighting conditions, cluttered backgrounds, occlusions and even multiple viewers present in the
frame. In total, 10 minutes of depth and colour data feed from Kinect were recorded at 27.5
average FPS (totalling in over 16, 000 frames).
3In particular, it is unclear what speed improvement could have been achieved only by using a faster CPUbecause of different operating systems, different implementation programming languages, and so on.
CHAPTER 4. EVALUATION 56
All scenarios covered in this evaluation set are given in the table 4.3.
4.2.1.2 External Participant Recordings
Recordings for participants #1 to #5 were taken as a part of the “Measuring Head Detection
and Tracking System Accuracy” experiment4.
Before the experiment a possible range of head/face muscle motions that can be performed
was suggested to each participant. Then each participant was asked to move his/her head in a
free-form manner and two colour and depth videos (each 30 seconds long) were recorded5.
4.2.1.3 Ground-Truth Establishment
In order to establish the head position ground-truth in recorded colour and depth videos, a
laborious manual-tagging process is required. To alleviate some of the difficulties associated
with this process, a video tagging tool named “Head Position Tagger” was implemented using
C# for .NET Framework 4.0 (see figure 4.8).
Using this tool, the location of the head in the aligned colour and depth image can be specified
by manually best-fitting an ellipse. The ratio of minor and major ellipse axes is set to 23, hence
only two points are needed to fully describe an ellipse (viz. the antipodal points on the major
axis).
These two points are given using the mouse (in a single click-and-drag motion). The po-
sition/orientation/size of the ellipse can then be further adjusted using the keyboard. Fur-
thermore, the ground-truth locations are linearly interpolated in between frames, hence only
the start and end ground-truth locations need to be established for spatially-continuous head
motions.
Using this tool, 2, 437 out of 16, 489 frames were tagged, accounting for 17.3% of the total
frames (121.85 frames out of 703 were tagged per video on average, with σ = 36.61), with
the rest of the frames interpolated. Around 30 minutes were spent on tagging an individual
video.
Based on the main project’s assumption (viz. presence of a single viewer in the image), a single
face was tagged in every frame6 (including cases where viewer’s head was partially occluded,
or was partially out of frame).
4The experiment consent form describing the manner of the experiment in more detail is given in appendixH.
5Recorded videos were kept in accordance to the Data Protection Act and will be destroyed after thesubmission of the dissertation.
6For “Multiple viewers” scenario, the viewer that was present in the recording for the longest time wastagged.
CHAPTER 4. EVALUATION 57
Scenario Framecount
Length(sec.)
Brief description
F.4 Head rotation (roll) 839 29.95 Head roll of ±70◦.
F.6 Head rotation (yaw) 828 29.95 Head yaw of ±160◦.
F.8 Head rotation (pitch) 799 29.98 Head pitch of ±90◦.
F.10 Head rotation (all) 812 29.98 Combined head roll, yaw and pitch.
F.12 Head translation (hori-zontal/vertical)
831 29.95 Head translation in 80% of horizontal FOV, 70% of verticalFOV.
F.14 Head translation(anterior-posterior)
821 29.98 Head translation in 80% of Kinect’s depth range
F.16 Head translation (all) 822 29.95 Combined horizontal, vertical and anterior-posterior trans-lation.
F.18 Head rotation and trans-lation (all)
787 30.01 Combined head roll, yaw, pitch and horizontal, vertical,anterior/posterior translation (6 degrees-of-freedom).
F.20 Participant #1 813 29.94 Face occlusion, varying facial expressions, partial headmovement out of frame.
F.22 Participant #2 831 29.96 Varying facial expressions, fast spatial motions.
F.24 Participant #3 846 29.98 Partial and full face occlusion by hair and hands, fast spa-tial motions, changing facial expressions.
F.26 Participant #4 848 29.95 Skin-hued clothing, partial face occlusion, varying facial ex-pressions.
F.28 Participant #5 828 29.98 Changing facial appearance (removing glasses, releasing thehair), partial face occlusion.
F.30 Illumination (low) 788 29.98 Difficult lighting conditions (with only the monitor glareilluminating an otherwise dark scene).
F.32 Illumination (changing) 849 29.98 Single light source moving around the scene.
F.34 Illumination (high) 843 29.97 Direct sunlight (with depth data only partially present).
F.36 Changing facial expres-sions
819 29.88 Drastically changing facial expressions.
F.38 Cluttered similar-huebackground
848 29.97 Scene with the skin-hue background and multiple skin-hueobjects.
F.40 Occlusions 809 29.98 Full head occlusions by multiple skin-hue and head-shapedobjects.
F.42 Multiple viewers 828 29.98 Two spectators present in the scene.
Total: 16, 497 599.3
Table 4.3: Head-tracking evaluation set. Each recording consists of uncompressed input from Kinect’s depthand colour sensors (320 × 240 px/12 bits-per-pixel, and 640 × 480 px/32 bits-per-pixel respectively), and thealigned colour and depth image (320× 240 px/32 bits-per-pixel). The total size of all recordings is 24.3 GB.
CHAPTER 4. EVALUATION 58
Figure 4.8: “Head Position Tagger” tool GUI. Frame 112 of the “Occlusions” recording is beingtagged; head position marker is shown in red.
CHAPTER 4. EVALUATION 59
Figure 4.9: Ground-truth objects tagged by two different annotators in frames 160, 466 and 617
of the “Participant #1” recording. Blue/red ellipses represent objects G(t)1 and G
(t)1 respectively for
t ∈ {160, 466, 617}.
4.2.1.4 Evaluation Metrics
Three main metrics are used when evaluating different tracker performances on the evaluation
set recordings7:
� δ metric, which measures the average normalized distance between the predicted and
ground-truth head centers,
� STDA metric, which measures the spatio-temporal overlap (i.e. the ratio of the spatial
intersection and union, averaged over time) between the ground truth and the detected
objects,
� MOTA/MOTP metrics, which evaluate i) tracking precision as the total error in estimated
positions of ground truth/detection pairs for the whole sequence, averaged over the total
number of matches made, and ii) tracking accuracy as the cumulative ratio of misses, false
alarms and mismatches in the recording, computed over the number of objects present in
all frames.
4.2.1.5 Inter-Annotator Agreement
Even humans do not entirely agree about the exact location of the head in an image (especially
for partially occluded head images, motion-blurred head images, etc). To establish an indication
of the upper limit of system’s performance, two recordings (“Participant #1” and “Participant
#2”) were independently tagged by two annotators (1, 644 frames in total).
Inter-annotator agreement was established for STDA, MOTA, MOTP and δ metrics, with the
tracker output object D(t)i in the metric definitions replaced by an object tagged by annotator
#2 (denoted G(t)i ), as illustrated in the figure 4.9.
The average distance between head centers as marked by both annotators was approximately
9.8% of the head size (as indicated by δ measure). Similarly, an 82.9% spatio-temporal overlap
ratio for the tagged ground-truths has been achieved (STDA measure).
7See appendix F.1 for full metric descriptions.
CHAPTER 4. EVALUATION 60
Figure 4.10: Inter-annotator δ metric evolution over time for “Participant #1” recording (annotator#1 ground-truth is arbitrarily used as the baseline).
Complete results for all metrics obtained from “Participant #1” and “Participant #2” record-
ings are listed in table 4.5, and the balance of the fault modes can be seen from the confusion
matrix 4.4.
Annotator #1Overlap ≥ 75% Overlap < 75%
Annotator #2Overlap ≥ 75% 1, 505 46Overlap < 75% 93 0
Table 4.4: Inter-annotator confusion matrix (in # of frames). The overlap is measured as the pro-portion of the area tagged by both annotators versus the area tagged by just a single annotator (i.e|G(t)
1 ∩G(t)1 |
|G(t)1 |
and|G(t)
1 ∩G(t)1 |
|G(t)1 |
).
4.2.1.6 Evaluation Procedure
All scenarios in table 4.3 were tested using the same set of HT3D head-tracker parameters, as
given in table 4.6.
For each recording, raw depth and colour streams were loaded using StatisticsHandler and
fed into the HT3D library core using the same data path as for the live data from the Kinect
sensor. The individual colour/depth/combined trackers were initialized at the first frame of the
recording and the predicted head area/head center coordinates in each frame were serialized to
an XML file.
CHAPTER 4. EVALUATION 61
Recording Frame count δ STDA MOTA MOTP
Participant #1 813 0.1374 0.8122 0.8370 0.8122Participant #2 831 0.0592 0.8447 0.8923 0.8447
Total: 1, 644 0.0979 0.8286 0.8650 0.8286
Table 4.5: Inter-annotator agreement for all evaluation metrics.
The serialized tracker output and the ground-truth data was then loaded into the “Head Posi-
tion Tagger” tool, and the report containing evaluated STDA, MOTA, MOTP and δ metrics
was generated.
4.2.1.7 Evaluation Results
Colour, depth and combined tracker8 performances for the evaluation recordings with respect
to δ, STDA, MOTA and MOTD metrics are discussed below.
Average Normalized Distance from the Head Center (δ) Results of δ metric conclu-
sively show that both depth and combined trackers perform better than only the colour tracker
on the given input recordings. In particular, both depth and combined trackers performed
better than the colour one in 18/20 recordings.
While the difference between the performances of depth and combined trackers is much smaller,
a combined tracker still outperforms the depth tracker in 14/20 recordings, and achieves a
slightly better total δ result.
Nonetheless, all trackers fell short of the “gold” inter-annotator agreement standard.
For illustration purposes, figures 4.11 and 4.13 show δ measure evolution over time for “Partici-
pant #5” and “Illumination (high)” recordings respectively. Similar analyses for the remaining
recordings are given in the appendix F.2.2.
A summary of δ measure for all the recordings is shown in figure 4.16.
8Using default settings as given in table 4.6, unless otherwise noted.
CHAPTER 4. EVALUATION 62
Frame 70 Frame 174 Frame 240 Frame 298
Frame 363 Frame 458 Frame 531 Frame 537
Frame 621 Frame 705 Frame 751 Frame 822
Figure 4.11: ”Participant #5” recording. Marked red area indicates the output of the combinedhead-tracker.
Figure 4.12: δ metric evolution over time for “Participant #5” recording.
CHAPTER 4. EVALUATION 63
Frame 0 Frame 110 Frame 174 Frame 326
Frame 359 Frame 392 Frame 721 Frame 767
Figure 4.13: “Illumination (high)” recording. Marked red area indicates the output of the combined
head-tracker.
Frame 14 Frame 69 Frame 121 Frame 246
Figure 4.14: “Illumination (high)” recording depth frames. Blue colour indicates the areas of the
image where no depth data is present. More depth data becomes available towards the end of the
recording due to the reduced amount of sunlight in the scene.
Figure 4.15: δ metric evolution over time for “Illumination (high)” recording.
CHAPTER 4. EVALUATION 64
Figure 4.16: δ (average normalized distance from the head center) metric for all evaluation recordings(default settings). Lower values indicate better performance.
Figure 4.17: δ metric for all evaluation recordings (custom settings: increasedColourTrackerSaturationThreshold and ColourTrackerValueThreshold values for “Illumination (low)”,
“Cluttered similar-hue background” and “Occlusions” recordings).
CHAPTER 4. EVALUATION 65
Figure 4.18: STDA (Sequence Track Detection Accuracy) metric for all evaluation recordings (highervalues indicate better performance). Average (mean) STDA metric values for individual trackers aregiven in the table 4.7.
STDA, MOTA and MOTP Similar to δ metric, colour-based tracker is nearly always out-
performed by both depth and combined trackers with regards to STDA, MOTA and MOTP met-
rics. In particular, depth and combined trackers perform better in 19/20 and 20/20 recordings
respectively for STDA metric (as shown in figure 4.18), 16/20 and 18/20 recordings respectively
for MOTA, 20/20 and 19/20 recordings respectively for MOTP metric. Depth and combined
trackers also consistently achieve better STDA/MOTA/MOTP results than the colour-only
tracker, but, again, do not reach the performance of the “gold standard” (inter-annotator
agreement).
Interestingly, a purely depth-based tracker (as described in 2.5.2) performs better than the com-
bined one w.r.t. the measures based on the spatio-temporally averaged ground-truth/detection
overlap ratios.
In particular, depth-based tracker outperforms the combined tracker in 12/20 recordings using
the STDA metric, 12/20 recordings using MOTA metric and 16/20 recordings using the MOTP
metric (see figure F.44 for MOTA/MOTP metric values per individual recording). Depth-
CHAPTER 4. EVALUATION 66
k = 1 k = 2 k = 4 k = 8
Figure 4.19: Kinect’s colour and depth streams subsampled and rescaled by a factor of k.
tracker also achieves slightly better total STDA/MOTA/MOTP results.
These results can be partially explained by the fact that the combined tracker uses the inter-
section of individual colour and depth tracker outputs to produce the final prediction. This
approach can potentially reduce the amount of both false positives (since both trackers have
to “vote” for the pixel to be classified as part of the head) and false negatives (in cases where
one of the trackers loses the track).
While this approach increases the accuracy of the head-center localization as shown by δ metric
(which is crucial for the project’s success), these benefits are outweighed by the slight increase
in the false negatives occuring in the majority of frames due to the relatively poor performance
of the colour tracker and hence decreased intersection area.
4.2.1.8 Robustness to Undersampling and Noise
The effects of spatio-temporal undersampling and the additive white Gaussian noise (AWGN)
were also briefly investigated.
The average normalized distance from head center (δ) metric was calculated for colour, depth
and combined head trackers on spatially and temporally undersampled “Participant #2” record-
ing (see figure 4.19). The results are summarized in figure 4.20, but in brief, all trackers demon-
strated good robustness to undersampling, indicating that these algorithms could potentially
be applied for sensors with a lower resolution.
Similarly, a varying degree of Gaussian noise was added to both colour and depth streams of the
“Participant #2” recording (see figure 4.21). The results for the combined tracker are shown
in figure 4.22. While all three trackers showed some degree of robustness to noise, it has been
observed that the depth tracker was much more error-prone to AWGN in depth stream. This
is possibly due to the head detection approach which requires that the horizontal local minima
satisfying equations 2.13 and 2.14 would be found in a number of consecutive rows.
CHAPTER 4. EVALUATION 67
Figure 4.20: Average Normalized Distance from Head Center (δ) metric for “Participant #2” record-ing (using default tracker settings) under varying degrees of spatio-temporal undersampling. Lowervalues indicate better performance (notice the difference in vertical axis range for each of the trackers).
CHAPTER 4. EVALUATION 68
σ = 0 σ = 0.05 σ = 0.1 σ = 0.15
σ = 0 σ = 0.05 σ = 0.1 σ = 0.15
Figure 4.21: White Gaussian noise N ∼ N (0, σ2) added to Kinect’s colour and depth streams.
Figure 4.22: Average Normalized Distance from Head Center (δ) metric for “Participant #2” record-
ing with added white Gaussian noise. Noise is distributed around zero, with deviation given as a
proportion of a maximum range value (255 for colour data, 4000 for depth data). Lower values
indicate better performance.
CHAPTER 4. EVALUATION 69
Setting Default value
BackgroundSubtractorSensitivity 20ColourDetectorSensitivity 0.4ColourTrackerUseSaturation TrueColourTrackerSaturationThreshold 32ColourTrackerValueThreshold 64BackgroundSubtractorType BackgroundSubtractorType.DEPTH
DepthShadowEliminationEnabled TrueDepthSensorBlurRadius 1ColourTrackerSensitivity 0.8DepthTrackerSensitivity 0.8CombinedTrackerSensitivity 0.4
Table 4.6: Default HT3D library settings.
4.2.1.9 Summary
Table 4.7 shows the average (mean) metric values for all tracker performances. While the inter-
annotator agreement has not been reached, both depth and combined trackers have demon-
strated good performance in recordings containing varying backgrounds and lighting conditions,
and unconstrained viewer’s head movements (with ±70◦ roll, ±160◦ yaw, ±90◦ pitch, anteri-
or/posterior translations within 40-400 cm, and with horizontal/vertical translations within the
FoV of the sensor).
Tracker δ STDA MOTA MOTP
Colour 0.8259 0.3764 0.4158 0.4438Depth 0.3554 0.6024 0.5651 0.6552Combined colour and depth 0.3270 0.5926 0.5574 0.6066
Inter-annotator agreement 0.0979 0.8286 0.8650 0.8286
Table 4.7: Tracker performance averaged over all evaluation recordings (obtained using default set-tings). Bold font indicates the best tracker values achieved for a given metric.
In particular, the combined tracker was able to predict the viewer’s head center location within
less than 13
of head’s size from the actual head center (most important for the main project’s
aim), and the depth tracker was able to achieve over 60% spatio-temporal overlap for the
predicted head area.
Regarding the relative performance of different trackers, the main conclusion is that using
depth data besides colour data significantly improves head tracking accuracy (as indicated by
all metrics).
This is mostly due to the very good performance of the CAMShift algorithm when applied
to the head probability distribution, obtained from the depth data using Peters and Garstka
priors. The main performance losses of the combined tracking algorithm stemmed from the
CHAPTER 4. EVALUATION 70
Table 4.8: HT3D library performance when running configuration GUI for 60 seconds on a dual-corehyperthreaded Intel Core i5-2410M CPU @ 2.30 GHz, with 8 GB RAM.
AverageFPS
Average% CPUTime1
Minimum% CPUTime
Maximum% CPUTime
Colour tracker (no background subtraction) 27.833 28.820 21.839 36.658
Colour tracker (Euclidean background subtractor) 27.795 36.230 28.079 47.038
Colour tracker (ViBe background subtractor) 27.914 48.751 42.118 56.157
Depth tracker 27.568 47.135 37.438 60.837
Combined tracker 28.243 56.806 43.678 76.436
Histogram backprojection rendering 27.830 64.165 56.214 74.096
Background subtraction rendering 27.460 39.578 32.726 46.018
Depth head probability rendering 28.060 63.624 53.818 78.776
Depth image rendering 27.966 63.759 53.818 74.802
1 The percentage of time that a single CPU core (with hyperthreading enabled) was busy servicing the process.
inaccuracies of the colour tracker in bad lighting conditions or in presence of other similar-hue
objects in the scene.
4.2.2 Performance Evaluation
To successfully achieve the main project aim (3D display simulation) it is crucial that the HT3D
DLL performance would reach real-time.
In order to evaluate the run-time head-tracking costs in realistic conditions, the performance
of a HT3D configuration GUI (see 3.21) was measured for various tracker settings. HT3D
configuration GUI was chosen as a good representative program since it introduces only minimal
run-time overheads for data rendering (any projects using HT3D library would be likely to incur
similar costs).
Run-time performance was tested on a main development machine running 64-bit Windows 7
OS on a dual-core hyperthreaded Intel Core i5-2410M CPU @ 2.30 GHz, with 8 GB RAM. A 64-
bit release build containing no debug information was measured using the Windows Performance
Monitor and dotTrace Performance 5.0 [28] tools.
Evaluation results are summarized in figures 4.23, 4.24 and in table 4.2.2. In summary, all
trackers achieved real-time performance: more than 27.4 frames per second were processed on
a single CPU core (with raw input provided by Kinect sensor at 30 Hz).
CHAPTER 4. EVALUATION 71
Figure 4.23: Performance of HT3D trackers when running configuration GUI on a dual-core hyper-threaded Intel Core i5-2410M CPU @ 2.30 GHz.
Figure 4.24: HT3D background subtractor performance with a colour tracker when running configu-ration GUI.
CHAPTER 4. EVALUATION 72
4.2.2.1 “Hot Paths”
“Hot path” analysis indicates where most of the work in the process has been performed (also
known as the most active function call tree).
Due to a rather clumsy depth and colour stream alignment implementation in Kinect SDK
Beta 2, over 40% of the total head-tracking time has been spent in aligning the colour and
depth images (see figure 4.25).
In order perform this alignment, an SDK function GetColorPixelCoordinatesFromDepthPixel
is provided. This function takes the coordinates of a depth pixel in the depth image, together
with the depth pixel value, and returns the corresponding coordinates of a colour pixel in the
colour image.
This API design effectively means that in every single frame, for every single pixel (xd, yd) in
the depth image, i) the function GetColorPixelCoordinatesFromDepthPixel has to be called to
return the corresponding colour coordinates (xc, yc), ii) the colour image has to be referenced
at coordinates (xc, yc) to obtain the colour value (r, g, b), and only then iii) the depth pixel
(xd, yd) can be assigned a colour value (r, g, b)910.
9This flaw has been fixed in Kinect SDK v1 (released on 01/02/2012, after the “Implementation Finish” mile-stone of the project) where an API for full-frame conversion has been provided via MapDepthFrameToColorFrame
function. Assuming that the combined API yields a 10-fold performance improvement, Amdahl’s law predictsthat the overall head-tracker performance could be improved by over 38%, as shown in figure 4.25 part b).
10Incidentally, the updated image alignment API also supports 640 × 480 px resolution, hence the overalltracker resolution could be quadrupled from 320× 240 px to 640× 480 px.
CHAPTER 4. EVALUATION 73
a)
b)
Figure 4.25: “Hot path” analysis for HT3D library running configuration GUI. The high-lighted Kinect SDK function GetColorPixelCoordinatesFromDepthPixel performs the colour anddepth image alignment. Image a) shows the unoptimized head-tracker performance (methodnui DepthFrameReady), while image b) shows the possible performance improvement obtained from10-fold reduction of GetColorPixelCoordinatesFromDepthPixel function calls.
CHAPTER 4. EVALUATION 74
4.3 3D Display Simulator (Z-Tris)11
Correctness of Z-Tris implementation was evaluated using a combination of automated tests
(unit, “smoke”, regression) and manual (white-box, functional, sanity, usability, integration)
tests. Around 85.25% code coverage was achieved by automated unit tests for the core Z-Tris
classes (a sample unit test run is shown in figure 4.26).
Figure 4.26: Sample Z-Tris unit test run in Visual Studio Unit Testing Framework.
Regarding performance, Z-Tris (with the combined colour and depth head tracker enabled)
achieved 29.969 frames-per-second average rendering speed, satisfying the real-time rendering
requirements. Also, a single CPU core experienced an average load of 64.98%, indicating that
some further processing resources were available.
11Only the evaluation summary is presented in this section due to the space limitations; see appendix G formore Z-Tris evaluation details.
Chapter 5
Conclusions
5.1 Accomplishments
The main project’s aim (“to simulate depth perception on a regular LCD screen through theuse of the ubiquitous and affordable Microsoft Kinect sensor, without requiring the user to wearglasses or other headgear, or to modify the screen in any way”) has been successfully achieved.While static images cannot do justice to the level of depth perception simulated by the system,a short video demonstration can be seen at http://zabarauskas.com/3d.
To achieve the main project’s aim, the following new approaches were suggested:
� A distributed Viola-Jones face detector training framework. The framework running on 65 CPU
cores was able to train a 22-layer detector cascade containing 1, 828 decision stump classifiers in
less than a day (vs. a three week estimate using a naıve approach). Training process was limited
only by the amount of data available (exhausting 32.9 millions of negative training examples).
� A real-time depth-based head tracker, combining CAMShift tracking algorithm with Peters and
Garstka priors. During 10 minutes of colour and depth recordings, the depth-based head-tracker
was able to achieve a better than 60% average spatio-temporal overlap ratio between the ground-
truth objects and their predicted locations.
� A real-time combined (colour and depth) head-tracker. During 10 minutes of evaluation record-
ings (containing unconstrained viewer’s head movement in six degrees-of-freedom, in presence of
occlusions, changing facial expressions, different backgrounds and varying lighting conditions)
the combined head-tracker was able to predict the viewer’s head center location within less than13 of head’s size from the actual head center (on average).
To the same end, a number of published methods were implemented:
� Viola-Jones face detector (with depth cue extensions as suggested in [10]),
� Depth-based face detector, using Peters and Garstka method,
� CAMShift face tracker (extended to use both hue and saturation data), and
� ViBe background subtractor (in itself an extension to the project).
All of these methods were combined into a robust and flexible HT3D head-tracking library.
Finally, a proof-of-concept application was developed, creating depth perception on a regular LCD
display by simulating continuous horizontal/vertical motion parallax (using HT3D DLL) and a number
of pictorial depth cues.
Such systems could serve as potential “backwards-compatibility” providers during the transition from
2D to 3D displays (being able to render convincing 3D content on ubiquitous 2D displays).
75
CHAPTER 5. CONCLUSIONS 76
5.2 Future Work
Despite the obvious improvement over the colour-based head-tracker, neither depth, nor the combined
trackers have reached the inter-annotator agreement (“gold standard”) results.
The obvious next steps in increasing the tracker performance would be to i) train the Viola-Jones face
detector using more data, and ii) port the HT3D library from Kinect SDK Beta 2 to Kinect SDK
v1, increasing the depth resolution to 640× 480 px (effectively quadrupling the amount of depth data
present).
A more interesting direction, however, would be to explore the applicability of well-performing colour-
based methods to depth data. Possible examples include training a Viola-Jones face detector on
depth images (involving the collection of a representative depth data training set), or exploring the
applicability of adaptive background subtraction techniques (like ViBe) to depth image sequences.
Based on the experience obtained throughout the project, it seems quite likely that these approaches
could further improve the tracking accuracy.
Furthermore, the head-tracking library could be extended to deal with multiple people (this would
involve implementing partial/full occlusion disambiguation and object identification). Running both
colour- and depth-based multiple viewer trackers in parallel could potentially provide a significant
advantage over the systems based only on a single information source.
Bibliography
[1] Allen, J. G., Xu, R. Y. D., and Jin, J. S. Object Tracking Using CAMShift Algorithm and
Multiple Quantized Feature Spaces. Reproduction 36 (2006), 3–7.
[2] Barnich, O., and Van Droogenbroeck, M. ViBe: A Universal Background Subtraction
Algorithm for Video Sequences. IEEE Transactions on Image Processing 20, 6 (2011), 1709–
1724.
[3] Beck, K., Beedle, M., van Bennekum, A., Cockburn, A., Cunningham, W., Fowler,
M., Grenning, J., Highsmith, J., Hunt, A., Jeffries, R., Kern, J., Marick, B., Mar-
tin, R. C., Mellor, S., Schwaber, K., Sutherland, J., and Thomas, D. Manifesto for
Agile Software Development, 2001.
[4] Benzie, P., Watson, J., Surman, P., Rakkolainen, I., Hopf, K., Urey, H., Sainov, V.,
and Kopylow, C. V. A Survey of 3DTV Displays: Techniques and Technologies, 2007.
[5] Bernardin, K., and Stiefelhagen, R. Evaluating Multiple Object Tracking Performance:
The CLEAR MOT Metrics. EURASIP Journal on Image and Video Processing 2008 (2008),
1–10.
[6] Blinn, J. F. Models of Light Reflection for Computer Synthesized Pictures. ACM SIGGRAPH
Computer Graphics 11, 2 (1977), 192–198.
[7] Boyle, M. The Effects of Capture Conditions on the CAMShift Face Tracker. Alberta, Canada:
Department of Computer Science, (2001).
[8] Bradski, G. Real Time Face and Object Tracking as a Component of a Perceptual User Interface.
In Proceedings of the Fourth IEEE Workshop on Applications of Computer Vision, 1998. WACV
’98., pp. 214 –219.
[9] Burbeck, S. Applications Programming in Smalltalk-80: How to Use Model-View-
Controller (MVC). http://st-www.cs.illinois.edu/users/smarch/st-docs/mvc.html. Last accessed
in 07/04/2012 , 12.
[10] Burgin, W., Pantofaru, C., and Smart, W. D. Using Depth Information to Improve Face
Detection. In Proceedings of the 6th International Conference on Human-Robot Interaction (New
York, NY, USA, 2011), HRI ’11, ACM, pp. 119–120.
[11] Carbonetto, P. Training Data for Robust Object Detection. http://www.cs.ubc.ca/
~pcarbo.
[12] Crow, F. C. Shadow Algorithms for Computer Graphics. In Proceedings of the 4th Annual Con-
ference on Computer graphics and Interactive Techniques (1977), vol. 11, ACM Press, pp. 242–
248.
[13] Cutting, J. E., and Vishton, P. M. Perceiving Layout and Knowing Distances: The Inte-
gration, Relative Potency, and Contextual Use of Different Information About Depth. Perception
5, 3 (1995), 1–37.
[14] Dodgson, N. A. Autostereoscopic 3D Displays. Computer 38, 8 (2005), 31–36.
77
BIBLIOGRAPHY 78
[15] Freund, Y., and Schapire, R. E. A Decision-Theoretic Generalization of On-Line Learning
and an Application to Boosting. Computational Learning Theory 139 (1995), 119–139.
[16] Fukunaga, K., and Hostetler, L. The Estimation of the Gradient of a Density Function,
with Applications in Pattern Recognition. IEEE Transactions on Information Theory 21, 1
(1975), 32–40.
[17] Garstka, J., and Peters, G. View-dependent 3D Projection using Depth-Image-based Head
Tracking. In 8th IEEE International Workshop on Projector Camera Systems PROCAMS (2011),
pp. 52–57.
[18] GiD. Google Image Downloader. http://googleimagedownloader.com.
[19] Goldstein, E. B. Sensation and Perception. Wadsworth Pub Co, 2009.
[20] Gouraud, H. Continuous Shading of Curved Surfaces. IEEE Transactions on Computers C-20,
6 (1971), 623–629.
[21] Heidmann, T. Real Shadows, Real Time, vol. 18. 1991.
[22] Herrera C., D., and Kannala, J. Accurate and Practical Calibration of a Depth and Color
Camera Pair. Computer Analysis of Images and (2011).
[23] Holliman, N. 3D Display Systems. Handbook of Optoelectronics. IOP Press, London (2005).
[24] Holliman, N., Dodgson, N., Favalora, G., and Pockett, L. Three-Dimensional Displays:
A Review and Applications Analysis. Broadcasting, IEEE Transactions on 57, 99 (June 2011),
1–10.
[25] Hubona, G. S., Shirah, G. W., and Jennings, D. K. The Effects of Cast Shadows and
Stereopsis on Performing Computer-Generated Spatial Tasks, 2004.
[26] ImageMagick. Mogrify Command-Line Tool. http://www.imagemagick.org/www/mogrify.
html.
[27] Jensen, O. Implementing the Viola-Jones Face Detection Algorithm. M.Sc Thesis, Informatics
and Mathematical Modelling, Technical University of Denmark (2008).
[28] JetBrains. dotTrace 5.0 Performance. http://www.jetbrains.com/profiler.
[29] Jones, A., McDowall, I., Yamada, H., Bolas, M., and Debevec, P. Rendering for an
Interactive 360 Light Field Display. ACM Transactions on Graphics (TOG) 26, 3 (2007), 40.
[30] Kooima, R. Generalized Perspective Projection. http://aoeu.snth.net/static/
gen-perspective.pdf, 2009.
[31] L. Xia C.-C. Chen, and Aggarwal, J. K. Human Detection Using Depth Information by
Kinect. In Workshop on Human Activity Understanding from 3D Data in conjunction with CVPR
(HAU3D) (Colorado Springs, USA, 2011).
[32] Manohar, V., Soundararajan, P., and Raju, H. Performance Evaluation of Object Detec-
tion and Tracking in Video. In Proceedings of the Seventh Asian Conference on Computer Vision
(2006), pp. 151–161.
BIBLIOGRAPHY 79
[33] Microsoft. Kinect for Windows SDK. http://www.microsoft.com/en-us/
kinectforwindows.
[34] OpenTK. The Open Toolkit Library. http://www.opentk.com.
[35] Papageorgiou, C., and Oren, M. A General Framework for Object Detection. Computer
Vision, 1998. (1998), 555–562.
[36] Rowley, H. A., Baluja, S., and Kanade, T. Neural Network-Based Face Detection. IEEE
Transactions on Pattern Analysis and Machine Intelligence 20, 1 (1998), 23–38.
[37] Schapire, R. Improved Boosting Algorithms Using Confidence-Rated Predictions. Machine
Learning (1999).
[38] Shoushtarian, B., and Bez, H. E. A Practical Adaptive Approach for Dynamic Background
Subtraction Using an Invariant Colour Model and Object Tracking. Pattern Recognition Letters
26, 1 (2005), 5–26.
[39] Stiefelhagen, R., Bernardin, K., Bowers, R., Rose, R. T., Michel, M., and Garo-
folo, J. The CLEAR 2007 Evaluation. Multimodal Technologies for Perception of Humans 4625
(2008), 3–34.
[40] Urey, H., Chellappan, K. V., Erden, E., and Surman, P. State of the Art in Stereoscopic
and Autostereoscopic Displays. Proceedings of the IEEE 99, 4 (Apr. 2011), 540–555.
[41] Viola, P. Rapid Object Detection Using a Boosted Cascade of Simple Features. Proceedings of
the CVPR 2001 (2001).
[42] Viola, P., and Jones, M. Fast and Robust Classification using Asymmetric AdaBoost and a
Detector Cascade. Advances in Neural Information Processing Systems 14 (2002), 1311–1318.
[43] Viola, P., and Jones, M. Robust Real-Time Face Detection. Int. J. Comput. Vision 57, 2
(May 2004), 137–154.
[44] Xia, L., Chen, C.-c., and Aggarwal, J. K. Human Detection Using Depth Information by
Kinect. Pattern Recognition (2011), 15–22.
[45] Zhang, C., Yin, Z., and Florencio, D. Improving Depth Perception with Motion Parallax
and its Application in Teleconferencing. In Multimedia Signal Processing, 2009. MMSP’09. IEEE
International Workshop on (2009), IEEE, pp. 1–6.
Appendix A
Depth Cue Perception
A.1 Oculomotor Cues
Oculomotor cues are created by two phenomena: convergence and accommodation.
Convergence is the inward movement of the eye (created by stretching the extraocular muscles) that
occurs when the object of focus moves closer to the eye (see figure A.1). The kinesthetic sensations
that arise are processed in the visual cortex and serve as cues for the depth perception.
Accommodation is the change in the shape of the eye lens that occurs when the sight is focused on
the objects at different distances. Ciliary muscles stretch the lens making it thinner thus changing the
eye’s focal length (see figure A.2). Similarly to convergence, the kinesthetic sensations that arise from
contracting and relaxing ciliary muscles serve as basic cues for the distance interpretation.
Both of those phenomena are most effective at the range of up to 10 meters from the observer [13]
and provide absolute distance information.
A.2 Monocular Cues
Monocular cues provide depth information when the scene is viewed with just one eye. They are
typically split into pictorial and motion cues.
a) b)
Figure A.1: Eye convergence on a) near and b) far target.
80
APPENDIX A. DEPTH CUE PERCEPTION 81
a) b)
Figure A.2: Right eye accommodation on a) near and b) far target.
A.2.1 Pictorial Cues
Pictorial cues are the sources of depth information that are present purely in the image formed on the
retina. They include:
� Occlusion, which occurs when one object is hiding another from view. The partially hidden
object is then interpreted as being farther away.
� Relative height. The object that is below horizon and has its base higher in the field-of-view is
interpreted as being farther away.
� Relative size, which occurs when two objects that are of equal size occupy different amounts
of space in the field-of-view. Typically, the object that subtends a larger visual angle of the
retina than the other is interpreted as being closer. If the object’s size is known, then this prior
knowledge can be combined with the angle that the object subtends on the retina to provide
cues about its absolute distance.
� Relative density, which occurs when a cluster of objects or texture features have a characteristic
spacing on the retina, and the observer is able to infer the distance to the cluster by the
perspective foreshortening effects on this characteristic spacing.
� Perspective convergence, which occurs when the parallel lines extending from the observer appear
to be converging at infinity. The distance between these two lines provide hints about the
distances from observer to objects on these lines.
� Atmospheric perspective which occurs when the objects in the distance appear less sharp, have
lower luminance contrast, lower colour saturation and the colours are slightly shifted towards
the blue end of the spectrum. This happens because the light from far away objects is scattered
by small molecules in the air (water droplets, dust, airborne pollution).
� Lighting and shadows. The way that light reflects off the surfaces of an object and the shadows
that are cast provide cues to the visual cortex to determine both the shape and the relative
position of the objects.
� Texture gradient, which manifests as a decrease in fineness of the texture details with increasing
distance from the observer. This change in texture detail as the objects recede is detected in
parietal cortex and provides further depth information.
APPENDIX A. DEPTH CUE PERCEPTION 82
Figure A.3: A photograph exhibiting a number of pictorial depth cues: occlusion, relative height,size and density, atmospheric perspective, lighting and shadows, and texture gradient.
Most of these pictorial depth cues are demonstrated in figure A.3.
A.2.2 Motion Cues
All the cues described above are present for the stationary observer. However, if the observer is in
motion, the following new cues emerge that further enhance human perception of depth:
� Motion parallax, which occurs when the objects closer to the moving observer seem to move
faster and in opposite direction to the movement of the observer, whereas the objects farther
away move slower and in the same direction. This difference in motion speeds provides hints
about their relative distance. Given the surface marking and the some knowledge about the
observer’s position, motion parallax can yield the absolute measure of depth at each point of
the scene.
� Deletion and accretion. Deletion occurs when the object in the background gets covered by the
object in front when the observer moves, and accretion occurs when the observer moves in the
opposite direction and the object in the background gets uncovered. This information can then
be used to infer depth order.
APPENDIX A. DEPTH CUE PERCEPTION 83
a) b)
Figure A.4: The points on left and right retinae with the same relative angle from the fovea areknown as the corresponding retinal points (or cover points). Absolute disparity is the angle betweentwo corresponding retinal points. Horopter is an imaginary surface that passes through the point offixation; only images of the objects on the horopter fall on corresponding points on the two retinae;they also have the absolute disparity equal to zero (e.g. objects A1 and B in picture a). Relativedisparity is the difference between two objects’ absolute disparities. Notice that the absolute disparityof the object A1 changes from 0 in picture a) to φ in picture b), but the relative disparity betweenobjects A1 and A2 remains constant.
A.3 Binocular Cues
In the average adult human, the eyes are horizontally separated by about 6 cm, hence even when
looking at the same scene, the images formed on the retinae are different. The difference in the images
in the left and right eyes is known as binocular disparity.
Binocular disparity gives rise to two phenomena that provide information about the distances of
objects, absolute disparity and relative disparity, which are illustrated in the figure A.4.
It has been shown that this information about depth which is present in the geometry (both absolute
and relative disparity), is actually translated into depth perception in the brain, creating a stereopsis
depth cue. In particular, neurons in the striate cortex respond to absolute disparity (Uka & DeAngelis,
2003), and the neurons higher up in the visual system (temporal lobe and other areas) respond to the
relative disparity (Parker, 2007).
Appendix B
3D Display Technologies
B.1 Binocular (Two-View) Displays
Binocular displays generate two separate viewing zones, one for each eye. Various multiplexing meth-
ods (and their combinations) are used to provide the binocular separation of the views:
� Wave-length division, used in anaglyph-type (wavelength-selective) displays (e.g. red/cyan
colour channel separation using anaglyph glasses, or amber/blue channel separation used in
ColorCode 3D display system, both shown in figure B.1).
Most of the technologies based on wave-length division require eyewear (stereoscopic displays).
� Space/direction division, used in parallax-barrier type and lenticular-type displays. These are
mainly autostereoscopic displays, i.e. they do not require glasses.
Also, a number of space/direction division based displays can be combined with head tracking
to provide the viewing zone movement (using shifting parallax barriers/lenticulars, or steerable
backlight).
� Time division, used in active LCD-shutter glasses (e.g. DepthQ system, figure B.2),
� Polarization division, used in systems requiring passive polariser glasses (e.g. RealD ZScreen,
figure B.2).
a) b)
Figure B.1: Wave-length division display technologies: a) red/cyan channel multiplexed glassesfor anaglyph 3D image viewing, b) patented ColorCode 3D display system that uses amber/bluecolour channel multiplexing to produce full colour 3D images.
84
APPENDIX B. 3D DISPLAY TECHNOLOGIES 85
a) b)
Figure B.2: Time- and polarization-division based technologies: a) RealD ZScreen display systemthat uses a single projector equipped with an electrically controllable polarization rotator, to produceorthogonally polarized frames, b) DepthQ display system that uses a single projector with time-multiplexed output (to be viewed with active liquid-crystal based shutter glasses).
B.2 Multi-View Displays
Multi-view displays create a fixed set of viewing zones across the viewing field, in which different stereo
pairs are presented. Typical implementation techniques for this type of displays include:
� Combination of pixelated emissive displays, with static parallax barriers or lenticular arrays
(integral imaging displays). For the latter, hemispherical (as opposed to cylindrical) lenslets
can be used to provide vertical, as well as horizontal parallax.
However, constraints on pixel size and resolution in LCD or plasma displays limit horizontal
multiplexing to a small number of views [14]. Also, parallax barriers can cause a significant light
loss with the increasing number of views, whereas lenticular displays magnify the underlying
subpixel structure of the device, creating dark transitions between viewing zones.
� Multiprojector displays, where the image from each projector is projected on the entire double-
lenticular screen, but is visible only within the corresponding viewing regions at the optimal
viewing distance. These displays require a very precise alignment of projected images, and are
extremely costly since they require having a single projector per view.
� Time-sequential displays, where the different views are generated by a single display device
running at a very high frame rate. A secondary optical component (synchronized to the former
image-generation device) then directs the images at different time-slots to different viewing
zones. An example implementation using a high-speed CRT monitor and liquid crystal shutters
in lens array has been developed at Cambridge (see figure B.3). However, the optical path
length required by such displays reduces their commercial appeal in comparison to the flat-
panel displays [23].
APPENDIX B. 3D DISPLAY TECHNOLOGIES 86
a) b)
Figure B.3: Multi-view and light-field 3D display technologies: image a) shows a 25” diagonal, 28-view time-multiplexed autostereoscopic display system developed at Cambridge. A high-speed CRTdisplay renders each view sequentially and the synchronised LCD shutters direct the view through aFresnel field lens at the appropriate angle. Image b) shows the light-field display system as describedby Jones et al. [29] consisting of a high-speed video projector and a spinning mirror covered by aholographic diffuser.
B.3 Light-Field (Volumetric and Holographic) Dis-
plays
Light-field displays simulate light faring in every direction through every point in image volume.
Volumetric displays generate images by rendering each point of the scene at its actual position in space
through slice-stacking, solid-state processes, open-air plasma effects and so on. Sample implementa-
tions of such displays include laser projection onto a spinning helix (Lewis et al.), varifocal mirror
displays (Traub) or swept-screen systems (Hirsch).
Holographic displays attempt to reconstruct the light-field field of a 3D scene in space by modulating
coherent light (e.g. with spatial light modulators, liquid crystals on silicon, etc). Two commercial ex-
amples of holographic displays are Holografika, which uses a sheet of holographic optical elements as its
principal screen, and Quinetiq system, which uses optically-addressed spatial light modulators.
Another light-field display system is described by Jones et al. [29], which consists of a high-speed video
projector and a spinning mirror covered by a holographic diffuser (see figure B.3).
B.4 3D Display Comparison w.r.t. Depth Cues
All display types listed above can simulate all of the pictorial cues. Two-view displays without head
tracking add stereopsis to the pictorial depth cues, and head-tracked two-view displays can simulate
motion parallax. However, two-view displays typically require eyewear or head-tracking.
APPENDIX B. 3D DISPLAY TECHNOLOGIES 87
Multi-view displays create the perception of stereopsis and can simulate motion parallax without the
head-tracking/eyewear. However, motion parallax is typically segmented into discrete steps and is
only horizontal. Building multiview displays with a large number of views to overcome these problems
remains technologically challenging.
Light-field displays can provide continuous motion parallax and accommodation depth cues (besides
stereopsis, convergence and pictorial depth cues). However, as described by Holliman et al. [24],
volumetric displays remain a niche product, and computation holography remains experimental.
In general, despite the fact that most of the stereoscopic binocular display systems have been man-
ufactured for decades and some of the autostereoscopic systems have been available for 10-15 years,
they are still mainly used in niche applications (further discussed in the following section).
B.5 3D Display Applications
B.5.1 Scientific and Medical Software
� Geospatial applications, in which 3D displays are used for terrain analysis, defence intelligence
gathering, pairing of aerial and satellite imagery by photogrammetrists, and so on.
� Oil and gas applications, in which 3D displays help exploration geophysicists to visualise sub-
terranean material density images, in order to make more accurate predictions where petroleum
reservoirs might be located.
� Molecular modelling, computation chemistry, crystallography visualisations. Since the structure
of a particular molecule is determined the spatial location of its molecular constituents, 3D dis-
plays can help to visualize spatial relationships between thousands of atoms in a given molecule,
helping to determine its structure and function.
� Mechanical design, where 3D displays can help industrial designers, mechanical engineers and
architects to design and showcase complex 3D models.
� Medical applications, in which magnetic resonance imaging (MRI), computed tomography (CT),
ultrasound and other inherently volumetric images can be represented in 3D to help doctors make
a more accurate and quicker judgement. Three-dimensional displays can also help in minimally
invasive surgeries (MIS) to give surgeons a better understanding of depth and position when
making critical movements.
� Training of complex operations, remote robot manipulation in dangerous environments, aug-
mented and virtual reality applications, 3D teleconferencing and so on.
B.5.2 Gaming, Movie and Advertising Applications
In this application class, 3D displays have the advantage of novelty and increased user imersiveness
over regular 2D displays.
APPENDIX B. 3D DISPLAY TECHNOLOGIES 88
Figure B.4: Free2C interactive kiosk (built for the use at showrooms, shops, airports, etc), whichuses head-tracking to control a vertically aligned lenticular screen to overcome the fixed viewing-zonerequirement.
Over the last few decades this advantage has been exploited by a large number of different 3D display
systems, manufactured for the purpose of advertising. An example of such system (Free2C interactive
kiosk) is shown in figure B.4.
Similarly, a number of recent developments show an increasing interest in 3D display technologies for
movies and gaming.
Examples given by Zhang et al. in [45] include Nvidia’s release of a 3D Vision technology stereoscopic
gaming kit (in 2008) containing liquid-crystal shutter glasses and a GeForce Stereoscopic 3D Driver
(enabling 3D gaming on supported displays), an agreement between The Walt Disney Company and
Pixar (made in April 2008) to make eight new 3D animated films over the next four years, and
an announcement by DreamWorks Animation that it will release all its movies in 3D, starting in
2009.
Appendix C
Computer Vision Methods (Additional
Details)
C.1 Viola-Jones Face Detector
C.1.1 Weak Classifier Boosting using AdaBoost
AdaBoost combines a collection of simple classification functions into a stronger classifier through a
number of rounds, where in each round
� the best weak classifier (simple classification function) for the current training data is found,
� lower/higher weights are assigned to correctly/incorrectly classified training examples.
The final strong classifier is obtained by taking a weighted linear combination of weak classifiers,
where the weights assigned to individual weak hypotheses are inversely proportional to the number of
classification errors that they make.
These steps are illustrated in the figure C.1, and precisely formalized in the algorithm C.1.1.1.
A number of properties of AdaBoost have been proven. Of a particular interest is a generalized
theorem of Freund and Shapire, by Shapire and Singer[37] which states that the training error of a
strong classifier decreases exponentially in the number of rounds, i.e. the training error at round T is
bounded by
1
N
∣∣∣∣∣N∑i=1
(C(~xi) 6= yi)
∣∣∣∣∣ ≤ 1
N
T∑i=1
exp (−yif(~xi)) , (C.1)
where N is the number of training examples and f(~x) =∑T
t=1 αtht(~x).
AdaBoost is designed to minimize the quantity related to the overall classification error, but in the
context of the face detection it is not the most optimal strategy. As discussed in section 2.4.1.4, it is
more important to minimize the false negative rate than the false positive.
C.1.1.1 AsymBoost Modification
AsymBoost (asymmetric AdaBoost) is a variant of AdaBoost (presented by Viola and Jones in
2002 [42]) specifically designed to be used in classification tasks where the distribution of positive
and negative training examples is highly skewed. The fix proposed in [42] is to adjust the training
89
APPENDIX C. COMPUTER VISION METHODS (ADDITIONAL DETAILS) 90
Figure C.1: A simplified illustration of AdaBoost weak classifier boosting algorithm given in C.1.1.1.In this training sequence, three weak classifiers that minimize the classification error are selected; afterselecting each classifier, the remaining training examples are reweighed (increasing/decreasing theweights of incorrectly/correctly classified examples respectively). After selecting all three classifiers,a weighed linear combination of their individual thresholds is taken, yielding a final strong classifier.
APPENDIX C. COMPUTER VISION METHODS (ADDITIONAL DETAILS) 91
Algorithm C.1.1.1 Weak classifier boosting using AdaBoost. It requires N training examples givenin the array A = (~x1, y1), ..., (~xN , yN ) (where yi = 0 for a negative and yi = 1 for a positive trainingexample) and uses T weak classifiers to construct a strong classifier. The result of the boosting is thefinal strong classifier h(~x), which is a weighted linear combination of T hypotheses with the weightsinversely proportional to the training errors.
AdaBoost(A, T )
1 // Initialize training weights (where m is the count of negative, l is the count of2 // positive training examples).3 for each training example (~xi, yi) ∈ A4 if yi = 05 w1,i ← 1
2m
6 else7 w1,i ← 1
2l
8 for t← 1 to T9 // 1. Normalize the weights:
10 for each weight wt,i11 wt,i ← wt,i∑N
j=1 wt,j
12 // 2. Select the best weak classifier h(x, ft, pt, θt) which minimizes the error13 // εt = minf,p,θ
∑iwt,i|h(xi, f, p, θ)− yi|:
14 ht(~x)← Find-Best-Weak-Classifier(~wt, A)
15 // 3. Update the weights:16 for each training example (~xi, yi) ∈ A17 if ht(~xi) = yi18 wt+1,i ← wt,i
εt1−εt
19 else20 wt+1,i ← wt,i
21 return h(~x) =
{1,
∑Tt=1 ht(~x) log 1−εt
εt≥ 1
2
∑Tt=1 log 1−εt
εt,
0, otherwise.
APPENDIX C. COMPUTER VISION METHODS (ADDITIONAL DETAILS) 92
weights in each round by a multiplicative factor of
exp
(1
Tyi log
√k
), (C.2)
where T is the number of rounds1, yi is a class of the training example i and k is the penalty ratio
between false negatives and false positives.
C.1.2 Best Weak-Classifier Selection
The algorithm to efficiently find the best decision stump weak classifier is given in C.1.2.1. The
asymptotic time cost to find the best weak classifier for a given training round is O(KN logN), where
K is a number of features and N is the number of training examples.
Algorithm C.1.2.1 Selection of the best decision stump weak classifier. It requires an array oftraining examples A = (~x1, y1), ..., (~xN , yN ), together with the training example weights ~wt. Thisalgorithm returns the best rectangle feature based decision stump classifier.
Find-Best-Weak-Classifier(~wt, A)
1 Calculate T+, T− (total sums of positive/negative example weights).
2 for each feature f3 for each training example (~xi, yi) ∈ A4 vi ← f(~xi)
5 ~v.sort()
6 for vi ∈ ~v7 Maintain S+
i , S−i (total sums of positive/negative weights below the current
example).
8 // Calculate the current error:9 εf,i = min{S+ + (T− − S−), S− + (T+ − S+)}
10 If εf,i is smaller than the previously known smallest error, remember thecurrent threshold θf and parity pf .
11 Maintain the feature with the smallest error fb and the associated threshold θb &parity pb.
12 return h(~x, fb, pb, θb)
1 If the strong classifier obtained using AsymBoost is to be used in the attentional cascade (see section2.4.1.4), the number of rounds required to train a particular strong classifier will be unknown in advance. Inthat case, it can be approximated using the round counts of the previous two layers: Ti+2 = Ti+1 + (Ti+1−Ti).
APPENDIX C. COMPUTER VISION METHODS (ADDITIONAL DETAILS) 93
C.1.3 Cascade Training
The precise algorithm to build a cascade Viola-Jones face detector is shown in the listing C.1.3.1.
Algorithm C.1.3.1 Building a cascaded detector. It requires the maximum acceptable false positiverate per layer f , the minimum acceptable acceptable detection rate per layer d, the target overall falsepositive rate Ftarget, a set of positive training examples P and a set of negative training examples N .The algorithm returns a cascaded detector C(~x).
Build-Cascade(f, d, Ftarget, P,N)
1 C(~x)← ∅
2 F0 ← 1.0, D0 ← 1.03 i← 0
4 while Fi > Ftarget
5 i← i+ 16 ni ← 0, Fi ← Fi−1
7 while Fi > f × Fi−18 ni ← ni + 1
9 hi(~x)← AdaBoost(N ∪ P, ni)10 C(~x)← C(~x) ∪ hi(~x)
11 Evaluate the cascaded classifier on validation set to determine Fi and Di.
12 Decrease threshold for hi(~x) until the cascaded classifier has a detection rateof at least d×Di−1.
13 N ← ∅14 if Fi > Ftarget
15 Evaluate C(~x) on the set of non-face images and put any false detections intoN (“bootstrap” negative images).
16 return C(~x)
C.1.3.1 Training Time Complexity
As briefly discussed in section 2.4.1.3, the asymptotic-time cost to find the best decision-stump weak
classifier is O(KN logN), where K is a number of features and N is the number of training examples.
Then, the cost of training a single strong classifier is O(MKN logN) where M is the number of weak
classifiers combined through boosting. Finally, the cost of training a detector cascade containing L
strong classifiers is O(LMKN logN).
To put the numbers into perspective, assume that it takes 10 milliseconds on average to evaluate a
rectangle feature on 10, 000 training images (1 µsec/image). Then, to train a cascade containing 25
APPENDIX C. COMPUTER VISION METHODS (ADDITIONAL DETAILS) 94
strong classifiers, with a total of 4, 000 decision-stumps, selected from 160, 000 features and trained
on 10, 000 training examples would require over 74 days of continuous training (without considering
the time it takes to select a best feature out of 160, 000, and to bootstrap the false positive training
images for each layer).
C.2 CAMShift Face Tracker
C.2.1 Mean-Shift Technique
CAMShift face tracker is based on the mean-shift technique, which is a non-parametric technique to
climb the gradient of a given probability distribution to find the nearest dominant peak (mode). The
precise details of this technique are summarized in the algorithm C.2.1.1.
Algorithm C.2.1.1 Two dimensional mean shift. It requires the probability distribution P , initiallocation of the search window (x, y), search window size s and the convergense threshold θ. It returnsthe location of the nearest dominant mode of the probability distribution P .
2D-Mean-Shift(P, x, y, s, θ)
1 (x′c, y′c)← (x, y)
2 repeat3 (xc, yc)← (x′c, y
′c)
4 // Find the zeroth moment of the search window
5 M00 ←∑|x|≤s/2|y|≤s/2
P (xc + x, yc + y)
6 // Find the first horizontal and vertical moments
7 M10 ←∑|x|≤s/2|y|≤s/2
xP (xc + x, yc + y)
8 M01 ←∑|x|≤s/2|y|≤s/2
yP (xc + x, yc + y)
9 (x′c, y′c)←
(M10
M00, M01
M00
)10 until Distance((xc, yc), (x
′c, y′c)) < θ
11 return (x′c, y′c)
C.2.2 Centroid and Search Window Size Calculation
Define a shorthand p(x′, y′) , Pr(“I(x′, y′) belongs to a face”). Then the face centroid and the search
window size can be calculated as follows.
APPENDIX C. COMPUTER VISION METHODS (ADDITIONAL DETAILS) 95
1. Compute the zeroth moment:
M00 =∑x,y∈Is
p(x, y), (C.3)
where Is is the current search window.
2. Compute the first horizontal and vertical spatial moments:
M10 =∑x,y∈Is
x× p(x, y),
M01 =∑x,y∈Is
y × p(x, y).(C.4)
3. The centroid location (xc, yc) is then given by
(xc, yc) =
(M10
M00,M01
M00
). (C.5)
Similarly, the size s of the search window is set as
s = 2√M00. (C.6)
This expression is based on two observations: first of all, the zeroth moment represents a distribution
area under the search window, hence assuming a rectangular search window, its side length can be
approximated as√M00. Secondly, the goal of CAMShift is to track the whole object, hence the search
window needs to be expansive. A factor of two ensures the growth of the search window so that the
whole connected distribution area would be spanned.
Bradski also suggests that in practice the search window width and height for the face tracking can
be set to s and 1.2s respectively, to resemble the natural elliptical shape of the face.
C.3 ViBe Background Subtractor
C.3.1 Background Model Initialization
Background models used in ViBe algorithm can be instantaneously initialized using only the first
frame of the video sequence. Since no temporal information is present in a single frame, the main
assumption made is that the neighbouring pixels share a similar temporal distribution.
Under this assumption, the pixel models can be populated using the values found in the spatial
neighbourhood of each pixel; based on the empirical observations by Barnich and Van Droogenbroeck,
selecting samples from the 8-connected neighbourhood of each pixel has proven to be satisfactory for
VGA resolution images.
APPENDIX C. COMPUTER VISION METHODS (ADDITIONAL DETAILS) 96
This observation can be formalized in the following way. Say that NG(x) is a spatial neighbourhood
of a pixel x, then
M0(x) = {v0(y)}, (C.7)
where locations y ∈ NG(x) are chosen randomly according to the uniform law, Mt(x) is a model of
pixel x at time t, and vt(x) is the colour-space value of pixel x at time t.
C.3.2 Background Model Update
After a new pixel value vt(x) is observed, the memoryless update policy dictates that the old to-be-
discarded sample would be chosen randomly fromMt−1(x), according to a uniform probability density
function. This way the probability that a sample which is present at time t will not be discarded at
time t+ 1 is N−1N , where N = |Mt(x)|.
Assuming time continuity and the absence of memory in the selection procedure, the probability that
a sample in question will still be present after dt time units is(N − 1
N
)dt= exp
[−dt ln
(N
N − 1
)], (C.8)
which is indeed an exponential decay.
Since in many practical situations it is not necessary to update each background pixel model for each
new frame, the time window covered by a pixel model of a fixed size can be extended using the random
time subsampling.
In ViBe this is implemented by introducing a time subsampling factor φ. If a pixel x is classified as
belonging to the background, its value v(x) is used to update its model M(x) with the probability1φ .
Finally, based on the assumption that neighbouring background pixels share a similar temporal dis-
tribution, the neighbouring pixel models are stochastically updated when the new background sample
of a pixel is taken. More precisely, given the 8-connected spatial neighbourhood NG(x) of a pixel x,
the model M(y) of one of the neighbouring pixels y ∈ NG(x) is updated (y is chosen randomly, with
the uniform probability).
This approach allows a spatial diffusion of information using only samples classified exclusively as
background, i.e. the background model is able to adapt to changing illumination or the structure of
the scene while retaining a conservative update2 policy.
2Conservative update policy “never includes a sample belonging to foreground in the background model.”[2]
Appendix D
Depth-Based Methods (Additional De-
tails)
D.1 Depth Data Preprocessing
D.1.1 Depth Shadow Elimination
In order to obtain the depth values in the frame, Kinect uses the infrared light to project a reference
dot pattern on the scene, which is then captured using an infrared camera. Since these images are not
equivalent due to the horizontal distance between the projector and the camera, stereo triangulation
can be used to calculate the depth after the correspondence problem is solved.
However, this leads to depth shadowing (see the example in figure D.1). Since the infrared projector
is placed 2.5 cm to the right of the infrared camera, depth shadows of the concave objects always
appear on the left side if the sensor is placed on a flat horizontal surface.
This suggests a straightforward depth shadow elimination technique for head tracking (as heads are
indeed concave):
1. Process the depth images one horizontal line at a time, from left to right.
2. If the unknown depth value is reported by Kinect, replace it using the last known depth value.
An example of the depth shadow removed using this technique is shown in figure 2.11.
D.1.2 Real-Time Depth Image Smoothing
The noise in the depth calculation can arise from the inaccurate measure of disparities within the
correlation algorithm, limited resolutions of the Kinect infrared camera and projector, external infrared
radiation (e.g. sun light), object surface properties (especially high specularity), and so on.
In the detection method below, every local minimum on a horizontal scan line will be treated as a point
which potentially lies on the vertical head axis (a hypothesis which will be confirmed or refuted using
certain prior knowledge about human head sizes). Since finding a local minimum basically involves
discrete differentiation, such method is very prone to noise. A solution proposed in [17] is to smooth
the depth-image in real time using the “integral image” filter from Viola and Jones face detection
algorithm.
As further described in section 2.4.1, the integral image can be calculated in linear-time using a
dynamic programming approach; a smoothed depth value Ir(x, y) of the pixel at coordinates (x, y)
97
APPENDIX D. DEPTH-BASED METHODS (ADDITIONAL DETAILS) 98
Figure D.1: Kinect depth shadowing. Light blue polygon shows the area which is visible from the IRcamera point of view, light red polygon shows the region where the IR pattern is projected. Thickerblue lines indicate the areas on the objects that are visible by the IR camera, thicker red lines indicatethe areas on the objects that have IR pattern projected on them.
can be obtained by calculating
Ir(x, y) =I(x− r, y − r)− I(x+ r, y − r)− I(x− r, y + r) + I(x+ r, y + r)
(2r + 1)2, (D.1)
where I is the integral image and r is the side length of the averaging rectangle. The result of
smoothing using different average rectangle sizes is shown in the figure 2.10.
D.2 Depth Cue Rendering
This subsection describes two algorithms which are used to render various depth cues, as described
in the main project aims 1.5. More precisely, a generalized perspective projection [30] is used to
simulate the motion parallax when the viewer’s head position is known, and the Z-Pass algorithm [21]
is used to simulate the pictorial shadow depth cue. More details on both of these algorithms are given
below.
D.2.1 Generalized Perspective Projection
A generalized perspective projection (as described by Kooima in [30]) is used to simulate the motion
parallax, occlusion, relative height, relative size, relative density and perspective convergence depth
cues.
The generalized perspective projection matrix G can be derived as follows. Let pa, pb, pc be the three
corners of the screen as shown in figure D.2. Then the screen-local axes ~vr, ~vu and ~vn that give the
APPENDIX D. DEPTH-BASED METHODS (ADDITIONAL DETAILS) 99
Figure D.2: Screen definition for the generalized perspective projection. Viewer-space pointspa, pb, pc give the three corners of the screen, point pe gives the position of the viewer’s eye,screen-local axes ~vr, ~vu and ~vn give the orthonormal basis for describing points relative to thescreen, non-unit vectors ~va, ~vb and ~vc span from the eye position to the screen corners and dis-tances from the screen-space origin l, r, t, b give the left/right/top/bottom extents respectivelyof the perspective projection.
orthonormal basis for describing points relative to the screen can be calculated using
~vr =pb − pa||pb − pa||
,
~vu =pc − pa||pc − pa||
,
~vn =~vr × ~vu||~vr × ~vu||
.
(D.2)
If the viewer’s position changes in a way that the head is no longer positioned at the center of the screen,
the frustum becomes asymmetric. The frustum extents l, r, b, t can be calculated as follows.
Let
~va = pa − pe,~vb = pb − pe,~vc = pc − pe,
(D.3)
where pe is the position of the viewer in the world coordinates. Then the distance from the viewer to
the screen-space origin is d = −(~vn ·~va). Given this information, the frustum extents can be computed
using
l = (~vr · ~va)n
d,
r = (~vr · ~vb)n
d,
b = (~vu · ~va)n
d,
t = (~vu · ~vc)n
d.
(D.4)
APPENDIX D. DEPTH-BASED METHODS (ADDITIONAL DETAILS) 100
Let n, f be the distances of the near and far clipping planes respectively. Then the 3D perspective
projection matrix P (which maps from a truncated pyramid frustum to a cube) is
P =
2nr−l 0 r+l
r−l 0
0 2nt−b
t+bt−b 0
0 0 −f+nf−n − 2fn
f−n0 0 −1 0
(D.5)
The base of the viewing frustum would always lie in XY-plane in the world coordinates. To enable
the arbitrary positioning of the frustum, two additional matrices are needed:
MT =
~vr,x ~vr,y ~vr,z 0
~vu,x ~vu,y ~vu,z 0
~vn,x ~vn,y ~vn,z 0
0 0 0 1
, (D.6)
which transforms points lying in the plane of the screen to lie in the XY-plane (so that the perspective
projection could be applied), and
T =
1 0 0 −pe,x0 1 0 −pe,y0 0 1 −pe,z0 0 0 1
, (D.7)
which translates the tracked head location to the apex of the frustum.
Finally, note that the composition of linear transformations in homogeneous coordinates corresponds
to the product of the matrices that describe these transformations. This way, the overall generalized
perspective projection G (which produces a correct off-axis projection given constant screen corner
coordinates pa, pb, pc and a varying head position pe) can be calculated by taking a product of the
three matrices described above, i.e.
G = PMTT. (D.8)
D.2.2 Real-Time Shadows using Z-Pass Algorithm with Stencil
Buffers
As discussed in section A.2.1, cast shadows are very important for the human perception of the 3D
world. In particular, shadows play an important role in understanding the position, size and the
geometry of the light-occluding object, as well as the geometry of the objects on which the shadow is
being cast.
The first hardware-accelerated algorithm that uses stencil buffers and shadow volumes to render
shadows in real-time was described by Heidmann in 1991 [21].
His technique uses a following two-step process:
APPENDIX D. DEPTH-BASED METHODS (ADDITIONAL DETAILS) 101
Figure D.3: Shadow volume of a triangle (white polygon) lit by a single point-light source. Anypoint inside this volume is in the shadow, everything outside is lit by the light.
1. The scene is rendered as if it was completely in the shadow (e.g. using only ambient lighting),
2. Shadow volumes are calculated for each face and the stencil buffer is updated to mask the
areas within the shadow volumes; then for each light source, the scene is rendered as if it was
completely lit, using the stencil buffer mask.
D.2.2.1 Shadow Volumes
Shadow volumes were first proposed by Crow[12] in 1977. A shadow volume is defined by the object-
space tesselations of the boundaries of the regions of space, occluded from the light source [12].
To understand how the shadow volume can be constructed without the loss of generality consider a
triangle lit by a single point-light source. Projecting rays from the light source through each of vertices
of the triangle to the points at infinity will form a shadow volume. Any point inside that volume is
hidden from the light source (i.e. it is in the shadow), everything outside is lit by the light (see figure
D.3).
D.2.2.2 Z-Pass Shadow Algorithm
After calculating shadow volumes, the locations in the scene where the shadows should be rendered
can be found in the following way:
1. For every pixel, project the ray from the viewpoint to the object visible at that pixel.
APPENDIX D. DEPTH-BASED METHODS (ADDITIONAL DETAILS) 102
Figure D.4: Z-Pass algorithm. Blue polygon represents the viewing area from the camera pointof view, grey polygons represent the shadow volumes, blue points indicates the “entries” to theshadow volumes, red points indicate the “exits”. Numbers above the blue/red points indicatethe operation that is being performed on a stencil buffer; if more shadow volumes have beenentered than left (i.e. the value present at the stencil buffer is greater than zero) then the pixelin question is in the shadow.
2. Follow this ray, counting the number of times when some shadow volume is entered and left. For
every pixel, subtract the number of times when some shadow volume is left from the number of
times when some shadow volume is entered.
3. If this count is greater than zero when the object is reached, more shadow volumes have been
entered than left, therefore that pixel of the object must be in the shadow.
See figure D.4 for an illustration.
D.2.2.3 Stencil Buffer Implementation
A stencil buffer is an integer per-pixel buffer (additional to the colour and depth buffers) found in
modern graphics cards; it is typically used to limit the area of rendering.
An interesting application of stencil buffer in real-time shadow rendering arises from the strong con-
nection between the depth and stencil buffers in the rendering pipeline. Since the values in the
stencil buffer can be incremented/decremented every time the pixels passes or fails the depth test,
the following two-pass implementation of the Z-Pass shadow algorithm (as described in [21]) becomes
feasible:
1. Initialize the stencil buffer to zero; render the scene with the lighting disabled. Amongst other
things, this will load the depth buffer with the depth values of the visible objects in the scene.
2. Enable the back-face culling, set the stencil operation to increment on depth-test pass and render
the shadow volumes without writing the rendering result into colour and depth buffers.
APPENDIX D. DEPTH-BASED METHODS (ADDITIONAL DETAILS) 103
This will count the number of “entries” into the shadow volume as described above.
3. Enable the front-face culling and set the stencil operation to decrement on the depth-test pass.
Again, render the shadow volumes without storing the render in the colour and depth buffers.
In this case, each pixel value in the stencil buffer will be decremented when the ray “leaves”
some shadow volume.
As described in section D.2.2.2, only the pixels that have a stencil buffer value of zero should be lit as
they are the ones that lie outside the shadow volume.
Using the zero values as a mask in the stencil buffer and rendering the scene with the lighting enabled
will correctly overwrite the previously shadowed pixels with the lit ones.
Appendix E
Implementation (Additional Details)
E.1 Viola-Jones Distributed Training Framework
The main classes of the Viola-Jones distributed training framework are shown in figures E.1 and E.2
below. The main responsibilities of these classes are summarized in table E.1.
E.2 HT3D Library
E.2.1 Head Tracker Core
A UML 2.0 class diagram of the HT3D library core is shown in figure E.3, and the responsibilities of
individual classes are summarized in table E.2.1.
E.2.2 Colour- and Depth-Based Background Subtractors
The class diagram for the colour- and depth-based background subtractors is shown in figure 3.13.
ViBeBackgroundSubtractor, EuclideanBackgroundSubtractor and DepthBackgroundSubtractor
classes have the shared responsibility to distinguish the moving objects (foreground) from the static
parts of the scene (background). Further implementation details of these classes are given below.
E.2.2.1 ViBe Background Subtractor
In ViBeBackgroundSubtractor, the background models of pixels obtained from the 8-bit grayscale
input bitmaps are internally represented as a three-dimensional byte array, where the first two di-
mensions represent the pixel coordinates in the image and the third dimension serves as an index into
the model of that pixel. The background models are updated over time following the theory given in
section C.3.2.
The background sensitivity of ViBe background subtractor is defined as the radius of the hypersphere
SR in the colour space, as shown in figure 2.8.
E.2.2.2 Euclidean Background Subtractor
The background models in EuclideanBackgroundSubtractor are built under the assumption that at
the moment of the background subtractor initialization, only background objects are present in the
frame. Then the subsequent frames can be segmented into foreground and background by inspecting
104
APPENDIX E. IMPLEMENTATION (ADDITIONAL DETAILS) 105
Figure E.1: UML 2.0 class diagram of the Viola-Jones distributed training framework architecture(part #1 of 2).
APPENDIX E. IMPLEMENTATION (ADDITIONAL DETAILS) 106
Figure E.2: UML 2.0 class diagram of the Viola-Jones distributed training framework architecture(part #2 of 2).
APPENDIX E. IMPLEMENTATION (ADDITIONAL DETAILS) 107
Class Responsibilities
ViolaJones
Trainer
Serves as an entry point to the program (includes input parameter parsingand server/client protocol set-up); provides core shared training server andclient functionality (e.g. multi-threaded best local rectangle feature search);implements training state preservation and restore.
ViolaJones
TrainerServer
Manages connections with clients; provides means of data serialization toXML (e.g. detector cascade) or to compressed binary format (e.g integralimages, training weights); handles data transfer to clients over TCP/IP andCIFS.
ViolaJones
TrainerClient
Implements connection and data exchange client-end, data deserializationand other client-specific functionality.
Rectangle
Feature
Provides efficient means to generate, store and evaluate rectangle features.
DecisionStump
WeakClassifier
Implements Find-Best-Weak-Classifier algorithm (given in C.1.2.1) aspart of the IWeakLearner interface.
StrongLearner Encapsulates a collection of rectangle features obtained using AsymBoostinto a strong learner (representing a single layer in the cascade).
StrongLearner
Cascade
Encapsulates a collection of trained strong learners into a detector cascade.
NegativeTrai-
ningImage
Stores large resolution negative training images; implementsFalse-Positive-Training-Image-Bootstrapping algorithm givenin 3.5.3.1.
NormalizedTrai-
ningImage
Stores normalized training images (both negative and positive) in the detec-tor resolution scale.
Utilities Provides helper functions (e.g. conversion between different image formats);implements various workarounds to prevent PWF machines from logging-offafter a certain period of inactivity, as well as mouse and keyboard softwarelocks, to prevent other users from accidentally shutting down training clients(see figure 3.3 for a picture of training machines in action).
Synchronized
List
Implements a thread-safe, synchronized generic item list (e.g. used in storingbootstrapped false positive training images which are simultaneously sent bya number of clients).
Log Handles output logging to hard drive in a thread-safe manner.
Table E.1: Responsibilities of individual classes in the Viola-Jones distributed training framework.
APPENDIX E. IMPLEMENTATION (ADDITIONAL DETAILS) 108
Figure E.3: UML 2.0 class diagram of the HT3D library core.
APPENDIX E. IMPLEMENTATION (ADDITIONAL DETAILS) 109
Class Responsibilities
HeadTracker Sets up the tracking environment: i) deserializes Viola-Jones face detectorcascade from the training framework output XML file, ii) sets up KinectSDK (registers for DepthFrameReady and VideoFrameReady events, opensdepth2 and colour3 byte streams), iii) initializes face/head detection andtracking components.
Orchestrates inputs and outputs from face/head detection and tracking com-ponents: i) aligns colour and depth images using a calibration procedureprovided by Kinect SDK (which uses a proprietary camera model developedby the manufacturer of Kinect sensor, PrimeSense)1, ii) maintains the head-tracking state of depth and colour trackers (using HeadTrackerState enu-meration), iii) prepares input data for individual tracking components (e.g.converting colour bitmaps to grayscale, or combining input colour bitmapswith background/foreground segmentation information), iv) invokes track-ing components as required and combines their outputs, and v) passes thetracking output to HeadTrackFrameReady event subscribers via instances ofthe event arguments class HeadTrackFrameReadyEventArgs.
Statistics
Handler
Provides means to record aligned colour and depth frames (as a stream of320 × 240 px bitmap images), together with the output from the head/-face trackers (serialized into XML file as a list of FaceCenterFramePair,allows recording and playback of the raw colour and depth frame data (one-dimensional byte arrays provided by Kinect SDK), gathers statistics aboutthe face/head detection and tracking speeds.
Utilities Provides functionality to convert data between different formats (e.g. rep-resenting depth values as colours, or converting input bitmap to the HSVbyte array) and various methods that simplify bitmap manipulation (e.g.resizing, conversion to grayscale, etc).
Table E.2: Responsibilities of individual classes in the HT3D library core.1 The process of colour and depth image alignment is necessary since IR and RGB cameras have different intrinsics
and extrinsics (due to the physical separation). As proposed by Herrera et al. [22], the intrinsics can be modelledusing a pinhole camera model with radial and tangential distortion corrections and extrinsics can be modelledusing a rigid transformation, consisting of a rotation and a translation.After the alignment, colour data is represented as 32-bit, 320×240 px bitmap images and depth data is representedas two-dimensional (320× 240) short arrays, where each item in the array represents the distance of the depthpixel from the Kinect sensor in millimetres.
2 320 × 240 px, 30 Hz. (While Kinect sensor supports 640 × 480 px depth output, 320 × 240 px is the highestresolution compatible with colour and depth image alignment API).
3 640× 480 px, 30 Hz.
APPENDIX E. IMPLEMENTATION (ADDITIONAL DETAILS) 110
which individual pixels differ from the initial frame more than the background subtractor sensitivity
threshold.
More precisely, if If is the initial 8-bit grayscale input frame, Ic is the current frame being segmented
and θ is the background subtractor sensitivity threshold, then a pixel (x, y) is classified as part of the
background by EuclideanBackgroundSubtractor if
|If (x, y)− Ic(x, y)| < θ. (E.1)
E.2.2.3 Depth-Based Background Subtractor
While the depth-based background subtractor inherits from the same base BackgroundSubtractor
class as colour-based background subtractors (see figure 3.13), it serves a slightly different purpose in
the head-tracking pipeline.
The main responsibility of the DepthBackgroundSubtractor class is to increase the speed and
the accuracy of colour-based face detector and tracker, using the information provided by the
DepthHeadDetectorAndTracker.
In particular, as long as the depth-based tracker is accurately locked onto the viewer’s head (i.e. if
the depth tracker state maintained in HeadTracker is equal to HeadTrackerState.TRACKING), all
pixels that are further away from the Kinect sensor than the detected head center are classified as
background.
An illustration of this process is shown in figure 3.14.
E.3 3D Display Simulator Components
As described in section 3.7 and illustrated in figure 3.17, 3D display simulator consists of two small
UI modules (“3D Simulation Entry Point” and “Head Tracker Configuration”), and a larger model-
view-controller-based module (“Z-Tris”).
Both UI module implementations and “Z-Tris” M-V-C architectural units are briefly described be-
low.
E.3.1 Application Entry Point
When the application is initialized, the Program class launches the MainForm (shown in figure 3.20).
The main form is responsible for launching the configuration form, showing help or launching the
game form.
E.3.2 Head Tracker Configuration GUI
ConfigurationForm class handles the communication to the HT3D library DLL. Through the user
interface (as shown in figure 3.21), all available head-tracking tweaking options are exposed.
APPENDIX E. IMPLEMENTATION (ADDITIONAL DETAILS) 111
All user’s preferences are saved by the PreferencesHandler when the configuration form is closed, and
restored when the form is reopened. PreferencesHandler achieves this functionality by recursively
walking through the configuration form’s component tree and storing/reading the values of check-
boxes, sliders and combo-boxes to/from the special XML file.
Finally, a DoubleBufferedPanel class is implemented to remove the flicker-on-repaint artifacts when
rendering the output from the head-tracking library (extends the WinForms Panel component to
enable the double-buffering functionality).
E.3.3 3D Game (Z-Tris)
Figure E.4 shows the model-view-controller architectural grouping of the Z-Tris game classes. Each
of the M-V-C architectural units are discussed in more detail below.
E.3.3.1 Model
The main responsibility for the LogicHandler class is to maintain and update
� The status of the pit (represented as a three-dimensional byte array),
� The status of the active (falling) polycube,
� Scores/line count/current level.
The status of the pit/active polycube is updated either on the user’s key press (notified by the
KeyboardHandler controller), or when the time for the current move expires (notified by the internal
timer).
At the end of the move, LogicHandler updates the score s using the following formula:
s← s+ line count× line score× fline count × flevel+ bempty pit × line score× flevel,
(E.2)
where fline count and flevel are the multiplicative factors which increase with the number of layers and
the number of levels (since the time allowance for each move decreases with increasing levels), and
bempty pit is equal to 1 if the pit is empty and 0 otherwise.
After a move is finished, a random polycube (represented as a 3× 3 byte array in Polycube.Shapes
dictionary) is added to the pit if it is not already full; otherwise, the game status is changed to
LogicHandler.Status.GAME OVER.
Both the model and the view are highly customizable (i.e. they can correctly process and render
different pit sizes, polycube shape sets, timing constraints, scoring systems and so on).
E.3.3.2 Controller
KeyboardHandler class (full code listing given at appendix H.1) is responsible for interfacing between
the user and the game logic. It operates using the following protocol:
APPENDIX E. IMPLEMENTATION (ADDITIONAL DETAILS) 112
Figure E.4: UML 2.0 class diagram of the Z-Tris game, grouped into the model-view-controllerarchitectural units.
APPENDIX E. IMPLEMENTATION (ADDITIONAL DETAILS) 113
Figure E.5: KeyboardHandler class event timing diagram. When the user presses and holdsa key on the keyboard, the first OnKeyPress event is triggered immediately, the second is trig-gered after INITIAL KEY HOLD DELAY MS milliseconds and all the following events are triggered afterREPEATED KEY HOLD DELAY MS milliseconds.
1. A keyboard key code is registered through a call to KeyboardHandler.RegisterKey(...) and
an event handler (callback function) is registered with KeyboardHandler.OnKeyPress event.
2. KeyboardHandler monitors the state of the keyboard, and given that one of the registered keys
was pressed, it notifies the appropriate OnKeyPress event subscriber(-s).
3. If a key is not released, it repeatedly triggers OnKeyPress events according to the timing diagram
shown in figure E.5.
This class is capable of handling multiple key events simultaneously (as required for the control of the
game), multiple keyboard event subscribers and customization of timing constraints.
E.3.3.3 View
View component (and in particular RenderHandler class) is responsible for
� Rendering the static game state (pit and the active cube),
� Rendering the active cube animations (rotations and translations).
The animation of simultaneous rotations and translations of the active cube is achieved by keeping
two vectors ~θr and ~θt which indicate the amount of rotation/translation animations remaining. At
each frame, the active polycube is
1. Translated to the coordinate origin,
2. Rotated in all three directions simultaneously by the fraction ~θr ×timeFromPreviousRender
KeyboardHandler.REPEATED KEY HOLD DELAY MS,
3. Translated by ~θt × timeFromPreviousRenderKeyboardHandler.REPEATED KEY HOLD DELAY MS
, and
4. Translated back to its original location.
A screenshot of simultaneous translations and rotations is shown in figure E.6.
The active polycube is also rendered as being semi-transparent, so as not to occlude the playing field.
It is achieved by
APPENDIX E. IMPLEMENTATION (ADDITIONAL DETAILS) 114
Figure E.6: Screenshot of Z-Tris game showing a simultaneous translation and rotation of the activepolycube around X- and Y-axes.
APPENDIX E. IMPLEMENTATION (ADDITIONAL DETAILS) 115
1. Rendering the active polycube as the last element of the scene with blending enabled,
2. Hiding the internal faces of individual cubes that make up the active polycube,
3. Culling the front faces and blending the remaining back faces of the polycube onto the scene,
4. Culling the back faces and blending the remaining front faces of the polycube onto the scene.
The rest of the rendering details are described in section 3.7.1.
Appendix F
HT3D Library Evaluation (Additional
Details)
F.1 Evaluation Metrics
F.1.1 Sequence Track Detection Accuracy
STDA measure (introduced by Manohar et al. [32]) evaluates the performance of the object tracker in
terms of the overall detection (number of objects detected, false alarms and missed detections), spatio-
temporal accuracy of the detection (the proportion of the ground truth detected both in individual
frames and in the whole tracking sequence) and the spatio-temporal fragmentation.
The following notation (following the original paper) is used:
� G(t)i denotes the ith ground truth object in tth frame,
� D(t)i denotes the ith detected object in tth frame,
� N(t)G and N
(t)D denote the number of ground truth/detected objects in tth frame respectively,
� Nframes is the total number of ground truth frames in the sequence,
� Nmapped is the number of mapped ground truth and detected objects in the whole sequence, and
� N(Gi∪Di 6=∅) is the total number of frames in which either the ground truth object i, or the
detected object i (or both) are present.
Then the Track Detection Accuracy (TDA) measure for ith object can be calculated as the spatial
overlap (i.e. the ratio of the spatial intersection and union) between the ground truth and the tracking
output of object i. More precisely, TDA can be defined as
TDAi =
Nframes∑t=1
|G(t)i ∩D
(t)i |
|G(t)i ∪D
(t)i |
(F.1)
Observe that the TDA measure penalizes for both false negatives (undetected ground truth area) and
false positives (detections that do not overlay any ground truth area).
To obtain the STDA measure, TDA is averaged for the best mapping of all objects in the sequence,
i.e.
STDA =
Nmapped∑i=1
TDAi
N(Gi∪Di 6=∅)=
Nmapped∑i=1
∑Nframes
t=1|G(t)
i ∩D(t)i |
|G(t)i ∪D
(t)i |
N(Gi∪Di 6=∅)(F.2)
116
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 117
F.1.2 Multiple Object Tracking Accuracy/Precision
CLEAR (Classification of Events, Activities and Relationships) was an “international effort to evaluate
systems for the perception of people, their activities and interactions.” CLEAR evaluation workshops
[39] held in 2006 and 2007 introduced Multiple Object Tracking Precision (MOTP) and Multiple
Object Tracking Accuracy (MOTA) metrics for 2D face tracking task evaluation [5].
MOTP metric attempts to evaluate the total error in estimated positions of ground truth/detection
pairs for the whole sequence, averaged over the total number of matches made. More precisely, MOTP
is defined as
MOTP =
∑Nmapped
i=1 TDAi∑Nframes
j=1 N(j)mapped
, (F.3)
where N(j)mapped is the number of mapped objects in jth frame.
MOTA metric is derived from three error ratios (ratio of misses, false alarms and mismatches in the
sequence, computed over the number of objects present in all frames) and attempts to assess the
accuracy aspect of the system’s performance. MOTA is defined as
MOTA = 1−∑Nframes
i=1 (cM (FNi) + cFP (FPi) + lnS)∑Nframes
i=1 N(i)G
, (F.4)
where cM (x) and cFP (x) are the cost functions for missed detection and false alarm penalties, FNi
and FPi are the numbers of false negatives/false positives in ith frame respectively and S is the total
number of object ID switches for all objects.
In turn, false negative and false positive counts are defined as
FNi =
Nmapped∑j=1
1{|G(i)
j\D(i)
j|
|G(i)j|
>θFN
} (F.5)
FPi =
Nmapped∑j=1
1{|D(i)
j\G(i)
j|
|D(i)j|
>θFP
}, (F.6)
where θFN and θFP are the false negative/false positive ratio thresholds (illustrated in figure
F.1).
F.1.3 Average Normalized Distance from the Head Center
For motion parallax simulation, accurately localizing the face center is more important than achieving
a higher spatio-temporal overlap between the detected and tagged objects. To measure HT3D colour,
depth and combined head-trackers in this regard, an Average Normalized Distance from Head Center
(δ) metric is constructed.
The ground truth head ellipse in frame i is described by its center location ci and the locations of the
semi-major and semi-minor axes (points ai and bi respectively).
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 118
Figure F.1: False positive (false alarm) and false negative (miss) definitions for MOTA metric. Blueellipse indicates the detected head Di, red ellipse indicates tagged ground truth Gi.
Let hi be the head center location in frame i, as predicted by the head tracker. Then the normalized
distance between the detected and tagged head centres δi can be calculated by transforming the ellipse
into a unit circle centred around the origin, and measuring the length of the transformed head center
vector (as shown in figure F.2).
Let φj be the angle between the major-axis of the ellipse and the x−axis in jth frame. Observe that
φj = cos−1((aj−cj)·i|aj−cj |
).
Then the average normalized distance from the tagged head center can be calculated as
δ =1
Nframes
Nframes∑i=1
δi =1
Nframes
Nframes∑i=1
∣∣∣∣∣∣ cosφi
|ai−ci|sinφi|ai−ci|
− sinφi|bi−ci|
cosφi|bi−ci|
(hi − ci)T
∣∣∣∣∣∣ . (F.7)
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 119
a) b)
Figure F.2: δ-metric computation: a) ith input frame is transformed to image b) so that the groundtruth ellipse shown in red would be mapped to a unit circle centred at the origin. Then the normalizeddistance metric δi is the length of the position vector given by transformed head center prediction(shown in blue).
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 120
F.2 Evaluation Set
F.2.1 Viola-Jones Face Detector Output
Figure F.3: Output of Viola-Jones face detector for all HT3D library evaluation recordings. Falsepositive face detections are marked with red crosses.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 121
F.2.2 δ Metric for Individual Recordings
Head tracking accuracy (δ metric evolution) for all evaluation set recordings is shown below.
Frame 74 Frame 136 Frame 219 Frame 228
Frame 232 Frame 239 Frame 244 Frame 252
Frame 258 Frame 655 Frame 731 Frame 761
Figure F.4: “Head rotation (roll)” recording. Marked red area indicates the output of the combined(depth and colour) head tracker.
Figure F.5: Head-tracking accuracy (δ metric) for “Head rotation (roll)” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 122
Frame 63 Frame 144 Frame 315 Frame 431
Frame 453 Frame 466 Frame 488 Frame 507
Frame 531 Frame 553 Frame 659 Frame 827
Figure F.6: “Head rotation (yaw)” recording.
Figure F.7: δ metric for “Head rotation (yaw)” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 123
Frame 63 Frame 81 Frame 144 Frame 167
Frame 204 Frame 228 Frame 259 Frame 264
Frame 290 Frame 353 Frame 492 Frame 657
Figure F.8: “Head rotation (pitch)” recording.
Figure F.9: δ metric for “Head rotation (pitch)” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 124
Frame 48 Frame 77 Frame 98 Frame 114
Frame 183 Frame 282 Frame 308 Frame 343
Frame 550 Frame 564 Frame 612 Frame 636
Figure F.10: “Head rotation (all)” recording.
Figure F.11: δ metric for “Head rotation (all)” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 125
Frame 54 Frame 71 Frame 139 Frame 166
Frame 248 Frame 346 Frame 465 Frame 531
Frame 549 Frame 591 Frame 718 Frame 735
Figure F.12: “Head translation (horizontal and vertical)” recording.
Figure F.13: δ metric for “Head translation (horizontal and vertical)” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 126
Frame 38 Frame 56 Frame 75 Frame 91
Frame 140 Frame 212 Frame 368 Frame 395
Frame 551 Frame 659 Frame 696 Frame 729
Figure F.14: “Head translation (anterior/posterior)” recording.
Figure F.15: δ metric for “Head translation (anterior/posterior)” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 127
Frame 57 Frame 153 Frame 199 Frame 237
Frame 299 Frame 334 Frame 374 Frame 479
Frame 563 Frame 619 Frame 743 Frame 783
Figure F.16: “Head translation (all)” recording.
Figure F.17: δ metric for “Head translation (all)” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 128
Frame 41 Frame 80 Frame 124 Frame 144
Frame 188 Frame 327 Frame 350 Frame 407
Frame 420 Frame 482 Frame 670 Frame 727
Figure F.18: “Head rotation and translation (all)” recording.
Figure F.19: δ metric for “Head rotation and translation (all)” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 129
Frame 64 Frame 104 Frame 152 Frame 237
Frame 354 Frame 386 Frame 591 Frame 642
Frame 668 Frame 687 Frame 743 Frame 812
Figure F.20: “Participant #1” recording.
Figure F.21: δ metric for “Participant #1” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 130
Frame 150 Frame 180 Frame 185 Frame 218
Frame 354 Frame 449 Frame 484 Frame 498
Frame 528 Frame 542 Frame 580 Frame 767
Figure F.22: “Participant #2” recording.
Figure F.23: δ metric for “Participant #2” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 131
Frame 50 Frame 136 Frame 219 Frame 313
Frame 382 Frame 409 Frame 496 Frame 538
Frame 618 Frame 623 Frame 654 Frame 776
Figure F.24: “Participant #3” recording.
Figure F.25: δ metric for “Participant #3” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 132
Frame 0 Frame 199 Frame 232 Frame 259
Frame 272 Frame 416 Frame 549 Frame 603
Frame 651 Frame 722 Frame 738 Frame 847
Figure F.26: “Participant #4” recording.
Figure F.27: δ metric for “Participant #4” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 133
Frame 70 Frame 174 Frame 240 Frame 298
Frame 363 Frame 458 Frame 531 Frame 537
Frame 621 Frame 705 Frame 751 Frame 822
Figure F.28: ”Participant #5” recording.
Figure F.29: δ metric for “Participant #5” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 134
Frame 62 Frame 126 Frame 142 Frame 253
Frame 268 Frame 377 Frame 439 Frame 466
Frame 511 Frame 529 Frame 671 Frame 782
Figure F.30: “Illumination (low)” recording.
Figure F.31: δ metric for “Illumination (low)” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 135
Frame 11 Frame 70 Frame 156 Frame 242
Frame 275 Frame 361 Frame 534 Frame 551
Frame 653 Frame 728 Frame 770 Frame 840
Figure F.32: “Illumination (changing)” recording.
Figure F.33: δ metric for “Illumination (changing)” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 136
Frame 0 Frame 110 Frame 174 Frame 326
Frame 359 Frame 392 Frame 464 Frame 607
Frame 721 Frame 767 Frame 823 Frame 842
Figure F.34: “Illumination (high)” recording.
Figure F.35: δ metric for “Illumination (high)” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 137
Frame 22 Frame 129 Frame 175 Frame 196
Frame 215 Frame 247 Frame 266 Frame 402
Frame 510 Frame 690 Frame 700 Frame 818
Figure F.36: “Facial expressions” recording.
Figure F.37: δ metric for “Facial expressions” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 138
Frame 11 Frame 69 Frame 103 Frame 142
Frame 199 Frame 278 Frame 403 Frame 480
Frame 617 Frame 658 Frame 705 Frame 741
Figure F.38: “Cluttered background” recording.
Figure F.39: δ metric for “Cluttered background” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 139
Frame 50 Frame 106 Frame 163 Frame 367
Frame 379 Frame 410 Frame 416 Frame 433
Frame 564 Frame 616 Frame 620 Frame 731
Figure F.40: “Occlusions” recording.
Figure F.41: δ metric for “Occlusions” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 140
Frame 95 Frame 182 Frame 249 Frame 315
Frame 390 Frame 426 Frame 515 Frame 567
Frame 594 Frame 618 Frame 694 Frame 767
Figure F.42: “Multiple viewers” recording.
Figure F.43: δ metric for “Multiple Viewers” recording.
APPENDIX F. HT3D LIBRARY EVALUATION (ADDITIONAL DETAILS) 141
F.2.3 MOTA/MOTP Evaluation Results
MOTA/MOTP metrics for all evaluation recordings are summarized in figure F.44. Akin to STDA
metric, depth and combined trackers outperform colour-based head tracker, but fall short of the inter-
annotator agreement.
Figure F.44: MOTA/MOTP metrics for all evaluation recordings evaluated using default trackersettings given in table 4.6. Higher values indicates better performance.
Appendix G
3D Display Simulator (Z-Tris) Evalua-
tion
The main intention behind Z-Tris implementation was to provide a proof-of-concept 3D application
that strengthens depth perception using continuous motion parallax (obtained by changing the per-
spective projection based on viewer’s head position). To verify the operation of this proof-of-concept,
a combination of automated and manual tests was used. The performance of 3D display simulator
was also measured, to ensure that real-time rendering rates can be achieved while simulating all depth
cues mentioned earlier.
G.1 Automated Testing
Unit Testing Framework provided by the Microsoft Visual Studio IDE was used to author/run unit
and “smoke” tests. A sample such run is shown in figure 4.26.
As summarized in table G.1, most of the code (around 85.25%) in Z-Tris core (main classes from figure
E.4) was covered by automated testing.
Automated “smoke” tests were also used as part of the regression testing to always maintain the
application in a working state when progressing through development iterations.
Class name Code coverage(% blocks)
Covered(blocks)
Not covered(blocks)
RenderHandler 85.38% 543 93LogicHandler 84.25% 385 72SpriteHandler 67.27% 74 36PreferencesHandler 93.94% 93 6KeyboardHandler 100.00% 58 0Polycube 100.00% 45 0DisplayUtilities 83.33% 10 2
Total: 85.25% 1, 208 209
Table G.1: Z-Tris core unit test code coverage.
142
APPENDIX G. 3D DISPLAY SIMULATOR (Z-TRIS) EVALUATION 143
a) b)
c) d)
Figure G.1: Z-Tris off-axis projection and head-tracking manual integration testing: the samescene is rendered with the viewer’s head positioned at the a) left, b) right, c) top and d) bottombevels of the display.
G.2 Manual Testing
Manual tests included hours of functional testing, both to evaluate the requirements given in section
2.3 and to perform basic usability and sanity tests.
A significant amount of time has also been spent performing the manual system integration testing.
Figure G.1 shows one example integration test scenario, where the same scene is rendered from four
different viewpoints based on viewer’s head position to test the integration of head-tracking and off-axis
projection rendering.
G.3 Performance
After integrating the 3D display rendering subsystem and HT3D head-tracking library, the overall
system run-time performance has been measured. The system was built in a 64-bit “Release” mode,
APPENDIX G. 3D DISPLAY SIMULATOR (Z-TRIS) EVALUATION 144
with no debug information, and with optimizations enabled. The final setup of the integrated system
is shown in figure 3.19.
The overall system’s performance was measured using the main development machine, running a 64-bit
Windows 7 OS on a dual-core hyperthreaded Intel Core i5-2410M CPU @ 2.30 GHz.
As expected, the real-time average Z-Tris game rendering speed (with the combined colour and depth
head tracker enabled) was 29.969 frames-per-second (with the standard deviation of 5.167 frames), i.e.
real-time requirements were satisfied. A single CPU core experienced an average load of 64.98% (min-
imum – 17.13%, maximum – 88.01%) per 5 minutes of game play, indicating that further processing
resources were still available.
Appendix H
Sample Code Listings
Listing H.1: KeyboardHandler code.using System;
using System.Collections.Generic;
using System.Text;
using OpenTK.Input;
namespace ZTris
{
/// <summary >
/// <para >
/// Keyboard handler , responsible for interfacing between the user and the game
/// logic.
/// </para >
/// <para >
/// This class is capable of handling multiple key events simultaneously (as
/// required for the control of the game), multiple keyboard event subscribers
/// and customization of timing constraints.
/// </para >
/// <para >
/// It operates using the following protocol:
/// <list type=" bullet">
/// <item >
/// A keyboard key code and the caller are registered through a call to
/// <see cref=" RegisterKey(Key key , object sender)">, and an event
/// handler (callback function) is registered through
/// <see cref=" OnKeyPress"> event.
/// </item >
/// <item >
/// <see cref=" KeyboardHandler"> monitors the state of the keyboard , and
/// when one of the registered keys was pressed , it notifies all
/// <see cref=" OnKeyPress"> event subscribers.
/// </item >
/// <item >
/// If a key is not released , it repeatedly triggers <see cref=" OnKeyPress">
/// events according to <see cref=" INITIAL_KEY_HOLD_DELAY_MS"> and
/// <see cref=" REPEATED_KEY_HOLD_DELAY_MS"> timing.
/// </item >
/// </list >
/// </para >
/// </summary >
public class KeyboardHandler
{
#region Internal classes
/// <summary >
/// Internal mutable key state representation.
/// </summary >
private class KeyState
{
public DateTime LastPressTime;
public bool IsRepeated;
public bool IsFirst;
}
145
APPENDIX H. SAMPLE CODE LISTINGS 146
#endregion
#region Constants
/// <summary >Represents the initial key hold delay until the second key event is
/// trigerred.</summary >
public const int INITIAL_KEY_HOLD_DELAY_MS = 400;
/// <summary >Represents the key hold delay until the third (and all subsequent)
/// key events are trigerred.</summary >
public const int REPEATED_KEY_HOLD_DELAY_MS = 180;
#endregion
#region Private fields
/// <summary >Interfaces keyboard handler.</summary >
private IKeyboardDevice _keyboard = null;
/// <summary >Maps keyboard keys to their states.</summary >
private Dictionary <Key , KeyState > _pressedKeys = new Dictionary <Key , KeyState >();
/// <summary >Maps registered keyboard keys to their subscribers .</summary >
private Dictionary <Key , List <object >> _registeredKeys =
new Dictionary <Key , List <object >>();
#endregion
#region Public fields
/// <summary >Key press event handler type.</summary >
/// <param name="key">Keyboard key that triggered the event.</param >
public delegate void KeyEventHandler(Key key);
/// <summary >Key press event handler.</summary >
public event KeyEventHandler OnKeyPress;
#endregion
#region Constructors
/// <summary >Default keyboard handler constructor .</summary >
/// <param name=" keyboard">Keyboard interface.</param >
public KeyboardHandler(IKeyboardDevice keyboard)
{
_keyboard = keyboard;
}
#endregion
#region Public methods
/// <summary >
/// A method to register subscriber 's interest in a particular key press.
/// Typically this method would be called as <c>RegisterKey (..., this)</c>.
/// </summary >
/// <param name="key">Keyboard key to register.</param >
/// <param name=" subscriber">Handle to the subscriber .</param >
public void RegisterKey(Key key , object subscriber)
{
_pressedKeys.Add(key , new KeyState ()
{
APPENDIX H. SAMPLE CODE LISTINGS 147
LastPressTime = DateTime.Now ,
IsRepeated = false ,
IsFirst = true
});
if (! _registeredKeys.ContainsKey(key))
{
_registeredKeys.Add(key , new List <object >());
}
_registeredKeys[key].Add(subscriber);
}
/// <summary >
/// A method to register subscriber 's interest in particular key presses.
/// Typically this method would be called as <c>RegisterKeys (..., this)</c>.
/// </summary >
/// <param name="keys">Keyboard keys to register.</param >
/// <param name=" subscriber">Handle to the subscriber .</param >
public void RegisterKeys(Key[] keys , object subscriber)
{
foreach (Key key in keys)
{
this.RegisterKey(key , subscriber);
}
}
/// <summary >
/// Main event processing loop where the subscribed key events are triggered.
/// </summary >
public void UpdateStatus ()
{
// Record key press time before processing
DateTime keyPressTime = DateTime.Now;
// Check the status of each registered key
foreach (Key key in _registeredKeys.Keys)
{
KeyState pressedKeyState = _pressedKeys[key];
if (_keyboard[key])
{
bool triggerKeyPressEvent = false;
// If the key is pressed for the first time , trigger the event ←↩immediately
if (pressedKeyState.IsFirst)
{
triggerKeyPressEvent = true;
pressedKeyState.IsFirst = false;
}
// If the key was held , trigger the events accordingly to timing ←↩constraints
else
{
double timeSinceLastPressInMs =
keyPressTime.Subtract(pressedKeyState.LastPressTime).←↩TotalMilliseconds;
triggerKeyPressEvent = (pressedKeyState.IsRepeated ?
(timeSinceLastPressInMs > REPEATED_KEY_HOLD_DELAY_MS) :
(timeSinceLastPressInMs > INITIAL_KEY_HOLD_DELAY_MS));
// Update key state
if (triggerKeyPressEvent)
{
APPENDIX H. SAMPLE CODE LISTINGS 148
pressedKeyState.IsRepeated = true;
}
}
if (triggerKeyPressEvent)
{
// Record last press time
pressedKeyState.LastPressTime = keyPressTime;
// Trigger subscriber event handlers
this.CallbackSubscribers(key);
}
}
else
{
pressedKeyState.IsRepeated = false;
pressedKeyState.IsFirst = true;
}
}
}
#endregion
#region Private methods
/// <summary >
/// Calls back the event handlers of subscribers to a particular key press.
/// </summary >
/// <param name="key">Keyboard key that was pressed.</param >
private void CallbackSubscribers(Key key)
{
foreach (Delegate eventCallback in OnKeyPress.GetInvocationList ())
{
if (_registeredKeys[key]. Contains(eventCallback.Target))
{
eventCallback.DynamicInvoke(key);
}
}
}
#endregion
}
}
Listing H.2: IKeyboardDevice interface.using System;
using OpenTK.Input;
namespace ZTris
{
/// <summary >
/// Keyboard device interface , responsible for providing the keyboard status.
/// </summary >
public interface IKeyboardDevice
{
/// <summary >
/// An indexer returning a status of the particular key.
/// </summary >
/// <param name="key">Keyboard key of interest.</param >
/// <returns >
/// Status of <see cref="key"/>:
/// <list type=" table">
APPENDIX H. SAMPLE CODE LISTINGS 149
/// <item >
/// <term >true </term >
/// <description ><see cref="key"/> is pressed.</description >
/// </item >
/// <item >
/// <term >false </term >
/// <description ><see cref="key"/> is released.</description >
/// </item >
/// </list >
/// </returns >
bool this[Key key] { get; }
}
}
1
M E A S U R I N G H E A D D E T E C T I O N A N D T R A C K I N G S Y S T E M A C C U R A CY
EXPERIMENT CONSENT FORM
EXPERIMENT PURPOSE
This experiment is part of the Computer Science Tripos Part II project evaluation. The project in question involves using a Microsoft Kinect sensor to track viewer’s head position in space. The main purpose of the experiment is to ensure that the face detector/tracker is robust and works for different viewers.
EXPERIMENT PROCEDURE
The experiment consists of recording two colour and depth videos (each 30 seconds long) of the participant moving his/her head in the free-form manner.
A possible range of head/face muscle motions that can be performed include (but are not limited to):
Head rotation (yaw/pitch/roll)
Head translation (horizontal/vertical, anterior/posterior)
Facial expressions, e.g. joy, surprise, fear, anger, disgust, sadness, etc.
Roll
Yaw
Pitch
Vertical
Horizontal
Anterior
Posterior
2
CONFIDENTIALITY
The following data will be stored: two (2) colour and depth recordings (each 30 seconds long).
No other personal data will be retained. Recorded videos will be kept in accordance to the Data Protection Act and destroyed after the submission of the dissertation.
FINDING OUT ABOUT RESULT
If interested, you can find out the result of the study by contacting Manfredas Zabarauskas, after 18/05/2012. His phone number is 0754 195 8411 and his email address is [email protected].
PLEASE NOTE THAT:
- YOU HAVE THE RIGHT TO STOP PARTICIPATING IN THE EXPERIMENT,
POSSIBLY WITHOUT GIVING A REASON.
- YOU HAVE THE RIGHT TO OBTAIN FURTHER INFORMATION ABOUT THE
PURPOSE AND THE OUTCOMES OF THE EXPERIMENT.
- NONE OF THE TASKS IS A TEST OF YOUR PERSONAL ABILITY. THE OBJECTIVE
IS TO TEST THE ACCURACY OF THE IMPLEMENTED HEAD TRACKING SYSTEM.
RECORD OF CONSENT
Your signature below indicates that you have understood the information about the “Measuring Head Detection and Tracking System Accuracy” experiment and consent to your participation. The participation is voluntary and you may refuse to answer certain questions on the questionnaire and withdraw from the study at any time with no penalty. This does not waive your legal rights. You should have received a copy of the consent form for your own record. If you have further questions related to this research, please contact the researcher.
Participant (Name, Signature): Date (dd/mm/yy): __________________________________________________________________________
__________________________________
Researcher (Name, Signature): Date (dd/mm/yy): __________________________________________________________________________
__________________________________
1
M E A S U R I N G H E A D D E T E C T I O N A N D T R A C K I N G S Y S T E M A C C U R A CY
VIDEO AND DEPTH RECORDING RELEASE FORM
RELEASE STATEMENT
I HEREBY ASSIGN AND GRANT TO MANFREDAS ZABARAUSKAS THE RIGHT AND
PERMISSION TO USE AND PUBLISH (PARTIALLY OR IN FULL) THE VIDEO AND/OR
DEPTH RECORDINGS MADE DURING THE “MEASURING HEAD DETECTION AND
TRACKING SYSTEM ACCURACY” EXPERIMENT, AND I HEREBY RELEASE
MANFREDAS ZABARAUSKAS FROM ANY AND ALL LIABILITY FROM SUCH USE AND
PUBLICATION.
Participant (Name, Signature): Date (dd/mm/yy): __________________________________________________________________________
_________________________
Researcher (Name, Signature): Date (dd/mm/yy): __________________________________________________________________________
_________________________
Appendix I
Project Proposal
Computer Science Tripos Part II Project Proposal
3D Display Simulation Using Head Tracking with Microsoft Kinect
M. Zabarauskas, Wolfson College (mz297)
Originator: M. Zabarauskas
20 October 2011
Project Supervisor: Prof N. Dodgson
Signature:
Director of Studies: Dr C. Town
Signature:
Project Overseers: Dr S. Clark & Prof J. Crowcroft
Signatures:
APPENDIX I. PROJECT PROPOSAL 154
Introduction
Reliable real-time human face detection and tracking has been one of the most interesting problems in
the field of computer vision in the past few decades. The emergence of cheap and ubiquitous Microsoft
Kinect sensor containing an IR-depth camera provides new opportunities to enhance the reliability
and speed of face detection and tracking. Moreover, the ability to use the depth information to track
the user’s head in 3D space opens up a lot of potential for new immersive user interfaces.
In my project I want to implement widely recognized, industry-standard face detection and tracking
methods: a Viola-Jones object detection framework and a CAMShift (Continuously Adaptive Mean
Shift) face tracker, based on the ideas presented by the authors in their original papers. Having
achieved that, I want to explore the opportunities of using the current state-of-the-art methods to
integrate the depth information into face detection and tracking algorithms, in order to increase their
speed and accuracy. As a next and final part of the project, I want to employ the depth information
provided by Kinect to obtain an accurate 3D location of the viewer with respect to the display. The
knowledge of viewer head’s coordinates in 3D will allow me to simulate the parallax motion that
occurs between the visually overlapping near and far objects in a 3D scene when the user’s viewpoint
changes, mimicking a three-dimensional display viewing experience.
Method Descriptions
The Viola-Jones face detector mentioned above is a breakthrough method for face detection, proposed
by Viola and Jones [43] in 2001. They described a family of extremely simple classifiers (called
“rectangle features”, reminiscent of Haar wavelets) and a representation of a grayscale image (called
“integral image”), using which these Haar-like features can be calculated in constant time. Then,
using a classifier boosting algorithm based on AdaBoost a number of most effective features can be
extracted and combined to yield an efficient “strong” classifier with an extremely low false negative
rate, but a high false positive rate. Finally, they proposed a method to arrange strong classifiers into a
linear cascade, which can quickly discard non-face regions, focusing on likely face regions in the image
to decrease the false positive rate.
After the face has been localized in the image, it can be efficiently tracked using the face colour
distribution information. CAMShift (Continuously Adaptive Mean Shift) was first proposed by Gary
Bradski [8] at Intel in 1998. In this method, a hue histogram of a face that is being tracked is used to
derive the “face probability distribution”, where the most frequently occurring colour is assigned the
probability 1.0 and the probabilities of other colours are computed based on their relative frequency
to the most frequent colour. Then, given a new search window a “mean shift” algorithm is used (with
a simple step-function as a kernel) to converge to a probability centroid of the face colour probability
distribution. The size of the search window is then adjusted as a function of the zeroth moment, and
the repositioning/resizing is repeated until the result changes less than a fixed threshold.
However, these colour information based face detection and tracking methods encounter difficulties in
situations when the face orientation does not match the training ones (e.g. when the user is facing
away from the camera), when the background is visually cluttered and so on.
Burgin et al. [10] suggested a few simple ways how the depth information could be used to improve
face detection. For example, given a certain distance from the camera, the realistic range of human
APPENDIX I. PROJECT PROPOSAL 155
head sizes in pixels can be calculated. This can then be used to reject certain window sizes and to
improve on the exhaustive search for faces in an entire image in Viola-Jones algorithm. Similarly, they
suggested that the distance thresholding could also be used to improve the face detection efficiency,
since far-away points are likely to be blurry or to contain too few pixels for reliable face detection.
On a similar note, Xia et al. [31] described a 3D model fitting algorithm for the head tracking.
Their algorithm scales a hemisphere model to the head size estimated from the depth values of the
location possibly containing a head (using an equation regressed from the empirical head size/depth
measurements). Then it attempts to minimize the square error between the possible head region and
the hemisphere template. Since this approach uses the generalized head depth characteristics (front,
side, back views, as well as higher and lower views of the head approximate a hemisphere), it is view
invariant. When combined with CAMShift face tracker, the 3D model fitting approach should enhance
the reliability of the overall tracking even when the person turns to look away for a few seconds.
These improved ways of face detection and tracking, combined with the depth information provided
by Kinect can be employed to obtain the accurate 3D location of the viewer with respect to the
display. The location can then be used to simulate the parallax motion (near objects moving faster
in relation to far objects), evoking a visual sense of depth as perceived in real three-dimensional
environments.
Resources Required
� Hardware
– Microsoft Kinect sensor. Acquired.
– Development PC. Acquired: hyperthreaded dual-core Intel i5-2410M running at 2.90 GHz,
8 GB RAM, 250 GB HDD.
– Primary back-up storage: 0.5 GB space on PWF for program and dissertation sources
only. Acquired.
– Secondary back-up storage: 16 GB USB flash drive for source code/ dissertation/ built
snapshots. Acquired.
� Software
– Development: Microsoft Visual Studio 2010, Kinect SDK, OpenTK (OpenGL wrapper for
C#), Math.NET (mathematical open source library for C#). All installed.
– Back-up: Subversion CVS. Installed both on local machine and on PWF.
� Training data
– Face/non-face training images for Viola-Jones. Acquired 4916 face images and 7960 non-
face images from Robert Pless’ website [11].
Starting Point
� Basic knowledge of C#,
APPENDIX I. PROJECT PROPOSAL 156
� Minimal familiarity with OpenGL,
� Nearly no knowledge of computer vision.
Substance and Structure of the Project
As discussed in the introduction, the substance of the project can be split into the following
stages:
� to implement the industry standard colour-based face detection and tracking algorithms (viz.
Viola-Jones and CAMShift),
� to extend these algorithms using the depth information provided by Microsoft Kinect’s IR-depth
sensing camera,
� to simulate the parallax motion effect using the calculated head movement’s in 3D, creating a
3D display effect.
Viola-Jones Face Detector
As described in the introduction, the main task will be to implement the AdaBoost algorithm which
will combine Haar-like weak classifiers into a strong classifier. These classifiers will be connected
into a classifier cascade, such that early stages reject the image locations that are not likely to con-
tain the faces. It is crucial to get this stage implemented early, since classifier training can take
days/weeks.
CAMShift Face Tracker
The main task for the face tracker will be to implement the “histogram backprojection” method and
the “mean-shift” algorithm.
Depth Cue Integration for Viola-Jones Detector
Since the suggestions in Burgin et al. [10] paper are relatively straightforward (e.g. distance threshold-
ing), the main task will be simply to code the unnecessary image region elimination before launching
the Viola-Jones detector.
Tracker Extension Using Depth Cues
Based on Xia et al. [31] approach, 3D hemisphere fitting will have to be implemented. However,
additional work will be required to ensure that when the CAMShift colour-based tracker loses the
faces, depth-based tracker reliably takes over, and vice versa.
APPENDIX I. PROJECT PROPOSAL 157
3D Display Simulation Using Parallax Motion
Having obtained the head location in pixel (and depth) coordinates during the stages above, the
head’s location in 3D can be calculated using publicly available conversion equations (derived by
measuring the focal distances, distortion coefficients and other parameters of both depth and RGB
cameras).
To simulate the effect of parallax motion, a simple OpenGL scene will be created and a scene’s
viewpoint will be set to follow the head’s motions in 3D.
Success Criteria
For the project to be deemed success, the following items have to be completed:
1. Viola-Jones face detector,
2. CAMShift tracker,
3. Viola-Jones detector extensions using depth cues,
4. 3D hemisphere-fitting tracker,
5. OpenGL program, simulating the parallax motion effect.
Furthermore, the implemented items should have a comparable performance to the one in papers that
describe these methods.
Evaluation Criteria
Face detector and trackers can be quantitatively evaluated on its speed and ROC (receiver operating
characteristic), i.e. the rate of correct detections versus the false positive rate, as well as on the
precision (TP/TP + FP ), recall (TP/TP + FN), accuracy (TP/TP + TN + FP + FN) and other
metrics.
Similarly, its robustness against different head orientations (tilt, rotation), distance to camera, speed
of movement, global illumination conditions, etc can be quantitatively measured.
Then the relative performance and accuracy gain/loss, obtained by adding the depth cues to the face
detector/tracker, can be obtained.
Finally, the accuracy of head location tracking in 3D (with respect to head translation in X, Y and
Z-axis) can be assessed.
APPENDIX I. PROJECT PROPOSAL 158
Possible Extensions
Given enough time, the system could be extended to deal with multiple people. This would involve
only minor changes to the Viola-Jones detector, but should be more challenging for the trackers. Both
colour and depth information based trackers should now deal with partial/full occlusion and object
tagging (i.e. if person A passes behind person B, it should not treat person A as a new person in the
image, and should not get confused between A and B); depth information tracker should have more
potential in disambiguating these situations.
After implementing the extension above, OpenGL scene could be trivially segmented so that each
viewer would see her own 3D segment of a display.
Work Plan
The work will be split into 16 two-week packages, as detailed below:
07/10/11 - 20/10/11
Gain a better understanding in face detection and tracking methods described above. Set up the
development and backing-up environment. Obtain the colour and depth input streams from Kinect.
Write the project proposal.
Milestones: SVN set up on PWF. Written a small C# test project for Microsoft Visual Studio 2010
and Kinect SDK, that fetches the input colour and depth streams from the device, and renders them
on a screen. Project proposal written and handed in.
21/10/11 - 03/11/11
Fully understand the Viola-Jones face detector and start implementing it. Add additional face images
to the training set if required.
Milestones: clear understanding of Viola-Jones face detector. Pieces of working implementa-
tion.
04/11/11 - 17/11/11
Finish implementing the Viola-Jones algorithm and start the training. Start reading about the
CAMShift algorithm.
Milestones: implementation of Viola-Jones face detector.
APPENDIX I. PROJECT PROPOSAL 159
18/11/11 - 01/12/11
Fully understand and implement the CAMShift face tracker. Integrate it to the Viola-Jones face
detector as a next stage (when the face is detected).
Milestones: implementation of CAMShift tracker, integrated into the system.
02/12/11 - 15/12/11
Add depth cues to the Viola-Jones face detector, start reading about the 3D hemisphere-fitting
tracker.
Milestones: depth cues added to the Viola-Jones detector. Clear understanding of 3D hemisphere-
fitting tracker.
16/12/11 - 29/12/11
Implement the 3D hemisphere-fitting tracker and integrate it to the system to start tracking in parallel
with CAMShift algorithm when Viola-Jones detects a face in the image. Start reading about the
parallax motion simulation.
Milestones: implementation of 3D hemisphere-fitting tracker, integrated into the system. Clear
understanding of how the parallax motion could be simulated knowing the head’s position.
30/12/11 - 12/01/12
Prepare the presentation for the progress meeting in January. Write progress report. Slack time in case
any of the face detector/face trackers/progress report/progress presentation are not finished.
Milestones: presentation for the progress meeting and a progress report.
13/01/12 - 26/01/12
Fully understand how head’s pixel and depth coordinates can be converted into its location in 3D
space. Further research on how the parallax motion can be simulated from the head location in 3D.
Start implementing an OpenGL scene which could be used to display the parallax motion effect.
Milestones: basic implementation of an OpenGL scene.
27/01/12 - 09/02/12
Finish implementing an OpenGL scene. Slack time for any unfinished implementation details.
APPENDIX I. PROJECT PROPOSAL 160
Milestones: finished implementation of an OpenGL scene. At this stage the overall system should
be functional, i.e. it should combine the output from the face detector and face trackers to obtain the
head’s location in 3D and use it to simulate parallax motion on the display.
10/02/12 - 23/02/12
Start writing a dissertation. Come up with a structure, including sections, subheadings and short
bullet points to be covered in each section.
Milestones: basic structure of the dissertation.
24/02/12 - 09/03/12
Write the “Introduction” and “Preparation” sections. Get feedback from the supervisor/DoS.
Milestones: complete “Introduction” and “Preparation” sections.
10/03/12 - 23/03/12
Milestones: Incorporate the feedback from the supervisor/DoS regarding the “Introduction” and
“Preparation” sections. Write the “Implementation” section and send for feedback to supervisor/-
DoS.
24/03/12 - 06/04/12
Incorporate the feedback from the supervisor/DoS regarding the “Implementation” section. Gather
the numerical data for “Evaluation” section. Slack time for finishing “Introduction”, “Preparation”
and “Implementation” sections.
Milestones: finished “Introduction”, “Preparation” and “Implementation” sections. Gathered data
for “Evaluation” section.
07/04/12 - 20/04/12
Write the “Evaluation” section and send it for feedback to DoS/supervisor.
Milestones: finished “Evaluation” section.
21/04/12 - 04/05/12
Incorporate the feedback for “Evaluation” section and finish a draft dissertation. Send it for final
feedback to supervisor/DoS.
Milestones: finished draft dissertation.
APPENDIX I. PROJECT PROPOSAL 161
05/05/12 - 18/05/12
Incorporate final feedback from supervisor/DoS and get the final version approved.
Milestones: dissertation is finished, approved, bound and handed in before 18/05/2012.