visual body pose analysis for human-computer - mvdb live

Diss. ETH No. 18838

Visual Body Pose Analysis for

Human-Computer Interaction

A dissertation submitted to the

SWISS FEDERAL INSTITUTE OF TECHNOLOGY ZURICH

for the degree of

Doctor of Sciences ETH

presented by

MICHAEL VAN DEN BERGH

M.Sc. in Electrical Engineeringborn 26th December 1981

citizen of Belgium

accepted on the recommendation of

Prof. Dr. Luc Van Gool, ETH Zurich and KU Leuven, examinerProf. Dr. Fernando De la Torre, Carnegie Mellon University,

co-examiner

2010

Abstract

Human-Computer Interaction (HCI) is the study of interactionbetween people (users) and computers. The recent advances incomputing technology push the interest in human-computer in-teraction in other ways than the traditional keyboard, mouse orkeypad devices. The work presented in this thesis uses computervision to enhance the HCI, by introducing novel real-time andmarker-less gesture and body movement-based systems.

Real-time systems have a high refresh rate and minimal latency,providing the user with smooth and instantaneous interactionwith the system. Marker-less systems allow a natural interactionwithout wearing special markers or special tracking suits, whichare generally required in modern day tracking systems. The sys-tems described in this thesis aim to achieve this real-time marker-less HCI. They are based on vision and built with standard com-puters equipped with standard color cameras. The goal set forthis work is hand gesture-based interaction with large displays, aswell as full body pose recognition for interaction where the useris immersed in a virtual environment.

The systems described in this thesis can be divided into three com-ponents: (1) preparing the input for detection and recognition,which includes segmentation and reconstruction; (2) detecting ofbody parts and recognition of body poses and hand gestures; (3)using the detection/recognition to steer the application. Thesethree components are reflected in chapters 2 to 4 of this thesis.

Segmentation and Reconstruction. The first part of the thesisprovides a brief summary of foreground-background segmentation,skin color segmentation and 3D hull reconstruction. Skin color

4

segmentation usually suffers from changes in lighting and of theuser. Therefore, a novel and improved skin color segmentationalgorithm is introduced, which combines an offline and an onlinemodel. The online skin color model is updated at run-time basedon color information taken from the face region of the user.

Detection and Recognition. The second part of the thesis startswith a summary of how the face, eye, hand and finger locationscan be detected in a camera image. Then, a novel body poserecognition system is introduced based on Linear DiscriminantAnalysis (LDA) and Average Neighborhood Margin Maximiza-tion (ANMM). This system is able to classify poses based eitheron 2D silhouettes or 3D hulls. Using a similar technique, a novelhand gesture recognition system is introduced. Both the bodypose and hand gesture recognition systems are improved for speedwith the help of Haarlets. A novel Haarlet training algorithm is in-troduced, which aims to approximate the ANMM transformationusing Haarlets. Furthermore, 3D Haarlets are introduced, whichare trained with the same ANMM approximation algorithm, andcan be used to classify 3D hulls in real-time.

Applications. The algorithms are demonstrated with four appli-cations. The first application is a perceptive user interface, wherethe user can point at objects on a large screen, and move themaround on the screen. The emphasis of this application is on de-tecting the body parts and determining the 3D pointing direction.The second application is the CyberCarpet, a prototype platformwhich allows unconstrained locomotion of a walker in a virtualworld. As this system is a prototype, the walker is replaced bya miniature robot. The vision part of this system consists of anoverhead tracker which tracks the body position and orientationof the walker in real-time. The third application is the full-scaleomni-directional treadmill, which accomodates for human walkersCyberWalk. Beside the position and orientation tracker, the visionpart is completed with a full body pose recognition system. Keyposes are detected to enable interaction with the virtual worldhe is immersed in. The fourth application is a hand gesture in-teraction system. It detects hand gestures and movements formanipulating 3D objects or navigating through 3D models.

Zusammenfassung

Das Forschungsgebiet der Human-Computer Interaction (HCI)befasst sich mit der Interaktion zwischen Menschen und Comput-ern. Die Fortschritte in der Computertechnologie verlangen nachneuen Eingabemoglichkeiten, die sich von traditionellen Geratewie Tastatur, Maus oder Joystick unterscheiden. Diese Disserta-tion untersucht neuartige Wege basierend auf Bildverarbeitung,um mit dem Computer zu interagieren. Das Ziel ist es, Anwen-dungen mittels Gestik und Korperbewegungen markerlos und inEchtzeit zu steuern.

Echtzeitsysteme zeichnen sich durch eine hohe Wiederholrate undminimale Verzogerung aus, wodurch eine reibungslose und unmit-telbare Interaktion ermoglicht wird. Markerlose Systeme erlaubeneine naturliche Interaktion ohne zusatzliche Markierungszeichenoder aufwendige Trackinganzuge.

Die in dieser Dissertation entwickelten Systeme ermoglichen de-rartige markerlose HCI in Echtzeit. Die verwendeten Technikenbasieren auf Bildverarbeitung und erfordern ausschliesslich han-delsubliche Computer mit gewohnlichen Farbkameras (z.B. We-bcams). Das Ziel dieser Doktorarbeit ist sowohl das Steuernvon grossen Bildschirmen mittels Gestik als auch das Erkennender Korpersprache, um den Benutzer in eine virtuelle Umgebungeinzubetten. Die vorgestellten Systeme konnen in drei Kompo-nenten unterteilt werden: (1) Vorbereitung des Bildes fur De-tektion und Erkennung, was Segmentierung und Rekonstruktionbeinhaltet; (2) Detektion der Korperteile und Erkennung vonKorper- und Handposen; (3) Verwendung der Detektion/ Erken-nung um Anwendungen zu steuern. Diese drei Komponenten sindin den Kapiteln 2 bis 4 beschrieben.

6

Segmentierung und Rekonstruktion. Der erste Teil dieser Ar-beit gibt einen kurzen Uberblick der Techniken, die bei Vorder-/Hintergrund Segmentierung, Segmentierung der Hautfarbe undder Rekonstruktion der 3D-Hullen verwendet werden. Das Erken-nen der Hautfarbe wird durch veranderliche Lichtverhaltnisse undunterschiedliche Benutzer erschwert. Deshalb wurde ein neuer,verbesserter Segmentierungsalgorithmus fur Hautfarben entwick-elt, welcher die Vorteile eines offline- und eines online-Systemskombiniert. Das online Hautfarbenmodell wird in Echtzeit mitFarbinformationen der Gesichtsregion aktualisiert.

Detektion und Erkennung. Der zweite Teil dieser Arbeit beginntmit einem Uberblick der Gesichts-, Augen-, Hand- und Fingerde-tektion. Danach wird ein neuartiges Korperposenerkennungssystembasierend auf Linear Discriminant Analysis (LDA) und AverageNeighborhood Margin Maximization (ANMM) vorgestellt. DiesesSystem ist fahig, Korperposen anhand von 2D Silhouetten oder 3DHullen zu klassfizieren. Mit einer ahnlichen Methode werden auchHandgestiken erkannt. Beide Systeme sind effizient mit Haarletsimplementiert. Ein neuer Trainingsalgorithmus approximiert dieANMM Transformationen mit Haarlets. Des Weiteren werden 3DHaarlets vorgestellt, um 3D Hullen in Echtzeit zu klassifizieren.

Anwendungen. Die beschriebenen Methoden werden mit einigenAnwendungen demonstriert. Die erste Anwendung ist eine perzep-tive Benutzerschnittstelle, bei der der Benutzer Objekte auf einemgrossen Bildschirm auswahlen und verschieben kann. Die Schw-erpunkte dieses Systems sind die Detektion von Korperteilen unddie Ermittelung der 3D-Zeigerichtung. Die zweite Anwendung istder CyberCarpet, eine Laufbuhne fur uneingeschrankte Fortbewe-gung eines Benutzers in einer virtuellen Welt. In dem vergestell-ten Prototyp ist der Benutzer durch einen kleinen Roboter er-setzt. Die Bildverarbeitung besteht aus einem Overhead-Tracker,welcher die Position und Orientierung des Benutzers in Echtzeitverfolgt. Die dritte Anwendung ist ein ungerichtetes Laufband(CyberWalk), welches auch fur Menschen begehbar ist. Nebendem Positions- und Orientierungstracker wird fur die Bildverar-beitung ein Korperposenerkennungssystem verwendet. WichtigePosen werden erkannt, um die Interaktion mit der virtuellen Welt

7

zu ermoglichen. Die vierte Anwendung ist ein System fur die In-teraktion mittels Handgesten. Das System erkennt Handgestenund Handbewegungen der Benutzer, um 3D Objekte zu manip-ulieren oder 3D Modelle zu navigieren.

Contents

List of Figures v

1 Introduction 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . 21.3 Related Work . . . . . . . . . . . . . . . . . . . . . 21.4 Contributions . . . . . . . . . . . . . . . . . . . . . 51.5 Organization of the thesis . . . . . . . . . . . . . . 6

2 Segmentation 92.1 Foreground-background Segmentation . . . . . . . 10

2.1.1 Colinearity Criterion . . . . . . . . . . . . . 102.1.2 Adaptive Threshold . . . . . . . . . . . . . 112.1.3 Darkness Compensation . . . . . . . . . . . 13

2.2 Skin Color Segmentation . . . . . . . . . . . . . . . 142.2.1 Color Spaces . . . . . . . . . . . . . . . . . 142.2.2 Histogram Based Approach . . . . . . . . . 162.2.3 Gaussian Mixture Model Based Approach . 202.2.4 Post Processing . . . . . . . . . . . . . . . . 222.2.5 Speed Optimizations . . . . . . . . . . . . . 252.2.6 Discussion . . . . . . . . . . . . . . . . . . . 26

2.3 3D Hull Reconstruction . . . . . . . . . . . . . . . 27

3 Detection and Recognition 333.1 Face and Hand Detection . . . . . . . . . . . . . . 35

3.1.1 Face Detection . . . . . . . . . . . . . . . . 353.1.2 Eye Detection . . . . . . . . . . . . . . . . . 383.1.3 Hand Detection . . . . . . . . . . . . . . . . 413.1.4 Finger Detection . . . . . . . . . . . . . . . 43

ii Contents

3.1.5 Discussion . . . . . . . . . . . . . . . . . . . 433.2 2D Body Pose Recognition . . . . . . . . . . . . . 46

3.2.1 Background . . . . . . . . . . . . . . . . . . 473.2.2 Classifier Overview . . . . . . . . . . . . . . 513.2.3 Linear Discriminant Analysis (LDA) . . . . 523.2.4 Average Neighborhood Margin Maximiza-

tion (ANMM) . . . . . . . . . . . . . . . . . 533.2.5 Rotation Invariance . . . . . . . . . . . . . 553.2.6 Discussion . . . . . . . . . . . . . . . . . . . 56

3.3 3D Body Pose Recognition . . . . . . . . . . . . . 583.3.1 Classifier Overview . . . . . . . . . . . . . . 593.3.2 Average Neighborhood Margin Maximiza-

tion (ANMM) . . . . . . . . . . . . . . . . . 603.3.3 Orientation Estimation . . . . . . . . . . . 623.3.4 Discussion . . . . . . . . . . . . . . . . . . . 69

3.4 Hand Gesture Recogntion . . . . . . . . . . . . . . 723.4.1 Background . . . . . . . . . . . . . . . . . . 723.4.2 Inputs . . . . . . . . . . . . . . . . . . . . . 733.4.3 Hausdorff Distance . . . . . . . . . . . . . . 743.4.4 Average Neighborhood Margin Maximiza-

tion (ANMM) . . . . . . . . . . . . . . . . . 763.4.5 Discussion . . . . . . . . . . . . . . . . . . . 77

3.5 Haarlet Approximation . . . . . . . . . . . . . . . . 793.5.1 2D Haarlets . . . . . . . . . . . . . . . . . . 793.5.2 Training . . . . . . . . . . . . . . . . . . . . 813.5.3 Classification . . . . . . . . . . . . . . . . . 823.5.4 3D Haarlets . . . . . . . . . . . . . . . . . . 843.5.5 Discussion . . . . . . . . . . . . . . . . . . . 86

3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . 883.6.1 Body Pose Recognition: without rotation . 883.6.2 Body Pose Recognition: with rotation . . . 933.6.3 Hand Gesture Recognition . . . . . . . . . . 100

4 Applications 1054.1 Perceptive User Interface (BlueC 2 project) . . . . 106

4.1.1 Introduction . . . . . . . . . . . . . . . . . 1074.1.2 System Overview . . . . . . . . . . . . . . . 1084.1.3 Calibration and 3D Extraction . . . . . . . 113

Contents iii

4.1.4 User Interface . . . . . . . . . . . . . . . . . 1164.1.5 Integrated Setup . . . . . . . . . . . . . . . 1184.1.6 Discussion . . . . . . . . . . . . . . . . . . . 119

4.2 CyberCarpet (CyberWalk project) . . . . . . . . . 1214.2.1 Background . . . . . . . . . . . . . . . . . . 1214.2.2 System Overview . . . . . . . . . . . . . . . 1224.2.3 Experiments . . . . . . . . . . . . . . . . . 1294.2.4 Discussion . . . . . . . . . . . . . . . . . . . 141

4.3 Omni-directional Treadmill (CyberWalk project) . 1434.3.1 Design of the Omnidirectional Treadmill . . 1444.3.2 Visual Localization . . . . . . . . . . . . . . 1474.3.3 Control Design . . . . . . . . . . . . . . . . 1474.3.4 Visualization . . . . . . . . . . . . . . . . . 1494.3.5 Body Pose Recognition . . . . . . . . . . . 1534.3.6 Discussion . . . . . . . . . . . . . . . . . . . 156

4.4 Hand Gesture Interaction (Value Lab) . . . . . . . 1574.4.1 The Value Lab . . . . . . . . . . . . . . . . 1584.4.2 System Overview . . . . . . . . . . . . . . . 1584.4.3 Object Manipulation: One Object . . . . . 1644.4.4 Object Manipulation: Two Objects . . . . . 1664.4.5 Model Navigation . . . . . . . . . . . . . . 169

4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . 171

5 Summary 1755.1 Future Work . . . . . . . . . . . . . . . . . . . . . 177

A Calibration 179A.1 Camera Calibration: 2 cameras . . . . . . . . . . . 179A.2 Camera Calibration: n cameras . . . . . . . . . . . 181A.3 Screen Calibration . . . . . . . . . . . . . . . . . . 182

B Notation 187

List of Figures

1.1 Strategy to build HCI systems. Images are grabbedfrom one or more camera viewpoints, and processedin a segmentation step, e.g. to segment skin color,or foreground from background. In some cases a3D reconstruction is also required. The segmentedimages or recontructed hulls are then used to detectbody part or to classify hand/body gestures. Thesecan then be used to steer the interaction. . . . . . 7

2.1 The background model is subtracted from the ob-served image and yields a difference image. Pix-els which differ significantly from the backgroundmodel are defined as foreground. . . . . . . . . . . 10

2.2 The collinearity between two pixels can be mea-sured by considering the difference vectors df re-spectively db between xf respectively xb and theestimate of the true signal direction u. . . . . . . . 11

2.3 The parameter vc denotes the number of pixels car-rying the label foreground and lie in a 3×3 neigh-borhood of the pixel to be processed. These labelsare already known for the pixels which have alreadybeen processed (gray shade). For the other pixelswe simply take the labels from the previous changemask. . . . . . . . . . . . . . . . . . . . . . . . . . 12

vi List of Figures

2.4 Result of our illumination-invariant background sub-traction method. The resulting segmentation isshown in image (c). Note that the black loud-speaker in the background batters a hole in theforeground region. Image (d) shows the segmenta-tion result with darkness compensation, with theregion in front of the black loudspeaker now seg-mented correctly. . . . . . . . . . . . . . . . . . . 13

2.5 Online updating of the skin and non-skin color his-tograms. The segmented regions are marked in red.Note that the skin segmentation used in this caseis after post-processing, as described in section 3.4. 18

2.6 Example of the histogram-based skin color segmen-tation with online training of the color models. Theregions segmented as skin regions are marked in red. 20

2.7 Combining the histrogram based and the Gaussianmixture model based. The regions segmented asskin regions are marked in white. . . . . . . . . . 23

2.8 The result of median filtering (right) compared tothe original segmentation (left). The regions seg-mented as skin regions are marked in white. . . . 24

2.9 The result of connected components analysis (right)compared to the original segmentation (after me-dian filtering, left). The regions segmented as skinregions are marked in white. . . . . . . . . . . . . 25

2.10 The visual hull of the object is the intersection ofthe generalized cones extruded from its silhouettes. 28

2.11 A lookup-table (LUT) is stored at each pixel in theimage with pointers to all voxels that project ontothat particular pixel. This way, expensive projec-tions of voxels can be avoided and the algorithmcan take advantage of small changes in the imagesby only addressing voxels whose pixel has changed. 30

2.12 Example of how silhouettes from multiple cameraviews can be used to reconstruct a 3D hull of theuser. . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.13 Examples of 3D hull reconstructions. . . . . . . . . 31

List of Figures vii

3.1 Example Haarlets shown relative to the enclosingrectangle window. The sum of the pixels whichlie within the white rectangles are subtracted fromthe sum of the pixels in the grey rectangles. Figuretaken from [1]. . . . . . . . . . . . . . . . . . . . . 35

3.2 Examples of features selected by AdaBoost. Thetwo features are shown in the top row and thenoverlayed on a typical training face in the bottomrow. The first feature measures the difference in in-tensity between the region of the eyes and a regionacross the upper cheeks. The feature capitalizes onthe observation that the eye region is often darkerthan the upper cheeks and nose. The second fea-ture compares the intensities in the eye regions tothe intensity across the bridge of the nose. Figuretaken from [1]. . . . . . . . . . . . . . . . . . . . . 36

3.3 Schematic depiction of the detection cascade. A se-ries of classifiers are applied to every sub-window.The initial classifier eliminates a large number ofnegative examples with very little processing. Sub-sequent layers eliminate additional negatives butrequire additional computation. After several stagesof processing, the number of sub-windows has beenreduced significantly. Further processing can takeany form such as additional stages in the cascade.Figure taken from [1]. . . . . . . . . . . . . . . . . 36

3.4 Example of a detected face shown in the white rect-angle. . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.5 (a) region of interest, (b) probability based on lumi-nance, (c) probability based on color, (d) probabil-ity based on integrated luminance, (e) all probabil-ities combined, (f) the eyes as the two componentswith highest probability. . . . . . . . . . . . . . . . 40

3.6 Example of detecting the eyes. . . . . . . . . . . . 403.7 Examples of the hand detection. . . . . . . . . . . 423.8 Finger detection. . . . . . . . . . . . . . . . . . . . 443.9 Example of detecting the fingers on the hand. . . . 45

viii List of Figures

3.10 Example of detecting the fingertip, shown as a greencross. . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.11 Examples of silhouettes which are used for classifi-cation. Note the holes in the segmentation and theartifacts due to reflections on the floor. . . . . . . . 46

3.12 Example of 3 camera views, foreground-backgroundsegmentation, and their concatenation to a singlenormalized sample. . . . . . . . . . . . . . . . . . 47

3.13 The full body pose tracker presented by Kehl etal. [2] in action. . . . . . . . . . . . . . . . . . . . . 49

3.14 (a) 3D hull reconstruction: (b) shows silhouetteswhich can be extracted from input images usingforeground-backgound segmentation, which can becombined to reconstruct a 3D hull as shown in (c). 50

3.15 Basic structure of the classifier. The input sam-ples (silhouettes) are projected with transforma-tion Wopt onto a lower dimensional space, and theresulting coefficients are matched to poses in thedatabase using nearest neighbors (NN). . . . . . . 52

3.16 An illustration of how ANMM works. For eachsample, within a neighborhood (marked in gray),samples of the same class are pulled towards it,while samples of a different class are pushed away,as shown in the left. The figure on the right showsthe data distribution in the projected space. . . . . 54

3.17 The first 4 eigenvectors for the frontal view only,after training for a 12 pose set, using the ANMMalgorithm. Dark regions are positive values andwhite regions are negative values. . . . . . . . . . . 55

3.18 Examples of segmentation and classification results. 573.19 Example of a reconstructed 3D hull of the user. . 583.20 Examples of different orientations of the user, re-

sulting in similar normalized hulls. . . . . . . . . . 593.21 Basic classifier structure. The input samples (3D

hulls) are projected with transformation Wopt ontoa lower dimensional space, and the resulting coeffi-cients are matched to poses in the database usingnearest neighbors (NN). . . . . . . . . . . . . . . . 60

List of Figures ix

3.22 An illustration of how ANMM works. For eachsample, within a neighborhood (marked in gray),samples of the same class are pulled towards it,while samples of a different class are pushed away,as shown in the left. The figure on the right showsthe data distribution in the projected space. . . . . 61

3.23 Three example ANMM eigenvectors. The first ex-ample vector will inspect the legs, while the last ex-ample shows a feature that distinguishes betweenthe left and the right arm stretched forward. Darkregions are positive values and white regions arenegative values.s . . . . . . . . . . . . . . . . . . . 62

3.24 Each particle is modeled by an ellipse and a circle,representing the shoulders and the head, parametrizedas in eq. (3.14) . . . . . . . . . . . . . . . . . . . . 64

3.25 Example of the visual localization algorithm track-ing a walking person run on the same sequence: (a)the elliptical model without head region (3 degreesof freedom); and (b) the model with head region (5degrees of freedom) . . . . . . . . . . . . . . . . . . 68

3.26 Examples of reconstruction and classification results. 71

3.27 Possible inputs for the hand gesture classifier. . . 74

3.28 Basic classifier structure. The input samples (handimages) are projected with ANMM transformationWopt onto a lower dimensional space, and the re-sulting coefficients are matched to poses in the databaseusing nearest neighbors (NN). . . . . . . . . . . . . 76

3.29 Examples of the classifier detecting different handgestures. An open hand is marked in red (a), afist is marked in green (b) and a pointing hand ismarked in blue (c). . . . . . . . . . . . . . . . . . 77

x List of Figures

3.30 Classifier structure illustrating the Haarlet approx-imation. The pre-trained set of Haarlets are com-puted on the input sample (grayscale image, sil-houette or 3D hull). The approximated ANMMcoefficients (l) are computed as a linear combina-tion Lapprox of the Haarlet coefficients (h). Thecontents of the dotted line box constitute an ap-proximation of Wopt in figure 3.15. . . . . . . . . . 80

3.31 The set of possible 2D Haarlet types. . . . . . . . . 803.32 The top figure shows one ANMM vector, with the

overhead, profile and frontal views side by side.The bottom figure shows the Haarlet approxima-tion of this ANMM vector, using 10 best Haarletsselected to approximate Wopt. It can be seen howthe Haarlets look for arms and legs in certain areasof the image. . . . . . . . . . . . . . . . . . . . . . 82

3.33 The sum of the voxels within the gray cuboid canbe computed with 8 array references. If A, B, C,D, E, F , G and H are the integral volume valuesat shown locations, the sum can be computed as(B + C + E + H)− (A + D + F + G). . . . . . . . 84

3.34 The proposed 3D Haarlets. The first 15 featuresare extruded versions of the original 2D Haarletsin all 3 directions, and the other 2 are true 3Dcenter-surround features. . . . . . . . . . . . . . . . 85

3.35 (a) Three example ANMM eigenvectors. (b) Ap-proximation using 10 Haarlets. The first exampleshows how a feature is selected to inspect the legs;the last example shows a feature that distinguishesbetween the left and the right arm stretched forward. 86

3.36 The 50 pose classes used in the body pose recogni-tion experiments where the orientation of the useris fixed. . . . . . . . . . . . . . . . . . . . . . . . . 89

3.37 Correct classification rates comparing LDA and ANMMfor the classification of 2D silhouettes. . . . . . . . 91

3.38 Correct classification rates comparing LDA and ANMMfor the classification of 3D hulls. . . . . . . . . . . 91

List of Figures xi

3.39 Correct classification rates comparing classificationbased on 2D silhouettes and 3D hulls using ANMMapproximation with Haarlets. . . . . . . . . . . . . 92

3.40 The 50 pose classes used in the body pose recog-nition experiments, where the user is allowed torotate around the vertical axis. . . . . . . . . . . . 94

3.41 Correct classification rates comparing LDA and ANMMfor the classification of 3D hulls. . . . . . . . . . . 95

3.42 Correct classification rates comparing classificationbased on 2D silhouettes and 3D hulls using ANMMapproximation with Haarlets. . . . . . . . . . . . . 96

3.43 Correct classification rates comparing classificationbased on 2D silhouettes from respectively 3 and 6cameras. . . . . . . . . . . . . . . . . . . . . . . . . 97

3.44 Correct classification rates using up to 100 Haarletsfor classification, comparing 2D and 3D Haarlets. . 98

3.45 Correct classification rates using up to 100 Haarletsfor classification, comparing to AdaBoost. . . . . . 99

3.46 Classification times in milliseconds for the pure ANMMclassifier and the classifier using 100 3D Haarlets toapproximate the ANMM transformation. . . . . . . 99

3.47 The 10 hand gestures used in this experiment. . . 100

3.48 (a) the input for the ANMM-based classifier and(b) the input for the Hausdorff distance-based clas-sifier. . . . . . . . . . . . . . . . . . . . . . . . . . . 101

3.49 Possible inputs for the hand gesture classifier. . . 102

3.50 Correct classification rates for the ANMM-basedmethod and the Hausdorff distance-based method. 103

3.51 Correct classification rates for 2 to 10 pose classesusing the ANMM-based method. . . . . . . . . . . 103

4.1 Pointing at the screen. . . . . . . . . . . . . . . . . 106

xii List of Figures

4.2 General overview of the system. The Observer com-ponent detects the eye, hand and finger locationsin the input images, and recognizes hand gestures.The 3D geometry step draws a virtual line from theeyes through the fingertip onto to the screen, whichallows the user to point at and manipulate objectsin the graphical user interface. . . . . . . . . . . . 108

4.3 Detailed overview of the Observer component. De-pending on the state (pointing or gesturing), eye,finger and gesture detection is run based on theskin color segmented input images. . . . . . . . . . 109

4.4 Example of detecting the hand and the face, usingthe foreground-background segmentation and theskin color segmentation. . . . . . . . . . . . . . . . 110

4.5 Result of skin color analysis without post-processing.1114.6 Result of skin color analysis after post-processing. 1124.7 Calibrating the screen by pointing at a sequence of

points. . . . . . . . . . . . . . . . . . . . . . . . . . 1154.8 3D-representation: pointing at the screen. . . . . . 1174.9 Using the perceptive user interface. . . . . . . . . . 1204.10 Overview of the system architecture of the Cyber-

Carpet. . . . . . . . . . . . . . . . . . . . . . . . . . 1234.11 Platform principle. . . . . . . . . . . . . . . . . . . 1244.12 The CyberCarpet platform: (a) the motion trans-

mission principle; (b) a drawing of the preliminarydesign (courtesy of Max Planck Institute for Bio-logical Cybernetics; German Patent filed in 2005);(c) the final physical realization . . . . . . . . . . . 125

4.13 The experimental set up with the CyberCarpet, themobile robot carrying a picture of a human body,and the overhead camera for visual tracking. . . . 126

4.14 A view from the overhead camera, with superim-posed user localization. The white ellipse indicatesthe shoulder region of the walker, while the red cir-cle indicates the head region and the white line theorientation. . . . . . . . . . . . . . . . . . . . . . . 128

4.15 Control system architecture of the CyberCarpet . . 129

List of Figures xiii

4.16 Absolute trajectory in the experiment where thewalker is standing still . . . . . . . . . . . . . . . . 131

4.17 Linear and angular velocity commands for the tra-jectory of figure 4.16 . . . . . . . . . . . . . . . . . 131

4.18 Absolute trajectory in the experiment where thewalker is moving at constant velocity, with staticfeedback control. . . . . . . . . . . . . . . . . . . . 132


4.20 Absolute trajectory in the experiment where thewalker is moving at constant velocity, with velocitycompensation. . . . . . . . . . . . . . . . . . . . . . 134


4.22 Absolute trajectory in the experiment where thewalker moves along a circular path, with static feed-back control. . . . . . . . . . . . . . . . . . . . . . 135


4.24 Absolute trajectory in the experiment where thewalker moves along a circular path, with velocitycompensation. . . . . . . . . . . . . . . . . . . . . . 136


4.26 Trajectory of the user in the virtual world whileexecuting a square path . . . . . . . . . . . . . . . 137

4.27 Absolute trajectory in the experiment the walkermoves along a square path, with static feedbackcontrol. . . . . . . . . . . . . . . . . . . . . . . . . 139


4.29 Absolute trajectory in the experiment where thewalker moves along a square path, with velocitycompensation. . . . . . . . . . . . . . . . . . . . . . 140


4.31 Construction of the treadmill. . . . . . . . . . . . . 1454.32 Chain trajectory. . . . . . . . . . . . . . . . . . . . 146

xiv List of Figures

4.33 Belt construction. . . . . . . . . . . . . . . . . . . . 1464.34 The chain drive wheel of the omni-directional tread-

mill. . . . . . . . . . . . . . . . . . . . . . . . . . . 1484.35 The omni-directional treadmill before it is being

shipped to Tuebingen. . . . . . . . . . . . . . . . . 1484.36 Apparent accelerations (intertial, centrifugal and

coriolis) felt by the walker in the x and y direction(XW and YW respectively), due to the platformmotion. . . . . . . . . . . . . . . . . . . . . . . . . 149

4.37 The walker wearing a head mounted display (HMD)and a safety helmet with reflective VICON mark-ers attached. The safety rope and HMD cables arealso visible in the background. . . . . . . . . . . . 150

4.38 Virtual model of Pompei generated with the CityEngine. . . . . . . . . . . . . . . . . . . . . . . . . 152

4.39 Virtual model of Pompei generated with the CityEngine. . . . . . . . . . . . . . . . . . . . . . . . . 152

4.40 Overview of the body pose estimation on the om-nidirectional treadmill. . . . . . . . . . . . . . . . . 153

4.41 Images taken from an overhead camera of the mov-ing treadmill. The treadmill is moving and there isa security cable dangling in the image. . . . . . . . 155

4.42 Expamples of the foreground-background segmen-tation on a moving treadmill with a user on it. . . 155

4.43 Person interacting wiht a camera and screen. . . . 1574.44 The Value Lab. . . . . . . . . . . . . . . . . . . . . 1594.45 The Value Lab. . . . . . . . . . . . . . . . . . . . . 1594.46 Overview of the hand gesture recognition system. . 1604.47 Input for the hand gesture classifier. . . . . . . . . 1614.48 Examples of the classifier detecting different hand

gestures. An open hand is marked in red (a), afist is marked in green (b) and a pointing hand ismarked in blue (c). . . . . . . . . . . . . . . . . . 161

4.49 Computing times on a frame before optimizations.Total average computing time results in 124 Hz, ora framerate of 8 Hz. . . . . . . . . . . . . . . . . . 162

List of Figures xv

4.50 Computing times for a sequence of 9 frames afteroptimization. Frame grabbing (11 ms) has beenleft out for brevity. Total average computing timeresults in 31 ms, or a framerate of 32 Hz. . . . . . 163

4.51 Hand gesture system manipulating one object. . . 1644.52 The hand gestures used for the manipulation of one

object on the screen. . . . . . . . . . . . . . . . . . 1654.53 The hand gestures used for the manipulation of two

objects on the screen. . . . . . . . . . . . . . . . . 1664.54 Hand gesture system manipulation two objects. . . 1674.55 Hand gesture system manipulation two objects. . . 1674.56 Hand gesture system selecting the object on the

right (cube). . . . . . . . . . . . . . . . . . . . . . . 1684.57 The hand gestures used for the manipulation of the

3D object on the screen. . . . . . . . . . . . . . . . 1704.58 Zooming into the model. . . . . . . . . . . . . . . . 1714.59 Panning the model. . . . . . . . . . . . . . . . . . . 1724.60 Rotating the model. . . . . . . . . . . . . . . . . . 173

A.1 For calibration, the user is required to wave thelaser pointer (a) throughout the obscured workingvolume. A small piece of plastic is mounted on thelaser pointer to produce a small and bright pointin the camera images. Image (b) shows the pro-jections of the laser pointer for four cameras. Theimage coordinates of the laser pointer can be easilydetected and are indicated as small circles. . . . . . 181

A.2 The screen plane. . . . . . . . . . . . . . . . . . . . 183A.3 Screen calibration based on 9 correspondences. The

red points are the eye positions. The green pointsare the finger positions. The screen plane α isshown in blue. The actual screen coordinates aremarked with black crosses, and the reconstructedpoints with grey crosses. . . . . . . . . . . . . . . . 185

1Introduction

1.1 Overview

Human-Computer Interaction (HCI) is the study of interactionbetween people (users) and computers. The work presented inthis thesis aims to enhance HCI based on computer vision. Itaims to improve the interaction with novel real-time and marker-less gesture and body movement-based systems.

Real-time systems have a high refresh rate (30 Hz) and minimallatency, providing the user with smooth and instantaneous inter-action with the system. Marker-less systems allow natural inter-action without wearing special markers or special tracking suits,which are generally required in modern day tracking systems usedin the entertainment industry.

The systems described in this thesis aim to achieve this real-timemarker-less HCI based on vision. They are built with standardcomputers equipped with standard color cameras, which are ofteneven found built into the displays.

The goal set for this work is hand gesture based interaction withlarge displays, as well as full body pose recognition for interactionwhere the user is immersed into a virtual world.

2 1. Introduction

1.2 Motivation

The current advances in computing technology push the inter-est in interaction in other ways than traditional keyboard, mouseor keypad devices. For example, the recent popularity of touchscreen phones demonstrates the added value of touch interfacesfor small devices. But what about larger devices that you cannothold in your hands?

There is ongoing research into touch-based interfaces for large dis-plays, but one might wonder: is a touch-based interface, requiringthe user to stand at most 30 cm from a screen, the most adequatefor large displays? Science fiction works, such as the movie Minor-ity Report, show gesture-based interaction, where the user waveswith special gloves at a large screen, and interacts in a muchmore natural way. Similarly, new popular gaming systems withaccelerometer-based controllers also demonstrate the adequacy ofbody-gesture based interaction over traditional systems with key-pads. However, such systems require the use of markers or wearingspecial devices.

We believe that vision-based human-computer interaction systemsare the way of the future. They allow for natural interactionwithout expensive hardware (small cameras are already presentin most current devices) and without the necessity to wear spe-cial markers, gloves or suits. The fields of interest include homeentertainment, gaming, virtual reality, collaboration, conferencingand surveillance.

1.3 Related Work

In this thesis, we describe algorithms and methods to build vision-based, marker-less and real-time HCI systems. In this section weplace the work in the context of other HCI systems.

1.3. Related Work 3

Vision-based

The first differentiation that can be made in HCI systems iswhether the system is vision-based or not. Vision-based meansthat the system uses one or multiple cameras to observe the user,or objects that are attached to the user (markers). The advan-tage of these systems is that the user does not need to hold ortouch the device with which he is interacting, which is especiallyuseful if the user is interacting with large displays or is immersedin virtual environments.

Non-vision-based devices include touch screens, exoskeletons, suitswith sensors, gloves with sensors and accelerometer-based remotes.

Marker-based vs Marker-less

Within the vision-based HCI systems, we can make a distinc-tion between systems that use markers and systems which do not.Markers are usually reflective or colored balls which are placed onthe user, or are attached to a special suit, glove or helmet that theuser is wearing. They can also be colored LEDs. The benefit ofsuch markers is that they can be detected very fast and robustly,with algorithms as simple as applying a threshold in the cam-era image. A well known example is the motion capture systemfrom Vicon1, which uses reflective balls as markers and infraredcameras. G-Speak2 is a project where the user wears gloves withreflective markers which are tracked by a Vicon system, to allowfor a Minority Report-style HCI system.

These systems require the user to wear special markers, and thecost of hardware can be very high if more than one marker isto be tracked (i.e. Vicon). The benefit of marker-less systems,on the contrary, is that the user does not need to wear specialsuits, gloves or helmets, making the system more natural andintuitive. Furthermore, the hardware requirements can be madesmaller, as the system can use standard cameras which are already

1Vicon motion capture systems: http://www.vicon.com2Oblong industries: http://www.oblong.com

4 1. Introduction

present in most recent systems. Research in the field of marker-less vision-based systems is ongoing and existing systems are veryexperimental.

Real-time vs. Offline

Most marker-less tracking systems that exist today are offline,which means that the video sequence is recorded and then thetracking algorithm is run offline. These systems allow for accuratetracking of the human body, which can be useful for analysis of thepersons in the video, motion capturing, or archiving video content.However, these algorithms sometime takes hours of computationfor a minute of video. Therefore these systems are not usefulfor HCI, because the user expects real-time interaction with thesystem.

Tracking vs. Recognition

Another important distinction can be made between tracking andrecognition. Model-based (tracking) systems iteratively fit an ar-ticulated model of the user (or part of the user) to the observation.There is no limitation on the number of poses that can be tracked,as the model is fitted to whichever pose the user assumes. Foreach frame the model is updated. The more computation time isavailable between each frame, the better the model can be fitted.Therefore, tracking systems lend themselves perfectly to offlinesystems. In a real-time system, however, they will either havevery little computation time available for each frame, resulting inthe model being fitted very inaccurately, or the frame-rate willbe low, meaning that the system will lose track of the user easily.Furthermore, tracking systems require initialization of the model.

Example-based (recognition) systems on the other hand, detectthe pose of the user (or part of the user) instantaneously andnot based on the previous frames. The poses are matched toprerecorded poses in a database. Therefore, unlike tracking, thenumber of poses that can be detected is limited, but example-based systems allow for much higher speeds by focusing on the

1.4. Contributions 5

generally few poses which are interesting for the HCI. An exampleof a fast example-based system is the system by Ren et al. [3],which lets the user control a dancer with body motions. Thedance consists of a number of pre-defined poses that are detectedon the user.

A more detailed comparison between tracking and recognition ispresented in section 3.2.1.

1.4 Contributions

This thesis contributes in various ways to the enhancement of HCIsystems. All the systems described in this thesis are real-time anddo not require the use of markers.

• Algorithms for skin color segmentation are improved to bemore suited to HCI, and less vulnerable to changes in illu-mination. A novel and improved skin color segmentationalgorithm is introduced, which combines an offline and anonline model. The online skin color model is updated at run-time based on color information taken from the face regionof the user.

• A novel body pose recognition system is introduced based onLinear Discriminant Analysis (LDA) and Average Neighbor-hood Margin Maximization (ANMM). This system is ableto classify pose based on either 2D silhouettes or 3D hulls.

• A similar hand gesture recognition system is introduced,which uses ANMM to classify hand gestures.

• The speeds of both the body pose and hand gesture recog-nition systems are improved with the help of Haarlets. Anovel Haarlet training algorithm is introduced, which aimsto approximate the ANMM transformation using Haarlets.

• 3D Haarlets are introduced, and can be trained with thesame ANMM approximation algorithm.

6 1. Introduction

Several real-time marker-less HCI applications have been builtand are presented in this thesis.

• A perceptive user interface, where the user can point atobjects on a large screen and move them around on thescreen. The emphasis of this application is on detecting thebody parts and determining the 3D pointing direction.

• The CyberCarpet, a prototype platform which allows uncon-strained locomotion of a walker in a virtual world. As thissystem is a prototype, the walker is replaced by a miniaturerobot. The vision part of this system consists of an over-head tracker which tracks the body position and orientationof the walker in real-time.

• The full-scale omni-directional treadmill, which accomodatesfor human walkers in a virtual environment. Beside the po-sition and orientation tracker, the vision part also includesa full body pose recognition system. Key poses of the userare detected to allow for interaction with the virtual worldhe is immersed in.

• A hand gesture interaction system that allows for manip-ulation of 3D objects or navigation through 3D models bydetecting the gestures and the movements of the hands of auser in front of a camera mounted on top of a screen.

1.5 Organization of the thesis

This thesis is presented as individual building blocks that canbe used to build interaction systems. The overall strategy is illus-trated in figure 1.1. As shown in the figure, the building blocks canbe divided into three parts: (1) preparing the input for detectionand recognition (this includes segmentation and reconstruction);(2) detecting of body parts and recognition of poses/gestures; and(3) using the detection/recognition to steer the application. Thesethree parts are reflected in the next three chapters of this thesis.

1.5. Organization of the thesis 7

Strategy

face hands

2D Haarlets

LDA/ANMM

detection

classification3D

Haarlets

fingers eyes

match to database

skin fg-bg

3D reconstruction

segmentationreconstruction

interactionapplication

Figure 1.1: Strategy to build HCI systems. Images are grabbedfrom one or more camera viewpoints, and processed in a segmen-tation step, e.g. to segment skin color, or foreground from back-ground. In some cases a 3D reconstruction is also required. Thesegmented images or recontructed hulls are then used to detectbody part or to classify hand/body gestures. These can then beused to steer the interaction.

8 1. Introduction

Chapter 2 describes the different segmentation steps that are per-formed on input images. Foreground-background segmentationseparates the user from the background, while skin color segmen-tation allows for the detection and localization of body parts ofinterest (head and hands). Furthermore, several segmentations ofthe user from different camera views can be used to reconstruct a3D hull of the user used in body pose recognition.

Chapter 3 deals with the detection and recognition steps. First, itdescribes the detection of body parts such as the face, the eyes, thehands and the fingers. Then, it describes the recognition of bodyposes and hand gestures. It introduces Haarlet approximation asa technique to speed up this recognition process. A fast overheadtracker is described to estimate the orientation of the user. Also,the different approaches described in this chapter are evaluatedagainst each other.

Chapter 4 discusses the four real-world applications described inthe previous section that have been built using the methods andbuilding blocks described in this thesis.

2Segmentation

The overall strategy to build interaction systems is illustrated infigure 1.1. A typical system starts from one or several cameraspointed at the user. These cameras continuously grab imageswhich are then processed. This chapter describes the preprocess-ing steps that are performed on these input images, which aimto facilitate the detection and classification of body parts, bodygestures and hand gestures, as will be discussed in chapter 3.

The foreground-background segmentation (section 2.1) determineswhich pixels in the camera image are different from a stored back-ground image. This allows us to extract a 2D silhouette of theuser. These silhouettes can be used to aid body part detection(section 4.1) or for 2D full body pose recognition (section 3.2).

The skin color segmentation (section 2.2) detects skin colored re-gions and is used to detect and extract body parts, for example todetect the hand location (section 3.1.3). It is also used to extracta silhouette of the hand, which can be used to count the numberof fingers, or for hand gesture recognition (sections 3.1.4 and 3.4).

Silhouettes from several viewpoints can be combined to make a3D hull reconstruction of the user (section 2.3). This hull is a3D representation of the user and can be used for 3D body poserecognition (section 3.3). The benefit from moving to 3D hullsfrom 2D silhouettes is increased accuracy and rotation invariance.

10 2. Segmentation

2.1 Foreground-background Segmenta-tion

The first step in most systems is foreground-background segmen-tation, which considers the difference between the observed imageand a model of the background. Regions where the observed im-age differs significantly from the background model are defined asforeground, as illustrated in figure 2.1. The background model istypically calculated from a set of images of the empty working vol-ume. Background subtraction works only for static backgroundssince the same model is used for subsequent images.

Page 7 of 28

Figure 2: Examples of different poses.

2.3 Foreground-background Segmentation

Background subtraction considers the difference between the observed image and a model of the background as illustrated in figure 3. Regions where the observed image differs significantly from the background model are defined as foreground. The background model is typically calculated from a set of images of the empty working volume.

Figure 3: The background model is subtracted from the observed image and yields a difference image. Pixels which differ significantly from the background model are defined as foreground. Background subtraction works only for static backgrounds since the same model is used for subse-quent images. For our setup, static backgrounds can be assumed except for audience or slightly flickering light tubes. It is essential to have a good similarity measurement for two colors. Considering either the differ-ence or the angle between two observed color vectors is not advisable due to the required normali-zation of both vectors. It makes a significant difference whether an angular difference is found for ’long’ or ’short’ signal vectors. We use the illuminationinvariant collinearity criterion proposed by Mester et al. [15]. Let xf be the vector of all RGB values within a 3x3 neighborhood in the input image as illustrated in figure 4 and let xb be the corresponding vector in the background image. Collinearity can be tested by considering the difference vectors df and db between xf, xb and the es-timate of the true signal ’direction’ u, i.e. df = xf - xf u and then calculating the sum of the differ-ences:

D2 = d f

2+ db

2

(1)

Minimizing D2 estimates u and yields zero if the two vectors xf and xb are collinear, i.e. the differ-ence vectors and hence the sum of their norms are zero. If the two vectors are collinear no change

Figure 2.1: The background model is subtracted from the ob-served image and yields a difference image. Pixels which differ sig-nificantly from the background model are defined as foreground.

2.1.1 Colinearity Criterion

It is essential to have a good similarity measurement for two col-ors. Considering just the difference or the angle between twoobserved color vectors will not yield the best results. It makesa significant difference whether an angular difference is found forlong or short signal vectors. Therefore we use the illuminationinvariant collinearity criterion proposed by Mester et al. [4].

Mester’s method compares the color values at pixels in a referenceimage (background), and a given image. In particular, all colorvalues within a small window around a pixel, here always a 3× 3neighborhood, are stacked into row vectors xb resp. xf for the

2.1. Foreground-background Segmentation 11

background resp. the given image (figure 2.2). The latter willtypically contain some additional foreground objects. In Mester’sanalysis, change detection amounts to assessing whether xb andxf are colinear. If they are, no change is judged to be present andthe background is still visible at that pixel in the given image.If not, the pixels are considered to have different colors, and aforeground pixel has been found.

Rather than testing for perfect colinearity, one has to allow forsome noise in the measurement process. A kind of bisector hasto exist (u in figure 2.2) to which both xb and xf lie close. In-deed, when Gaussian noise is assumed, the unknown ’true signal’direction u can be estimated by minimizing

D2 = |df |2 + |db|2. (2.1)

Page 8 of 28

is judged to be present and the background is still visible. If not, the pixels are considered to have different colors, and a foreground pixel is found. However, perfect collinearity is unlikely as our observed color vectors are noisy. Griesser et al. [31] showed that applying a static threshold Ts and an adaptive threshold Tadapt on D2

D2

foreground

background

TS +Tadapt

(2)

makes the segmentation robust against noise. The adaptive threshold is used to incorporate spatio-temporal considerations. Spatial compactness is induced by giving a pixel a higher chance to be foreground if several of its neighbors have this status. A sampled Gibbs/Markov random field (MRF) is used to enforce this spatial compactness in an iterative manner. Temporal smoothness can be achieved by using the results of the previous frame for initialization of the MRF. The influence of the spatiotemporal compactness is controlled by a factor B and the number of iterations j. A re-sult of our method can be seen in figure 5 c).

Figure 4: The collinearity between two pixels can be measured by considering the difference vectors df and db between xf ,xb and the estimate of the true signal ’direction’ u (see text).

Figure 2.2: The collinearity between two pixels can be measuredby considering the difference vectors df respectively db betweenxf respectively xb and the estimate of the true signal direction u.

2.1.2 Adaptive Threshold

Minimizing D2 estimates u and yields zero if the two vectors xf

and xb are collinear, i.e. the difference vectors and hence the sumof their norms are zero. If the two vectors are collinear no changeis judged to be present and the background is still visible. Ifnot, the pixels are considered to have different colors, and a fore-ground pixel is found. However, perfect collinearity is unlikely asour observed color vectors are noisy. Griesser et al. [5] showed

12 2. Segmentation

that applying a static threshold Tstatic and an adaptive thresholdTadapt on D2 makes the segmentation robust against noise. Theadaptive threshold is used to incorporate spatiotemporal consid-erations. Spatial compactness is induced by giving a pixel a higherchance to be foreground if several of its neighbors have this status.A sampled Markov Random Field (MRF) is used to enforce thisspatial compactness in an iterative manner, as described in [4].This yields the following context adaptive decision rule:

D2

foreground

><

background

Tstatic + (4− vc) · B (2.2)

where vc denotes the number of pixels that carry the label fore-ground and lie in the 3 × 3 neighborhood of the pixel to be pro-cessed. These labels are known for those neighboring pixels whichhave already been processed while scanning the image raster, assymbolized by the gray shade in figure 2.3. For pixels which arenot yet processed we simply take the labels from the previouschange mask. This introduces some level of temporal smooth-ness. The parameter B is a positive cost constant.

Figure 2.3: The parameter vc denotes the number of pixels car-rying the label foreground and lie in a 3 × 3 neighborhood ofthe pixel to be processed. These labels are already known for thepixels which have already been processed (gray shade). For theother pixels we simply take the labels from the previous changemask.

2.1. Foreground-background Segmentation 13

2.1.3 Darkness Compensation

This collinearity test (which is also called darkness compensation)makes the method intensity-invariant and thus provides robust-ness against lighting changes and shadows. However, dark colorsare problematic for this method as they can be seen as a lowintensity version of any color and consequently match with anycolor. To remedy this, an additional component with a constantvalue Odc is added to both vectors, as described in [5]. Thisadditional component renders the color similarity measure moresensitive to differences, especially when dark pixels are involved.This way, objects or backgrounds with dark colors can be seg-mented as illustrated in figure 2.4 where the region in front of theblack loudspeaker in the top left of the image is now correctlysegmented.

Page 9 of 28

Figure 5: Result of our illumination-invariant background subtraction method. The input image a) is compared to the background image b) by using an illumination-invariant collinearity criterion for the RGB values. The resulting segmen-tation is shown in image c). Note that the black loudspeaker in the background batters a hole in the foreground region. Image d) shows the segmentation result with darkness-compensation, the region in front of the black loudspeaker is now correctly segmented. This collinearity test makes the method intensity-invariant and thus provides robustness against lighting changes and shadows. However, dark colors are critical for this method as they can be seen as a low intensity version of any color and consequently match with any color. To remedy this, an additional component with a constant value Odc is added to both vectors. This additional dimension does not change the distances or measurements thereof but separates collinear vectors with differ-ent lengths. This way, objects or backgrounds with dark colors can be segmented as illustrated in figure 5 d) where the region in front of the black loudspeaker is now correctly segmented.

Segmentation is controlled by three user defined parameters, the static threshold Ts, the darkness offset Odc and the importance factor B of the spatiotemporal compactness. First, the static threshold Ts is determined with Odc and B set to zero. Then, the darkness offset Odc is increased until a bal-ance between appearing shadows and vanishing holes is reached. Finally, the compactness value B is increased until the foreground regions are smooth an compact. The values of the parameters vary for different cameras.

2.4 Normalization

The silhouette images are normalized in order for similar silhouettes to overlap in the image square. A bounding box is defined around the silhouette. The top and bottom of this bounding box are de-fined by the top and bottom pixels in the silhouette (y-axis). The x-position of the bounding box is centered to the center of gravity of the silhouette’s pixels. Subsequently, the image bound by this bounding box is resampled to a fixed resolution, resulting in square silhouette images (i.e. 32x32, 60x60, …). In most cases, more than one camera is used. This means if n cameras are used, for each instant in time there will be n images taken, and thus n normalized square silhouette images. These n images are concatenated to one image (i.e. side by side) as shown in figure 6. These con-catenated silhouette images are used as input for the training and classification algorithms.

(a) input image

Page 9 of 28



2.4 Normalization


(b) background

Page 9 of 28



2.4 Normalization


(c) segmentation

Page 9 of 28



2.4 Normalization


(d) with darkness

compensation

Figure 2.4: Result of our illumination-invariant background sub-traction method. The resulting segmentation is shown in image(c). Note that the black loudspeaker in the background batters ahole in the foreground region. Image (d) shows the segmentationresult with darkness compensation, with the region in front of theblack loudspeaker now segmented correctly.

14 2. Segmentation

2.2 Skin Color Segmentation

This section describes methods to distinguish skin pixels fromnon-skin pixels in an image. The color of skin in the visiblespectrum depends primarily on the concentration of melanin andhemoglobin [6]. Skin color is a simple but powerful characteristic,allowing for robust and efficient localization of body parts suchas the head and the hands. The distribution of skin color acrossdifferent ethnic groups under controlled conditions of illuminationhas been shown to be quite compact, with variations expressablein terms of the concentration of skin pigments (see [7] for a recentstudy). However, under arbitrary conditions of illumination thevariation in skin color will be less constrained.

Detecting skin pixels consist of two important components. First,a model/statistic has to be made of skin color pixels and non-skincolor pixels in the color space. As the color properties dependa lot on the different color spaces, the choice oif the best colorspace is discussed in more detail in section 2.2.1. Secondly, basedon this model, a decision has to be made whether an input pixelcolor is skin or non-skin. This is usually based on a probabilitycalculation and a threshold. We will consider two systems. Thefirst is a histogram-based system which can be trained online,while the system is running. The advantage of this system is thatit can be adapted on-the-fly to changes in illumination and theperson using the system. The second system is trained in advancewith a Gaussian mixture model and a lookup table. The benefitis that the offline trained system can be trained with much moretraining data and is more robust. However, it is not adaptive tochanges in illumination or to the person using the system.

2.2.1 Color Spaces

Every color can be represented as a point in a color space. Groupsof color pixels of different skin types and under different light-ing conditions have a different distribution in each color space.Cameras typically store color in RGB. Therefore, a conversion is

2.2. Skin Color Segmentation 15

usually applied so that the skin colors have a more compact dis-tribution in the new color space. We will first discuss color spaceswith 3 channels, and then color spaces with 2 channels. The choiceof color space will be discussed from the point of view of using aGaussian Mixture Model (GMM) to model the skin colors, andfrom the point of view of using a color histogram.

Color Spaces with 3 Channels

Cameras typically record and store color in 3 channels. Thereforeconversions to other color spaces with 3 channels will usually re-tain all color information (i.e. the conversion is reversible). If theskin color will be modeled by a GMM, then the GMM could alsobe transformed to any chosen color space, yielding the same re-sults. When using a GMM, the choice of 3D color space has littleimpact, given that no color information is lost in the conversion.

However, if a color histogram is used to model skin color, thechoice of color space is more relevant. In a recent study, Schmuggeet al. [8] test different color spaces, and conclude that the HSI(hue, saturation and intensity) color space provides the highestperformance for a three dimensional color space, in combinationwith a histogram-based classifier. HSI is an expensive transfor-mation, but it can be stored beforehand in a lookup table (LUT)to increase the conversion speed. The color transformation is de-scribed as,

i = max(R,G, B) (2.3)

s =

�max(R,G,B)−min(R,G,B)

max(R,G,B) if max(R,G, B) �= 0;0 if max(R,G, B) = 0.

(2.4)

h = arccos(R−G) + (R−B)

2�

(R−G)(R−G) + (R−B)(G−B)(2.5)

where i, s and h are the intensity, saturation and hue values re-spectively.

16 2. Segmentation

Color Spaces with 2 Channels

For reasons of color compactness and classification speed, onemight want to convert to a 2D color space. When convertingto a two dimensional color space, the choice of color space be-comes more critical, as information will be discarded during thetransformation. Switching to a 2D color space, the illuminationcomponent can be discarded for example. In this case we use therg color space (chromaticity). This is a normalized color space,which is obtained through the transformation,

r =R

R + G + B(2.6)

g =G

R + G + B(2.7)

where r and g are the normalized red and green values respec-tively. This transformation enables both a very compact repre-sentation of skin color pixels and fast computation.

Another option would be to use the hue and saturation componentof the HSI transformation. This should yield similar results, assimilar color information is discarded.

2.2.2 Histogram Based Approach

As the study by Schmugge et al. [8] reveals, the highest perfor-mance can be obtained using a histogram-based method to modelthe skin color in the HSI color space. As this is a rather slow colorspace transformation, the transformed color values are stored ina lookup table. The histogram based method introduced in thissection is similar to the system presented by Jones et al. [9], butuses the HSI color space instead of RGB, and adds online trainingof the color models.

Histograms

Two histograms are kept, one for the skin color distribution (s),and one for the non-skin colors (n). Schmugge et al. [8] suggest


both histograms to have 8 bins per color channel. It makes moresense however, to assign more importance to the hue informa-tion, as it defines the actual color, and assign less importance tointensity information, as it varies greatly with illumination andshadow. Therefore we assign 8 bins to the hue channel, 4 bins tothe saturation channel and 2 bins to the intensity channel. In ourexperiments this greatly improves the accuracy of the skin colorsegmentation.

The color hsi is assigned to bin s[h�s�i�] as,

h� = �h ∗ 8/256� (2.8)s� = �s ∗ 4/256� (2.9)i� = �i ∗ 2/256� (2.10)

where h�, s� and i� are the indices of the bin in a three dimensionalmatrix. The color is assigned to the bin by incrementing the valueof the bin. The total pixel count Ts of the bin is also incremented.Each skin color pixel in the training data is assigned to a bin in theskin color histogram s, and each non-skin color pixel is assignedto a bin in the non-skin color histogram n. The labeled trainingpixels are used to construct skin and non-skin color models forskin detection.

Skin and Non-skin Color Models

Given skin and non-skin histograms we can compute the proba-bility that a given color value belongs to the skin and non-skinclasses:

P (hsi|skin) =s[h�s�i�]

Ts

(2.11)

P (hsi|non-skin) =n[h�s�i�]

Tn

(2.12)

where s[h�s�i�] is the pixel count contained in bin h�s�i� of theskin histogram, n[h�s�i�] is the equivalent count from the non-skinhistogram, and Ts and Tn are the total counts contained in theskin and non-skin histograms, respectively.

18 2. Segmentation

The pixel is then classified as a skin pixel by applying a threshold,

P (hsi|skin)P (hsi|non-skin)

> T. (2.13)

Online updating of the histograms

The histograms s and n are first initialized using a face detec-tor, such as the one provided in Open CV [1]. The pixels insidethe face region are considered skin pixels, while the pixels out-side the region are considered non-skin pixels. This is a roughinitialization, but the histograms will be refined as the algorithmproceeds.

(a) input image (b) face region

(c) non-skin region (d) skin color segmentation

Figure 2.5: Online updating of the skin and non-skin color his-tograms. The segmented regions are marked in red. Note thatthe skin segmentation used in this case is after post-processing,as described in section 3.4.


In each new frame, the face region is found using the face detector,and the pixels inside the face region are used to update the skincolor histogram s. Then, the skin color detection algorithm is runand finds the face regions as well as other skin regions such as thehands and arms. The pixels which are not classified as skin, arethen used to update the non-skin color histogram n. This is donewith a margin of some pixels, so that pixels which are close toskin regions are not added to the non-skin color histogram. Thisis illustrated in figure 2.4.

The histograms s and n are updated by keeping 90% of the oldhistogram and taking 10% of the new pixel counts. This is doneby keeping only 90% of the bin values and the total pixel countin each iteration,

Ts = fmem · T �s, (2.14)

and for each bin i,s[i] = fmem · s�[i], (2.15)

where T �s

and s�[i] are the old total pixel count and the old binvalues respectively, and fmem is the factor of how much of theold bins are kept in memory. The color hsi is then assigned to abin by only incrementing the bin and the total pixel count with(1− fmem) instead of with 1. The same is done for the non-skincolor histogram. This allows for the histograms to evolve overtime, and take changes in illumination into account as the systemruns.

Results

The resulting skin color segmentation is fairly fast and accurate,segmenting a 640 × 480 frame in about 40 ms and updating themodels in about 30 ms. Our experiments show that the histogram-based segmentation is especially good at segmenting the skin fromskin-like backgrounds (i.e. beige walls) and red or pink clothing(where the GMM-based method in the next section, often fails).This is because the beige walls and the pink clothing are explicitlyadded to the non-skin color model at run time. It has some noise,however, because the models are based on limited training data.

20 2. Segmentation

Figure 2.6: Example of the histogram-based skin color segmen-tation with online training of the color models. The regions seg-mented as skin regions are marked in red.

Therefore the algorithm makes mistakes which sometimes seemtrivial, but are due to limited training data. An example of thehistogram-based skin color segmentation without post-processingof the results is shown in figure 2.6.

2.2.3 Gaussian Mixture Model Based Approach

In the Gaussian mixture model-based approach, the training pix-els are transformed to the rg color space, and a Gaussian mixturemodel is fitted to their distributions. As this approach requirestraining in advance, it cannot adapt to changes in lighting. There-fore the rg color space is used, where the illumination componentof the color is discarded, which allows for some degree of illumi-nation invariance.


General Model

A Gaussian mixture model is fitted to the distribution of thetraining skin color pixels using the expectation maximization al-gorithm [10]. The mixture model looks like,

p(c|skin) =�

i

wi

2π�|σi|

· exp�−1

2(c− µ

i)T σ−1

i(c− µ

i)�(2.16)

and a similar model for the P (c|non-skin), where c is the pixelcolor in the rg color space. Given these models, the probabilitythat a pixel belongs to the skin class, according to Bayes’ rule,

P (skin|c) =p(c|skin)

p(c|skin) + p(c|non-skin)(2.17)

assuming P (skin) = P (non-skin).

Lookup table

The probabilities P (skin|c) can be computed in advance, andstored in a lookup table. Pixels are then classified by lookingup P (skin|c) in the lookup table, and applying a threshold tothe probability. Using a lookup table, a 640 × 480 frame can besegmented in about 30 ms.

Results

The resulting skin color segmentation is extremely rubust un-der constrained lighting conditions. However, it requires a largeamount of training data and training time to build the model andlookup table. It is not possible to update the model online, asin the histogram-based method. As the model is trained across alarge variation of illumination possibilities, the range of what isconsidered skin is rather large. It often confuses skin with beigebackground and red objects in the image. These problems areaddressed in the next section.

22 2. Segmentation

2.2.4 Post Processing

Combining histogram-based and GMM-based

The histogram-based method performs rather well at detectingthe skin color pixels under varying lighting conditions. However,as it bases its classification on very little input data, it has a lotof false positives, due to colors that are not sufficiently present inthe non-skin training pixels. The pre-trained Gaussian mixturemodel-based method performs well in constrained lighting condi-tions. Under varying lighting it however tends to falsely detectwhite and beige regions in the background.

Interesting is that both methods make different mistakes. There-fore we can apply both methods with a low threshold and thencombine the results (AND function). The resulting skin colorsegmentation is very robust, and fast, as a 640 × 480 frame issegmented in about 46 ms. It is illustrated in figure 2.6.

Median filtering

In both the histrogram-based approach, as the GMM-based ap-proach, we assume that the skin color probabilities of the differentpixels are independent. For two neighboring pixels i and j thismeans

P (skini, skinj) = P (skini) · P (skinj). (2.18)

In this case a significant amount of information is lost. Pixelsthat are surrounded by skin pixels will have a higher probabilityof being skin colored themselves, and vice versa.

An option is to perform a smoothing on P (skin|c). This is possi-ble by building a hidden markov model (HMM) that puts restric-tions on P (skin), based on training data. For this, the horizontaland vertical neighbors of the pixel are considered.

P (skini|{skinj : �pi − pj� ≤ 1}) (2.19)

We assume the model to be isotropic, by combining the horizontaland the vertical case. Subsequently, based on a HMM, a maximum


(a) input image (b) histogram based

(c) gaussian mixture model based (d) combination

Figure 2.7: Combining the histrogram based and the Gaussianmixture model based. The regions segmented as skin regions aremarked in white.

entropy model is considered. [10] shows that this HMM-approachresults in 1% less false positives than the base model.

A more efficient implementation of an HMM can be achieved byusing filters. Instead of building a complex HMM model, me-dian filtering can be performed on P (skin|c). Through medianfiltering, pixels surrounded by skin pixels get a higher skin colorprobability and vice versa. This method provides similar resultsto a HMM in this context, and can be implemented with less CPUcycles.

In a binary image, median filtering boils down to choosing themost frequently appearing value (skin or non-skin) in a regionaround the pixel. In a continuous image, the pixels are sortedaccording to their skin color probability, and the middle value is

24 2. Segmentation

selected. Noise and holes will be efficiently filtered out. Further-more, neighboring pixels’ skin color probabilities are implicitelybut efficiently taken into account. The result of median filteringis shown in figure 2.7.

Median filtering is especially useful when the shape of the seg-mented object is very important and cannot be noisy, for examplein silhouette-based classification as described in sections 3.2 and3.4. If the skin color segmentation is only used for localization,median filtering is unnecessary.

(a) original segmentation (b) after median filtering

Figure 2.8: The result of median filtering (right) compared tothe original segmentation (left). The regions segmented as skinregions are marked in white.

Connected components

As a final step in the post processing, the remaining noise andunwanted objects can be filtered out based on their size, by con-nected components analysis. Pixels are grouped per skin compo-nent, and sorted by size. We only consider components which aresufficiently large, and we only consider the 3 largest components(the head and the two hands). The remaining components areconsidered noise, and discarded. This is shown in figure 2.8.


(a) original segmentation (b) connected components

Figure 2.9: The result of connected components analysis (right)compared to the original segmentation (after median filtering,left). The regions segmented as skin regions are marked in white.

2.2.5 Speed Optimizations

In order to make the skin color segmentation real-time, a numberof optimizations are introduced. Firstly, the HSI transformation(section 3.1.1) is precomputed offline for all (R,G, B) values andstored in a lookup table. This allows for the color transformationto be computed much faster, rather than computing expensivesquare roots and arccosines.

Futhermore the GMM-based probabilities are precomputed for all(r, g) values and stored in a lookup table. This way, at run timeall that remains to be done is to look up the probability and applya threshold.

Finally, the skin color segmentation is first applied to a down-sampled image (2×). In the full resolution image, only the pixelscorresponding to skin pixels in the downsampled image are testedfor skin color, while the others are discarded. This step doublesthe speed of the segmentation.

With these optimizations, segmenting one 640×480 image takes 46ms on average. Updating the skin and non skin color histogramstakes 30 ms on average.

26 2. Segmentation

2.2.6 Discussion

In this section, a number of contributions have been made to theperformance of skin color segmentation. The starting point wasa basic GMM-based skin color segmentation algorithm, which isrelatively slow and restricted to fixed lighting conditions. Mostof the scenarios in this thesis are in rooms with windows, whichmeans that the lighting is influenced by the time of day and theweather.

The segmentation was made more robust to changing lightingconditions by adding a histogram-based model which is updatedonline using the color from the face, which is located using aface detector. By combining the GMM-based and the histogram-based methods, many false positives can be eliminated, resultingin a more robust segmentation. The resulting segmentation wasimproved further in some post processing steps, namely medianfiltering and connected components analysis, which reduce thenoise in the segmentation.

To improve the speed of the system, the HSI color transforma-tion, as well as the skin probabilities based on the GMM, arepre-computed offline and stored in a lookup table. Furthermore,the skin color segmentation is first applied to a downsampled im-age in order to determine regions of interest. Then, detailed skincolor segementation is run inside these regions of interest.

Although these steps increase the performance of the skin colorsegmentation significantly, improvements are still possible. Thesegmentation works robustly in most lighting conditions, evenwith large windows. Still, the aperture of the camera has to beset manually, and the white balance has to be tweaked slightly forbest performance. In future work, it would be interesting to im-prove the segmentation algorithm to make it even more robust tochanging lighting conditions, and to look into ways to detect thecurrent lighting conditions automatically and setting the apertureand white balance of the camera accordingly.

2.3. 3D Hull Reconstruction 27

2.3 3D Hull Reconstruction

When determining the full body posture of a person, it is usefulto generate a 3D reconstruction of the person instead of detectingpose based on 2D silhouettes. 3D hulls allow for a more robustpose classification process, and their orientation can be normal-ized. This means a 3D hull-based pose classifier can be madeorientation invariant.

Computing the visual hull of an object requires its silhouettes ina number of available images together with the center of projec-tions and the calibration matrices of the corresponding cameras.If we want to reconstruct an object, we know that it is included inthe generalized cone extruded from the silhouette with origin atthe camera center. The intersection of these cones from multiplecalibrated camera views yields a volume which contains the ob-ject. This principle is called Shape-from-Silhouette and producesa volume which approximates the object reasonably well if a suf-ficient number of cameras with different lines of sight are used.This approximated volume is called the visual hull of the objectand is commonly defined as the largest possible volume which ex-actly explains a set of consistent silhouette images [11]. figure 2.10illustrates the principle for three camera views.

Note that the visual hull will never be an exact representationof the object as concave regions can not be reconstructed fromsilhouettes, and an infinite number of camera views is neededto compute the exact visual hull [12]. However, our results haveshown that even a coarse approximation of the person’s visual hullfrom 4 to 5 views is sufficient to perform body pose estimation.In [13] guidelines can be found to choose an optimal camera setupfor object reconstruction. Our definition of the visual hull in thischapter is limited to using a finite number of camera views.

Algorithms for Shape-from-Silhouette can be roughly divided intothree groups:

1. Volumetric Reconstruction using Voxels: This technique di-vides the working volume into a discrete grid of smaller volumes,so called voxels, and projects them successively onto the image

28 2. Segmentation4.3. 3D Reconstruction 47

View 1

View 2

View 3

Object

Visual Hull

Figure 4.9: The Visual Hull of the object is the intersection of the generalized conesextruded from its silhouettes. The red polygon corresponds to the approximatedVisual Hull from three camera views.

1. Volumetric Reconstruction using Voxels: This technique divides theworking volume into a discrete grid of smaller volumes, so called voxels, andprojects them successively onto the image planes of the available camera views.Voxels lying outside of the silhouette in at least one view do not belong to theintersection of the cones and can be discarded. Due to their simplicity, voxelbased procedures have been used recently for body tracking (Luck et al., 2001;Theobalt et al., 2002; Mikic et al., 2003; Hasenfratz et al., 2003; Caillette andHoward, 2004). Voxel based methods have the drawback that they tend tobe expensive as a high number of voxels must be projected into the imageplanes. Hasenfratz et al. (2003) exploit hardware acceleration to speed up theprojections and achieve real-time reconstructions using the GPU. However,the resolution of their voxel space is limited by the memory of the graph-ics card. Others use Octree representations (Potmesil, 1987; Szeliski, 1993;Caillette and Howard, 2004) and compute the reconstructions from coarse tofiner resolution. In contrast to simple voxel spaces, Octrees allow adaptiveresolution with lower memory consumption.

2. Polyhedral Visual Hull: A surface based approach to compute the VisualHull. It is computed from a polygonal representation of the silhouettes andapplies constructive solid geometry (CSG) to compute the intersection of the

Figure 2.10: The visual hull of the object is the intersection ofthe generalized cones extruded from its silhouettes.

planes of the available camera views. Voxels lying outside ofthe silhouette in at least one view do not belong to the intersec-tion of the cones and can be discarded. Due to their simplicity,voxel based procedures have been used recently for body tracking[14, 15, 16, 17, 18]. Voxel based methods have the drawback thatthey tend to be expensive as a high number of voxels must beprojected into the image planes.

2. Polyhedral Visual Hull : This is a surface based approach tocompute the visual hull. It is computed from a polygonal rep-resentation of the silhouettes and applies constructive solid ge-ometry (CSG) to compute the intersection of the correspondingpolyhedra. Real-time algorithms have been proposed by Matusiket al. [19] and by Franco and Boyer [20]. The polyhedral visualhull offers better accuracy than voxel based procedures as themethod does not work on a discretized volume. Moreover, theresulting triangle mesh is perfectly suited for rendering on graph-


ics hardware. However, due to the complexity of the geometriccalculations, these algorithms are limited by their overall fragilitywhich relies on perfect silhouettes. Corrupted silhouettes oftenresult in incomplete or corrupted surface models. In the applica-tion described in this chapter, silhouettes will often be corruptedby reflections and noisy segmentation.

3. Space Carving and Photo Consistency : Space carving is a vol-umetric reconstruction technique which uses color consistency aswell as silhouettes as proposed by Kutulakos and Seitz [21] andSeitz and Dyer [22]. Voxels which are not photo-consistent acrossall camera views in which they are visible are carved away. Photoconsistency methods often assume constant illumination and Lam-bertian reflectance. The reconstructed volume only contains thesurface voxels and is often referenced as the photo hull. The vis-ibility criterion of the voxels is critical for this method and isusually solved by making multiple plane-sweep passes, using eachtime only the cameras in front of the plane, and iterating untilconvergence. Unfortunately, the complexity of this method makesit difficult to achieve real-time computation. Cheung et al. [23, 24]proposed a mixed approach between visual hull and photo consis-tency. They use the property that the bounding edge of the visualhull touches the real object in at least one point. Therefore, photoconsistency has only to be tested for bounding edges of the visualhull, which can be done at moderate cost. Unfortunately, the re-construction is then very sparse and needs many input data to bepractical.

Voxel-based methods for Shape-from-Silhouette are popular buttend to be computationally expensive as a high number of voxelshave to be projected into the camera images. Most implementa-tions speed up this process by using an octree representation tocompute the result from coarser to finer resolutions (Szeliski [25]),while others exploit hardware acceleration (Hasenfratz et al. [17]).Our method addresses the problem the other way around, as pro-posed by Kehl et al. [2]. Instead of projecting the voxels into thecamera views at each frame, we keep a fixed look-up table (LUT)for each camera view and store a list at each pixel with pointers toall voxels that project onto that particular pixel as illustrated in

30 2. Segmentation

figure 2.11. This way, the image coordinates of the voxels have nei-ther to be computed during runtime nor to be stored in memory.Instead, the LUT’s are computed once at startup. The proposedreversal of the projection allows for a compact representation ofthe voxels: each voxel is represented by a bit mask where eachbit bi is 1 if its projection lies in the foreground of camera i and0 otherwise. Thus, a voxel belongs to the object (i.e. is labeledas active) if its bitmask only contains 1’s. This can be evaluatedrapidly by byte comparisons.

Page 20 of 28

age pixels to voxels, we only have to look up the voxels linked to pixels, which have changed their foreground-background status. This leads to far fewer voxel look-ups compared to standard meth-ods where for each frame all voxels have to be visited in order to determine their labels.

Figure 15: A lookup-table (LUT) is stored at each pixel in the image with pointers to all voxels that project onto that par-ticular pixel. This way, expensive projections of voxels can be avoided and the algorithm can take advantage of small changes in the images by only addressing voxels whose pixel has changed.

The reconstruction itself is done by going pixel by pixel through all segmented (binary) images. If a pixel of the current view i has changed its value compared to the previous frame, the correspond-ing bit bi for all voxels contained in the reference list of this pixel is set to the new value and their labels are determined again. Results of our reconstruction algorithm can be seen in figure 16.

Figure 16: The first two images show a reconstruction computed from five cameras at resolution 643. The last two images show the same reconstruction at 1283 resolution.

Figure 2.11: A lookup-table (LUT) is stored at each pixel in theimage with pointers to all voxels that project onto that particularpixel. This way, expensive projections of voxels can be avoidedand the algorithm can take advantage of small changes in theimages by only addressing voxels whose pixel has changed.

Another advantage of our method is that the voxel space can beupdated instead of being computed from scratch for each frame.A voxel only changes its label if one of the pixels it is projected tochanges from foreground to background or vice versa. Therefore,as we can directly map from image pixels to voxels, we only haveto look up the voxels linked to those pixels, which have changedtheir foreground-background status. This leads to far fewer voxellook-ups compared to standard methods where for each frame allvoxels have to be visited in order to determine their labels. The


reconstruction itself is done by going pixel by pixel through allsegmented (binary) images. If a pixel of the current view i haschanged its value compared to the previous frame, the correspond-ing bit bi for all voxels contained in the reference list of this pixelis set to the new value and their labels are determined again. Re-sults of our reconstruction algorithm can be seen in figures 2.11and 2.13. Using this approach the reconstruction of a hull from 6cameras takes about 15 ms.

Figure 2.12: Example of how silhouettes from multiple cameraviews can be used to reconstruct a 3D hull of the user.

Figure 2.13: Examples of 3D hull reconstructions.

3Detection and Recognition

In the previous chapter we described how the input images canbe preprocessed to facilitate the detection and classification steps.We explained how foreground-background segmentation and 3Dhull reconstruction can be used to generate 2D silhouettes resp.3D hulls which will be used for full body pose recognition. Fur-thermore, skin color segmentation was introduced which will helpus to detect skin colored regions such as the hands of the user.

The overal strategy of our vision-based HCI is shown in figure 1.1.In this chapter we will introduce how body parts can be detected,and how body and hand gestures can be recognized.

Section 3.1 describes the detection of the face, eyes, hands and fin-gers. For example detecting the location of the hand is necessaryto be able to detect the hand posture (section 3.4), and detect-ing the finger is necessary to be able to determine the pointingdirection of the user (section 4.1).

Section 3.2 describes how body poses can be detected based on2D silhouettes of the user. This method is then expanded tothe 3D case for classifying 3D hulls of the user, in section 3.3.This classification technique can also be used for hand gestures,as shown in section 3.4.

As the aim is to build real-time systems, each component has tobe designed to utilize as little CPU cycles as possible. The com-putation time of the classification algorithms in sections 3.2, 3.3and 3.4 can be improved with the help of Haarlets. In section 3.5we introduce a new training algorithm for these Haarlets, which

34 3. Detection and Recognition

are used to build fast 2D body pose and hand gesture recognitionsystems. Furthermore, 3D Haarlets are introduced, and are usedto improve the speed of the 3D body pose recognition system.

Finally, all the techniques introduced in this chapter are evaluatedin section 3.6.

3.1. Face and Hand Detection 35

3.1 Face and Hand Detection

In most human-computer interaction systems, the next step aftersegmenting the user, is finding body parts such as the face andthe hands. Detecting the body parts has two purposes: (1) se-lecting them for further recognition steps, such as hand gesturerecognition or face recogntion, and (2) obtaining their location inthe image or in the working volume, for example for analyzing themotion of the hands or determining a pointing direction.

3.1.1 Face Detection

For face detection we use the standard OpenCV implementationby Viola and Jones [1] which is based on 2D Haarlets. Haarletsare rectangular features as shown in figure 3.1 and can be com-puted rapidly using the integral image. In the Viola and Jonesclassifier, these Haarlets are selected using AdaBoost and providecharacteristic features of the human face as shown in figure 3.2.

Figure 3.1: Example Haarlets shown relative to the enclosingrectangle window. The sum of the pixels which lie within thewhite rectangles are subtracted from the sum of the pixels in thegrey rectangles. Figure taken from [1].


Figure 3.2: Examples of features selected by AdaBoost. Thetwo features are shown in the top row and then overlayed on atypical training face in the bottom row. The first feature measuresthe difference in intensity between the region of the eyes and aregion across the upper cheeks. The feature capitalizes on theobservation that the eye region is often darker than the uppercheeks and nose. The second feature compares the intensitiesin the eye regions to the intensity across the bridge of the nose.Figure taken from [1].

Figure 3.3: Schematic depiction of the detection cascade. Aseries of classifiers are applied to every sub-window. The ini-tial classifier eliminates a large number of negative examples withvery little processing. Subsequent layers eliminate additional neg-atives but require additional computation. After several stages ofprocessing, the number of sub-windows has been reduced signifi-cantly. Further processing can take any form such as additionalstages in the cascade. Figure taken from [1].


Figure 3.4: Example of a detected face shown in the white rect-angle.

The system uses a sliding window approach, which means a sub-window of 24 × 24 pixels is moved accross every candidate facelocation in the image, and the classifier is applied to each subwin-dow. This is applied to different scales of the image to allow forvarious distances from the face to the camera. As this implies alarge number of subwindows, in order to make the system real-time, the classifier is implemented as a detection cascade. Thegoal of the detection cascade is to reject most subwindows with avery low amount of computations, while allowing a positive sub-window to pass through the entire cascade to ensure that therereally is a face present in the subwindow. The detection cascadeis illustrated in figure 3.3. An example of a detected face is shownin figure 3.4.


3.1.2 Eye Detection

The centroid of the face can be used as an estimate of the eyelocation, however we propose a few detection steps to determinethe exact eye locations. The starting point is the face object assegmented by the skin color analysis.

Applying eigen analysis on the position of the face pixels, an el-lipse can be fitted to the face. Eigen analysis provides µ, λ1,λ2, v1 and v2. This is respectively the center of gravity, the twoeigenvalues and the two eigenvectors of the covariance matrix ofthe face pixels. Subsequently, to find the eyes, the search regionis reduced to a rectangle defined by the following four corners:

r1 = µ + 2�

λ2 · v2 − 2�

λ2 · v1 (3.1)r2 = µ− 2

�λ2 · v2 − 2

�λ2 · v1 (3.2)

r3 = µ + 2�

λ2 · v2 (3.3)r4 = µ− 2

�λ2 · v2 (3.4)

The result is a rectangle as shown in figure 3.5a. The height ofthe rectangle is defined by λ2, and not λ1, because the latteris too sensitive to the amount of neck or forehead visible in theimage. The rectangle is rotated and scaled to a normalized sizeand resolution.

To detect the eyes a number of distinct features are extracted.Based on each feature a pseudo-probability is calculated, and allthese probabilies are combined to result in one robust result. Thefirst probability is constructed based on luminance, as the eyes(especially the pupils) are distinctly darker spots in the image.First, the image is converted to an intensity image i(p),

i(p) =65 · R(p) + 129 · G(p) + 24 · B(p)

256+ 16 (3.5)

and the probability based on luminance is then defined as,

pL(p) = exp�−I(p)100

�(3.6)


The exponent is chosen to amplify the probability of very darkpixels. The value of 100 was chosen experimentally. The result isshown in figure 3.5b. The second probability is based on color, asthe eyes are very distinctly not skin-colored,

pS(p) = p(non-skin|cp) (3.7)

where cp is the color of pixel p. The result is shown in figure 3.5c.Finally a third probability can be extracted from the horizontallyintegrated luminance,

SI(p) =� +w

−w

i(t, y) dt (3.8)

where p = (x, y) is a pixel in the image. After normalization,

pI(p) =SI(p)−min

�SI(p)

�

max�

SI(p)�−min

�SI(p)

� (3.9)

Both the eyes and the shadow region around them present a muchdarker region inside the region of interest. Horizontal integrationis a very robust technique to exploit this characteristic. as shownin figure 3.5d. The end result is obtained by combining all ofthese characteristics as a product of probabilities, as shown infigure 3.5e.

The eyes are localized using connected component analysis, asthe two components with highest probability, which is shown infigure 3.5f. The method’s accuracy exceeds 95% with a frontalobservation of the user. It remains robust with respect to headrotations until about 30◦ of rotation around the vertical axis. Anexample of eye detection is shown in figure 3.6


(a) original (b) pL (c) pS

(d) pI (e) combination (f) eyes

Figure 3.5: (a) region of interest, (b) probability based on lu-minance, (c) probability based on color, (d) probability based onintegrated luminance, (e) all probabilities combined, (f) the eyesas the two components with highest probability.

Figure 3.6: Example of detecting the eyes.


3.1.3 Hand Detection

The hand detection problem is a lot more complex than face de-tection as the appearance of hands may change dramatically. Thedegrees of freedom in the hand and finger joints is very high, whilefaces remain quite static. The eyes, nose and mouth always stayin the same position relatively to each other.

Kolsch et al. [26], Ong et al. [27] and Micilotta et al. [28] userectangular features and AdaBoost to detect hands, similar to theface detection in section 3.1.1. The problem with this approach isthat it works well for certain specific and fixed hand postures, butfails when the hand can take any shape or pose. Hands have such ahigh number of degrees of freedom, that it is almost impossible tomodel the shape in an AdaBoost-based detection cascade, whereasin a face, the relative position of the eyes, nose and mouth stayfixed throughout the detection process. Other approaches arebased on tracking, such as the approach by Stenger et al. [29].However, these model-based approaches, though very accurate,require manual initialisation and are not real-time.

In our approach, we deliberately choose not to look for the handsdirectly, but to look for hands as skin colored objects relative tothe face. First, both face detection and skin color segmentationare run on the entire input image. Using connected componentsanalysis the candidate regions are labeled. Next, the region thatcorresponds to the detected face is eliminated from the list ofhand candidates, and regions that are too small are discarded.Like this, we have a fast and robust hand detection system, thatis independent from the articulation of the hand. The detectedhands can then be used as input for hand gesture recognition(section 3.4), or for finger detection (next section). Some examplesof detected hands are shown in figure 3.7. The system has somelimitations however, as the hands cannot overlap with the face.


Figure 3.7: Examples of the hand detection.


3.1.4 Finger Detection

The goal is to extract the fingers from a segmented hand silhou-ette. Subsequently, the number of fingers can be counted, andthe tip of a pointing finger can be determined. In [30] fingers aredetected using an erosion-dilation method. First the hand silhou-ette pixels are eroded until all fingers are removed, as shown infigure 3.8b. The structural element is a circle, of which the radiusis relative to the scale of the hand. Then, the remaining objectis dilated until the palm of the hand regains its original size, asshown in figure 3.8c. By subtracting this hand palm image fromthe original hand silhouette, the fingers remain, as shown in fig-ure 3.8d. Using connected component analysis, the fingers canbe labeled and counted. The hand palm silhouette can be usedto determine the centroid of the hand. An example of the fingerdetection is shown in figure 3.9.

If only one finger is counted, a pointing gesture is assumed, andthe tip of the finger is determined as the finger pixel furthest awayfrom the centroid of the hand palm, as shown in figure 3.10.

3.1.5 Discussion

In this section, we have described fast methods to detect the lo-cations of different body parts. Detecting these body parts isa crucial step in, for example, determining pointing direction orinteracting using hand gestures.

The main weakness so far is that the hand detection relies en-tirely on color information. Thus, in difficult lighting conditionsthe hand detection may fail. The hand detection is made moreadaptive to varying lighting conditions by using the color informa-tion from the face in the skin color segmentation. However, futurework should also incorporate shape information into the hand de-tector. Unfortunately, our experiments using an AdaBoost-basedhand detector (similar to the face detector), or using trackers, didnot have satisfying results. Subsequently, hand detection remainsan important area where further research is needed.


(a) one finger (b) erosion (c) dilation (d) subtraction

(e) two fingers (f) erosion (g) dilation (h) subtraction

(i) three fingers (j) erosion (k) dilatation (l) subtraction

(m) five fingers (n) erosion (o) dilation (p) subtraction

Figure 3.8: Finger detection.


Figure 3.9: Example of detecting the fingers on the hand.

Figure 3.10: Example of detecting the fingertip, shown as agreen cross.


3.2 2D Body Pose Recognition

This section describes the recognition of of the full body pose ofa person. The work described in this section is also presentedin [31, 32]. The approach aims to classify poses based on 2Dsilhouettes of the person, which are obtained using foreground-background segmentation as described in section 2.1. In this sec-tion we propose an example-based classifier, which is based onLinear Discriminant Analysis (LDA).

In example-based approaches, observations are compared and mat-ched against stored examples of human body poses. In the 2Dapproach described in this section the observations consist of sil-houettes of the user. These silhouettes are extracted from videosof several fixed cameras around the person. Some examples of suchsilhouettes are shown in figure 3.11. The extracted silhouettes arenormalized to a fixed resolution and position by defining a squarebounding box around the silhouette centered horizontally to thecenter of gravity of the silhouette. The cropped silhouette is thenrescaled to a fixed resolution. Therefore, it is possible for the userto change position in the scene without significantly affecting theresulting silhouettes. The images containing the silhouettes fromthe different camera views are concatenated to one single image,which the classifier can process, as illustrated in figure 3.12.

Figure 3.11: Examples of silhouettes which are used for classifi-cation. Note the holes in the segmentation and the artifacts dueto reflections on the floor.

3.2. 2D Body Pose Recognition 47

Figure 3.12: Example of 3 camera views, foreground-backgroundsegmentation, and their concatenation to a single normalized sam-ple.

3.2.1 Background

This section provides an overview of methods to estimate bodypose; they are divided into two categories: model-based and example-based. The model-based methods can also be called trackingmethods, as they track individual body parts in an articulatedbody model. Example-based methods do not rely on body mod-els but match the input to a set of predefined poses.

Tracking / Model-based

The introduction above contains a number of choices that havebeen made. The first is one in favor of example-based rather thanmodel-based (tracking) techniques. Model-based approaches typ-ically rely on articulated 3D body models [33, 34, 35, 36, 37]. Inorder to be effective they need to have a high number of degrees offreedom, in combination with non-linear anatomical constraints.Consequently, they require time-consuming per-frame optimiza-tion and the resulting trackers are too slow for real-time (> 25Hz)approaches. They are also very sensitive to fast motions and seg-mentation errors.

Most model-based methods exploit 2D image information for track-ing. However, these cues only offer rather weak support to thetracker, which quickly leads to sophisticated and therefore oftenrather slow optimization schemes. Multiple calibrated camerasallow for the computation of the 3D shape of the person, whichprovides a strong cue for tracking, as the 3D shape only contains


information which is consistent over all the individual views withrespect to some hypothesis, and thus discards, for example, clutteredges or spikes in the silhouettes. The increase of the computa-tional power offered by cheap consumer PCs has made real-timecomputation of the 3D shape or hull possible and has createdseveral interesting approaches to full body tracking.

Cheung et al. [23] introduced the SPOT algorithm, a rapid voxel-based method for the volumetric reconstruction of a person. Real-time tracking is achieved by assigning the voxels in the new frameto the closest body part of the previous one. Based on this regis-tration, the positions of the body parts are updated over consec-utive frames. However, this simple approach does not guaranteethat two adjacent body parts would not drift apart, and it canlose track easily for moderately fast motions. Furthermore, toobtain a good segmentation, the person has to wear a dark suit.In [24], Cheung et al. use both color information and a shape-from-silhouette method for full body tracking, although not inreal-time anymore. They use colored surface points (CSPs) tosegment the hull into rigidly moving body parts, based on the re-sults of the previous frames, and take advantage of the constraintof equal motion of parts at their coupling joints to estimate jointpositions. A complex initialization sequence recovers the joint po-sitions of an actor, which are used to track the same person in newvideo sequences.

Mikic et al. [38] proposed a similar voxel-based method for fullbody tracking. After volumetric reconstruction, the different bodyparts are located using sequential template growing and fitting.The fitting step uses the placement of the torso computed by tem-plate growing to obtain a better starting point for the voxel label-ing. Furthermore, an extended Kalman filter is used to estimatethe parameters of the model given the measurements. To achieverobust tracking the method uses prior knowledge of average bodypart shapes and dimensions.

Kehl et al. [2] also propose a markerless solution for full bodypose tracking. A model built from superellipsoids is fitted to acolored volumetric reconstruction using Stochastic Meta Descent(SMD) while taking advantage of the color information to over-


Figure 6: User brings limbs close and forces occlusion of the arms.

Figure 7: User shows full articulation, performs kicks and turns around.

Figure 8: User throws and kicks a blue cube and picks it up afterwards.

8

Figure 3.13: The full body pose tracker presented by Kehl etal. [2] in action.

come ambiguities caused by limbs touching each other. In orderto increase the robustness and accuracy, the tracking is refinedby matching model contours against image edges. Results of thistracker are shown in figure 3.13. Similar to the previously men-tioned tracking approaches, this system was capable of trackingone frame in approximately 1.3 seconds. As the input data for areal-time system are generally recorded at 15-30 Hz, this trackingapproach is too slow and as a result too sensitive to fast motions.

The tracking-based approaches suffer from a trade-off betweenhaving a complex, accurate tracker at ± 1 Hz, or having a fasterbut less accurate tracker. In both cases it will be difficult notto lose track of the person in an interactive system where theuser walks around and moves a lot. Therefore the decision wasmade to look at example-based methods. The stored examplescan be 2D silhouettes of the person, or 3D hulls which have beenreconstructed.

Example-based

Example-based methods benefit from the fact that the set of typ-ically interesting poses is far smaller than the set of anatomicallypossible ones, which is good for robustness. As the pose is esti-mated on a frame-by-frame basis, it is not possible to lose track of


(a) input image (b) silhouettes (c) 3D hull

Figure 3.14: (a) 3D hull reconstruction: (b) shows silhou-ettes which can be extracted from input images using foreground-backgound segmentation, which can be combined to reconstructa 3D hull as shown in (c).

an object. Also, not needing an explicit parametric body modelmakes them more amenable to real-time implementation and ap-plication to the pose analysis of other structures than human bod-ies, e.g. animals. Silhouettes (and their derived visual hulls) seemto capture the essence of human body pose well. Compared tomodel-based methods, not many example-based pose-estimationmethods exist in the literature. Rosales and Sclaroff [39] train aneural network to map example 2D silhouettes to 2D positionsof body joints. Shakhnarovich et al. [40] outline a framework forfast pose recognition using parameter sensitive hashing. In theirframework, image features such as edge maps, vector responsesof filters and edge direction histograms can be used to match sil-houettes against examples in a database. In [3], Ren et al. applythis parameter sensitive hashing framework to use 2D Haarlets forpose recognition. These Haarlets are trained using AdaBoost.

The biggest limitation of silhouette-based approaches is that thestored silhouettes are not invariant to changes in orientation of theperson. To overcome this, a visual hull of the person can be re-constructed using the silhouettes taken from several camera views,and can then be rotated to a standard orientation before beingused for training or classification, resulting in a rotation-invariantsystem. An example of such a hull is shown in figure 3.14c.

The example-based approach proposed by Cohen and Li [41], mat-ches 3D hulls using an appearance-based 3D shape descriptor


and a support vector machine (SVM). This method is rotation-invariant, but, running at 1 Hz, it is not real-time. Weinland etal. [42] and Gond et al. [43] propose similar hull-based approachesbut provide no statistics concerning classification speeds. In or-der to build a 3D hull-based system which is capable of real-timeperformance, we aim to combine the speed of Haarlets with thestrength of ANMM.

3.2.2 Classifier Overview

The input of the classifier consists of a concatenation of normal-ized silhouettes as shown in figure 3.12. This concatenation is agrayscale image (for example 72 × 24 pixels, made up of three24× 24 silhouettes), of which the grayscale values are then storedin a vector (with for example 72 · 24 = 1728 values). This vec-tor can be seen as a data point in an n-dimensional space (n= 1728). Each input sample is thus an n-dimensional vector,which the classifier projects onto a lower dimensional space witha transformation Wopt. Projecting the input vector to the lowerdimensional space results into a number of coefficients, which areused to match the input sample to a pose.

This classifier’s structure is shown in figure 3.15. The transfor-mation Wopt is found using Average Neighborhood Margin Max-imization (ANMM), a variation on Linear Discriminant Analysis(LDA). It projects the input vectors onto a lower dimensionalspace where the different pose classes are maximally separatedand easier to classify. Using a nearest neighbors (NN) approachthese projected samples (coefficients) are matched to stored posesin a database and the closest match is the output of the system.Later on, in order to improve the speed of the system, the trans-formation Wopt can be approximated using Haarlets, which willbe discussed in section 3.5.


compute

Haarlet

coe!cients

NN

NN

input

input

database

databaseset

coe"s

coe"s coe"s

pose

pose

Wopt

Lapproxh l

Figure 3.15: Basic structure of the classifier. The input samples(silhouettes) are projected with transformation Wopt onto a lowerdimensional space, and the resulting coefficients are matched toposes in the database using nearest neighbors (NN).

3.2.3 Linear Discriminant Analysis (LDA)

The goal of the LDA step is to find a transformation that will helpto discriminate between the different pose classes. It provides alinear transformation which will project the input silhouettes ontoa lower dimensional space where they are maximally separated be-fore they are classified. The training examples (silhouettes) aredivided into different pose classes. The pixel values of these silhou-ettes are stored in an n-dimensional vector, where n is the totalnumber of pixels in the input silhouettes. The idea is to find alinear transformation such that the classes are maximally separa-ble after the transformation into the lower-dimensional space [44].The class separability can be measured by the ratio of the determi-nant of the between-class scatter matrix SB and the within-classscatter matrix SW . The optimal projection Wopt is chosen as thetransformation that maximizes the ratio,

Wopt = arg maxW

|WSBWT ||WSW WT | , (3.10)


and is determined by calculating the generalized eigenvectors ofSB and SW . Therefore,

WT

opt=

�v1 v2 ... vm

�, (3.11)

where vi are the generalized eigenvectors of SB and SW corre-sponding to the m largest generalized eigenvalues λi. The eigen-values represent the weight of each eigenvector, and are stored ina diagonal matrix D, while the eigenvectors vi represent charac-teristic features of the different pose classes. A solution for theoptimization problem in equation (3.10) is to compute the inverseof SW and solve an eigenproblem for the matrix S−1

WSB [44]. Un-

fortunately SW will be singular in most cases, because the numberof training examples is smaller than the number of dimensionsin the sample vector. It is however possible to solve the eigenproblem using simultaneous diagonalization [44], which producesa result even though SW is singular. Yet, it is better to look foran alternative where a different matrix is used which does notsuffer from this dimensionality problem, as explained in the nextsection.

3.2.4 Average Neighborhood Margin Maximiza-tion (ANMM)

LDA aims to pull apart the class means while compacting theclasses themselves. This introduces the small sample size problemwhich renders the within-class scatter matrix singular. Further-more LDA can only extract c− 1 features (where c is the numberof classes), which is suboptimal for many applications. ANMM,as proposed by Wang and Zhang [45], is a similar approach whichavoids these limitations. For each data point, ANMM aims topull the neighboring points with the same class label towards itas near as possible, while simultaneously pushing the neighboringpoints with different labels away from it as far as possible. Thisprinciple is illustrated in figure 3.16.


Figure 3.16: An illustration of how ANMM works. For eachsample, within a neighborhood (marked in gray), samples of thesame class are pulled towards it, while samples of a different classare pushed away, as shown in the left. The figure on the rightshows the data distribution in the projected space.

Instead of using the between-class scatter matrix SB and thewithin-class scatter matrix SW , ANMM defines a scatterness ma-trix as,

S =�

i,k:xk∈N ei

(xi − xk) (xi − xk)T

|N e

i| (3.12)

and a compactness matrix as,

C =�

i,j:xj∈Noi

(xi − xk) (xi − xj)T

|N o

i| (3.13)

where N o

iis the set of n most similar data which are in the same

class as xi (n nearest homogeneous neighborhood) and where N e

i

is the set of n most similar data which are in a different class as xi

(n nearest heterogenous neighborhood). The ANMM eigenvectorsWopt can then be found by the eigenvalue decomposition of S−C.


ANMM introduces three important benefits compared to tradi-tional LDA: it avoids the small sample size problem since it doesnot need to compute any matrix inverse; it can find the discrimi-nant directions without assuming a particular form of class densi-ties (LDA assumes a Gaussian form); and finally much more thanc−1 feature dimensions are available. Some examples of resultingANMM eigenvectors are shown in figure 3.17.

Figure 3.17: The first 4 eigenvectors for the frontal view only,after training for a 12 pose set, using the ANMM algorithm. Darkregions are positive values and white regions are negative values.

3.2.5 Rotation Invariance

While it is easy to normalize the size and location of 2D silhou-ettes, it is impossible to normalize their orientation. Therefore a2D silhouette-based classifier cannot classify the pose of a personwith changing orientation.

To overcome this limitation, there are two solutions: (1) classify-ing 3D hulls rather than 2D silhouettes, as described in section3.3 or (2) training a separate 2D classifier for each possible orien-tation of the user. In this case the training samples are dividedinto 36 individual bins, depending on the angle of orientation. Foreach bin a separate 2D classifier is trained. In the classificationstage, depending on the measured angle of rotation, the appropri-ate 2D classifier is used. This allows for pseudo-rotation invariantclassifier, which is described in more detail in experiments section3.6.2, where it is compared to the 3D hull-based solution.


3.2.6 Discussion

In this section, we presented a state-of-the-art method for poserecognition based on 2D silhouettes. Experiments in section 3.6tested the method for the case where the user remains in thesame orientation. These experiments show that ANMM indeedperforms better than LDA, and that, for the case of fixed ori-entation of the user, the 2D method performs as well as the 3Dmethod described in chapter 3.3.

The biggest limitation of the method described in this section isthat the orientation of the 2D input silhouettes cannot be nor-malized, so that the method is not rotation invariant. This canbe overcome to some extent by training a classifier for each pos-sible orientation, as shown in figure 3.18, where some examplesof segmentation and classification results are shown. However,experiments in section 3.6.2 show that it is better to switch to a3D approach, where the orientation of the 3D input hulls can benormalized.

Another limitation is that the method is limited to a pre-definedset of poses, and thus cannot recover the full articulated pose ofthe user. This articulated pose could be estimated, however, byinterpolating between the closest poses found in the set of pre-defined poses.

The algorithm presented in this section is not limited to classifyingbody poses based on silhouettes. As will be shown in section 3.4,for example, it can also be used to classify hand gestures.

As the number of poses in the database increases, the methodwill slow down as more feature vectors have to be computed forclassification. In order to improve the speed, the feature vectorscan be approximated with Haarlets, which will be described insection 3.5.


Figure 3.18: Examples of segmentation and classification results.

input segmentation from 3 camera views detected pose


3.3 3D Body Pose Recognition

This section is similar to section 3.2, except that it aims to classifyposes based on 3D hulls of the person, rather than 2D silhouettes,with the benefit of increased robustness and rotation invariance ofthe classifier. The work presented in this section is also presentedin [31, 32, 46].

An example-based classifier is proposed, which means that theinput samples (hulls) are compared to poses stored in a database.Each frame is classified independently from the others. In order toobtain the 3D hulls, several cameras are placed around the person.Any number of cameras can be chosen, but it is best to deploysufficient cameras to make a good 3D voxel reconstruction. Usingbackground subtraction as described in section 2.1, the silhouettesare extracted from each camera view. These silhouettes are thenused for the 3D voxel reconstruction as described in section 2.3.An example of such a 3D hull is shown in figure 3.19.

Figure 3.19: Example of a reconstructed 3D hull of the user.

The pose classification problem using 2D silhouettes becomes moredifficult when the subject can not only change position freely, butalso orientation. While a change of position in 2D can easily benormalized for, it is impossible to normalize for the rotation of thesubject. In a 3D hull approach, however, it is possible to normalizethe rotation of the 3D hulls before classifying them. Normalizingthe rotation of the hull consists of measuring the angle of its ori-entation, and then rotating it to a standard orientation. The goal


is that, regardless of the orientation of the subject, the resultingnormalized hull will look the same, as illustrated in figure 3.20.

Figure 3.20: Examples of different orientations of the user, re-sulting in similar normalized hulls.

3.3.1 Classifier Overview

The input of the classifier consists of 3D hulls as shown in fig-ure 3.20. These hulls are 3-dimensional matrices of a fixed size(for example 24×24×24) containing voxel values between 0 and 1(where 1 means the voxel is part of the user and 0 means the voxelis thin air, and the values in between allow for some smoothingof the edges of the hull). The voxel values of the input hulls arehowever stored as an n-dimensional vector (where, for example,n = 24 · 24 · 24 = 13824). This vector can be seen as a datapoint in an n-dimensional space. Each input sample is thus ann-dimensional vector, which the classifier projects onto a lowerdimensional space with a transformation Wopt. Projecting the in-put vector to the lower dimensional space results into a number ofcoefficients, which are used to match the input sample to a pose.This classifier structure is shown in figure 3.21.


The transformation Wopt is found using Average NeighborhoodMargin Maximization (ANMM), a variation of Linear Discrimi-nant Analysis (LDA). It projects the input samples onto a lowerdimensional space where the different pose classes are maximallyseparated and easier to classify. Using a nearest neighbors (NN)approach these projected samples are matched to stored poses ina database and the closest match is the output of the system. Inorder to improve the speed of the system, the transformation Wopt

can be approximated using Haarlets, which will be discussed inchapter 3.5.

compute

Haarlet

coe!cients

NN

NN

input

input

database

databaseset

coe"s

coe"s coe"s

pose

pose

Wopt

Lapproxh l

Figure 3.21: Basic classifier structure. The input samples (3Dhulls) are projected with transformation Wopt onto a lower dimen-sional space, and the resulting coefficients are matched to posesin the database using nearest neighbors (NN).


As in the 2D approach in section 3.2, the optimal transformationto separate the pose classes is found using ANMM. Instead ofsilhouettes, training and classification is based on 3D hulls. Theresulting ANMM eigenvectors are thus 3D features as shown in


Figure 3.22: An illustration of how ANMM works. For eachsample, within a neighborhood (marked in gray), samples of thesame class are pulled towards it, while samples of a different classare pushed away, as shown in the left. The figure on the rightshows the data distribution in the projected space.

figure 3.23. As the method is very similar to the 2D approach, werefer to section 3.2.4 for further details.


“Ch14-P374633” — 2009/2/25 — 14:13 — page 16 — #16

16 CHAPTER 14 Real-Time 3D Body Pose Estimation

represented by F . Let N be a basis of the null space of F ,

N !null(F) (14.8)

N forms a basis that spans everything not yet described by F . To obtain the newoptimal transformation we project D ·Wopt onto N , where D is the diagonal matrixcontaining the weights of the eigenvectors wi in Wopt.

D′ ·W ′opt !D ·Wopt ·N ·NT (14.9)

or

W ′opt !D′"1 ·D ·Wopt ·N ·NT (14.10)

where D′ is a diagonal matrix containing the new weights !′i of the new eigenvectors

wi in W ′opt,

!′i ! ||!i ·wi ·N ·NT || (14.11)

Every time a new Haarlet is selected based on Wopt, F is updated accordingly and thewhole process is iterated until the desired number of Haarlets is obtained. Examples ofselected Haarlets are shown in Figure 14.14.

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15 20 25 30

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15 20 25 30

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15 20 25 30

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15 20 25 30

30

25

20

15

10

5

030

2520

1510

50

(a)

(b)0 5 10 15 20 25 30

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15 20 25 30

FIGURE 14.14(a) Three example ANMM eigenvectors. (b) Approximation using 10 Haarlets. The first exampleshows how a feature is selected to inspect the legs; the last example shows a feature thatdistinguishes between the left and right arm stretched forward.

Figure 3.23: Three example ANMM eigenvectors. The first ex-ample vector will inspect the legs, while the last example showsa feature that distinguishes between the left and the right armstretched forward. Dark regions are positive values and whiteregions are negative values.s

3.3.3 Orientation Estimation

A visual overhead tracker is used to find the orientation of theuser. Our visual tracker is an adaptation of a color-based particlefilter presented by Nummiaro et al. [47]. The tracker uses a setof particles to model the posterior probability distribution of thestate of the user. During each iteration, the tracker generates aset of new hypotheses for the state by propagating the particlesusing a simple dynamic model. This generates the prior distribu-tion of the next state, which is then tested using the observation ofthe image captured by the overlooking camera. The top-view of aperson is modeled by an ellipse and a circle representing the shoul-ders and the head, respectively, as shown in figure 3.24. The colordistribution of these two regions is compared to a stored modelhistogram to yield the likelihood for the state of each particle.

Multiple hypotheses

Particle filtering is a multiple hypotheses approach, which meansthat several hypotheses s(1),. . . , s(N) exist at the same time andare kept during tracking. Each hypothesis, or particle, s(j) rep-resents one hypothetical state of the user, with a correspondingdiscrete sampling probability π(j), j = 1, . . . , N . In the original


tracker [47], a particle consists of an elliptical blob with a posi-tion and size, which describes the boundary of the object beingtracked. Each particle (in the following, the index j is droppedfor simplicity of notation) is specified as

s = {x, y, x, y, HM , Hm, a}, (3.14)

where x and y represent the position of the center of the ellipse, xand y the motion, HM and Hm the major and minor axes of theellipse, and a the corresponding scale change. As the size of theuser is considered to be constant, and his motion is consideredto be non-linear, x, y, HM , Hm, a are left out of the particledescription. However, the orientation of the ellipse is needed,therefore each particle is specified as,

s = {x, y,β}, (3.15)

where β is the angle of orientation of the major axis of the el-lipse. However, due to perspective changes and the head bobbingabout, the appearance of the user changes continuously, and thetracker is not able to follow the user correctly. This is illustratedin figure 3.25a. Clearly, the head motion has to be modeled too.The model is extended from a simple ellipse, to the combinationof on elliptical shoulder region and a circular head region, whichis illustrated in figure 3.24. The particle description is extendedto,

s = {x, y,β, cx, cy}, (3.16)

where cx and cy are the position of the head circle relative to theellipse center. The improved tracker performance is illustrated infigure 3.25b.

The particle set is initialized manually by mouse clicks on the userinterface: one click to set the center of the shoulder ellipse (andof the head circle), and two on the shoulders to determine thesize and orientation of the ellipse. Each of the N = 500 particlesis assigned this same initial set of values. At each iteration tof the visual localization algorithm, three steps are performed tocompute the updated set of hypotheses from that available at the


2.4 Shoulder-Head Model

2.4.1 Extending the model

To improve the robustness of the tracker, a circular region is added to the elliptical model. This re-gion models the head (hair color) and can move around the ellipse in a natural manner. A sample is extended to:

(4.1)

where and denote the and coordinates of the head relative to the center of the ellipse. The norm of is constrained between 0 and 1, and is relative to . The radius of the circu-lar head region is fixed to be equal to the short axis of the ellipse, . When evaluating the points inside the ellipse as described in section 2.3.2, the points inside the circular head region are dis-carded and not included in the histogram.

Figure 4.1 Model including a circular head region.

This leads to a significant improvement for the shoulder tracker as the head can move freely out-side of the ellipse without affecting the resulting histogram.

2.4.2 Results

The results are shown in table 4.2. The second and third frames are now tracked correctly, allowing the head to leave the elliptical region without disturbing the tracker. As a result, the ellipse is not flipped in frame 4.

The results are not perfect however, as the tracked ellipse is still biased from time to time, as in frame 7. If in this case the user moves fast, the ellipse can still flip over to the wrong direction, as shown in frame 8.

The speed of the extended model is shown in table 4.1. The increased number of dimensions in the model requires the sample set to have around 100 samples to have a stable tracker. The tracker is still able to operate well within 50 Hz.

Table 4.1 Speed of the extended tracker compared to the previously mentioned trackers.

50 samples 100 samples 400 samples 1000 samples

original tracker 19,5ms 51 Hz 38,6ms 26 Hz 152,3ms 7 Hz 384,2ms 3 Hz

ellipse model 2,6ms 389 Hz 4,3ms 235 Hz 17,1ms 58 Hz 43,6ms 23 Hz

ellipse-head model 4,1ms 242 Hz 7,7ms 129 Hz 29,7ms 34 Hz 84,9ms 12 Hz

The extension proposed in this section results in a clearly better estimate of the overall orientation. Sometimes the orientation is slightly biased which can result in flipping. In order to avoid this bias and this flipping the model has to be refined even further to boost robustness, and minimize the possible bias of the tracked ellipse.

Figure 3.24: Each particle is modeled by an ellipse and a cir-cle, representing the shoulders and the head, parametrized as ineq. (3.14)

previous iteration t− 1, and the corresponding estimation of theuser’s position and orientation. These steps are described in thefollowing sections.

Particle filtering

In the first step, each particle st−1 of the previous iteration ispropagated according to a simple dynamic model. In particular,the evolution st is computed as

st = st−1 + wt−1, (3.17)

where wt−1 is a multivariate Gaussian random variable that mod-els the user’s motion in the time interval between two iterations.

Color-based observation

To test the probability of the evolved particle st being a goodhypothesis, the current image is observed. A color histogram pis computed for the shoulder region and another histogram p� forthe head region. The shoulder region is defined as all the pixelsinside the ellipse, but excluding the pixels that are inside the headcircle. The head region is defined as all the pixels inside the headcircle.

The following procedure refers to the histogram p, and is repeatedin the same way also for histogram p�.


Each pixel has 3 color channels r, g, b (red, green, and blue) withvalues between 0 and 255. Each channel is divided into n = 8bins, giving a total of m = n3 = 512 bins. A pixel with color rgbis assigned to a bin p[r�g�b�] as follows,

r� = �r ∗ n/256� (3.18)g� = �g ∗ n/256� (3.19)b� = �r ∗ n/256� (3.20)

where r�, g� and b� are the indices of the bin in a three dimensionalmatrix. Using equations (3.18) to (3.20), each pixel is assigned toa bin p[r�g�b�], which is then incremented by a proper pixel weightw as

p[r�g�b�] = p[r�g�b�] + w. (3.21)

The weight w is computed so as to increase the reliability of thecolor distribution when boundary pixels belong to the backgroundor get occluded. In particular, smaller weights are assigned topixels that are further away from the region center, according to

w = 1− r2,

where r is the relative (i.e. scaled by HM ) distance between thepixel and the center of the ellipse (thus, 0 ≤ r ≤ 1).

In order to speed up the tracker, the number of pixels that areevaluated to build the histogram p is reduced. In fact, a randomsampling is made of the pixels that lie inside the ellipse (shoulder)and circle (head) regions. This random sampling is fixed for theentire run of the tracker. When calculating the color histogramonly the sampled pixels are evaluated. This not only benefits thespeed, but also makes the number of evaluated pixels independentfrom the size of the ellipse, making computation time constant.

The resulting histogram p is normalized to sum up to 1, andcompared to a stored histogram or target model q (acquired atthe tracker initialization) using the Bhattacharyya coefficient

ρ[p, q] =m�

u=1

�p[u]q[u], (3.22)


with 0 ≤ ρ ≤ 1 and where u iterates through all the m bins of thehistograms p and q. For the histogram p� (and its target modelq�), we compute similarly

ρ�[p�, q�] =m�

u=1

�p�[u]q�[u]. (3.23)

The larger is ρ (respectively, ρ�), the more similar are the his-tograms p and q (or, p� and q�). Using equations (3.22) and (3.23),the distance between the particle and the target model is definedas

d =

�1− ρ[p, q] + ρ[p�, q�]

2, (3.24)

which is called the Bhattacharyya distance [48]. This similaritymeasure provides a likelihood of each particle that will be usedto update the particle set (see Sect. 3.3.3). In fact, each elements(j) of the particle set can be assigned a probability π(j) in termsof the observations (color histogram), namely

π(j) =1√2π σ

exp�− d2

2σ2

�, (3.25)

where d is the Bhattacharyya distance in equation (3.24) and σ,the standard deviation of the Gaussian distribution, is a constantthat can be fine tuned. The π(j)’s are further normalized to sumup to 1. Thus, the normalized π(j), j = 1, . . . , N , define thediscrete posterior probability distribution of the user’s state.

Updating the particle set and tracker state

The final step of the generic tracker iteration is devoted to thegeneration of the new set of hypotheses (to be used at the nextiteration), and to update the tracker’s state (the new estimationof the user’s position and orientation).


The new set of N particles is drawn from the current set by choos-ing the generic particle s(j) with probability1 π(j). The particlesare drawn with replacement, so that those with higher probabilityare in general selected several times. Of course, the same hypoth-esis will propagate in a different way when eq. (3.17) is applied atthe next iteration.

Finally, the updated tracker state is computed as a weighted meanover all the current particles, weighted by their Bhattacharyyadistance (3.24) to the target model.

Random sampling

The original tracker in [47] requires N = 75 particles to correctlytrack an object. Given the increased number of degrees of free-dom in the adapted model, our experiments have shown that aminumum of N = 500 particles are needed. This, however, slowsdown the tracker to 3 tracker iterations per second (3 Hz).

Condering the high resolution of the input image and the relativesize of the ellipse, it seems that too many pixels are being eval-uated for building the histograms, and the tracker would workas well with less pixels being evaluated. Thus, a random sam-pling is made of P = 500 random points within the ellipse region.These points are stored as a vector of (px, py) positions relativeto the ellipse center (x, y) and angle of orientation β. The sameis done for the head region. While updating the histograms, onlythe pixels corresponding to these points are evaluated. A typicalsnapshot of the tracker state for a human user in a room is shownin figure 3.25.

1This is done by generating a uniformly distributed random number x ∈[0, 1] and finding then the smallest j for which it is

x ≤jX

k=1

π(j).


(a)

(b)

Figure 3.25: Example of the visual localization algorithm track-ing a walking person run on the same sequence: (a) the ellipticalmodel without head region (3 degrees of freedom); and (b) themodel with head region (5 degrees of freedom)


3.3.4 Discussion

In this section, we have presented a body pose recognition algo-rithm based on 3D hulls of the user. As the orientation of thehulls can be normalized, the described algorithm is rotation in-variant. Some examples of inputs and classified poses are shownin figure 3.26.

In section 3.6.2, it will be shown that this method is more ro-bust than the 2D approach described in chapter 3.2. However,the 3D hull-based approach requires more hardware for real-timeimplementation than the 2D approach. Where the 2D approachperforms fairly well using just 3 cameras connected to 1 computer,the 3D approach requires more cameras, each being connected toits own computer.

Note that the current orientation tracker requires initialization.However, if initialization is not desireable for the application, itis possible to use alternatives like an orientation sensor, or to de-termine the greatest horizontal direction of variation in the hull.The latter works well, but limits the number of useable poses, asit cannot distinguish the front from the back of a person. There-fore, all poses need to be symmetrical, which is not ideal. Thiscould be avoided by using other cues to determine which side ofthe person is in front, such as face detection from the camerasplaced sideways. Another option is choosing a direct approach,such as the one proposed by Cohen et al. [41], Weinland et al. [42]and Gond et al. [43], where a rotation-invariant 3D shape descrip-tor is used rather than normalization of the hulls. We chose tofirst normalize the hulls and then classify them for two reasons.First, we believe that higher performance and lower computationtimes are possible this way. Both the method by Ren et al. [3] andour method achieve very high classification rates in real-time byfirst determining the orientation and then classifying. Secondly,disconnecting the normalization from the classification allows theclassification algorithm to be used for different classification prob-lems as well, such as, for example, hand gestures, or classificationin a space where the third dimension is a time dimension (similar


to [49]). In this case, a different normalization step is required,but the classification algorithm remains mostly the same.


Figure 3.26: Examples of reconstruction and classification re-sults.

input reconstruction detected pose


3.4 Hand Gesture Recogntion

In this section, a novel hand gesture recognition system is intro-duced based on ANMM. The system is example-based, meaningthat it matches the observation to predefined gestures stored in adatabase. It is real-time and does not require the use of specialgloves or markers. The work described in this section was alsopresented in [50].

3.4.1 Background

A review of current model-based hand pose estimation methodsis presented by Erol et al. [51]. The review states that currently,the only technology that satisfies the advanced requirements ofhand-based input for human-computer interaction is glove-basedsensing. In this section, however, we aim to provide hand-basedinput without the requirement of such markers, which is possiblethrough an example-based approach.

An example-based gesture recognition system can be decomposedinto two subsystems: detecting the hand location and recognizingthe hand gesture/pose. For detecting the hand location, severalHaarlet-based approaches exist [27, 26], where the Haarlets aretrained with boosting algorithms. These approaches are similar toface detection [1]. However, while faces remain rather static in ap-pearance, the hand configuration has over 26 degrees of freedom,and its appearance can change drastically. Therefore, detectingthe hand location in a similar way to detecting the face location,only works when the hand is visible in certain orientations andposes. Adding several orientations and poses, the hand is easilyconfused with clutter in the background, rendering the methoduseless.

To overcome this problem, we detect the hands based on color.In section 2.2, we proposed a hybrid between a histogram-basedmethod, and a Gaussian mixture model (GMM) based method.This results in a hybrid between an online trained and an offline

3.4. Hand Gesture Recogntion 73

trained model for skin color segmentation. Based on this segmen-tation, the hands are located and segmented.

Sato et al. [52] recognize the shape of a human hand using aneigenspace method. Depth data is projected to a lower-dimensionaleigenspace in order to recognize the pose of the hand. The eigenspaceis found as the principal eigenvectors of the covariance matrix ofvectors taken from training data, also referred to as principal com-ponent analysis (PCA). Wang and Zhang [45] show that lineardiscriminant analysis (LDA) and average neighborhood marginmaximization (ANMM) perform better than PCA at dimension-ality reduction. In section 3.2 and 3.3, we recognize full bodypose based on LDA and ANMM respectively. In this section werecognize hand gestures/poses in a similar fashion.

3.4.2 Inputs

The goal of this section is to detect the hand gesture or pose, givenan image of the hand. The locations of the user’s hands are firstdetermined as described in section 3.1.3. The poses of the handsare classified based on the cropped hand images. This is doneusing a classifier similar to the 2D body pose classifier describedin section 3.2. As inputs for the classifier we have considered fouroptions: the cropped grayscale image of the hand, a segmentedversion of this cropped image where the non-skin colored back-ground is removed, a silhouette of the hand based on the skincolor segmentation, or an edge map based on the silhouette of thehand. These four possibilities are illustrated in figure 3.27.

The benefit of using just the cropped image without segmenta-tion (figure 3.27a) is that it is very robust against noisy seg-mentations. Large holes in the segmentation will not impact theperformance of the classifier. However, heavily cluttered back-grounds can disturb the classifier. Using the image segmentedbased on skin color only, the background influence is eliminated.In this case the segmented region can either consist of the origi-nal pixel values(figure 3.27b), or white pixels forming a silhouette(figure 3.27c). Finally, the input image can be an edge map of


(a) cropped (b) segmented (c) silhouette (d) edge

(e) cropped (f) segmented (g) silhouette (h) edge

Figure 3.27: Possible inputs for the hand gesture classifier.

the silhouette (figure 3.27d), which can be used, for example, forclassification using the Hausdorff distance measure. However, theedge map is very sensitive to holes in the segmentation.

The different classifier input types are compared in the experi-ments section 3.6.4. From intuition, if the segmentation is noisy,using the complete cropped image (figure 3.27a) will be best, whileif the background is cluttered, a segmented image (figure 3.27b)will classify more accurately.

The cropped hand image can be classified using ANMM similarto the 2D body pose recognition in section 3.2. However, we firstdescribe a method which classifies the edge map (figure 3.27d)based on the Hausdorff distance. The performance of both meth-ods is compared in the experiments section 3.6.3. It is shown thatthe ANMM-based approach performs better.

3.4.3 Hausdorff Distance

This method is based on the work by Sanchez-Nielsen et al. [53].To consider only relevant data, the edge of the hand is extractedfrom the silhouette. An edge map is obtained by moving a 3× 3filter over the image. Examples of such edge maps are shown


in figures 3.27d and 3.27h. To measure a distance between edgemaps, we use the Hausdorff distance. The benefit of using theHausdorff distance is that there is no need for an explicit spatialrelationship between matching pixels. Formally, given two sets ofpoints A and B, the Hausdorff distance is computed as,

H(A, B) = max�

h(A, B), h(B,A)�

(3.26)

where,h(A, B) = max

a∈A

minb∈B

d(a, b) (3.27)

The function h(A, B) is called the directed Hausdorff distancefrom set A to B. The manhattan distance d(a, b) is used to de-termine the distance between two data points.

This method has a number of interesting properties. For example,the edge maps don’t have to be perfectly aligned, as the methoddoes not require an explicit spatial relation between the edge pix-els. However, the method also has some limitations. It assumesa clean segmentation, and is sensitive to noise in the segmenta-tion. It discards the grayscale and color information in the handimage, and therefore gestures with similar outlines will becomeambiguous. The method can only take one input training sampleper gesture, so there is no generalisation over multiple trainingsamples. Furthermore, the amount of wrist / arm that is visiblein the segmentation impacts the distance measure dramatically,while these pixels should be irrelevant. Finally, the method israther slow, as the computation time of the Hausdorff distanceincreases with the resolution and with the number of poses. Allof these limitations are lifted by using the ANMM-based approachdescribed in the next section.



The classifier used in this section is similar to the 2D body poseclassifier described in section 3.2. The main difference is that thetraining and input data used for the classifier are not limited tobinary silhouettes, but can also consist of a grayscale image of thehand (including background) or a segmented grayscale image ofthe hand (no background).

NNinput

database

coe!s poseWopt

Figure 3.28: Basic classifier structure. The input samples(hand images) are projected with ANMM transformation Wopt

onto a lower dimensional space, and the resulting coefficients arematched to poses in the database using nearest neighbors (NN).

To extract the hand image from the input image, the hand detec-tion algorithm described in section 3.1.3 is used. A rectangularregion is cropped from the input image (as in figure 3.27a), basedon the boundaries of the skin color segmentation of the hand. Infigure 3.28 the classifier structure is shown. The input hand im-ages are projected onto a lower dimensional space using transfor-mation Wopt. Subsequently, the resulting coefficients are matchedto a database of hand gestures.

The transformation Wopt is trained using ANMM as described insection 3.2.3. However, instead of using strictly black and whitesilhouettes, the input can be any of the inputs shown in figure 3.27.


For optimal classification performance, the classification is basedon a concatenation of the cropped image (figure 3.27a) and thesilhouette (figure 3.27c). This way, both the appearance and theshape are explicitely modeled in the classifier. This is shown infurther detail in the experiments section in section 3.6.3. Exam-ples of the method classifying some gestures using the combinedcropped image and silhouette classifier are shown in figure 3.29.

(a) detected gesture 1 (b) detected gesture 2 (c) detected gesture 3

Figure 3.29: Examples of the classifier detecting different handgestures. An open hand is marked in red (a), a fist is marked ingreen (b) and a pointing hand is marked in blue (c).

3.4.5 Discussion

In this section, we presented how the 2D classification approachdescribed in section 3.2 can be applied to hand gesture recogni-tion. The resulting classifier is fast, robust, and the hand does nothave to be aligned perfectly in order for the system to properlyclassify the pose. This means that it allows for noisy or incom-plete segmentations of the hand. The localization of the hands isbased on a coarse skin color detection as described in section 2.2.

The method has been tested in experiments in section 3.6.3. It iscompared to the method based on the Hausdorff distance. Fur-thermore, different input types for the classifier have been tested,and it is shown that the best input is a combination of the silhou-ette of the hand and its grayscale image.


A hand gesture classifier based on the 3D approach described inchapter 3.3 has not been implemented, because the resolution ofthe reconstructed hulls is insufficient to detect any hand details.If several cameras were aimed directly at and zoomed onto thehands of the user, a 3D reconstruction could be feasible. Thiswould require a lot of cameras with varied pose angles with respectto the user (as the user will occlude the hand from most cameras),and either the hand would have to stay in the set working volume,or the cameras would have to tilt and rotate to be able to followit, while staying calibrated during this process. The 3D approachhas not been further explored for hand gestures.

3.5. Haarlet Approximation 79

3.5 Haarlet Approximation

In this section we aim to improve the speed of the classifiersdescribed in sections 3.2, 3.3 and 3.4 with the use of Haar-likewavelet features or Haarlets for brevity. The work described inthis section was also presented in [31, 32, 46].

Computing the ANMM transformation Wopt as shown in figure 3.15can be computationally demanding, especially if there are a lot ofANMM eigenvectors. In order to improve the speed of the sys-tem, the transformation Wopt in the classifier can be approximatedusing Haarlets, as shown in figure 3.30. In this case the transfor-mation Wopt is approximated by a linear combination Lapprox ofHaarlets. An optimal set of Haarlets is selected during the trainingstage. Computing this set of features on the input image resultsin a number of coefficients. Transforming these coefficients withLapprox results in an approximation of the coefficients that wouldhave resulted from the transformation Wopt on the same inputdata, and subsequently can be used for classification in the samemanner as in the pure ANMM case. Because of their speed ofcomputation, Haarlets are very popular for real-time object de-tection and real-time classification. This ANMM approximationapproach provides a new and powerful method for selecting ortraining Haarlets. Especially in the 3D case, where, as noted byKe et al. [49], existing methods fail because of the large amountof candidate Haarlets. This approach makes it possible to train3D Haarlets selected from the full set of candidates.

3.5.1 2D Haarlets

Papageorgiou et al. [54] proposed a framework for object detec-tion based on Haarlets, which can be computed with a minimumof memory accesses and CPU operations using the integral image.Viola and Jones [1] used AdaBoost to select suitable Haarlets forobject detection. The same approach was used for pose recogni-tion by Ren et al. [3]. In our approach similar Haarlets are used,but we introduce a new selection process based on ANMM. The


compute

Haarlet

coe!cients

NN

NN

input

input

database

databaseset

coe"s

coe"s coe"s

pose

pose

Wopt

Lapproxh l

Figure 3.30: Classifier structure illustrating the Haarlet approx-imation. The pre-trained set of Haarlets are computed on theinput sample (grayscale image, silhouette or 3D hull). The ap-proximated ANMM coefficients (l) are computed as a linear com-bination Lapprox of the Haarlet coefficients (h). The contents ofthe dotted line box constitute an approximation of Wopt in fig-ure 3.15.

Figure 3.31: The set of possible 2D Haarlet types.

Haarlets are selected to approximate Wopt as a linear combination.The particular set of Haarlets used here, was carefully selected byLienhart and Maydt [55] and is shown in figure 3.31. Beside thefeature type, the other parameters are width, height and positionin the image. All combinations are considered. At a resolution of24 × 24 pixels and using 3 cameras views, this results in over amillion candidate Haarlets.


3.5.2 Training

The best Haarlets are obtained from the set of candidate Haarletsby convolving each Haarlet with the vectors in Wopt and select-ing those with the highest coefficients, i.e. the highest responsemagnitudes. This score is found for each candidate Haarlet by cal-culating the dot product of that Haarlet with each ANMM vector(each row in Wopt), and calculating the weighted sum using theweights of those ANMM vectors, as stored in the diagonal matrixD (i.e. the eigenvalues serve as weights). Thus, the entire ANMMeigenspace is approximated as a whole, giving dimensions with ahigher weight higher priority when selecting Haarlets. This dotproduct can be computed very efficiently using the integral image.

Most selected Haarlets will be redundant unless Wopt is adaptedafter each new Haarlet is selected before choosing the next one.Let H be a matrix containing the already selected Haarlets invector form, where each row of H is a Haarlet. H can be regardedas a basis that spans the feature space represented by the Haarletvectors selected so far. Basically we do not want the next selectedHaarlet to be in the space that is already represented by H. LetN be a basis of the null space of H,

N = null (H) . (3.28)

N forms a basis that spans everything that is not yet described byH. To obtain the new optimal transformation we project D ·Wopt

onto N , where D is the diagonal matrix containing the weights ofthe eigenvectors wi in Wopt.

D� · W �opt

= D · Wopt · N · NT , (3.29)

or,W �

opt= D�−1 · D · Wopt · N · NT , (3.30)

where D� is a diagonal matrix containing the new weights λ�i

ofthe new eigenvectors vi in W �

opt,

λ�i= ||λi · vi · N · NT ||. (3.31)


Int J Comput Vis (2009) 83: 72–84 77

Fig. 9 The top figure shows one ANMM vector, featuring the over-head, profile and frontal views side by side. The bottom figure showsthe Haarlet approximation of this ANMM vector, using the 10 bestHaarlets selected to approximate Wopt . It can be seen how the Haarletslook for arms and legs in certain areas of the image

After the ANMM vectors have been computed and theHaarlets have been selected to approximate them, the nextstep is to actually classify new silhouettes. This process usesthe Haarlets to extract coefficients from the normalized sil-houette image, and then computes a linear combination ofthese coefficients to approximate the coefficients that wouldresult from the ANMM transformation. An example of suchan approximated ANMM feature vector is shown in Fig. 9.The resulting coefficients can be used to classify the poseof the silhouette. Given the coefficients h extracted withthe Haarlets, the approximated ANMM coefficients l can becomputed as

l = C · h, (9)

where C is an m × l matrix where m is the number ofANMM eigenvectors and l is the number of Haarlets usedfor the approximation. C can be obtained as the least squaressolution to the system

Wopt = C · FT . (10)

The least squares solution to this problem yields

C = Wopt ·((

FT F)−1

FT

)T

. (11)

C provides a linear transformation of the feature coefficientsh to a typically smaller number of ANMM coefficients l.This allows for the samples to be classified directly based

on these ANMM coefficients, whereas an AdaBoost-basedmethod needs to be complemented with a detector cascade(Viola and Jones 2001), or with a hashing function (Ren etal. 2005). Finally, using nearest neighbors search, the newsilhouettes can be matched to the stored examples, i.e., themean coefficients of each class.

4.2 Introduction of 3D Haarlets

The concepts of an integral image and Haarlets can be ex-tended to three dimensions. The 3D integral image, or inte-gral volume, is defined as,

ii(x, y, z) =∑

x′≤x,y′≤y,z′≤z

i(x′, y′, z′). (12)

Using the integral volume, any rectangular box sum can becomputed in 8 array references as shown in Fig. 10. Ac-cordingly, the integral volume makes it possible to construct

Fig. 10 The sum of the voxels within the gray cuboid can be computedwith 8 array references. If A, B, C, D, E, F, G and H are the integralvolumes at shown locations, the sum can be computed as (B + C + E+ H) − (A + D + F + G)

Fig. 11 The proposed 3D Haarlets. The first 15 features are extrudedversions of the original 2D Haarlets in all 3 directions, and the other 2are true 3D center-surround features

Figure 3.32: The top figure shows one ANMM vector, with theoverhead, profile and frontal views side by side. The bottom figureshows the Haarlet approximation of this ANMM vector, using 10best Haarlets selected to approximate Wopt. It can be seen howthe Haarlets look for arms and legs in certain areas of the image.

Every time a new Haarlet is selected based on Wopt, H is updatedaccordingly and the whole process is iterated until the desirednumber of Haarlets is selected. Examples of selected Haarlets areshown in figure 3.32.

3.5.3 Classification

After the ANMM vectors have been computed and the Haarletshave been selected to approximate them, the next step is to ac-tually classify new silhouettes. This process uses the Haarletsto extract coefficients from the normalized silhouette image, andthen computes a linear combination of these coefficients to approx-imate the coefficients that would result from the ANMM trans-


formation. An example of such an approximated ANMM featurevector is shown in figure 3.32. The resulting coefficients can beused to classify the pose of the silhouette.

Applying the ANMM transformation Wopt on the input imageresults in a vector of ANMM coefficients l. Instead of applyingthe ANMM transformation, it is much faster to compute the set ofHaarlets on the input image, which results in a vector of Haarletcoefficients h. Using these haarlets, the ANMM coefficients l canbe approximated.

Given the coefficients h extracted with the Haarlets, the approx-imated ANMM coefficients l can be approximated as

l = Lapprox · h, (3.32)

where Lapprox is an m × n matrix where m is the number ofANMM eigenvectors and n is the number of Haarlets used forthe approximation. Lapprox can be obtained as the least squaressolution to the system

Wopt = Lapprox · HT . (3.33)

The least squares solution to this problem yields

Lapprox = Wopt ·��

HT H�−1

HT

�H

. (3.34)

Lapprox provides a linear transformation of the feature coefficientsh to a typically smaller number of ANMM coefficients l. Thisallows for the samples to be classified directly based on theseANMM coefficients, whereas an AdaBoost-based method needs tobe complemented with a detector cascade [1], or with a hashingfunction [40, 3].

Finally, using nearest neighbors search, the new silhouettes arematched to the stored examples, i.e., the mean coefficients of eachclass.


Figure 3.33: The sum of the voxels within the gray cuboid canbe computed with 8 array references. If A, B, C, D, E, F , G andH are the integral volume values at shown locations, the sum canbe computed as (B + C + E + H)− (A + D + F + G).

3.5.4 3D Haarlets

The concepts of an integral image and Haarlets can be extendedto three dimensions. The 3D integral image or integral volume isdefined as:

ii (x, y, z) =�

x�≤x,y�≤y,z�≤z

i (x�, y�, z�). (3.35)

Using the integral volume, any rectangular box sum can be com-puted in 8 array references as shown in figure 3.33. Accordingly,the integral volume makes it possible to construct volumetric boxfeatures similar to the 2D Haarlets. We introduce the 3D Haar-let set as illustrated in figure 3.34, simply extended from the 2DHaarlets in figure 3.31.

The 3D Haarlets are used to approximate the 3D ANMM eigen-vectors in Wopt. The Haarlet selection and classification processes


Figure 3.34: The proposed 3D Haarlets. The first 15 features areextruded versions of the original 2D Haarlets in all 3 directions,and the other 2 are true 3D center-surround features.

are identical to the 2D case as described in section 3.5.2 and 3.5.3respectively. Examples of ANMM eigenvectors approximated with3D Haarlets are shown in figure 3.35.


“Ch14-P374633” — 2009/2/25 — 14:13 — page 16 — #16



N !null(F) (14.8)


D′ ·W ′opt !D ·Wopt ·N ·NT (14.9)

or

W ′opt !D′"1 ·D ·Wopt ·N ·NT (14.10)


wi in W ′opt,

!′i ! ||!i ·wi ·N ·NT || (14.11)


30

25

20

15

10

5

030

2520

1510

50 0 5 10 15 20 25 30

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15 20 25 30

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15 20 25 30

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15 20 25 30

30

25

20

15

10

5

030

2520

1510

50

(a)

(b)0 5 10 15 20 25 30

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15 20 25 30


(a)

“Ch14-P374633” — 2009/2/25 — 14:13 — page 16 — #16



N !null(F) (14.8)


D′ ·W ′opt !D ·Wopt ·N ·NT (14.9)

or

W ′opt !D′"1 ·D ·Wopt ·N ·NT (14.10)


wi in W ′opt,

!′i ! ||!i ·wi ·N ·NT || (14.11)


30

25

20

15

10

5

030

2520

1510

50 0 5 10 15 20 25 30

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15 20 25 30

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15 20 25 30

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15 20 25 30

30

25

20

15

10

5

030

2520

1510

50

(a)

(b)0 5 10 15 20 25 30

30

25

20

15

10

5

030

2520

1510

50 0 5 10 15 20 25 30


(b)

Figure 3.35: (a) Three example ANMM eigenvectors. (b) Ap-proximation using 10 Haarlets. The first example shows how afeature is selected to inspect the legs; the last example showsa feature that distinguishes between the left and the right armstretched forward.

3.5.5 Discussion

In this section, we presented a new algorithm to train Haarlets,resulting in a recognition system with lower memory requirementsand computation time than the existing methods. The Haarletsare trained by approximating ANMM eigenfeatures.

For the 2D case, the approach performs better than AdaBoostfor multi-class problems such as pose classification (section 3.6.2).Note that for a 2-class detection problem, such as face detection,the method cannot outperform an AdaBoost-based detection cas-cade (figure 3.3). AdaBoost excels at selecting 2-5 very discrim-inative Haarlets at the beginning, and the subsequent Haarletswill be rather random. The ANMM approximation excels at se-lecting a group of Haarlets which are very discriminative when


combined. It performs well when sufficient Haarlets are used toapproximate the ANMM features fairly well, which is usually atleast 10 features. Unlike AdaBoost, ANMM features discriminatebetween multiple classes, which is why the approximation lendsitself so well to multi-class classification problems.

In this section we also introduced 3D Haarlets. For the 3D case,boosting algorithms are not able to train the Haarlets selectedfrom the full set of Haarlets, due to memory restrictions andcomputation time [49]. The relaxed memory requirements of theANMM approximation allow us to train from the full set of 3DHaarlets without a problem.

The Haarlet approximation offers a significant speed increase overusing the pure ANMM features, which is especially apparent inthe 3D case (section 3.6.2). Adding additional features has anegligible impact on computation time, making it possible to trainfor a lot of pose classes (50 in the experiments). The experimentsalso show that very few 3D Haarlets are needed for a reasonableclassification of the hulls. The 3D Haarlets are able to capture alot of information, and enable very robust classification with veryfew CPU cycles.

The 3rd dimension of the 3D Haarlets does not necessarily haveto be a spatial dimension. In future work, it would be interestingto introduce a temporal dimension, for example to classify movinggestures.


3.6 Experiments

Three sets of experiments are described in this section. First,the pose recognition system is tested on a training and test setswhere the orientation of the user is fixed. The system is testedfor both the 2D and the 3D case. Secondly, both the 2D and 3Dpose recognition systems are tested on training and test sets wherethe user can rotate freely. Finally, the hand gesture recognitionsystem is evaluated.

3.6.1 Body Pose Recognition: without rotation

In this section the experiments are described that are related tofull body pose recognition, where the orientation of the subjectis fixed around the vertical axis. We compare the ANMM-basedapproach to the standard LDA-based approach, and the 3D hull-based approach is compared to the 2D silhouette-based approach.

Fixing the orientation of the user simplifies the classification prob-lem. However, it allows for a direct comparison between the 2Dand the 3D approach, as the silhouettes in the 2D approach cannotbe normalized for orientation. The more difficult problem, wherethe orientation of the user is not fixed, is evaluated in section3.6.2.

Experimental Setup and Data

The experiments are set up in an office scenario with a 4× 4 me-ter working space for the user. 6 cameras were placed around theuser, of which one is placed above the user as an overhead cam-era. For these experiments, 2000 training samples are recorded ofa subject in different positions but always in the same orientation.Due to the office scenario with cluttered background the segmen-tations are sometimes noisy. The samples are recorded from 6cameras which are connected to 6 computers that run foreground-background segmentation on the recorded images. The silhouettes

3.6. Experiments 89

1.1 The Two Main Approaches

This paper focusses on two main approaches for classification, a two dimensional and

a three dimensional approach. The 2D approach aims to classify poses based on

silhouettes of the user. These silhouettes are extracted from videos of several fixed

cameras around the person.

Figure 1: Some examples of poses used in this paper.

Figure 3.36: The 50 pose classes used in the body pose recogni-tion experiments where the orientation of the user is fixed.


extracted from the foreground-background segmentations are nor-malized for size and position to a resolution of 24× 24 pixels. 3Dhulls are reconstructed and normalized for size and position to aresolution of 24 × 24 × 24 voxels. Validation is done using 2000test samples recorded in a similar fashion. The 50 pose classesused in this experiment are shown in figure 3.36, and are madeup of different arm and leg positions.

LDA and ANMM

In the first experiment, we compare ANMM to LDA for classifying2D silhouettes over a large number of poses. Using up to 50 poseclasses, the test silhouettes are classified using both the LDA-based and the ANMM-based approach. When using less than 50pose classes, a random selection of pose classes is made, and theresults are averaged over 5 random samplings of pose classes. Theresults are presented in figure 3.37. There is almost no differencein performance between LDA and ANMM in this experiment.There is a large amount of training data compared to the lowdimensionality of the problem, and thus ANMM does not offer asignificant advantage over LDA.

However, in the case of classifying 3D hulls, the number of dimen-sions (voxels) in the data is much higher. This is why in the 3Dcase, the standard LDA algorithm suffers from the low sample sizeproblem, as shown in figure 3.38. The performance of ANMM isclearly higher in the 3D case.

2D and 3D

In the following experiment, we compare the 3D hull-based classi-fier (section 3.3) to the 2D silhouette-based approach (section 3.2),where the orientation of the person is fixed. For both the 2D andthe 3D approach, the classifiers are trained using ANMM and theHaarlet approximation. In figure 3.39 we show the performancefor classification with different numbers of pose classes up to 50.When using all 50 pose classes, the 3D system achieves 98.5% cor-rect classification, whereas the 2D system achieves 98.4% correct

3.6. Experiments 91

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

10 20 30 40 50

corr

ect

cla

ssif

ica

tio

n r

ate

number of poses

LDA

ANMM

Figure 3.37: Correct classification rates comparing LDA andANMM for the classification of 2D silhouettes.

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

6 8 10 20 30 40 50

corr

ect

cla

ssif

ica

tio

n r

ate

number of poses

LDA

ANMM

Figure 3.38: Correct classification rates comparing LDA andANMM for the classification of 3D hulls.


classification. There is virtually no difference in performance, con-sidering that the orientation of the user is fixed. The advantageof the 3D approach, however, is that it can deal with a changingorientation, which will be shown in the next section.

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

10 20 30 40 50

corr

ect

cla

ssif

ica

tio

n r

ate

number of poses

2D

3D

Figure 3.39: Correct classification rates comparing classificationbased on 2D silhouettes and 3D hulls using ANMM approximationwith Haarlets.

3.6. Experiments 93

3.6.2 Body Pose Recognition: with rotation

In this section the experiments that were executed related to fullbody pose recognition are described, where the subject is allowedto rotate around the vertical axis. First, we compare the ANMM-based approach to the standard LDA-based approach. Thus, the3D hull-based approach is compared to the 2D silhouette-basedapproach. Finally, the Haarlet approximation is evaluated forspeed, and the number of Haarlets needed to approximate thepure ANMM transformation is also assessed.


The experiments are set up in the same office scenario with 6cameras as in section 3.6.1. All the experiments in this sectionmake use of images from all 6 cameras, unless stated otherwise.For these experiments, 2000 training samples are recorded of asubject in different positions and orientations. The samples arerecorded from 6 cameras which were connected to 6 computersthat run foreground-background segmentation on the recorded im-ages. An overhead tracker (section 3.3.3) is run while recordingthe data, and the estimated body orientation is saved for eachrecorded sample. The silhouettes extracted from the foreground-background segmentations are normalized for size and positionto a resolution of 24 × 24 pixels. 3D hulls are reconstructed andnormalized for size, position and orientation to a resolution of24 × 24 × 24 voxels. Validation is done using 4000 test samplesrecorded in a similar fashion. The 50 pose classes used in thisexperiment are shown in figure 3.40. The poses are made up ofdifferent arm, but not leg positions, because the legs are neededfor walking around and changing the orientation. Thus, in therecorded data, the legs are constantly moving.

LDA and ANMM

In the first experiment we show that ANMM is indeed better thanLDA at classifying 3D hulls over a large number of poses. Using


Int J Comput Vis (2009) 83: 72–84 81

Fig. 15 The 50 pose classesused in this article, differing byarm directions

Figure 3.40: The 50 pose classes used in the body pose recogni-tion experiments, where the user is allowed to rotate around thevertical axis.

3.6. Experiments 95

up to 50 pose classes, the test hulls are classified using both theLDA-based and the ANMM-based approach. When using lessthan 50 pose classes, a random selection of pose classes is made,and the results are averaged over 5 random samplings of poseclasses.

The results are presented in figure 3.41. The ANMM-based ap-proach is more consistent and maintains high correct classificationrates of around 97% even when all 50 pose classes are used. TheLDA-based approach drops down to 80% correct classification.

0.00%!

10.00%!

20.00%!

30.00%!

40.00%!

50.00%!

60.00%!

70.00%!

80.00%!

90.00%!

100.00%!

1! 4! 7! 10! 13! 16! 19! 22! 25! 28! 31! 34! 37! 40! 43! 46! 49!

corr

ect c

lass

ifica

tion

rate!

number of poses!

3D, ANMM!

3D, LDA!

Figure 3.41: Correct classification rates comparing LDA andANMM for the classification of 3D hulls.

2D and 3D

To test the 3D hull-based rotation invariant classifier (section 3.3)we compare it to the 2D silhouette-based approach (section 3.2).As a 2D silhouette-based classifier cannot classify the pose of aperson with changing orientation, it is impossible to compare di-rectly. Therefore in the 2D case, a separate classifier is trainedfor each possible orientation, as described in section 3.2.5.


For both the 2D and the 3D approach, the classifiers are trainedusing ANMM and the Haarlet approximation. This allows us toquantify how much a 3D hull based approach improves the perfor-mance. In figure 3.42 we show the performance for classificationwith different numbers of pose classes up to 50. When using all 50pose classes, the 3D system achieves 97.5% correct classification,whereas the 2D system achieves 95.6% correct classification.

80.00%

90.00%

100.00%

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

corr

ect

class

ific

ati

on

rate

number of poses

3D, 6 cameras

2D, 6 cameras

Figure 3.42: Correct classification rates comparing classificationbased on 2D silhouettes and 3D hulls using ANMM approximationwith Haarlets.

Number of cameras

In the experiments above we make use of the images from all6 cameras. In the 2D case, however, it is possible to use fewercameras, with some impact on the performance. The differencein performance between using 3 and 6 cameras, is shown in fig-ure 3.43.

3.6. Experiments 97

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

corr

ect

class

ific

ati

on

rate

number of poses

2D, 6 cameras

2D, 3 cameras

Figure 3.43: Correct classification rates comparing classificationbased on 2D silhouettes from respectively 3 and 6 cameras.

Haarlet Approximation

In this section we evaluate how many Haarlets are needed for agood ANMM approximation, and measure the speed improvementover using a pure ANMM approach. For this experiment, a 50-pose classifier is trained. The resulting classifier uses 44 ANMMeigenvectors, which can be approximated almost perfectly with100 Haarlets. The number of Haarlets used determines how wellthe original ANMM transformation is approximated, as shown infigure 3.44. Therefore, there is no overfitting, but, after a certainnumber of Haarlets, the approximation delivers the same classifi-cation performance as the pure ANMM classification. Using this3D ANMM-based approach the classifier achieves 97.5% correctclassification on 50 pose classes using 100 Haarlets.

In figure 3.45, we also show the performance for a 2D silhouette-based classifier. In this 2D case, we show that the classificationperformance for (1) a classifier where the Haarlets are trained withthe ANMM-based approach, and (2) a classifier where the Haar-lets are trained with AdaBoost [1]. The ANMM-based approach


has better performance, as the AdaBoost-based approach suffersfrom overfitting. Due to the memory constraints of AdaBoost itis not possible to apply it to 3D Haarlets [49].

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0 10 20 30 40 50 60 70 80 90 100

correct

cla

ssif

icati

on

rate

number of Haarlets

2D Haarlets

3D Haarlets

Figure 3.44: Correct classification rates using up to 100 Haarletsfor classification, comparing 2D and 3D Haarlets.

As shown in figure 3.46, the Haarlet approximated approach ismany times faster than the pure ANMM approach. For the ANMMtransformation, the computation time increases almost linearlywith the number of pose classes. This is because increasing thenumber of pose classes increases the number of ANMM featurevectors. Using the ANMM approximation, the integral volume ofthe hull has to be computed once, after which computing addi-tional Haarlet coefficients requires virtually no computation timerelative to the time of computing the integral volume. Consideringthe processing time required for segmentation (5 ms, in parallel)and reconstruction (15 ms), the overall total processing time isthen less than 25 ms per frame. The classification is performedon a standard 3 GHz computer.

3.6. Experiments 99

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0 10 20 30 40 50 60 70 80 90 100

correct

cla

ssif

icati

on

rate

number of Haarlets

2D Haarlets

2D AdaBoost

Figure 3.45: Correct classification rates using up to 100 Haarletsfor classification, comparing to AdaBoost.

0.00

10.00

20.00

30.00

40.00

50.00

60.00

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

com

pu

tati

on

tim

e fo

r cl

ass

ifca

tion

in

ms

number of pose classes

ANMM pure

ANMM

approximation

Figure 3.46: Classification times in milliseconds for the pureANMM classifier and the classifier using 100 3D Haarlets to ap-proximate the ANMM transformation.


3.6.3 Hand Gesture Recognition

In this section we show the improved performance of the ANMM-based method over the Hausdorff distance-based method. Sec-ondly, we evaluate the different possible input types for the ANMM-based hand gesture classifier.


This experiment is executed in an office scenario where the camerais placed on top of the computer screen of the user (similar to awebcam).

We have recorded 50 samples of each hand in 10 poses, for atotal of 500 samples. Each sample is normalized to a resolutionof 36×36 pixels. The samples have varying backgrounds, and thehand moves and twists. The segmentation is relatively bad. Foreach pose class, 10 samples are used for training, and 40 are usedfor classification. The 10 pose classes used in this experiment areshown in figure 3.47.

Figure 3.47: The 10 hand gestures used in this experiment.

3.6. Experiments 101

ANMM and Hausdorff

We compare the ANMM-based hand gesture classifier to a clas-sifier based on the Hausdorff distance. The Hausdorff distance isbased on an edge of the silhouette of the hand, as shown in fig-ure 3.48b. For enabling the comparison, the ANMM-based clas-sifier will be classifying binary silhouettes of the hand, as shownin figure 3.48a.

(a) silhouette (b) edge

Figure 3.48: (a) the input for the ANMM-based classifier and(b) the input for the Hausdorff distance-based classifier.

The results of this experiment are presented in figure 3.50. Theyshow that the method based on the Hausdorff distance performsvery poorly. The reason is that it expects (1) a perfect segmen-tation, with not much changes in the shape of the hand (as itdoes not generalize) as well as (2) poses that are sufficiently dif-ferent. The ANMM-based approach performs very well, also onnoisy silhouettes. Furthermore, as shown in the next section, theANMM-based approach can even take the cropped image, withoutsegmentation as input.


Classifier Input

The ANMM-based classifier can take different kinds of inputs:silhouettes, grayscale images, or a combination thereof. If silhou-ettes are used for classification (figure 3.49b), then the classifica-tion is based on shape. If a cropped grayscale image is used asinput (figure 3.49a), then the classification is based on appear-ance. Both are relevant in detecting hand gestures, so it is usefulto use both.

(a) cropped (b) silhouette (c) segmented (d) concatenated

Figure 3.49: Possible inputs for the hand gesture classifier.

To determine which input type is optimal for the ANMM-basedclassifier, we tested the performance based on four input types:cropped grayscale image, binary silhouette, segmented grayscaleimage and a concatenated cropped grayscale image with the bi-nary silhouette. The four input types are shown in figure 3.49.The last option is the concatenation of the first two.

The results are presented in figure 3.51. They show that concate-nating the appearance and the shape (figure 3.49d) consistentlyyields the highest performance. Using the concatenated input forthe classifier, on 10 pose classes, it achieves 98.2% correct classifi-cation, as opposed to 97.2% and 97.5% for using only the croppedimage or only the silhouette respectively.

3.6. Experiments 103

0%

20%

40%

60%

80%

100%

2 3 4 5 6 7 8 9 10

corr

ect

cla

ssif

ica

tio

n r

ate

number of hand gestures

ANMM

Hausdorff

Figure 3.50: Correct classification rates for the ANMM-basedmethod and the Hausdorff distance-based method.

90%

92%

94%

96%

98%

100%

2 3 4 5 6 7 8 9 10

corr

ect

cla

ssif

ica

tio

n r

ate

number of hand gestures

cropped

silhouette

segmented

combined

Figure 3.51: Correct classification rates for 2 to 10 pose classesusing the ANMM-based method.

4Applications

Chapters 2 and 3 described different building blocks that can beused to build interaction systems. In this chapter, those buildingblocks are used to build four real-world applications. The first ap-plication is a perceptive user interface, where the user can pointat objects on a large screen, and move them around on the screen.The emphasis of this application is on detecting the body partsand determining the 3D pointing direction. The second applica-tion is the CyberCarpet, a prototype platform which allows uncon-strained locomotion of a walker in a virtual world. As this systemis a prototype, the walker is replaced by a miniature robot. Thevision part of this system consists of an overhead tracker whichtracks the body position and orientation of the walker in real-time.The third application is the full-scale omni-directional treadmill,which accomodates for human walkers CyberWalk. Beside the po-sition and orientation tracker, the vision part is completed witha full body pose recognition system. Key poses are detected toenable interaction with the virtual world he is immersed in. Thefourth application is a hand gesture interaction system. It de-tects hand gestures and movements for manipulating 3D objectsor navigating through 3D models.

106 4. Applications

4.1 Perceptive User Interface (BlueC 2project)

This section describes the building of a prototype of a functionalperceptive user interface. The work described in this section wasalso presented in [56]. The system uses two cameras to detectthe users head and hands. It recognizes eyes, hand gestures andfingers. This information is interpreted for interaction with aninterface on screen. This allows the user to point at objects on ascreen, and manipulate them with hand gestures. This concept isillustrated in figure 4.1.

Figure 4.1: Pointing at the screen.

The system is a fully implemented and integrated user interface.It is written in C++ and achieves roughly 7.5 frames per second ona Pentium IV system. All methods and algorithms in the systemare optimized for speed to assure the real-time character of the

4.1. Perceptive User Interface (BlueC 2 project) 107

system. The system is robust and easy to use. Any untrained usercan use our demo-interface. It is intuitive and easy to use, and itcan be transposed to other applications, or other environments.

4.1.1 Introduction

The reducing cost of CPU cycles and computers blending into thebackground of our living environment inspire new user interfaces.This convergence of speed and ubiquity pushes for the introduc-tion of machines and environments that ’see’. Vision allows usto detect a number of important cues like location, identity, at-tention, expression, emotion, pointing at objects and gesturing.In this work, we build a perceptive user interface based around auser pointing at objects and gesturing with his hands. It is pos-sible to point at locations and objects in space. On top of this,robust gesture recognition is developed to perform actions on theselected objects.

There are few existing systems with similar objectives. One ex-ample is PFinder, developped by C. Wren, A. Azarbayejani andA. Pentland at the Massachusetts Insitute of Technology (MIT),USA [57]. This system analyzes color and form to detect thehead and the hands of the user. A body contour is extractedby comparing the current image with a stored background image(foreground-background segmentation). Subsequently a numberof groups of pixels in the image are selected which are thought tobe a head or a hand. The model is simple and allows a real-timeimplementation. Unfortunately it isn’t robust and precise enoughto be used in a user interface. In other systems, attempts are madeto build a more articulated and dynamic body model. Such a sys-tem was developed by R. Plankers and P. Fua at the Ecole Poly-technique Federale de Lausanne EPFL, Switzerland [58]. Theybuilt a body model from soft objects. The model is iterativelyfitted to the camera images. This technique is very precise, butthe system is too complex for a real-time implementation.

In this work, we tried to build a real-time system similar toPFinder, but also precise enough to allow a user to point at ob-jects on a screen and to interpret the user’s hand gestures. The

108 4. Applications

first step is to build a detailed model to detect skin colored ob-jects in the camera image. Using this information, eyes, fingersand hand gestures are detected. To allow the extraction of 3Dcoordinates of these features, the user is observed by two camerasrather than one. The system processes these coordinates and therecognized gestures to allow interaction with the computer screen.

4.1.2 System Overview

A large (3×2 meter) screen is placed in front of the user, with twocameras, one placed on each side of the screen, pointing diagonallyat the user. In this system there are always two input images, onefrom the right camera and one from the left camera respectively.The system assumes that only one user is visible in the cameraimages, and that the user is interfacing with one hand.

An overview of the system is shown in figure 4.2. The Observercomponent detects the eye, hand and finger locations in the inputimages. These locations are pairs of 2D coordinates in the inputcamera images. The 3D geometry step converts these pairs of co-ordinates to 3D coordinates. A virtual line can then be drawnfrom the eyes, through the fingertip or hand, onto the screen,resulting in a 2D coordinate on the screen. The Observer com-ponent also recognizes hand gestures (such as pointing, draggingand clicking), which allows the user to point at and manipulateobjects in the graphical user interface.

3D Geometry Graphical User Interface

Camera

Camera

Observer logic

1

2

Figure 4.2: General overview of the system. The Observer com-ponent detects the eye, hand and finger locations in the inputimages, and recognizes hand gestures. The 3D geometry stepdraws a virtual line from the eyes through the fingertip onto tothe screen, which allows the user to point at and manipulate ob-jects in the graphical user interface.


The Observer component is shown in more detail in figure 4.3.First, a fast skin color segmentation algorithm is run on both in-put images. Then, depending on the state, eye, finger and gesturedetection is run. There are two states: pointing and gesturing.During the pointing state, eye and finger detection are run inde-pendently in each camera image, and the system expect to detectone (pointing) finger. If more than one finger are detected, orno fingers at all, the system switches to the gesturing state. Inthis state, eye detection is run independently in each camera im-age, and gesture recognition is run on both images at once. Thismeans that the gesture recognition step takes a stereo image asinput, and the output will be the recognized hand gesture (point-ing, dragging or clicking). If the pointing gesture is recognized,the Observer switches back to the pointing state.

FastSkinDetector

GestureDetector2

FingerDetector

logic

EyeDetector

FastSkinDetector

FingerDetector

EyeDetector

Figure 4.3: Detailed overview of the Observer component. De-pending on the state (pointing or gesturing), eye, finger and ges-ture detection is run based on the skin color segmented inputimages.

110 4. Applications

Skin Color Segmentation

The aim of the skin color segmentation is to detect the locationof the head and the hand in the camera images. This is achievedby first running the foreground-background segmentation algo-rithm, described in section 2.1, and then skin color segmentationalgorithm, described in section 2.2. Including the foreground-background segmentation step significantly improves the robust-ness. When the head and the hand are detected based on skincolor, they must be inside the foreground segmentation. Thereforeall near-skin-colored objects in the background can be discarded.An example of using foreground-background segmentation andskin color segmentation to detect the head and the hand is shownin figure 4.4.

Figure 4.4: Example of detecting the hand and the face, us-ing the foreground-background segmentation and the skin colorsegmentation.

The skin color segmentation used in this system is described insection 2.2. As the system has to be robust, a number of post


processing steps are used as described in section 2.2.4. Thesesteps include median filtering to improve the segmentation accu-racy, and connected components analysis to remove unwanted skincolored objects. Examples of the skin color segmentation beforeand after post processing are shown in figure 4.5 and figure 4.6respectively.

Figure 4.5: Result of skin color analysis without post-processing.

As the skin color segmentation must be run twice (once for eachcamera) and is a subsystem of many components that require alot of processing, it has to be very fast in order to keep the en-tire system real-time. Therefore, a number of speed optimizationmethods are implemented as described in section 2.2.5. These op-timizations include using a lookup table (LUT) for the Gaussianprobabilities, and downsampling the input images before runningthe initial skin color segmentation. Detailed segmentation is thenonly run in regions where skin color was detected in the low res-olution pass.

112 4. Applications

Figure 4.6: Result of skin color analysis after post-processing.

Finger and Eye Detection

Given the skin color segmentation, the hand and the head aredetected as the two largest skin colored components detected inthe input images. The head is assumed to be the largest of the two.This is a simplified version of the approach described in section3.1. However, it saves processing time by avoiding running a facedetector, while proving robust enough in the camera setup of thissystem.

The eyes are detected using the eye detector described in section3.1.2, and the fingers are detected as described in section 3.1.4.The detected number of fingers can be used to determine if theuser is pointing (1 finger) or making a gesture (no fingers or morethan 1 finger). Detecting the eyes and the fingertip will allowfor computing the pointing direction as the line that connects theeyes through the fingertip of the pointing hand. Intersecting thisline with the screen will give us the pointing location.


Hand Gesture Recognition

The hand gesture recognition system is described in section 3.4.In this application three gestures are trained: a pointing gesture,a dragging gesture and a clicking gesture. The pointing gesture issimply the hand pointing with one finger. This way the pointinggesture can also be detected by counting the fingers. The draggingand clicking gestures can be defined at wish. In our experimentswe chose to use a scissor gesture for dragging, and a gun gesturefor clicking.

4.1.3 Calibration and 3D Extraction

In this section we will describe the camera and screen calibration,and how the pointing direction of the user (based on the locationof the eyes and the pointing finger) is related to a coordinate onthe screen.

Camera calibration

To make life easier for the user, we automatically determine the3D positions of the cameras and estimate their calibration param-eters. The goal is an automatic calibration that can be performedby the end user. This allows the user to move cameras withoutrequiring complicated calibration by technicians. The calibrationof the internal and external camera parameters are described indetail in Appendix A.1.1.

The internal parameters of the cameras are assumed to be con-stant. They are determined off line with a semi-automatic cal-ibration program and calibration object [59, 60]. The externalcalibration comprises the fundamental matrix F of the camerapair. This matrix can be calculated from correspondences. In thecalibration step, the user’s finger tip is detected in each of thecamera images as the user points at the screen randomly. Thesedetections can be used as correspondences, on which the Ransac-algorithm [61] is run to calculate the F -matrix. This matrix can

114 4. Applications

then be used to calculate the camera matrices [62] to achieve fullcalibration of the cameras.

Pointing direction

First the pointing direction must be defined. There are severaloptions. The first option is to extend the line defined by thepointing finger. However, in order to reconstruct a reasonablyaccurate pointing direction, the finger would have to be detectedwith an incredible resolution. Indeed, the noise on the detectedfinger line would be multiplied when extended to intersect thescreen.

Another option is to choose the pointing direction of the lowerarm of the user. This would require very robust elbow detectionin both cameras. Further, most of the time, the elbow is notvisible in at least one camera view. Finally, these methods wouldnot be accurate as pointing involves the eyes.

Therefore, we chose a third option: the line connecting the eyesand the fingertip of the pointing finger. The user is pointing atthe screen position that he perceives as right behind his finger-tip. This principle is illustrated in figure 4.1. This definition ofpointing direction is not only intuitive, but it also benefits fromvery precise eye and finger detection, as described in section 3.1.In the remainder of this chapter, the coordinates of the eyes (thecenter point between the two eyes) and the finger tip of the point-ing finger are written as Xo and Xv respectively. The pointingdirection is then defined as Xv −Xo.

Screen calibration

The system needs to be able to determine the coordinate on screenat which the user is pointing. This coordinate is defined as theintersection of the pointing direction Xv − Xo and the screenplane. The screen, however, is an object in space that can vary insize and location. This thus requires a calibration step for whichwe propose the following approach.


A number of points are displayed on screen in sequence for the userto point at. As shown in figure 4.7, the finger and eye coordinatesare determined each time the user points at a point. Using thesesets of correspondencies (eyes, finger, point on screen) it is possibleto fit the screen plane in 3D space. This fitting step is describedin Appendix A.1.3.

Figure 4.7: Calibrating the screen by pointing at a sequence ofpoints.

Screen coordinate

The screen coordinate where the user is pointing at is defined asthe intersection of the pointing direction, Xv−Xo, and the screenplane. The inputs are pairs (xo,xv) of eye and finger coordinatesin both cameras. First, the 3D coordinates of the eye Xo and thefingertip Xv are needed. The relation between a 2D point x andthe corresponding 3D point X is in homogeneous coordinates:

x = PX (4.1)

116 4. Applications

where

P = K�Rt|−RtC

�, x = [x, y, w]t and X = [X,Y, Z,W ]t (4.2)

with P the projection matrix of the camera. Indroducing the pro-jection matrices for both cameras and a 2D point in equation (4.1)yields

x = PX and x� = P �X (4.3)

From equation (4.3), a homogeneous system of 4 equations canbe constructed. If Pi is row i of projection matrix P of the leftcamera, and P �

ifor the right camera, then the system of equations

is

wP1 − xP3

wP1 − yP2

w�P �1 − x�P �

3

w�P �1 − y�P �

2

XYZW

= 0 (4.4)

from which the points Xo and Xv can be calculated.

Then, a line must be constructed through these two points andintersected with the screen plane. A line through two points isdefined as

L ↔ X = Xo + k(Xv −Xo) (4.5)

Together with equation (A.7) of the screen plane (Appendix A1.3),this yields

kxy

=�

Xv −Xo −Ux −Uy

�−1 (U0 −Xo) (4.6)

This equation can be used to directly calculate the screen co-ordinate (x, y) based on the 2D coordinates of the eye and thefingertip.

4.1.4 User Interface

Until now we have described methods to detect skin, eyes, hands,fingers and gesture. Based on the coordinates of the eyes and


Figure 4.8: 3D-representation: pointing at the screen.

finger tip, we can determine which point on the screen the user ispointing at. Now it’s important to illustrate the functionality ina simple and useful interface. The goal is to build an environmentwhere the user can point at objects (icons) to highlight them, grabthem, drag them and activate them.

The icons are read from PNG files. Each icon has three states: anormal state, a highlighted state, and a grabbed state. The iconsappear on the screen in their normal state. When the user pointsat an icon it grows and becomes highlighted. The user can thengesture to grab the icon that goes to the grabbed state. Thesestates focus the attention of the user and give useful feedback.Finally the user can gesture to activate the icon. In our imple-

118 4. Applications

mentation, activating an icon opens a hyperlink to a new interfacewith new icons.

The icons, their locations on the screen and their hyperlinks aredefined in an XML file. figure 4.9 shows the interface, and theresult of the user pointing at an icon.

4.1.5 Integrated Setup

Our setup consists of a large screen (3×2 meters), with two SonyDFW-VL500 cameras alongside of it. The cameras are positionednear the border of the screen, in the middle of the vertical sides.The cameras are pointed directly at the volume in which the userstands. The cameras could in fact be positioned anywhere, aslong as the eyes and the hands are visible inside the picture. Thecameras are connected through a firewire interface to two separatePCI firewire cards inside a Pentium IV 2.53 Ghz.

Both cameras continuously grab images. These images are imme-diately passed through the skin color analysis. Two RGB imageswith their binary skin images are fed to the inputs of the follow-ing subsystems. Depending on the state of the system (pointingor gesturing) the system will apply finger tip detection or gesturerecognition. In both cases a coordinate will be calculated in 3Dspace. If the user is pointing, this coordinate is the tip of hispointing finger. Otherwise, if the user is gesturing, this coordi-nate is the center of gravity of his hand. The gesture recognitionsystem analyzes the stereo images and return the ID of the recog-nized gesture. The eye detector localizes the coordinates of botheyes. The center of these two points is transposed to 3D space.An imaginary 3D line is constructed based on these two coordi-nates (hand/finger and eyes). The crossing of this line with the3D screen plane gives us the 2D coordinate on screen. The inter-face will process this 2D screen coordinate along with the ID ofthe recognized gesture. If the user points at an icon, this icon willbe highlighted. The gestures are interpreted to perform actionson the highlighted icon.


The result of the integrated setup is shown in figure 4.9. Thesystem achieves approximately 7.5 frames per second on a Pen-tium IV. The average pixel error is 15 pixels on a resolution of1024x768. This means that the pixel error is about 2%. The iconsin our system are 168x168 pixels in size, comfortably bigger thanthe pixel error.

4.1.6 Discussion

In this section we presented a real-time marker-less HCI systemwhich allows for intuitive interaction with objects on a large dis-play. As a prototype, the system is relatively fast. Nonethelessthe speed could be improved with additional optimizations, suchas the ones described in section 4.4.

Furthermore, the prototype system was designed to work underfixed lighting conditions, meaning in a room with fixed lightingand no windows. The system can be improved by making theforeground-background segmentation and the skin color segmen-tation more robust to variable lighting conditions.

120 4. Applications

(a) pointing at an object

(b) pointing at an object

Figure 4.9: Using the perceptive user interface.

4.2. CyberCarpet (CyberWalk project) 121

4.2 CyberCarpet (CyberWalk project)

Exploration of VR worlds by allowing omni-directional uncon-strained locomotion possibilities for a walking user is an activearea of research. The ultimate goal is to have the user fully im-mersed in a VR scene, free to walk in any direction with nat-ural speed, while remaining within the limited physical area ofthe platform and without the need of wearing any constrainingequipment (e.g., for tracking the walker position or for character-izing the gait). To support such locomotion, the platform mustcounteract the intentional motion of the walker in order to keephim/her in place without altering the impression of movement.The associated perceptual effects on the walker should be takeninto account, in the form of input command constraints, so as toavoid unpleasant feelings.

The CyberCarpet is an actuated platform that allows for uncon-strained locomotion of a walking user for Virtual Reality (VR) ex-ploration. The platform consists of a linear treadmill covered bya ball-array carpet and mounted on a turntable, and is equippedwith two actuating devices for linear and angular motion. Themain control objective is to keep the walker close to the platformcenter in the most natural way, counteracting his/her voluntarymotion, despite the fact that the system kinematics is subject toa nonholonomic constraint. A kinematic controller uses the linearand angular platform velocities as input commands and feedbackis based only on measurements of the walker’s position obtainedby a visual tracking system. The system is built within the Cy-berWalk project, and the work described in this section is alsopresented in [63].

4.2.1 Background

Different locomotion interfaces exist that allow walking in virtualenvironments (see, for instance, the surveys in [64] and [65]). Inmany of them, locomotion is restricted to a 1D motion on a lin-ear treadmill, like in the Treadport platform [66] with possible

122 4. Applications

slope inclusion [67]. The user is constrained by a harness thatapplies stabilizing forces and other virtual effects [68]. To allowfor small/slow direction changes, the treadmill can be mountedon a turning table [69]. A different approach is taken in theCirculaFloor [70], where active moving tiles follow the feet mo-tion. Again, the walker should avoid sharp turns and high speed.For unconstrained 2D walking, the Omnidirectional Treadmill hasbeen proposed in [71] using two perpendicular belts and a largenumber of rollers, while a torus-shaped belt arrangement is imple-mented in the Torus Treadmill [72]. Both systems allow limitedspeed, mostly due to poor control design. Furthermore, the me-chanical implementation is complex due to the large mass of themoving parts (with associated noise). This kind of problems isnot present in passive devices like the Cybersphere [73] where,however, the perception of walking is altered by the inner curva-ture of the spherical floor. An alternative principle is used in [74],where a conveyor belt and a turntable transmit motion to a walkerthrough a ball-array board, realizing thus a 2D planar treadmill.In [75], the ball-array lays on a concave surface without actuation,but instrumented with sensors to detect feet contact.

Within the CyberWalk, two different motion concepts have beenconsidered for unconstrained 2D walking on a plane: the belt-array CyberWalk platform, similar to [71, 72], and the ball-arrayCyberCarpet, similar to [74]. In both approaches, we aimed ateliminating the use of any physical constraints on the feet, body,or legs of the user, as well as avoiding the need of a priori identifiedmodels of the human gait/walk. The two platform concepts havebeen analyzed and refined in terms of user mobility, mechanicalfeasibility, and perceptual effects.


The system consists of three main components: the physical plat-form, the visual tracker and the controller.

The locomotion platform uses a conveyor belt and a turntable totransmit translational and angular motion to the walker through


a ball-array board. Rotating balls are fitted into the array boardand are in contact with the belt so that a user on the board movesin the direction opposite to the underlying point on the belt (seefigure 4.12a). The walker is allowed to move in a natural way andindefinitely in any planar direction. The platform controller coun-teracts her/his motion by pulling the walker toward the center ofthe platform, while taking into account physiologically accept-able velocity/acceleration bounds. The body pose on the carpetis acquired through a markerless visual tracking system using anoverhead camera.

An overview of the system architecture is shown in figure 4.10.An overhead camera is placed above the ball-bearing platform. Atracker on the vision PC uses the camera to track the positionand orientation of the walker. A controller is run on the platformPC and receives the coordinates of the walker from the visionPC over a TCP/IP connection. Then, the controller sends thevelocity commands over a serial bus to the platform hardware.

11UOR -Review Meeting, Munich, June 14, 2007

System architecture

Position extraction data rate: 10 Hz

Velocity commands data rate: 10 Hz

Walking user replaced by a remote-controlled car

Figure 4.10: Overview of the system architecture of the Cyber-Carpet.

124 4. Applications

Locomotion Platform

To validate the CyberCarpet concept, a small-scale prototype witha diameter of about 0.8 m has been designed and built (see fig-ure 4.12b and c). While the limited platform dimension is indeednot appropriate for the actual VR exploration by a human user,the whole system has been conceived already keeping in mind thechallenges of a full-size realization. For instance, the current me-chanical structure can support the weight of a human user (about100 kg). Further, the control design allows a simple scaling of thefeedback gains, according, e.g., to the platform size and locomo-tion speed of the walker.

The principle of the platform is illustrated in figure 4.11. Theplatform has two degrees of freedom: a rotational one (ϕ) derivedfrom the turntable and a linear one (x) which is generated by thebelt. These two motion vectors are added up by the balls andgenerate a velocity vector which recenters the person or objecton the platform. The finished prototype platform is shown infigure 4.13.

A 2D-Motion Platform: The Cybercarpet

Martin C. Schwaiger∗

Institute of Applied Mechanics

Thomas Thummel†


Heinz Ulbrich‡


ABSTRACT

This paper presents the cybercarpet, an advanced model of a 2Dlocomotion interface with high velocities and high accuracy. Thecybercarpet consists of an array of balls which are actuated by abelt mounted on a turntable. The goal is to keep the walker in thecenter of the platform while he performs arbitrary motions in thevirtual environment. The current approach is based on mechanicalprinciples and components which could also be used for a big scaleplatform. Moreover the findings presented in this work are judgedwith respect to upscalability.

Keywords: 2D-Platform, Motion Platform, User Interface, Tread-mill, Cybercarpet, Ball Bearing Platform

1 INTRODUCTION

The research efforts which have been made to enable humansto walk freely in a virtual environment have generated varioussolutions. The Gaiter System [15] tracks and evalutes the usersmovements to generate motion in the virtual world without usinga specially prepared floor, thus the real movement is limited bythe room. In contrary to that, the Torus Treadmill [6] uses beltson a torus to generate a 2D motion to recenter the user on theplatform. The same goal can be achieved with moving platformswhich form an actuated ground. The principle was realized withthe CircolaFloor [8]. Another very interesting approach is theOmni-Directional Ball-Bearing Disc Platform [4] which usesunactuated balls on a curved surface. The user is recentered bygravity. This principle is also used in the Virtusphere [10] and theCybersphere [3] where the user is inside a sphere which is rotatedthrough the motion of the user. A solution which can even simulateuneven terrain is the Gait Master [9]. A different approach for afoot following device allowing full turning abilities was describedin our former work [14].The omni-directional treadmill [2] recenters the walker by addingthe vectors of two belts with embedded rolls. The Ball ArrayTreadmill (BAT) [1] was a first working model showing an arrayof balls actuated by a belt on a turntable. The principle was alsooutlined in the patent application of our project Partner MPI [12].An approach using balls to actuate the user shows some veryinteresting characteristics which predestines the principle forfurther use in VR environments especially if the person should notbe restricted by a harness or rim (except for security reasons duringtests). To mention only some advantages, the setup generateslittle noise and vibration. By the use of advanced mechanicalcomponents and an intelligent control, very high performance ispossible. Moreover the motion is not only limited to walking butincludes also crawling and virtually any other motion.Thework of the author is embedded in the Cyberwalk Project, see

∗e-mail: [email protected]†e-mail: [email protected]‡e-mail: [email protected]

section 9. Here, the prototype of a big scale platform based on anarray of balls with a belt system for angular and linear motionswill be presented. This prototype can be used for walking-testswith a human and evaluation of control strategies in a downscaledsize. A P-Controller is implemented at this stage and advancedcontrol strategies form our project partner UOR [11] can be addedsubsequently.

2 PRINCIPLE AND INNOVATIONS

2.1 Basic principle

The principle of the platform has been described before (BAT [1],patent application [12]) and thus shall only be outlined in brief.

Figure 1: Platform principle

As seen in figure 1, the platform has two degrees of freedom:a rotational one (ϕ) deriving from the turntable and a linear (x)which is generated by the belt. These two motion vectors are addedup by the balls and generate a resulting vector which recenters theperson or object on the platform.The apparatus consists of a turntable which is carried by the frame.This turntable is actuated by a servo motor via a toothed belt.On top of the turntable a conveyor belt is mounted. Within theturntable, a rotational feedthrough transmits the energy and datalines to the conveyor belt system which is also actuated by a servomotor. Thus the surface has two degrees of freedom, namely alinear and a rotational one. On top of the setup, the balls are heldin place by a matrix connected to the ground (see figure 2).The current setup provides a ball grid with a diameter of 800 mmand contains 4332 balls of 8 mm diameter.Additionally the platform has supporting devices for testing anddevelopment. First, two measurement systems are present. Oneusing strings to triangulate the vehicle, another which uses optical,markerless tracking. Second, a vehicle has been developed whichmoves according to previously recorded GPS motion data of aperson. On top of the vehicle a picture can be placed to generatean image (seen from top by the overhead camera) which is verysimilar to a person which is walking.

2.2 Improvements

One of the major improvements of the platform can be found in thedynamics of the model. The BAT [1] is moving at 0,5 cm/s onits linear axe. The current implementation is now able to generate

Authorized licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on May 14, 2009 at 10:12 from IEEE Xplore. Restrictions apply.

Figure 4.11: Platform principle.


(a)

(b)

(c)

Figure 4.12: The CyberCarpet platform: (a) the motion trans-mission principle; (b) a drawing of the preliminary design (cour-tesy of Max Planck Institute for Biological Cybernetics; GermanPatent filed in 2005); (c) the final physical realization

126 4. Applications

Figure 4.13: The experimental set up with the CyberCarpet,the mobile robot carrying a picture of a human body, and theoverhead camera for visual tracking.


Visual Tracker

The visual tracking algorithm itself has been tested on humanwalkers and is robust w.r.t. her/his posture changes (see figure 3.25).To prove the platform concept and the effectiveness of the controldesign, we report experimental results in which a top-view humanpicture has been mounted on a mobile robot and used as a mock-up to emulate the performance with a real user (see figure 4.13).

For the visual localization of the mobile robot, a camera is placedoverhead above the CyberCarpet, as shown in figure 4.13. Thecamera is a Sony DFW-VL500 Color/VGA camera, with 640×480.pixels at 30 fps, placed at about 131 cm over the surface of thecarpet. The differentially-driven mobile robot carrying a pictureof a human body on top emulates the moving user.

The initial position of the robot is set by hand with two mouseclicks, one in the middle of the head, one on the edge of the shoul-der. The tracker developed for the visual localization is describedin section 3.3.3. It tracks the position (x, y) and the orientation(ϕ) of the walker on the CyberCarpet. A typical view, with su-perimposed localization ellipse/circle, is show in figure 4.14.

One parameter has to be set in the visual tracker: the numberof particles in the particle filter. More particles translate intomore accurate tracking (less noise), but a lower frame rate. Inthis application, 500 particles are used, in which case the visualtracker runs at 15 Hz. This means that, 15 times per second, atimestamp and the position (x, y) of the walker are sent over aTCP/IP connection to the control algorithm on the platform PC.

Controller

Through the ball-array surface of the CyberCarpet, any actuatedmotion of the underlying belt results in a reverse motion imposedon the walker on the platform, i.e. a forward motion command willmove the user backwards and, due to the multiple contact betweenwalker’s feet and the ball-array carpet, a clockwise rotation will

128 4. Applications

Figure 4.14: A view from the overhead camera, with superim-posed user localization. The white ellipse indicates the shoulderregion of the walker, while the red circle indicates the head regionand the white line the orientation.

turn the user counterclockwise (see figure 4.12a). With this inmind it is possible to derive a kinematic model of the CyberCarpet.

Using this kinematic model, a control algorithm was designed atvelocity and acceleration level, so as to keep the user close tothe platform center. The control algorithm was developed by theUniversity of Rome. For the algorithm itself, we refer to De Lucaet al. [63]. The control system architecture is illustrated in fig-ure 4.15. The observer estimates the velocity of the walker fromthe coordinates of the visual tracker, and the angle and speed ofthe platform. The controller uses the difference from the centerbased on the coordinates obtained by the visual tracker, the es-timated velocity, and the angle of the conveyor belt to producenew motion commands for the platform. The angle of orientationof the walker is not used in the controller.


Figure 4.15: Control system architecture of the CyberCarpet

4.2.3 Experiments

The testing campaign involved four different user motions. Thefollowing scenarios were chosen:

A. User standing still off the origin;

B. User starting at the origin and moving along a straight linewith a constant speed of about 0.22 m/s;

C. User starting at the origin and moving along a circular pathof radius 0.35 m with a constant speed of about 0.14 m/s;

D. User starting at the origin and moving along a square pathof side 0.4 m with a constant linear velocity of about 0.1 m/s,and turning at the corners with an angular velocity of aboutπ/4 rad/s.

Furthermore, for each scenario we tested 2 variants of smoothvelocity-level feedback control: (1) without compensation of thewalker motion, and (2) with compensation of the walker motionusing the velocity observer.

130 4. Applications

Standing still

In the first experiment, the user starts at rest from the absoluteposition (0.2 m, -0.05 m) and keeps an intentional zero velocityduring the whole experiment. In this experiment there is no needfor velocity compensation. In figure 4.16 a trajectory executedunder the control law is shown, with a black triangle marker rep-resenting the starting position. The corresponding behavior of thelinear and angular velocity commands is displayed in figure 4.17.The discrepancies are due to the presence of noise in the imageprocessing step, to the discrete sampling of the measurements(10 Hz on the position (x, y)), and to the discrete sampling of thecontrol output (10 Hz on the commanded platform velocities).

Moving at constant velocity

In the second experiment, we tested separately the control lawswith or without the on-line velocity estimation and compensa-tion strategy. Here, the user moves along a straight line with aconstant speed of about 0.22 m/s.

Using the static feedback law, the absolute trajectory of the useris reported in figure 4.18. The starting point is at 0, 0.02 m(black triangle) and motion proceeds in the negative y-direction.As expected, the static feedback is not able to fully compensatefor a persistent intentional motion, but a steady state with non-zero position error is obtained after about 6 s. This behaviorcan be appreciated in figure 4.18, with the reached equilibriumposition at about (0.05 m, -0.23 m) (this value depends on thechosen control gains). The linear and angular velocity commandsare shown in figure 4.19. Note that, after an initial transient, thelinear velocity command matches the user intentional velocity,while the angular command is close to zero.

As shown in figure 4.20, in the case of the complete feedback/feed-forward law (velocity compensation), the additional presence ofthe feedforward action based on the observer is able to fully re-cover the platform center despite the persistent intentional mo-tion, similar to what could be obtained by an integral control


Figure 4.16: Absolute trajectory in the experiment where thewalker is standing still

Figure 4.17: Linear and angular velocity commands for the tra-jectory of figure 4.16

132 4. Applications

Figure 4.18: Absolute trajectory in the experiment where thewalker is moving at constant velocity, with static feedback control.



action. The user starts at (0.006 m, 0.022 m) (black triangle) andmoves again in the negative y-direction as before. Figure 4.21shows that, differently from the previous case, after a transientphase the control law brings back the user to the origin.

Moving along a circular path

In this experiment, the user is moving along a circular path ofradius 0.35 m with a constant speed of about 0.14 m/s, and, as aconsequence, with a constant angular velocity of about 0.4 rad/s.This test case is significantly different from the previous one, sincethe intentional velocity vector is continuously changing directionduring motion: this leads to a more demanding task for the distur-bance observer, which has to track a highly time-varying signal.

For the case of the static feedback law, the absolute trajectoryof the user is shown in figure 4.22. As expected, the control lawis able to partially compensate for the intentional motion, i.e.,the user is kept within a distance of 0.2 m from the platform,which is smaller than the original dimension of the circular path(0.35 m). The corresponding platform velocity commands areshown in figure 4.23. It is interesting to note that, after the initialtransient, the platform linear and angular velocities match theactual linear (0.14 m/s) and angular (0.4 rad/s) velocity of thewalker, confirming again that a steady-state condition has beenreached. At about t = 23.5 s, the user stops its motion and isthus brought back to the center of the platform.

Despite the more challenging task for the observer, the completefeedback/feedforward control law is able to keep the user closer tothe platform center than in the previous case. This can be checkedby comparing the absolute trajectories in figures 4.22 and 4.24 forthe two experiments. The velocity commands sent to the platformduring the experiment are shown in figure 4.25. The estimatedintentional speed oscillates around its nominal value of 0.14 m/s.

134 4. Applications

Figure 4.20: Absolute trajectory in the experiment where thewalker is moving at constant velocity, with velocity compensation.



Figure 4.22: Absolute trajectory in the experiment where thewalker moves along a circular path, with static feedback control.

0 5 10 15 20 25 30 35 40−0.3

−0.2

−0.1

0

0.1

time [s]

v [m

/s]

0 5 10 15 20 25 30 35 40−1

−0.5

0

0.5

time [s]

ω [r

ad/s

]


136 4. Applications

Figure 4.24: Absolute trajectory in the experiment where thewalker moves along a circular path, with velocity compensation.



Moving along a square path

In this last experiment, the user travels along a square path of0.4 m side with constant linear velocity of about 0.1 m/s. Theabsolute trajectory in the virtual world during the execution ofthe square path is shown in figure 4.26 (different colors are usedfor each side). The user starts from the initial position (blacktriangle) and moves to the left along the first side of the square.Then it stops, turns 90◦ counterclockwise, and starts travellingalong the second side, repeating the same sequence until tracingthe complete square. The total motion time is approximately 24 s.

The mismatch between the actual trajectory in figure 4.26 andthe ideal commanded square path is mainly due to slippage of themobile vehicle during motion, visible in the corners of the squarepath. To a lesser extent it is due to the inaccurate execution ofthe commanded velocities, visible in the last edge, and also dueto the noise on the visual tracker, visible along the straight edgesof the square path.

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

x [m]

y [m

]

Init Edge 1

Edge 2

Edge 3

Edge 4

Figure 4.26: Trajectory of the user in the virtual world whileexecuting a square path

138 4. Applications

The static feedback law is able to partially compensate for theintentional motion, keeping the user within a circle centered atthe origin and with radius of about 0.2 m (figure 4.27). Thevelocity commands sent to the platform during the execution ofthe motion are shown in figure 4.28.

The absolute motion for the case with velocity compensation isshown in figure 4.29, while the velocity commands sent to theplatform are given in figure 4.30. The benefits of the estimationof the intentional velocity are not so evident as in the previouscases. In particular, the controlled motion of the user remains inan area around the platform center that is as wide as when usinga pure feedback law. This is mainly due to the slow convergenceof the velocity observer with respect to the duration of the motionalong each side of the square.


Figure 4.27: Absolute trajectory in the experiment the walkermoves along a square path, with static feedback control.


140 4. Applications

Figure 4.29: Absolute trajectory in the experiment where thewalker moves along a square path, with velocity compensation.



4.2.4 Discussion

In this section, we presented the design and implementation of theCyberCarpet, a novel concept of actuated platform that supportsuser locomotion in an unconstrained way. The platform combinesthe linear mobility of a treadmill and the angular mobility of aturntable. It uses a ball-array carpet to transmit these motions tothe walker. Despite the presence of a nonholonomic constraint onthe instantaneous system velocities, the controller is able to keepa freely walking user close to the platform center in a natural way.The absolute walker position on the platform is obtained from anoverhead camera, with a visual localization algorithm based onparticle filters which is robust to potential changes of the user’sposture.

To validate the CyberCarpet principle and to test the actual per-formance of the motion control law, a small-scale prototype hasbeen built. Using this setup, and the proposed visual localizationalgorithm, a variety of control experiments have been conductedwith a mobile vehicle as a mock-up of a real walking user.

As far as the vision part of the project is concerned, the main lim-itation lies in the refresh rate of the tracker (15 Hz), which limitsthe speed at which the controller can respond to movements ofthe walker. The refresh rate could be bumped to the frame rateof the camera (30 Hz) by reducing the number of particles (whichwould increased noise), or by using faster hardware. Higher re-fresh rates (i.e. 100 Hz or more) would require the use of specialhardware and markers.

Furthermore, no measurement of the walker’s orientation has beenused in the control law, thus preventing any prediction of inten-tional turns and delaying the re-centering of the walker on theplatform. Nonetheless, in no case the walker dangerously ap-proached the platform border, proving the effectiveness of thepresented control approach up to velocities of 0.25 m/s, for a gridsize of 0.8 m.

Having proven the feasibility of the CyberCarpet concept in all itscomponents, the next step would be the construction of a full-scale

142 4. Applications

device, one that could let a human user walk at normal speed whilebeing immersed in a Virtual Reality environment. Our parallel ex-perience with the other 2D omni-directional platform developedwithin the CyberWalk project suggests that the compact mechan-ical principle underlying the ball-array platform may still be aconvenient choice in terms of weight and needed power. However,due to hardware constraints, the size of the platform was limitedto 0.8 m in diameter. Increasing the size of the CyberCarpet toaccomodate a human walker, was deemed infeasible, due to me-chanical constraints. The other platform developed within theCyberWalk project, the omni-directional treadmill, is discussed inthe next section.

4.3. Omni-directional Treadmill (CyberWalk project) 143

4.3 Omni-directional Treadmill (Cyber-Walk project)

As part of the CyberWalk project, a second platform was devel-oped, based on an array of conveyor belts. This omni-directionaltreadmill looks like an oversized chain of a horizontal escalators:However, instead of elongated metal plates that move from frontto back, the chain links are made up of treadmills that, in turn,move at right angles with respect to the track chain.

The final construction was built in Tuebingen, in a building calledthe Cyberneum. The purpose of the omni-directional treadmill isto research the human senses, to find out how the brain combineshearing, seeing and touching, and how it uses these senses to makerational decisions in order to take action.

The omni-directional treadmill has two contributions related tohuman-computer interaction. The original work packages for theCyberWalk project envisioned three levels of visual tracking of thewalker:

1. Head tracking to steer the head mounted display (HMD);

2. Position and orientation tracking as input for the controllerto keep the walker in the center of the platform;

3. Full body tracking for interaction with the virtual world.

The first level, the head tracking, is done using a Vicon motioncapture system1, as a marker-less tracker cannot provide the re-fresh rate needed for a realistic visualization through the HMD.The Vicon system makes use of special reflective markers, whichare placed on a safety helmet that must be worn by the walker.

The second level, the position and orientation tracker, tracks theposition and orientation of the main body of the walker, and isdescribed in section 3.3.3. This system is similar to the one em-ployed in the CyberCarpet prototype described in section 4.2. Thistracker is marker-less.

1Vicon motion capture systems: http://www.vicon.com

144 4. Applications

The third level, the full body tracking, was developed by R. Kehl etal. [2]. However, the speed of this tracker is less than 1 Hz. At thislow refresh rate, the tracker loses track too easily (within 1 secondthe user can change his entire body pose), and the interaction istoo slow. Therefore, the full body tracker was replaced by anexample-based full body pose recognition system. This means thatinstead of tracking the articulated pose of the walker from frameto frame, the body pose is detected instantly from the currentframe and matched to a selection of predefined key poses whichare stored in a database. This full body pose recognition systemis described in section 3.2 and 3.3. This system is also marker-less, the walker does not need to wear a special suit or attachspecial markers to his body. As a result, the only markers thatare attached to the walker are the ones on the safety helmet, whichhas to be worn, regardless, for safety reasons.

4.3.1 Design of the Omnidirectional Treadmill

The eleven ton piece of hardware was designed and constructedby the TU Munich, and is described in more detail by Schwaigeret al. [76]. It is designed as a belt array as shown in figure 4.31.As soon as a belt segment is conveyed to the upper part of thebelt, it is switched on. At the end of the surface, the belt isswitched off to prevent failure of the belt system which can notbe operated vertically. Any resulting (x, y) displacement can begenerated by combining the velocities of the belt chain and thebelts themselves.

Thus, unlike the CyberCarpet, the omni-directional treadmill per-forms an (x, y) displacement, and no angular displacement of thewalker. The platform is able to achieve 1 km/h in the direction ofthe chain and 7 km/h in the direction of the belts. The walkablearea is 4.7 by 4.6 meters.

One of the main problems of chains with big flank pitches is thejump in curvature when the chain element moves from the linearsection to the circular section. As the belts are mounted on thechain without gaps, the belt system slaps against the preceding


7TUM-U -Review Meeting Munich – 14.06.2007

Belt Array Platform

Belt Array Platform

Figure 4.31: Construction of the treadmill.

belt as soon as the platform reaches a certain speed. To over-come this problem, the trajectory in which the chain is conveyedis shaped in a special way. Between the linear section and thecircular section, there is a clotoid section (figure 4.32) that en-ables the rollers (on which the belts are mounted) to move fromthe linear rail on the upper side towards the clotoid rail without ajump in curvature of the surface. The belt is accelerated smoothlyand the angular momentum is increased. At the point where thecurvature of the clotoid rail and the circular wheel are equal, thechain is carried on by the drive wheel.

To create a highly immersive environment, the floor the walker isstanding on has to be very homogenous and must provide similarcharacteristics to ordinary floors. As soon as the walker touchesparts of the floor which are not actuated, the immersion is broken.This is particularly the case if there are gaps between the belts.Normally, it is impossible to align single belts without a gap be-tween the belts because the whole system has to be mounted andpowered at some point. In our prototype, this limitation is over-come by introducing a coupled pair of belts, namely the main belt(MB) and a support belt (SB) which can be seen in figure 4.33.

The track chain is actuated by electric motors. However, there isa danger that the whole construction will tip over if the motors

146 4. Applications

Figure 4.32: Chain trajectory.

Figure 4.33: Belt construction.

aren’t driven at exactly the same speed. Therefore the motors arekept synchronized with control electronics.

For safety reasons, a number of security measures are implementedin the platform. There is a laser barrier to stop the system im-mediately if someone or something comes too close to the movingmechanical parts of the system. There is also a harness retrac-tion system, which means that the walker has to wear a harnesswhich is attached to a rope. If the walker falls, he is pulled up,and has less risk of getting hurt. Furthermore, there is a remote


emergency stop system, which consists of a person with a handon a red button while looking at the walker on the platform, andable to stop the platform at any time by pushing the button.

Despite its bulk, the huge machine operates extremely smoothly.It accelerates and brakes slowly so that the walker doesn’t stagger.The feeling of natural walking is preserved. The omni-directionaltreadmill is shown in figure 4.34 and 4.35.

4.3.2 Visual Localization

The visual tracker used in the omni-directional treadmill is thesame as the one used on the CyberCarpet in section 4.2, and de-scribed in detail in section 3.3.3. There are, however, a number ofdifferences with the setup of the CyberCarpet. The user is wear-ing a harness with a safety rope, which is connected to a cableabove the platform. Together with this safety rope, there is apower cable and a video cable connected to the HMD. The ropeand the two cables combined form a significant obstruction to thetracker. To reduce the obstruction, the cables are wrapped in ablack sock, and the safety rope is painted black, which is the colorof the treadmill and of the background.

As an alternative to the visual tracker, the (x, y) coordinates re-covered from the Vicon system, can be used as input for the con-troller. This would be possible as the orientation of the user isnot used in the current control algorithm.

4.3.3 Control Design

The gentle and intelligent control of the treadmill was the contri-bution of the partners from the University of Rome. As the omni-directional platform can perform (x, y) movements, and does notrequire a rotational component, the control problem is much sim-pler than in the case of the CyberCarpet. It can be written as thesum of two linear control problems on a linear treadmill, one forthe x direction and one for the y direction.

148 4. Applications

Figure 4.34: The chain drive wheel of the omni-directional tread-mill.

10TUM-U -Review Meeting Munich – 14.06.2007

WP 2 – The Belt Array PlatformAchievements

The belt array Platform is in an advanced state of implementation

- Motion in both X and Y direction possible

- Main framework operational

- Up-scalable to full-scale is

possible by adding segments

4.6m2.5m

Walking area

Figure 4.35: The omni-directional treadmill before it is beingshipped to Tuebingen.


Apparent accelerations felt by the walker are shown in figure 4.36.The controller is designed to minimize these accelerations, whilekeeping the walker close to the platform center.

7UOR -Review Meeting, Munich, June 14, 2007

Simulation results

apparent accelerations (inertial, centrifugal and Coriolis)felt by the walker due to the platform motion)

Xw

Yw

Figure 4.36: Apparent accelerations (intertial, centrifugal andcoriolis) felt by the walker in the x and y direction (XW and YW

respectively), due to the platform motion.

4.3.4 Visualization

The walker is immersed in a virtual environment using a headmounted display (HMD) as shown in figure 4.37.

For a correct visualization of the virtual environment, the positionof the walker in the virtual world are needed as well as the viewingdirection (i.e. the angle of the head). The position of the useris provided by the visual tracker and the control algorithm. Theposition and angle of the head are tracked using a Vicon systemand Vicon markers which are attached to the safety helmet of thewalker. These markers are shown in figure 4.37.

The virtual environment consists of a virtual model of Pompei.As the omni-directional treadmill is designed for behavioral ex-periments of the walker and his senses, it is necessary to generate

150 4. Applications

Figure 4.37: The walker wearing a head mounted display (HMD)and a safety helmet with reflective VICON markers attached. Thesafety rope and HMD cables are also visible in the background.


new virtual models on the fly. Some examples are to generatea straight street, or a slightly curved street, or to replace somebuildings in the street with other buildings, and see how this af-fects the behaviour of the walker. To achieve this, virtual modelsare generated on the fly using the CityEngine2. Some examplesof generated models of Pompei are shown in figure 4.38 and 4.39.

Some basic interaction with the environment is made possiblethrough body pose recognition. For example, if the user stretcheshis arms to the sides, a map of the model is shown, and if the userholds up his arms, some numeric statistics about the visualizationare shown on top of the model. The end goal would be to allowthe user to interact with the model, such as opening a door, orrequesting information about the building he/she is looking at.However, the current visualization system is very basic and doesnot support animation (which would be required to open a door).

2CityEngine: http://www.procedural.com/cityengine

152 4. Applications

Figure 4.38: Virtual model of Pompei generated with the CityEngine.

Figure 4.39: Virtual model of Pompei generated with the CityEngine.


4.3.5 Body Pose Recognition

In order to interact with the virtual world, a real-time marker-lessfull body pose recognition system was implemented. This systemis described in detail in section 3.2 and 3.3. An overview of thesystem is also shown in figure 4.40.

Fast Body Posture Estimation

using Volumetric FeaturesMichael Van den Bergh Esther Koller-Meier Luc Van Gool

fg-bg

3D reconstruction

normalization

fg-bg fg-bg

compute Haarlets

LDA approximation

match to database

2D LDA-approx. 3D SVM-based3D LDA-approx.

...

2M

2M

14K

100

12

color images taken from each camera

foreground-background segmentation

omni-directional treadmill

3D Hull reconstruction, rotation-invariant

...

3D Haarlets, selected during training stage

database with predefined poses

classification performance detecting 12 poses

on a freely moving, walking person using 7 cameras

97.18% 93.78%87.42%

100 Hz 50 Hz 1Hz

Figure 4.40: Overview of the body pose estimation on the om-nidirectional treadmill.

154 4. Applications

To test the integration with the omni-directional treadmill andmore specifically with the visualization system, a simplified ver-sion of the pose recognition system was deployed. This versionwas limited to three body poses:

1. Standard walking: this consists of any walking pose wherethe arms of the user are naturally by the side of the body;

2. Arms stretched: the arms are stretched sideways, in whichcase the user requests a map of the model;

3. Arms up: the user holds up his arms, which toggles a windowwith statistics about the visualization system.

During the integration process, we ran into a number of problems.The body pose recognition assumes a static background, but thisis not the case if the moving treadmill is visible in the cameraimage, which causes a significant problem for the foreground-background segmentation, especially in the overhead camera, asshown in figure 4.41 and 4.42. Not only is the treadmill con-stantly moving, but it also contains reflective parts between thebelts. To overcome segmentation problems, we don’t use the over-head camera, and the other cameras are placed in such a way thatthe moving treadmill is not visible.

Another problem is that the Cyberneum is painted black. A blackbackground is not suitable for foreground-background segmenta-tion, as any shadow on the users body will tend towards black aswell, and will therefore be segmented out. A solution would beto use very low thresholds to extract the silhouette of the user.However this would result in a lot of noise in the segmentation. Areasonable solution to this problem is to put a directed light along-side the camera (and in the same direction as the camera), whicheliminates all shadows on the user from the camera viewpoint.

A final problem to the segmentation results from the rope andcables attached to the walker. Originally these were assumed tobe in scale with the electrical wires attached to the robot walkerin the Cybercarpet prototype. However, in practice they turned


out much thicker: combined, they are as thick as the walker’sown arms. To remedy this problem, the rope is painted blackand the cables are wrapped in a black sock, so that they can getsegmented out with the background. Despite this measure, theystill influence the segmentation and thus the pose recognition,meaning that poses are sometimes not properly recognized.

Figure 4.41: Images taken from an overhead camera of the mov-ing treadmill. The treadmill is moving and there is a securitycable dangling in the image.

Figure 4.42: Expamples of the foreground-background segmen-tation on a moving treadmill with a user on it.

156 4. Applications

4.3.6 Discussion

A large, smooth platform was constructed which can immersea human walker into the virtual Pompei. The platform is theresult of the integration of an omni-directional treadmill, a visualtracker, a controller, a visualization system and a pose recognitionsystem.

The integration was not as smooth as with the CyberCarpet pro-totype, and many issues could have been avoided with a betterintegration. In any case, the project resulted in an impressivecollaboration, a unique platform, which was presented at the Cy-berWalk workshop.

4.4. Hand Gesture Interaction (Value Lab) 157

4.4 Hand Gesture Interaction (Value Lab)

The goal of the system presented in this section is enabling thevisualization and manipulation of complex 3D models on a largescreen to multiple participants. The work described in this sectionwas also presented in [50, 77]. Traditional human-computer inter-action devices (e.g. mouse, keyboard) are typically focused to full-fil single user requirements, and are not adapted to work with largescreens and, in some applications, with the multi-dimensionalityof the data presented.

Figure 4.43: Person interacting wiht a camera and screen.

The system detects hand gestures and hand movements of a userin front of a camera mounted on top of a screen as shown in fig-ure 4.43. The goal is to enable interaction of the user with a 3Dobject or model. Section 2.2 introduced an improved skin colorsegmentation algorithm, and a novel hand gesture recognition sys-tem was introduced in section 3.4. The system is example-based,meaning that it matches the observation to predefined gesturesstored in a database, it is real-time and does not require the useof special gloves or markers. The result is a real-time marker-lessinteraction system which is applied to two applications, one formanipulating 3D objects, and the other for navigating through a3D model.

158 4. Applications

4.4.1 The Value Lab

The system is implemented at the Value Lab [78], which is a spe-cial kind of information visualization room. It was designed asa research platform to guide and visualize architectural modelsand planning processes. The Value Lab (shown in figure 4.44 and4.45) is an essential space in the new Information Science Labo-ratory (HIT) and represents the fusion between built architectureand digital design sciences. It is used in teaching and for re-search purposes, such as analysis of large data sets, applicationsin marketing and monitoring, applications in real-time visualiza-tion, distributed real-time image rendering, and interactive screendesign.

The Value Lab consists of a physical space with state-of-the-arthardware, software and intuitive human-computer interaction de-vices. It is equipped with five large touch displays, three on thewall and two set up as a table. Furthermore, it is equipped withprojectors and video conferencing. Beside the direct on-screen ma-nipulation of information (using a mouse or the touch interface),a technology was needed that offers a novel touch-less interactionsystem with camera-based gesture recognition. The goal of thisenvironment is also to research new ways of collaborative visual-ization and working. The goal of this project in particular is toallow for the visualization and manipulation of 3D models in thisenvironment using hand gestures.


An overview of the hand gesture recognition system is shown infigure 4.46. A camera is placed on top of the screen, and capturescolor images at a resolution of 640× 480 pixels and at 30 framesper second. For each frame, the camera image is passed to the skincolor segmentation system, which is described in section 2.2. Theonline skin color model is updated based on the color informationfrom the face, which is found using the standard face detector inOpenCV. Then the hands are located as the 2 largest skin-coloredobjects that are not part of the face. The two detected hands


Figure 4.44: The Value Lab.

Figure 4.45: The Value Lab.

160 4. Applications

are then used as input for the hand gesture classification system,which is described in section 3.4. For each hand, the gestureclassification system outputs a location (x, y) and an id (i) for thedetected gesture. Based on these variables, an application can bebuilt to manipulate 3D objects or to navigate through 3D models.Two applications were developed and are described later in thissection.

skin

locateface

locatehands

classifygesture

classifygesture

manipulateobject

Figure 4.46: Overview of the hand gesture recognition system.

Skin Color Segmentation

The skin color segmentation algorithm used in this system is de-scribed in section 2.2. It is a hybrid system that consists of anoffline and an online model. The online model is updated usingthe color information taken from the face region of the user. Thismakes the system more robust to different users and to changesin lighting.

Hand Gesture Recognition

Once the location of the hands is determined, both a grayscaleimage and a black and white segmentation of the hands are cutout and concatenated, as shown in figure 4.47. This concatenatedimage (for example 72×36 pixels) is then stored as a vector (withfor example 72 · 36 = 2304 values).

This vector is inputted into the hand gesture recognition system,which is described in detail in section 3.4. The system is trained to


detect three different gestures: an open hand (do nothing), a fist(grab), and a pointing finger (select), as shown in figure 4.48. Thedetection of these three gesture, and the motion of the hands, willbe used to steer the interaction with the applications as describedfurther in this section.

Figure 4.47: Input for the hand gesture classifier.

(a) detected gesture 1 (b) detected gesture 2

(c) detected gesture 3

Figure 4.48: Examples of the classifier detecting different handgestures. An open hand is marked in red (a), a fist is marked ingreen (b) and a pointing hand is marked in blue (c).

162 4. Applications

Speed Optimizations

To improve the speed of the system, the face detection is run ata lower resolution (half of the original resolution) and the skincolor segmentation is also first run on a lower resolution image,after which the segmentation is refined by evaluating the highresolution pixels around detected skin pixels. This results in facedetection in 34 ms and skin color segmentation in 46 ms. Addingthe time required to grab a frame from the camera (11 ms), up-dating the skin model (30 ms), detecting the hands and classifyingthe gesture (3 ms), the total processing time for a frame is still124 ms, as shown in figure 4.49. This means that the refresh rateof the system is 8 Hz.

In order to make the system even faster, the parts that require alot of computations are spread over a loop of 9 frames. This meansthat for every 9 frames that are grabbed from the camera, theface detection is run on frame 0, the full skin color segmentationis run on frame 3, and the skin model is updated on frame 6.On each frame, instead of running the skin color segmentation onthe full frame, it is only run on a small region of interest aroundthe previous location of the head and hands. This optimized skindetection requires only 5 ms per frame. The resulting frame loopis shown in figure 4.50 and reduces the average computation timeper frame to 31 ms, which results in a refresh rate of the systemof 32 Hz. These refresh rate allows for a very fast and smoothinteraction of the user with the 3D object.

frame 0

- grab frame (11 ms)

- face detection (34 ms)

- full skin color segmentation (46 ms)

- update skin model (30 ms)

- hand gesture recognition (3 ms)

124 ms}

Figure 4.49: Computing times on a frame before optimizations.Total average computing time results in 124 Hz, or a framerate of8 Hz.


frame 0

frame 1

frame 2

frame 3

frame 4

frame 5

frame 6

frame 7

frame 8

- face detection (34 ms)

- skin detection in ROI (5 ms)














- full skin color segmentation (46 ms)


- update skin model (30 ms)



53 ms

19 ms

19 ms

60 ms

19 ms

19 ms

49 ms

19 ms

19 ms

}

}

}

}

}

}

}

}

}

Figure 4.50: Computing times for a sequence of 9 frames af-ter optimization. Frame grabbing (11 ms) has been left out forbrevity. Total average computing time results in 31 ms, or a fram-erate of 32 Hz.

164 4. Applications

4.4.3 Object Manipulation: One Object

The first demonstration system consists of a 3D object (a teapot)displayed on the screen, and that the user can manipulate. Themanipulations consist of rotating the object, stretching it andmoving it. The gestures used for manipulation of the object areillustrated in figure 4.52.

The user initializes the system by standing in front of the screen,and holding both his hands up (figure 4.52c). The user can grabthe object by making two fists, and move his fists further apartor closer to change the size of the object (figure 4.52a).

By moving his fists in a circle, he can rotate the object (fig-ure 4.52b). This is also possible in 3 dimensions, by moving thefists in the horizontal plane. The relative position of the hands inthe depth dimension is determined by the relative size of the hands(i.e. the ratio of the numbers of pixels in each of the segmentedhands). The user can also move the object by grabbing it anddragging one single fist (figure 4.52d). A picture of this systemrunning on a standard computer screen is shown in figure 4.51.

Figure 4.51: Hand gesture system manipulating one object.


(a) stretch (scale)

(b) rotate (three dimensional)

(c) release (do nothing)

(d) drag (move)

Figure 4.52: The hand gestures used for the manipulation of oneobject on the screen.

166 4. Applications

4.4.4 Object Manipulation: Two Objects

The second demonstration system is similar to the first, exceptthat it displays two objects: a teapot and a cube, which the usercan select and manipulate. The user can select an object witha pointing gesture, i.e. pointing with the left hand selects theobject on the left, and pointing with the right hand selects theobject on the right. Then, the user can manipulate the object(s)he has selected by rotating it in three dimensions. The gesturesused for selecting and manipulating the objects are illustrated infigure 4.53.

(a) select object (left, right or both)

(b) rotate (three dimensional)

(c) release (do nothing)

Figure 4.53: The hand gestures used for the manipulation oftwo objects on the screen.

Some pictures of this system running in theValue Labare shownin figures 4.54, 4.55 and 4.56.


Figure 4.54: Hand gesture system manipulation two objects.

Figure 4.55: Hand gesture system manipulation two objects.

168 4. Applications

Figure 4.56: Hand gesture system selecting the object on theright (cube).


4.4.5 Model Navigation

The hand gesture interaction in this application is composed ofthe fivee hand gestures shown in figure 4.57. It recognizes thegestures and movements of both hands to enable the navigationof the model. Pointing with one hand selects the model to startnavigation. By making two fists, the user can grab and rotatethe viewing frustrum the z and y-axis. The rotation around thez-axis is measured as the relative change in angle of the straightline connecting the two hands. The rotation around the y-axis ismeasured as the relative difference in size (pixel surface) of thehands, resulting in a pseudo-3D rotation. By making a fist withjust one hand and moving it, the user can pan through the model.By making a pointing gesture with both hands, and pulling thehands apart, the user can zoom in and out of the model. Theopen hands stops the navigation and nothing happens until theuser makes a new gesture.

The interaction demo system has been implemented as an exten-sion of an open-source 3D model viewer, the GLC Player3. Thisenables us to 1) load models in multiple formats (OBJ, 3DS, STL,and OFF) and of different sizes, and 2) use our hand interactionsystem in combination with standard mouse and keyboard in-teraction. Pressing a button in the toolbar activates the handinteraction mode, after which the user can start gesturing to nav-igate through the model. Pressing the button again deactivatesthe hand interaction model and returns to the standard mouse-keyboard interaction mode.

We conducted experiments in theValue Laband tested with mul-tiple 3D models, and in particular with a model created as partof the Dubendorf urban planning project. This model representsan area of about 0.6 km2 and is constituted of about 4000 objects(buildings, street elements, trees) with a total of about 500,000polygons. Despite this size, our system achieved frame rates ofabout 30fps, which is sufficient for smooth interaction. Examplesof the user zooming, panning and rotating through the 3D model

3http://www.glc-player.net/

170 4. Applications

(a) select (start interaction) (b) rotating

(c) panning (d) zooming

(e) release (do nothing)

Figure 4.57: The hand gestures used for the manipulation of the3D object on the screen.

are shown in figures 4.58, 4.59 and 4.60 respectively. In each fig-ure, the left column shows side and back views of the system inoperation at the beginning of the gesture, and the right columnsthe same views but at the end of the gesture.

The hand interaction mode is currently only available for modelnavigation (rotation, panning and zooming), all the other featuresof the viewer being only accessible in mouse-keyboard interactionmode. Nonetheless, our implementation enables simple extensionsof the hand interaction mode. In the near future, we for instanceaim to enable the hand interaction mode for object selection (e.g.to view its properties).

4.5. Discussion 171

Figure 4.58: Zooming into the model.

4.5 Discussion

This section introduced a novel solution for human-computer in-teraction based on hand gestures. Compared to currently existingsystems, it presents the advantage of being marker-less and real-time. Experiments, conducted in the Value Lab, investigated theusability of this system in a situation as realistic as possible. Theresults show that our system enables a smooth and natural inter-action with 3D objects/models.

The hand segmentation/detection was made more robust to changesin lighting and different users by combining a pre-trained skinmodel with one that is trained online. In future work, shape ordepth information will be included in the detection process tofurther improve robustness: the system cannot detect the handswhen they overlap with the face; and it also requires some fine-tuning of the white balance parameters of the cameras before use(which could be automated as well).

172 4. Applications

Figure 4.59: Panning the model.

With respect to the interaction system, the next steps in the de-velopment include: (1) extending the set of viewing features ac-cessible through hand gestures; and (2) the extension to multipleusers interacting with multiple screens.

4.5. Discussion 173

Figure 4.60: Rotating the model.

5Summary

We presented a number of building blocks which can be used tobuild vision-based Human-computer interaction (HCI) systems.These systems are intended to run in real-time and without theuse of special gloves or markers. These building blocks were thenused to present four real-world applications of HCI systems.

First, an overview was made of the foreground-background seg-mentation, the skin color segmentation and the 3D hull recon-struction algorithms which are used in this thesis. An importantcontribution is the skin color segmentation algorithm, which is ahybrid between an offline and an online trained model, allowingfor more robustness for changes in the user and in the lighting.

Then, an overview was presented of the face, eye, hand and fingerdetection techniques used in this thesis. Furthermore, a novel fullbody pose recognition method was introduced based on LinearDiscriminant Analysis (LDA) and Average Neighborhood MarginMaximization (ANMM). The body pose recognition was exploredboth for a 2D silhouette-based system, and a 3D hull-based sys-tem. The 3D approach offers the interesting advantage of beingrotation-invariant: the orientation of the input hull can be nor-malized to a standard orientation, which is not possible in the 2Dcase.

In order to estimate the orientation of the user, an overheadtracker was developed. This tracker is an adaptation of an exist-ing tracker, adding degrees of freedoms (rotation) and extendingthe model to accomodate for head movements and perspectivechanges.

176 5. Summary

A novel hand gesture recognition system was introduced, basedon ANMM and similar to the 2D body pose recognition system.This approach allows for very accurate hand gesture recognition,enabling many interesting HCI applications with hand gestures.

As the HCI systems are intended to run in real-time, the bodypose recognition system and the hand gesture recognition systemswere speeded up using integral images and Haarlets. In the 2Dcase these are 2D Haarlets, which are trained using a novel ANMMapproximation algorithm. Furthermore, 3D Haarlets were intro-duced in order to classify the 3D hulls. The ANMM approxima-tion algorithm is the first method to be able to train 3D Haarletsselecting from the full set of possible Haarlets, which consists ofhundreds of millions of candidate features.

At the end of this thesis, we presented four real-world HCI appli-cations, which make use of the building blocks described above.The first application is a hand gesture interface where the usercan point at icons at the screen. The user can then select or movethe objects using hand gestures.

The second and third application are part of the CyberWalk project,which aimed to enable unconstrained locomotion on an omnidi-rectional treadmill. This allows for immersion in a virtual world,where the user can walk around without constraints, interactingwith the virtual world. Our contributions to this project were:(1) the overhead tracker which tracks the position and orienta-tion of the user in order to keep him in the center of the platform,and (2) the body pose recognition system which allows for basicinteraction of the user with the system by making arm gestures.

The fourth application that we presented is a novel 3D interactionsystem where the user can manipulate and navigate through 3Dmodels using hand gestures. This system was implemented anddemonstrated at the Value Lab.

5.1. Future Work 177

5.1 Future Work

This thesis opened a multitude of possibilities. The applicationscan be extended in a number of ways using the presented tech-niques.

A number of improvements are still in order to make the systemmore reliable. For example, the hand detection relies entirely oncolor information. In the future it would be better to also addshape information to the detection process. This would make iteven more robust to changes in lighting. The hand detection isused to determine the location of the hands, and to select the handand process it for gesture recognition. Therefore it is fundamentalto a hand gesture based application that the hand detection is asreliable and accurate as possible.

The current segmentation/preprocessing steps consist of foreground-background segmentation, skin color segmentation and 3D hullreconstruction. In future work we would like to add a depth es-timation step, where the challenge would be to make it real-timeyet reliable. Depth information would allow the hand gesture sys-tem to focus on the person which is in front of the screen, whileignoring bystanders in the background.

The presented full body pose recognition systems (both 2D and3D) perform very well at detecting a predefined set of poses, evenif the number of pose classes is very high. In the future it would beinteresting to estimate the joint locations of the user by assigningjoint coordinates to the predefined pose classes. The joints couldthen be estimated by interpolating the joint coordinates of thethree or four closest pose matches.

Furthermore, the pose recognition systems presented are limitedto static poses, and do not make use of prior information. Forexample, building a motion graph or HMM based on prior in-formation could enable the detection of typical pose and gesturemovements. This would be especially useful in the full body poserecognition system, as it would allow for detecting typical actionsand behaviours. The hand gesture recognition system alreadyuses motion information taken from the detected hand locations.

178 5. Summary

The current hand gesture recognition system assumes the handorientation with respect to the camera/screen to be more or lessconstant. Some variation is allowed, but, for example, havingthe fist rotate a significant amount around the vertical axis isimpossible. In future work it would be interesting to explore ifmultiple viewpoints of the same gestures can be trained in theexisting system, perhaps with a slight adaptation of the ANMMalgorithm.

Until now the applications described in this thesis always assumeone user is interacting with the system. The goal in the future isto allow for multiple participants taking part in the interaction,for example on multiple screens. The first step would be to addbasic face recognition to detect which person is interacting andwhat interactions are allowed for that person, and in a next stepa tracker which tracks the location of the different participants.This would allow for a wealth of new possibilities. For example,in the Value Lab, one stakeholder could add a floor to a buildingmodel, after which another stakeholder widens the street on whichthe building is located.

ACalibration

In this appendix, the automatic calibration of the camera param-eters and the screen position is detailed. The 3D parameters of allthe cameras are needed to successfully make 3D reconstructionsbased on the silhouettes (section 2.3). They are also needed todetermine the 3D position of the users hands in the perceptiveuser interface application (section 4.1), as well as the 3D positionof the screen, which is described further in this appendix.

A.1 Camera Calibration: 2 cameras

The internal camera parameters are assumed constant for fixedcamera settings (zoom and focus), and are calculated offline us-ing a semi-automatic calibration program with calibration object([59] and [60]). The external camera parameters are calibrated asfollows.

Correspondences and the fundamental matrix

The fundamental matrix F of the pair of cameras expresses therelation between two corresponding points. These are points thatcorrespond in the two camera images. This relation between a pairof corresponding points x and x� of respectively the left and theright camera and the fundamental matrix F is (in homogeneouscoordinates):

x�tFx = 0 (A.1)

180 A. Calibration

Using a set of correspondences, the fundamental matrix can be de-termined. In order to make a good estimation considering noisycorrespondences, the Ransac-method is used [61]. Then, the cam-era matrices of the two cameras can be calculated from F .

The correspondences are recorded by a user pointing a fingeraround in the working volume. The fingertip is then detectedin both camera views using the methods described in section 3,as illustrated in figure 3.9.

Camera matrices

The internal camera parameters are stored in calibration matricesK and K � of respectively the left and the right camera. Theessential matrix E is calculated as,

E = K �tFK (A.2)

The camera matrices P and P � of respectively the left and theright camerea are computed, assuming the first camera matrix Pto be the origin of the 3D space:

P =�

I3 0�

(A.3)

As described in [62], there are four possible solutions for the sec-ond camera matrix P �. These solutions are found by rewritingthe essential matrix E in its singular value decomposition, in theform U ·diag(1, 1, 0) ·V t, where diag(v) is a diagonal matrix withv on the diagonal. The four possible solutions for P � are:

P � =�

UWV t +u3

�or

�UWV t −u3

�or (A.4)

�UW tV t +u3

�or

�UW tV t −u3

�(A.5)

Where u3 is the last column in U and W is written as

W =

0 −1 01 0 00 0 1

(A.6)

A.2. Camera Calibration: n cameras 181

These four solutions represent four possible positions of the twocameras with respect to the scene. Each camera can be flipped180 degrees, however, physically, only one position for the camerais possible. The correct P matrix is found by the constraint thata reconstructed point is in front of both cameras.

A.2 Camera Calibration: n cameras

For calibrating the multi-camera installations we employ the Multi-camera Self-calibration package from Svoboda et al. [79]. Theypropose a fully automatic calibration method which yields com-plete camera projection models and only needs a laser pointer(figure A.1a) instead of a standard calibration pattern.

3.2. Camera Calibration 29

3.2.2 Automatic Self-Calibration

For calibrating the multi-camera installations we employ the Multi-camera Self-calibration package from Svoboda et al. (2005)1. They propose a fully automaticcalibration method which yields complete camera projection models and only needsa laser pointer (Figure 3.5 a) instead of a standard calibration pattern.

(a) Laser Pointer (b) Detected Points

Figure 3.5: For calibration, the user is required to wave the laser pointer a)throughout the obscured working volume. A small piece of plastic is mounted on thelaser pointer to produce a small and bright point in the camera images. Image b)shows the projections of the laser pointer for four cameras. The image coordinatesof the laser pointer can be easily detected and are indicated as small circles.

The calibration follows three steps:

1. Waving the laser pointer in a smooth and slow fashion through the obscuredworking volume fills it with virtual points. This is the only manual part of thecalibration procedure. Figure 3.5 b) shows the projections of the laser pointerfor four cameras.

2. The positions of the projections of the laser pointer in the camera images aremeasured. Laser projections are detected in all cameras independently where2D Gaussians are fitted to reach subpixel accuracy. Misdetected points arediscarded through pairwise RANSAC analysis (Doubek et al., 2003).

3. The self-calibration software estimates the unknown parameters of the cameramodels.

1http://cmp.felk.cvut.cz/ svoboda/SelfCal/

(a) Laser Pointer

3.2. Camera Calibration 29

3.2.2 Automatic Self-Calibration

For calibrating the multi-camera installations we employ the Multi-camera Self-calibration package from Svoboda et al. (2005)1. They propose a fully automaticcalibration method which yields complete camera projection models and only needsa laser pointer (Figure 3.5 a) instead of a standard calibration pattern.

(a) Laser Pointer (b) Detected Points

Figure 3.5: For calibration, the user is required to wave the laser pointer a)throughout the obscured working volume. A small piece of plastic is mounted on thelaser pointer to produce a small and bright point in the camera images. Image b)shows the projections of the laser pointer for four cameras. The image coordinatesof the laser pointer can be easily detected and are indicated as small circles.


1. Waving the laser pointer in a smooth and slow fashion through the obscuredworking volume fills it with virtual points. This is the only manual part of thecalibration procedure. Figure 3.5 b) shows the projections of the laser pointerfor four cameras.

2. The positions of the projections of the laser pointer in the camera images aremeasured. Laser projections are detected in all cameras independently where2D Gaussians are fitted to reach subpixel accuracy. Misdetected points arediscarded through pairwise RANSAC analysis (Doubek et al., 2003).

3. The self-calibration software estimates the unknown parameters of the cameramodels.

1http://cmp.felk.cvut.cz/ svoboda/SelfCal/

(b) Detected Points

Figure A.1: For calibration, the user is required to wave thelaser pointer (a) throughout the obscured working volume. Asmall piece of plastic is mounted on the laser pointer to producea small and bright point in the camera images. Image (b) showsthe projections of the laser pointer for four cameras. The imagecoordinates of the laser pointer can be easily detected and areindicated as small circles.


A. Waving the laser pointer in a smooth and slow fashion throughthe obscured working volume fills it with virtual points.

182 A. Calibration

This is the only manual part of the calibration procedure.Figure A.1 shows the projections of the laser pointer for fourcameras.

B. The positions of the projections of the laser pointer in thecamera images are measured. Laser projections are detectedin all cameras independently where 2D Gaussians are fit-ted to reach subpixel accuracy. Misdetected points are dis-carded through pairwise RANSAC analysis [80].

C. The self-calibration software estimates the unknown param-eters of the camera models.

A.3 Screen Calibration

The screen calibration is done by projecting a sequence of pointson the screen. Each time, the user is asked to point at the dis-played point. The pointing direction is determined from the fingerand eye coordinates as described in section 4.1. Based on the setsof correspondences (eye, finger, point on the screen) a generalequation can be constructed that relates the position of the eye,the finger and the screen coordinate the user is pointing at.

The 2D screen coordinate of point pi is expressed in pixels andwritten as (λi, µi). The 3D coordinates of the corresponding posi-tions of the eye and finger are written as Xo

iand Xv

irespectively.

The screen plane α is defined as

α ↔ X = U0 + x(Ux) + y(Uy) (A.7)

where U0 is the origin of the screen (the left upper corner), andwhere Ux and Uy are the vectors that define the coordinate axesof the screen plane as illustrated in figure A.2. The parameters xand y define the position on the screen. The 3D coordinates of thefinger and the eye are registered using two cameras as describedin section 3.1. The line Li connecting these two points is definedas

Li ↔ AiX + bi = 0 (A.8)

A.3. Screen Calibration 183

fingereye

Figure A.2: The screen plane.

The rows of Ai are a basis for the orthogonal complement of thesubspace produced by Xv

i−Xo

i. If this basis is chosen orthonor-

mally, then, for each X, the following equation expresses the per-pendicular distance to Li:

d2(X, Li) = (AiX + bi)t(AiX + bi) (A.9)

Each projected point (λi, µi) in the screen plane corresponds to aline Li. In order for this point to be the position where the useris pointing at, Li must intersect the screen plane α in this point.Therefore, for each i:

Ai (U0 + λi(Ux) + µi(Uy)) + bi = 0 (A.10)

Given a set of correspondences (λi, µi) ↔ Li, we can determineα. Due to measurement errors the set of equation will not have asolution. Therefore we look for a least squares solution based onN correspondences. For this, we use equation (A.9): we look forthe plane α that puts the screen points u(λi, µi) ∈ α as near aspossible to the lines Li,

α = arg minu

��

i

(Aiu(λi, µi) + bi)t(Aiu(λi, µi) + bi)

�.

(A.11)

184 A. Calibration

We introduce the following notation:

u =

U0

Ux

Uy

en Λi =�

I3 λiI3 µiI3

�(A.12)

where I3 is a 3×3 identity matrix. Equation (A.11) can be rewrit-ten as: �

i

(AiΛiu− bi)t (AiΛiu− bi) (A.13)

Deriving to u and making equal to 0 yields:�

i

(AiΛi)t(AiΛi)

� �� S

u +�

i

(AiΛi)tbi

� �� g

= 0 ⇔ Su = −g (A.14)

The solution of equation (A.14) is not unique. Moving the screenin a direction perpendicular to the screen itself, yields an equiv-alent solution. However, it suffices to select any solution, as weonly need to determine screen coordinates where the user is point-ing at. The relation between the position of the eye Xo and thefinger Xv, and the screen coordinate, is detailed in equation (4.6)in section 4.1. Results of a screen calibration based on N = 9correspondeces is shown in figure A.3.

A.3. Screen Calibration 185

! "

!!!#"

!!#!$

!

!#!$

!#"

!#"

!#%

!#&

!#'

!#$

!#(

!#)

!#*

!#+

"

Figure A.3: Screen calibration based on 9 correspondences. Thered points are the eye positions. The green points are the fingerpositions. The screen plane α is shown in blue. The actual screencoordinates are marked with black crosses, and the reconstructedpoints with grey crosses.

BNotation

This appendix presents the mathematical notations and variablesused in this thesis. They are listed in Table B.1.

Table B.1: Notations used in this thesis.

Variable Descriptionp pixelc colorR red value (between 0 and 1)G green value (between 0 and 1)B blue value (between 0 and 1)h hue value (between 0 and 1)s saturation value (between 0 and 1)i intensity value (between 0 and 1)r normalized red value (between 0 and 1)g normalized green value (between 0 and 1)xf vector of all RGB values within a 3× 3 neigh-

borhood in the input imagexb vector of all RGB values within a 3× 3 neigh-

borhood in the background imagedf difference vector between xf and the estimate

of true signal direction udb difference vector between xb and the estimate

of true signal direction uu estimate of the true signal direction

Continued on next page...

188 B. Notation

Table B.1 – continued from previous pageVariable Description

D sum of the differencesT thresholdTstatic static thresholdTadapt adaptive thresholdOdc darkness offsetBcompact importance factor of the spatiotemporal com-

pactnesss skin color histogramn non-skin color histogramTs total pixel count contained in the skin color

histogramTn total pixel count contained in the non-skin

color histogramp color histogram for the pixels inside the shoul-

der regionp� color histogram for the pixels inside the head

regionp(u) bin u of histogram ppL(p) probability based on luminancepS(p) probability based on skin colorSI(p) horizontally integrated intensitypI(p) probability based on horizontally integrated

intensityw weightµ average or center of gravityσ standard deviationλi i-th eigenvaluevi i-th eigenvectorWopt optimal LDA or ANMM transforma-

tion/projectionLapprox linear transformation/projection that projects

Haarlet coefficients to approximated LDA orANMM coefficients

SB between-class scatter matrixSW within-class scatter matrix


Appendix B 189


S scatterness matrixC compactness matrixN o

ithe set of n most similar data which are inthe same class as xi (n nearest homogeneousneighborhood)

N e

ithe set of n most similar data which are in adifferent class as xi (n nearest heterogenousneighborhood)

H matrix containing the selected Haarlets, whereeach row of H is an image of the Haarlet invector form

N basis of the null space of HD diagonal matrix containing the weights of the

eigenvectorsh Haarlet coefficientsl ANMM coefficientsi(x, y) (grayscale) image value at (x, y)i(x, y, z) 3D image value (voxel) at (x, y, z)ii(x, y) integral imageii(x, y, z) integral volumes(j) hypothetical state or particleπ(j) discrete sampling probabilityN number of statesx horizontal positiony vertical positionHM major axis of the ellipseHm minor axis of the ellipseβ angle of orientationcx relative horizontal head positioncy relative vertical head positiont iterationwt−1 multivariate Gaussian random variabler relative distance between the pixel and the

center of the ellipseq target model


190 B. Notation


ρ[p, q] Bhattacharyya coefficient between histogramp and target model q

d Bhattacharyya distanceσ standard deviationH(A, B) Hausdorff distance from set A to Bh(A, B) directed Hausdorff distance from set A to Bd(a, b) manhattan distance between two data pointsF fundamental camera matrixP projection matrixK calibration matrix (of the left camera) con-

taining the internal camera parametersK � calibration matrix (of the right camera) con-

taining the internal camera parametersE essential matrixP camera matrix (of the left camera)P � camera matrix (of the right camera)Xo eye coordinateXv finger coordinatexo eye coordinate in the camera imagexv finger coordinate in the camera imageL line connecting two pointsX coordinate in homogeneous systemXo eye coordinate in homogeneous systemXv finger coordinate in homogeneous systemU0 origin of the screenUx vector that defines the x coordinate axis of the

screenUy vector that defines the y coordinate axis of the

screenα screen planeAi basis for the orthogonal complement of the

subspace produced by Xv

i−Xo

i

d2(X, L) perpendicular distance from point X to line L

Bibliography

[1] P. Viola and M. J. Jones, “Robust real-time object detec-tion,” IEEE Computer Societey Conference on Computer Vi-sion and Pattern Recognition, p. 747, 2001.

[2] R. Kehl, M. Bray, and L. Van Gool, “Full body tracking frommultiple views using stochastic sampling,” IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition, pp. 129–136, 2005.

[3] L. Ren, G. Shakhnarovich, J. K. Hodgins, H. Pfister, andP. Viola, “Learning silhouette features for control of hu-man motion,” ACM Transaction on Graphics, vol. 24, no. 4,pp. 1303–1331, 2005.

[4] R. Mester, T. Aach, and L. Dumbgen, “Illumination-invariant change detection using a statistical colinearity cri-terion,” Pattern Recognition: Proceedings 23rd DAGM Sym-posium, pp. 170–177, 2001.

[5] A. Griesser, S. D. Roeck, A. Neubeck, and L. V. Gool,“Gpu-based foreground-background segmentation using anextended colinearity criterion,” Proc. of VMV, pp. 319–326,November 2005.

[6] M. J. C. V. Gemert, S. L. Jacques, H. J. C. M. Steren-borg, and W. M. Star, “Skin optics,” IEEE Transactions onBiomedical Engineering, vol. 36, pp. 1146–1154, December1989.

[7] S. D. Cotton and E. Claridge, “Do all human skin colors lie ona dened surface within lms space?,” Technical Report CSR-

192 BIBLIOGRAPHY

96-01, School of Computer Science, Univ. of Birmingham,UK, January 1996.

[8] S. J. Schmugge, S. Jayaram, M. C. Shin, and L. V. Tsap,“Objective evaluation of approaches of skin detection usingroc analysis,” Computer Vision and Image Understanding,vol. 108, pp. 41–51, 2007.

[9] M. J. Jones and J. M. Rehg, “Statistical color models withapplication to skin detection,” Cambridge Research Labora-tory Technical Report Series CRL98/11, December 1998.

[10] B. Jedynak, H. Zheng, M. Daoudi, and D. Barret, “Maxi-mum entropy models for skin detection,” Technical Reportpublication IRMA, vol. 57, no. 13, 2002.

[11] A. Laurentini, “The visual hull concept for silhouette-basedimage understanding,” IEEE Transactions on Pattern Anal-ysis and Machine Intelligence archive, vol. 16, no. 2, pp. 150–162, 1994.

[12] A. Laurentini, “How many 2d silhouettes does it take to re-construct a 3d object?,” Computer Vision and Image Under-standing, vol. 67, no. 1, pp. 81–87, 1997.

[13] K. Shanmukh and A. Pujari, “Volume intersection with op-timal set of directions,” Pattern Recognition Letters, vol. 12,no. 3, pp. 165–170, 1991.

[14] J. Luck, D. Small, and C. Little, “Real-time tracking of ar-ticulated human models using a 3d shape-from-silhouettemethod,” Proceedings of the International Workshop onRobot Vision, pp. 19–26, 2001.

[15] C. Theobalt, M. Magnor, P. Schuler, and H. Seidel, “Combin-ing 2d feature tracking and volume reconstruction for onlinevideo-based human motion capture,” Proceedings of the 10thPacific Conference on Computer Graphics and Applications,p. 96, 2002.

BIBLIOGRAPHY 193

[16] I. Mikic, M. Trivedi, E. Hunter, and P. Cosman, “Humanbody model acquisition and tracking using voxel data,” Inter-national Journal on Computer Vision, vol. 53, no. 3, pp. 199–223, 2003.

[17] J.-M. Hasenfratz, M. Lapierre, J.-D. Gascuel, and E. Boyer,“Real-time capture, reconstruction and insertion into virtualworld of human actors,” Proc. of Vision, Video and Graphics,pp. 49–56, 2003.

[18] F. Caillette and T. Howard, “Real-time markerless humanbody tracking using colored voxels and 3-d blobs,” Proc. ofISMAR, pp. 266–267, 2004.

[19] W. Matusik, C. Buehler, and L. McMillan, “Polyhedral vi-sual hulls for real-time rendering,” Proceedings of the 12thEurographics Workshop on Rendering Techniques, pp. 115–126, 2001.

[20] J. Franco and E. Boyer, “Exact polyhedral visual hulls,” Pro-ceedings of the British Machine Vision Conference, pp. 329–338, 2003.

[21] K. Kutulakos and S. Seitz, “A theory of shape by space carv-ing,” Technical Report TR692, Computer Science Dept., U.Rochester, 1998.

[22] S. Seitz and C. Dyer, “Photorealistic scene reconstruction byvoxel coloring,” International Journal of Computer Vision,vol. 25, no. 3, 1999.

[23] K. Cheung, T. Kanade, J. Bouguet, and M. Holler, “A realtime system for robust 3d voxel reconstruction of human mo-tions,” Proceedings of CVPR, vol. 2, pp. 714–720, 2000.

[24] K. Cheung, S. Baker, and T. Kanade, “Shape-from-silhouetteof articulated objects and its use for uman body kinemat-ics estimation and motion capture,” Proceedings of CVPR,vol. 1, pp. 77–84, 2003.

194 BIBLIOGRAPHY

[25] R. Szeliski, “Repid octree construction from image se-quences,” Computer, Vision, Graphics and Image Process-ing, vol. 58, no. 1, pp. 23–32, 1993.

[26] M. Kolsch and M. Turk, “Robust hand detection,” SixthIEEE International Conference on Automatic Face and Ges-ture Recognition, pp. 614–619, May 2004.

[27] E.-J. Ong and R. Bowden, “A boosted classifier tree for handshape detection,” Sixth IEEE International Conference onAutomatic Face and Gesture Recognition, pp. 889–894, May2004.

[28] A. S. Micilotta, E. J. Ong, and R. Bowden, “Detection andtracking of humans by probabilistic body part assembly,”Proceedings of the British Machine Vision Conference, vol. 1,pp. 429–438, September 2005.

[29] B. Stenger, A. Thayanathan, P. H. Torr, and R. Cipolla,“Model-based hand tracking using a hierarchical bayesian fil-ter,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 28, pp. 1372–1384, September 2006.

[30] Y.-P. Hung, Y.-S. Yang, Y.-S. Chen, I.-B. Hsieh, and C.-S.Fuh, “Free-hand pointer by use of an active stereo vision sys-tem,” Proceedings of Third Asian Conference on ComputerVision, vol. 1, pp. 632–639, January 1998.

[31] M. Van den Bergh, E. Koller-Meier, and L. Van Gool, “Fastbody posture estimation using volumetric features,” IEEEWorkshop on Motion and Video Computing, pp. 1–8, January2008.

[32] M. Van den Bergh, E. Koller-Meier, and L. Van Gool, “Real-time body pose recognition using 2d or 3d haarlets,” Interna-tional Journal of Computer Vision, vol. 83, pp. 72–84, June2009.

[33] C. Bregler and J. Malik, “Tracking people with twists andexponential maps,” IEEE Conference on Computer Visionand Patter Recognition, pp. 8–15, 1998.

BIBLIOGRAPHY 195

[34] Q. Delamarrre and O. Faugeras, “3d articulated models andmulti-view tracking with silhouettes,” International Confer-ence on Computer Vision, pp. 716–721, 1999.

[35] D. M. Gavrila and L. Davis, “3d model-based tracking ofhumans in action: a multi-view approach,” IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition, pp. 73–80, 1996.

[36] I. Kakadiaris and D. Metaxas, “Model-based estimation of 3dhuman motion,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 22, no. 12, pp. 1453–1459, 2000.

[37] R. Plankers and P. Fua, “Articulated soft objects for video-based body modeling,” International Conference on Com-puter Vision, pp. 493–401, 2001.

[38] I. Mikic, M. Trivedi, E. Hunter, and P. Cosman, “Articulatedbody posture estimation from multi-camera voxel data,” Pro-ceedings of CVPR, vol. 1, pp. 455–460, 2001.

[39] R. Rosales and S. Sclaroff, “Specialized mapppings and theestimation of body pose from a single image,” IEEE HumanMotion Workshop, pp. 19–24, 2000.

[40] G. Shakhnarovich, P. Viola, and T. Darell, “Estimatingarticulated human motion with parameter-sensitive hash-ing,” IEEE International Conference on Computer Vision,pp. 750–757, 2003.

[41] I. Cohen and H. Li, “Inference of human postures by classifi-cation of 3d human body shape,” IEEE Workshop on Anal-ysis and Modeling of Faces and Gestures, p. 74, 2003.

[42] D. Weinland, R. Ronfard, and E. Boyer, “Free viewpoint ac-tion recognition using motion history volumes,” ComputerVision and Image Understanding, vol. 104, pp. 249–257,2006.

[43] L. Gond, P. Sayd, T. Chateau, and M. Dhome, “A 3d shapedescriptor for human pose recovery,” V Conference on Artic-ulated Motion and Deformable Objects, pp. 370–379, 2008.

196 BIBLIOGRAPHY

[44] K. Fukunaga, Introduction to Statistical Pattern Recognition(Second Edition). New York: Academic Press, 1990.

[45] F. Wang and C. Zhang, “Feature extraction by maximizingthe average neighborhood margin,” IEEE Computer Soci-ety Conference on Computer Vision and Pattern Recognition,pp. 1–8, 2007.

[46] M. Van den Bergh, E. Koller-Meier, and L. Van Gool, “Real-time 3d body pose estimation,” in Multi-Camera Networks:Concepts and Applications, pp. 335–360, 2009.

[47] K. Nummiaro, E. Koller-Meier, and L. V. Gool, “An adaptivecolor-based particle filter,” Image Vision Computing, vol. 21,no. 1, pp. 99–110, 2003.

[48] F. Aherne, N. Thacker, and P. Rockett, “The bhattacharyyametric as an absolute similarity measure for frequency codeddata,” Kybernetika, pp. 1–7, 1997.

[49] Y. Ke, R. Sukthankar, and M. Hebert, “Efficient visual eventdetection using volumetric features,” International Confer-ence on Computer Vision, pp. 166–173, 2005.

[50] M. Van den Bergh, F. Bosche, E. Koller-Meier, and L. VanGool, “Haarlet-based hand gesture recognition for 3d inter-action,” IEEE Workshop on Motion and Video Computing,December 2009. in press.

[51] A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, andX. Twombly, “Vision-based hand pose estimation: a review,”Computer Vision and Image Understanding, vol. 108, pp. 52–73, 2007.

[52] Y. Sato, K. Bernardin, H. Kimura, and K. Ikeuchi, “Taskanalysis based on observing hands and objects by vision,”Proceedings of International Conference on Intelligent Robotsand Systems, pp. 1208–1213, 2002.

[53] E. Sanchez-Nielsen, L. Anton-Canalıs, and M. Hernandez-Tejera, “Hand gesture recognition for human-machine inter-action,” Journal of WSCG, vol. 12, February 2004.

BIBLIOGRAPHY 197

[54] C. Papageorgiou, M. Oren, and T. Poggio, “A generalframework for object detection,” International Conference onComputer Vision, pp. 555–562, 1998.

[55] R. Lienhardt and J. Maydt, “An exteneded set of haar-likefeatures for rapid object detection,” IEEE International Con-ference on Image Processing, vol. 1, pp. 900–903, 2002.

[56] M. Van den Bergh, W. Servaes, G. Caenen, S. De Roeck, andL. Van Gool, “Perceptive user interface, a generic approach,”Computer Vision in Human-Computer Interaction, pp. 60–69, October 2005.

[57] C. R. Wren, A. Azerbayejani, T. Darell, and A. P. Pentland,“Pfinder: Real-time tracking of the human body,” IEEETransactions on Pattern Analysis and Machine Intelligence,vol. 19, pp. 780–785, July 1997.

[58] R. Plankers and P. Fua, “Articulated soft objects for video-based body modeling,” Proceedings 8th International Con-ference on Computer Vision, July 2001.

[59] T. Koninckx, A. Griesser, and L. Van Gool, “Real-time rangescanning of deformable surfaces by adaptively coded struc-tured light,” Fourth International Conference on 3-D DigitalImaging and Modeling (3DIM03), S. Kawada, ed., pp. 293–302, 2003.

[60] T. Koninckx and L. Van Gool, “High-speed active 3d acquisi-tion based on a pattern-specific mesh,” vol. 5013, pp. 26–37,January 21-22 2003.

[61] M. Fischler and R. Bolles, “RANdom SAmpling Consensus:a paradigm for model fitting with application to image anal-ysis and automated cartography,” Commun. Assoc. Comp.Mach., vol. 24, pp. 381–95, 1981.

[62] R. I. Hartley and A. Zisserman, Multiple View Geometryin Computer Vision. Cambridge University Press, ISBN:0521623049, 2000.

198 BIBLIOGRAPHY

[63] A. De Luca, R. Mattone, P. R. Giordano, H. Ulbrich,M. Schwaiger, M. Van den Bergh, E. Koller-Meier, andL. Van Gool, “Motion control of the cybercarpet platform,”IEEE Transactions on Control Systems Technology, 2010. inpress.

[64] H. Iwata, “Locomotion interface for virtual environments,” inProceedings of the 9th International Symposium on RoboticsResearch, pp. 275–282, 2000.

[65] J. M. Hollerbach, “Locomotion interfaces,” in Handbookof Virtual Environments Technology (K. M. Stanney, ed.),pp. 239–254, Lawrence Erlbaum Associates, 2002.

[66] J. M. Hollerbach, Y. Xu, R. R. Christensen, and S. C. Jacob-sen, “Design specifications for the second generation SarcosTreadport locomotion interface,” in Haptic Symposium, Proc.of ASME Dynamic Systems and Control Division, pp. 1293–1298, 2000.

[67] R. C. Hayward and J. M. Hollerbach, “Implementing vir-tual stairs on treadmills using torso force feedback,” in Proc.IEEE Int. Conf. on Robotics and Automation, (Washington,DC), pp. 586–591, 2002.

[68] D. Checcacci, J. M. Hollerbach, R. Hayward, and M. Berga-masco, “Design and analysis of a harness for torso force ap-plications in locomotion interfaces,” in Proceedings of the Eu-roHaptics Conference, (Dublin, IR), pp. 53–67, 2003.

[69] H. Noma, T. Sugihara, and T. Miyasato, “Development ofground surface simulator for Tel-E-Merge system,” in Proc.IEEE Virtual Reality Conf., pp. 217–224, 2000.

[70] H. Iwata, H. Yano, H. Fukushima, and H. Noma, “Cir-cularFloor,” IEEE Computer Graphics and Applications,vol. 25, pp. 64–67, 2005.

[71] R. Darken, W. Cockayne, and D. Carmein, “The omnidi-rectional treadmill: A locomotion device for virtual worlds,”in Proceedings of the Symposium on User Interface Softwareand Technology, pp. 213–221, 1997.

BIBLIOGRAPHY 199

[72] H. Iwata, “The torus treadmill: Realizing locomotion in ves,”IEEE Computer Graphics and Applications, vol. 9, pp. 30–35,1999.

[73] K. J. Fernandes, V. Raja, and J. Eyre, “Cybersphere: Thefully immersive spherical projection system,” Communica-tions of the ACM, vol. 46, pp. 141–146, 2003.

[74] A. Nagamori, K. Wakabayashi, and M. Ito, “The ball ar-ray treadmill: A locomotion interface for virtual worlds,” inWorkshop on New Directions in 3D User Interfaces (at VR2005), (Bonn, D), 2005.

[75] J.-Y. Huang, “An omnidirectional stroll-based virtual real-ity interface and its application on overhead crane training,”IEEE Transactions on Multimedia, vol. 5, pp. 39–51, 2003.

[76] M. Schwaiger, T. Thummel, and H. Ulbrich, “Cyberwalk: Anadvanced prototype of a belt array platform,” Proceedingsof the IEE International Workshop on Haptic Audio VisualEnvironments and their Applications, 2007.

[77] M. Van den Bergh, J. Halatsch, A. Kunze, F. Bosche, L. VanGool, and G. Schmitt, “A novel camera-based system for col-laborative interaction with multi-dimensional data models,”Proceedings of 9th International Conference on ConstructionApplications of Virtual Reality, pp. 19–28, November 2009.

[78] J. Halatsch and A. Kunze, “Value lab: Collaboration inspace,” Proceedings of the Internation Conference on Infor-mation Visualization, pp. 376–381, July 2007.

[79] T. Svoboda, D. Martinec, and T. Pajdla, “A con-venient multi-camera self-calibration for virtual environ-ments,” PRESENCE: Teleoperators and Virtual Environ-ments, vol. 14, pp. 407–422, August 2005.

[80] P. Doubek, T. Svoboda, and L. V. Gool, “Monkeys - a soft-ware architecture for viroom - low-cost multicamera system,”Proceedings of the International Conference on Computer Vi-sion Systems, pp. 386–395, April 2003.

Curriculum Vitae

Name Michael Van den BerghDate of Birth 26.12.1981Place of Birth Leuven, BelgiumCitizenship Belgium

Education

2005-2009 Ph.D. in Electrical EngineeringETH Zurich, Computer Vision Laboratory (BIWI)

2001-2004 M.Sc. in Electrical EngineeringKU Leuven, Department of Electrical Engineering (ESAT)

1999-2001 Ba. of EngineeringKU Leuven, Faculty of Engineering Sciences

1993-1999 Sint-Jan Berchmanscollege, Brussels

Academic Experience

2005-2010 Research AssistantETH Zurich, Computer Vision Laboratory (BIWI)

2004-2005 Research AssistantKU Leuven, Centre for Processing Speech and Images (PSI)

visual body pose analysis for human-computer - mvdb live

Documents