markerless visual tracking and motion analysis for sports...

University of LondonImperial College of Science, Technology and Medicine

Department of Computing

Markerless Visual Tracking and Motion

Analysis for Sports Monitoring

Julien Pansiot

Submitted in part fulfilment of the requirements for the degree ofDoctor of Philosophy in Computing of the University of London and

the Diploma of Imperial College, April 2009

Declaration

I herewith certify that all material in this dissertation which is not my own workhas been properly acknowledged.

Julien Pansiot

2

Abstract

In the past decade, detailed biomechanical motion analysis has become an impor-tant part of athletic training and performance evaluation. However, most commer-cially available systems are obtrusive and require complicated experimental setupand dedicated laboratory settings. With recent advances in smart vision sensors,markerless vision-based approaches have attracted significant interests for detailedsport motion analysis as they do not involve the placement of external fiducials,thus providing pervasive measurement of the athletes without affecting their normalperformance. This thesis presents a robust real-time vision-based tracking systembased on a miniaturised, low-power, autonomous Visual Sensor Network (VSN).The system is able to provide real-time motion monitoring with on-node process-ing. Detailed technical issues concerning background and silhouette segmentation,canonical view generation, 3D model-based motion parametrisation and reconstruc-tion, wearable-ambient sensor fusion and extraction of biomechanical motion indicesare addressed.

The proposed method is applied to motion tracking of indoor tennis training andperformance evaluation. Ambiguities due to occlusion, view point dependency anda lack of depth information are reduced by the deployment of a VSN. A methodis proposed to derive 2D canonical views from a set of input video sequences tofacilitate consistent motion monitoring. Further steps for 3D model reconstructionand parametrisation are also proposed, which involve the use of spherical harmonicparametrisation for deriving a compact 3D shape descriptor. The remaining uncer-tainties in motion analysis are resolved by predictive tracking based on a biomechan-ical model such that issues related to occlusion are avoided. To further incorporatehigh-frequency biomechanical information such as ground reaction forces, a sensorfusion framework based on an integrated use of wearable and vision-based ambientsensors is proposed. The practical value of the proposed framework is demonstratedwith a systematic implementation of a VSN for tennis training, which provides real-time automated generation of player motion profiles.

3

Acknowledgments

Looking back at the end of this four-year research ultramarathon, and if I was toorganise a garden party to thank all the persons that pushed me towards the finishline, a decent sized crowd would be invited.

First, I would like to express my gratitude to my supervisor Professor Guang-Zhong Yang for his constant support, guidance and suggestions through the courseof my PhD. I have learnt a lot from him, and most particularly my oral presentationskills. Furthermore I would like to point out that his rise to the challenge of readingthis thesis formatted with LATEX has been sincerely appreciated.

I would like to thank Karl Cooke from the Lawn Tennis Association (LTA), forfacilitating numerous experiments on high-profile tennis players and his insights onthe needs of coaches and players.

I would like to thank all my colleagues from the Visual Information Processinggroup at Imperial College London for their collaboration and camaraderie, and moreparticularly those with whom I have been working more closely: Omer Aziz, AhmedElsaify, Rachel King, Benny Lo, Douglas McIlwraith, Danail Stoyanov, JohannesTotz and Alexander Warren.

Because life was not all about PhD (although this is open for debate), I wouldlike to thank all the Thursday’s people that kept me sane, as well as my friends fromthe UK and overseas for their moral support and despite their tendency to ask fartoo often when would all this eventually finish, as illustrated below.

Last but not least, I would like to thank my parents who initiated my scientificcuriosity as far back as I can remember, and which eventually led to this PhD. Theirinvaluable patience and care have carried me throughout this epic journey.

“Piled Higher and Deeper” by Jorge Cham (www.phdcomics.com) – reproduced with authorisation from the author [41].

4

‘A journey of a thousand miles begins with a single step.’

Laozi

5

Relevant Publications

1. Louis Atallah, Mohamed ElHelw, Julien Pansiot, Danail Stoyanov, Lei Wang,Benny Lo, and Guang-Zhong Yang. Behaviour profiling with ambient andwearable sensing. In IFMBE proceedings of the 4th International Workshopon Wearable and Implantable Body Sensor Networks 2007 (BSN), pages 133–138, Aachen, Germany, 2007.

2. Omer Aziz, Benny Lo, Julien Pansiot, Louis Atallah, Guang-Zhong Yang,and Ara Darzi. From computers to ubiquitous computing by 2010: healthcare. Philosophical Transactions of The Royal Society A, 366 (1881):3805–3811, 2008.

3. Mohamed ElHelw, Julien Pansiot, Douglas McIlwraith, Raza Ali, Benny Lo,Louis Atallah, and Guang-Zhong Yang. An integrated multi-sensing frame-work for pervasive healthcare monitoring. In accepted for publication in Per-vasive Health’09, 2009.

4. Rachel King, Douglas McIlwraith, Benny Lo, Julien Pansiot, Alison McGre-gor, and Guang-Zhong Yang. Body sensor networks for monitoring rowingtechnique. In IEEE proceedings of the 6th International Workshop on Wear-able and Implantable Body Sensor Networks (BSN), pages 251–255, Berkeley,CA, 2009.

5. Benny Lo, Julien Pansiot, and Guang-Zhong Yang. Bayesian analysis ofsub-plantar ground reaction force with BSN. In IEEE proceedings of the 6thInternational Workshop on Wearable and Implantable Body Sensor Networks(BSN), pages 133–137, Berkeley, CA, 2009.

6. Douglas McIlwraith, Julien Pansiot, James Ballantyne, Salman Valibeik, andGuang-Zhong Yang. Structure learning for activity recognition in roboticassisted intelligent environments. To appear in IEEE/RSJ International Con-ference on Intelligent RObots and Systems (IROS), St. Louis, MO, 2009.

7. Douglas McIlwraith, Julien Pansiot, Surapa Thiemjarus, Benny Lo, and Guang-Zhong Yang. Probabilistic decision level fusion for real-time correlation ofambient and wearable sensors. In IEEE proceedings of the 5th InternationalWorkshop on Wearable and Implantable Body Sensor Networks 2008 (BSN),pages 117–120, Hong Kong, China, 2008.

8. Julien Pansiot, Ahmed Elsaify, Benny Lo, and Guang-Zhong Yang. RACKET:Real-time autonomous computation of kinematic elements in tennis. In IEEE12th International Conference on Computer Vision Workshops (ICCV Work-shops) – Fifth IEEE Workshop on Embedded Computer Vision, pages 773–779,Kyoto, Japan, 2009.

6

9. Julien Pansiot, Rachel King, Douglas McIlwraith, Benny Lo, and Guang-Zhong Yang. ClimBSN: climber performance monitoring with BSN. In IEEEproceedings of the 5th International Workshop on Wearable and ImplantableBody Sensor Networks 2008 (BSN), pages 33–36, Hong Kong, China, 2008.

10. Julien Pansiot, Benny Lo, and Guang-Zhong Yang. A simulator for distributedambient intelligence sensing. In IEE proceedings of the 2nd InternationalWorkshop on Wearable and Implantable Body Sensor Networks (BSN), page119, London, April 2005.

11. Julien Pansiot, Danail Stoyanov, Benny Lo, and Guang-Zhong Yang. Towardsimage-based modeling for ambient sensing. In IEEE proceedings of the 3rdInternational Workshop on Wearable and Implantable Body Sensor Networks2006 (BSN), pages 195–198, Cambridge, MA, April 2006.

12. Julien Pansiot, Danail Stoyanov, Douglas McIlwraith, Benny Lo, and Guang-Zhong Yang. Ambient and wearable sensor fusion for activity recognition inhealthcare monitoring systems. In IFMBE proceedings of the 4th InternationalWorkshop on Wearable and Implantable Body Sensor Networks 2007 (BSN),pages 208–212, Aachen, Germany, 2007.

13. Surapa Thiemjarus, Julien Pansiot, Douglas McIlwraith, Benny Lo, and Guang-Zhong Yang. An integrated inferencing framework for context sensing. InIEEE proceedings of the 5th International Conference on Information Tech-nology and Applications in Biomedicine 2008 (ITAB), Shenzhen, China, 2008.

7

Contents

1 Introduction 18

2 Biomechanics of the Human Motion 232.1 Human biomechanics models . . . . . . . . . . . . . . . . . . . . . . 23

2.1.1 Generalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.1.2 Bones and articulations . . . . . . . . . . . . . . . . . . . . . 242.1.3 Muscles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.1.4 Kinetic chain . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2 Biomechanical measures . . . . . . . . . . . . . . . . . . . . . . . . . 292.3 Practical applications of biomechanics . . . . . . . . . . . . . . . . . 30

2.3.1 Running . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.4 Biomechanics of tennis movements . . . . . . . . . . . . . . . . . . . 33

2.4.1 Motion on the court . . . . . . . . . . . . . . . . . . . . . . . 332.4.2 Tennis stroke . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.4.3 Key measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.5 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . . . . 372.5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.5.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Visual Motion Monitoring and Activity Recognition 393.1 Ambient sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.1.1 Non-visual sensors . . . . . . . . . . . . . . . . . . . . . . . . 403.1.2 Vision-based sensors . . . . . . . . . . . . . . . . . . . . . . . 413.1.3 Vision Sensor Networks . . . . . . . . . . . . . . . . . . . . . 423.1.4 Privacy considerations in ubiquitous sensing . . . . . . . . . . 463.1.5 Vision-based approach challenges . . . . . . . . . . . . . . . . 46

3.2 Segmentation and motion evaluation . . . . . . . . . . . . . . . . . . 473.2.1 Pixel-based segmentation . . . . . . . . . . . . . . . . . . . . 483.2.2 Region-based segmentation . . . . . . . . . . . . . . . . . . . 523.2.3 Frame-based segmentation . . . . . . . . . . . . . . . . . . . . 533.2.4 Stereo-based segmentation . . . . . . . . . . . . . . . . . . . . 533.2.5 Motion-based segmentation . . . . . . . . . . . . . . . . . . . 543.2.6 Hybrid methods and semantic input . . . . . . . . . . . . . . 58

3.3 Human modeling and tracking . . . . . . . . . . . . . . . . . . . . . 603.3.1 Marker-based tracking methods . . . . . . . . . . . . . . . . . 613.3.2 Markerless vision-based tracking . . . . . . . . . . . . . . . . 61

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

8

4 Motion Reconstruction from Monocular Vision 704.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2 VSN node design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.3 System configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3.1 Node placement and focal length choice . . . . . . . . . . . . 724.3.2 Node calibration . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.4 Tennis player tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 774.4.1 On-node background segmentation . . . . . . . . . . . . . . . 774.4.2 Monocular player tracking . . . . . . . . . . . . . . . . . . . . 794.4.3 Multiview player tracking . . . . . . . . . . . . . . . . . . . . 794.4.4 Computational load analysis . . . . . . . . . . . . . . . . . . 79

4.5 Practical applications and results . . . . . . . . . . . . . . . . . . . . 804.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . 804.5.2 User queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.5.3 Strategy analysis . . . . . . . . . . . . . . . . . . . . . . . . . 814.5.4 Tracking accuracy validation . . . . . . . . . . . . . . . . . . 86

4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.6.1 Current limitations . . . . . . . . . . . . . . . . . . . . . . . . 874.6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5 Generation of Canonical Views for Motion Reconstruction 915.1 Image-Based Modelling and Rendering . . . . . . . . . . . . . . . . . 92

5.1.1 Characteristics and classification . . . . . . . . . . . . . . . . 935.1.2 Image and view morphing . . . . . . . . . . . . . . . . . . . . 945.1.3 Interpolation from dense samples . . . . . . . . . . . . . . . . 945.1.4 Image mosaics . . . . . . . . . . . . . . . . . . . . . . . . . . 955.1.5 3D-based methods . . . . . . . . . . . . . . . . . . . . . . . . 96

5.2 Canonical novel views generation . . . . . . . . . . . . . . . . . . . . 985.2.1 Image-Based Visual Hulls . . . . . . . . . . . . . . . . . . . . 995.2.2 Automated canonical view rendering . . . . . . . . . . . . . . 103

5.3 Application to tennis player tracking . . . . . . . . . . . . . . . . . . 1055.3.1 Tennis strokes recognition . . . . . . . . . . . . . . . . . . . . 1055.3.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 106


6 Subject-Centric 3D Model Descriptors 1116.1 3D shape descriptors review . . . . . . . . . . . . . . . . . . . . . . . 112

6.1.1 Common 3D shape descriptors . . . . . . . . . . . . . . . . . 1126.1.2 Spherical harmonics . . . . . . . . . . . . . . . . . . . . . . . 1166.1.3 Spherical harmonics-based rotationally invariant descriptors . 121

6.2 Compact descriptor generation . . . . . . . . . . . . . . . . . . . . . 1236.2.1 Fuzzy Visual Hulls . . . . . . . . . . . . . . . . . . . . . . . . 1236.2.2 Subject-centric descriptors . . . . . . . . . . . . . . . . . . . . 1256.2.3 Spherical harmonic-based descriptor . . . . . . . . . . . . . . 1266.2.4 Cylindrical extend-based descriptor . . . . . . . . . . . . . . . 1286.2.5 Distributed prototype . . . . . . . . . . . . . . . . . . . . . . 129

6.3 Application to tennis stroke recognition . . . . . . . . . . . . . . . . 1306.3.1 Signature examples . . . . . . . . . . . . . . . . . . . . . . . . 131

9

6.3.2 Tennis strokes classification . . . . . . . . . . . . . . . . . . . 1326.4 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . . . . 133

6.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.4.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7 Model-Based Motion Analysis 1357.1 Human biomechanical model . . . . . . . . . . . . . . . . . . . . . . 136

7.1.1 Model design . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367.1.2 Superquadrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367.1.3 Activity modelling . . . . . . . . . . . . . . . . . . . . . . . . 1387.1.4 Activity learning . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.2 Estimation stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1407.2.1 Model fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 1407.2.2 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . 1427.2.3 Colour-based self-occlusions handling . . . . . . . . . . . . . . 142

7.3 Prediction stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1447.3.1 Joint-level state prediction . . . . . . . . . . . . . . . . . . . . 1447.3.2 Body-level motion prediction . . . . . . . . . . . . . . . . . . 144

7.4 Application and results . . . . . . . . . . . . . . . . . . . . . . . . . 1457.4.1 Running stride analysis . . . . . . . . . . . . . . . . . . . . . 1457.4.2 Tennis stroke monitoring . . . . . . . . . . . . . . . . . . . . 1477.4.3 Noise impact on the tracking accuracy . . . . . . . . . . . . . 148


8 Integration of Ambient and Wearable Sensors 1528.1 Wearable sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

8.1.1 Motion and position sensors . . . . . . . . . . . . . . . . . . . 1538.1.2 Force and pressure sensors . . . . . . . . . . . . . . . . . . . . 1538.1.3 Physiological sensors . . . . . . . . . . . . . . . . . . . . . . . 1548.1.4 Other wearable sensing modalities . . . . . . . . . . . . . . . 154

8.2 Sensor fusion: the best of both worlds . . . . . . . . . . . . . . . . . 1558.2.1 Sensor fusion levels . . . . . . . . . . . . . . . . . . . . . . . . 1558.2.2 Expectation-Maximisation and Gaussian Mixture Models . . 157

8.3 Application and results . . . . . . . . . . . . . . . . . . . . . . . . . 1608.3.1 System design . . . . . . . . . . . . . . . . . . . . . . . . . . . 1608.3.2 Ball contact detection . . . . . . . . . . . . . . . . . . . . . . 1638.3.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . 1638.3.4 Feature-level sensor fusion . . . . . . . . . . . . . . . . . . . . 1648.3.5 Decision-level and hybrid sensor fusion . . . . . . . . . . . . . 165

8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

9 Conclusions and Future Work 1689.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1689.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1699.3 Potential improvements and future work . . . . . . . . . . . . . . . . 170

Bibliography 173

10

List of Figures

2.1 Left: Toppling occurs when the centre of mass does not lie directlyabove the support area. Right: the quadriceps angle or Q-angle. . . 24

2.2 Muscle force production; the moment of the muscle is greater on theright, allowing more torque to be produced from the same force. . . 25

2.3 The Hill muscle model. . . . . . . . . . . . . . . . . . . . . . . . . . . 262.4 Hills curve for a muscle with the following properties: Tmax = 12, 000N ,

Vmax = 2ms−1, Pmax = 2300W . . . . . . . . . . . . . . . . . . . . . . 272.5 Motion sequence on the sagittal plane of a runner at 25 Hz. Top:

video sequences recorded. Bottom: joint positions derived from three-dimensional (3D) marker-based optical tracking at 250 Hz. . . . . . 30

2.6 Foot pressure distribution over time during ground contact for threedifferent strikes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.7 Ground reaction force for three different strikes. . . . . . . . . . . . . 322.8 Main leg joint angles in the sagittal plane during a running cycle. . . 332.9 Biomechanics of the tennis serve. . . . . . . . . . . . . . . . . . . . . 35

3.1 VSN nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2 Pixel colour histograms for background segmentation. . . . . . . . . 493.3 GMM used for adaptive background modelling. . . . . . . . . . . . . 503.4 Top: a video sequence showing a serve action in tennis. Bottom:

GMM-based pixel-level background segmentation. Noise and shadowscan be observed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.5 Top: raw binary image segmentation for the serve sequence shown inFigure 3.4. Bottom: after erosion and dilation. . . . . . . . . . . . . 53

3.6 Top: original video sequence. Bottom: Horn-Schunck’s optical flow[99]. Red lines denote larger flow magnitudes; it can be observed thanthe insufficient frame rate leads to poor approximation of the racketmotion in the last frame. . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.7 Optical flow intensity conservation constraint. . . . . . . . . . . . . . 583.8 Background segmentation comparison carried out by Toyama et al.

[220]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.9 The main steps involved in a general video-based tracking system. . 623.10 Commonly used human shape models. . . . . . . . . . . . . . . . . . 643.11 Kalman filter used for tracking, following the general principle illus-

trated in Figure 3.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.1 The main components of a self-contained VSN module. . . . . . . . . 734.2 Tennis court corner detection. . . . . . . . . . . . . . . . . . . . . . . 754.3 Tennis court corners automatically detected on-node for camera cal-

ibration. Left: corner detection order. Right: actual detection result. 75

11

4.4 Distortion correction. Left: original image. Right: corrected image,aliasing effects due to the integer nature of the filter are visible on thecourt lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.5 Left: original image as captured by the VSN node camera using a wideangle lens (focal 2.2 mm). Right: on-node binary blob segmentationand AABB computation. . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.6 Position fusion between two VSN nodes. . . . . . . . . . . . . . . . . 804.7 Software optimisation for on-node processing showing the computa-

tional loading of the main operations involved in the player tracking. 824.8 The effect of court coverage with different camera configurations.

Left: two symmetric sensors. Right: two sensors tracking from thesame side of the court. . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.9 Near and far sides of the court as seen by the VSN node. . . . . . . 834.10 Tennis player tracking interface running on an Apple iPod Touch. . . 834.11 Two court occupancy plots derived during a set for two competitors. 844.12 Simultaneous playback of the player movements on the court. . . . . 844.13 Tennis court zones used for winning patterns recognition. . . . . . . 854.14 Tennis winning pattern recognition: “Against a wide serve - return

deep crosscourt” [221]. . . . . . . . . . . . . . . . . . . . . . . . . . . 854.15 Metric grid added to the tennis court for tracker accuracy evaluation. 864.16 Example of projective error due to incorrect background segmentation

at knee level (simulated). . . . . . . . . . . . . . . . . . . . . . . . . 864.17 Factors of error in the player tracking. . . . . . . . . . . . . . . . . . 88

5.1 Computer Graphics, Computer Vision and IBMR. . . . . . . . . . . 925.2 A classification of IBMR methods according to the number of views

required against the use of 3D aspects . . . . . . . . . . . . . . . . . 955.3 Schematic illustration of the basic principle of space carving. . . . . 975.4 The basic principle of IBVH: sample the intersection of the perspective-

projected silhouettes according to the desired POV . . . . . . . . . . 995.5 IBVH: 3 original images and shaded novel view. . . . . . . . . . . . . 1015.6 IBVH: slightly inaccurate 3D model leads to shading errors . . . . . 1025.7 Stitching the seams. Left: original image. Right: stitched image. . . 1035.8 Comparison of real image against novel view. . . . . . . . . . . . . . 1045.9 Workflow of the key steps involved in generation of a canonical novel

view of a subject based on two-pass sampling. . . . . . . . . . . . . . 1055.10 Left to right: original colour image, binary background segmentation,

thinning, pseudo-skeleton, key element detection. . . . . . . . . . . . 1065.11 Binary silhouettes derived from the player’s video sequences acquired

by two cameras. Left to right: stand, wait for serve return, fore-hand, backhand and serve. Inaccurate lower legs segmentation canbe observed for some postures. . . . . . . . . . . . . . . . . . . . . . 107

5.12 Left: reference binary segmented images. Right: novel canonical viewdepth map derived from IBVH computation. . . . . . . . . . . . . . 108

5.13 Recognition rate (sensitivity) by activity for two different player gen-eral orientations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.1 Extended Gaussian Image (EGI) [100]. . . . . . . . . . . . . . . . . . 1136.2 Radial distribution1: the shape is sampled with rays originating from

its centre to determine the average and the variance for instance. . . 114

12

6.3 Spherical Extend Function (SEF). . . . . . . . . . . . . . . . . . . . 1146.4 Shape decomposition into shells, into sectors and combined method

proposed by [10]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.5 The first degree spherical harmonics (real part). Positive lobs are

represented in red, whereas negative lobs are blue. . . . . . . . . . . 1196.6 A cube reconstructed with spherical harmonics under increasing de-

grees (0, 4, 8, 20). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.7 An overview of the proposed FVH framework pipeline for compact

posture signatures generation. . . . . . . . . . . . . . . . . . . . . . . 1246.8 Regular IBVH and FVH comparison. . . . . . . . . . . . . . . . . . . 1256.9 The effect of centre translation on spherical harmonic coefficients. . . 1276.10 Coefficients 0, 1 and 2 of the spherical harmonics decomposition per-

formed on real video data. . . . . . . . . . . . . . . . . . . . . . . . . 1286.11 Tackling the centre falling outside of the 3D model with two different

strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1286.12 Example of CEF on a subject performing squats. The axis is clearly

in front of the torso, leaving large empty (white) areas. . . . . . . . . 1296.13 CEF on a 3D model. Two slices are given as example. Each slice is

converted as a 1D polar extend function r = f(a). . . . . . . . . . . . 1306.14 Distributed prototype of the spherical harmonics pipeline. . . . . . . 1306.15 Examples of spherical harmonic signatures extracted from a tennis

player’s motion over time. . . . . . . . . . . . . . . . . . . . . . . . . 1316.16 Classification confusion matrix (total and normalised): some serves

and backhands are incorrectly classified as forehand. ‘W’ stands for‘wait for return’, ‘F’ for forehand, ‘B’ for backhand and ‘S’ for serve. 133

7.1 Articulated human model structure (left) and its representation basedon tapered superellipsoids (right). . . . . . . . . . . . . . . . . . . . . 137

7.2 Superquadrics examples . . . . . . . . . . . . . . . . . . . . . . . . . 1387.3 An example of activity graph factorisation. Top: original sequential

KP graph. Bottom: cyclic activity factorisation. . . . . . . . . . . . 1397.4 The four KPs used for running motion modelling. . . . . . . . . . . . 1407.5 Generate-and-test strategy: estimation loop of the 2D/3D posture

matching algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1417.6 Original binary blob, colour-segmented blob and coloured reprojected

model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1437.7 Monocular tracking with activity model. . . . . . . . . . . . . . . . . 1467.8 Monocular tracking with colour-based matching. . . . . . . . . . . . 1487.9 Examples of randomly added noise for tracking resilience evaluation. 1497.10 Influence of moderate SNR in the reference image on the tracking

accuracy: the tracker starts drifting before stabilising at an error of60 to 150 mm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

7.11 Influence of poor SNR in the reference image on tracking accuracy:in both examples the tracker loses track before partially recovering. . 150

8.1 An ear-worn Activity Recognition (e-AR) sensor [137] developed atImperial College London and the signal recorded during a jump. . . 154

8.2 Different levels of sensor fusion (from Body Sensor Networks [245]). . 1558.3 Multi-level sensor fusion with feedback loops as proposed in [150]. . 1588.4 Example of a dataset fitted by a two-component GMM. . . . . . . . 158

13

8.5 Clustering using GMM. . . . . . . . . . . . . . . . . . . . . . . . . . 1608.6 Original video examples. Left: strong motion blur during a serve

with the arm and racket invisible and generally poor chrominanceinformation. Right: camera motion blur due to a ball contact withthe net. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

8.7 Location of the wireless inertial sensor on the tennis racket. . . . . . 1618.8 Raw acceleration captured by an accelerometer mounted on a tennis

racket for four different strokes. . . . . . . . . . . . . . . . . . . . . . 1628.9 Classification: tennis strokes confusion matrices. . . . . . . . . . . . 1658.10 Voting schemes used for decision-level sensor fusion. Left: GMM and

KNN flat decision-level fusion. Right: hybrid fusion based on GMMonly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

14

List of Tables

2.1 Estimations of the contributions to the racket velocity during a tennisserve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.2 Key biomechanical measures. . . . . . . . . . . . . . . . . . . . . . . 37

3.1 Overview of the main VSN platforms. . . . . . . . . . . . . . . . . . 44

4.1 VSN node image output size and frame rate comparison. . . . . . . . 81

5.1 Comparison estimators. . . . . . . . . . . . . . . . . . . . . . . . . . 1035.2 Elementary actions recognition rates: sensitivity. . . . . . . . . . . . 1085.3 Elementary actions recognition rates: specificity. . . . . . . . . . . . 109

6.1 Shape descriptors overview. . . . . . . . . . . . . . . . . . . . . . . . 1176.2 Automated stroke and posture recognition with rotationally invariant

descriptor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7.1 Treadmill speed estimation with model-based tracking. . . . . . . . . 1477.2 Tracking error under increasing levels of noise. All values are ex-

pressed in millimetres and computed from the first sixth frames ofthe sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

8.1 Ball contact detection using the racket inertial sensor. The volleyscause issues due to the relatively low motion intensity. False positivesare due to the player dribbling with the ball. . . . . . . . . . . . . . 163

8.2 Features used in classification. . . . . . . . . . . . . . . . . . . . . . . 1648.3 Comparison of stroke classification rates between using a vision sensor

alone and with the proposed combined system. . . . . . . . . . . . . 1648.4 Comparison of stroke classification rates between the feature-level,

decision-level and hybrid sensor fusion methods. . . . . . . . . . . . 167

15

List of Abbreviations

2D Two-dimensional3D Three-dimensionalAABB Axis-Aligned Bounding BoxAJAX Asynchronous Javascript And XMLAPAS Ariel Performance Analysis SystemARM Advanced RISC MachineBN Bayesian networksBNT Bayes Net ToolkitBP Belief PropagationBSN Body Sensor NetworkCEF Cylindrical Extend FunctionCHMM Coupled Hidden Markov ModelCMOS Complementary Metal Oxide Semiconductor (image sensor)CONDENSATION CONditional DENSity propagATION (algorithm)CRT Cathode Ray Tube (monitor)DOF Degrees Of FreedomDOG Difference Of GaussianDSP Digital Signal Processore-AR ear-worn Activity Recognition (sensor)ECG ElectrocardiogramEGI Extended Gaussian ImageEKF Extended Kalman FilterEM Electro-Magnetic (tracking)EM Expectation-Maximisation (algorithm)FFT Fast Fourier TransformFOV Field-Of-ViewFPGA Field-Programmable Gate ArrayFVH Fuzzy Visual HullsGMM Gaussian Mixture ModelGPS Global Positioning SystemGRF Ground Reaction ForceHMM Hidden Markov ModelIBM Image-Based ModellingIBMR Image-Based Modelling and RenderingIBR Image-Based RenderingIBVH Image-Based Visual HullsIMU Inertial Measurement UnitICA Independent Component AnalysisIR Infrared

16

JTAG Joint Test Action Group (standard)KNN k -Nearest NeighbourKP Key PostureLAN Local Area NetworkLASER Light Amplification by Stimulated Emission of RadiationLED Light-Emitting DiodeLTA Lawn Tennis AssociationMCU MicrocontrollerMDS Multi-Dimensional ScalingMEMS Microelectromechanical SystemMHI Motion History ImageNCC Normalised Cross-CorrelationNTC National Tennis CentreOBB Oriented Bounding BoxPCA Principal Components AnalysisPD Parkinson’s diseasePDF Probability Density FunctionPIR Passive Infrared (sensor)POV Point-Of-ViewPPG PhotoplethymographRAM Random Access MemoryRFID Radio-Frequency IdentificationRISC Reduced Instruction Set ComputingRMS Root Mean SquareROI Region Of InterestRSD Reflective Symmetry DescriptorsRSSI Received Signal Strength IndicatorSAI Simplex Angle ImageSDRAM Synchronous Dynamic Random Access MemorySEF Spherical Extent FunctionSFM Structure From MotionSFS Structure From SilhouetteSIFT Scale-Invariant Feature TransformSLAM Self-Localisation And MappingSMC Sequential Monte CarloSNR Signal-to-Noise RatioSPI Serial Peripheral Interface (bus)SSD Sum-of-Square DifferenceTOF Time Of Flight (camera)USB Universal Serial BusUSTA US Tennis AssociationVIP Visual Information Processing (group)VSN Vision Sensor NetworkWLAN Wireless Local Area NetworkXML Extensible Markup Language

17

Chapter 1

Introduction

For elite athletes, a tiny margin often separates a medallist and a finisher. This gapis so small that even the most subtle physical adjustments or mental conditioningcan make a significant difference to the final outcome of major competitions. For thisreason, research in sport science has been focused on the interplay of biomechanics,exercise physiology and psychology. For human motion modelling, the human bodycan be formalised with biomechanical tools that rationalise the complex effects andconsequences of dynamic motion alteration. The contribution of such biomechanicalunderstanding is twofold. Firstly, measurements carried out on a large cohort ofathletes provide insights into the intrinsic differences between national and worldlevel athletes. Secondly, serial biomechanical measurements on individual athleteshelp to identify key performance issues for further improvement or fine tuning.

In order to record objectively such biomechanical parameters, a number of de-vices have been developed over the years. Thus far, two families of biomechanicalparameter acquisition devices have been commonly employed: motion capture sys-tems and force sensors. Most commercially available motion capture systems involvea large set of wearable sensors or optical markers. The use of such optimally placedand highly visible markers allows for the monitoring of an athlete’s motion withgreat accuracy. However there are two major drawbacks associated with this so-lution. Firstly, the complexity of such systems prevents de facto their usage on aregular basis, they are therefore typically used in a laboratory setting to acquire asnapshot of the athlete’s performance. Secondly, these systems are obtrusive, whichcan inadvertently interfere with normal performance of the athlete. As a conse-quence, they are not used during competitions. Force sensors, on the other hand,are usually packaged into the athlete’s insoles or integrated into a specifically engi-neered floor or exercise device. Although being less intrusive, they cannot be usedin competitive events for similar reasons.

Video-based monitoring without the use of wearable markers is a preferred toolof choice as it does not impair the motion of the athlete, and yet can capturerelatively detailed body motion. Therefore, it provides more realistic measurements

18

that can be used on a regular basis or during competitions. Compared to othersensing modalities, however, video sensors or cameras do not provide ready-to-usemotion or biomechanical measures. Further image processing is required to deriveindices such as location, velocity, acceleration or other biomechanical parameters.Therefore, a class of application-dependent algorithms must be developed to processa large amount of data generated by multiple cameras to provide meaningful high-level information to the athletes and coaches.

Whilst some applications may rely on a single video source, issues such as limitedField-Of-View (FOV), a lack of depth information, and ambiguities due to occlusionshave called for the use of multiple cameras for practical applications. Further re-quirements on the ease of deployment and forgoing streaming video data to a centralworkstation have recently led to the development of Vision Sensor Networks (VSNs).A VSN is a collaborative network composed of multiple self-contained vision sensors,sometimes referred to as smart cameras. Such nodes typically encompass a camera,a processing unit, a communication interface, and a battery. The main drive behindthe VSN concept is to perform most of the low-level processing on-node and high-level inferencing in a distributed fashion, optimising resource usage in terms of bothprocessing and communication bandwidth.

When multiple views from the same athlete are available, Image-Based Mod-elling and Rendering (IBMR) techniques allow the generation of three-dimensional(3D) models or novel two-dimensional (2D) views. When designing an IBMR-basedframework, the temptation is to generate an explicit 3D representation of the sub-ject. Whilst this may be a promising solution, 3D reconstructions are often far toocomplex to be processed in real-time with limited computational resources. There-fore, an efficient means of dimensionality reduction is required before further dataanalysis can be applied.

In the context of monitoring elite athletes for performance optimisation, thegeneral models proposed above are usually not precise enough as the explicit andquantitative biomechanical parameters derived may be inaccurate. To circumventthese problems, explicit human biomechanical models need to be employed. Suchmodels, for example, can consist of “stick figures” that are matched with the sourceimages for reconstructing key biomechanical indices.

Ultimately, one of the main weaknesses of vision-based ambient sensors lies intheir inability to detect detailed high-frequency motion information, which in manycases can only be detected with the use of wearable devices. Improved accuracy canbe achieved by combining both ambient and wearable sensors.

This thesis represents an innovative use of VSNs for biomechanical sensing. Byusing tennis training as an exemplar, it details a model-based vision algorithm frame-work combined with wearable sensing for high accuracy, minimally intrusive sportperformance measurement.

The main contributions of this thesis include:

19

• Implementation of a robust real-time tennis player tracking system on an un-obtrusive, low-power autonomous VSN node;

• Demonstration of an automated framework for the generation of player motionprofiles and statistics;

• 3D subject reconstruction from multiple views and generation of canonicalviews;

• Derivation of a compact rotational-invariant signature based on subject-centric3D shape descriptors;

• Model-based tracking from monocular vision with detailed validation results;

• Demonstration of the practical value of ambient and wearable sensor fusionfor enhanced system performance.

The rest of this thesis is organised as follows. In Chapter 2, elementary conceptsof the human motion biomechanics are introduced. Key biomechanical parametersincluding joint angles and velocities, as well as forces applied to external systems,such as the ground or a sport instrument, are discussed. This is then followed by areview of the current state-of-the-art systems for monitoring such parameters, whichinclude marker-based motion capture systems and force sensors.

A review of the current methods for vision-based human motion monitoring isprovided in Chapter 3. Vision-based object tracking is a well-researched topic andthere are many potential algorithms that can be implemented on a VSN, althoughnot all are suitable for real-time on-node processing with distributed inferencing.In this chapter, techniques related to subject segmentation from the background,motion-specific feature detection, and predictive tracking based on Kalman filteringare discussed.

In Chapter 4, a miniaturised wireless VSN hardware platform is introduced andthe system is applied to real-time tennis tracking. In order to derive 3D world coor-dinates of the player on the court from 2D image silhouette, an accurate calibrationprocess is necessary, which generates a mapping function by computing the positionof the camera in 3D with respect to the court. In order to simplify its practical usefor training, standard court markings are used as a calibration grid. This, alongwith the tracking algorithm, is implemented on a self-contained battery-poweredwireless VSN module, weighing less than 100 grams. Further post-processing ofthe derived coordinate information allows the generation of play tactics and courtmovement/coverage during the game.

The technique proposed in Chapter 4 is based on perspective projection of theplayer on the court and the measurements derived may be biased depending on theviewing angle. To rectify this problem, a framework for canonical view generationis presented in Chapter 5. This ensures that spatio-temporal measurements based

20

on which dynamic indices are derived can be consistently calculated, and thereforeare free from projective bias. The contribution of these novel canonical views lies intheir Points-Of-View (POV) invariant representation. In this chapter, the influenceof relative position and orientation of the subject with respect to the camera, withand without the use of canonical view representation, is analysed in detail. The pro-posed algorithm greatly facilitates the application of further processing algorithmsby enforcing orientation constraints.

Spatio-temporal tracking of an articulated object, such as the tennis player,involves high-dimensional feature vectors. An approach to dimensionality reductionthrough the use of compact descriptors is presented in Chapter 6. These 3D shapedescriptors aim to provide a low-dimensional yet meaningful representation of theoriginal data. The technical focus of this chapter is on subject-centric and rotational-invariant descriptors, because they further reduce the dimensionality of the searchspace. To this end, the dimension of the 3D model is first reduced by the use of asubject-centred spherical reprojection algorithm, which is a parametrisation of themodel’s surface in a spherical coordinate system. Further dimensionality reductionis provided by the application of spherical harmonics. A compact, rotationallyinvariant shape signature is then derived from the spherical harmonic coefficients.One important advantage of these descriptors compared to more explicit models istheir ability to mask out the subject’s appearance information. Indeed, under theVSN paradigm, the video data observed by this device can be processed on-boardin real-time and the sensor only transfers abstract signal features. This ensuresprivacy, which is important for the use of the proposed platform for applicationswhere privacy is a high priority.

Further improvement of the method is presented in Chapter 7, where a model-based tracking algorithm that allows the derivation of higher-level biomechanicalmotion features is presented. The proposed biomechanical model consists of an ar-ticulated structure simulating the skeleton of the body with joint-angle constraints.In order to fit this model to 2D image sequences, the body has also been modelledusing tapered superellipsoids, giving it a near-human silhouette. A predictive track-ing algorithm has been used to derive joint motion parameters from the silhouette,with which a Kalman filter is employed at the joint level and full-body models areused for activity-specific tracking. In this way, predictive tracking reduces the in-herent ambiguities, whereas the biomechanical model acts as a set of constraints,leading to improved accuracy and reduced number of cameras to be used for practicalapplications.

To integrate both wearable and ambient sensing, a Bayesian framework for am-bient and wearable sensor fusion is presented in Chapter 8. The proposed methodallows the fusion of features derived from both sensors by using features extractedfrom the vision-based sensor and biomechanical information derived from a 3-axisaccelerometer. They are combined in a Bayesian framework and results show that

21

sensor fusion significantly increases the accuracy of classification compared to theindependent use of vision or wearable sensors. Chapter 9 concludes the thesis bysummarising the key technical contributions of the work and potential areas of futureresearch.

22

Chapter 2

Biomechanics of the Human

Motion

Biomechanics is the study of the human body using mechanical theory [26, 42, 88].More specifically, human motion biomechanics focuses on musculo-skeletal aspectsof the body motion. The muscles and skeleton are modelled mechanically, allowinganalysis, simulation, and optimisation of specific human activities. In this chapter,some of the key concepts in human motion biomechanics along with their practicalapplications and key challenges are discussed.

2.1 Human biomechanics models

2.1.1 Generalities

From a biomechanical perspective, the human body is composed of an articulatedstructure, the skeleton, and a set of muscles imposing forces between the bones.Both internal (e.g. muscles) and external forces (e.g. gravity, Ground Reaction Force(GRF)) are applied to the system. It should be noted that whilst the human bodyis at the centre of biomechanical studies, other elements such as specific sport in-struments may be integrated into the system.

Human motion studies aim to model the dynamics of the body in a mathematicalway. Many medical and sport related studies rely on biomechanics. In medicine,the use of such a model scheme allows for the understanding of adverse events suchas unstable gait leading to falls frequently occurring in elderly or disabled. This isillustrated in Figure 2.1, where toppling occurs when the centre of mass does notlie directly above the supported area. Preventive methods can be derived from suchmodels. In sport science, biomechanical models can be employed to tune competitivetechniques or identify factors leading to injury. For example, the quadriceps angle orQ-angle illustrated in Figure 2.1 is a measure performed on the skeleton. It has beenshown that a large Q-angle increases the risk of knee injury when running [101].

23

Centre

of mass

Support area

Centre of mass

ground projection

Q−angle

Figure 2.1: Left: Toppling occurs when the centre of mass does not lie directly abovethe support area. Right: the quadriceps angle or Q-angle. A large Q-angle is oftenattributed to problems such as knee pain (figure base adapted from [236]).

2.1.2 Bones and articulations

The human skeleton consists of bones connected at articulations. Most of the ar-ticulations only allow for a rotational motion. The friction between the bones inmotion is greatly reduced by pieces of cartilage at the articulations, in addition tothe synovial fluid lubricating the contact surfaces [239]. Some articulations such asthe knee also feature ligaments which link the bones together.

In biomechanics, the skeleton is typically modelled by a tree of sticks connectedwith three rotational Degrees Of Freedom (DOF). The DOF of some articulationscan be reduced to one or two. Joint friction can optionally be included for morerealistic simulation.

2.1.3 Muscles

2.1.3.1 Muscle force production

Skeletal muscles are designed to provide a leverage force in between articulatedbones1, to which they are attached by tendons. Muscles are composed of tens ofthousands elongated fibres bundled together into fasciculus. Two types of fibre arepresent in the muscles: slow and fast fibres. Slow fibres rely on aerobic biochemi-cal reaction, thus involving the lungs and the cardiovascular system, and therefore

1As opposed to smooth muscles that can be found in the arteries or the bladder.

24

take some time to start producing a force. Fast fibres use an anaerobic and inef-ficient decomposition of the glucose into poisonous lactic acid, allowing rapid butunsustainable contractions [209]. In general, the muscle contraction force can bedeveloped in three different modes [26, 42, 88]:

Concentric contraction occurs when the muscle shortens whilst producing thework. This happens when the muscle’s force is greater than the resistive forces.

Eccentric contraction occurs when the muscle elongates during contraction. Thisis typically the case when trying to control the motion resulting from otherforces (internal or external).

Isometric contraction (at constant muscle length) can also occur, when holdinga position or an object for example.

Because there is no such thing as a “decontraction”, or a force produced bya muscle that would push [42, 209], most muscles are paired with an antagonistcounterpart. Such a pair of antagonist muscles allow for the production of force ineither direction.

Although the muscle contraction mostly produces a translational force, mosthuman articulations only permit rotational motion between the composing bones,as illustrated in Figure 2.2.

r

F

r

F oo

Figure 2.2: Muscle force production; the moment of the muscle is greater on theright, allowing more torque to be produced from the same force.

The torque τ generated by a force F applied with a lever arm r is simply definedas the cross product of the force and the lever arm:

τ = r× F (2.1)

τ = rF sin θ (2.2)

where τ , F and r are the magnitudes of the vectors and θ is the angle betweenF and r.

25

Therefore, the conversion of the muscle translational force into a rotary torquedepends on the angle of the articulation and the position of the attachment pointsof the muscle to the bones. Whilst the latter is fixed, the articulation angle plays amajor role in the potential torque that a muscle can develop.

2.1.3.2 Hill’s muscle model

The classic Hills muscle model [26, 42, 88, 98] is an elementary mechanical equiv-alent of a skeletal muscle. This model consists of several elementary mechanicalcomponents, such as a contractile force, damping and elastic components.

contractile

component series elastic

component

parallel elastic

component

F

Damping

contractile

component

series elastic

component

parallel elastic

component

F

Damping

Figure 2.3: Hills muscle model is composed of a contractile element in series withan elastic element, both in parallel with another elastic element. A variation ispresented on the right.

The elastic elements of the model are of the highest importance. Indeed, a largenumber of the motion sequences performed by human beings rely significantly on themuscles’ elasticity to store potential energy. The elastic energy can be initially storedthrough two main formats: pre-loading by stretching the muscles or the action ofexternal forces onto the body [103]. For example, muscle elasticity plays a major rolein capturing the GRF energy during walking or running by restoring it a fraction ofsecond later (bouncing). Because the contractile component is modelled in parallelwith the elastic component, the elastic energy is not persistent. Therefore, it mustbe re-used as quickly as possible after storage [102].

The contractile element can be characterised by Hill’s equation [98, 219] in thecase of a concentric contraction:

(T + a) (V + b) = c (2.3)

where T is the muscle tension and V its contraction speed. In this equation, a, band c are constants defining the muscle contractile properties. Equation 2.3 can berearranged to give:

T =c

V + b− a (2.4)

26

Therefore, the muscle tension T reaches its maximum Tmax at a null velocity:

Tmax =c

b− a (2.5)

Conversely:Vmax =

c

a− b (2.6)

and the above equation can be rewritten as:

(T + a) (V + b) = b (Tmax + a) (2.7)

= a (Vmax + b) (2.8)

The main outcome of the application of this equation is that a skeletal muscledevelops its maximum tension during an isometric contraction (at a null contractionspeed), and conversely contracts at its highest velocity when not providing anyforce. That is why, for example, sprinters cannot accelerate indefinitely. Once theirmuscles have reached a contraction speed that is close to their maximum, theycannot produce more force to keep on accelerating. Therefore, sprinters are boundto their innate maximal speed.

The representation of Hill’s muscle equation in the (T, V ) space is a hyperbola,whose asymptotes are defined by T = −a and V = −b, as illustrated in Figure 2.4.

−2000

0

2000

4000

6000

8000

10000

12000

14000

−0.5 0 0.5 1 1.5 2 2.5

b

a

Tmax

Vmax

Pmax:

(Topt,Vopt)

Tensio

n (

N)

Velocity (m/s)

Figure 2.4: Hills curve for a muscle with the following properties: Tmax = 12, 000N ,Vmax = 2ms−1, Pmax = 2300W .

It can be noted that this equation is applicable to individual fibres, fasciculusand muscles. Indeed, the total tension produced by n fibres at velocity V with

27

parameters af , bf , cf is the sum of the tension of each fibre Tf :

T = nTf (2.9)

= n

(cf

V + bf− af

)(2.10)

=ncf

V + bf− naf (2.11)

which leads to Hills equation for the whole muscle:

(T + naf ) (V + bf ) = ncf (2.12)

The muscle power P generated during a contraction is defined by the product oftheir tension and velocity [219]:

P = TV (2.13)

= V

(c

V + b− a)

(2.14)

= T

(c

T + a− b)

(2.15)

The maximum muscle power is produced at an optimal velocity Vopt, generating atension Topt. The derivative of the above function P (V ) can be expressed as:

dP

dV= V

−c(V + b)2

+c

V + b− a (2.16)

The maximum of P (V ) is reached when its derivative dP/dV is null, leading to theresolution of a second order equation with its only positive solution being:

Vopt =

√bc

a− b (2.17)

Similarly,

Topt =√ac

b− a (2.18)

which can also be expressed as:

Vopt =√b (Vmax + b)− b (2.19)

Topt =√a (Tmax + a)− a (2.20)

And therefore,Pmax = ToptVopt = ab+ c− 2

√abc (2.21)

It is also worth noting that:TmaxVmax

=ToptVopt

=a

b(2.22)

28

Whilst Hill’s equation provides an elementary yet general estimation, it is notaccurate unless dealing with low speeds and high forces [181, 219].

2.1.4 Kinetic chain

Muscle models provide a local parametrisation of the force that each individualmuscle can produce independently. Whilst this approach is sufficient for elementarystudies, it cannot provide relevant information on complex motions carried out bythe human body while exercising. For this purpose, global human biomechanicalmodels are necessary. Such models usually include a definition of the skeleton, a setof significant muscle models, mass distribution and further information on contactpoints with potential external forces. This is often referred to as the kinetic chainmodel [223], as all elements are interdependent with each other. The kinetic chainmodel allows for more detailed and realistic human motion simulation.

For example, the racket motion during a tennis serve cannot be evaluated witha model of the player’s arm or upper body. A full kinetic chain model is necessaryto study the significant effect of the front leg drive to the final tennis ball velocity.

2.2 Biomechanical measures

Whilst it is difficult to measure the exact biomechanical parameters of the humankinetics, a large number of methods exist that provide good estimates of some ofthe external consequences of the applied forces.

Kinematic aspects such as the position, the speed and the acceleration (and tosome extent the jerk, i.e., the acceleration derivative over time) can be measuredexternally at the surface of the skin. Because the skin itself is subject to complexsoft tissue deformation, the underlying motion of the skeleton cannot be necessarilyderived with high accuracy [21]. In general, forces can be measured either directlyor indirectly.

A dynamometer can be used to measure directly the force developed by a specificmuscle or a group of muscles. This solution can only be employed on a specific testbed and to measure the maximal applied force. Indeed, the energy spent by thesubject is entirely absorbed by the device and therefore cannot be used to measureforces developed during real-life exercise. This is typically used to assess the nominalforce of a subject at a given time.

In order to measure the forces involved in real-life, a device that does not dissipateenergy must be used. Therefore, the forces are typically estimated indirectly via thepressure applied on the ground or an object. Pressure sensors can be integratedinto a variety of devices such as insoles, gloves or they can be embedded into theground. Whilst this can provide measurement in real-life conditions, only a smallsubset of the forces involved can be measured and this indirect observation often

29

leads to losses in accuracy.

2.3 Practical applications of biomechanics

2.3.1 Running

A normal running gait consists of two main phases: the stance phase (also referredto as contact phase) during which a foot is in contact with the ground, and anairborne recovery phase (also referred to as swing phase) [42]. The two phases of arunning cycle are illustrated in Figure 2.5.

0 0.5 1 1.5 2 2.5

Distance (m)3

Figure 2.5: Motion sequence on the sagittal plane of a runner at 25 Hz. Top: videosequences recorded. Bottom: joint positions derived from three-dimensional (3D)marker-based optical tracking at 250 Hz.

2.3.1.1 Stance phase

The stance phase starts when the foot lands on the ground. A heel strike, illustratedin Figure 2.6 (middle), can be characterised by a sudden vertical reaction forceapplied to the leg. This fast acceleration change, or jerk is a source of energydissipation and potential injuries. A mid-foot strike develops a lower impact (Figure2.6 - bottom). Elastic energy can be stored in stretched tendons, particularly in the

30

ankle and the knee, when landing and recovered later when pushing up and forward.The energy recovery rate depends on the contact time and the dynamics of the sole.

Walking

Time (ms)0 51042534025517085 595

Running (heel strike)

Time (ms)0 2101751401057035 245

Running (mid-foot strike)

Time (ms)0 150125100755025 175

Figure 2.6: Foot pressure distribution over time during ground contact for threedifferent strikes, as measured with a Parotec insole [180]. It is evident that walk-ing involves lower GRF than running. These measurements also demonstrate thedifference between heel strike and mid-foot landing.

During contact, the centre of pressure on the foot moves from the heel to thetoes when the impulse is given. This phase is sometimes sub-divided into a supportphase at landing and a drive phase, illustrated in Figures 2.6 and 2.7. The GRF isapplied to the body through the foot during the stance. This external force can bedecomposed into a vertical component, opposing resistance to the body weight andthe vertical impulse, and a horizontal component due to the ground friction allowingforward and sideways motions.

2.3.1.2 Recovery phase

An upward motion is compulsory to allow for an airborne phase, but this needs tobe minimised in order to save energy. A bend in the knee increases the clearance tothe ground whilst carrying little mass upwards, but also helps with the pendulum

31

0

200

400

600

800

1000

1200

1400

1600

0 100 200 300 400 500 600

For

ce (

N)

Time (ms)

Ground Reaction Force (GRF)

WeightWalking

Running (midfoot strike)Running (heel strike)

Figure 2.7: Overall ground reaction force for three different strikes recorded at300 Hz. The initial force peak during a heel strike is a source of energy dissipation.Note the difference of amplitude between running and walking.

motion of the entire leg.Once airborne, the runner’s aim is to get his foot as far forward as possible.

This can be achieved by increasing either the stride length or the stride frequency.Whether runners should favour stride length or frequency is a topic of constantdebate in sports training.

2.3.1.3 Other aspects

Arms also play an important role during walking and running. Indeed, due to theprinciple of action-reaction, the pendulum motion of the leg during the recoveryphase forces the torso to rotate in an opposite direction. A counter-action by thearm is then required to keep the torso straight. This effect is particularly importantduring large accelerations, and that is why elite sprinters usually exhibit a strongupper body musculature.

2.3.1.4 Key biomechanical measures

A complete understanding of key biomechanical parameters allows for the designof more specific and efficient training programs. Observation and analysis of theseparameters (Figure 2.8 left) is therefore key to sport performance improvement.

32

Recovery Stance−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

0 100 200 300 400 500 600 700

Ang

le (r

ad)

Time (ms)

hipkneeankle

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0 0.5 1 1.5 2

Hip

ang

le (r

ad)

Knee angle (rad)

Figure 2.8: Main leg joint angles in the sagittal plane during a running cycle. Left:illustration of the two main running phases. Right: shape of the knee versus thighangles during a running cycle allows one to determine fatigue level.

For example, it has been demonstrated that muscle fatigue during sprinting can bedetermined by the shape of the knee versus thigh angles during a running cycle[43, 88] as illustrated in Figure 2.8 (right). Although the effect of fatigue varies fromathlete to athlete, the relative magnitude of the concavity of the curves provides anoverall indication of its impact.

2.4 Biomechanics of tennis movements

2.4.1 Motion on the court

In order to illustrate the technical features of the sensing paradigm proposed inthis thesis, tennis movement tracking will be used as the exemplar throughout thisthesis. In this section, some of the key biomechanical features of tennis movementswill be discussed. Acceleration is an obvious key factor to a successful tennis game[86, 188]. The player’s ability to move and change direction quickly allows theathlete to convert more opportunities into points, particularly on slow surfaces suchas clay (it is less of an advantage on faster surfaces where the player cannot ‘run’after the ball). However, the necessity to move fast must not prevent the player fromremaining balanced enough to control the ball efficiently. Therefore, dynamic balance[103], which is the capacity to remain balanced whilst moving, is a fundamentalrequirement for the tennis player [188].

2.4.2 Tennis stroke

In addition to generic motion such as accelerating, sprinting and decelerating, a ten-nis game involves more specific biomechanical aspects in each stroke. One commondenominator between all kinds of stroke is that the kinetic energy propagated bythe racket to the ball is a function of the mass of the system in motion (mainly thearm and the racket) and the square of the racket’s head speed. Indeed, the kinetic

33

energy Ek possessed by a mass m in translational motion at a velocity v is given bythe classical mechanics:

Ek =12mv2 (2.23)

Although it may appear that only the racket’s head speed matters during thestroke, it is important to understand that this motion is actually the consequenceof energy propagated through the whole body’s biomechanic chain. Each specifictennis stroke involves a balance of power and precision, which can be described inbiomechanical terms.

Serve Although there is no such thing as the perfect serve [67, 178], general rulesdo apply in order to achieve the most efficient stroke [66, 223]. To achievethe maximum power, the whole body of the tennis player is involved in orderto get as much racket rotational velocity as possible. The front leg drivesthe movement, the trunk and shoulders then rotate. An upward and forwardmovement of the body is applied, providing further speed to the upper body.This is usually performed by bending the knees to store some elastic energyand release it by effectively jumping forward. An upper arm external rotationallows it to store elastic energy [102] before the final arm flexion and wristtwist during the ball toss.

It is clear that the whole kinematic chain from the ankles to the wrist con-tributes to the final ball speed as illustrated in Table 2.1 and Figure 2.9.Because of the elastic energy usage, all the motion sequence must be perfectlycoordinated and the rhythm is fundamental [66, 223].

Return of serve Different to the serve, the aim of the return is not velocity gen-eration (especially on the first serve), but more of a matter of precision [120].Therefore, the speed of all body parts is greatly reduced in favour of ballcontrol.

Forehand stroke A large number of options are offered to the player to carry out aforehand stroke [20]. The grip determines the racket orientation and preparesthe wrist for impact. For example, the Western grip (hand palm at 90◦ fromthe string plane) allows easier topspin and racket orientation, whereas theEastern grip (palm parallel to the strings) provides more stability. A varietyof intermediate grips are also in use. The trunk rotation is used to store elasticenergy. The back-swing can be straight, providing more control, or in loop forvelocity.

Backhand stroke The backhand can be performed one- or two-handed, exhibitinglarge differences in the resulting kinematic chain. A one-handed backhand hitsthe ball with a larger radius of rotation, but a lower angular velocity of theracket. Overall, the linear velocity of the racket at the impact is thought to

34

1 2 3 4

5 6 7 8

9 10 11 12

Elastic energy storage Elastic energy release

Figure 2.9: Biomechanics of the tennis serve: all major body parts play a role inthe eventual ball speed and the coordination is crucial. 1 to 5: the front leg drivesthe motion. 3 to 11: hips’ rotation. 4 to 12: shoulders’ rotation. 4: elastic energystorage through a knee bend. 5 and 6: release of the elastic energy into an upwardsand forward motion and elastic energy storage in the arm. 7 to 10: release of thearm elastic energy - shoulder, elbow and wrist are used to their maximal extend. 10to 12: Follow through.

35

Motion Contrib. Terminal Velocitypoint (ms−1)

Leg drive & shoulder rotation 10 - 30% shoulder 4.2Upper arm flexion & internal rotation 40 - 55% elbow 7.3Forearm extension & pronation 5% wrist 8.7Hand flexion 30% racket 20.5

ball 44.2

Table 2.1: Estimations of the contributions to the racket velocity during a tennisserve [66, 67, 102], along with the average velocities as measured by Papadopoulos etal. [178]. Although the relative contributions are still being debated, it is commonlyacknowledged that none of the motion aspects of the kinematic chain is negligible.

be similar for both techniques [187]. The one-handed backhand requires morestrength, and therefore is rarely favoured by the beginners.

Volley The precision required during a volley is achieved by a forward step in orderto increase the general balance and a lock of the racket on the arm and wristfor optimal ball control. A knee bend allows an increase in the overall balancebut also to be ready to move in any direction [189], as the reaction time isbound to be very short.

2.4.3 Key measures

Most of the current biomechanical studies carried out on tennis players are focusedon the serve. The leading idea is to identify the contributions of several biomechan-ical parameters to the final racket speed. Several types of experiments have beendesigned in order to measure these variables. They are summarised in Table 2.2.

The GRF during a serve has been measured in [164] using a force plate. Thisallows one to evaluate the difference in two variations of the tennis serve stroke,the foot-up and the foot-back. It was found that the magnitude of the verticalcomponent is about 50% higher at the end of a foot-up serve than that of a foot-back serve, which is attributed to a greater impulse. Nevertheless, the authorsassume that this force remains low enough to discard any risk of lower body injuriesin both cases.

At the other end of the human kinetic chain, Wang et al. [233] analysed theupper body kinematics during a flat serve. They relied on a HiRES Expert Visionmotion capture system composed of six cameras tracking sixteen markers on theplayer’s body. The derived angular velocities provide an insight into the motionsynchronisation along the kinematic chain during the serve. The authors suggestedthat a fine motion control using the measured velocities should help prevent injuries.

A complete kinetic chain analysis was carried out by [178], using manually an-notated video recordings. The 3D position of 13 body points generated by the Ariel

36

Performance Analysis System (APAS) [62] allows for the derivation of joint velocitiesand drawing conclusions on what actually matters in the tennis serve. It appearsthat elite players do not necessarily follow the theoretical optimum serve sequence.

The overwhelming presence of the serve stroke in the tennis biomechanics lit-erature is often justified by its importance to the final game result. Indeed, theserve is believed to be a major factor to the final outcome. Whilst this assumptionis certainly correct, other strokes undoubtedly have a role to play during a tennismatch. The reason for the limited studies on these factors such as stroke variabilityis mainly due to the difficulty of obtaining detailed information without impedingthe game itself. It would be far from practical to cover a whole tennis court withforce plates in order to capture the GRF of long forehand as well as net game inmatch conditions.

Table 2.2 summarises the key biomechanical parameters to be measured and howthey are currently achieved in laboratory or outdoor settings. The main technicalchallenge therefore consists of generalising the previously mentioned methods tomost of the tennis strokes and even to the general court motion and posture.

Biomechanical parameter Measurement device ExamplePosition and velocity on court Video-based tracking [183, 210]GRF Force plate [164]GRF Pressure insole -Joints velocity Manually annotated video [178]Upper body joints velocity Marker-based motion capture [233]

Table 2.2: Key biomechanical measures.

2.5 Summary and conclusion

2.5.1 Summary

Human motion biomechanics offers a set of tools for modelling the body movementsin an intuitive, yet realistic manner. Local muscle models, such as Hill’s, provide aninsight into the individual muscles capacity to produce force and speed by their con-traction, whereas the global kinetic chain models describe the interactions betweenthe skeleton and muscles in the whole body.

Biomechanical models provide coaches with a means of describing the motion ofathletes with quantitative parameters that complement subjective observations. Acomparison of biomechanical simulations with real-life observations is an effectiveway of optimising, inferring and rectifying subtle movement patterns, leading tonot only performance enhancement, but also injury prevention. To this end, keybiomechanical parameters providing the most relevant indication of performancegain must be identified and extracted.

37

Commonly used methods for measuring biomechanical parameters include wear-able inertial sensors, wearable and ambient force sensors and video-based systems.The current state-of-the-art systems for whole body motion capture mainly consistof multiple infrared cameras, tracking fiducials attached to the body. These markerscan be obtrusive during training and are unsuitable during competition.

2.5.2 Conclusion

In this chapter, it has been shown that the tennis serve is a well-studied topic bythe biomechanical community. Other tennis strokes have been analysed with lessscientific rigour due to a higher variability and a wider spatial coverage. Empiricalstudies have shown that the whole body, from the legs to the wrist, contributes tothe eventual racket speed. Finer biomechanical analysis allows detailed evaluationof relative contributions of constituting actions to the racket speed, as well as thespatio-temporal coordination between these actions.

In order to apply such motion optimisation in situ during training and compe-tition, there is a need to measure appropriate biomechanical parameters accurately.This means that the measurement devices must be accurate and pervasive, with-out affecting the normal performance of the game itself. Any disturbance to theathlete’s natural motion can lead to subtle changes in movements. Therefore, mostcommonly used tracking systems are not suitable for this task.

As a consequence, one of the key challenges in motion capture would be to derivesimilar detailed postural parameters using a less obtrusive markerless system. In thenext chapter, the suitability of vision-based systems for accurate and unobtrusivemotion analysis will be evaluated.

38

Chapter 3

Visual Motion Monitoring and

Activity Recognition

Over the years, a wide variety of sensing modalities have been developed for sportperformance monitoring, but not all of them are routinely employed. Physiologicalparameters such as heart rate are frequently used in most sports whereas otherssuch as the oxygen uptake (V O2 max) are limited to laboratory settings. Similarly,kinetic parameters such as stride length and stride frequency are a pre-requisite forsprint training, but finer biomechanical analysis such as detailed Ground ReactionForce (GRF) is less commonly used in the field.

In these examples, the current technology is mature enough to provide accurateand meaningful answers. Heart rate can be measured with a chest belt, oxygenuptake with a wearable mask, pace with a watch and Global Positioning System(GPS) or a foot sensor, and stride analysis with a marker-based motion capturesystem. The reason why oxygen uptake and motion capture analysis are less fre-quently used is partly due to their prohibitive cost and partly due to their complexinstallation/setup and their obtrusive nature.

Whilst it is impossible to measure the oxygen uptake with a vision-based system,motion analysis can be performed with markerless vision sensors. This would providenot only a more affordable but also a much less intrusive motion capture system.However, a number of issues must be addressed. In this chapter, a review of thecurrent state-of-the-art in motion monitoring and tracking is presented. Ambientsensors such as natural-light or infrared cameras and ultrasonic sensors are discussed.Because vision sensors do not provide specific kinetic parameters directly, furtherprocessing is required, which includes object segmentation, motion evaluation andhigh-level tracking.

39

3.1 Ambient sensing

3.1.1 Non-visual sensors

Passive Infrared (PIR) sensors, also referred to as motion sensors and pyroelectricsensors, are widely used for security purposes. Such sensors capture the infrared(IR) light emitted from warm objects such as the human body. In order to increasethe Signal-to-Noise Ratio (SNR), PIR sensors are often based on several subsensorsevaluating the IR emission at different directions close to each other. Correlationbetween these subsensors provides a means of robust motion detection. In practice,PIR sensors provide a low cost, low-power and robust way of motion detection[22]. The addition of a Fresnel lens array further permits the evaluation of velocityinformation [90].

For ambient sensing, bend, pressure, magnetic or mechanical switch sensors canalso be mounted on furniture or appliances [22, 127, 163] to detect their use. Bendsensors are developed by using materials whose electrical properties (resistance, ca-pacitance) change with the torsion or in the case of optical sensor by measuring themode splitting in an optical fiber [136]. Pressure sensors are usually based on piezo-electric materials, or electrico-mechanical contacts. Large scale arrays of pressuresensors can be embedded in the floor, to build a smart or active floor [1, 171].

Acoustic and vibration sensors have also been used to infer motion and activitiesby sampling the ambient sound [44, 49, 196] in an environment. They can also bemounted on specific targets, for example to monitor water flowing in a pipe [163]and therefore infer appliance usage. Depending on the desired frequency range,traditional microphones or piezoelectric materials can also be used.

For ambient sensing, many other specific sensors have also been developed. Theyinclude smoke detectors (optical sensors are widely spread for this purpose; they relyon a light beam scattered by the smoke particles), gas, carbon monoxide/dioxide,luminosity (photodiode), water level, humidity, temperature (thermistor) [22, 163],electricity consumption of specific sockets [163].

In between ambient and wearable sensing, some systems permit localisationbased on the combined use of wearable and ambient sensors. This can be per-formed by triangulation of the Received Signal Strength Indicator (RSSI) by severalreceivers [213]. Radio-Frequency Identification (RFID) based identification of suchsensors can permit the concurrent motion monitoring of multiple subjects.

Although these sensors are suitable for the specific tasks they are designed for,their application domain is very limited. In order to sense a wide range of activitiesin sport environments, a large range of sensors are necessary. Some of these de-vices also require careful installation and calibration, which is difficult for practicalapplications.

40

3.1.2 Vision-based sensors

Vision-based methods are typically based on visible light or infrared cameras. Suchsystems can consist of a single camera [151, 230], one or several stereo pairs [70], or amultiple-view configuration [44, 138, 176]. Most of the work in this area has focusedon static cameras, but dynamic cameras [159] can also be employed, particularlyfor human tracking in large areas. Computer vision has been an important focusin distributed ambient sensing because of its versatility and relative maturity. Itis probably worth noting that the dominant human sense is vision, thus suggestingvision-based methods being richer and making more “sense”.

There are two main approaches for detecting and analysing human motion basedon vision-based sensors. The global methods aim to segment the image into regions,some of which may contain the subject. This is typically performed with a statisticalbackground segmentation, edge detection or motion detection. A human model canthen be matched to these selected regions. Local methods, on the other hand,look for specific features in the image, such as the head or feet of the subject. Thesefeatures are then associated with a human model during the next level of processing.

Once the subject has been identified, several techniques can be used to derive in-formation about the posture and gait. One solution is to generate low-level featuresdirectly from the segmented object on a frame-by-frame basis. Another approach isto update a more detailed model of the body based on tracking. Direct generationof features from the extracted silhouette has the advantage of being simpler with-out the need of model fitting. Such features include ellipses [213, 230] or OrientedBounding Boxes (OBBs) [177], which provide information about the global orienta-tion of the subject (e.g., lying, sitting or standing) and its aspect ratio. Given anellipse (x, y, a, b, ϕ) defined by its centre (x, y), its major and minor axes a and b

and its orientation ϕ, the best fit can be easily computed using the central momentsof the image [230]:

µpq =∑x

∑y

(x− x)p(y − y)qI(x, y)

ϕ =12

tan−1

(2µ11

µ20 − µ02

)(3.1)

where I(x, y) is the image intensity at a given pixel. In the case of a binary blob,the intensity value can only be 0 or 1. The values of a and b can be calculated byprojection of the pixels once the centre and orientation are known.

Other techniques include radial shape analysis [10, 230], which is the extent ofthe silhouette against the angle in polar coordinates. Tracking methods will bedescribed more in depth in Section 3.3.

Most non-tracking methods based on a single camera suffer from view-dependentbias and occlusion induced errors. A unique Point-Of-View (POV) makes posture

41

evaluation difficult, as the features extracted from the silhouette are dependant onthe subject’s orientation. Occlusion and self-occlusion (typically arms occluded bythe chest, for example) add further ambiguities to the scene. This issue can betackled by the use of an omnidirectional camera [151]. Such cameras are usuallyless sensitive to the relative position. They are also less sensitive to occlusion. An-other solution to these problems is to combine several cameras [83, 176]. Multiviewmethods are discussed more in detail in Chapters 5 and 6.

3.1.3 Vision Sensor Networks

In order to increase the coverage of a vision-based system, a desirable solution is torely on several cameras or vision-based sensors, forming a Vision Sensor Network(VSN). Whilst it is possible to extend single camera systems by juxtaposition ofseveral sensing nodes, an efficient VSN deployment involves some level of inter-nodecollaboration.

3.1.3.1 Hardware

A VSN is composed of a set of nodes embedded with a vision sensor, a processor anda communication interface. A set of personal computers equipped with cameras andconnected together by a Local Area Network (LAN) is one example of a VSN. As theVSN technology is still at its infancy, the most advanced algorithms are often testedon this kind of generic equipment setup [47, 57] or even on simulated environments[5, 57].

VSN-specific hardware is bound to be more efficient in terms of cost and de-ployment. Small battery powered devices embedding a camera, a processor for localdata analysis, and a wireless interface are emerging. They are sometimes referredto as “smart cameras”. Typically, such a device would include:

• A processing unit, e.g. ARM-based [190, 215] (such as the Intel XScale PXAseries [45, 71, 144, 216]) and sometimes packaged in a microcontroller (MCU)such as the Atmel AT91SAM7S [61, 97]. Low-power microcontrollers suchas the Atmel Atmega128L [96, 186, 193] are also popular in designs. OtherReduced Instruction Set Computing (RISC) platforms include PowerPC [147]and Blackfin [211] processors.

• Random Access Memory (RAM) and Flash memory.

• A Complementary Metal Oxide Semiconductor (CMOS) image sensor, themost popular ones being provided by Omnivision [170]. A Universal SerialBus (USB) webcam is sometimes employed.

• A Field-Programmable Gate Array (FPGA) [39, 45, 68, 147] or a Digital SignalProcessor (DSP) module [138, 165] can be integrated for fast low-level image

42

processing.

• A communication interface that often takes advantage of the supporting plat-form providing IEEE 802.15 [216], IEEE 802.11g/b Wi-Fi [144, 211] or evenmore simply a wired Ethernet [147].

Table 3.1 provides a comparison of some of the current VSN platforms and anexample VSN hardware setup is depicted in Figure 3.1.

Figure 3.1: Top, left to right: VSN node developed at Imperial College London,COTS iMote2 [216], WiCa [119]. Bottom: CMUcam 3 [190], CITRIC [45], mvBlue-LYNX [147]. Images are not to scale.

3.1.3.2 Bandwidth usage

In practice, the large amount of visual information acquired by each VSN node canbecome problematic when building collaborative tracking environments, as band-width is typically limited in most applications [5, 6, 17, 47, 57, 167]. As a conse-quence, extensive work has been carried out for limiting the inter-node communica-tion with local on-node processing. Early work consists of image compression beforetransmission, whereas the current trend is more towards distributed processing com-bined with a reduction of the amount of data sent across the network [125].

For example, when Cheng et al. [47, 57] proposed a method to calibrate automat-ically a set of VSN nodes, they considered the bandwidth constraint as part of theiralgorithm. Their system is designed to build a vision graph, in which the nodes

43

Nod

eP

latf

orm

Pro

cess

or(f

requ

ency

)A

dd

itio

nal

Cam

era

Com

mu

n.

Pow

erR

ef.

pro

cess

orIn

terf

ace

(mW

)C

ITR

ICT

mot

eSk

yX

Scal

eP

XA

270

(624

MH

z)F

PG

AO

V96

5580

2.15

970

[45]

CM

Uca

m-

Ubi

com

SX28

(75

MH

z)-

OV

6620

RS2

3210

00[1

91]

CM

Uca

m2

-U

bico

mSX

52(7

5M

Hz)

-O

V66

20/7

620

RS2

3285

0[1

92]

CM

Uca

m3

-N

XP

LP

C21

06(6

0M

Hz)

-O

V66

20/7

620

RS2

3250

0[1

90]

Cog

nac

hro

me

-M

otor

ola

6833

2b-

NT

SCR

S232

2000

[166

]C

OT

SiM

ote2

iMot

e2X

Scal

eP

XA

271

(416

MH

z)-

OV

7649

802.

1532

2[2

16]

CO

TS

XY

ZX

YZ

OK

IM

L67

Q50

02(5

7M

Hz)

-A

LO

HA

/OV

7649

802.

15[2

15]

Cycl

ops

Cro

ssbo

wM

ICA

2A

TM

EL

AT

meg

a128

LX

C2C

256

AD

CM

-170

0M

ica2

42[1

86]

eCA

ME

cono

deD

W80

51O

V52

8O

V76

40nR

F24

E1

230

[179

]E

lph

el31

3-

Axi

sE

TR

AX

100L

X(1

00M

Hz)

FP

GA

CM

OS

Eth

erne

t30

00[6

8]H

arb

inIT

-Sa

msu

ngS3

C44

B0X

FP

GA

NS

LM

9628

CC

1000

[39]

ICV

SN

1.3

-B

lack

finB

F53

7(5

00M

Hz)

-O

V96

5580

2.11

300

[174

]M

eerk

ats

Cro

ssbo

wSt

arga

teX

Scal

eP

XA

255

(400

MH

z)-

USB

web

cam

802.

11[1

44]

Mes

hE

ye-

AT

ME

LA

T91

SAM

7S-

AD

NS-

3060

/AD

CM

-270

080

2.15

[97]

mvB

lueL

YN

X-

Pow

erP

C(4

00M

Hz)

FP

GA

CM

OS

Eth

erne

t[1

47]

NC

5200

-V

ISoc

-R

ISC

DSP

CM

OS

USB

[165

]P

anop

tes

1A

pplie

dD

ata

Bit

sySt

rong

AR

M(2

06M

Hz)

-U

SBw

ebca

m80

2.11

5200

[71]

Pan

opte

s2

Cro

ssbo

wSt

arga

teX

Scal

eP

XA

255

(400

MH

z)-

USB

web

cam

802.

11[7

1]S

ense

Cam

-P

IC16

F87

6-

vari

ous

RS2

32[7

5]S

urv

eyor

SR

V-1

-B

lack

finB

F53

7(5

00M

Hz)

-O

V96

5580

2.11

1980

[211

]U

biS

ense

ICB

SNT

IM

SP43

0A

DSP

2185

VC

SBC

5080

2.11

[138

]W

eeb

leV

ideo

Cro

ssbo

wM

ICA

2A

TM

EL

AT

meg

a128

L-

OV

6620

?M

ica2

[193

]W

iCa

-80

51X

etal

IC3D

2×V

GA

802.

1540

0[1

19]

WiS

N-

AT

ME

LA

T91

SAM

7S(4

8M

Hz)

-A

DN

S-30

60/A

DC

M-1

670

802.

15[6

1]W

iSN

AP

CC

2420

DB

AT

ME

LA

Tm

ega1

28L

-A

DN

S-30

60/A

DC

M-1

670

802.

15[9

6]

Tab

le3.

1:O

verv

iew

ofth

em

ain

VSN

plat

form

s.

44

represent the vision sensors and the edges show an overlap in the Field-Of-View(FOV) between two sensors. It is important to note that the topological graph fordata processing may not be the same as the physical communication graph. Thefirst stage of the algorithm therefore consists of pair-wise calibration of the VSNnodes. To this end, the overlap in FOV must first be detected before computingmore accurate epipolar parameters. Each of the VSN nodes detects locally a set offeatures based on the Difference-of-Gaussian (DOG) and compute Scale-InvariantFeature Transform (SIFT) descriptors. In order to reduce the bandwidth usage, asubset is extracted from the feature set. Only the most distinctive and spatiallydistributed features are kept in order to increase the chances of overlap with an-other sensor node. This “feature digest” is finally broadcasted to the whole networkand only nodes showing a large overlap will request the rest of the feature set forcalibration. In this work, other considerations such as transmission priorities are yetto be addressed.

3.1.3.3 Power consumption

For a VSN, the power consumption depends on the communication usage and theprocessing power. Multi-hop communication strategies can be implemented to saveenergy and increase the communication coverage in wireless sensor networks [17,167]. As the electromagnetic energy emitted by an omnidirectional source will decaywith the inverse square of the distance travelled, the necessary power to transmita signal to a distance D is proportional to D2. By using N hops, the total powerrequired would be proportional to N (D/N)2 = D2/N . The overall power gain istherefore directly proportional to the number of hops. A common strategy to reducethe processing power consumption is to leave part of the network idle or sleepinguntil a critical event is detected in the vicinity [5, 213].

3.1.3.4 Configuration

Chen et al. [47, 57] proposed a method for automated camera calibration acrossa large VSN. It is based on Belief Propagation (BP) across the vision graph. Forsparse camera coverage, Multi-Dimensional Scaling (MDS) has been used to evaluatethe spatial correlation between sensors. This allows automatic node activation onlywhen a user is about to enter the FOV [5].

3.1.3.5 Other considerations

As the overall resource for a VSN is often limited, much research has been carriedout to make its usage more efficient. For example, the VSN processing power can beused in a distributed fashion [47]. Idle nodes can be triggered to share the processingburden from busy nodes. Mathematically, a VSN presents the joint characteristicsof a sensor network and a multi-view computer vision system. From a computer

45

vision perspective, a VSN has all the advantages of a multi-view system such asthree-dimensional (3D) reconstruction, occlusions handling [17, 69], and extendedcoverage. From a sensor network point of view, a VSN is an ideal environment forfeature and decision level sensor fusion [40, 213]. The inherent redundancy ensuresenhanced robustness to errors and sensor inaccuracies.

3.1.4 Privacy considerations in ubiquitous sensing

For practical applications, the deployment of sensors for handling visual informationcan raise privacy concerns. This is, however, a complex notion encompassing severalaspects [128]. For example, privacy can refer to dignity, to empowerment or simplyto the right to be left alone. The concept of privacy boundaries states that theinformation about a person usually does not cross certain boundaries [128]. Althoughthis is not a major concern for sport training, the potential extension of the VSNframework for well-being monitoring needs to consider this seriously. In this regard,the blob paradigm combined with on-node processing and abstraction for cross-network communication offers an important way forward. This is the approachproposed by Lo et al. on “from blob to personal metrics to behaviour profiling” withwhich appearance information is not used nor transmitted for respecting the privacyconcerns [138].

3.1.5 Vision-based approach challenges

Vision-based methods provide overall information about the subject being monitoredfrom an external POV, but these methods show some inherent limitations.

Spatial coverage The area covered by vision-based systems is limited by the cam-era FOV and occluding objects. Coverage extension can be performed by eitherincreasing the number of vision sensors or the use of wider angle lenses. Theformer involves more maintenance and data processing costs, whereas the lat-ter reduces the effective resolution of the system. Moreover, occlusions cannotalways be avoided.

Spatial resolution Unless expensive high-resolution cameras are used in close prox-imity to the object being monitored, subtle but nevertheless important move-ments such as respiration may not be reliably detected. To achieve higherresolution, it is possible either to increase the number of vision sensors or tofocus on specific regions of interest. This can be achieved by placing the sen-sor closer to the region of interest or using longer focal lenses, at a cost of thespatial coverage.

Temporal resolution Unless expensive high-speed cameras are used, normal framerate is only 25 Hz to 30 Hz. This can be an issue for sport sensing. For exam-ple, the foot-to-ground contact time of a sprinter can be as low as 100 ms, thus

46

only 2 to 3 video frames are recorded. The use of interpolation can introducemotion blur and inaccuracy.

Lighting conditions The light conditions can play a major role in the way vision-based sensors behave. IR sensors need to be used in poor lighting conditions.

Subject Independence Although it is possible to recognise and track people inspecific conditions, vision-based sensors are not designed to track a singleperson but rather a volume defined by the FOV. They are therefore proneto errors particularly for crowded environments.

Invisible features Above everything else, many features are not visible throughthe camera, they need to be captured through the combined use of wearablephysiological sensors.

3.2 Segmentation and motion evaluation

A prerequisite for analysing video data for activity recognition is object segmen-tation. Some methods, such as optical flow [99, 140] or space carving [126], canfilter out the background from the subject. However many other methods, includingthe visual hulls [148], require prior segmentation of the subject. Thus far, severalclasses of background segmentation methods exist, from pixel-based to more subtlemorphological-based methods. Often that several techniques are combined in a par-ticular context. Chroma-keying background segmentation, often used in televisionstudios, is not applicable to real scenarios and will not be discussed here, only ap-pearance-based segmentation will be considered in this section. Tracking methodsare strongly related to segmentation, but they usually involve more knowledge aboutthe image and are discussed in the next section.

While implementing algorithms for subject segmentation, several important is-sues need to be considered [220]:

Moved objects Most low-level segmentation algorithms are based on the assump-tion that a large moving area is considered as foreground. However, when abackground object (e.g. a chair) is moved, it should still be considered as partof the background.

Progressive illumination changes The segmentation system must adapt to chang-ing lighting conditions throughout the day and over different seasons.

Sudden light changes Lights may be switched on or off, thus creating dramaticchanges to the images. Similar effect can also be observed when clouds movequickly in front of the sun.

Bootstrapping The absence of a good quality training dataset may influence thealgorithm’s performance. This is particularly common for scenes in public

47

places where crowds are constantly moving. In this case, it may not be trivialto identify when the background is actually visible.

Non-static background Trees, television or other animated artefacts should notbe classified as foreground. This means that the background model must beresilient to high spatial and temporal frequency.

Camouflage Background and foreground of a similar colour/texture may presentpractical difficulties. Indeed, whilst the human eyes are able to identify subtlecolour/texture variations, typical vision sensors will require specific enhance-ment algorithms to highlight these changes.

Foreground aperture The inside of a uniform foreground object may appearstatic from a local perspective, and therefore may be misclassified as back-ground. For example, if a uniformly coloured lorry is moving slowly in frontof a camera, the region in the image centre may appear stationary.

Immobile subject An immobile person should not be considered as background.This is an issue because most algorithms progressively learn the backgroundfor the aforementioned reasons.

Shadows and reflections Shadows and reflections cast on the environment oftenappear as foreground, as they exhibit shape and motion characteristics similarto the actual object. Specific algorithms need to be developed to handle theeffect of shadows [139].

For foreground object segmentation, the last four issues mentioned above remainthe most difficult problems to resolve in practice. Whilst no method can cope withall the issues mentioned above, many algorithms can deal with specific applicationdomains, e.g., intruder detection for security applications.

3.2.1 Pixel-based segmentation

The lowest level of background segmentation is performed at a pixel level. Pixel-based segmentation methods aim at modelling the statistical distribution of thebackground colour of each individual pixel. If a sudden change of colour occurs,the pixel colour will not match to the background distribution and the pixel willtherefore be marked as the foreground. These low-level methods in general requirea good contrast between the subject and the background.

To achieve this, many models have been developed in the past. The simplestmodel consists of a static reference image. The mean value can also be computedon a training sequence and the classification can be performed either based on afixed threshold or based on the variance of the colour distribution [121, 176, 241]. Inorder to account for potential illumination changes in the scene, these models can beupdated progressively. The techniques are usually referred to as adaptive methods.

48

While mean and variance models are based on uni-modal colour description, his-tograms represent a more realistic statistical colour distribution. Similar to meanand variance, the histograms can be pre-computed or updated in real-time. Sev-eral heuristics can be used to separate background from foreground. The simplestmeasures are the distance to the histogram maximum [112, 214] (unimodal) andthe percentage of the maximum value. When multiple peaks are present in thehistogram due to the long-term presence of objects, more sophisticated techniquesinvolving higher-level information can be used to infer which peak in the histogramis to be considered as the background [85]. These methods are illustrated in Figure3.2.

0

50

100

150

200

250

300

350

0 5 10 15 20 25 30

Num

ber o

f occ

uren

ces

Colour intensity

25% of maximum value

0

50

100

150

200

250

300

350

0 5 10 15 20 25 30

Num

ber o

f occ

uren

ces

Colour intensity

distance from

maximum <2

Figure 3.2: Pixel colour histograms for background segmentation.

A Gaussian Mixture Model (GMM) can also be used for modelling multi-modalbackground colour [83, 130, 207]. The main task of GMM is model building. Thestatistical model can be pre-computed using the Expectation-Maximisation (EM)algorithm. See Section 8.2.2 for more details on this algorithm. However, the back-ground colour changes slowly over time due to changing lighting conditions or be-cause of moving objects. Therefore, it is more desirable to incrementally update themodel so as to adapt to such changes. An incremental variation of the EM algorithm[162] is often used for this purpose. In general, EM is considered to be too costly toimplement with limited processing power of a VSN node, therefore simpler methodssuch as k -means are sometimes used instead [207]. The balance between stabilityand adaptation is also an important issue to consider. In practice, fast adaptationsuppress slow moving objects as background whereas slow adaptation can createmany false positives. To circumvent these problems, Lee [130] proposed the use of avariable learning rate that is a function of the colour change. Figure 3.3 illustratesthe key mechanism of an adaptive background model based on GMM.

Other methods for foreground object segmentation include temporal derivatives[91] based on a threshold on the colour change between frames. It is also possible touse Principal Component Analysis (PCA) [169], also referred to as eigenbackgroundsegmentation. A more sophisticated technique based on the Wiener filter has also

49

Object leaving

the field of view

Object remaining

in the field of view

Object entering

the field of view

Gaussians modelling

the foreground

Gaussians modelling

the background

Figure 3.3: GMM used for adaptive background modelling. Top: an object entersthe field, thus a new Gaussian is added to the model (red). As the object remainsstationary, the weight of its Gaussian increases. Bottom-left: the object moves awayand the original background model recovers to its exact original state. Bottom-right: the object stays long enough to be considered as part of the background. Itsdistribution is now incorporated as a part of the background model.

50

been proposed [143, 220]. The prediction st of the value of the pixel at the frame t,given its values at the previous frames {st−k} and a set of linear coefficients {ak},is:

st =∑k

akst−k (3.2)

The expected squared prediction error E[e2t]

can be calculated as follow:

E[e2t]

= E[s2t]

+∑k

akE [stst−k] (3.3)

The pixel is considered as foreground if the distance between the actual value andthe predicted one is greater than α

√E[e2t]. However this method can be sensitive

to noise. A single pixel can corrupt the entire history of many frames and furtherimprovements are necessary for low SNR image sequences.

For background segmentation, a feedback loop can also be added to prevent apixel colour to be included in the background model if it has already been detectedas part of the foreground [207]. However this should be handled carefully as asegmentation error can be propagated over time. With this approach, the pixelsof an object incorrectly classified as foreground will no longer be involved in thebackground model, thus preventing error recovery.

Figure 3.4: Top: a video sequence showing a serve action in tennis. Bottom: GMM-based pixel-level background segmentation. Noise and shadows can be observed.

Shadows and reflections are important issues to consider for pixel-based methods,as illustrated in Figure 3.4. They present rapid changes in the image that can be

51

incorrectly classified as foreground. Some elementary methods can help removeshadows at a pixel level. For example, methods based on the use the hue [176, 241]or other chromatic components of the pixel [109] assume that for pixels in shadow theinherent colour information is invariant. This, however, does not hold for achromaticpixels as hue is not well defined in this case. Another method to use is to assume thatshadows are darker than the average background colour [241]. The use of severalfeatures describing the difference between two images such as intensity difference,hue difference or colour distance can potentially improve the segmentation result.For example, this can be implemented in a self-adaptive filter based on a neuro-fuzzyclassifier as proposed by Lo et al. [139].

3.2.2 Region-based segmentation

Morphological background segmentation methods do not treat pixels individually.Typical region-based segmentation methods aim at clustering sets of pixels sharinga similar property. The simplest region-based segmentation algorithms include edgedetection [38], which uses boundary to separate different regions of the image. Colourbased segmentation, on the other hand, uses a common colour feature to describedifferent parts of the image. For example, Shafarenko et al. [201] used a watershedalgorithm to segment an image and improvement can be made by using histogramsin the YUV colour space [202].

Because real-life objects rarely exhibit a consistent colour, texture-based meth-ods are commonly used. Textures can be modelled and segmented after defining adistance measure between different texture models. Different to colour based tech-niques, scale becomes an important issue when working with textures [155].

Texture-based shadow detection is another example of region-based segmenta-tion. Shadows present a difficult problem at pixel-level, but they exhibit propertiesat higher-level that can be exploited. In fact, while a part of the background iscovered by shadows, some textural patterns remain invariant. Leone et al. [131]proposed to use a texture signature based on Gabor filter banks [106]. With moreknowledge about the observed surface, the effect of shadow can be explicitly de-scribed. For example, Barsky and Petrou [25] used fractals to model the shadowingeffect.

In practice, region-based segmentation methods are often combined with pixel-based methods in order to improve the overall accuracy. For example, uniformlycoloured regions in motion are often misclassified as background at a pixel level.However the boundaries of such a region are usually detectable. By propagating themovement information [53, 220], it is possible to reconstruct the entire motion fieldby the use of a smoothness constraint, e.g., Horn-Schunck’s optical flow computation[99] as detailed in Section 3.2.5.

Some elementary morphological filters such as erosion and dilation (sometimes

52

known as minimum and maximum filters) can also be used. These filters output theminimum or the maximum of the values in the neighbourhood of a given pixel. Anerosion followed by a dilation process, for example, is an elementary and yet efficientnoise filter. This filter is demonstrated in Figure 3.5.

Figure 3.5: Top: raw binary image segmentation for the serve sequence shown inFigure 3.4. Bottom: after erosion and dilation.

3.2.3 Frame-based segmentation

Frame level methods are mostly used to detect global changes in the image. Forexample, when the light is switched on in a room, all the background models learntat the pixel level become invalid. A pixel-based method will take some time torecover, depending on the algorithm used. The simplest frame level method totackle such a problem is to use several background models and detect such suddenchanges. This is typically associated with an adaptation to foreground detection,covering a large proportion of the scene [220]. A k -mean algorithm is then used todetermine the best background model. If no other model is available, the generationof a new model can still be more reliable than the existing one.

3.2.4 Stereo-based segmentation

When multiple cameras are available, the background can also be segmented byusing dense stereo matching [70]. The disparity image is computed from a stereopair of cameras, based on the correlation between the left and right images. The

53

depth (or range) is derived easily from the disparity, thus leading to a range imageof the observed scene. The depth image can then be segmented in a similar fashionas a grey-scale image. This method can effectively deal with non static background,changes in illumination condition, and to some extent occlusions, shadows and re-flections. A good example of practical application of stereo imaging is by [122],which allows for robust background segmentation.

Range cameras also offer the same functionality in a single device. Modern Time-Of-Flight (TOF) cameras can provide a range image (or depth map) in real-time.They are based on a similar principle as the radar, as they emit light pulses andmeasure the return time (thus their name) to evaluate the target distance. A ratherstandard optic system gathers the reflected light onto a photo-diodes array, whichenables a full image to be reconstructed in one pass, whereas conventional radarsonly acquire a single point at a time.

3.2.5 Motion-based segmentation

Optical flow is an effective method for motion segmentation. The method estimatesthe most likely two-dimensional (2D) dense motion field from a temporal sequenceof images and has attracted extensive interest in human motion recognition [53, 54].As the apparent 2D motion is actually a projection of 3D movement, one mayargue that the image intensity variation does not necessarily reflect the real flow[92, 226]. Nevertheless, many methods have used this as an approximation fordetecting apparent 2D motion with error rates of about 10 to 20% [24].

In previous work, Barron et al. [24] distinguished three processing stages per-formed by most optical flow methods:

1. Low-pass filtering to improve SNR.

2. Computation of local measurements such as the spatial and temporal deriva-tives or correlations.

3. Generation of a dense and smooth flow from the local measurements.

The methods used to compute the optical flow can be classified in three maincategories based on their underlying techniques: differential methods, region match-ing methods, energy and phase-based methods. A few examples of each class aregiven hereafter.

3.2.5.1 Differential methods

Differential techniques rely on the spatial and temporal derivatives of the image. Aconstraint on the intensity conservation, also referred to as the gradient or brightnessconservation, states that the global illumination does not change, but local pixels

54

can move. More specifically, if the pixel (x, y) at time t moves to (x+ δx, y + δy) att+ δt, we have the constraint:

I(x, y, t) = I(x+ δx, y + δy, t+ δt) (3.4)

where I(x, y, t) is the image intensity. Assuming that the movement is relativelysmall, the intensity derivatives can be approximated using the first order of theTaylor series expansion:

I(x+ δx, y + δy, t+ δt) = I(x, y, t) +∂I

∂xδx +

∂I

∂yδy +

∂I

∂tδt +O(x, y, t)

Thus:∂I

∂xδx +

∂I

∂yδy +

∂I

∂tδt = 0

and therefore:

IxVx + IyVy = −It∇IV = −It (3.5)

where V = (Vx, Vy) is the optical flow vector, I, Ix, Iy and It are the image intensityand its partial derivatives.

As the real motion happens in 3D, this constraint is not sufficient to determinethe optical flow. As a matter of fact, equation 3.5 provides only a single linearrelationship with two unknowns Vx and Vy. Further assumptions are necessary tosolve the above problem. One can use the second order derivatives [222] to combinelocal and global derivatives based on the least-squares optimisation of the velocity[116, 140] or rely on a smoothness constraint [99, 160].

Moreover, as the above methods are based on local derivatives, they are only ableto detect the motion locally. In this case, a large area with uniform intensity willappear as stationary. For the same reason, the global motion direction of an edgecannot be determined with a local value; the motion appears locally as orthogonalto the contour. Similarly, if the motion is faster than the frame rate, it cannotbe detected. This issue is commonly referred to as the aperture problem. Varioussolutions exist to solve this issue, which consist of further assumptions on the velocitydistribution.

Horn-Schunck and Lucas-Kanade methods are two popular approaches for solv-ing the optical flow. The Lucas-Kanade method [140] is based on a least squaresapproximation of the partial derivatives of the image in a single pass:

V =

∑i

W 2I2xi

∑i

W 2IxiIyi∑i

W 2IxiIyi

∑i

W 2I2yi

−1 −∑i

IxiIti

−∑i

IyiIti

(3.6)

55

where the sums are carried out over the neighbourhood of a given point and W isa window function giving more weight to the central pixel (a Gaussian can be usedfor this purpose). This method is very sensitive to the aperture problem, as theoptimisation occurs only at a local level. However, it is relatively robust to noise.

The classical Horn-Schunck [99] method introduces a global smoothness con-straint and the optical flow is calculated by successive iterations of:

V k+1 = V k − IIxV k

x + IyV ky + It

α2 + I2x + I2

y

(3.7)

for iteration k for a given pixel. In this equation V k is the optical flow vector, V kx

the average of V k in its neighbourhood and α is a regularisation constant to ensurethe smoothness of the derived result. The propagation of the flow performed by thesmoothness constraint partially resolves the aperture problem. However, it is moresensitive to noise than Lucas-Kanade. This algorithm is illustrated in Figure 3.6.

Figure 3.6: Top: original video sequence. Bottom: Horn-Schunck’s optical flow [99].Red lines denote larger flow magnitudes; it can be observed than the insufficientframe rate leads to poor approximation of the racket motion in the last frame.

3.2.5.2 Region matching methods

When the SNR of an image is poor, the local computation of the spatio-temporalderivatives can be inaccurate. In this case, region matching techniques can be used,which aim at identifying the spatial shift of given regions over time. These methods

56

are usually based on the maximisation of the Normalised Cross-Correlation (NCC)or the minimisation of a distance between regions (typically the Sum-of-Square Dif-ference (SSD)) [9, 35].

Given an image I and a feature f , the square Euclidean distance between theimage and the feature at a given position (u, v), also referred to as SSD, can becomputed as follow:

SSDI,f (u, v) =∑x,y

[I(x, y)− f(x− u, y − v)]2 (3.8)

=∑x,y

[I(x, y)2 + f(x− u, y − v)2 − 2I(x, y)f(x− u, y − v)

]∑x,y

f(x − u, y − u)2 is constant and assuming that the intensity of the image is

uniformly distributed,∑x,y

I(x, y)2 is nearly constant. A cross-correlation measure

can be calculated as:

CorrI,f (u, v) =∑x,y

I(x, y)f(x− u, y − v) (3.9)

As this measure depends on the intensity distribution in the image and on the sizeof the feature, a normalised version can be derived from the previous equation:

NCCI,f (u, v) =

∑x,y

[I(x, y)− Iu,v

] [f(x− u, y − v)− t

]√∑

x,y

[I(x, y)− Iu,v

]2∑x,y

[f(x− u, y − v)− t

]2 (3.10)

where t is the mean of the feature and Iu,v is the mean of the image region on whichlies the feature.

These methods are in fact similar to differential techniques in the sense that theyboth minimise a distance metric but they are applied to a larger scale, and thereforerelatively immune to noise.

3.2.5.3 Energy-based and phase-based methods

Energy-based methods rely on a measure of the motion of the energy in the image.They are also referred to as frequency-based methods, as a Fourier transform canbe used to detect the translation [2, 95]. Similarly, phase-based methods have beendeveloped by relying on the fact that phase may be more robust than the amplitudeof the signal.

57

3.2.5.4 Other considerations

Many recent methods aim at finding alternatives to the intensity (brightness) con-servation constraint for optical flow. This constraint does not hold in real casesbecause of the 3D nature of the scene and changes in light source. A static objectilluminated by a fading light, for example, may be interpreted by many optical flowalgorithms as being in motion. This effect is illustrated in Figure 3.7. In order toassess this issue, physical models have been integrated in the algorithms [92]. How-ever, the reformulated problem is much more complex to solve, as both optical flowand physical model parameters need to be calculated. The most common solutionis to rely on a well-defined physical model for each scenario [92]. Although thesead-hoc methods may work for a particular case, they are difficult to be generalised.

Figure 3.7: Due to the intensity conservation constraint, a fading light on a staticobject is interpreted as in motion by the Horn-Schunck method (on the right).

Most of the methods detailed above can be improved by a hierarchical coarse-to-fine approach, also referred to as a pyramid method in [35]. This is an elegantsolution to the aperture problem, and is adopted by Davis [54] for example. Inthis study, a Motion History Image (MHI) is derived from a sequence of segmentedimages. In a MHI, the pixel intensity is defined according to their lifetime. Themore recently a foreground pixel appeared in the image, the brighter it will be. Bygenerating a hierarchical pyramid of MHI at different scales, it is possible to deriveoptical flow at several scales, and thus to overcome the aperture problem.

3.2.6 Hybrid methods and semantic input

Most of the robust segmentation methods normally rely on several base classes ofsegmentation. Typically, pixel-based segmentation is first applied, which is thenfollowed by morphological filtering. The Wallflower project [220] is an example ofsuch a hybrid approach.

It is important to note that most of the time background segmentation is onlythe first stage of vision processing pipelines. Higher-level tracking and recognitionalgorithms are based on these segmented images. In return, these algorithms can

58

Figure 3.8: Background segmentation comparison carried out by Toyama et al. [220].The top row is the real coloured image, the second one is the ideal, hand-segmentedone, followed by the output of several algorithms.

give feedback to the lower level background segmentation algorithm. For example, atelevision can be considered as foreground by a pixel-based segmentation algorithm.But a human tracking algorithm may perform a shape classification step and con-clude that the television image does not match to the human model. Eventually,

59

a feedback loop will force the pixel-level algorithm to learn the television as back-ground. However, care must be taken while designing such feedback loops. Indeed,by mixing the semantic models, the global accuracy of the system can be affected.In the television example, if the pixel-level algorithm distinguishes well both the hu-man subject and the television from the rest of the image and the tracking methodcan prune out the television, the overall system can work perfectly. However, if ad-ditional feedback about the television is sent to the pixel-level algorithm, this maylead to every object in the television area being considered as background, includingwhen the human subject walking in front of it.

Several high-level methods have been developed at a semantic level to distinguishhumans from other objects. A naive approach is to consider the proportion of pixelswith colour falling into the human range as the foreground [213]. As many objectstend to exhibit straight edges whereas the human body shows more smooth contours,computing the quantity of straight edges is also an indicator of the presence of ahuman subject [213]. A comparison of several methods carried out by Toyama et al.[220] is shown in Figure 3.8.

3.3 Human modeling and tracking

Direct 2D posture recognition and 3D reconstruction provides good results in non-cluttered environments. As previously discussed, noisy video data can interfereand lead to inaccurate results. For example, background segmentation errors dueto shadows, reflections and occluding objects will lead to inaccurate 2D and 3Dsilhouettes. Self-occlusion will always be problematic in a silhouette-based posturerecognition framework.

A priori knowledge of the human body and dynamics can effectively constrainthe search space, thus providing more accurate results. A human model can beconstructed from a set of possible joints angles and angular motion limits. Such amodel can either be pre-built using expert knowledge or learnt on the fly duringdata acquisition. In this case, posture recognition is turned into an optimisationproblem. Predictive methods such as the Kalman filter can improve the robustnessof the tracking algorithm. The model can finally be fitted directly on the 2D imageand the most likely posture is then inferred. Accurate body angles and velocities canbe derived easily from the human model. This permits, for example, the computationof precise dynamics of a runner.

Using a larger number of cameras, more sophisticated fitting algorithms can beused to derive models from less constrained sports such as tennis. A system isdesigned, for example, in this thesis to locate the player on the court and provideinformation about cross-court movements. More complex motion, such as a tennisbackhand, can also be recognised from such a model, but a learning phase or expertknowledge is required.

60

3.3.1 Marker-based tracking methods

Human tracking can be performed using intrusive techniques, requiring the trackedsubject to wear specific clothes, markers or sensors. At the early stages of humanbody tracking, mechanical methods were widely used. Most of them relied on me-chanical arms to be attached to the body. Capacitive or resistive sensors at thejoints provide the system with an accurate estimation of the limbs’ position. How-ever, these devices impose strong constraints on the user’s motion, and thereforeare not used extensively anymore. More recent electro-mechanical methods includegyroscope and accelerometer-based wearable sensor networks [245], which are morelightweight and less constrictive.

The VICON system [227], for example, has been popular for animation in themovie industry, along with other functionally similar systems such as BTS Smart-D [33] and Qualisys Oqus [185]. It requires actors to wear dark clothes with IRmarkers. The markers’ 3D positions are then evaluated by multiple IR camerasacross the activity volume often with sub-millimetre accuracy. As the markers arereflective spheres, this system is considered as passive. Further constraints based onthe spatial relationship between the markers allow identifying precisely the body’selements. Other similar techniques, such as the NDI OPTOTRAK [161] are basedon synchronised stroboscopic IR Light-Emitting Diodes (LEDs) to be worn on thebody. This is considered as an active system, slightly more intrusive than the VICONdue to wires running along the body.

Other tracking techniques rely on the use of an electromagnetic (EM) field, suchas the Polhemus Fastrak [184] or the Ascension MotionStar [14]. A high frequencymagnetic field is generated by a static emitter. The intensity of this field is capturedby wearable sensors placed on the body. This allows the local computation of the3D position and orientation of the markers with respect to the emitter. The specificemission frequency allows filtering out part of the noise. The advantage of EM-based techniques over IR based approaches is their robustness to occlusion. Howeverthe tracking volume is usually smaller and the tracker is sensitive to ferromagneticobjects.

High-quality EM and IR based commercial tracking systems are usually relativelyexpensive and require an extensive pre-calibration stage. It should be mentionedthat because the markers or sensors are usually placed on the subject’s skin, issuesdue to soft tissue deformation can in fact introduce large errors [21].

3.3.2 Markerless vision-based tracking

Markerless vision-based tracking offers a cheaper and less intrusive alternative to thesystems described above. However, more processing is required. Preliminary workwas in fact carried out as early as the end of the 19th century by photographers suchas Eadweard Muybridge and Etienne-Jules Marey. They both carried out animal

61

motion analysis based on the first cinematographic slow-motion recordings.

3.3.2.1 Generalities

The major difference between background segmentation techniques detailed in theprevious sections and vision-based tracking methods is the feedback, or predictionloop from the model estimation to the image feature extraction module, as shownin Figure 3.9. Intuitively, rather than trying to extract the foreground from thebackground from scratch at each frame, a tracking system relies on a priori infor-mation such as the previous position of the subject. For instance, if the subject wasdetected at a position p at time t and it is known that its maximum speed is s, thetracking system will only attempt to locate the subject in a s × δt neighbourhoodof p at time t+ δt.

Video sequence Image features Model estimation Activity analysis

feedback loop

Human modelCamera model

Figure 3.9: The main steps involved in a general video-based tracking system.

A tracking system is composed of a number of elements that can be implementedwith various vision algorithms. The choice will affect the global characteristics of thesystem such as hardware installation cost, computational burden, as well as overallsystem flexibility or robustness. The main considerations while designing a trackingsystem include:

Hardware configuration Some systems rely on a single camera [30, 91], whereasothers are based on a camera network with or without overlap [74, 83]. Single-camera systems are easier to implement but do not cover a large volume.Relying on several cameras can increase the tracking space, but it requiresextra cross-camera matching which may not be trivial if the cameras do notoverlap. Overlapping cameras allow smoother tracking over the entire networkand therefore more robust estimation. Stereo cameras are deployed to providedepth information. Active camera tracking systems such as [70, 159] rely on apan/tilt camera following mechanically the subject of interest.

Human model The human model generally consists of two main components: theshape model and the motion (dynamic) model. These models can either be

62

learnt by the tracking system [105] or based on explicit prior knowledge [74, 91].

Camera model As with the human model, some systems require the camera modelto be known [74] whereas others do not rely on a camera model and canself-calibrate. For instance, each node of the sensor forest in [83] is able todetermine its position with respect to the others.

Scene model Typically assumed to be static, a model of the scene based on heuris-tic [121] or built while tracking [83] is sometimes used to provide more con-straints on the subjects motion. The most widely used scene model is anestimation of the ground plane. Given the ground plane, it is possible to com-pute the tracked and static object distance with respect to the camera andtherefore to tackle occlusions [83, 121].

Thus far, many variations of shape models have been developed. Non humanspecific methods such as B-splines and snakes [113, 218] allow for the tracking offree-form shapes over time. Elementary human models rely on a set of ellipses (head,torso, feet and hands) [241] or on 2D cardboard models [91]. A large proportion ofthe research takes into account the specific details of the human body, consisting ofskeleton and flesh. A 3D articulated model can be built and joint constraints can bemodelled a priori. Anthropometry (bones size ratio), as well as the human kinemat-ics (joints maximal angles, muscles contraction speed), has been extensively studied.Some tracking methods rely only on the joints. However, many methods rely on thecontour detection, and therefore the flesh needs to be modelled as well. Cylinders[7], generalised cylinders, ellipsoids [30], and superellipsoids [74] are also used forthis purpose. Loose-limbed models are similar, although based on spring modellinginstead of the rigid segments in order to relax partially the joint constraints. Someof these models are illustrated in Figure 3.10.

Technically, the main issues involved in human tracking include:

Lack of visual invariants e.g. , due to the POV, lighting conditions and clothing.

Large DOF of the human model. Some elementary 2D models for tracking torso,head, hands and feet only [91, 241] have a relatively low DOFs. But some 3Dmodels can have up to 22 DOFs [74], and thus are computationally expensive.

Occlusion is a main issue for practical tracking algorithms. Occlusions affect theglobal shape, and therefore subsequent recognition. A common way of dealingwith occlusions is to sort the tracked objects according to their respective dis-tances to the camera. Knowing which object can potentially occlude another,it is possible to discard the silhouette matching process in the occluded areas[83, 121]. While tracking human interactions with occlusions, a remaining is-sue is to determine which subject is which when they reappear as individual

63

Figure 3.10: Commonly used human shape models, left to right: 2D loose card-board model, 2D articulated model with ellipses modelling the flesh and 3D taperedsuperellipsoids used in [74].

blobs. Motion smoothness can be used, but is unreliable at low speed. Build-ing image signatures of the subject such as the temporal texture template [91]allows for more reliable matching.

Shadows and reflections can also be an issue, although the model state estima-tion stage often includes a shadow removal stage based on a priori knowledge.For example, Zhao and Nevatia [251] make the assumption that the sun is theonly light source for outdoor tracking. Moreover, this source is supposed tobe directional only, without diffusion. As the position of the sun can be com-puted using the spatio-temporal coordinates of the camera, it is possible toremove the shadows using geometric information only. Zhao and Nevatia alsouse constraints based on the camera model for further shadow discrimination.

3.3.2.2 Kalman filter

The Kalman filter [110] is a recursive linear filter which estimates the state of adynamic system from a set of measurements with random errors (noise, missingdata). One of the main advantages of the Kalman filter is that it does not require ahistory of measurements. The Kalman filter is similar to the Hidden Markov Model(HMM), based on continuous states.

The Kalman filter model is based on the assumption that the state xk at time kcan be derived from xk−1 as follows:

xk = Fkxk−1 +Bkuk + wk (3.11)

where Fk is the state transition model applied to the state xk−1, Bk is the control

64

input model applied to the control vector uk and wk is the noise. It is assumed tofollow a normal distribution with covariance matrix Qk.

Considering an observation zk of the state xk, it is assumed:

zk = Hkxk + vk (3.12)

where Hk is the observation model and vk is the observation noise with covarianceRk.

Kalman filtering is performed by successive application of the prediction andcorrection phases (sometimes also referred to as update). While using the Kalmanfilter for tracking purposes, the observation zk feeds in the filter between the twophases. In this case, the predicted state value can be used to reduce the search spacewhile observing. This is illustrated in Figure 3.11.

Prediction

xk|k−1 = Fkxk−1|k−1 +Bkuk predicted statePk|k−1 = FkPk−1|k−1F

Tk +Qk predicted estimate covar.

Correction

yk = zk −Hkxk|k−1 innovation residualSk = HkPk|k−1H

Tk +Rk innovation covariance

Kk = Pk|k−1HTk S−1k optimal Kalman gain

xk|k = xk|k−1 +Kkyk update state estimatePk|k = (I −KkHK)Pk|k−1 update estimate covariance

where at time k, xk|k is the estimate state and Pk|k the error covariance matrix.

Figure 3.11: Kalman filter used for tracking, following the general principle illus-trated in Figure 3.9.

In practice, setting the observation and system noise covariance matrices Q andR is an issue. This operation is sometimes referred to parameters tuning, and can

65

be performed either by hand tweaking or by off-line analysis of the data. However,if the noise is not constant over time, care should be taken to update these valuesduring tracking.

The Extended Kalman Filter (EKF) is a variation of the Kalman filter handlingnon-linear state transition and observation models. In this case, the matrices F andH are then defined as the Jacobian matrices of non-linear differentiable functions.

3.3.2.3 Kalman-based tracking systems

Pfinder (person finder) [241] is one of the major works in human tracking. With thistechnique, head, hands, feet, torso and lower body are tracked independently, usingcolour information (in particular the normalised flesh colour) and contours. Each ofthem is modelled by ellipsoids and the system can only track a single person. Thetracker needs to be initialised in a posture where the feet, hands, head can be easilyidentified, such as the starfish posture. Strong priors about the human shape modelcan be pre-defined to allow for more robust tracking.

Zhao and Nevatia [251] relied on a gross human model (3D ellipsoid) for depthestimation to handle occlusions. A Kalman filter assuming a constant velocity modelwas used for tracking. A motion template based on an averaged optical flow wasused for mode detection in 2D, such as standing or walking. This template allowsfor an estimation of the motion direction, provided that the movement does notoccur along the axis of the camera. A 3D articulated model is also used for postureestimation.

The method proposed by Koller et al. [121] is based on two Kalman filters: onefor the shape model and one for the motion model. This tracker relies on the groundplane constraint to evaluate the tracked object depth and handle occlusions.

The method used in [83, 207] is an outdoor tracker based on a forest of sensors.Correspondences are established automatically between camera pairs in order to de-termine an approximate spatial relationship between the cameras. The ground planeis estimated automatically from the tracked subjects and used to handle occlusions.This tracker is based on linearly predictive multiple hypothesis. A pool of Kalmanfilters are updated simultaneously and the best fit is chosen. This method partiallyovercomes the linear limitation of the Kalman filter.

In [169], a Kalman filter tracks the subjects using a blob model based on featuressuch as location, coarse shape, colour or velocity. At a higher level, the interactionsbetween several subjects are modelled with a HMM and a Coupled Hidden MarkovModel (CHMM). Both techniques allow dynamic time warping, and this aspect ofthe HMM is unique amongst the usual tools. The model training is performed withsynthetic models and new models can be automatically built.

66

3.3.2.4 Non Kalman-based methods

As an alternative, the W4 system [91] tracks head, feet, torso, hands, based ona cardboard model. When tracked blobs are either splitting or merging, severalcases are considered. A final decision is made after a few frames, based on theprobabilities of each case. When two subject blobs split after being merged (i.e.when a subject reappears after an occlusion by another tracked subject), a systemis designed to identify which subject corresponds to which blob in the image. Thetemporal texture template, which is basically a normalised averaged appearance overtime, is computed while the subject is visible. When a subject reappears, decisionis made accordingly to the best matching over the given templates.

Gavrila and Davis [74] used an articulated 3D human model with 22 DOFs. Theflesh was modelled by tapered superellipsoids, providing a realistic representation.This, for example, enables the tracking of two dancers interacting tightly. Thetracking does not actually occur in the 3D space but is applied on the reprojected 2Dmodel, relying on a generate and test strategy. This method consists of generating2D projections of the model with an exhaustive set of parameters. The generatedimage is then tested against the reference image. The parameters leading to thebest match are appropriately chosen. While in the testing phase, the similarity ofthe detected subject contour and the modelled contour is measured with a variationof the Chamfer matching . The direct Chamfer distance DD between a test set ofpoint T and reference set of point R is defined by:

DD(T,R) =∑t∈T

dd(t, R) =∑t∈T

minr∈R‖t− r‖ (3.13)

which can be normalised as follows:

DD(T,R) =DD(T,R)‖T‖

(3.14)

The main issue of the direct Chamfer distance is that it defines asymmetrically howone set is close to another. For example a point might be very close to a line, butthey are not considered as similar. The undirected Chamfer distance D overcomesthis issue by averaging the Chamfer distance in both directions:

D(T,R) =12(DD(T,R) +DD(R, T )

)(3.15)

As the model generation and test strategy can potentially be computationally inten-sive given the high DOF involved, a hierarchical search space decomposition strategyis used. The global subject positions are evaluated first, followed by the torso, andeventually the limbs. Multiple subject views can be used and different hypothe-ses fused. However, if this tracking method is markerless, as the figures suggest,relatively specific tight clothes must be worn.

67

Bregler and Malik [30] relied on twists to represent the angles between bodysegments. Twists have the advantage over the commonly used Euler angles of show-ing no singularities. The flesh was modelled by ellipsoids and the system handles asingle and multiple cameras.

Chen et al. [44] relies on a particle filter , also referred to as sequential MonteCarlo (SMC) to track human motion. SMC methods approximate distribution byrandom sampling of the dataset. This allows the modelling of multimodal, non-Gaussian distributions and thus complex movement tracking. A pre-calibrated setof cameras is used to track the head and feet (lowest and highest points) whilst beingable to handle occlusions. The human shape model used for posture estimation isa binary three-by-three grid (9 squares) fitted to the bounding box of the subjectsilhouette. Colour features are used to distinguish between subjects in the case ofocclusion or contact.

The CONDENSATION algorithm (CONditional DENSity propagATION) [105]can deal with non-Gaussian observation and thus can represent simultaneously sev-eral hypothesis. It is based on the concept of factor sampling, that allows the eval-uation of complex probabilities relying on a random sampling of the dataset. Thisconcept is equivalent to those of the particle filter. Lucena et al. [141] proposed anextension of the CONDENSATION algorithm by including a local estimation of themotion computed through an optical flow method.

Agarwal and Triggs [4] used a set of regressors to evaluate the probabilities ofmultiple hypothesis in a monocular system. Regressors, also referred to as indepen-dent variables were used as controls on the system input that can generate severalcases or hypotheses. Wu and Yu [243] relied on a two-layer Markov random field tomodel complex shapes.

3.3.2.5 Other tracking aspects

Stereo pairs can also be used for tracking [70]. The range image generated bycorrelation calculation has many advantages over a colour image. Moreover, rangeimage based tracking is less affected by occlusions as the objects can be sorted outby their distances from the camera. As for segmentation, range images are also lesssensitive to lighting changes, subject-specific features and shadows.

Exemplar -based tracking [65] relies on a collection of possible postures that arematched to the actual video sequence. This collection can be large, as the numberof potential postures taken by a human is important. If this collection is based on2D binary images, then every posture needs to be represented under several pointsof view. If it is based on 3D models, however, the search space will be larger. Thecrucial element is to find appropriate features to match efficiently.

Snakes [113], also referred to as active contour models are a parametrisation ofthe contour of a shape. When using snakes for tracking purposes, the model is up-

68

dated in order to constrain the snake to match the edge of the tracked subject in theimage. A Kalman filter can be used in addition for motion tracking [218]. However,because of the high DOFs involved, these methods are relatively computationallyintensive.

3.4 Conclusion

In this chapter, some of the major techniques for tracking have been reviewed.Vision-based approaches offer a versatile, unobtrusive and efficient means of humanmotion monitoring.

More specifically, a network of vision sensors is particularly well suited for sportsmonitoring. Indeed, the light-weight nature of the VSN nodes greatly simplifies itsinstallation in different locations. Because the VSN nodes are light-weight, low-power, self-contained modules, a number of issues must be addressed for its practicaldeployment. These include, for example, algorithm optimisation, self-configurationand automated power-management of the nodes. For these reasons, the VSN hard-ware is still under extensive development in the community.

For human tracking, low-level detection algorithms are required. Two mainapproaches are typically used for this purpose: segmentation of the subject silhouettefrom the background or identification of specific local features exhibited by a humansubject. The use of a per-pixel statistical background model is an efficient means ofsegmenting the foreground from the relatively static background. Alternatively, theuse of optical flow based on spatio-temporal derivatives of the input video streammay be used to distinguish the image regions. Edge detection and other featuredetectors may be employed in order to identify local anatomical features such as thehead, feet or hands.

Once a subject has been identified in the image by either means, it is possibleto recover the human pose by the use of predictive tracking. A human pose canbe defined by the 3D angles of an articulated model. Based on such a model, atracking algorithm aims to provide a robust estimation of the human pose givenan image-based observation (i.e. segmented silhouette or the local features) anda solution predicted from the model dynamics and former observations. For thispurpose, spatio-temporal information based on Kalman filtering can be employed.

It is worth noting that the use of VSN for practical sport motion capture canintroduce a number of further challenges. These include, for example, view depen-dent object representation that introduces large variations to the tracked subjectsilhouette once projected onto the 2D image. Solutions such as the deployment ofmultiple cameras are proposed later in this thesis.

69

Chapter 4

Motion Reconstruction from

Monocular Vision

In the previous chapter, some of the key techniques for motion tracking have beendiscussed. This will focus on the development of motion reconstruction techniquesbased on monocular vision with applications to tennis tracking. In professional ten-nis, the difference between top-level players is influenced by a number of factorsincluding subtle physical adjustments and mental strategies. During training, phys-ical based techniques aim at improving the overall fitness level, the strength andconsistency of techniques used [182], whereas mental training is focused on concen-tration, motivation and self-confidence [80, 104, 182].

In order to continuously refine training strategies, players and coaches rely onkey information recorded over the training sessions and matches. Logbooks are com-monly used to monitor progressive changes in training [182], which provide generalhigh-level information about the nature and outcome of the training sessions. Videorecordings during training and matches are becoming increasingly popular recently[224]. On one hand, the current advances in information technology allow largeamounts of video data to be collected and stored easily. On the other hand, retriev-ing useful information from these datasets can be problematic if the sequences havebeen poorly annotated. High-quality annotation of the sequences is an extremelytime consuming activity, which somehow defeats the original purpose.

In order to provide the players and coaches with ready-to-use biomechanicalparameters such as the wrist angular velocity, motion tracking systems have beenused. However, most commercially available systems involve a large set of wearablemarkers or sensors [227, 244] and are obtrusive. Therefore, they are typically usedto record a snapshot of the player’s technique and are not used on a regular basis.

Video-based monitoring is a tool of choice for detailed sport motion analysis asit does not affect the players, thus providing more detailed measurements for bothtraining and competition. Different to other sensing modalities, video recordings donot provide direct biomechanical measures to the players or coaches. Further image

70

processing is required to derive information such as location, velocity or detailedposture parameters. In this thesis, it is aimed to develop a system that wouldcombine the ease of video recording with real-time extraction of meaningful featurescomparable to marker-based systems. To this end, the work presented in this chapterwill be based on a Vision Sensor Network (VSN) as introduced in the previouschapters. By using multiple nodes, the system is able to provide extended coveragewith improved accuracy. Wireless communication simplifies the installation process,particularly during tournaments. The unique on-board processing capability reducescommunication and bandwidth requirements1.

4.1 Related work

In this thesis, three levels of information that can be extracted from a video sequencewill be distinguish: the player’s position and velocity on the court, the player’s tech-nique (stroke and short term tactics) and the player game strategy. The purpose ofthis chapter is to focus mainly on positional tracking. Early work on tennis playertracking was performed by Sudhir et al. [210] and Pingali et al. [183]. The systemproposed by Sudhir et al. encompassed court line detection for camera calibration,player tracking, as well as high-level feature detections. Court detection and cam-era calibration are fundamental features for tennis tracking systems. It allows thederivation of the position of the player on the court from apparent two-dimensional(2D) image features. Based on relative positions of the players, four types of annota-tions can be added to the video sequence: baseline-rallies, passing-shots, net-gamesand serve-and-volley. Pingali et al. used four cameras to track the player and theball, and introduced the concept of the occupancy map. This map represents theproportion of time spent by the player at different locations on the court. Thisinformation can be used to infer general tactics during the game.

Bloom and Bradley [28] later carried out similar work and extended it withstroke recognition. They recorded the sound track to detect the exact time of ballcontact and derived the player’s apparent skeleton to infer the stroke being played.Acoustic information is used increasingly frequently to resolve vision-based sensingambiguities [49]. Wang and Parameswaran [231] proposed to classify the 58 winningpatterns recommended by the US Tennis Association (USTA) [221]. A Bayesiannetwork was used to infer play patterns based on the sequences of ball landingareas.

Thus far, most of the tennis-related computer-vision studies have focused ontennis ball trajectory tracking. Commercial systems such as Hawkeye [93] are alreadymature enough to be used routinely in major tournaments. Technically, there aretwo major differences between ball and player tracking. On one hand, the ball

1A paper has been accepted for publication on this topic: RACKET: Real-time AutonomousComputation of Kinematic Elements in Tennis [174]

71

moves much faster and is much smaller than a player, making it harder to follow.On the other hand, a tennis ball is a well defined object in terms of colour, shapeand motion, compared to complex articulation of the player, which requires moresophisticated detection and tracking techniques.

4.2 VSN node design

The VSN node developed at our laboratory is a self-contained module composed ofthree stackable boards and a battery:

1. An Omnivision OV9655 1.3 megapixel camera [170] with interchangeable lensesincluding a range of focal lengths (fisheye 1.3 mm (full 180◦) to near teleob-jective 6.5 mm).

2. The main board embedding:

• 500 MHz Analog Devices Blackfin BF537 Processor [8] (mixed ReducedInstruction Set Computing (RISC) Microcontroller (MCU) / Digital Sig-nal Processor (DSP) architecture).

• 256 Mb Synchronous Dynamic Random Access Memory (SDRAM), 32 MbSerial Peripheral Interface (SPI) Flash.

• Body Sensor Network (BSN) platform connector.

• Joint Test Action Group (JTAG) connector for debugging purposes.

3. A Lantronix Matchport [129] Wireless Local Area Network (WLAN) 802.11g/bWi-Fi board for wired and wireless communication.

The physical dimension of the module is 25×45×45 mm (not including the lensand antenna), with a total weight of 80.5 g (112 g including the casing). Its totalpower consumption is 220 mA in fully active mode (i.e., during concurrent framesacquisition, compression and wireless communication). A detailed breakdown of thepower budget includes: main processor board: 147 mA, wireless communicationboard: 55 mA, camera board: 21 mA. In sleep mode, the total power consumptiondrops to 10 mA. The module components are illustrated in Figure 4.1.

4.3 System configuration

4.3.1 Node placement and focal length choice

A tennis-specific application has been developed to evaluate precisely the position ofa player on the tennis court. The optimal node configuration is a trade-off betweenthe camera lens and the node position, as their combination determines the overallcourt coverage. A longer focal length provides a better image quality as they are

72

Figure 4.1: The main components of a self-contained VSN module. Left, from top-down: camera/extension board, main Blackfin board, Lantronix Matchport, battery.Right: casing and antenna.

less affected by radial distortion. However, they must be placed further back fromthe tennis court, which is often not physically possible.

For player position tracking, the camera would ideally be placed on top of thecourt. In this particular case, the position of the player in the image space wouldtranslate to its actual position on the court by taking into account the scale, shiftand rotation in the 2D image space. In general, the more vertically the camera axisis located, the better the tracking resolution that can be achieved.

Although it is possible to track the players on the entire court with a single node,it is usually easier to use at least one node per side. This allows the use of longerfocal length cameras with downward viewing angles, which also facilitates the visionalgorithm development.

73

4.3.2 Node calibration

In order to derive a meaningful three-dimensional (3D) player position on the courtfrom 2D image sequences, it is necessary to calibrate the node in situ to determineextrinsic camera parameters. In order to avoid the use of an extra calibration gridor fiducials, the markings on the tennis court itself are used for calibration. Thisoperation is conducted in two steps involving corner detection and actual cameracalibration.

4.3.2.1 Tennis court corner detection

The user is asked to point the camera roughly at the centre of the middle T of atennis court. After acquiring a still image from the VSN node, the relevant tenniscourt lines intersections are detected by using the following method.

In order to determine the local tennis court marking topology at a given point,a polar intensity histogram is built around this point. A three-pixel thick circularwindow is used for this purpose. Thus the histogram represents the average imageintensity for a given angular sector in the vicinity of the considered point.

The intensity histogram along the circle can be robustly and quickly segmented,with local maxima denoting the tennis court line (as it is brighter than the court

surface). In order to filter out insignificant peaks, a threshold ofmax−mean

2was

applied, where max and mean are the maximum and the mean of the histogramvalues, respectively.

This histogram is illustrated in Figure 4.2 and the method has proved to be ro-bust enough to deal with sub-pixel thin lines. Depending on the number of maxima,further topological assumptions are made. No maximum means there is no tenniscourt line around this point and a single maximum means that the algorithm partlylost track of the line2. Two intersections means the current window centre is closeto a line, and three means the window is located in the vicinity of a ‘T’ junctionbetween lines.

A method has been designed to follow the tennis court lines accurately. This iseffectively a tracking algorithm performed iteratively based on successive estimationand observation/correction. From the current window position, the next positionis estimated using the court line direction. The intersections of the line with thecircle are recomputed, which allows the determination of the actual position of thewindow with respect to the court.

A fully automated system has been designed that can detect the relevant tenniscourt corners with on-node processing. Starting with the middle T, the algorithmfollows the service line in both directions until the intersections with the side linesare found. The side lines are then followed to find their intersections with the

2There are no free-ending lines in tennis court markings.

74

0o

-40

-20

0

20

40

60

80

100

120

140

0 30 60 90 120 150 180 210 240 270 300 330 360

Pix

el n

eigh

bour

hood

inte

nsity

Angle (degrees)

Tennis court line / circle intersections

Figure 4.2: Left: circle/lines intersections detected on a ‘T’ junction by histogramsegmentation (a three-pixel thick circle is actually used for more robust intensityestimation). Right: Histogram used for fast tennis line intersection detection. Pixelintensities are relative to the average intensity on the circle and the three intersec-tions are obvious. Angles of the histogram x-axis correspond to the sweeping lineon the left figure.

baseline at the back of the court, as illustrated in Figure 4.3 (left). An examplecorner detection result is depicted in Figure 4.3 (right).

This process can be implemented efficiently without floating point operations.Although it would be possible to implement a system with sub-pixel accuracy, thiswould require the use of floating points. For this reason, the accuracy is limited toone pixel, i.e. five centimetres in the worse case scenario of a typical configuration.It is relatively immune to noise and can be executed in less than a millisecond onthe VSN node. The detected corners are then used to calibrate the camera.

side line

centre

service line

Middle T

servic

e lin

e

base lin

e

Figure 4.3: Tennis court corners automatically detected on-node for camera calibra-tion. Left: corner detection order. Right: actual detection result.

75

4.3.2.2 Extrinsic camera calibration

The camera calibration aims at determining the camera position (x, y, z) and itsorientation defined by the Euler angles (α, β, γ) with respect to the tennis court. Agradient descent algorithm optimises the camera position and orientation by repro-jecting the tennis court corners and matching them with the actual court geometry.The camera position and orientation are then retrieved. The focal length can op-tionally be estimated, although this is a fixed parameter that can be pre-calculated.

During camera calibration, the principal point is assumed to be at the centre ofthe image and skew is negligible. Random seeds can be used to prevent the gradientdescent algorithm from being locked in local minima. Finally, the back projectionmatrix of the camera is computed. The node is eventually ready for tracking withaligned image and world coordinate systems.

An optimised version of this algorithm has been ported onto the VSN nodeand the calibration process takes about 10 to 30 seconds. With this approach, it ispossible to rely entirely on the embedded software to detect the corners and calibratethe VSN camera. Therefore, the VSN node can be used in a completely autonomousfashion.

4.3.2.3 Camera distortion correction

As previously mentioned, radial distortion is a significant problem when using smalllenses with relatively short focal lengths. An optional distortion correction step istherefore provided in the proposed framework. In this thesis, a polynomial radialcamera distortion model [32] was used:

dN = u

(1 +

N∑i=1

k2i+1‖u‖2i)

(4.1)

d1 = u(1 + k3‖u‖2

)(4.2)

where d is the distorted point in the normalised coordinates system, u in the undis-torted pinhole projection point and ki, i ∈ {1..N} are the polynomial coefficients.The reverse equation can be approximated at the second order:

u ≈ d

1 + k3

(‖d‖

1 + k3‖d‖2

)2 (4.3)

In the experiments conducted in this thesis, the average error of this step was foundto be 0.75 pixels, with maximum error being 3 pixels at the far-end corners as shownin Figure 4.4. A more accurate solution can be achieved by recursive application ofthe above equation.

The value of k3 has been determined by trial-and-error. Whilst accurate parame-

76

ters could be obtained with an iterative calibration algorithm such as that proposedby Zhang [250], this was not implemented in this thesis.

Figure 4.4: Distortion correction. Left: original image. Right: corrected image,aliasing effects due to the integer nature of the filter are visible on the court lines.

For VSN programming, floating point operation is computationally expensive.For this reason, a look-up table providing pixel displacements is first created andused subsequently. This solution proves to be efficient but has two main weaknesses.The overall processing time is increased by about 20% and the re-sampling artefact3

is significant, leading to errors affecting the subsequent background segmentationprocess.

To circumvent this problem, a novel hybrid approach is proposed. The look-uptable is first used to correct the distortion on the whole image during the calibrationphase. At this stage, the computation speed is not an issue and the re-sampling arte-fact only has a limited impact. During the actual tracking phase, image processingis performed on the distorted image, at no extra computational cost and withoutartefacts. Only the player’s feet position in the image space is undistorted using afloating point method before final back mapping onto the tennis court coordinatesystem. As only a single point is corrected, this has virtually no effect on the overallperformance.

4.4 Tennis player tracking

4.4.1 On-node background segmentation

In order to separate the player from the rest of the scene, background subtractionis performed. Pixel-based segmentation methods using statistical distribution ofthe background colour are employed in this study. If a sudden change of colour

3The re-sampling artefacts are mainly due to the integer nature of the look-up table. Whilst itwould be possible to use a more accurate floating point look-up table, the required colour blendingwould be too computation expensive.

77

occurs, the pixel colour will not match to the background distribution, and thereforebe marked as the foreground. The colour model used in this chapter is based onGaussian Mixture Models (GMM) [130].

Since the processing power of the VSN node is limited, for real-time operation,it is only possible to use a single Gaussian for the background colour model, whichis tantamount to a mean and variance model. Using two or more Gaussians in theGMM requires the use of floating point operation, which would be too computation-ally expensive on this platform. It has been shown that the node processing poweris sufficient to enable on-node, real-time background segmentation at a resolutionof 320×256 pixels. Alternative methods such as histogram-based methods [85] ormulti-modal mean [11] implementations could potentially have been used for thispurpose, but they are more computationally demanding.

Further optimisation is necessary to achieve truly real-time performance. First,the background model is only updated every 20 frames. Secondly, a low-resolutionsegmentation is performed beforehand to determine the Region Of Interest (ROI). Bydown-sampling using a quarter of the native resolution, this operation only requiresone-sixteenth of the time compared to full resolution segmentation. Full resolutionsegmentation and morphological filtering (erosion-dilation) are then performed onlyin the ROI. This hierarchical method provides a substantial speed gain when arelatively small proportion of the image is classified as foreground, which is true fortennis tracking.

After segmentation, several low-level features can be computed from the bi-nary blob image including centre, the Axis-Aligned Bounding Box (AABB), and theeigenvectors of the blob determining its global orientation.

Figure 4.5: Left: original image as captured by the VSN node camera using a wideangle lens (focal 2.2 mm). Right: on-node binary blob segmentation and AABBcomputation.

78

4.4.2 Monocular player tracking

In order to track the player movement, background segmentation is performed andthe bounding box around the player is computed. The position of the feet is thenextracted from the bounding box in the 2D image space and back-projected into thereal-world 3D coordinates by using the calibration matrix. This under-constrainedproblem is solved by assuming that the players feet are touching the ground (groundplane constraint). Although the use of 3D geometry is justified by the physical setupof the scene, this operation is actually equivalent to a 2D image morphing.

4.4.3 Multiview player tracking

The above tracking algorithm strongly relies on the ground plane constraint to de-termine the player’s position on the court. Whilst this assumption holds most of thetime, it can fail and lead to inaccurate results. This happens typically in two cases,i.e., if the player jumps up or if for some reason the legs are incorrectly segmented.In either case, the lowest visible part of the player is not in direct contact with theground. Therefore, the tracker will return a position that is farther away from thecamera than its true 3D position.

The simplest solution for overcoming this problem is to add a second VSN node.By calculating the mean of the position returned by both nodes, a more robustposition can be derived. However this simplistic sensor fusion method does notsolve the aforementioned ground constraint problem.

In this case, an error in the vertical position can lead to an error in the horizontalplane. As the horizontal position is correct, the intersection between the line of sightfrom each camera to their respective apparent player position provides the actualposition of the player in 3D without the assumption of the ground plane. Althoughthis correction is applied to the horizontal 2D space on a single point as illustratedin Figure 4.6, this, in essence, is a particular case of the visual hull 3D reconstructionalgorithm presented in Section 5.2.1 of this thesis.

4.4.4 Computational load analysis

In order to demonstrate the value of on-node processing, Table 4.1 compares thebandwidth requirements for a centralised processing scheme requiring video or fea-ture stream communication.

Computational loading analysis was performed during the development and op-timisation phases of the system. Figure 4.7 summarises the distribution of the imageprocessing tasks previously described on the processor. It is evident that a combineduse of low-resolution pre-segmentation and AABB reduces the overall computationtime dramatically.

79

Figure 4.6: Position fusion between two VSN nodes. Cameras are represented bythe black squares and their respective player position estimation by the red disks.Sensor fusion by naive position average is represented in orange. As the player jumpson the line, this estimation becomes inaccurate. The intersection of the lines of sight(yellow disk), is closer to the reality.

4.5 Practical applications and results

4.5.1 Experimental setup

An upper concourse at the back of the Lawn Tennis Association’s (LTA) NationalTennis Centre (NTC) tennis courts was used for assessing the practical deploymentof the VSN node. The VSN nodes were placed at a typical height of 6 meters,20 meters away from the centre of the court (i.e. 8 meters behind the baseline).Although the sensor coverage is wide enough to track both players on the wholecourt with a single node, two sensors were used in order to increase the overallresolution of the system.

Ideally, the sensors would be placed at either end of the court, as illustratedin Figure 4.8 (left). Because only one concourse was accessible, an asymmetricconfiguration with two different focal lengths (3.6 mm and 6.5 mm) was used, asshown in Figure 4.8 (right).

Figure 4.9 illustrates the difference of perspective projection between the nearside tracking and the far side tracking. The vertical resolution on the far sideis bound to be lower, even when using a longer focal length sensor, as the anglebetween the optical axis of the camera and the court plane is significantly lower. Itcan also be noted that the lighting conditions were not as good on the far side.

4.5.2 User queries

In order to provide efficient access to the VSN node, a software environment hasbeen developed for real-time on-node data interrogation. To facilitate its use by non-technical users, a micro web server has also been implemented on the VSN node.This allows any computer or hand held devices (such as an Apple iPhone) to retrieve

80

Image encoding Size Frame rate Proces. time(bytes) (frames/s) (%)

Raw RGB 245,760 0.4 0.0Raw YUV 422 163,840 0.6 0.0JPEG (high quality) 24,650 2.6 34.8JPEG (medium quality) 5,960 6.4 61.2JPEG (low quality) 3,820 8.7 66.2Raw binary blob 10,240 5.6 41.7Run-length encoded binary blob 847 11.3 90.3Features only 40 14.7 99.4

Table 4.1: VSN node image output size and frame rate comparison (320×256 pixels).The size of the run-length encoded binary blob is measured when an average size blobis visible. Assuming that raw images do not need any processing, it is possible toestimate the relative percentage of time spent for processing and for communication.It can be observed that on the given platform, on-node processing allows higherframe-rate and reduces dramatically the time spent on communication.

live data from the VSN node. In order to reduce the bandwidth usage, a lightweightAJAX (Asynchronous JavaScript and XML) framework is used. With this setup, aJavaScript program on the client side requests for the tennis player position to theVSN node at regular intervals and updates the VSN HTML web page accordingly,as illustrated in Figure 4.10. This suppresses the need for refreshing the whole pageconstantly.

4.5.3 Strategy analysis

The on-node real-time tracking was trialled during a Junior’s Masters Tournamentorganised by the LTA, as well as two of the matches of GB Davis Cup Team play-offs. The positions of the players were also archived in real-time on a PC for furtheranalysis. During some of the matches, the game was also manually annotated toprovide ground-truth data for comparison.

4.5.3.1 Direct feature representations

The fundamental kinetic parameters are provided in real-time or during playbacks.These include the total distance covered and the current speed and acceleration.Several representations of the trajectory were made available to the coaches. Theactual trajectory on the court over an arbitrary length of time can be readily playedback during the game. Some trajectory patterns can be easily observed. This isparticularly useful during training “drills”, when the player is asked to repeat thesame movement a number of times. It is possible to determine at a glance whetherthe player was consistent in his motion.

After data collection, it is possible to derive a court occupancy map as illustrated

81

Naive approach

With bounding box

With bounding box

and pre segmentation

0 50 100 150 200 250

Computation time (ms)

Compression

AABBErosion−dilationSegmentationFrame grabbingWait loop

Send

Figure 4.7: Software optimisation for on-node processing showing the computationalloading of the main operations involved in the player tracking.

Figure 4.8: The effect of court coverage with different camera configurations. Left:two symmetric sensors. Right: two sensors tracking from the same side of the court.

in Figure 4.11. This graph provides meaningful insight into the overall player move-ment on the court and illustrates where the player spent most of the time during amatch.

When manual marking was carried out for the ground-truth data, some eventssuch as the hits may be recovered. Combining these annotations with the trackingdata allows one to playback the match with ball contact points highlighted, asillustrated in Figure 4.12. Whilst this does not provide sufficient information toestablish why the player wins or looses a point, it may nevertheless provide a usefulinformation about his technical skills.

4.5.3.2 Tactics-related measures

In general, the tracked features described earlier can indicate instantaneous motioncharacterisation of a player. The juxtaposition of such features over a period oftime provides a deeper insight on the game as illustrated in Figures 4.11 and 4.12.

82

Figure 4.9: Left: near side of the court as seen by the VSN node with a 3.6 mmfocal length optic. Right: far side as seen with a 6.5 mm focal length. Note thedifference in vertical resolution.

Figure 4.10: Tennis player tracking interface running on an Apple iPod Touch.

In either case, the outcome remains a direct representation of the recorded values.Higher-level indices can be constructed by aggregation and analysis of patterns inthe original feature set over time.

Assuming that both players have been tracked and their shots have been anno-tated, it is possible to segment the games accordingly. Indeed, the serves can beidentified automatically using low-level cues such as player position and inter-shottimes. For each game, fundamental statistics can be derived, such as the number ofshots, game duration, and distance covered by each player.

An attempt to recognise some of the winning patterns described in [221] has beencarried out. Whilst this has been performed initially by Wang and Parameswaran[231], they relied on ball tracking, whereas the proposed system only tracks theplayer. This raises certain difficulties as the aforementioned winning patterns aredefined in terms of ball ground contact. Moreover, the relationship between the ball

83

Figure 4.11: Two court occupancy plots derived during a set for two competitors.Serve and return locations are clearly visible. The right graph shows more mobilityand more variability on the serve return placement.

Figure 4.12: Simultaneous playback of the player movements on the court along withthe estimated positions when hitting the ball derived from the manual annotations.The player on the left side of this image was tracked across the court, thus the lowerresolution compared to the player on the right side tracked from the near end of thecourt.

trajectory and the player position during a particular shot is far from trivial. Indeed,the absolute position of the racket during a shot is usually not if ever the same asthe player’s feet position. For example, the relative distance between a racket usedforehand and backhand might be as large as two meters. Similarly, the player canchoose to or be forced to hit the ball early or late.

For pattern recognition purposes, the tennis court has been divided into ninezones, as illustrated in Figure 4.13. Due to the limited annotated data available, itwas not attempted to rely on statistical or probabilistic methods for classification,

84

1 32

4 5

6 7

8 9

Figure 4.13: Tennis court zones used for winning patterns recognition.

and a zone-based segmentation was employed. The focus was set on the serve andfifteen winning patterns and two variations related to serve and return were consid-ered. However, due to the game style adopted by the players, only four patterns havebeen frequently recognised: return deep crosscourt (7), return deep down the middle(10, 11) and return low at the server’s feet (17). An example of such a pattern isshown in Figure 4.14.

Figure 4.14: Tennis winning pattern recognition: “Against a wide serve - returndeep crosscourt” [221]. Note that the lines only represent the ball exchanges, notits actual trajectory.

85

4.5.4 Tracking accuracy validation

In order to assess the overall accuracy of the proposed system, its spatial accuracy onthe nearest side of the court was evaluated by using a ground-truth grid. A metricgrid as illustrated in Figure 4.15 (left) was carefully marked on the ground withmasking tape. It was assumed that the grid appearance does not interfere with theVSN tracking system. Two players were then asked to move from corner to corneron the grid whilst a VSN node was tracking and transmitting their positions to acomputer for recording. This experiment was performed twice to ensure a completecoverage of the court (over 1000 points in total).

1m

Figure 4.15: Metric grid added to the tennis court for tracker accuracy evaluation.Left: court as seen by the VSN node (note the radial distortion and the vignetting).Right: Overall error distribution of the position estimation (without distortion cor-rection). The radius of the disks represent the average error at the calibrationpoints.

VSN

node

Figure 4.16: Example of projective error due to incorrect background segmentationat knee level (simulated). Left, top-down: original image, correct segmentation,zealous segmentation discarding the lower legs. Right: the red disks represent theprojection of the correct and zealous segmentations onto the court ground plane.

Results have shown that the overall average accuracy of the system is 20 cm,with minimum at 9 cm and maximum at 36 cm. Figure 4.15 (right) shows the error

86

distribution on the court. The error is clearly larger near the service point. This ismost likely due to the camera distortion. Indeed, it can be seen in Figure 4.15 (left)that the radial distortion has a strong effect on the baseline. Three main sources ofinaccuracy are observed during this experiment.

Distortion The experiment was performed without distortion correction, whichmay have affected the accuracy near the serve point, as it can be observed inFigure 4.15.

Incorrect binary segmentation Typically, if the player wears a colour similarto the court, the tracker will be less sensitive to motion. If the player’s feetare subsequently discarded from the foreground, the tracker will consider thatthe player is slightly behind where he actually stands, as illustrated in Figure4.16. In a typical setup, an incorrect segmentation at the knee level would leadto an average of 65 cm projection error on the base line and 110 cm on theservice line. This effect can be observed on 4.17 (bottom), as the distortion-compensated error generally increases along with the distance to the VSNnode.

Low spatial resolution During this experiment, the horizontal resolution near theservice line was 5 cm per pixel. This means that the legs of a player will appearno thicker than one or two pixels, which is only just enough to be detected. Therelatively low resolution combined with ambient noise can introduce significanterrors to the proposed system.

Although no formal test has been carried out, the camera extrinsic calibrationdoes not appear to be a significant source of error as long as the player stays in thecalibrated side of the court.

4.6 Conclusions

4.6.1 Current limitations

Image resolution As previously stated, the angle between the axis of the cameraand the court can influence the overall accuracy of the proposed tracking system. Ata low angle (i.e. near horizontal camera optical axis), the apparent court size on theimage is relatively small, thus reducing the effective system resolution. For example,if the camera is placed 30 meters away from the centre of the court at 10 metershigh, the apparent court length is 157 pixels, i.e., an average longitudinal resolutionof 14 cm per pixel. At 5 meters high, this resolution reduces to 30 cm. This sourceof inaccuracy has a direct impact on manual or automated corner detection, cameracalibration and 2D player position. Setting up one VSN node on each end of thecourt may alleviate this problem, as illustrated on Figure 4.8 (left).

87

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

Err

or (

m)

Distance to the nearest calibration point (m)

Error per metric grid pointAverage error per distance span

-0.05

0

0.05

0.1

0.15

0.2

0.25

8 9 10 11 12 13 14

Dis

tort

ion-

com

pens

ated

err

or (

m)

Distance to the VSN node (m)

Error per metric grid pointAverage error per distance span

Figure 4.17: Factors of error in the player tracking. Top: error due to distortion:when the player stands far from all calibration points, the camera distortion leads toan erroneous position estimation. Bottom: influence of the subject-to-node distanceon the tracking accuracy.

Lighting and colours When the colour of the player’s clothes is similar to thatof the court (that is grey and blue at the NTC), the segmentation algorithm maynot perform well. This effect can be accentuated in the peripheral region by thestrong vignetting effect, as illustrated in Figure 4.15 (left). In this case, parts ofthe body may not be segmented correctly, leading to incorrect player detection,and therefore on-court position estimation. This issue can be solved by the use ofmultiple VSN nodes. In extreme cases, when tracking across the court, the playermay be occasionally completely recognised as part of the background (see Figure 4.9

88

(right) for example).

Spatial tracking accuracy As previously discussed, the average measured 2Daccuracy is about 20 cm. This is partly due to the vision sensor resolution, the colourissues and lens distortion. Furthermore, the quantitative validation performed in thischapter has been carried out on static standing subjects. When moving at a highspeed, the exact definition of the position of the player may not be straightforward.Whilst the proposed system will consider the position of the feet, one may arguethat the use of the torso would be more relevant.

Temporal accuracy For cross-court tracking, the current system runs at 10 to15 Hz. This may not be suitable for fast motion. Indeed at 10 Hz, that is 100 ms, aplayer jumping sideways at 10 m/s will move by a meter between successive imageframes.

Multiple occupancy Due to its lightweight design, the system is not currentlyconfigured to track several persons simultaneously. If several persons happen to bevisible at the same time, the nearest to the node will be tracked by the currentalgorithm. For the purpose of tennis tracking, this is nevertheless sufficient.

Court line detection For clay surface, the markings on the court may not be asclear because the lines may be partially covered by clay dust. Further assessment ofthe system on such conditions would be necessary.

4.6.2 Conclusion

Markerless video-based tracking is ideally suited for sport training. In this chapter,it has been shown that tennis players can be tracked with an autonomous, mobile,and low-power device. It has also been demonstrated the practical value of the VSNnodes in terms of ease of deployment for practical applications. The miniaturisedsize of the VSN nodes combined with their low-power consumption and wirelesscommunication, make them particularly attractive for field-sports.

Extensive testing has been performed on high profile matches such as the GBDavis Cup Team play-offs, demonstrating that the proposed system can be deployedduring competitive events without causing problems.

The full implementation of the VSN node calibration and player tracking on thenode means that no specific software is required on the user side. Any web-browsercan be used, and it is therefore possible to visualise the motion patterns on hand-held devices. The archived data also permits detailed off-line analysis. The matchor training session can be replayed, and strategic patterns can be automaticallyidentified. It is worth noting that the current system has been mainly validatedusing hard surface courts.

89

The extension of the monocular configuration to a multi-view framework allowsexplicit 3D reconstruction and further enhances the overall accuracy of the system.

90

Chapter 5

Generation of Canonical Views

for Motion Reconstruction

In the previous chapter, the localisation of the player is based on monocular visionand perspective reprojection. Whilst the use of a monocular visual motion analysisframework is satisfactory in certain situations, its inherent limitations such as lowspatial coverage, view point dependency and occlusions can prevent detailed motiontracking. These issues can be partly solved by introducing multiple vision sensorsdeployed around the subject. The same monocular vision-based methods could thenbe applied in parallel to each video stream in order to track the player. This fusionmethod can lead to an improvement over a purely monocular framework. However,some issues must be addressed before its practical use and the method needs to takefull advantage of the multi-view geometry offered by multiple vision sensors.

The use of multiple vision sensors for three-dimensional (3D) reconstruction isclosely related to Image-Based Modelling and Rendering (IBMR). These methods aregenerally appropriate for human motion monitoring due to their ability to generatea detailed 3D model of the subject or to render novel views. In other words, whenrelying on IBMR for sports monitoring, the information coming from different nodescan be fused at an image level rather that at feature or decision level. The maindrawback of this approach is a dramatic increase in dimensionality. Indeed, the final3D model can be significantly more complex than the original image data.

Given the relatively high complexity of conventional 3D model representations,there is a general need to reduce the dimensionality of the estimated models. Inthis chapter, the concept of canonical view is introduced, that is a standard viewof a subject or an object. The name was chosen after the mathematical definitionof the word ‘canonical’, i.e. used in a standard or simplest form. Generating novelcanonical views follows such a dimensionality reduction approach as this providesview-independent subject representations. In essence, the novel canonical view hasthe same dimensionality as the original images but can be rendered from arbitraryPoints-Of-View (POV). Such POV can be chosen to be aligned with a specific co-

91

ordinate system. When tracking tennis players, for example, this can be projectedfrom a given direction that best depicts the whole body articulation of the playerduring a specific manoeuvre across the court.

In this chapter, after introducing the main concepts of IBMR, a framework isproposed that can render such novel canonical views for facilitating player trackingusing standard computer vision methods. Its application to player motion classifi-cation is then demonstrated1.

5.1 Image-Based Modelling and Rendering

On one hand, a typical computer graphics pipeline aims to render a manually created3D geometric model onto a two-dimensional (2D) image. On the other hand, theoverall goal of computer vision consists of identifying specific features in 2D images,and inferring 2D or 3D geometrical clues. IBMR uses 2D images as input just ascomputer vision would do, but aims to render directly novel 2D images, by-passingthe explicit 3D reconstruction stage. This implicit process for 3D reconstruction isshown in Figure 5.1.

Image Space Geometry Space

Computer Vision

Computer Graphics

IBMR

Figure 5.1: Computer Graphics, Computer Vision and IBMR.

Whilst computer graphics are mainly focused on solid modelling, IBMR methodsare typically concerned with light field modelling. Therefore, the IBMR approach isbased on the concept of plenoptic function [3]. The plenoptic function characterisesthe light rays travelling in a given volume at a given time. In general, it can be mod-elled by a seven-dimensional function, as a ray is defined by its position (x, y, z),its orientation (θ, φ) and its wave length (λ) at a given time (t): P (x, y, z, θ, φ, λ, t).Typically, IBMR algorithms approximate the plenoptic function from a set of 2Dimages in order to render another set [200]. Given the high dimensionality of thisfunction, most of the methods employ specific constraints for dimensionality reduc-tion.

In this thesis, it will be referred to the input camera images as reference images1Preliminary results have been published on this topic: Towards Image-Based Modelling for

Ambient Sensing [176].

92

and the new images to be generated as desired images.

5.1.1 Characteristics and classification

IBMR is a relatively novel research field, and therefore some of the employed tech-niques are still immature, with a lot of issues remaining. Indeed, numerous con-straints can limit the performance of IBMR implementations. IBMR methods canactually be distinguished by their limitations on reference images, processing algo-rithms and desired images.

5.1.1.1 Constraints on reference images

Camera calibration required – some methods require the intrinsic and/or ex-trinsic calibration of the camera [126, 148], others can do without [46].

Number of reference images – only one or two references images may be suffi-cient for image reconstruction [198], but some methods may require hundredsof them [78, 132].

Depth map required – some methods need a depth input map to reproject images[52, 145], which can be acquired by dense stereo matching or range sensing.

Image segmentation required – some methods are based on pre-segmented im-ages [148].

Target volume topology – the rendered or modelled volume is generally eithermostly convex (in the case of an object scan) [78, 132] or at the opposite mostlyconcave (while scanning a room or a panorama) [46, 203], but some methodsallow mixed topology [126].

Static scene and light – some algorithms require a static light [78, 132], whichis often complex to enforce; others can cope with different lighting and evencapture dynamic scenes [225].

Opaque/translucent objects – most pure IBMR methods assume opaque scene,very few are able to deal with translucent objects.

5.1.1.2 Processing aspects

Amount of geometry involved in the rendering process: some methods generatefirst an explicit 3D model before rendering [126], most of the others do thisimplicitly [148].

Required computing power – only few methods can render in real-time, some-times with the help of hardware acceleration [148].

93

Generated data volume – depending on the dimensionality of the approximatedplenoptic function and the method to compress it, the data volume can berelatively large [78, 126, 132].

5.1.1.3 Limitations on desired images

Degree Of Freedom (DOF) of the desired POV – the POV can be restrictedto some angles or to some particular positions, especially in the case of rela-tively low number of reference images used [198].

Photo-realism – the final rendering may be realistic but occlusion/dis-occlusionartefacts need to be dealt with [148].

In practice, the term IBMR actually covers a wide range of methods. Detailedclassification of these methods may be difficult and existing reviews tend to be in-consistent with each other. McMillan [152] distinguishes between images as approx-imations that are re-mapped on a simplified geometry, images as databases relyingon large amount of references but few transformations and images as model. Oliveira[168], on the other hand, distinguishes between pure Image-Based Rendering (IBR)that only relies on 2D aspects, hybrid IBR that relies on geometric information toenhance the rendering and pure Image-Based Modelling (IBM) that aims to generatea 3D model. Burschka et al. [34] fuse the two former approaches in interpolationfrom dense samples, image and view morphing and pixel reprojection using scenegeometry. The latter is refined into image mosaics and depth-based reprojection. Agraphical representation of selected IBMR methods according to their main charac-teristics is proposed in Figure 5.2.

The following review will mostly follow the classification as proposed by [34].

5.1.2 Image and view morphing

This class of techniques does not rely on 3D aspects to generate the desired images.It may therefore be geometrically incorrect and lead to distortion effects. However,by restricting the freedom of movement to some particular points of view, it ispossible to minimise these effects. For example, Seitz and Dyer [198] developeda technique called view morphing to generate a transition between two referenceimages. Lhuillier and Quan [133] relied on a propagation algorithm to effectivelygenerate such transitions. This kind of method is powerful and does not rely on alarge number of views. They could be applied to moving scenes. However, they donot provide a 3D model of the scene and the field of view covered is limited.

5.1.3 Interpolation from dense samples

This approach relies on a very dense network of reference images, and thereforerequires a very long acquisition process, during which the light and object must

94

Number of reference views required

Exp

licit

use o

f 3

D f

eatu

res

Pure IBM

Pure IBR

2 views 100+ views

3D−based methods

Image & view

morphing

Image mosaics

Interpolation from

dense samples

Image−Based

Visual Hulls

3D Post−Warping

Shape From Motion

Space Carving

Voxel Coloring

Match propagation

View Morphing

Concentric Mosaics

Panoramas

QuickTime VR

Light Field Rendering

Lumigrpah

Figure 5.2: A classification of IBMR methods according to the number of viewsrequired against the use of 3D aspects

remain static. Two methods are typically employed in order to acquire the dataset. The object can be scanned with a camera mounted on the arm of a robot, oralternatively, a rather dense set of cameras are used simultaneously. The acquireddata volume is large, although it can be compressed. Note that the compressionalgorithm has to take into account the processing method of the algorithm, whichmay require access to disseminated data.

This class of method has been first implemented simultaneously and yet inde-pendently by Gortler et al. (Lumigraph) [78] and Levoy and Hanrahan (Light FieldRendering) [132]. They both relied on a 4D plenoptic function to sample the light ina cubic or cylindrical bounding volume of the object of interest. The novel views arethen interpolated using this dense sample. How ever photo-realistic these methodscan be, they require a long acquisition process. Moreover they are not suitable forreal-time moving object rendering.

5.1.4 Image mosaics

These methods aim at reprojecting different reference images to produce a largerone. Image locations have to be precisely registered either manually or by an au-tomated process. Early work in this field includes the Movie-Map [135]. The usercould navigate through a virtual 3D environment based on a large database of im-ages stored on a video disc. 360◦ panoramas from a static POV are well-knownapplications, first implemented in Quick Time VR [46]. McMillan and Bishop [153]

95

extended this idea to a set of panoramas and allow more complex reprojection toenable multiple points of view. Shum and He [203] also proposed a method based onconcentric mosaics. More recently, Google Street View [77] has implemented similartechnologies on a larger scale.

Related to these, Kubota et al. [124] presented an interesting method to mergereference images acquired with different focal lengths in order to remove most of thealiasing effects for rendering an all-focused image.

These methods are typically adapted to “open” or large environments such aslandscape and rooms, but not really convenient for single, convex objects. Theacquisition can be fast if multiple cameras are used instead of a single mobile camera.Therefore, they are not suitable for acquiring data in dynamic environments.

5.1.5 3D-based methods

When available, depth information allows the generation of an explicit and approx-imative 3D model that can significantly improve the final rendering. When usingcolour reference images, depth can be computed using stereo for example. Depthacquisition can also be performed using external devices such as a sonar, a Time-Of-Flight (TOF) camera or an infrared (IR) marker-based tracking system. It isalso often approximated on the fly by some algorithms. In the particular case ofsynthetic reference images, the depth is generally computed anyway.

For example Mark et al. [145, 146] proposed a method for increasing the framerate of a synthetic image stream. They relied on two image frames and their respec-tive depth buffer to generate extra frames at a lower computational cost. A systemto fill in potential holes in the data based on surface connection has also been inte-grated. These methods can be very useful for real-time acquisition and rendering.Curless and Levoy [52] presented a complete volumetric model reconstruction sys-tem based on range images acquired by a LASER. However, the prerequisite of adetailed depth map precludes many of the sport application scenarios.

Other vision-orientated techniques such as the Structure from Motion (SFM)[18] and Structure from Silhouette (SFS) can also generate a 3D simplified modelthat could eventually be used as a basis for IBMR techniques.

5.1.5.1 Image-Based Visual Hulls

Other methods do not rely on a depth map but compute an approximate 3D modelusing only the reference images. In this case, the cameras usually need to be fullycalibrated, although some methods exist to derive those parameters directly fromthe reference images. For example Sinha et al. [205] rely on the moving silhouettesof the subject to estimate the parameters of the camera network. A well-knownexample of such modelling methods is the Image-Based Visual Hulls (IBVH) [148].This frameworks can generate the convex hull of an object using the 3D intersection

96

of the perspective-projected silhouettes of the object as illustrated in Figure 5.4.This method will be described in greater details further in this thesis.

5.1.5.2 Voxel-based methods

The general concept of voxel-based methods consists of using a set of voxels (volu-metric pixel, i.e. a volume element) to determine their colour using a set of imagesavailable. A fundamental notion is the photo-consistency of a voxel. A voxel is calledphoto-consistent if its projections onto all the reference camera planes are consistentbetween each other. Non-photo-consistent voxels are carved away. The principle isillustrated in Figure 5.3 and practical methods for computing photo-consistency arevaried.

Camera 1

Camera 2

Image plane

Image plane

Voxel to

be carvedRemaining

artefacts

Real object

being carved

Object after

carving

Figure 5.3: Schematic illustration of the basic principle of space carving. The con-sidered voxel appears to be red for camera 1 but blue for camera 2: it is not photo-consistent and therefore will be carved away.

The well-known Voxel Colouring [199] method requires specific camera locations,whereas Space Carving [126] allows arbitrary camera configuration. Moreover, thelatter also proposes an efficient way of checking the photo-consistency by planesweeping, thus avoiding CPU-consuming visibility computation. However, this canbe performed only by assuming a plane separating the camera and the voxel set(therefore some views may not be used). Generalised Voxel Colouring [51] extendsthe previous algorithms by using all the reference images to perform the task muchfaster but with less noise.

Statistical Consistency Check [31] introduces a method relying on several pixelsto check the photo-consistency. This is crucial when the resolution of the images ishigher than that of the voxel set, as several pixels can be considered for the backprojection of a single voxel. This enhancement helps to avoid stray holes generated

97

by the traditional space carving method. Work has also been carried out on movingobjects in Shape and Motion Carving in 6D [225]. In this case, the notion of voxelis extended to hexel, that is a voxel moving over time. Assuming that the colourof each voxel remains relatively constant and that the motion between two framesis bounded, space carving can also be performed in time. The sweeping plane isextended to a thick plane called slab.

A major issue with voxel-based methods in a sparse coverage context is that thephoto-consistency looses its meaning when a reduced number of reference imagesare employed. Ultimately, a voxel visible in only one reference image is alwaysphoto-consistent.

In practice, the space carving algorithm is initialised with a set of voxels thatencompasses the object to be reconstruct. Then for each of these voxels, the photo-consistency is computed. This function returns a true value if the voxel can have aconsistent colour across the set of reference images. To simplify the problem, thisreturns true if reprojections of the reference images onto this voxel all maintain thesame colour. In reality, the intensity can vary considerably and even the hue valuecan be different for non-Lambertian surfaces. If the considered voxel is not photo-consistent across all the reference images, it is carved away as shown in Figure 5.3.This process is repeated until convergence, i.e. all the remaining voxels are photo-consistent.

5.1.5.3 Probabilistic methods

In the original IBVH algorithm, a small segmentation error can potentially lead torelatively large errors in the final 3D model. This is due to the binary nature ofthe segmentation and modelling techniques. To overcome this issue, probabilisticmodels can be used [81, 82]. The 3D structure of the subject can be derived from the2D shape using Bayesian inference. Grauman et al. [82] generated 20,000 syntheticview-dependant images of pedestrians to train their inferencing system. Syntheticimages have the added advantage of being fast and easy to generate. They can alsobe segmented perfectly, leading to an accurate, non-biased reference dataset. Withthis approach, inference is performed at the image level using this ground-truthdataset. An articulated body is then matched to the image and subsequently usedfor rendering

5.2 Canonical novel views generation

In general, relying on multiple views should help to ensure a view invariant ob-ject representation. The Space Carving algorithm has been implemented for thispurpose, but it is not appropriate when dealing with sparse views. Indeed, thephoto-consistency concept on which the Space Carving method is based requires a

98

larger number of reference views reprojected onto each voxel in order to be statis-tically significant and ensure suitable results. In this thesis, the IBVH [148] will bemainly relied on. Indeed, this method is based on binary silhouette segmentation,which is conceptually very close to the VSN blob sensing paradigm described in ear-lier chapters. Moreover IBVH are faster to compute than other techniques such asSpace Carving, and therefore it bodes well for future on-node VSN based processing.The IBVH algorithm mainly consists of three steps: background subtraction (imagebinary segmentation), visual hull sampling and visual hull shading.

5.2.1 Image-Based Visual Hulls

camera 1

camera 2

sampling from

desired viewpoint

Figure 5.4: The basic principle of IBVH: sample the intersection of the perspective-projected silhouettes according to the desired POV

5.2.1.1 Subject 3D reconstruction

Initially, each camera is calibrated individually to determine its intrinsic parameters[250]. The extrinsic parameters of the camera network are then determined in acommon coordinate system and all parameters are optimised within an iterativenon-linear refinement loop. This is a common and flexible approach for calibratinga multi camera system, only requiring a planar calibration object (an excellentimplementation of the core technique is available online [29]).

Once the 2D silhouettes of each camera are extracted, they are projected alongthe perspective of each camera, generating a generalised pyramid as illustrated inFigure 5.4. Assuming that the desired image is virtually acquired by a pinholecamera (such as the usual cameras and human eye model), the intersection is thensampled using a pyramidal set of rays with their apex being the desired POV.

99

Other topologies could also be used to simulate non-pinhole-like cameras, such asorthographic projection. This method can at best reconstruct a convex visual hullof the subject (with an infinite number of reference images), and therefore it doesnot handle occlusions nor concavities.

5.2.1.2 Novel view rendering

Generating a nearly photo-realistic view from the implicit 3D model requires twostages: a visibility detection pass to determine which camera can see which pixeland the final shading process to render each pixel.

Visibility detection Considering the large number of pixels and the multiplesampling rates at various stages of the process, visibility detection is not a trivialtask. In this study, a reversed projection of the sampled rays to the camera imageplane was used. Note that the sampling rate of this depth map must be chosenaccording to the sample rate of the novel rays.

If the accuracy of the model is not good enough, the algorithm can detect in-correct extended visibility. This is particularly common on the edges of the model,where the rays are nearly tangential to the surface. This issue was tackled by con-sidering only points where the camera rays do not collide too much tangentially withthe surface.

Surface shading There are several ways to process the shading of the visual hulland choose the appropriate reference image for texturing. For example, Matusik etal. [148] relied on a view-dependent technique. It simply selects the reference camerawhose angle to the desired one is the smallest. Another proposed method wouldchoose the closest camera. Both of them are called nearest neighbour interpolationin [234]. A method that selects the camera which is the most normal to the surfacewas also implemented, as this usually means a better resolution. However, thequality of the result is highly dependent on the model and the camera configuration.It should be used only in well-controlled environments. A shaded image is shown inFigure 5.5. It is also possible to blend several references views, but this often leadsto artefacts due to the small number of cameras used.

Moreover, a set of quality-based priority rules can be manually set up in thesystem. For example, it is possible to choose to use a low-resolution reference imagefor shading only if there is no better image available. This allows the building of anheterogeneous camera network while avoiding quality deterioration.

The main issues with this implementation can be summarised as follows.

Dependency to calibration – Similarly to space carving and most of IBM tech-niques, IBVH depend strongly on a camera pre-calibration.

100

Figure 5.5: IBVH: 3 original images and shaded novel view.

Shading: camera selection – Matusik et al. [148] selects the reference camerawhose angle to the desired one is the smallest and does not blend several ref-erence images. This can be useful when the reference images are correlatedenough while reprojected onto the surface. A method relying on this correla-tion could decide automatically whether or not to blend.

Shading: gaps between cameras – As the sampled 3D model is always slightlyinaccurate, the reprojection of the reference images can easily lead to gaps oroverlaps. Unlike Space Carving, the pixels on the surface of a visual hull arenot necessarily photo-consistent, as shown in Figure 5.6. IBVH decorrelatecolour and volume information.

Also, because the colours are not perfectly consistent between the referenceimages, seams can appear. This issue is tackled by Watson et al. [234], but itrequires relatively large overlaps between reference images. A stitching algo-rithm can be used, which follows the edges between the reference images andinterpolates the colours around them as shown within Figure 5.7.

Sampling artefacts appear at various stages of the method, such as visibility de-tection and reference image selection.

Self occlusion remains a major issue. The IBVH algorithm is theoretically onlyapplicable to a convex hull. Otherwise, the result is unpredictable. Strayeffects increase with the complexity of the posture of the subject and dimin-ishing number of the reference images. This problem is visible as an artefactbetween the legs in Figure 5.5 (right).

101

camera 1

camera 2camera 3

inconsistentsurfacecolour

visual hullsurface red and blue

object

Figure 5.6: IBVH: slightly inaccurate 3D model leads to shading errors

5.2.1.3 Quantitative evaluations

Two aspects of the novel images are considered in this chapter to evaluate the qualityof the results: their photo-realism (self-consistency) and their similarity to what theyshould be (consistency with the real word). Although some mathematical formulaecan help, the best way to evaluate the photo-realism is by human observation.

In order to evaluate objectively the proposed rendering algorithm, a leave-one-outapproach was followed. For this purpose, a scene was acquired with four cameras,but only three were used in the reconstruction process. The remaining one waskept as a reference image for comparison. The novel image was then computedfrom the POV of this reference camera and the resulting images were comparedstatistically. To obviate bias, the images were cropped to the bounding box of thesilhouettes to avoid an artificially high global correlation due to a large proportionof the background within the image.

To provide an overall idea of the location of the inaccuracy, the Euclidean dis-tance in the colour space between the two images was computed for each pixel. Thisis equivalent to the undirected Chamfer distance mentioned earlier in Chapter 3.The corresponding results are shown in Figure 5.8 (left). It is evident that the in-accuracy is mostly concentrated on the sides of the silhouette, particularly on theupper body. The overall shape, however, is not affected. A map of the distance inhue was also computed, as shown in Figure 5.8 (right), showing the colour fidelityacross the image.

Several estimators were then computed to evaluate the similarity between thereference image and the reconstructed one. They are summarised in Table 5.1 andthe average distance between the two silhouettes is first computed. The standard

102

Figure 5.7: Stitching the seams. Left: original image. Right: stitched image.

Average silhouettes distance 4.57 pixels (1.21%)Average silhouettes overlap 92%

Normalised RMS error 0.28SNR 11.15 dB

Pearson correlation coefficient 0.85

Table 5.1: Comparison estimators.

normalised Root Mean Square (RMS) error is then evaluated. As the RMS erroris rather sensitive to noise (in particular those generated by the reference image),the Pearson product-moment correlation coefficient is computed instead, which es-timates the global correlation between the images on pixel by pixel basis.

5.2.2 Automated canonical view rendering

The main advantage of the proposed method is that canonical views can be gen-erated. Such synthetic views could be taken from a POV that remains static inrelation to the subject. This can be, for example, a virtual camera that followsthe subject at all times. The generation of such novel images can then be used formonocular image-based processing. This can dramatically reduce the variability ofthe posture detected by removing view-dependent bias.

To achieve this goal, an approximate 3D model is sampled first. An OrientedBounding Box (OBB) [79] of this model is then computed by using the eigenvectorsof the covariance matrix of the sampling rays.

Given the set {pi} of the N extreme points of the rays, which is a sampling of

the surface of the model and its centre c =1N

N∑i=1

pi, the covariance matrix is given

103

Figure 5.8: Comparison of real image against novel view: white means no differenceat all, black means the maximum distance. Left to right: RGB distance, Pearsoncorrelation coefficient, hue distance.

as follows:

C =1N

N∑i=1

(c− pi)(c− pi)T (5.1)

The eigenvectors of this matrix can be approximated by using the power method[237] for the dominant eigenvalue and by deflation for the next two. Given a vectorx0 that is not orthogonal to the dominant eigenvector, the dominant eigenvalue λmaxis:

λmax = limn→∞

Cn+1x0Cnx0

Cnx0Cnx0(5.2)

where n is a positive integer. vmax being the dominant eigenvector, the deflationmethod gives a new matrix C ′ on which the power method can be applied recursivelyto approximate the next eigenvalues:

C ′ = C − λmaxvmaxvTmax (5.3)

The orientation of such a bounding box matches to the shape of the cloud ofpoints. Assuming that in most cases this bounding box will follow the general“natural” orientation of the body, one can rely on it to generate automatically sixvirtual points of view (front, back, top, bottom, left, right). By using the techniquesdiscussed earlier, six novel canonical views can be generated. A graphical summaryof the entire process is shown in Figure 5.9.

However, the camera configuration can limit the “canonicality” of these views.Because the sampling of the visual hull may be inaccurate, some strong artefacts(such as a large hump in the back of the subject) may lead to a miscalculated

104

Reference images

Perspective projected

silhouettes

Approximate

3D model

OBB (Oriented

Bounding Box)

Canonical

3D model

Colour images

Photo−realistic

novel views

Background

segmentation

Low resolution

ray sampling

Covariance &

eigenvectors

Canonical POV

Visibility detection

& cameras choice

Shading

High resolution

ray sampling

Binary silhouettes

3D reprojection

Figure 5.9: Workflow of the key steps involved in generation of a canonical novelview of a subject based on two-pass sampling.

bounding box.

5.3 Application to tennis player tracking

5.3.1 Tennis strokes recognition

An experiment was carried out to evaluate the improvement of motion tracking byusing the canonical views generated above. Once the silhouette of the player has beensegmented from the background, it is possible to recognise elementary movements ofthe player and the played strokes. Whilst it is possible to do so by using a monocularframework, strong constraints must be applied to ensure robust classification. Forexample, Bloom and Bradley [28] assumed that the camera was at the back of thecourt and it was also known whether the player is right- or left-handed.

105

In this chapter, an experiment was designed to demonstrate the value of canonicalviews over single views. The same tennis tracking algorithm was applied to theoriginal view-dependent silhouettes and to the reprojected canonical views. Theproposed activity recognition algorithm follows mainly Bloom and Bradley’s method[28] and it consists of six major steps as illustrated in Figure 5.10:

1. Statistical background segmentation and/or generation of a novel canonicalsilhouette;

2. Dilation-erosion and selection of the largest blob for noise removal;

3. Zhou et al. thinning algorithm [252] to reduce the blob to a one-pixel thin,single-connected frame;

4. Detection and tracking of the pseudo-skeleton joints and termination points;

5. Pseudo-skeleton simplification and key elements annotation (central joint, head,foot, racket) using heuristics based on spatial relationships and graph connec-tivity;

6. Activity classification based on a cross-correlation measure with a set of refer-ence patterns.

The reference patterns are built by averaging manually annotated samples ex-tracted from a training dataset. The use of cross-correlation provides simultaneousdetection and classification of the strokes and is efficient enough in this contextbecause the considered strokes are relatively consistent speed-wise. For other andmore diverse motions, a Hidden Markov Model (HMM) may be used.

Figure 5.10: Left to right: original colour image, binary background segmentation,thinning, pseudo-skeleton, key element detection.

5.3.2 Experimental results

Three video sequences were recorded synchronously with four cameras, during whicha player was asked to simulate several actions such as standing straight, waiting for

106

a serve, serving, forehand and backhand strokes. Binary background segmentationwas performed on the sequences, as illustrated in Figure 5.11. The number of strokesplayed is reported in Table 5.2.

Figure 5.11: Binary silhouettes derived from the player’s video sequences acquiredby two cameras. Left to right: stand, wait for serve return, forehand, backhand andserve. Inaccurate lower legs segmentation can be observed for some postures.

In order to introduce some variability in the orientation, the actions were per-formed facing three arbitrary directions. The IBVH algorithm was then applied tothe sequences in order to generate a single canonical view for each frame, as illus-trated in Figure 5.12. It can be observed that for this particular example, the novelcanonical view is similar to the top left reference image. Therefore, the algorithm isexpected to perform equally well for this particular orientation. It can also be seenthat other views are more ambiguous, particularly the one on the bottom right.

The recognition algorithm explained above was then applied to the segmentedreference images and the novel canonical view, respectively for performance com-parison. Note that the objective of this experiment is to demonstrate the value ofthe canonical view as a conditioning step but not to evaluate the recognition algo-rithm itself. The corresponding sensitivity and specificity results are presented inTables 5.2 and 5.3, respectively. It can be seen that the main issue encountered bythis algorithm is the differentiation between forehand and backhand strokes. This ismainly due to the relaxation of the view-point constraint used in the original study[28], where the camera has to be located behind the player.

It can be seen that overall the recognition accuracy is better when the canonicalview is used. Sometimes the use of the original images gave slightly better results.For example, the algorithm detects well the backhands on view 4 with the player at

107

Figure 5.12: Left: reference binary segmented images. Right: novel canonical viewdepth map derived from IBVH computation – note that the depth information isnot actually used.

orientation 3, but performs poorly for the same player at orientation 1, as illustratedin Figure 5.13. This can be explained by the relative orientation at this particularangle being more favourable for detecting backhand motion. On the other hand,waiting for a serve return is much better detected under orientation 1 than 3 fromthis view. Overall, the use of canonical view gives more consistent over all theorientations.

Action Played Sensitivity on view (%):1 2 3 4 Canonical

Stand for return 15 33.3 60.0 93.3 66.7 93.3Forehand stroke 26 88.5 53.9 15.4 30.8 80.8Backhand stroke 21 0.0 14.3 23.8 38.1 47.6Serve 24 58.3 16.7 45.8 33.3 58.3Overall 86 48.8 34.9 39.5 39.5 68.6

Table 5.2: Elementary actions recognition rates: sensitivity.

108

Action Specificity on view (%):1 2 3 4 Canonical

Stand for return 100.0 86.7 20.0 6.7 66.7Forehand stroke 0.0 65.4 84.6 65.4 92.3Backhand stroke 71.4 90.5 90.5 71.4 76.2Serve 79.2 87.5 100.0 95.8 75.0Overall 55.8 81.4 79.1 65.1 79.1

Table 5.3: Elementary actions recognition rates: specificity.

0

20

40

60

80

100

View 1 View 2 View 3 View 4 Canonical

Sen

sitiv

ity (

%)

Orientation 1

Wait for returnForehandBackhand

Serve

0

20

40

60

80

100

View 1 View 2 View 3 View 4 Canonical

Sen

sitiv

ity (

%)

Orientation 3

Wait for returnForehandBackhand

Serve

Figure 5.13: Recognition rate (sensitivity) by activity for two different player generalorientations.


5.4.1 Summary

In an effort to reduce the POV variability, a multi-view framework was employed inthis chapter. In order to fuse information from different views, the concept of IBMRhas been introduced in this chapter. IBMR allows the generation of novel imagesfrom arbitrary view points without the need of explicit 3D modelling. Among thesemethods, an approach based on IBVH has been used for generating a novel canonicalview for consistent 3D recognition.

The proposed method works in two steps. First, a coarse 3D model is generatedby using the IBVH with a low sampling density. This model allows the identificationof the main orientation of the subject using OBB based on Principal ComponentsAnalysis (PCA). Subsequently, the IBVH algorithm is used to render a novel viewfrom which consistent posture recognition can be performed.

5.4.2 Conclusion

In this chapter, it has been shown that tennis player tracking from multiple viewpoints reduces POV dependent bias and improves the overall robustness of the al-gorithm. Indeed, the derived canonical view is a POV independent representation

109

of the player’s posture.The novel canonical images exhibit the same properties as the original images and

can subsequently be processed by using any chosen vision-based posture recognitiontechnique. Canonical view generation is therefore a versatile means of reducing thePOV variability.

In order to demonstrate the practical value of canonical views, a tennis strokerecognition algorithm has been applied to the original segmented images and thosefrom the canonical view. It has been shown that a standard stroke recognitionalgorithm works more consistently when applied to the novel canonical view asopposed to the original images.

It is worth noting that the proposed method relies on implicit, 3D multiple viewgeometry. Therefore some information is lost, which may be useful for subsequenttracking and classification processes.

110

Chapter 6

Subject-Centric 3D Model

Descriptors

In the previous chapter, the value of multi-view geometry for tennis player track-ing has been demonstrated. The high dimensionality of the three-dimensional (3D)reconstructed model has been addressed by using its reprojection onto a canonicaltwo-dimensional (2D) image. The main step associated with such canonical views isthe computation of their orientation, which is achieved by using Principal Compo-nent Analysis (PCA). The derived canonical view is therefore viewpoint-invariant,because its orientation has been normalised. Whilst this method works well for goodquality image data, image noise and artefacts can affect the normalisation process.

In this chapter, it will be examined how the model dimensionality can be re-duced by generating a rotationally invariant descriptor rather than the previouslydescribed rotationally normalised canonical view. The advantage of invariance onnormalisation is in its independence to the original data. If the original image hasan artefact at one specific location, the final rotationally invariant descriptor is alsoexpected to carry some artefact at a particular location. However, a normalisationphase would spread the error across the entire canonical view. For this reason, theuse of a subject-centric descriptor is proposed in this chapter. The position is nor-malised by the computation of a cylindrical or a spherical depth map, which is themaximal extent of the subject 3D model from its centre of mass. Frequency analy-sis based on spherical harmonics (an extension of the Fourier analysis in sphericalcoordinates) allows the generation of a compact 1D rotationally invariant posturesignature.

The use of the Image-Based Visual Hulls (IBVH) as described in the previouschapters implies a “hard” binary segmentation on the original images before pro-jecting into the 3D space. In this case, one poorly segmented view will have a strongimpact on the final result even if all the other views are perfectly segmented. In or-der to address this issue, it is desirable to fuse the foreground probabilities in the 3Dspace and perform the binary segmentation at this level. In this chapter, the existing

111

shape descriptors will first be reviewed briefly. The mathematical background onspherical harmonics is introduced, followed by the proposed signature generation. AFuzzy Visual Hulls (FVH) pipeline is then described, and detailed validation resultsare presented.

6.1 3D shape descriptors review

Because of their high dimensionality, 3D models must be reduced before furtheranalysis. Typical methods for dimensionality reduction include PCA and Indepen-dent Component Analysis (ICA) [245]. They are all relevant to video stream or 3Dmodel analysis. However, many modelling methods are more specific to 3D models

and they all attempt to maximise the ratioinformation

data size. As the type of infor-

mation can vary with applications, many 3D shape descriptors have been designed.This section introduces commonly used 3D shape descriptors.

6.1.1 Common 3D shape descriptors

Thus far, the large number of 3D shape descriptors proposed is mostly due to theneed of handling 3D models from an increasing number of 3D scanning techniquesbased on video, laser or ultrasound scanning techniques. In real-time computergraphics, only the model appearance matters. Therefore, the descriptor usuallyconsists of the detailed surface of the 3D object, represented by polygons and mate-rial properties. This type of representation is relatively straightforward to use andoffers a high quality rendering from arbitrary view points.

On the other hand, voxel-based descriptors allow for an in-depth descriptionof the shape. This type of representation is memory-consuming and the renderingof a 2D image projection from an arbitrary view point is not trivial. But the fullvolumetric information is captured by these descriptors, which make them suitablefor many 3D simulations. Rendering 2D slices of the model is a common and easyway to display data represented by this kind of descriptors. While most of the 2.5Dand 3D model descriptors are view independent, some of them, such as a depth map,can be view dependent.

Extensive work has been carried out to build search engines for 3D models inlarge databases [10, 64, 73, 114, 115, 123, 172]. Matching polygons or voxel-basedmodels directly is usually not practical. In order to provide a robust and efficientmethod for computing 3D model similarity, most methods rely on a 2-step approach[115]:

Signature generation from the original model with information abstraction; thisis usually performed offline.

On-line comparison of the models through the abstracted information.

112

The abstract descriptor or signature, is a function defined on a canonical domainor a transformation invariant function. These descriptors are usually meant to becompact so that shapes can be compared efficiently afterwards. Although thesetechniques are similar to human posture matching, they differ in a sense that forposture recognition, emphasis sits on subtle changes whereas the global shape re-mains similar.

A non-exhaustive list of common spherical descriptors is given hereafter, followedby non-spherical ones. “Spherical descriptors” means that descriptors are definedon a sphere. It does not mean that the generation of such a descriptor necessarilyinvolves a form of projection from the centre of the shape onto a bounding sphere.It can be any type of mapping from a shape onto the (unit) sphere.

Extended Gaussian Image (EGI) [100] – generated by accumulation of the nor-mals to the surface in consideration onto a Gaussian (unit) sphere, as shownin Figure 6.1. In other words, for each angular element, the surface normaland the area of the surface element are computed. A mass corresponding tothis area is then added on the Gaussian sphere at the unique position wherethe normal to the sphere matches to the normal to the surface. The EGI canbe computed on polygon meshes, in which case the projection on the Gaussiansphere will be discrete, or on continuous spherical functions. This descrip-tor is translational invariant and scale-normalised. However, it is rotationallydependent. In fact, the EGI rotates with the same angle as the original shape.

Figure 6.1: Extended Gaussian Image (EGI) [100]: the vectors normal to the surfaceof the shape (on the left) are mapped onto the Gaussian sphere (on the right).

Simplex Angle Image (SAI) [55, 94] – a variation of EGI based on the curvatureof a regularised 3D mesh derived from a shape. The simplex angle is equalto 0 for a flat surface, negative for a concave shape and positive for a convexone. This measure has the advantage of being invariant to scale, translationand rotation.

Radial Distribution – each angular segment is represented by the mean and stan-

113

dard deviation of the distribution sampled on the ray as shown in Figure 6.2.

Radial

Distribution

0o

0

20

40

60

80

100

120

140

160

180

0 1 2 3 4 5 6

Ext

ent (

SI)

Angle (rad)

MeanVariance

Figure 6.2: Radial distribution1: the shape is sampled with rays originating fromits centre to determine the average and the variance for instance.

Spherical Extent Function (SEF) [229] – each angular segment is representedby the last intersection of the ray with the model, as shown in Figure 6.3. In thecase of a continuous distribution, a threshold may be required to distinguishbetween the inside and the outside of the model.

Spherical

Extent

Function

0o

0

50

100

150

200

250

300

350

0 1 2 3 4 5 6

Ext

ent (

SI)

Angle (rad)

Figure 6.3: Spherical Extend Function (SEF)1 of a binary or surface based model:the shape is sampled with rays originating from its centre to determine the maximalextend.

Sector Model – each angular segment is represented by the area of the surfaceelement, as illustrated in Figure 6.4.

Shell Model – represents the shape volume contained in concentric spheres andtherefore is inherently rotationally invariant, as illustrated in Figure 6.4.

1Whilst these figures represent a 2D model sampled in 1D for the simplicity of illustration, theactual descriptors provide a 2D spherical “depth image” from a 3D model.

114

Shape Histograms [10] combines sectors and shells, as illustrated in Figure 6.4.

Figure 6.4: Shape decomposition into shells, into sectors and combined methodproposed by [10].

Reflective Symmetry Descriptors (RSD) [114] – are based on a measure ofreflective self-similarity. A similarity measure is calculated between the originalshape and its reflection on a plane passing through its centre of mass. Thefull descriptor is obtained by computing this measure for a collection of suchsymmetrical planes. This descriptor is scale invariant.

Spherical texture mapping – if the original 3D shape is textured and if its sur-face topology is equivalent to a sphere (genus-zero), it is possible to generatea spherical image using dense correspondence [206]. This is mainly used forhuman body tracking, texture normalisation and compression.

Wavelets – can be used as for spherical data post-processing [197].

Non-spherical descriptors include :

Voxel grid – a representation of a 3D grid, with which each element (called voxel)is valued to one only if it is inside the shape.

Shape Distributions [172] – a collection of probabilistic measures are derivedfrom the model. The distributions of these measures form a shape signature.These measures can be the distance between two random points, the anglebetween three random points or the distance between a given point (typicallythe centre) and a random one for instance. These measures are rotationallyand translationally invariant. Some of them are also scale invariant.

Spin Images [108] – 2D surface descriptors. Given a point on the surface of theshape, and the normal to the surface at this point, a cylindrical coordinatesystem is associated with this point. The radial and elevation coordinates ofall the points on the surface in this coordinate system are then accumulatedinto a 2D image. Depending on the size of the accumulators (bins) and themaximal extent of the coordinate system, it is possible to generate a range ofimages with varying resolution to contain either local or global information.

115

Harmonic shape images [248] – a harmonic map is a smooth mapping betweentwo 3D manifolds that minimises an energy measure such as the Euclideandistance. The harmonic image of a 3D surface is a representation on a 2D discof a harmonic map matching the vertices of the given surface and a 2D flatdisc. The harmonic shape image is a harmonic image exhibiting a local shapeparameter such as the surface curvature.

Edge paths – a set of representative edge paths on the surface of the shape arecomputed. Kolognias et al. [123] combine these edge paths with the aspectratio and a voxel grid in normalised space (referred to as binary 3D shapemask).

Moments [64] – can be computed on normalised models.

Primitives – the model is decomposed into base elements (also referred to as geons)such as boxes, spheres, generalised cylinders [59] or superquadrics [48].

One of the main characteristics of a shape descriptor is whether its extent is localor global. Local descriptors can be defined as a collection of local features, whereasglobal descriptors encompass the whole shape in all their constituent elements. Localdescriptors are typically more sensitive to noise. In particular, EGI and SAI aresusceptible to noise as they are based on derivatives of the surface. Small randomvariations on the 3D vertices of the surface of the model will generate large differencesin the surface normal and its curvature, thus affecting the accuracy of EGI or SAI.Global descriptors, such as Shape Distributions, Radial Distribution, SEF or RSDare much more robust to noise. However, global descriptors are unable to distinguishsubtle changes in shapes. Moreover, they are not as robust to cluttered or partiallyoccluded shapes as local descriptors. Indeed, local descriptors can be used for partialmatching by using only a sub-set of features.

A summary of the main properties of some of the descriptors is given in Table6.1. For each descriptor, if it is a spherical descriptor (in which case native, meansa direct reprojection of the model on the sphere, whereas mapping means that amorphism has been applied to map the model onto the unit sphere), whether it is alocal or global descriptor and if it is invariant to scale, translation and rotation.

6.1.2 Spherical harmonics

Spherical harmonics are a set of functions forming an ortho-normal basis for allsquare-integrable spherical functions. They are similar to the Fourier basis functions(a set of sinusoids), but defined on a sphere instead. Although spherical harmonicsare natural extensions to Fourier basis functions, they are not a direct mapping ofthe 2D plane (x, y) sinusoids onto the spherical coordinates (θ, ϕ). This is mostlydue to the variability of the points density on the surface when uniformly distributedin the (θ, ϕ) space (infinite near the poles, low at the azimuth) [212].

116

Descriptor Spherical Extend Sc. Tr. Rot.EGI mapping local N I NSAI mapping local I I ISpherical texture mapping mapping local/global N N NRadial Distribution native global N N NSEF native global N N NSector Model native global N N NShell Model native global N N IShape Histograms native global N N NRSD native global I N NSpin Image no local/global N I IShape Distributions no global N I IEdge paths no local N I NMoments no global N N NGeons no global N N N

Table 6.1: Shape descriptors overview. Sc., Tr., Rot. stand for Scale, Translationand Rotation, whereas I stands for Invariant and N means Normalised.

Spherical harmonics have been extensively used in physics to calculate and rep-resent the effect of central forces such as the gravitation and the electro-magneticforces. However, some interest has been raised recently in the computer vision andcomputer graphics community for face recognition [249], anatomical modelling [58],illumination map compression [240], robotic navigation [142] and 3D model searchengines [73, 115].

6.1.2.1 Mathematical definitions

In this thesis, the spherical coordinates follow the common convention used in physicsas described, for example, in [12]. Thus θ is the colatitude or polar coordinatesatisfying θ ∈ [0, π[ and ϕ is the longitude or azimuthal coordinate satisfying ϕ ∈[0, 2π[. In a complex space, i is the imaginary unit and a is the complex conjugateof a.

On the complex spherical function space, the inner product of two square inte-grable functions f and g is defined as:

< f, g >=∫ 2π

ϕ=0

∫ π

θ=0f(θ, ϕ)g(θ, ϕ) sin θdθdϕ

Therefore the natural norm on this space is defined by:

‖f‖ =√< f, f > (6.1)

=

√∫ 2π

ϕ=0

∫ π

θ=0|f(θ, ϕ)|2 sin θdθdϕ

117

6.1.2.2 Basis functions

The Laplace differential equation in spherical coordinates is expressed as following[36, 238]:

∇2f = div grad f =1r2

[∂

∂r

(r2∂f

∂r

)+

1sin θ

∂

∂θ

(sin θ

∂f

∂θ

)+

1sin2 θ

∂2f

∂ϕ2

]= 0

(6.2)The solutions to this equation can be interpreted physically as potential functionsdue to forces on the spherical coordinate system.

Assuming f(r, θ, ϕ) = R(r)Θ(θ)Φ(ϕ), a particular solution of this equation canbe calculated. Using the variable separation method on the angular part of thisequation, also known as the spherical harmonics differential equation, it can be re-written:

Φ(ϕ)sin θ

∂

∂θ

(sin θ

∂Θ∂θ

)+

Θ(θ)sin2 θ

∂2Φ∂ϕ2

+ l(l + 1)Θ(θ)Φ(ϕ) = 0 (6.3)

where:

l(l + 1) =r

R(r)∂2

∂r2(rR(r)) =

1R(r)

∂

∂r

(r2∂R

∂r

)(6.4)

It can be further proved [36, 238] that the solution of this equation Y ml is the

product of an associated Legendre polynomial Pml and trigonometric functions:

Y ml (θ, ϕ) =

√(2l + 1)

4π(l −m)!(l +m)!

Pml (cos θ)eimϕ (6.5)

= αml Pml (cos θ)eimϕ

= Pml (θ)eimϕ

where αml is a regularisation term that varies with the field of research, Pml (X) isthe associated Legendre Polynomial and Pml (θ) = αml P

ml (cos θ) is the regularisation

of Pml (X). l is called the degree and m the order of the harmonic. They verifyl >= 0 and |m| ≤ l. The set of spherical harmonics will be referred to as SH. Thelower degree spherical harmonics are depicted in Figure 6.5.

The associated Legendre polynomials Pml are based on the Legendre polynomialsPl and defined as [36, 235]:

Pml (X) = (−1)m(1−X2)m/2∂m

∂XmPl(X)

= (−1)m(1−X2)m/2∂m

∂Xm

(1

2ll!∂l

∂X l(X2 − 1)l

)=

(−1)m

2ll!(1−X2)m/2

∂m+l

∂Xm+l(X2 − 1)l (6.6)

In this definition, the term (−1)m is called Condon-Shortley phase and is sometimesomitted or included in αml [238].

118

Eventually, the general solution of the Laplace equation is a linear combinationof the spherical harmonics:

f(r, θ, ϕ) =∞∑l=0

l∑m=−l

r−1−lCml Yml (θ, ϕ) +

∞∑l=0

l∑m=−l

rlCm′

l Y ml (θ, ϕ) (6.7)

where Cml and Cm′

l are the coefficients of the combination. Thus for a sphericalfunction defined on the unit sphere we have simply:

f(θ, ϕ) =∞∑l=0

l∑m=−l

Cml Yml (θ, ϕ) (6.8)

Figure 6.5: The first degree spherical harmonics (real part). Positive lobs are rep-resented in red, whereas negative lobs are blue.

Spherical harmonics fall into three different classes: the zonal harmonics of the

119

form Y 0l , the sectoral harmonics of the form Y l

l and the remaining ones referred toas tesseral harmonics.

6.1.2.3 Function decomposition

The spherical harmonics normalised with the constant αml are orthonormal [36, 238].In other words, they satisfy:

∀(Y ml , Y m′

l′ ) ∈ SH2, < Y ml , Y m′

l′ >= δll′δmm′ (6.9)

where δab is the Kronecker delta: δab = 1 if a = b, δab = 0 otherwise. As the sphericalharmonics form an orthonormal base of the spherical functions, the coefficients Cmlin the decomposition of f can be expressed as the projection of f onto SH [58] usingthe inner product:

Cml = < f, Y ml > (6.10)

=∫ 2π

ϕ=0

∫ π

θ=0f(θ, ϕ)Y m

l (θ, ϕ) sin θdθdϕ

These coefficients can be calculated in a discrete manner on a n× 2n dataset usingthe following sum:

Cml =π2

n2

n∑k=1

2n∑j=1

f(θk, ϕj)Pml (θk)eimϕj sin θk

=π2

n2

n∑k=1

Pml (θk) sin θk

2n∑j=1

f(θk, ϕj)eimϕj (6.11)

where θk = 2πk/n and ϕj = 2πj/2n. It is then possible to recognise the complexFourier transform of the function fk(x) = f(θk, x) in the nested sum, which allowsfaster computation using the Fast Fourier Transform (FFT).

Cml =

π2

n2

n∑k=1

Pml (θk) sin θkFFT−1(fk)[m] if m≥0

π2

n2

n∑k=1

Pml (θk) sin θkFFT (fk)[m] if m<0

(6.12)

As the spherical harmonics form an orthogonal basis for all square-integrablespherical functions, it is possible to decompose (i.e. project) any function into thespherical harmonic basis and to recompose these functions.

An obvious difference with the FFT decomposition is that the number of coef-ficients is infinite. As a matter of fact, it is practically impossible to compute anexact decomposition with a finite number of coefficients. The finite decompositionis a least squares approximation of the original function. As an example, a cube re-

120

constructed using increasing degrees is shown in Figure 6.6. The remaining artefactsare very similar to those demonstrated while approximating a square wave with aFourier series.

Figure 6.6: A cube reconstructed with spherical harmonics under increasing degrees(0, 4, 8, 20).

6.1.2.4 Other mathematical representations

Other mathematical methods similar to the spherical harmonics have also been usedin computer vision. For example, Fourier descriptors have been used for gait recog-nition in [157]. These descriptors are similar to spherical harmonics, although theyare defined on a plane and therefore Point-Of-View (POV) dependent. Sphericalwavelets have also been used to represent, model and compress functions on a sphere[197], such as topographic data. Spherical wavelets are very similar to spherical har-monics. However, they provide local rather than global frequency information.

6.1.3 Spherical harmonics-based rotationally invariant descriptors

As discussed in [115, 194, 228, 229], one of the challenges of 3D model comparison isthat arbitrary similarity transformations of the same object are usually consideredas equivalent. Indeed, a cube that is rescaled, translated and rotated remains acube. However, it should be noted that this is not always true. For example, ahuman subject standing or lying may differ only by a rotation around a horizontalaxis, but the two postures are usually not considered to be equivalent. A commonidea is to transform a 3D surface into a 2D image. 2D images can then be matchedin a much simpler manner by using well-known, robust and fast techniques.

To compare several objects, several general methods are proposed:

• Try to explicitly align the models. However, this is usually impractical, giventhe complexity of the models and the high degree of freedom of the transfor-mations.

• Normalise the models: describe the models by rotation-dependent descriptorsin a canonical coordinate system. Typically the centre of mass for translation,

121

the square root of the average square radius for scale and a PCA or centralmoments analysis for rotation.

• Describe the models by transformation-invariant descriptors.

• A hybrid of the above methods.

Although translation and scale normalisation are relatively robust, rotation nor-malisation using PCA is not always sufficient. Indeed it only catches the secondorder alignment. Therefore, after scale normalisation (average square radius) andtranslation (with regard to the centre of mass), Kazhdan et al. [115] proposed a de-composition scheme based on the original descriptors using spherical harmonics. Ifthe original descriptor is a spherical function, the decomposition is straightforward.However, in the case of a voxel grid description, one has to use a set of binary spher-ical functions. For a given radius, the function returns true if and only if the nearestvoxel is inside the model. The set of functions is built by successive increment ofthe radius.

Kazhdan et al. proposed a rotation invariant descriptor by computing the L2-normof the spherical harmonic components at every frequency, or degree. This descriptorrepresents the “energy” contained in the model at every frequency. While reducinga 3D model to a 1D signature, information loss can occur at several levels:

• In practice, the maximum frequency is limited.

• The sum over the harmonics inside a frequency group adds loss: componentscan be rotated independently and several components can carry the samefrequency and energy.

• In the case of the voxel representation, further loss occurs due to a possiblede-correlation between radii.

Kazhdan et al. applied their method to the following original descriptors:

• EGI (normals distribution)

• Radial Distribution (average distance and standard deviation of the rays)

• SEF (maximal ray length)

• Sectors (amount of surface that sits over a ray)

• Shape Histogram (multi-shells sectors)

• Voxel grid

122

6.2 Compact descriptor generation

To handle self-occlusions and to overcome the problem of view-dependence, a multi-view system based on FVH has been designed. Typically, three to five cameras areemployed to cover the target volume. The cameras are assumed to be calibrated andbackground segmentation has been performed, as described in the previous chapters.Reprojection of the silhouettes is conducted according to the cameras position andorientation. In order to generate a view-independent representation of the subject,the centre of mass of the 3D visual hull is computed and it is subsequently reprojectedonto the spherical domain. The spherical reprojection is therefore subject-centric.A frequency domain decomposition is performed using spherical harmonics. A rota-tionally invariant signature is derived from this decomposition. This signature doesnot permit reconstruction of the 3D shape, but is sufficient for analysing the postureof the subject. An overview of the system is given in Figure 6.7.

Several variations have been applied to the proposed architecture. A majorvariant is to perform a cylindrical reprojection, thus resulting in a different signa-ture. Other variations include the use of FVH instead of the original binary one.Nevertheless, the overall principle remains the same.

6.2.1 Fuzzy Visual Hulls

As discussed previously, one of the main issues with the original IBVH method isthe binary segmentation required. If the background segmentation is slightly wrongin one of the input images, the error will propagate to the 3D results. This can beillustrated by a system composed of four cameras observing a subject. In one of thecamera’s field of view, the subject’s hand has a colour similar to the background,and therefore misclassified as background. In the three remaining cameras, thehand is correctly classified as part of the foreground. All four silhouettes are thenprojected in 3D and the intersection is calculated. If we consider a voxel in 3Dspace at the position of the hand, it is considered as part of the subject by threecameras, and part of the background by the one of them. The intersection of allfour silhouettes at this voxel will then be considered as out of the subject. Smallerrors in classification, even in a limited number of input views can therefore havea significant impact on the final 3D model. This error propagation occurs becausea logical (binary) intersection is performed after binary segmentation (labelling).

If the intersection is calculated before the binary segmentation, errors can berecovered. This idea was first developed by Franco and Boyer [72]. To carry on withthe previous example, let us associate a foreground probability to the cameras. Thepixel is considered as foreground if its probability is above 0.5. For one of the handpixel, these probabilities are (0.75, 0.82, 0.87, 0.42), as illustrated in Figure 6.8 (left).The last camera incorrectly classifies this pixel as background (0.42<0.5). Theseprobabilities can be directly reprojected for intersection before binary segmentation.

123

Reference

images

Foreground

probabilities

Fuzzy Visual

Hulls (FVH)

Spherical Extend

Function (SEF)

Spherical

harmonics

Compact

signature

Background

segmentation

Probabilities

reprojection

Spherical

projection

Spherical harmonics

decomposition

Frequency

accumulation

Figure 6.7: An overview of the proposed FVH framework pipeline for compactposture signatures generation.

The simplest way of evaluating the intersection in the probability space is to computethe mean values (i.e. 0.715). The segmentation can be eventually performed in 3Dspace. This voxel will now be considered as part of the subject, as its probabilityis higher than 0.5. The three correct cameras have recovered the error given by thefourth one, and a schematic illustration of the FVH is given in Figure 6.8 (right).

124

Image−Based Visual Hulls

(IBVH)

Fuzzy Visual Hulls

(FVH)

Figure 6.8: Left: regular IBVH and FVH comparison. FVH perform the intersectionbefore segmentation, thus propagating less errors. Right: FVH output illustration.

In general, the above problem can be approached as a sensor fusion problem.The intersection and segmentation stages detailed above are elementary ways ofperforming sensor fusion. More complex approaches can be developed. For example,Franco and Boyer [72] formulated the problem as a Bayesian inference problem wherethe input are the foreground probabilities.

In the proposed system, the binary segmentation is postponed to the last stageof the pipeline in order to reduce the information loss throughout the process. Thisbinary labelling could also potentially be completely omitted. A 3D probability gridcould indeed be generated. The value of each voxel of this 3D grid would representthe probability of the voxel being a part of the subject. As appealing this idea maysound, it has a major drawback in that the probability depends on the contrastbetween the subject and the background. The final grid can be dependent not onlyon the subject’s clothes and background colours, but also on the relative position ofthe subject in the environment. A final binary segmentation is therefore necessaryto ensure a more consistent result.

6.2.2 Subject-centric descriptors

Once the 3D model has been derived, its centre must be computed in order toderive a translation and rotation invariant shape descriptor. However, finding thiscentre accurately and quickly is not trivial. One method is based on a two-passapproach. A first pass (or pre-sampling) is used to evaluate the centre. The secondpass realises the actual spherical sampling. However, this method is relatively slow.A tracking-based method has therefore been implemented. The centre needs to beinitialised. Once the system is running, the actual centre is derived while performingthe spherical reprojection. Considering this centre as an observation, a Kalman filtercan be applied during the correction stage. This filter then provides a predicted

125

estimate of the centre position at the next frame. This prediction can then be usedduring the next spherical sampling pass. This method allows for the computation ofthe centre and sampling in a single pass. However, the accuracy may be inadvertentlyreduced, in particular when the subject suddenly changes its motion direction orspeed.

In this chapter, two kinds of subject-centric descriptors were considered: a spher-ical harmonic signature based on a spherical reprojection and a signature based onfeatures derived from a cylindrical reprojection around the vertical axis.

6.2.3 Spherical harmonic-based descriptor

In order to reduce the 3D model to a rotationally invariant signature as discussedin [115], for each video frame, a 3D model is evaluated using FVH. The 3D model isthen reprojected onto the unit sphere using the SEF described earlier and detailed inFigure 6.3. Such a spherical 2D depth map is illustrated in Figure 6.7. A sphericalharmonics decomposition is then performed on the spherical depth map. Harmoniccoefficients of each frequency are accumulated to compute the final 1D descriptor,which results in a 1D frequency representation over the entire spherical domain.Therefore, the spherical harmonic signature is translation-normalised, rotationallyinvariant and very compact. It is also possible to extend this static posture signa-ture to a dynamic activity signature by applying an FFT on each of the signaturecoefficients over time.

6.2.3.1 Intrinsic limitations

Two main limitations can arise by the use of a fully translation-normalised and rota-tionally invariant descriptor for posture and activity recognition. Because of trans-lation normalisation, the global motion is not considered and therefore the overallwalking speed is lost in the process. In a similar fashion, the rotational invarianceconceals the general orientation of the subject. As a consequence, standing and lyingmay have similar signatures. As the coordinate system is subject-centric, only localchanges, such as limbs motion are captured. In order to recover these meaningfulfeatures, two extra values are added to the signature. The overall motion speed anda verticality coefficient, ranging from 0 (lying) to 1 (standing).

6.2.3.2 Practical limitations

In practice, determining the position of the centre is not an easy task. A translationof the centre may have significant consequences on the spherical reprojection. An-other issue arises when the centre is outside of the 3D model volume. In practice,the position of the centre of the reprojection may not be accurate. A variation inthe centre position can influence the spherical reprojection. To illustrate this effect,

126

0

2

4

6

8

10

12

14

16

0 5 10 15 20 25 30 35 40 45

Coe

ffici

ents

(S

I)

Centre translation (% of model size)

coefficient 0coefficient 1coefficient 2coefficient 3

Figure 6.9: The effect of centre translation on spherical harmonic coefficients. Forthe lower degrees coefficients, the value is given for 10 gradually increasing transla-tions. The mass shift from symmetric coefficients 0 and 2 to asymmetric ones 1 and3 is clearly visible.

an experiment was carried out to evaluate the influence of a slight shift of the re-projection centre. The centre was gradually shifted up to a tenth of the model size.The most affected decomposition orders were respectively 1, 0 and 3. Empirically,this can be explained by an eccentric mass transfer. Indeed, the first harmonic isa centred sphere, whilst all the others are lobs originating from the centre. More-over, the odd decomposition coefficients (1, 3, 5, ...) do not exhibit a symmetricallydistributed mass. As illustrated in Figure 6.9, an artificially increasing centre trans-lation leads to a decrease in the coefficient 0 (uniformly distributed mass) and anincrease in coefficient 1 (asymmetrically distributed mass). A similar effect is ob-served with attenuated amplitude in coefficients 2 and 3.

This effect can be observed in real video data as illustrated in Figure 6.10. A shiftin mass due to inaccurate centre localisation can significantly affect coefficient 1, butcoefficients 0 and 2 still exhibit features that can be used for shape classification.

It is worth noting that for practical applications, the centre can also fall outsideof the 3D model. This typically occurs when the subject is bent over or extendshis arms forward. In these cases, the centre of mass is then outside the volume,as illustrated in Figure 6.11. If the maximal extent is considered, the sphericalfunction may be undefined on a domain (Figure 6.11 – right). By considering a

127

40

60

80

100

120

140

160

1 2 3 4 5 6 7 8 9 10 11 12 13

Coe

ffici

ents

(S

I)

Time (min)

Coefficient 0

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10 11 12 13

Coe

ffici

ents

(S

I)

Time (min)

Coefficient 1

0

10

20

30

40

50

60

70

1 2 3 4 5 6 7 8 9 10 11 12 13

Coe

ffici

ents

(S

I)

Time (min)

Coefficient 2

Figure 6.10: Coefficients 0, 1 and 2 of the spherical harmonics decomposition per-formed on real video data. FVH and Kalman centre tracking were used. Severalactivities are clearly distinguished from coefficients 0 and 2, whereas coefficient 1mostly exhibits the eccentric mass shift due to slightly inaccurate centre tracking.

negative maximal extent, it is possible to define the spherical reprojection on alarger proportion of the sphere (Figure 6.11 – left). However, the orientation ofthe surface covered by the negative maximal extent is reversed. This can lead todiscontinuities at the intersection between positive and negative surfaces. Moreover,sometimes the rays may not intersect with the subject at all, as shown in Figure6.11.

Figure 6.11: Tackling the centre falling outside of the 3D model with two differentstrategies. On the left, all the rays are reprojected from the outside (negative max-imal extent), but the sampling direction is not consistent over the 3D model. Onthe right, the model is reprojected from the inside and the reprojection is partlyincomplete.

6.2.4 Cylindrical extend-based descriptor

In order to deal with some of the issues arising from the use of spherical reprojection,a cylindrical reprojection is also proposed in this study. Once the centre of thesubject is known, it is also possible to reproject the model on a vertical cylinderaround the subject as illustrated in Figure 6.13. Again, the maximal extent wasused for the reprojection as shown in Figure 6.12. The vertical axis is considered

128

as known, as the camera system must be calibrated beforehand. The result of theprojection is referred to as the Cylindrical Extend Function (CEF).

Figure 6.12: Example of CEF on a subject performing squats. The axis is clearly infront of the torso, leaving large empty (white) areas.

A standard FFT is applied along each circular slice to ensure rotational invari-ance around the vertical axis. Technically, this is in fact a 1D FFT applied to theunwrapped CEF. The CEF is only rotationally invariant around the vertical axis.This seems to be ideal for practical applications, as it does not matter whether thesubject is facing left or right, but the leaning and swaying angles are of interest. Asfor the spherical harmonic descriptor, an extra speed value is added to the signatureto capture the overall motion. Instead of the FFT, other more compact features canalso be derived. An elementary signature consists of the height of the reprojection,as well as the minimum, maximum and average of maximal extend of the model oneach horizontal slice.

Issues can arise when the axis is partially or totally out of the 3D volume, aspreviously described and illustrated (Figure 6.11). Such cases are common withcylindrical reprojections, as the axis often runs between the subject’s legs, as il-lustrated by the bottom image of Figure 6.13. A large proportion of rays will notintersect at all with the 3D model.

6.2.5 Distributed prototype

A distributed implementation of the pipeline as illustrated in Figure 6.7 has beenprovided. The visual hull spherical or cylindrical sampling can be performed in paral-lel by a number of nodes. The spherical harmonic signature is calculated by anotherprocessing node. These modules are connected via TCP/IP through a client/serverparadigm.

A client program connects to the spherical harmonics module, which acts as aserver, in order to request a signature. The module for spherical harmonics thenconnects to a given number of sampling modules to request the spherical projection.Once every sampling node has finished its task, the spherical harmonics modulegathers the samples to built the complete spherical projection. It also computes the

129

Figure 6.13: CEF on a 3D model. Two slices are given as example. Each slice isconverted as a 1D polar extend function r = f(a).

signature and returns it to the initial client. The process is illustrated in Figure6.14.

6.3 Application to tennis stroke recognition

In this chapter, the spherical harmonic decomposition has been applied to the tennisplayer dataset as described in the previous chapter. The player was asked to play thesame stroke a number of times and was recorded using four fully calibrated cameras.The background probabilities were calculated and the FVH applied. A spherical

Spherical Harmonics

Signature ModuleLight−weight client

Sampling

Module

Sampling

Module

Sampling

Module

Sampling

Module

Signature

Sampling 1/n

Sampling 2/n

Sam

plin

g n/

n

Figure 6.14: Distributed prototype of the spherical harmonics pipeline.

130

sampling of the 3D probability map was performed and decomposed using sphericalharmonics. The accumulation of the harmonics over the frequencies provided acompact signature.

6.3.1 Signature examples

0

2

4

6

8

10

17 17.5 18 18.5 19 19.5

Sph

eric

al h

arm

onic

s co

effic

ient

s

Time (s)

Stand straightSH-2SH-3SH-4SH-5

0

2

4

6

8

10

20 20.5 21 21.5 22 22.5 23

Sph

eric

al h

arm

onic

s co

effic

ient

s

Time (s)

Wait for returnSH-2SH-3SH-4SH-5

0

2

4

6

8

10

42 43 44 45 46 47 48 49

Sph

eric

al h

arm

onic

s co

effic

ient

s

Time (s)

Forehand strokeSH−2SH−3SH−4SH−5

0

2

4

6

8

10

75 76 77 78 79 80 81 82

Sph

eric

al h

arm

onic

s co

effic

ient

s

Time (s)

Backhand strokeSH−2SH−3SH−4SH−5

0

2

4

6

8

10

96 97 98 99 100 101

Sph

eric

al h

arm

onic

s co

effic

ient

s

Time (s)

ServeSH−2SH−3SH−4SH−5

0

2

4

6

8

10

74 75 76 77 78 79

Sph

eric

al h

arm

onic

s co

effic

ient

s

Time (s)

Serve (2)SH−2SH−3SH−4SH−5

Figure 6.15: Examples of spherical harmonic signatures extracted from a tennisplayer’s motion over time. In this figure, the forehand and backhand strokes areplayed three times, whereas the serve is played twice. Four significant coefficientsare illustrated. The serve signature is illustrated for two different player orientations.

A PCA of the dataset shows that most of the information is contained in the fourcoefficients 2 to 5, referred to as SH-2, SH-3, SH-4 and SH-5. These componentsare illustrated in Figure 6.15. It is evident that coefficient 2 (SH-2 ) on its own maycarry enough information to distinguish between these strokes.

131

The underlying reason for this dominance of SH-2 in terms of amplitude andvariability can be explained empirically. Firstly, this coefficient is even, meaningthat it carries an overall mass information rather than a mass displacement fromthe centre. Secondly, this is the lowest coefficient with an elongated shape ratherthan a uniform distribution in the spherical domain, as illustrated earlier in Figure6.5. As a consequence, it matches well to the shape of the human body. Therefore,this coefficient carries information related to the maximal extent of the subject fromits centre of mass.

6.3.2 Tennis strokes classification

Some characteristics for each stroke can be identified in the signal and they include:

Stand still Near-null derivative of the signal; high SH-2 due to maximal extentposture.

Wait for return Slight sideways motion; lower SH-2 due to slight leaning forward.

Forehand stroke SH-2 exhibits a plateau during the back-swing, followed by asharp decrease as the racket gets closer to the trunk.

Backhand stroke Similar to forehand, but to a lesser extend; preparing for thebackhand is also well marked by two peaks in between the shots. A masstransfer from SH-2 to the odd coefficients SH-3 and SH-5 is also observedduring the swing, hinting a shift of the player centre of mass.

Serve SH-2 pyramidal shape with relatively high maximum due to the arm verticalextend.

The two bottom graphs in Figure 6.15 illustrate the serve from two differentorientations. The pyramidal shape is observed in both cases, showing that thesignature is rotationally invariant.

Action Played Recognition rateStand for return 4 100.0%Forehand stroke 7 100.0%Backhand stroke 7 71.4%Serve 8 75.0%Overall 26 86.6%

Table 6.2: Automated stroke and posture recognition with rotationally invariantdescriptor.

A manually annotated dataset is used to derive the average pattern signaturesassociated with each stroke listed in Table 6.2. A Normalised Cross-Correlation(NCC) is then used to match the reference patterns onto another dataset recorded

132

from a different perspective. The results are summarised in Table 6.2 and Figure6.16.

W F B SW 4 0 0 0F 0 7 0 0B 0 2 5 0S 0 2 0 6

W F B SW 1.0 0.0 0.0 0.0F 0.0 1.0 0.0 0.0B 0.0 0.29 0.71 0.0S 0.0 0.25 0.0 0.75

Figure 6.16: Classification confusion matrix (total and normalised): some serves andbackhands are incorrectly classified as forehand. ‘W’ stands for ‘wait for return’, ‘F’for forehand, ‘B’ for backhand and ‘S’ for serve.


6.4.1 Summary

Player tracking from different view points offers a more reliable result than that of amonocular framework. However, the high dimensionality of such a model makes fur-ther classification difficult. The method proposed in this chapter for dimensionalityreduction consists of three main stages.

First, the model is reprojected onto a spherical depth map that represents themaximal extent of the player’s model from the centre of mass (SEF). This operationeffectively reduces the dimensionality from 3D to 2D. Furthermore, the representa-tion is subject-centric and the SEFs are position-normalised. Secondly, the SEFs aredecomposed using spherical harmonics, which is a form of frequency analysis in thespherical coordinate system. This reduces further the data size by filtering out thehigh frequency component of the SEF. Thirdly, the spherical harmonic coefficientsare accumulated on a frequency basis, producing a 1D signature. This descriptor iscompact, normalised in translation and invariant to rotation, thus making the finalsignature easy to analyse.

6.4.2 Conclusion

In this chapter, a compact viewpoint invariant descriptor based on spherical harmon-ics decomposition has been used for human 3D posture analysis. The method allowsto fuse visual information from multiple views whilst conserving the fundamentalpostural information into a low-dimension descriptor.

Specifically, the proposed method is proved to be more reliable for tennis strokerecognition than a monocular framework. Indeed, it has been shown by experimentthat the derived compact signatures are position and rotation invariant and cancapture information about a tennis player’s posture over time, thus enabling furtherdetailed posture analysis.

133

Furthermore, it has been shown that the hard binary segmentation issue en-countered in the original IBVH algorithm can be partially solved by using a fuzzyframework. Under this paradigm, the binary segmentation between the player andthe foreground is not carried out in the 2D image space but postponed later in thepipeline during 3D reprojection, which makes the algorithm relatively immune tonoise and segmentation errors.

134

Chapter 7

Model-Based Motion Analysis

In the previous chapters, several methods have been proposed to infer the generalposture from image sequences either using a monocular vision system or multi-viewprojections. In general, these methods are based on pixel or voxel information andthe final descriptors are mainly geared towards posture segmentation rather thanbiomechanical description. For example, the shape descriptors proposed in the lastchapter are not able to provide information about the joint angle during a serve. Inorder to calculate detailed kinematic indices, it is necessary to adopt a model-basedapproach. This will involve the building of a human biomechanical model and to fitthis model to the observed data. This represents a high-level approach and it relieson a prediction and an estimation stage, as previously mentioned in Section 3.3. Theestimation stage derives the initial state of the model from the observation, whereasthe prediction stage infers the current state from the previous state combined withmodel constraints.

The method proposed in this chapter extends the previously provided model-based approach by Gavrila and Davis [74] as discussed in Section 3.3.2. The aim ofthis chapter is to rely on a reduced number of cameras and impose no constraints onclothing. The main hurdles to overcome are associated with limited depth perceptionand the inherent ambiguities such as self-occlusions of the object. In this chapter,a motion prediction method is proposed, which combines low-level prediction withglobal activity modelling. Because the range of motion performed by a tennis playercan vary significantly, a method based on colour segmentation is designed, which cankeep tracking when no suitable activity models are available. Different to Gavrilaand Davis’ approach, the proposed method does not involve the use of specificallycoloured garments.

The remainder of this chapter is organised as follows. The human biomechanicalmodel based on tapered superellipsoids is first introduced. At the estimation stage,a tracking algorithm based on the fitting of the three-dimensional (3D) model to theobserved two-dimensional (2D) image data is then described. Finally, the low andhigh level motion prediction methods are developed and the method is validated

135

with detailed motion analysis of tennis players.

7.1 Human biomechanical model

7.1.1 Model design

As mentioned earlier in this thesis, the main purpose of introducing a human modelis to reduce the search space and resolve ambiguities based on prior knowledge. Atracking model is fundamentally a set of constraints on the search space. In theparticular case of human model tracking, these constraints may be applied at threedifferent levels.

First, the model provides a definition of the tracked subject posture at a fixed andreasonably low dimension. In this case, the proposed model is a human articulatedskeleton or kinematic tree. This is actually a set of constraints on the relativeconfiguration and articulation of the skeleton.

Secondly the model should incorporate prior knowledge of the dynamic aspectsof the body for reducing model complexity and accurate state prediction. Thedynamic aspects of the model include joint angle constraints during motion. Low-level motion prediction (e.g. Kalman filtering) can have limited efficiency due tonon-linear motion. High-level motion prediction, which consists of learning activity-specific motion primitives needs to be developed.

Additionally, the model needs to compare its state to the current observation. Inlinear systems such as the Kalman filter, this is simply formulated by an observationmatrix. The present case is more complex as there is a need to compare an observed2D image to a 3D model. In practice, a stick model can not accurately representthe human body for direct comparison. The muscular structure is represented inthis chapter by tapered superellipsoids [74], as illustrated in Figure 7.1. In this case,each joint provides one to three angular Degrees Of Freedom (DOF) and the angularranges are limited according to the average human kinematics. This model has atotal of 27 DOF (6 DOF for the general position and orientation, 3 DOF for thechest, 2 DOF for the head, 4 DOF for each leg and arm), or 35 DOF when optionallyincluding hands and feet.

7.1.2 Superquadrics

A superellipse is a 2D closed curve, defined by the following equation [23, 107]:(xa

)m+(yb

)m= 1 (7.1)

where m is a positive rational number, and a and b the size of the axes. Assumingm = 2, this equation defines an ellipse, whereas m→∞ leads towards a rectangularshape and m → 0 towards a cross. Negative values of m lead to open Lame curves

136

Figure 7.1: Articulated human model structure (left) and its representation basedon tapered superellipsoids (right).

such as hyperboles. This equation can be rewritten in polar coordinates:

s(θ) =

[a cosε θb sinε θ

], θ ∈ [−π, π] (7.2)

where the exponentiation is actually the signed power function: xε = sign(x)|x|ε.The superellipsoids are a 3D extension of the superellipses, obtained by spherical

product of two Lame curves [23]:

r(θ, ϕ) = s1(θ)⊗ s2(ϕ) (7.3)

=

[a′1 cosε1 θb′1 sinε1 θ

]⊗

[a′2 cosε2 ϕb′2 sinε2 ϕ

]

=

a′1a′2 cosε1 θ cosε2 ϕ

a′1b′2 cosε1 θ sinε2 ϕb′1 sinε1 θ

, (θ, ϕ) ∈ [−π2,π

2]× [−π, π]

which can eventually be rewritten in a simpler form without loss of generality:

r(θ, ϕ) =

a1 cosε1 θ cosε2 ϕa2 cosε1 θ sinε2 ϕ

a3 sinε1 θ

, (θ, ϕ) ∈ [−π2,π

2]× [−π, π] (7.4)

137

The εi parameters are sometimes referred to as the squareness coefficients. Indeed,a value close to zero will lead to a square shape whereas a value close to one willlead to a more spherical one as shown in Figure 7.2 (right). Superellipsoids areactually part of a broader family of 3D curves: the superquadrics. Superquadricsare the 3D extensions of the Lame curves. These include four main categories:superellipsoids, superhyperboloids of one and two sheets and supertoroids. Examplesof each of these types of curves are shown in Figure 7.2 (left). It should be notedthat superhyperboloids are asymptotic and not closed surfaces.

Figure 7.2: Left: superquadrics include superellipsoids (a), supertoroids (b) and su-perhyperboloids of one (c) or two sheets (d). Right: superellipsoid examples showingthe influence of the squareness coefficients ε1 and ε2.

Superellipsoids are the most commonly used type of superquadrics in modelling,because they are genus-zero closed surfaces. They allow for a compact reconstruc-tion of complex shapes by hierarchical decomposition into superellipsoids [48]. Infact, each superellipsoid is completely defined by three scale parameters and twosquareness coefficients. Superellipsoids include primitive shapes such as spheres,cubes and cylinders, sometimes referred to as geons. A set of superellipsoids withdifferent squareness coefficients are shown in Figure 7.2 (right).

7.1.3 Activity modelling

In general, modelling activities allows for a better motion prediction. In order torepresent potential activities in a compact and yet accurate manner, activities aredefined by a set of reference states referred to as Key Postures (KPs). This issimilar to the concept of synthetic “key frames” used in [84] for activity modelling.A KP is technically a set of joint angles and angular velocities at a given time.Each KP is represented by a node in a graph, whose edges denote the possible

138

transitions between the KP states. Given the continuous nature of the motion andthe uncertainty in its speed, an observed state can then be modelled by a weightedblending between two connected KPs. A blend between two KPs can be performedthrough linear interpolation. In essence, this is a Markov model as proposed in [37],whose transition probabilities are used as blending weights.

7.1.4 Activity learning

Activities recorded during well defined scenarios, for example when using multiplecameras or relying on marker-based tracking, can be used for activity modelling. Analgorithm is proposed to learn such activities by using the following steps:

1. Gross segmentation.

2. Cycle detection and cycles factorisation.

3. Finer KP segmentation.

Gross segmentation The algorithm first determines the minimal overall angularvelocities. Indeed, the low angular velocity points are often associated with a changein the motion direction, and therefore are likely to violate the linear interpolationmodel. As a consequence, such points are good KP candidates as well as potentialactivity change points. For each sub-sequence, an independent activity model iscreated, as illustrated in Figure 7.3 (top).

Cycle detection and factorisation The KPs are then compared using a distancefunction based on joint-to-joint angle differences. In the case of a cyclic activity suchas walking, running or rowing, some continuous segments will be similar. Thesesegments are merged into a cycle, as illustrated in Figure 7.3 (bottom) and theactivity models are updated accordingly.

runningtransitionwalking

stan

ding

walking running

tran

sitio

n

Key Posture (KP)

interpolation

runningwalkingtransition

stan

ding

tran

sitio

n

Figure 7.3: An example of activity graph factorisation. Top: original sequential KPgraph. Bottom: cyclic activity factorisation.

139

Finer KP segmentation Once the graph has been reduced, each segment is thenprocessed more finely for the intermediate KPs to be incrementally added. Theyare placed as far apart as possible without exceeding the allowed modelling error forthe intermediate states. For a given sequence, the KPs are selected as:

1. No KP is an interpolation between two connected KPs. This would lead tomultiple definitions of the same state as well as unnecessarily increase thegraph size.

2. The KP density is low enough to prevent the observation to skip a KP.

3. The KP density is high enough to ensure that the approximation error intro-duced by interpolation is acceptable.

For example, running can be modelled with only four KPs, as illustrated in Figure7.4

Figure 7.4: The four KPs used for running motion modelling.

7.2 Estimation stage

7.2.1 Model fitting

During estimation, it is necessary to evaluate the model state given the currentobservation and the predicted state. The match between the 3D model and the two-dimensional binary segmented image provided by the blob sensor is then performed.This is usually done in the 2D image space for performance considerations. There-fore, this method involves estimating the parameters of an articulated 3D modelgiven its projection onto a 2D image. This relatively complex task is performedwith a generate-and-test strategy as suggested in [74]. In other words, 2D imagesare generated from the 3D model and compared to the observed image.

140

In general, fitting of all human 3D joints at once would be time consuming due tothe high-dimensionality of the human model. In this study, a hierarchical skeletontree is used instead. Multiple passes with decreasing angular increments can beapplied for increased accuracy. This hierarchical model fitting is performed by usingan iterative gradient descent algorithm until convergence. The starting point of thegradient descent is crucial to ensure convergence.

The main iterative loop, as illustrated in Figure 7.5, consists of four main stages:

1. Displacement of the 3D model parameters in a test direction.

2. Projection of the 3D model in 2D space as a binary image, illustrated in Figure7.5. This has been hardware-accelerated using OpenGL.

3. Evaluation of a matching score between the reference binary image and the 2Dmodel projection. An exclusive OR operator (XOR) is used for this purpose.

4. Decision to keep or reject the parameter changes.

Binary blob

image

2D model

projection

3D model

ProjectionDecision

Match

score

Original

image

Background

segmentation

Figure 7.5: Generate-and-test strategy: estimation loop of the 2D/3D posturematching algorithm.

141

7.2.2 Bootstrapping

Bootstrapping is a key element of our tracking framework. Indeed all predictionmethods rely on the current state of the model to predict the next state. Withoutbootstrapping, prediction would be impossible when initiating the system. More-over, an incorrect initialisation of the model can lead to erratic tracking results.Indeed, the tracker in general may not be able to recover if incorrectly initialised.

For this reason, bootstrapping is performed through a coarse brute force searchrather than the gradient descent method as mentioned previously. In the context ofa particular activity, the activity model KPs are tested as potential starting points.

7.2.3 Colour-based self-occlusions handling

Self-occlusion is a main problem for the proposed platform. The arms are typicallylocated in front of the chest and they will not be distinguishable from the chest afterbackground segmentation. This is mainly due to the binary segmentation. Indeed,the arms are typically well-distinguished from the chest in the original colour image,but may be fused erroneously in the binary blob.

Gavrila and Davis [74] originally used edge matching instead of blob silhouettes,which partly resolved this common problem. However, it must be understood thatthis solution was made efficient by the strong colour contrast (and therefore sharpedges) provided by specific clothes. For example, crossed legs can be detected easilydue to their different colours. For our application, these clothing constraints areimpractical.

In order to partially to resolve this ambiguity, an automated colour blob segmen-tation is used in place of the binary segmentation, as illustrated in Figure 7.6. Inessence, this method is a generalisation of the aforementioned algorithm by Gravrilaand Davis. The key improvement of the proposed method over their method is thatit relies on the coloured blobs naturally derived from the subject. Whilst this con-tributes to reducing the self-occlusions in some cases such as the arms in front thechest, this does not provide a substantial gain for others, such as the crossed legsissue, as they are bound to be coloured identically.

However, this solution raises new challenges. In particular, the 2D blob matchingmust be performed in a coloured image space. This means that the 3D human modelmust be coloured beforehand. Moreover, several measures can be used to calculatethe match score between the reprojected model image and the segmented image.

Let Mi be the colour of the model for a given pixel i and Oi the colour of theobserved pixel. The colour is encoded as a strictly positive natural number if thepixel is part of the foreground, and null if it is part of the background. The matchscore can be defined as a linear combination of a binary background/foreground

142

Figure 7.6: Left to right: original binary blob, colour-segmented blob and colouredreprojected model resolving partially the self-occlusions ambiguities.

match and an exact colour match:

MatchScore =∑i

αf(Mi, Oi) + βg(Mi, Oi) (7.5)

where α and β are the mixing parameters and:

f(x, y) =

1 if a = b = 0

1 if a 6= 0 and b 6= 0

0 otherwise

(7.6)

g(x, y) =

1 if a = b

0 otherwise(7.7)

α = 0 leads to using an exact colour match whereas β = 0 is the original binaryblob matching detailed previously.

The choice of values for α and β is based on the quality of the colour segmentationand the similarity between the model and the actual subject to be tracked. Indeed, ifthe colour segmentation is noisy, it is better not to rely on it too much and thereforeto decrease β. Similarly, if the model colouring does not match well with the subject,the colour information may be misleading, and β should be reduced.

It should be noted that whilst all the illustrations depicted in this chapter arecoloured for improved visibility, the actual implementation relies on grey-scale im-ages for performance reasons. Indeed the bottleneck of this implementation lies inthe efficiency of frame retrieval from the graphics hardware, therefore dividing theframe size by a factor three improves significantly the overall performance.

143

7.3 Prediction stage

Because the projection of a 3D model onto a 2D image is not an injective trans-formation, several distinct 3D model configurations can lead to the same 2D imageprojection. In other words, the dimensionality reduction due to the projection canresult in ambiguities in 2D model fitting. Many of the 3D model states potentiallymatching the 2D images are unlikely to happen, either because of temporal consis-tency or simply because they are not representative of a common human posture inthe given context. In order to resolve these ambiguities, it is necessary to select themost likely 3D model among the set of possible postures.

Therefore, the prediction stage is important not only for performance reasons,given the high-dimensionality of the search space, but also for reducing the uncer-tainty due to noise, limited number of views and 2D reprojection.

7.3.1 Joint-level state prediction

A low-level prediction can be performed for each individual joint by using a Kalmanfilter. By using the couple (angle, angular velocity) as a tracking state, an estimationof the joint position can be derived. However, the highly non-linear nature of humanmotion will prevent a good prediction result by using the Kalman filter. Neverthelessthis estimation is a good means of noise filtering.

Further biomechanical assumptions are considered in this study, in particularthe angular range and the maximal angular velocities. However, over-constrainingthe model should be avoided in practice. Indeed, giving too much emphasis on theprediction rather than observation can prevent the tracker recovering from errors.For example, let us consider a knee model with a maximal angular velocity of 6rad/s. At the first frame, the knee angle is 1.3 rad, and moving at 5 rad/s, so it willbe 1.5 rad 40 ms later. If the first knee angle is incorrectly estimated at 1.2 rad,the tracker will assume that the next angle is only in the range of 1.2 ± 0.24 rad.Therefore, even if the model is fitting correctly at 1.5 rad, the model will limit theangle to 1.44 rad.

7.3.2 Body-level motion prediction

Moreover, joint-level trackers do not take into account the subject’s body postureas a whole, neither do they consider human-specific motion knowledge. Becauseall the joints are linked through a kinematic chain, it is possible to predict partof the motion of a single joint from others. Also, even if a very large number ofkinetic states are permitted by the model, most of them are unlikely to happen fornormal human motion. The activity models described previously can be used forthis purpose.

In general, a new state can be predicted from the previous state given an activity

144

model. Firstly, the previous state is parametrised in the activity model space. Thatis, an optimal weighted blending of two KPs is estimated. In the case of a well definedactivity, such as running on a treadmill at a given speed for a given individual, it ispossible to use this optimal model for prediction. The next state of the activity modelis then generated. However, in most scenarios, this may not be the case. Therefore,the angular velocities from the optimal activity model are used in conjunction withthe static angles from the previous state to predict the future state. In essence, thisleads to motion prediction rather than posture prediction.

7.4 Application and results

To evaluate the proposed modelling framework, two aspects of the technique aretested independently. The activity modelling was first evaluated with a runningsequence. The binary blob was used to demonstrate the ability of the activitymodel for dealing with occlusions. The coloured blob method was then assessed ona tennis sequence without activity modelling.

7.4.1 Running stride analysis

In this experiment, running on a treadmill takes place in a laboratory setting withdetailed close-up views. Running exhibits a repetitive motion with a limited rangeof variations, allowing for enhanced motion prediction. In other words, only a singlestate hypothesis can be made given the previous state.

A runner was recorded by a single camera positioned on his side. Motion track-ing on a treadmill is illustrated in Figure 7.7. This figure illustrates some of thestrengths and weaknesses of model-based tracking. For example, the relatively poorobservation quality can be compensated by high-level prediction. In this case, theoriginal image shows motion blur in the lower legs and the initial segmentation resultis noisy. The runner’s shadow is partly included in the foreground but is neverthelesscompletely disregarded during tracking. However, still some ambiguities remain inthe final result (e.g. an occluded arm).

A set of features can then be generated from the tracked model:

Stride frequency also referred to as cadence is trivial to derive from the activitycycle length and sampling frequency.

Ground contact time can be segmented based on the vertical position of theankle.

Relative speed to the treadmill conveyor belt can then be computed using thefoot tip displacement during the contact time, as shown in Table 7.1.

Stride length is finally derived from the speed and the stride frequency.

145

Figure 7.7: Monocular tracking with activity model. Top down: original imageframes, coarse binary segmentation, fitted 3D model reprojection and the underlyingstick model.

The speed estimation results presented in Table 7.1 can be interpreted as fol-lows. The measured error is a combination of two factors with opposing effects. Athigher running speeds, the lower temporal resolution combined with lower spatialresolution due to motion blur will inevitably reduce the accuracy of the estimatedspeed. However, at low speed, frequent cross occlusions of the legs with each other

146

can affect the temporal accuracy of foot-ground contact detection. Furthermore,the running activity model was built using a relatively slow speed model (10 km/h),therefore it makes sense for the tracker to perform better at similar speeds.

Activity Treadmill speed Estimated speed ErrorWalking 5 km/h 5.21 km/h 4.2%Running 12 km/h 12.05 km/h 0.4%Running 15 km/h 13.87 km/h 7.5%Running 20 km/h 18.88 km/h 5.6%

Table 7.1: Treadmill speed estimation with model-based tracking.

7.4.2 Tennis stroke monitoring

The proposed tracking method was also applied to a tennis sequence. Different torunning on a treadmill, playing tennis exhibits a wide range of movements, whichmakes reliable prediction difficult. Therefore, the activity model would providelimited gain in motion prediction. The coloured blob method was then used as analternative to reducing the impact of self-occlusions.

From the results derived, several observations can be made from the imagespresented in Figure 7.8.

Firstly, the arms are well tracked by the coloured method even when positionedin front of the torso. On the other hand, it can be seen that the binary blob losestrack of the arm and is unable to recover later. More specifically, it can be observedthat the tracker matches the player’s arm with the racket in the third frame and herhand with protuberance made by the tennis balls in her pocket in the fourth andfifth frames. As the arm is not distinguishable from the rest of the silhouette at theseimage frames and the racket and tennis ball are not included in the superquadricsmodel, such results derived are in fact to be expected. The coloured blob does notexhibit any issue in this context, as the arm’s colour is well distinguished from theplayer’s T-shirt.

Secondly, the rotation of the player’s hips and shoulders is clearly visible inthe reprojected colour model and the underlying stick figure. Rotation around thevertical axis is usually difficult to capture as discussed in [74]. This is due to therelative invariance of the appearance of the human body with small rotations aroundthe vertical axis. The strength of the coloured blob in this respect lies mainly inthe colouring of the arms. They are sufficiently different to allow for an accurateestimation of the shoulders’ position which enforces a better torso rotation.

147

Figure 7.8: Monocular tracking with colour-based matching. Top down: original im-age frames, coarse colour-based segmentation, fitted coloured 3D model reprojection,underlying stick model and fitted binary 3D model reprojection for comparison.

7.4.3 Noise impact on the tracking accuracy

The resilience of the tracker to adverse Signal-to-Noise Ratios (SNR) was then evalu-ated. To this end, synthetic noise was randomly added to the input colour-segmentedimages with varying sizes, colour and density. The noisy colour distribution is cho-

148

sen according to the subject’s colour distribution. Some background colour was alsoadded randomly. The amount of noise was progressively increased in density andsize, as illustrated in Figure 7.9. The level of noise is referred to as ‘s×n’, where ‘s’is the maximal radius of the disks and ‘n’ the number of disks.

Figure 7.9: Examples of randomly added noise for tracking resilience evaluation.Left to right: original image, 1×1000, 4×1000, 2×4000 and 8×500.

Bootstrapping was performed on a noise-free image in order to observe the driftfrom the original posture. The tracked posture was then compared to the posturewhen tracked without noise. The results are shown in Figures 7.10 and 7.11.

0

20

40

60

80

100

120

140

160

2 4 6 8 10 12

Diff

eren

ce (

mm

)

Frames

Moderate SNR influence on tracking

1x10002x10004x1000

Figure 7.10: Influence of moderate SNR in the reference image on the trackingaccuracy: the tracker starts drifting before stabilising at an error of 60 to 150 mm.

It has been observed that for levels of noise under 4×1000, the tracker rapidlystabilises at an error less than 150 mm, as illustrated in Figure 7.10. This residualerror can be interpreted as a slight misalignment due to minor observation errors,but the algorithm keeps track of the subject motion. For higher levels of noise, i.e.

149

Noise 1×500 1×1000 2×1000 4×500 4×1000 2×4000 8×500Mean 76.3 94.7 74.9 108.5 135.0 218.5 253.1Min. 67.7 67.5 62.3 84.8 120.7 172.9 212.1Max. 92.2 129.0 96.1 140.1 145.3 295.4 303.9Final 67.7 91.5 66.5 85.2 145.3 196.1 226.4

Table 7.2: Tracking error under increasing levels of noise. All values are expressedin millimetres and computed from the first sixth frames of the sequence.

2×4000 and 8×500, the drift becomes more significant, as illustrated in Figure 7.11.However, it can also be observed that in both cases, the drift decreases later on,demonstrating the error recovery capabilities of the proposed algorithm.

0

50

100

150

200

250

300

350

2 4 6 8 10 12

Diff

eren

ce (

mm

)

Frames

Poor SNR influence on tracking

2x40008x500

Figure 7.11: Influence of poor SNR in the reference image on tracking accuracy: inboth examples the tracker loses track before partially recovering.

A summary of the influence of different levels of noise considered in Table 7.2 ontracking accuracy is provided.


7.5.1 Summary

In order to provide detailed biomechanical measurements of the athletes, a model-based approach is introduced in this chapter. The use of a biomechanical-basedmodel for low-level motion prediction combined with activity detection for long-

150

term estimation is an efficient means of human motion tracking.In this chapter, the musculo-skeletal properties of the human body are modelled

with a stick model providing joints angles. In order to fit this model to a real image,tapered superellipsoids bound to the original stick model are used. It is possibleto match the skeleton model to a binary segmented image using a gradient descentmethod.

7.5.2 Conclusion

It has been shown that the proposed method works reasonably well when a relativelygood initialisation is provided to the algorithm.

An extension of this method to multiple colour segmented blobs has been pro-posed, which has been shown to be able to resolve some of the self-occlusion problemsencountered by the binary silhouettes.

An activity or a set of activities can be modelled by a graph of KPs. In thischapter, the size of the activity graph is kept to a minimum by limiting the numberof KPs used and factorising the cyclic activities. By fitting the current posturalstate onto the activity model, it is possible to infer what movements are the mostlikely to happen in the next frame. The proposed method handles noise relativelywell.

Whilst such a model can provide an overall representation of the subject’s mo-tion, parameters such as the ground reaction force cannot be inferred from visionsensors alone. Therefore, other ambient or wearable sensing modalities need to beintegrated to permit a more detailed representation of the biomechanical indices.

151

Chapter 8

Integration of Ambient and

Wearable Sensors

Whilst acknowledging the technical merit of the vision-based tracking approachesdescribed in the previous chapters, the inherent limitations of such approaches havealso been illustrated. In general, the method provides a pervasive means of cross-court tracking of the player and certain biomechanical indices can be inferred fromthe vision data. For detailed biomechanical analysis, however, the use of visionsensors alone may not be enough to capture rapidly varying temporal signals. Forexample, a shutter speed of 1/500 to 1/1000 second is recommended to track accu-rately certain biomechanical indices of a tennis player [26], which is not feasible witha low-power VSN node. Furthermore, certain features such as the Ground ReactionForce (GRF) introduced in Chapter 2 are not observable by a vision-based approach.This information, however, can be readily derived from wearable sensors.

The purpose of this chapter is to introduce the use of wearable sensing to over-come some of these difficulties. Particular emphasis is placed on the use of inertialsensors such as Microelectromechanical (MEMS) accelerometers and gyroscopes fortheir miniaturised size and ease of integration. Sensor fusion techniques are em-ployed to capture complementary features of both ambient and wearable sensing,and to enhance the overall system reliability. In this way, vision-based ambientsensing provides an overview of the player and his/her localisation across the court,whereas wearable sensing provides the corresponding high-fidelity biomechanical andmotion information.

In this chapter, a quick overview of the wearable sensing modalities is providedfollowed by commonly employed sensor fusion techniques. An experiment is thendescribed, which combines ambient vision sensing with wearable inertial sensors fordetailed tennis stroke analysis. To this end, feature-level and decision-level fusionmethods are used1.

1A similar experiment applied in a different context has been published in: Ambient and WearableSensor Fusion for Activity Recognition in Healthcare Monitoring Systems [177].

152

8.1 Wearable sensing

Vision-based sensing provides a rich source of information but is not always accurateenough for certain kinematic features. This can be caused by insufficient framerate or image resolution. Thus far, typical uses of wearable sensors include generalactivity monitoring [137], medical studies [16] and sport performance analysis [175].Two main classes of wearable sensors have been popular: kinetic sensors (motion,position and forces) and physiological sensors.

8.1.1 Motion and position sensors

For motion sensors, MEMS-based inertial sensors such as accelerometers [19, 137,149] and gyroscopes [19] are widely used for human motion capture. Such InertialMeasurement Units (IMUs) allow a subject-centric description of the motion. Incontrast to vision-sensors, motion capture from IMUs is not dependent on externalconditions such as the light or occluding objects. However, IMUs lack absolutelocalisation in the environment. Commonly, accelerometers only measure the three-dimensional (3D) acceleration in their own coordinate system, the orientation ofwhich is often unknown.

Moreover, simply integrating twice the acceleration signal to evaluate the posi-tion of the sensor is often not practical as signal noise can cause considerable driftover time. Alternatively, an accelerometer can be used to measure the angle withrespect to the vertical (sometimes referred to as tilt or lean) when it is stationary.In this case, the only acceleration is enforced by the Earth’s gravity2.

Gyroscopes, on the other hand, measure angular velocities. They lack absoluteangle measure and the derived signal must be integrated to provide an angular value.Therefore, just as with accelerometers, angles calculated from a gyroscope will besubject to drift over time.

Digital magnetometers [19] provide an absolute angle with respect to the direc-tion of the Earth’s magnetic field but their output is susceptible to a wide rangeof noise generated by ferromagnetic objects and surrounding electric and electronicdevices. Furthermore, the magnetic field intensity at the surface of the Earth issubject to local variations. Kalman filtering is often used to derive a more stablesignal and a combined use of accelerometers, gyroscopes and magnetometers canalso solve some the issues mentioned above.

8.1.2 Force and pressure sensors

Strain [76] and bend [118] sensors can also be placed on the user’s joints to monitortheir motion. They are typically based on materials whose electromagnetic proper-ties (resistance or capacitance) change in response to stress. Optic fibre sensors are

2Strictly speaking, MEMS accelerometers do not actually measure an absolute acceleration, butrather the force applied by a cradle to a spring-mounted mass.

153

−4

−3

−2

−1

0

1

2

3

4

5

0 200 400 600 800 1000

Acc

eler

atio

n (g

)

Time (ms)

3D Acceleration during a jump

sidewaysfront/back

verticaloverallgravity

1 2 3 4

Figure 8.1: The ear-worn activity recognition (e-AR) sensor [137] developed at ourlaboratory and the signal recorded during a jump. Four main phases can easily beidentified from the overall acceleration. 1: the subject first stands still and the GRFis counterbalancing exactly the effect of gravity. 2: the subject pushes up. 3: thesubject is airborne leading to a null acceleration as measured by the sensor. 4: thesubject is finally landing abruptly.

also popular for practical applications. These sensors are often fixed onto garmentsor woven into the textile, making them ubiquitous.

Foot pressure sensors such as the Parotec system [180, 247] and force sensors[195] have also been used extensively in gait analysis for both sport and healthcare.They are typically based upon piezo-electric materials. They can provide detailedinformation about the GRF while the subject is performing an activity such aswalking or running. The trajectory of the centre of pressure during a walking cycleprovides valuable information on a runners stride. Example results derived from theParotec system have been illustrated earlier in Chapter 2, Figure 2.6.

8.1.3 Physiological sensors

Physiological parameters can also be acquired in real-time by wearable sensors.These include, for example, electrocardiogram (ECG) [134], pluseoxymetry (heartrate and blood oxygen saturation) from photoplethymograph (PPG) sensors [232]and body temperature [149]. Respiration can be derived from chest motion [16] orusing a more intrusive oxygen uptake (V O2) sensor [50, 173].

8.1.4 Other wearable sensing modalities

A large number of specialised wearable sensors have been designed for more specificapplications. These include, for example, body-mounted cameras [63] for capturingthe subject’s facial expression, light sensor such as in the eWatch [149], for providinginformation about the type of environment the user is in, and sound sensors [149, 217]

154

to gather evidence of social activity.

8.2 Sensor fusion: the best of both worlds

For sensor network deployment, several sensor readings can be combined to derivea final decision. The use of different sensors to capture similar information is aneffective way of enhancing the robustness of the system against sensor failure or noise.Positioning the sensors at different locations can also prevent localised interferenceand provide more balanced information. Ultimately, using several sensing modalitiesmay overcome the inherent limitation of each sensor. In all these cases, informationflows need to be combined in a process referred to as sensor fusion, which allowsmore accurate and robust final decision making [87, 150, 245].

In general, sensor fusion can be performed at several levels. If sensors providecomparable data, fusion can be performed directly on the raw signal. This techniqueis usually referred to as data-level sensor fusion. Features derived from the sensornetwork can be gathered to feed into a classifier, a process called feature-level sen-sor fusion. Decision-level fusion consists of several independent classifiers drawingconclusions from the general consensus. Finally, it is possible to use a combinationof these fusion methods.

8.2.1 Sensor fusion levels

Sensor fusion is possible at multiple levels as shown in Figure 8.2. This is possibleat hardware, raw data, feature, or decision levels. Sensor fusion is therefore closelyrelated to classification. In many cases, the original classification methods are simplymodified to handle input from different sensors. The main issue involved in sensorfusion is the exponential increase of sensor data and associated processing costs.

Feature level fusion

Classification

Decision

Data level fusion

Feature extraction

Classification

Decision

Decision level fusion

Decision

Feature

extraction

Feature

extraction

Feature

extraction

Feature

extraction

Feature

extraction

Feature

extraction

Classif. Classif. Classif.

Sensor 1 Sensor 2 Sensor n Sensor 1 Sensor 2 Sensor n Sensor 1 Sensor 2 Sensor n

Figure 8.2: Different levels of sensor fusion (from Body Sensor Networks [245]).

The different levels of fusion can be illustrated by a simple example consisting of

155

a pair of cameras used to detect the presence of a person. Both cameras are assumedto have a similar coverage of the scene. At the data level, it is possible to fuse bothimages into one single image. This system will be more robust to hardware failureand noise in the video streams. Another solution is to apply the fusion at the featurelevel. In this case, each camera will perform its own background segmentation. Thesegmented images can then be fused together. This system will be more robust tonoise and artefacts. The last solution consists of fusing the final decisions. In thiscase, each camera will perform its own background segmentation and evaluate theprobability that a subject is in the room. These probabilities can then be comparedto provide a final decision.

More advanced fusion techniques are possible at data or feature level if priorknowledge of the system is available. In this example, considering that both camerasare calibrated, it is possible to compute a depth image of the scene by dense stereomatching (fusion at data level) or to identify and match features in the 3D space(fusion at feature level). In that sense, the visual hulls method employed in theprevious chapters can be considered as a sensor fusion process.

8.2.1.1 Hardware level fusion

At the hardware level, sensor fusion can be achieved by using thresholds. The workpresented in [127] relies on tilt switches to activate an accelerometer-based moduleonly when significant events are occurring.

8.2.1.2 Data and feature level fusion

At the data (signal) and feature levels, dimensionality reduction such as PrincipalComponent Analysis (PCA) is often deployed before further pattern classificationtakes place. At this level of sensor fusion, modelling methods such as k -NearestNeighbour (KNN) [242], Gaussian Mixture Models (GMMs) [177], Bayesian Net-works (BNs) [204], neural networks [111], Kalman filters [111] or Hidden MarkovModels (HMM) [15, 27] are common.

BNs are essentially directed graphs, whose nodes model the states and whoselinks represent the conditional dependencies between these states. BNs are usedby propagating available evidence into the graph and extracting beliefs from theresult. One of the issues of BN is to compute the probabilities associated with theconditional dependencies. This can be performed by automated training based on agiven training dataset or through expert knowledge. Singhal et al. [204] developed amultilevel BN to detect the main subject in unconstrained images. The image is firstsegmented into continuous regions. Various image features are then derived fromeach region. Every feature generation module can be considered as an individual“virtual sensor”, including, for example, the aspect ratio, skin colour, and centrality.Therefore, a combination of these partially independent features in essence is a sensor

156

fusion problem.HMM have shown good capabilities for dynamic pattern recognition. Bernardin

et al. [27] relied on the posture derived from bend sensors and contact informationcaptured by an array of pressure sensors embedded in a glove to classify varioushand grasping methods. During training, the HMM can learn the importance ofeach sensor for every type of grasp action. The influence model [13] relies on areduced number of states with respect to the HMM, and therefore is particularlyefficient for deploying large sensor networks [60]. The relative independence of HMMto time is a major advantage over other commonly employed techniques. Indeed,the same action performed at different speeds will be considered as similar by anHMM, thus making the technique suitable for behaviour profiling [15].

8.2.1.3 Decision level fusion

At decision level, methods usually rely on a voting scheme or by using some system-specific assumptions. For example, Chen et al. [44] relied on video and sound todetect people interaction. The video and the audio modules derive confidence levelsfrom their respective measures Cv and Ca independently. Fusion is then applied atthe decision level by a weighted sum: Cd = αCv + (1 − α)Ca. Typically, when thedecision of a particular sensor is known to be unreliable on a specific range of values,priority will be given to another sensor. For example, Tabar et al. [213] relies ona wearable sensor to monitor activities. If a fall is detected, the vision sensors areswitched on to confirm the fall.

Some researchers have also proposed a higher level of data fusion at performancelevel [208]. This fusion level aims at assessing how the result of sensor fusion matchesto the expected result. The result of the performance-level fusion can then be usedas feedback to the lower level fusion methods for appropriate adaptation.

8.2.1.4 Multilevel fusion

In practice, it is also possible to perform sensor fusion at several levels simultane-ously. An example is presented in [150]. This work aims at identifying some targetsrelying on two kinds of radar images. Several types of classifiers, such as neuralnetwork, maximum likelihood, nearest neighbour or piecewise sequential, were usedin parallel and then combined using a voting scheme. An overview of this system isshown in Figure 8.3.

8.2.2 Expectation-Maximisation and Gaussian Mixture Models

A GMM can be used to model the Probability Density Function (PDF) of a dataset.It is composed of a linear combination of Gaussian PDF components, as shown inFigure 8.2.2. In a dataset composed of several clusters, each of the clusters can bemodelled by a Gaussian mixture, as illustrated in Figure 8.5.

157

Decision level fusion

Decision


Classification

Data level fusion

Classification

Feature

extraction

Feature

extraction

Feature

extraction

Sensor nSensor 1 Sensor 2

Figure 8.3: Multi-level sensor fusion with feedback loops as proposed in [150].

Figure 8.4: Example of a dataset fit-ted by a two-component GMM. Bluehistogram: original dataset. Plain redcurve: GMM. Dotted red lines: Gaus-sian components.

The Expectation-Maximisation (EM) algorithm is a method to compute the max-imum likelihood estimates of a PDF on a dataset [56]. The algorithm relies on theiterative application of expectation and maximisation stages until convergence. Theexpectation stage computes an estimate of the current model parameters, given thedataset and the current estimate. The maximisation stage computes the maximumlikelihood parameters given the current estimates on the dataset. In the case of aGMM, which is differentiable, the method is relatively easy to implement.

Given a dataset {X1..N} to be classified in C classes, and assuming that theconditional PDF p(X = x) for each of those classes is a Gaussian G, the algorithm[56] aims at finding the best fit of λt = {µ1..C(t),Σ1..C(t), ω1..C(t)} where µj(t), Σj(t)and ωj(t) are respectively the mean, the covariance and the weights of the mixture

158

j at the iteration t. The Gaussian Gj is evaluated by:

Gj(x) =1

(2π)m/2‖Σj‖1/2e−

12

(x− µj)TΣ−1j (x− µj)

(8.1)

Let cj(t) = p(ωi) at the iteration t. The expectation and maximisation steps areperformed iteratively until convergence. The expectation (E) step for each point xiand each class j, based on Bayes law is the following:

p(ωj |xi, λt) =G(xi|ωj)p(ωj |λt)

G(xk|λt)

=p(xi|ωj , µj(t),Σj(t))cj(t)

C∑k=1

p(xi|ωk, µk(t),Σk(t))ck(t)

=Gj(xi)cj(t)C∑k=1

Gk(xi)ck(t)

(8.2)

The likelihood maximisation (M) step is then:

µj(t+ 1) =

N∑i=1

p(ωj |xi, λt)xi

N∑i=1

p(ωj |xi, λt)

(8.3)

Σj(t+ 1) =

N∑i=1

p(ωj |xi, λt)(xi − µi(t+ 1))(xi − µi(t+ 1))T

N∑i=1

p(ωj |xi, λt)

(8.4)

cj(t+ 1) =1N

N∑i=1

p(ωj |xi, λt)xi (8.5)

The EM algorithm can be used either with a supervised approach (in which casea labelled training dataset is required) or an unsupervised fashion (in which casethe data clusters are not associated to a semantic class).

The main issue associated with EM is the initialisation of the GMM parameters,as well as the choice of the number of classes. If the GMM are not well initialised,the system may converge towards a non-optimal solution (corresponding to a localmaximum).

The K-means [154] algorithm is a simplified variation of the EM algorithm, wherethe classes are represented by their centres only. In other words, the expectation stepconsists of marking each point with the nearest class centre whereas the maximi-

159

Figure 8.5: Clustering using GMM. Left to right: raw dataset, clusters fitted withGMMs (represented as ellipses) and space partitioned by the GMM parameters forclassification.

sation step is simply performed by recalculating the class centres given the currentpoints in the class.

8.3 Application and results

The purpose of this experiment is to introduce a robust framework for enhancedtennis stroke recognition by integrating inertial sensors with ambient blob-basedvision sensors. Data obtained by each sensor independently is pre-processed fordimensionality reduction before the application of a Bayesian classifier. To assessthe improved accuracy of the proposed method, the classifier was evaluated againsta purposely built low-quality dataset.

A player was recorded whilst performing different “drills” in tennis training. Itinvolves a repetition of patterns of play. The ball contacts were first detected usinga wearable sensor before performing an actual classification on the type of stroke.

8.3.1 System design

The vision-based sensor employed in this experiment is far from optimal. Figure 8.6illustrates the video images captured during the experiment. The recorded imagequality appears to be under average for a number of reasons:

• incomplete coverage;

• relatively low resolution;

• generally noisy image and blurry edges;

• frequent duplicated and dropped frames;

• unstable auto colour balance causing disturbance to the background modellingalgorithm;

160

• compression artefacts;

• camera mounted on a net post and therefore subject to high-intensity vibra-tions when the ball touches the net.

Figure 8.6: Original video examples. Left: strong motion blur during a serve withthe arm and racket invisible and generally poor chrominance information. Right:camera motion blur due to a ball contact with the net.

The tennis player was fitted with two inertial sensors. She was asked to wearan e-AR sensor depicted in Figure 8.1 and developed by Lo et al. [137, 245]. Ituses a Nordic nRF24E1 chipset with built-in 2.4 GHz RF transceiver, memory, andan analog-digital converter to retrieve data from a 3-axis accelerometer. For thepurpose of this experiment, the sensor was programmed to transmit continuouslythe data to a base station. The other sensor was mounted on the inside of the rackettriangle, as illustrated in Figure 8.7.

X

y

Z

Figure 8.7: Location of the wireless inertial sensor on the tennis racket.

The acceleration recorded from the sensor mounted on the racket exhibits specificpatterns, as illustrated in Figure 8.8. However, due to the very high intensity ofthe motion, the accelerometer shows a tendency to saturate during the shots. Forexample, Mitchell et al. [156] produced a graph showing a change of horizontalvelocity in the centre of the racket of approximately 20.1 ms−1 in just 23 ms during

161

a serve. This represents an average acceleration of 910 ms−2, i.e. over 92g, whereasthe employed sensor saturates at 3g.

-5

-4

-3

-2

-1

0

1

2

3

4

Acc

eler

atio

n (g

)

Forehand

saturation levelxyz

-5

-4

-3

-2

-1

0

1

2

3

4

Acc

eler

atio

n (g

)

Backhand

saturation levelxyz

-5

-4

-3

-2

-1

0

1

2

3

4

Acc

eler

atio

n (g

)

Serve

saturation levelxyz

-5

-4

-3

-2

-1

0

1

2

3

4

Acc

eler

atio

n (g

)

Volley

saturation levelxyz

Figure 8.8: Raw acceleration captured by an accelerometer mounted on a tennisracket for four different strokes. It can be seen that signal patterns are very differenteven with an obvious signal saturation and a relatively low sampling rate. The imagesequences illustrate the relatively poor signal quality for deriving such indices.

162

8.3.2 Ball contact detection

The shots were first detected in the sequences before detailed stroke classification.The racket sensor alone was used for that purpose. As previously stated, the ac-celerometer saturates during the shots due to the high acceleration imposed. There-fore, the sensor saturation over a period longer than 50 ms is a robust indicator forshot detection. A total of 121 shots were manually annotated and tested against theautomated detection algorithm. The detection rates are summarised in Table 8.1.

Stroke Played Detection rateForehand 30 96.9%Backhand 26 92.3%Serve 17 100.0%Forehand volley 16 93.8%Backhand volley 14 50.0%Dribbles 7 100.0%Forehand throw 12 100.0%Overall 122 90.1%False positives 1 0.8%

Table 8.1: Ball contact detection using the racket inertial sensor. The volleys causeissues due to the relatively low motion intensity. False positives are due to the playerdribbling with the ball.

8.3.3 Feature extraction

In order to fuse efficiently the high-dimensional input data, features were first se-lected from each of the sensing modalities. The average accelerations over the threeaxes during the shot were computed from the racket inertial sensor. This seems tobe sufficient given the large differences in the raw signal illustrated in Figure 8.8.

The vision-based blob sensor exhibits strong artefacts. Therefore, a more subtleapproach was developed to generate reliable features. A counter white balancefilter was first applied, followed by a regular statistical binary segmentation andmorphological noise filtering. The largest blob in the image was then selected andassumed to be representing the player. A bounding box was fitted around the bloband some features were derived from it. The enhanced pseudo-skeleton algorithmoriginally proposed by [28] was then used to derive tennis-specific features.

A complete list of the features used for the classifier is summarised in Table8.2. Sensor fusion is performed using a Gaussian Bayes EM classifier based on thewearable and the blob sensor data. A GMM is used to model each activity class.The Bayes Net Toolkit (BNT) [158] was used for the implementation of the classifierand six classes were modelled to describe different tennis strokes. The tennis strokesconsidered in this experiment include: forehand, backhand, serve, forehand volley,

163

Sensor Feature DimensionInertial Racket average acceleration 3Vision Pseudo-skeleton head 2Vision Relative racket position 1

Table 8.2: Features used in classification.

backhand volley and hand thrown forehand3.For each of the classes considered, three quarters of the data were used for

training of the inference system and the rest for validation. For training, the EMiterative algorithm was used to compute the maximum likelihood fit [56]. Oncethe model is computed through EM, the remaining data is used to evaluate theaccuracy of the classifier based on the marginal probability of every activity. Thehighest probability was chosen for the final classification.

8.3.4 Feature-level sensor fusion

Once a ball contact has been detected, it is then used to classify the actual strokeplayed. The classification results by using the proposed method compared those withthe use of the vision sensor alone are presented in Table 8.3. A detailed analysisof inter-class misclassification described as confusion matrices (with and withoutsensor fusion) is provided in Figure 8.9.

Stroke Inertial Vision Vision + Improvementalone alone Inertial (% point)

Forehand 50.0% 80.0% 90.0% 10.0%Backhand 72.7% 54.5% 90.9% 36.4%Serve 71.4% 85.7% 100.0% 14.3%Forehand volley 66.7% 16.7% 33.3% 16.6%Backhand volley 100.0% 0.0% 50.0% 50.0%Forehand throw 20.0% 20.0% 60.0% 40.0%Overall4 61.0% 53.7% 78.1% 24.4%

Table 8.3: Comparison of stroke classification rates between using a vision sensoralone and with the proposed combined system.

From the results shown in Table 8.3, it is evident that by incorporating wearablesensing with ambient sensing, stroke classification accuracy has been improved sig-nificantly. The confusion matrix shown in Figure 8.9 demonstrates that two majormisclassifications (regular backhand against backhand volley and regular forehand

3This is not technically speaking a specific stroke, but was detected as such.4The overall classification rate is defined as the ratio of the total number of correctly classified

strokes on the total number of strokes, which is not the average of the individual classification ratesfor each stroke.

164

against hand-thrown forehand) are dramatically reduced. The latter example is par-ticularly interesting, as both sensing modalities classify the hand-thrown forehand at20% only, but the fused result increases three-fold to 60%. It is not surprising thatthe serve is correctly classified at 100% by using the proposed scheme. However,the integration of the wearable sensor does not solve the vision sensor confusion be-tween the two types of volleys that are well classified by the wearable sensor alone,suggesting the inherent drawback of the vision sensors.

F B S FV BV FTF 5 4 0 0 0 1B 2 8 0 1 0 0S 2 0 5 0 0 0FV 0 0 0 4 1 1BV 0 0 0 0 2 0FT 3 0 1 0 0 1

F B S FV BV FTF 0.5 0.4 0.0 0.0 0.0 0.1B 0.18 0.73 0.0 0.09 0.0 0.0S 0.29 0.0 0.71 0.0 0.0 0.0FV 0.0 0.0 0.0 0.67 0.17 0.17BV 0.0 0.0 0.0 0.0 1 0.0FT 0.6 0.0 0.2 0.0 0.0 0.2


F B S FV BV FTF 0.8 0.1 0.1 0.0 0.0 0.0B 0.09 0.55 0.18 0.09 0.0 0.09S 0.14 0.0 0.86 0.0 0.0 0.0FV 0.17 0.0 0.0 0.17 0.5 0.17BV 0.0 0.5 0.0 0.5 0.0 0.0FT 0.6 0.2 0.0 0.0 0.0 0.2


F B S FV BV FTF 0.9 0.1 0.0 0.0 0.0 0.0B 0.0 0.91 0.09 0.0 0.0 0.0S 0.0 0.0 1 0.0 0.0 0.0FV 0.0 0.0 0.0 0.33 0.5 0.17BV 0.0 0.0 0.0 0.5 0.5 0.0FT 0.4 0.0 0.0 0.0 0.0 0.6

Figure 8.9: The tennis strokes confusion matrices showing how the fusion algorithmimproves the classification accuracy. Top-down: wearable sensor alone, vision sensoralone, sensor fusion (left: total, right: normalised). The two types of volley remainthe main misclassification (non-diagonal element) after sensor fusion.

8.3.5 Decision-level and hybrid sensor fusion

Due to the radically different nature of the two sensing modalities employed in thisexperiment, it appears that some strokes are much better classified by using a singlemodality. This, for example, is the case for forehand stroke that is correctly classifiedat 80% by the blob sensor alone, but only at 50% by the inertial sensor. Therefore,feature fusion is likely to be unnecessary, if not counter-productive, for this specificstroke. Indeed, if the worst sensor is intrinsically unable to describe the consideredclass, fusing the features will simply lead to adding noise during the learning phase ofthe algorithm, and therefore has an adverse effect on the final classification results.In this case, it may be more appropriate to fuse the information from both sensing

165

modalities at the decision level. Several classifiers are trained independently for eachsensor, and both feature streams are classified accordingly.

In order to demonstrate the practical value of decision-level fusion, KNN clas-sifiers were introduced. As opposed to GMM, KNN classifiers do not rely on anoverall model of the classes, but a rather local approach to classification. The KNNprinciple consists of classifying a test point by counting the number of samples fromeach class among the k nearest neighbours in the training dataset, thus the algo-rithm name. For both the inertial and the vision sensors, the previously describedGMM was completed by two KNN classifiers, with k=3 and k=4. All the classifierswere then fused using a voting scheme, as illustrated in Figure 8.10 (left).

Voting scheme

Decision

Feature

extraction

GMM KNN3 KNN4

Vision

sensor

Inertial

sensor

Feature

extraction

GMM KNN3 KNN4

Voting scheme

Decision

Feature

extraction

GMM

Vision

sensor


Inertial

sensor

Feature

extraction

GMM

GMM

Figure 8.10: Voting schemes used for decision-level sensor fusion. Left: GMM andKNN flat decision-level fusion. Right: hybrid fusion based on GMM only.

The results of this implementation are summarised in Table 8.4 (middle column).The decision-level fusion algorithm performs similarly to the original algorithm withtwo notable exceptions. The forehand volley is much better classified with thedecision-level fusion method. This shows that the vision sensor’s poor ability toclassify this stroke does not affect the results provided by the wearable sensor. On theother hand, the backhand volley is not classified correctly any more, demonstratingthat inter-sensor complementarity was relied upon for this stroke during the feature-level sensor fusion.

In order to compensate for this issue, a hybrid system is designed that fusesthe decisions taken by the GMM trained on each sensor individually as well asthe decision taken by the feature-level fusion method, as illustrated in Figure 8.10(right). The results presented in the right column of Table 8.4 show a small overallimprovement over the feature-level and decision-level fusion methods. A closer lookat the results shows the consistency of this method over the different consideredstrokes. Indeed, the lowest stroke classification rate is 40%.

166

Stroke Feature-level Decision-level Hybrid(GMM) (GMM+KNN) (GMM)

Forehand 90.0% 90.0% 90.0%Backhand 90.9% 90.9% 81.8%Serve 100.0% 100.0% 100.0%Forehand volley 33.3% 83.3% 66.7%Backhand volley 50.0% 0.0% 50.0%Forehand throw 60.0% 40.0% 40.0%Overall4 78.1% 80.5% 82.9%

Table 8.4: Comparison of stroke classification rates between the feature-level,decision-level and hybrid sensor fusion methods.

8.4 Conclusion

In this chapter, the practical value of sensor fusion for improving classification ac-curacy has been illustrated. It clearly demonstrates the fact that wearable sensorscan be used to overcome some of the ambiguities in tennis stroke recognition due toa non-optimal vision sensor installation.

With the proposed technique, a feature-level fusion method has been first em-ployed, which consists of training the GMM on both the wearable and the visionsensors features altogether. It was found that the main advantage of feature-levelsensor fusion is that the sensor complementarity is taken into account during thelearning phase by the EM algorithm.

Subsequently, a decision-level method has been used, which consists of train-ing several classifiers per sensing modality and to rely on a voting scheme on theclassifier’s independent decisions to provide a final decision. It appears that such amethod is not able to deal with local sensor complementarity, but prevents a poorsensor or classifier misleading the EM algorithm during training.

Finally, a hybrid method for fusion at both feature and decision levels is proposed.This method exhibits more consistent results over the considered tennis strokes.Overall, all three methods perform similarly well, with the hybrid method showingslightly improved results.

167

Chapter 9

Conclusions and Future Work

9.1 Summary

In sport training, human biomechanical models provide an important means of de-scribing the motion of athletes for optimising, inferring and rectifying subtle move-ment patterns objectively, leading to not only performance enhancement, but alsoinjury prevention. To this end, a variety of techniques have been developed over theyears and a major focus has been directed to key biomechanical parameters providingthe most relevant indication of potential performance gain. In order to apply suchmotion optimisation in situ during training, there is a need to measure appropriatebiomechanical parameters pervasively. This means that the measurement devicesmust be accurate and unobtrusive, without affecting the normal performance of thesport itself. The current state-of-the-art systems for whole body motion capturemainly consist of multiple infrared cameras tracking fiducials attached to the body.These markers make them obtrusive during training and unsuitable during compe-tition. Therefore, one of the key challenges in motion capture would be to derivesimilar parameters using a markerless system.

In this thesis, a robust real-time tennis player tracking system has been success-fully implemented on a low-power Vision Sensor Network (VSN) device. Due tothe limited computational resources available on this platform, a statistical back-ground segmentation algorithm has been optimised through the use of pre-samplingand early detection of the Regions of Interest (ROI). The relatively high trackingaccuracy makes it perfectly suitable for the task. To make the device easy-to-usefor real-time field deployment, the embedded software comprises a micro web serverand the device is capable of self-calibration. The large amount of raw data derivedfrom the VSN node has been reduced to key statistical indices.

The canonical view representation proposed in this thesis is a Point-Of-View(POV) independent representation of the player’s posture. The image can be pro-cessed by using any chosen posture recognition technique. It has been shown that thealgorithm performs consistently better when combined with the use of canonical view

168

reprojection. It is worth noting that the proposed method relies on implicit, three-dimensional (3D) multi-view geometry, therefore some information is lost, whichmay be useful for subsequent tracking and classification processes. Nevertheless, thestrength of such an implicit approach for 3D object representation is in this inherentrobustness, which is important for practical applications.

In addition to canonical view representation, a view invariant descriptor for 3Dmotion tracking based on spherical harmonics has been introduced in this thesis. Thespherical harmonic coefficients are accumulated on a frequency basis, producing a 1Dsignature, which is compact, normalised in translation and invariant to rotation. Ithas been shown that the derived compact position and rotation invariant signaturescan capture information about a tennis player’s posture over time, therefore enablingfurther detailed posture analysis.

In this thesis, a human kinematic model based on biomechanical parameters hasalso been introduced. A reliable tracking algorithm has been developed that canfit the rendered 3D human model onto the reference silhouette image. An activitymodelling algorithm has been proposed, which relies on a graphical model for humanmovement recognition. The combination of joint-level prediction based on Kalmanfilters and activity-level modelling has proved to be an efficient means of reducingambiguities due to occlusions.

In summary, the techniques presented in this thesis have shown that it is possibleto perform high quality motion analysis without the need for expensive, bulky, andabove all intrusive sensors. A single set of miniaturised unobtrusive VSN nodes canbe used instead. The avoidance of intrusive markers makes the system suitable to arange of training environments. The development of such unobtrusive techniques canlead to more ubiquitous sport performance analysis, allowing detailed sensing to beperformed routinely in all training sessions. The advantage of the proposed systemis twofold. First, it will help to derive long-term trends rather than episodic snap-shots of an athletes performance. Secondly, it allows the collection of data duringcompetitions, as well as training sessions, thus providing much more comprehensiveperformance data to the coaches and players.

9.2 Conclusions

An accurate and unobtrusive means of motion monitoring is required in order toprovide feedback on athletes’ performance. Disturbance to the athlete must be keptto a minimum in order to collect meaningful data during competition, therefore amarkerless vision-based approach is proposed.

A real-time tennis player tracker has been successfully implemented on an em-bedded platform. The player’s motion can be monitored in real-time with an au-tonomous and mobile device, thus facilitating ubiquitous data collection.

The introduction of canonical novel views generated from multiple view geometry

169

reduces the POV-dependent bias. The proposed system shows an improvement interms of tennis stroke recognition as compared to a monocular framework.

Compact 3D postural descriptors based on spherical harmonics have been in-troduced. The descriptors are position and rotation invariant, and carry enoughpostural information to enable fast stroke recognition.

A biomechanical human model for detailed posture analysis is combined withcoloured flesh model and activity models in order to deal with occlusions moreefficiently. It has been shown that the model is resilient to noise.

Finally, because vision-based approach has technical limitations in terms of cov-erage and temporal resolution, the value of information fusion of inertial-based wear-able and vision-based sensor is demonstrated.

9.3 Potential improvements and future work

Despite the achievements made in this thesis, there are a number of potential im-provements that can be made in future studies. The main issue with the VSNimplementation of the tennis tracking system introduced in Chapter 4 is its lowspatial resolution when tracking across the court. Due to the relatively small anglebetween the ground plane and the camera optical axis, the resolution along the longaxis of the court is bound to be limited.

The canonical views proposed in Chapter 5 are indeed view-invariant, but theyare not specifically facing the subject as originally intended. Furthermore, becausethe novel view point calculation is based on Principal Components Analysis (PCA),the view is normalised rather than actually invariant. As a consequence, local inac-curacies can have a global impact on the canonical image. The spherical harmonicspresented in Chapter 6 can be used for tennis strokes detection, but their extremecompactness may make it difficult to detect subtle changes in motion or posture.

It is also worth noting that the classification algorithms for sensor fusion pre-sented in Chapters 5, 6 and 8 are based on static states and the temporal informationis not explicitly incorporated. The use of Hidden Markov Models (HMM) for exam-ple would lead to better classification accuracies.

In this thesis, it has been shown that the combined use of wearable and ambientsensors can offer much in-depth information about the spatio-temporal dynamicsand biomechanical indices of the player. Preliminary experiments have shown thatan inertial sensor mounted on the racket or worn on the wrist can reliably detectthe impact of the tennis ball and transient motion of the racket. Indeed, by incorpo-rating a wearable Inertial Measurement Unit (IMU) composed of an accelerometerand a gyroscope, it is possible to calculate the transient trajectory of the racket.Ambiguities due to integration drift of the IMU and a lack of horizontal absoluteorientation reference can be compensated for by tracking the player’s hand usingvision-based techniques. Fusing global position and orientation from the embedded

170

vision-sensor with such wearable IMUs can potentially lead to a versatile monitoringtool that is not currently available. It would not only allow learning and recognisingpatterns of play automatically, but also provide much detailed information about thecorrelation between the player’s position on the court and the type and intensity ofstroke played. Although this thesis is mainly focused on tennis player tracking, thereis also a strong need for simultaneous tracking of the tennis ball, as this provides fur-ther insight into the final outcome of the shots. In this regard, some of the existingtechniques on ball tracking can be incorporated into the current VSN framework.The main technical improvements that need to be addressed would be to increasethe frame-rate of the vision sensor and its spatial resolution. Combining the playerposition and motion on court, detailed transient trajectories of the racket and balltracking would open a whole range of opportunities in tennis training, automaticplayer profiling and strategy analysis in future.

Because the VSN nodes are low cost and easy to deploy, the proposed platformcan potentially be installed all around the court. This would lead to a better coverageof the court and a higher resolution. In this context, the court coverage would bedistributed in between the nodes and therefore the player would not always be in theField-of-View (FOV) of all the nodes simultaneously. In this case, a truly distributedprocessing paradigm can be implemented such that the processing power of all thenodes can be fully utilised. Indeed, it has been shown that the run-length encodedbinary blob can be sent across the network at rates over 10 Hz. Therefore, given thegain in resolution and the spare processing capabilities available, it might be possibleto perform stroke recognition on-node. The current pipeline architecture is built onVSN nodes processing the player position independently and performing the actualinformation fusion at the very last stage. The processing pipeline is completelyfixed and can therefore be considered as loosely distributed as it does not reactdynamically to changes in the observed scene. In order to implement the multi-stage processing method proposed above, there is a need to develop a more dynamiccollaboration scheme between the nodes. This is because the tasks assigned to eachnode can change constantly depending on the player position and other contextualinformation of the game. This means that light-weight communication protocols forreal-time reconfiguration are required.

It is also worth noting that the Image-Based Modelling and Rendering (IBMR)methods can be employed to generate photo-realistic scenes from novel view points asa generalisation of the preliminary work by Kimura and Saito [117]. Interesting novelview points include a virtual camera facing the player and the actual player’s POV.For this purpose, other IBMR algorithms may be used for model reconstruction, suchas space carving. This method typically requires a larger number of cameras andis computationally much more demanding than Image-Based Visual Hulls (IBVH),but does not require prior motion segmentation. This can reduce ambiguities due tolow contrast within the image and enable the whole scene to be reconstructed rather

171

than the player only. This would undoubtedly be of interest for broadcasting, butcould also lead to some novel applications such as research on the player decisionmaking process based on his/her current perception of the game. Such analysis couldpotentially determine the extent of the cognitive/psychological factors contributionto the choice of a particularly strategy.

Although this thesis is focused on sport monitoring and more specifically ontennis player motion tracking, a large proportion of the proposed methods is ap-plicable to other environments such as healthcare. Indeed, it has been shown, forexample, that impaired gaits and postures are sometimes related to underlying dis-eases that are not yet diagnosed. The best-known example is probably Parkinson’sDisease (PD). It is now well acknowledged that an impaired gait is exhibited bypatients affected by early PD [246]. For most diseases, an early diagnosis is highlydesirable, as it is important to start treatment or intervention before irreversibledamage is caused. Therefore, detailed gait and posture analysis using the model-based approach proposed in this thesis would be of high value for such applications.Furthermore, because the common denominator of all proposed methods in thisthesis is a preliminary binary segmentation stage, many of the privacy issues arenaturally avoided. Indeed, whilst it is not an issue to transmit raw video sequencesof athletes during their training sessions, it is highly sensitive to home-monitoringapplications. This would make the proposed techniques particularly suited for homemonitoring environments.

172

Bibliography

[1] Addlesee, M. ORL active floor. IEEE Personal Communications 4, 5 (1997),35–41.

[2] Adelson, E. H., and Bergen, J. R. Spatiotemporal energy models for theperception of motion. Journal of the Optical Society of America 2, 2 (1985),284–299.

[3] Adelson, E. H., and Bergen, J. R. The plenoptic function and the ele-ments of early vision. Computational Models of Visual Processing (1991).

[4] Agarwal, A., and Triggs, B. Monocular human motion capture with amixture of regressors. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) (Washington, DC, USA, 2005), IEEEComputer Society, p. 72.

[5] Agathangelou, D., Lo, B., Wang, L., and Yang, G.-Z. Self-configuringvideo-sensor networks. In Proceedings of the 3rd International Conference onPervasive Computing (2005), pp. 29–32.

[6] Akdere, M., Cetintemel, U., Crispell, D., Jannotti, J., Mao, J.,and Taubin, G. Data-centric visual sensor networks for 3D sensing. In2nd International Conference on Geosensor Networks (GEO’06) (Boston, MA,USA, October 2006).

[7] Amat, J., Casals, A., and Frigola, M. Stereoscopic system for humanbody tracking in natural scenes. In Proceedings of the IEEE InternationalWorkshop on Modelling People (1999), pp. 70–76.

[8] Analog Devices. Blackfin BF537. http://www.analog.com/en/embedded-processing-dsp/blackfin/adsp-bf537/processors/product.html.

[9] Anandan, P. A computational framework and an algorithm for the measure-ment of visual motion. International Journal of Computer Vision (IJCV) 2(1989), 283–310.

[10] Ankerst, M., Kastenmuller, G., Kriegel, H.-P., and Seidl, T. 3Dshape histograms for similarity search and classification in spatial databases. In6th International Symposium on Advances in Spatial Databases (Hong Kong,China, 1999), R. Guting, D. Papadias, and F. Lochovsky, Eds., vol. 1651,Springer, pp. 207–228.

173

[11] Apewokin, S., Valentine, B., Bales, R., Wills, L., and Wills, S.Tracking multiple pedestrians in real-time using kinematics. In IEEE Com-puter Society Conference on Computer Vision and Pattern Recognition Work-shops (CVPRW) (June 2008), pp. 1–6.

[12] Arfken, G. B., Weber, H. J., and Weber, H.-J. Mathematical Methodsfor Physicists. Academic Press, 1995.

[13] Asavathiratham, C. The Influence Model: A Tractable Representation forthe Dynamics of Networked Markov Chains. PhD thesis, MIT, 2000.

[14] Ascension. Motionstar. http://www.ascension-tech.com/products/ motion-star.php.

[15] Atallah, L., ElHelw, M., Pansiot, J., Stoyanov, D., Wang, L., Lo,B., and Yang, G.-Z. Behaviour profiling with ambient and wearable sensing.In IFMBE proceedings of the 4th International Workshop on Wearable andImplantable Body Sensor Networks 2007 (BSN) (Aachen, Germany, 2007),pp. 133–138.

[16] Atallah, L., Elsaify, A., Lo, B., Hopkinson, N., and Yang, G.-Z.Gaussian process prediction for cross channel consensus in body sensor net-works. In Proceedings of the International Workshop on Wearable and Im-plantable Body Sensor Networks (BSN) (2008), pp. 165–168.

[17] Awad, M., Jiang, X., and Motai, Y. Incremental support vector machineframework for visual sensor networks. EURASIP Journal on Applied SignalProcessing (2007), 222–222.

[18] Azarbayejani, A., Horowitz, B., and Pentland, A. Recursive esti-mation of structure and motion using relative orientation constraints. In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR) (1993), pp. 294–299.

[19] Bachmann, E. R., McGhee, R. B., Yun, X., and Zyda, M. J. Inertialand magnetic posture tracking for inserting humans into networked virtual en-vironments. In Proceedings of the ACM symposium on Virtual Reality Softwareand Technology (VRST) (New York, NY, USA, 2001), ACM, pp. 9–16.

[20] Bahamonde, R. Biomechanics of the forehand stroke. ITF Coaching & SportScience Review (CSSR) 24 (2001), 6–8.

[21] Baker, R. Gait analysis methods in rehabilitation. Journal of NeuroEngi-neering and Rehabilitation 3, 1 (2006), 4.

[22] Barnes, N., Edwards, N., Rose, D., and Garner, P. Lifestylemonitoring-technology for supported independence. Computing & Control En-gineering Journal 9 (1998), 169–174.

[23] Barr, A. Superquadrics and angle-preserving transformations. IEEE Com-puter Graphics and Applications 1, 1 (1981), 11–23.

[24] Barron, J., Fleet, D., Beauchemin, S., and Burkitt, T. Performanceof optical flow techniques. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) (1992), pp. 236–242.

174

[25] Barsky, S., and Petrou, M. The shadow function for rough surfaces.Journal of Mathematical Imaging and Vision 23, 3 (2005), 281–295.

[26] Bartlett, R. Introduction to Sports Biomechanics: Analysing Human Move-ment Patterns. Routledge, 2006.

[27] Bernardin, K., Ogawara, K., Ikeuchi, K., and Dillmann, R. A sensorfusion approach for recognizing continuous human grasping sequences usinghidden markov models. IEEE Transactions on Robotics 21, 1 (2005), 47–57.

[28] Bloom, T. Player tracking and stroke recognition in tennis video. In Proced-dings of WDIC (2003), pp. 93–97.

[29] Bouguet, J.-Y. Camera calibration toolbox for matlab.http://www.vision.caltech.edu/bouguetj/calib doc.

[30] Bregler, C., and Malik, J. Tracking people with twists and exponentialmaps. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) (1998), pp. 8–15.

[31] Broadhurst, A., and Cipolla, R. A statistical consistency check for thespace carving algorithm. In Proceedings of the British Machine Vision Con-ference (BMVC) (2000), pp. 282–291.

[32] Brown, D. Close-range camera calibration. Photogrammetric Engineering37, 8 (1971), 855–866.

[33] BTS. Smart-D. http://www.bts.it/eng/proser/smartd.htm.

[34] Burschka, D., Hager, G., Dodds, Z., Jagersand, M., Cobzas, D.,and Yerex, K. Recent methods for image-based modeling and rendering. InIEEE Virtual Reality (March 2003), p. 299.

[35] Burt, P. J., and Adelson, E. H. The Laplacian pyramid as a compactimage code. IEEE Transactions on Communications COM-31,4 (1983), 532–540.

[36] Byerly, W. E. An elementary treatise on Fourier’s series and spherical,cylindrical, and ellipsoidal harmonics. Dover Publications, 1893.

[37] Caillette, F., Galata, A., and Howard, T. Real-time 3-D human bodytracking using learnt models of behaviour. Computer Vision and Image Un-derstanding 109, 2 (2008), 112–125.

[38] Canny, J. A computational approach to edge detection. IEEE Transactionson Pattern Analysis and Machine Intelligence (PAMI) 8, 6 (1986), 679–698.

[39] Cao, Z.-Y., Ji, Z.-Z., and Hu, M.-Z. An image sensor node for wireless sen-sor networks. In International Conference on Information Technology: Codingand Computing (Los Alamitos, CA, USA, 2005), vol. 2, IEEE Computer So-ciety, pp. 740–745.

[40] Castanedo, F., Patricio, M. A., Garcıa, J., and Molina, J. M. Ex-tending surveillance systems capabilities using BDI cooperative sensor agents.In Proceedings of the 4th ACM international workshop on Video Surveillance

175

and Sensor Networks (VSSN) (New York, NY, USA, 2006), ACM, pp. 131–138.

[41] Cham, J. Grad student etiquette. http://www.phdcomics.com/comics.php?f=47.

[42] Chapman, A. E. Biomechanical Analysis of Fundamental Human Move-ments. Human Kinetics, 2008.

[43] Chapman, A. E., and Medhurst, C. W. Cyclographic evidence of fatiguein sprinting. Journal of Human Movement Studies 7 (1981), 225–272.

[44] Chen, D., Malkin, R., and Yang, J. Multimodal detection of humaninteraction events in a nursing home environment. In Proceedings of the 6thInternational Conference on Multimodal Interfaces (ICMI) (New York, NY,USA, 2004), ACM Press, pp. 82–89.

[45] Chen, P., Ahammad, P., Boyer, C., Huang, S.-I., Lin, L., Lobaton,E., Meingast, M., Oh, S., Wang, S., Yan, P., Yang, A., Yeo, C.,Chang, L.-C., Tygar, J., and Sastry, S. CITRIC: A low-bandwidthwireless camera network platform. Second ACM/IEEE International Confer-ence on Distributed Smart Cameras (ICDSC) (Sept. 2008), 1–10.

[46] Chen, S. E. QuickTime VR – an image-based approach to virtual environ-ment navigation. Computer Graphics 29, Annual Conference Series (1995),29–38.

[47] Cheng, Z., Devarajan, D., and Radke, R. J. Determining vision graphsfor distributed camera networks using feature digests. EURASIP Journal onApplied Signal Processing (2007), 220–220.

[48] Chevalier, L., and Jaillet, F. Segmentation and superquadric modelingof 3D objects. In Proceedings of WSCG (2003), vol. 11.

[49] Christmas, W., Jaser, E., Messer, K., and Kittler, J. A multime-dia system architecture for automatic annotation of sports videos. ComputerVision Systems 2626 (2003), 513–522.

[50] Cosmed. K4b2 VO2 sensor. http://www.cosmed.it/products.cfm?p=1&fi=1&a=1&cat=1.

[51] Culbertson, W. B., Malzbender, T., and Slabaugh, G. G. Generalizedvoxel coloring. In Workshop on Vision Algorithms (1999), pp. 100–115.

[52] Curless, B., and Levoy, M. A volumetric method for building complexmodels from range images. Computer Graphics 30, Annual Conference Series(1996), 303–312.

[53] Cutler, R., and Turk, M. View-based interpretation of real-time opticalflow for gesture recognition. In Proceedings of the Third IEEE InternationalConference on Automatic Face and Gesture Recognition (Apr 1998), pp. 416–421.

176

[54] Davis, J. Hierarchical motion history images for recognizing human motion.In Proceedings of the IEEE Workshop on Detection and Recognition of Eventsin Video (2001), pp. 39–46.

[55] Delingette, H., Hebert, M., and Ikeuchi, K. A spherical representationfor the recognition of curved objects. In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) (May 1993), pp. 103–112.

[56] Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihoodfrom incomplete data via the EM algorithm. Journal of the Royal StatisticalSociety: Series B (JRSSB) 39 (1977), 1–38.

[57] Devarajan, D., and Radke, R. J. Calibrating distributed camera networksusing belief propagation. EURASIP Journal on Applied Signal Processing(2007), 221–221.

[58] Dillenseger, J.-L., Guillaume, H., and Patard, J.-J. Spherical har-monics based intrasubject 3-D kidney modeling/registration technique appliedon partial information. IEEE Transactions on Biomedical Engineering 53(2006), 2185–2193.

[59] Dion, D., Laurendeau, D., and Bergevin, R. Generalized cylindersextraction in a range image. In Proceedings of the International Conferenceon Recent Advances in 3-D Digital Imaging and Modeling (NRC) (Washington,DC, USA, 1997), IEEE Computer Society, p. 141.

[60] Dong, W., and Pentland, A. Multi-sensor data fusion using the influencemodel. In Proceedings of the International Workshop on Wearable and Im-plantable Body Sensor Networks (BSN) (Washington, DC, USA, 2006), IEEEComputer Society, pp. 72–75.

[61] Downes, I., Rad, L. B., and Aghajan, H. Development of a mote forwireless image sensor networks. In Proceedings of Cognitive Systems and In-teractive Sensors (COGIS) (Paris, 2006).

[62] Dynamics, A. Apas. http://www.arielnet.com.

[63] el Kaliouby, R., Teeters, A., and Picard, R. An exploratory social-emotional prosthetic for autism spectrum disorders. In Proceedings of theInternational Workshop on Wearable and Implantable Body Sensor Networks(BSN) (April 2006), pp. 3–4.

[64] Elad, M., Tal, A., and Ar, S. Directed search in a 3D objects databaseusing SVM. Tech. Rep. HPL-2000-20(R.1), HP Laboratories, 2000.

[65] Elgammal, A., Shet, V., Yacoob, Y., and Davis, L. Exemplar-basedtracking and recognition of arm gestures. In Proceedings of the 3rd Interna-tional Symposium on Image and Signal Processing and Analysis (ISPA 2003)(2003), pp. 656–661.

[66] Elliott, B. Biomechanics of the serve. ITF Coaching & Sport Science Review(CSSR) 24 (2001), 3–5.

[67] Elliott, B. Biomechanics and tennis. British Journal of Sports Medicine 40(2006), 392–396.

177

[68] Elphel. Model 313 Camera. http://www3.elphel.com/.

[69] Ercan, A., El Gamal, A., and Guibas, L. Object tracking in the presenceof occlusions via a camera network. In Proceedings of the International Con-ference on Information Processing in Sensor Networks (IPSN) (April 2007),pp. 509–518.

[70] Eveland, C., Konolige, K., , and Bolles, R. C. Background modelingfor segmentation of video-rate stereo sequences. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR) (Washing-ton, DC, USA, 1998), IEEE Computer Society, p. 266.

[71] Feng, W.-C., Kaiser, E., Feng, W. C., and Baillif, M. L. Panoptes:scalable low-power video sensor networking technologies. ACM Transactionson Multimedia Computing, Communications and Applications 1, 2 (2005),151–167.

[72] Franco, J.-S., and Boyer, E. Fusion of multiview silhouette cues using aspace occupancy grid. In Proceedings of the IEEE International Conferenceon Computer Vision (ICCV) (2005), vol. 2, pp. 1747–1753.

[73] Funkhouser, T., Min, P., Kazhdan, M., Chen, J., Halderman, A.,Dobkin, D., and Jacobs, D. A search engine for 3D models. ACM Trans-actions on Graphics 22 (2003), 83–105.

[74] Gavrila, D. M., and Davis, L. S. 3-D model-based tracking of humansin action: a multi-view approach. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) (Washington, DC, USA,1996), IEEE Computer Society, p. 73.

[75] Gemmell, J., Williams, L., Wood, K., Lueder, R., and Bell, G.Passive capture and ensuing issues for a personal lifetime store. In Proceedingsof the the 1st ACM workshop on Continuous Archival and Retrieval of PersonalExperiences (CARPE) (New York, NY, USA, 2004), ACM, pp. 48–55.

[76] Giorgino, T., Quaglini, S., Lorassi, F., and De Rossi, D. Experimentsin the detection of upper limb posture through kinestetic strain sensors. InProceedings of the International Workshop on Wearable and Implantable BodySensor Networks (BSN) (April 2006), pp. 9–12.

[77] Google. Street view. http://maps.google.co.uk/help/maps/streetview.

[78] Gortler, S. J., Grzeszczuk, R., Szeliski, R., and Cohen, M. F. TheLumigraph. Computer Graphics 30, Annual Conference Series (1996), 43–54.

[79] Gottschalk, S., Lin, M. C., and Manocha, D. OBBTree: A hierarchicalstructure for rapid interference detection. Computer Graphics 30, AnnualConference Series (1996), 171–180.

[80] Gould, D., Medbery, R., Damarjian, N., and Lauer, L. A surveyof mental skills training knowledge, opinions, and practices of junior tenniscoaches. Journal of Applied Sport Psychology 11 (1999), 28–50.

178

[81] Grauman, K., Shakhnarovich, G., and Darrell, T. A bayesian ap-proach to image-based visual hull reconstruction. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR) (2003),vol. 1, pp. 187–194.

[82] Grauman, K., Shakhnarovich, G., and Darrell, T. Inferring 3D struc-ture with a statistical image-based shape model. In Proceedings of the IEEEInternational Conference on Computer Vision (ICCV) (Nice, France, October2003).

[83] Grimson, W. E. L., Stauffer, C., Romano, R., and Lee, L. Usingadaptive tracking to classify and monitor activities in a site. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(Washington, DC, USA, 1998), IEEE Computer Society, p. 22.

[84] Guo, F., and Qian, G. Monocular 3D tracking of articulated human motionin silhouette and pose manifolds. EURASIP Journal on Image and VideoProcessing 2008, 3 (2008), 1–18.

[85] Gutchess, D., C, M. T., Cohen-solal, E., Lyons, D., and Jain, A. K.A background model initialization algorithm for video surveillance. In Pro-ceedings of the IEEE International Conference on Computer Vision (ICCV)(2001), pp. 733–740.

[86] Haake, S. J., and Coe, A. O., Eds. Tennis Science & Technology. BlackwellSciences, 2000.

[87] Hall, D. L., and Llina, J. Handbook of Multisensor Data Fusion. CRCPress, 2001.

[88] Hamill, J., and Knutzen, K. M. Biomechanical Basis of Human Move-ment. Wolters Kluwer, 2009.

[89] Hamming, R. You and your research. Transcript of the Bell CommunicationsResearch Colloquium Seminar, 7 Mar. 1986.

[90] Hao, Q., Brady, D. J., Guenther, B. D., Burchett, J., Shankar,M., and Feller, S. Human tracking with wireless distributed pyroelectricsensors. Sensors Journal, IEEE 6, 6 (2006), 1683–1696.

[91] Haritaoglu, I., Harwood, D., and Davis, L. W4: Who? When? Where?What? A real time system for detecting and tracking people. In Proceedingsof the Third Face and Gesture Recognition Conference (1998), pp. 222–227.

[92] Haussecker, H. W., and Fleet, D. J. Computing optical flow with phys-ical models of brightness variation. IEEE Transactions on Pattern Analysisand Machine Intelligence (PAMI) 23, 6 (2001), 661–673.

[93] Hawkeye. Hawkeye Innovations. http://www.hawkeyeinnovations.co.uk.

[94] Hebert, M., Ikeuchi, K., and Delingette, H. A spherical representationfor recognition of free-form surfaces. IEEE Transactions on Pattern Analysisand Machine Intelligence (PAMI) 17, 7 (1995), 681–690.

179

[95] Heeger, D. J. Optical flow using spatiotemporal filters. International Jour-nal of Computer Vision (IJCV) 1, 4 (1988), 270–302.

[96] Hengstler, S., and Aghajan, H. WiSNAP: A wireless image sensor net-work application platform. In International on Testbeds and Research In-frastructures for the Development of Networks and Communities (Barcelona,Spain, 2006).

[97] Hengstler, S., Prashanth, D., Fong, S., and Aghajan, H. Mesheye: ahybrid-resolution smart camera mote for applications in distributed intelligentsurveillance. In Proceedings of the International Conference on InformationProcessing in Sensor Networks (IPSN) (New York, NY, USA, 2007), ACM,pp. 360–369.

[98] Hill, A. V. The heat of shortening and the dynamic constants of muscle.Proceedings of the Royal Society London Series B 126, 843 (1938), 136–195.

[99] Horn, B., and Schunck, B. Determining optical flow. Artificial Intelligence17, 1–3 (1981), 185–203.

[100] Horn, B. K. P. Extended gaussian images. Proceedings of the IEEE 72(1984), 1671–1686.

[101] Huberti, H., and Hayes, W. Patellofemoral contact pressures: The influ-ence of q-angle and tendofemoral contact. Journal of Bone and Joint Surgery66 (1984), 715–724.

[102] International Tennis Federation (ITF). Biomechanical principles forthe serve in tennis. Coach Education Series.

[103] International Tennis Federation (ITF). Biomechanics of tennis: Anintroduction. Coach Education Series.

[104] International Tennis Federation (ITF). Mental training for tourna-ment tennis players. Coach Education Series.

[105] Isard, M., and Blake, A. CONDENSATION – conditional density propa-gation for visual tracking. International Journal of Computer Vision (IJCV)29, 1 (1998), 5–28.

[106] Jain, A. K., and Karu, K. Learning texture discrimination masks. IEEETransactions on Pattern Analysis and Machine Intelligence (PAMI) 18, 2(1996), 195–205.

[107] Jaklic, A., Leonardis, A., and Solina, F. Segmentation and Recoveryof Superquadrics, vol. 20 of Computational imaging and vision. Kluwer, Dor-drecth, 2000. ISBN 0-7923-6601-8.

[108] Johnson, A. E., and Hebert, M. Using spin images for efficient objectrecognition in cluttered 3D scenes. IEEE Transactions on Pattern Analysisand Machine Intelligence (PAMI) 21, 5 (1999), 433–449.

[109] KaewTraKulPong, P., and Bowden, R. An improved adaptive back-ground mixture model for real-time tracking with shadow detection. In 2ndEuropean Workshop on Advanced Video-based Surveillance Systems (2001).

180

[110] Kalman, R. E. A new approach to linear filtering and prediction problems.Transactions of the ASME–Journal of Basic Engineering 82, Series D (1960),35–45.

[111] Kam, M., Zhu, X., and Kalata, P. Sensor fusion for mobile robot naviga-tion. Proceedings of the IEEE 85, 1 (1997), 108–119.

[112] Kamijo, S., Matsushita, Y., Ikeuchi, K., and Sakauchi, M. Trafficmonitoring and accident detection at intersections. IEEE Transactions onIntelligent Transportation Systems 1, 2 (Jun 2000), 108–118.

[113] Kass, M., Witkin, A., and Terzopoulos, D. Snakes: Active contourmodels. International Journal of Computer Vision (IJCV) 1, 4 (1988), 321–331.

[114] Kazhdan, M., Chazelle, B., Dobkin, D., Funkhouser, T., andRusinkiewicz, S. A reflective symmetry descriptor for 3D models. Algo-rithmica (2003).

[115] Kazhdan, M., Funkhouser, T., and Rusinkiewicz, S. Rotation invariantspherical harmonic representation of 3D shape descriptors. In Symposium onGeometry Processing (2003), pp. 167–175.

[116] Kearney, J. K., Thompson, W. B., and Boley, D. L. Optical flow esti-mation: an error analysis of gradient-based methods with local optimization.IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 9,2 (1987), 229–244.

[117] Kimura, K., and Saito, H. Video synthesis at tennis player viewpoint frommultiple view videos. In IEEE Proceedings of Virtual Reality (March 2005),pp. 281–282.

[118] King, R., Atallah, L., Lo, B., and Yang, G.-Z. Development of a wirelesssensor glove for surgical skills assessment. to appear in IEEE Transactions ofInformation Technology in Biomedicine (2008).

[119] Kleihorst, R., Schueler, B., and Danilin, A. Architecture and applica-tions of wireless smart cameras (networks). IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) 4 (April 2007), 1373–1376.

[120] Kleinoder, H. Biomechanics of the return of serve. ITF Coaching & SportScience Review (CSSR) 24 (2001), 5–6.

[121] Koller, D., Weber, J., and Malik, J. Robust multiple car tracking withocclusion reasoning. In Proceedings of the European Conference on ComputerVision (ECCV) (1994), pp. 189–196.

[122] Kolmogorov, V., Criminisi, A., Blake, A., Cross, G., and Rother,C. Bi-layer segmentation of binocular stereo video. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR) (Washing-ton, DC, USA, 2005), IEEE Computer Society, pp. 407–414.

[123] Kolonias, I., Tzovaras, D., Malassiotis, S., and Strintzis, M. Fastcontent-based search of VRML models based on shape descriptors. IEEETransactions on Multimedia 7, 1 (2005), 114–126.

181

[124] Kubota, A., Takahashi, K., Aizawa, K., and Chen, T. All-focused lightfield rendering, 2004.

[125] Kundur, D., Lin, C.-Y., and Lu, C.-S. Editorial: visual sensor networks.EURASIP Journal on Applied Signal Processing (2007), 227–227.

[126] Kutulakos, K. N., and Seitz, S. M. A theory of shape by space carving.International Journal of Computer Vision (IJCV) 38, 3 (2000), 199–218.

[127] Laerhoven, K., Gellersen, H., and Malliaris, Y. Long term activitymonitoring with a wearable sensor node. In Proceedings of the InternationalWorkshop on Wearable and Implantable Body Sensor Networks (BSN) (2006),pp. 171–174.

[128] Langheinrich, M. Privacy invasions in ubiquitous computing. In Proceedingsof UBICOMP’02 (2002).

[129] Lantronix. Matchport. http://www.lantronix.com/device-networking/embedded-device-servers/matchport.html.

[130] Lee, D.-S. Effective gaussian mixture learning for video background sub-traction. IEEE Transactions on Pattern Analysis and Machine Intelligence(PAMI) 27, 5 (2005), 827–832.

[131] Leone, A., Distante, C., and Buccolieri, F. A texture-based approachfor shadow detection. In IEEE Conference on Advanced Video and SignalBased Surveillance (AVSS) (2005), pp. 371–376.

[132] Levoy, M., and Hanrahan, P. Light field rendering. Computer Graphics30, Annual Conference Series (1996), 31–42.

[133] Lhuillier, M., and Quan, L. Match propogation for image-based mod-eling and rendering. IEEE Transactions on Pattern Analysis and MachineIntelligence (PAMI) 24 (August 2002), 1140–1146.

[134] Linz, T., Kallmayer, C., Aschenbrenner, R., and Reichl, H. Fullyuntegrated EKG shirt based on embroidered electrical interconnections withconductive yarn and miniaturized flexible electronics. In Proceedings of theInternational Workshop on Wearable and Implantable Body Sensor Networks(BSN) (April 2006), pp. 23–26.

[135] Lippman, A. Movie-maps: an application of the optical videodisc to computergraphics. In Procedings of ACM SIGGRAPH (1980).

[136] Liu, Y., Williams, J., and Bennion, I. Optical bend sensor based on mea-surement of resonance mode splitting of long-period fiber grating. PhotonicsTechnology Letters, IEEE 12, 5 (May 2000), 531–533.

[137] Lo, B., Atallah, L., Aziz, O., ElHelw, M., Darzi, A., and Yang, G.-Z. Real time pervasive monitoring for post operative care. In Proceedings of theInternational Workshop on Wearable and Implantable Body Sensor Networks(BSN) (Aachen, Germany, 2007), pp. 122–127.

182

[138] Lo, B., Wang, J. L., and Yang, G. Z. From imaging networks to be-havior profiling: Ubiquitous sensing for managed homecare of the elderly. InAdjunct Proceedings of the 3rd International Conference on Pervasive Com-puting (PERVASIVE 2005) (Munich, Germany, 2005), pp. 101–104.

[139] Lo, B. P. L., and Yang, G.-Z. Neuro–fuzzy shadow filter. In Proceedings ofthe European Conference on Computer Vision (ECCV) (May 2002), pp. 381–392.

[140] Lucas, B. D., and Kanade, T. An iterative image registration techniquewith an application to stereo vision. In Proceedings of Imaging understandingworkshop (1981), pp. 121–130.

[141] Lucena, M., Fuertes, J., Perez de la Blanca, N., and Garrido, B.An optical flow probabilistic observation model for tracking. In Proceedings ofthe IEEE International Conference on Image Processing (ICIP) (2003), vol. 3,pp. 957–960.

[142] Makadia, A., Sorgi, L., and Daniilidis, K. Rotation estimation fromspherical images. In Proceedings of the International Conference on PatternRecognition (ICPR) (Washington, DC, USA, 2004), IEEE Computer Society,pp. 590–593.

[143] Makhoul, J. Linear prediction: A tutorial review. Proceedings of the IEEE63, 4 (April 1975), 561–580.

[144] Margi, C. B., Lu, X., Zhang, G., Stanek, G., Manduchi, R., andObraczka, K. Meerkats: A power-aware, self-managing wireless cameranetwork for wide area monitoring. In International Workshop on DistributedSmart Cameras (DSC-06) (Boulder, Colorado, U.S.A., 2006), pp. 26–30.

[145] Mark, W. Post-rendering 3D image warping: Visibility, reconstruction, andperformance for depth-image warping. Tech. Rep. TR99-022, University ofNorth Carolina, Chapel Hill, NC, USA, 1999.

[146] Mark, W. R., McMillan, L., and Bishop, G. Post-rendering 3D warping.In Symposium on Interactive 3D Graphics (1997), pp. 7–16.

[147] Matrix Vision. mvBlueLYNX smart camera. http://www.matrix-vision.com/products/hardware/mvbluelynx.php.

[148] Matusik, W., Buehler, C., Raskar, R., Gortler, S. J., and McMil-lan, L. Image-based visual hulls. In SIGGRAPH ’00: Proceedings of the27th annual conference on Computer graphics and interactive techniques (NewYork, NY, USA, 2000), ACM Press/Addison-Wesley Publishing Co., pp. 369–374.

[149] Maurer, U., Rowe, A., Smailagic, A., and Siewiorek, D. ewatch: awearable sensor and notification platform. In Proceedings of the InternationalWorkshop on Wearable and Implantable Body Sensor Networks (BSN) (April2006), pp. 142–145.

[150] McCullough, C., Dasarathy, B., and Lindberg, P. Multi-level sensorfusion for improved target discrimination. In Proceedings of the 35th IEEEConference on Decision and Control (1996), vol. 4, pp. 3674–3675.

183

[151] McKenna, J., and Nait-Charif, H. Summarising contextual activity anddetecting unusual inactivity in a supportive home environment. Pattern Anal-ysis and Applications 7, 4 (2004), 386–401.

[152] McMillan, L. An Image-Based Approach to Three-Dimensional ComputerGraphics. PhD thesis, University of North Carolina, Apr 1997.

[153] McMillan, L., and Bishop, G. Plenoptic modeling: an image-based render-ing system. Computer Graphics 29, Annual Conference Series (1995), 39–46.

[154] McQueen, J. B. Some methods for classification and analysis of multivariateobservations. In Proceedings of 5-th Berkeley Symposium on MathematicalStatistics and Probability (1967), U. of California Press, Ed., pp. 281–297.

[155] Mirmehdi, M., and Petrou, M. Segmentation of color textures. IEEETransactions on Pattern Analysis and Machine Intelligence (PAMI) 22, 2 (Feb2000), 142–159.

[156] Mitchell, S. R., Jones, R., and Kotze, J. The influence of racket momentof inertia during the tennis serve: 3-dimensional analysis. Tennis Science &Technology (2000), 57–65.

[157] Mowbray, S. D., and Nixon, M. S. Automatic gait recognition via Fourierdescriptors of deformable objects. Proceedings of Audio Visual Biometric Per-son Authentication (2003).

[158] Murphy, K. Bayes net toolkit. http://bnt.sourceforge.net/.

[159] Murray, D., and Basu, A. Motion tracking with an active camera. IEEETransactions on Pattern Analysis and Machine Intelligence (PAMI) 16, 5(1994), 449–459.

[160] Nagel, H. H., and Enkelmann, W. An investigation of smoothness con-straints for the estimation of displacement vector fields from image sequences.IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 8,5 (1986), 565–593.

[161] NDI. OPTOTRAK. http://www.ndigital.com/.

[162] Neal, R., and Hinton, G. A view of the EM algorithm that justifies incre-mental, sparse, and other variants. In Learning in Graphical Models (1998),M. I. Jordan, Ed., Kluwer.

[163] Neild, I., Heatley, D. J. T., Kalawsky, R. S., and Bowman, P. A.Sensor networks for continuous health monitoring. BT Technology Journal 22,3 (2004), 130–139.

[164] Neto, L. B., Bezerra, E. C., ao, J. C. S., and Amadio, A. C. Dynamiccharacteristics of two techniques applied to the field tennis serve. TennisScience & Technology (2000), 389–693.

[165] Neuricam. NC 5200 - VISoc. http://www.neuricam.com/main/product.asp?4M=NC5200.

184

[166] Newton Research Labs. Chognachrome. http://www.newtonlabs.com/cognachrome/index.html.

[167] Obraczka, K., Manduchi, R., and Garcia-Luna-Aveces, J. Manag-ing the information flow in visual sensor networks. The 5th InternationalSymposium on Wireless Personal Multimedia Communications 3 (Oct. 2002),1177–1181.

[168] Oliveira, M. M. Image-based modeling and rendering techniques: A survey.RITA IX, 2 (2002), 37–66.

[169] Oliver, N. M., Rosario, B., and Pentland, A. A bayesian computervision system for modeling human interactions. IEEE Transactions on PatternAnalysis and Machine Intelligence (PAMI) 22, 8 (2000), 831–843.

[170] Omnivision. OV9655. http://www.ovt.com/products/part detail.asp?id=27.

[171] Orr, R., and Abowd, G. The smart floor: A mechanism for natural useridentification and tracking. In Conference on Human Factors in ComputingSystems (2000).

[172] Osada, R., Funkhouser, T., Chazelle, B., and Dobkin, D. Match-ing 3D models with shape distributions. In Proceedings of the InternationalConference on Shape Modeling & Applications (SMI) (Washington, DC, USA,2001), IEEE Computer Society, p. 154.

[173] Palange, P., Forte, S., Onorati, P., Manfredi, F., Serra, P., andCarlone, S. Ventilatory and metabolic adaptations to walking and cyclingin patients with COPD. Journal of Applied Physiology 88, 5 (May 2000),1715–1720.

[174] Pansiot, J., Elsaify, A., Lo, B., and Yang, G.-Z. RACKET: Real-time autonomous computation of kinematic elements in tennis. In IEEE 12thInternational Conference on Computer Vision Workshops (ICCV Workshops)– Fifth IEEE Workshop on Embedded Computer Vision (Kyoto, Japan, 2009),pp. 773–779. (Best Paper Award).

[175] Pansiot, J., King, R., McIlwraith, D., Lo, B., and Yang, G.-Z.ClimBSN: Climber performance monitoring with BSN. In IEEE proceedingsof the 5th International Workshop on Wearable and Implantable Body SensorNetworks 2008 (BSN) (Hong Kong, China, 2008), pp. 33–36.

[176] Pansiot, J., Stoyanov, D., Lo, B., and Yang, G.-Z. Towards image-based modeling for ambient sensing. In IEEE proceedings of the 3rd Inter-national Workshop on Wearable and Implantable Body Sensor Networks 2006(BSN) (Cambridge, MA, April 2006), pp. 195–198.

[177] Pansiot, J., Stoyanov, D., McIlwraith, D., Lo, B., and Yang, G.-Z. Ambient and wearable sensor fusion for activity recognition in healthcaremonitoring systems. In IFMBE proceedings of the 4th International Workshopon Wearable and Implantable Body Sensor Networks 2007 (BSN) (Aachen,Germany, 2007), pp. 208–212.

185

[178] Papadopulos, C., Emmanouilidou, M., and Prassas, S. Kinematicanalysis of the service stroke in tennis. Tennis Science & Technology (2000),383–387.

[179] Park, C., and Chou, P. H. eCAM: ultra compact, high data-rate wirelesssensor node with a miniature camera. In Proceedings of the 4th internationalconference on Embedded networked sensor systems (Boulder, Colorado, USA,2006), pp. 359–360.

[180] Paromed. Parotec. http://www.paromed.de/englisch/frames/bessermessen.html.

[181] Perreault, E., Sandercock, T., and Heckman, C. Hill muscle modelperformance during natural activation and electrical stimulation. In Proceed-ings of the 23rd Annual International Conference of the IEEE Engineering inMedicine and Biology Society (2001), vol. 2, pp. 1248–1251.

[182] Petersen, C., and Nittinger, N. Fit to play: making better players, on& off court. Medicine and Science in Tennis 9, 1 (2004), 20–21.

[183] Pingali, G., Jean, Y., and Carlbom, I. Real time tracking for enhancedtennis broadcasts. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) (Jun 1998), pp. 260–265.

[184] Polhemus. Fastrak. http://www.polhemus.com/?page=Motion Fastrak.

[185] Qualisys. Oqus. http://www.qualisys.com.

[186] Rahimi, M., Baer, R., Iroezi, O., Garcia, J., Warrior, J., Estrin, D.,and Srivastava, M. B. Cyclops: In situ image sensing and interpretation inwireless sensor networks. In Third ACM Conference on Embedded NetworkedSensor Systems (SenSys05) (San Diego, California, USA, 2005).

[187] Reid, M. Biomechanics of the one- and two-handed backhands. ITF Coaching& Sport Science Review (CSSR) 24 (2001), 8–10.

[188] Roetert, E. P., and Ellenbecker, T. S. Biomechanics of movement intennis. ITF Coaching & Sport Science Review (CSSR) 24 (2001), 15–17.

[189] Roetert, E. P., and Groppel, J. L. Biomechanics of the volley. ITFCoaching & Sport Science Review (CSSR) 24 (2001), 10–11.

[190] Rowe, A., Goode, A. G., Goel, D., and Nourbakhsh, I. CMUcam3: Anopen programmable embedded vision sensor. Tech. Rep. CMU-RI-TR-07-13,Robotics Institute, Pittsburgh, PA, May 2007.

[191] Rowe, A., Rosenberg, C., and Nourbakhsh, I. A low cost embeddedcolor vision system. In IEEE International Conference on Intelligent Robotsand System (IROS) (2002), vol. 1, pp. 208–213.

[192] Rowe, A., Rosenberg, C., and Nourbakhsh, I. A second generation lowcost embedded color vision system. Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) (June 2005), 136–136.

186

[193] Sambhoos, P., Hasan, A. B., Han, R., Lookabaugh, T., and Mul-ligan, J. Weeblevideo – wide angle field-of-view video sensor networks. InInternational Workshop on Distributed Smart Cameras (DSC-06) (Boulder,Colorado, U.S.A., 2006), pp. 31–35.

[194] Saupe, D., and Vranic, D. V. 3D model retrieval with spherical harmon-ics and moments. In Proceedings of the 23rd DAGM-Symposium on PatternRecognition (London, UK, 2001), Springer-Verlag, pp. 392–397.

[195] Schepers, H., and Veltink, P. Estimation of ankle moment using am-bulatory measurement of ground reaction force and movement of foot andankle. The First IEEE/RAS-EMBS International Conference on BiomedicalRobotics and Biomechatronics (BioRob) (2006), 399–401.

[196] Schmandt, C., and Vallejo, G. “ListenIn” to domestic environmentsfrom remote locations. In Proceedings of the 2003 International Conferenceon Auditory Display (2003).

[197] Schroder, P., and Sweldens, W. Spherical wavelets: Efficiently repre-senting functions on the sphere. Computer Graphics 29, Annual ConferenceSeries (1995), 161–172.

[198] Seitz, S. M., and Dyer, C. R. View morphing. Computer Graphics 30,Annual Conference Series (1996), 21–30.

[199] Seitz, S. M., and Dyer, C. R. Photorealistic scene reconstruction by voxelcoloring. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR) (1997), pp. 1067–1073.

[200] Shade, J. Approximating the plenoptic function. Tech. Rep. UW CSE 02-12-06, University of Washington, 2002.

[201] Shafarenko, L., Petrou, M., and Kittler, J. Automatic watershedsegmentation of randomly textured color images. IEEE Transactions on ImageProcessing 6, 11 (Nov 1997), 1530–1544.

[202] Shafarenko, L., Petrou, M., and Kittler, J. Histogram-based segmen-tation in a perceptually uniform colour space. IEEE Transactions on ImageProcessing 7, 9 (1998), 1354–1358.

[203] Shum, H.-Y., and He, L.-W. Rendering with concentric mosaics. ComputerGraphics 33, Annual Conference Series (1999), 299–306.

[204] Singhal, A., Luo, J., and Brown, C. A multilevel bayesian networkapproach to image sensor fusion. In Proceedings of the Third InternationalConference on Information Fusion (FUSION) (2000), vol. 2, pp. 10–13.

[205] Sinha, S., Pollefeys, M., and McMillan, L. Camera network calibrationfrom dynamic silhouettes. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) (2004), vol. 1, pp. 195–202.

[206] Starck, J., and Hilton, A. Spherical matching for temporal correspon-dence of non-rigid surfaces. In Proceedings of the IEEE International Con-ference on Computer Vision (ICCV) (Washington, DC, USA, 2005), IEEEComputer Society, pp. 1387–1394.

187

[207] Stauffer, C., and Grimson, W. E. L. Learning patterns of activity us-ing real-time tracking. IEEE Transactions on Pattern Analysis and MachineIntelligence (PAMI) 22, 8 (2000), 747–757.

[208] Steinberg, A. N., and Bowman, C. L. Rethinking the JDL data fusionlevels, 2004.

[209] Stroud, M. Survival of the fittest. Yellow Jersey Press, 2004.

[210] Sudhir, G., Lee, J., and Jain, A. Automatic classification of tennis videofor high-level content-based retrieval. In Proceedings of the IEEE InternationalWorkshop on Content-Based Access of Image and Video Database (Jan 1998),pp. 81–90.

[211] Surveyor. SRV-1. http://www.surveyor.com/.

[212] Swarztrauber, P. N., and Spotz, W. F. Spherical harmonic projectors.Mathematics of Computation 73, 246 (2003), 753–760.

[213] Tabar, A. M., Keshavarz, A., and Aghajan, H. Smart home care net-work using sensor fusion and distributed vision-based reasoning. In Proceedingsof the 4th ACM International Workshop on Video Surveillance and SensorNetworks (VSSN) (New York, NY, USA, 2006), ACM Press, pp. 145–154.

[214] Tai, J.-C., and Song, K.-T. Background segmentation and its application totraffic monitoring using modified histogram. In IEEE International Conferenceon Networking, Sensing and Control (March 2004), vol. 1, pp. 13–18.

[215] Teixeira, T., Culurciello, E., Park, J., Lymberopoulos, D.,Barton-Sweeney, A., and Savvides, A. Address-event imagers for sen-sor networks: evaluation and modeling. In Proceedings of the InternationalConference on Information Processing in Sensor Networks (IPSN) (2006),pp. 458–466.

[216] Teixeira, T., Lymberopoulos, D., Barton-Sweeney, A., Culur-ciello, E., and Savvides, A. A lightweight camera sensor network op-erating on symbolic information. In International Workshop on DistributedSmart Cameras (DSC-06) (Boulder, Colorado, U.S.A., 2006), pp. 87–91.

[217] Tejman-Yarden, S., Zlotnik, A., Weizman, L., Gurman, G., andTabrikian, J. Acoustic monitoring of lung sounds for the detection of onelung intubation. In International Conference on Information Technology: Re-search and Education (ITRE) (Oct. 2006), pp. 186–186.

[218] Terzopoulos, D., and Szeliski, R. Tracking with kalman snakes. Activevision (1993), 3–20.

[219] Thaller, S., and Wagner, H. The relation between Hill’s equation andindividual muscle properties. Journal of Theoretical Biology 231, 3 (2004),319–332.

[220] Toyama, K., Krumm, J., Brumitt, B., and Meyers, B. Wallflower:Principles and practice of background maintenance. In Proceedings of theIEEE International Conference on Computer Vision (ICCV) (1999), vol. 1,pp. 255–261.

188

[221] United States Tennis Association (USTA). Tennis Tactics: WinningPatterns of Play. Human Kinetics, 1996.

[222] Uras, S. F., Girosi, A., Verri, and Torre, V. A computational approachto motion perception. Biological Cybernetics 60, 2 (1988), 79–87.

[223] Van de Meer, D., and Kibler, W. B. The biomechanical fundamentals,the commodities, and the preferences of tennis strokes. Tennis Science &Technology (2000), 355–360.

[224] Van Wieringen, P. C. W., Emmen, H. H., Bootsma, R. J., Hoogeste-ger, M., and Whiting, H. T. A. The effect of video-feedback on the learn-ing of the tennis service by intermediate players. Journal of Sports Sciences 7(1989), 153–162.

[225] Vedula, S., Baker, S., Seitz, S., and Kanade, T. Shape and motioncarving in 6D. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) (2000), pp. 592–598.

[226] Verri, A., and Poggio, T. Against quantitative optical flow. In Proceedingsof the IEEE International Conference on Computer Vision (ICCV) (1987),pp. 171–180.

[227] VICON. Motion capture systems. http://www.vicon.com.

[228] Vranic, D. V. An improvement of rotation invariant 3D shape descriptorbased on functions on concentric spheres. In Proceedings of the IEEE Inter-national Conference on Image Processing (ICIP) (2003), vol. 3, pp. 757–760.

[229] Vranic, D. V., and Saupe, D. 3D shape descriptor based on 3D Fouriertransform. In Proceedings of the EURASIP Conference on Digital SignalProcessing for Multimedia Communications and Services (ECMCS) (2001),K. Fazekas, Ed., pp. 271–274.

[230] Wang, J., Lo, B., and Yang, G.-Z. Ubiquitous sensing for pos-ture/behavior analysis. In IEE Proceedings of the 2nd International Workshopon Body Sensor Networks (BSN) (2005), pp. 112–115.

[231] Wang, J., and Parameswaran, N. Analyzing tennis tactics from broad-casting tennis video clips. In Proceedings of the 11th International MultimediaModelling Conference (Jan. 2005), pp. 102–106.

[232] Wang, L., Lo, B., and Yang, G.-Z. Multichannel reflective PPG earpiecesensor with passive motion cancellation. IEEE Transactions on BiomedicalCircuits and Systems 1, 4 (Dec. 2007), 235–241.

[233] Wang, L. H., Wu, C. C., Su, F. C., Lo, K. C., and Wu, H. W. Kine-matics of trunk and upper extremity in tennis flat serve. Tennis Science &Technology (2000), 395–400.

[234] Watson, G., O’Brien, P., and Wright, M. Towards a perceptual methodof blending for image-based models. SIGRAD (2002).

[235] Wikipedia. Associated legendre function. http://en.wikipedia.org/wiki/Associated Legendre function.

189

[236] Wikipedia. Human skeleton. http://en.wikipedia.org/wiki/Human skeleton.

[237] Wikipedia. Power iteration. http://en.wikipedia.org/wiki/Power method.

[238] Wikipedia. Spherical harmonics. http://en.wikipedia.org/wiki/ Spheri-cal harmonics.

[239] Wong, B. L., Bae, W. C., Chun, J., Gratz, K. R., Lotz, M., andSah, R. L. Biomechanics of cartilage articulation: Effects of lubrication anddegeneration on shear deformation. Arthritis & Rheumatism 58, 7 (2008),2065–2074.

[240] Wong, T.-T., and Leung, C.-S. Compression of illumination-adjustableimages. IEEE Transactions on Circuits and Systems for Video Technology 13(2003), 1107–1118.

[241] Wren, C. R., Azarbayejani, A., Darrell, T., and Pentland, A.Pfinder: Real-time tracking of the human body. IEEE Transactions on PatternAnalysis and Machine Intelligence (PAMI) 19, 7 (1997), 780–785.

[242] Wu, Y., Ianakiev, K., and Govindaraju, V. Improved k-nearest neighborclassification. Pattern Recognition 35, 10 (2002), 2311–2318.

[243] Wu, Y., and Yu, T. A field model for human detection and tracking.IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 28,5 (2006), 753–765.

[244] Xsens. Moven. http://www.moven.com.

[245] Yang, G.-Z. Body Sensor Networks. Springer, 2006.

[246] Yang, Y.-R., Lee, Y.-Y., Cheng, S.-J., Lin, P.-Y., and Wang, R.-Y.Relationships between gait and dynamic balance in early Parkinson’s disease.Gait & Posture 27, 4 (2008), 611–615.

[247] Zequera, M., Stephan, S., and Paul, J. The “PAROTEC” foot pressuremeasurement system and its calibration procedures. In 28th Annual Interna-tional Conference of the IEEE (2006), E. in Medicine and B. S. (EMBS), Eds.,pp. 4135–4139.

[248] Zhang, D., and Hebert, M. Harmonic maps and their applications insurface matching. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) (1999), vol. 2, pp. 524–530.

[249] Zhang, L., and Samaras, D. Face recognition from a single training imageunder arbitrary unknown lighting using spherical harmonics. IEEE Trans-actions on Pattern Analysis and Machine Intelligence (PAMI) 28, 3 (2006),351–363.

[250] Zhang, Z. A flexible new technique for camera calibration. IEEE Transac-tions on Pattern Analysis and Machine Intelligence (PAMI) (2000).

[251] Zhao, T., and Nevatia, R. Tracking multiple humans in complex situations.IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 26,9 (2004), 1208–1221.

190

[252] Zhou, R. W., Quek, C., and Ng, G. S. A novel single-pass thinningalgorithm and an effective set of performance criteria. Pattern RecognitionLetters 16, 12 (1995), 1267–1275.

191

markerless visual tracking and motion analysis for sports...

Documents