team dev - csci5561: computer vision 1 stereo system for

TEAM DEV - CSCI5561: COMPUTER VISION 1

Stereo System for Real-Time Pose TrackingBen Dischinger, Robert Edge, and Joshua Vander Hook

Abstract—Here we describe a stereo camera system capableof tracking a single subject through a series of dynamic poses inreal time. Our system is capable of estimating the 5 key pointsof the person’s pose (head, feet, hands) as well as their centerof gravity. All of this is accomplished using two passive camerasrunning on commodity laptop hardware.

We provide a geometric description of the setup which pre-cludes the requirement for disparity calculations. We also pro-vide implementation details and experimental results, includingvideos of the system at work. Finally, a number of interestingapplications and directions for future work are identified.

Index Terms—pose tracking, background segmentation

I. INTRODUCTION

This paper discusses the implementation of a real-time gamecontroller using commodity web cameras and a combinationof computer vision algorithms from the literature. Our testbedis a game called “Rally Ball” which consists of a three-dimensional room where virtual balls spontaneously fly to-wards the player who must hit them in order to receive points.

A. Requirements

The first requirement is to detect ball collisions with theplayer. If the ball shares the same three dimensional space asthe player a hit should be detected. Depth information mustsomehow be determined to resolve occlusion ambiguities. Ourapproach uses an orthogonal stereo baseline along with robustbackground subtraction in order to easily disambiguate depthinformation.

In addition to simple hit detection we wish to determinewhich body part was struck by the ball. We present a solutionusing fast skeletonization, skin detection and K-means cluster-ing. As an extra challenge the player should be able to catchthe ball as it approaches. From our body segmentation we usethe estimated hand positions in order to determine whetherthe player caught the ball. Finally, in order to be a true gamecontroller the system should run in real-time. We considerreal-time less than one second delay with 10Hz performance.

This suggested three main components: Human detection,body part segmentation, and augmented reality. We’ll addresseach of these individually, describe our system in section III,and present the results in section IV.

II. SURVEY

This section presents algorithms in the literature which wefound pertinent to our implementation of a multi-view real-time game controller.

Ben Dischinger is with the Department of Computer Science, Universityof Minnesota, Minneapolis, USA, e-mail: [email protected].

Robert Edge is with the Department of Computer Science, University ofMinnesota, Minneapolis, USA, e-mail: [email protected].

Joshua Vander Hook is with the Department of Computer Science, Univer-sity of Minnesota, Minneapolis, USA, e-mail: [email protected].

A. Real Time Human Motion Analysis

[1] seeks to analyze motion of a human target in a videostream using foreground boundary information. This does notdirectly apply to our problem, but a number of the techniquesused in the paper help us to skeletonize our mask. Theproblem being approached in [1] is: Given some video streamcontaining humans, how can the activities of the human bedetermined, eg walking or running? Discussion on their motionanalysis section is not included in this paper.

Following their lead, we used background subtraction tosegment the target from the cluttered background. Lim[2]provides an interesting result that was similar to our re-quirements. They attempt to place cameras to find placesthat were occluded by moving objects, with the intent ofremoving false positives. This would have been particularlyuseful had we found shadows to be a problem. Noriega [3]provided a probabilistic model that was the inspiration for ourimplementation.

Figure 1. The foreground mask containing the human is dialated twice anderoded. [1]

This mask is then processed using a series of morphologicaloperations with the intention of removing any small holes.Figure 1 shows this process which leads to border extraction.The goal of the process is to create a signal which contains theborder’s distance from the mask centroid which is determinedin by 1:

xcentroid =1

N

N�

i=1

xi, ycentroid =1

N

N�

i=1

yi (1)

This signal is smoothed for noise reduction using a linearsmoothing filter to reduce noise, and the local maxima are thentaken as extreme points. [1] then uses these extreme pointsto create a “star” skeleton by connecting these points to thecentroid. Figure 2 shows this process, starting with smoothing

http://[email protected]




the raw distance signal and ending up with the extreme limblocation points.

Figure 2. Skeletonization process. Top: raw distance function around thesilhouette. Bottom: star skeleton from maxima of smoothed distance function.[1]

The paper discusses three main advantages in using this typeof skeletonization process, the first of which is performance.Other techniques such as thinning or distance transformationsare computationally expensive, whereas this type of skele-tonization is iterative, making it computationally less expen-sive. The other advantages are that an explicit mechanism forcontrolling scale sensitivity is provided, and this techniquedoes not rely on any prior human model.

The idea of a fast, model-free process is desirable for ourcontroller, so we implemented a similar approach. This processdelivers in terms of the promised performance. A disadvantageof this approach is that there are a handful of degenerate poseswhere the location of all five points may not be known becausethey are not local maxima. While this is potentially desirablefor their intended application of motion analysis, it does notgive us a full solution to our controller problem.

B. Body Part Segmentation of Noisy Human Silhouette Images

The motivation of [4] is to develop a solution to the problemof body part segmentation in noisy silhouette images. Theirmain contribution is that they address the issue of insufficientlabeled training data with synthetically generated images anduse this data to train a hidden Markov Model. The model thenis used to segment the silhouettes into arms, legs, body andhead.

To create the synthetic training data, motion capture tech-niques to capture the 3D information of various poses. Thesepoint sets are then used to generate silhouettes from variousangles.

Shape context features[4], [5] are extracted from theselabeled silhouettes and are then used to train Gaussian mixture

models for each body part. These features are used to describethe shape of the silhouette outline and were originally used inshape matching and finding the transformation between twoobjects. Creation of a shape context feature involves samplingthe silhouette edge at regular intervals. For each samplededge the distance and direction to all other points withinsome distance are calculated. These pairs are then stored in ahistogram where the bins are weighted based on the proposedscheme:

d(r) =

�R2N if R

3 < r <2R3

RN otherwise

(2)

Figure 3 shows the binning scheme for a feature for a shoulderpoint. The mixture models are then used to form states for

Figure 3. The binning scheme used in creation of the histograms for contextfeatures.[4]

the hidden Markov model. Figure 4 shows the process ofextracting models from a labeled silhouette. The approach

Figure 4. The process for creating body part models. First labeled syntheticdata is processed to find border points. The points are then turned into contextfeatures which are used in creation of Gaussian mixture models for the bodyparts.[4]

proposed in [4] limits the impact of incomplete training data


Figure 5. Conics induced by two spheres.[6]

with synthetic data. We decided to not pursue this approachbecause we wanted a model-free implementation. The benefitsrelevant for our controller implementation are mainly logisti-cal, including time constraints and availability of 3D trainingdata for use in creation of silhouettes.

C. Calibration using spheres

In [6], Wong presents a novel technique for calibratingmultiple cameras based on the epipolar geometry of spheres.The advantage of such a system is that instead of requiring aplanar calibration pattern be visible to all viewpoints spherescan be placed in the environment and be viewed more easilyby all viewpoints. For a system with many cameras thisapproach would make it possible to calibrate the world frameconsistently and simultaneously.

The intuition behind Wong’s approach is that two sphereswill precisely define two intersection points for lines tangentalong the sides of the two spheres. Figure 5 shows the twoconics induced by the two spheres. For every pair of spherestwo point correspondences will be produced. Therefore aminimum of three spheres are needed, which could producenine point correspondences, six which are the intersection ofthe conics, and three sphere centers that can be calculatedbased on the intersection of the lines created by the sixother correspondences. We did not use this approach for ourimplementation, but as more cameras are added using spherescould greatly simplify the extrinsic calibration process.

III. SYSTEM DESCRIPTION

A. Environment

We used two Logitech C120 cameras attached to a DellInspiron with two 2 GHz processors. We limited the systemto a small room. Our main objective was to detect thecollision between a subject and one or more time-varyingworld coordinate (the ball). Our system used a wide-baselinestereo configuration, with the camera image planes roughlyorthogonal (See Fig. 6b).

Intuitively, we could project the ball’s image onto the sameimage frames and detect a collision as though it had actuallybeen present. In practice, this amounts to checking the 2Dbounding polygons of each camera’s subject, compared to the3D coordinates of the ball projected onto the same image.When each camera detects a collision, we output, and attemptto recognized which body part had collided.

Our setup is based on the simple observation that, when twoobjects are overlapping (our definition of collision), their axis-aligned bounding boxes must also overlap. We treat an imageof the scene as a rough approximation of a planar projection,and delay match-checking until both planar-projections areoverlapping. (See Figure 6b). This condition does not guar-antee a collision, but performed well in practice as a firstapproximation of a collision state.

B. Setup and Calibration

We created a program to calibrate cameras individuallyin order to determine the intrinsic camera properties. Thisuses the same technique as in [7] where multiple images ofa checkerboard pattern are used to introduce homographicgeometry constraints on the intrinsic properties.

Before playing, the world frame must also be calibratedacross the cameras. To do this we use a checkerboard patternthat is visible to both viewpoints. The world origin is taken tobe the upper left rectangle of the pattern in both cameras.

Since the checkerboard pattern is a plane, it is related to theimage of the checkerboard by a homography. Since we knowthe intrinsic properties of the cameras, we can solve for theextrinsic properties using the homographic constraints [7].

Let Pi = (Xi, Yi, 1) be the set of homogeneous worldcoordinates with zero Z axis coordinate for the ith cornerof the checkerboard pattern, and let pi = (xi, yi, 1) be itsimage coordinates. We know that pi is related to Pi by both aprojective and homographic transform. The projection is givenby the usual transformation

pi = K[RT ]Pi (3)

where K is the intrinsic properties of the camera and RT arethe extrinsic rotation and translation properties we are tryingto determine.

Since the pattern is related to its image by a homographywe know that

αpi = HPi (4)

where α is some scale factor. This homography can beestimated using least squares techniques.

Since we know that Zi = 0 we can rewrite equation 3 as

pi = K[r1r2t]Pi

where r1 and r2 are column vectors of the rotation matrixwe are trying to determine because r3 provides no constraintgiven Zi = 0. Using the homography H = [h1h2h3] that wedetermined above, we can combine equations 3 and 4

αK[r1r2t]Pi = [h1h2h3]Pi

α[r1r2t] = K−1[h1h2h3]

We can solve for RT as follows using the orthonormalconstraints on ri:


(a) (b)

Figure 6. The camera positions and test area 6a: An example false-positive from narrow baseline. 6b: Using wide baseline to resolve depth informationwithout disparity.

r1 = K−1h1/α

r2 = K−1h2/α

r3 = r1 × r2

t = K−1h3/α

α = 1/||K−1h1||

We do this for each viewpoint of the scene. Since eachcamera is viewing the same checkerboard pattern the worldframe should now be calibrated to each of our viewpoints. Weassumed that this calibration was sufficient and did not attemptto reduce projection error between the two cameras. This canbe done real-time to determine the world frame dynamicallyby projecting an object to the world frame origin. Then theworld frame can be placed by the player where they wouldlike.

C. Human Detection

To isolate a human target from the background, we usedbackground subtraction. Following the calibration procedure,we collected frames to form a mean background image Iµ.Over the set of images we also calculated the standard devia-tion, Iσ . Once this was complete, we began the game. Duringplay, we collected images (It for time t), and calculated aboolean mask of changed pixels. First, form the symmetricdifference:

Dt = |It − Iµ|

then form the mask by evaluating:

Imask =

�1 if Dt(i, j) > T · Iσ(i, j)0 otherwise

for some threshold T . Values near T = 3.5 worked well inpractice. The mask formed reflected any large change betweenthe current frame and the average background. In practice wefound it necessary to apply Gaussian blur to both Iµ and It

before calculating the difference. We also found it necessary toapply morphological operators to the resulting mask to bothincrease connectivity and remove Laplacian noise. Figure 7shows the results of these operations.

D. Five Point Estimator

The author implements a skeletonization procedure similarto that which is outlined in section 2.1 and [1]. The keydifference is that instead of computing the distances fromthe centroid around the border, our implementation computesthe distances from the centroid from 0 to 2Πradians in .02radian increments. This design choice was made both for easeof implementation and speed. Figure 5 contains an examplesignal using our implementation as well as the smoothed resultfrom which the limbs can be identified.

Figure 8. Left: Sample human silhouette. Middle: Raw distance signal. Right:Smoothed distance signal - notice the 5 unique maxima

This skeletonization technique works well for any humanpose where the arms do not occlude other body parts andposes where the arms are not parallel to the normal with theground. Our implementation always uses this technique forfinding feet, but for degenerate poses, additional informationis needed to find the skeleton of the hands and head. This extrainformation come from head tracking and skin detection.

E. Skin Detection

To compute pose information when the skeletonizationprocedure is insufficient, we use the color information of thescene to compute hand and head positions. To detect skin, the


(a) (b) (c) (d)

Figure 7. Background Subtraction process. (a) the background image. (b) a human subject (c) a sample symmetric difference (d) the mask (orange) aftermorphological operators.

foreground image is thesholded using experimentally verifiedskin color values. To reduce noise, we use a median filteron the resulting mask image. K-means is employed to assignthe skin to clusters. The image on the right in Figure 5demonstrates this successful clustering.

F. Segmentation

The approach we use to assign body parts uses domainknowledge of the standard poses a human will be in whileplaying rally ball. The algorithm first assigns all the pixelsbelow the centroid to the legs. We found that the centroidwhen viewed from a horizontal camera is very close to thewaist through a wide range of poses.

The head is next identified using a combination of nearestneighbors, thresholds, and relative location. The centroids ofthe three skin patches are first pruned from head candidacyby relative position information, e.g. candidate must be abovethe centroid. After pruning the highly unlikely candidates, thehead is assigned to centroid nearest to the average of the last10 head positions (over ~1s).

The hands are chosen as the two remaining skin clusters.Shoulders are set at a stock distance and width ratio betweenthe head and centroid. It works well experimentally, but wecould improve it by adding an initial calibration phase thatdoes the five extreme points, except using local minima.This information would give the shoulder points and the trueratio to be used for further renderings. The torso points areassigned as a box that extends from the shoulders to thecentroid. As all other body parts have been accounted for, wecan assign the remaining mask pixels to arms with certainty.The biggest problem with this approach is that arms are notappropriately segmented if they occlude the torso or legs.This could potentially be resolved by assigning pixels nearthe arm skeleton as arm pixels. Figure 8 shows an examplesegmentation using our heuristics.

G. Ball Projection

For robustness and efficiency we performed all calculationsfor ball-player interaction in the image frames. This allowed usto use very simple depth calculations instead of constructinga world centric simulation that requires dense depth infor-mation. Since our camera frames were orthogonally placed,most collisions can be accurately detected by a simple maskcalculation. If the ball intersects with the foreground masks in

each frame then there is a hit. The accuracy of this calculationonly improves with the number viewpoints added.

In order to perform this calculation we must know the ballimage coordinates in each image frame. The ball coordinatesare tracked and updated in the world frame, but since we havecalibrated each camera, we can apply the projective transformto determine the image coordinates for the ball. It is assumedthat the camera distortion is minimal so that the projection ofthe sphere will be a circle. We can then use similar trianglesto calculate the radius of the projected sphere

rproj = f(y + r)/z − fy/z = fr/z

where f is the camera focal length, z is the ball center distancefrom the camera, and r is the ball’s radius in the world frame.

Using this information we can render the ball in each cameraframe. This calculation also defines a region of interest ineach of the camera images. Given that our cameras are placedorthogonal to one another, if this region intersects with theplayer in each frame, we can be very certain that a collisionhas occurred.

IV. RESULTS

Our work resulted in the successful implementation of aRally Ball controller. The described system is able to detectvirtual ball collisions with the player, detect which body partwas hit, and register if a catch has been made.

Figure 9 shows a player kicking a virtual ball using oursystem. This demonstration uses body part segmentation inSub-Figure d to determine which body part collided with theball. Sub-Figure e displays an example of a catch scenariowhere the skin segmentation is used to determine if two handsare in a within some factor of the radius from the center ofthe ball and further than the diameter apart from each other.

The system also functions at speeds greater than 10Hz withless than one second delay, meeting the real-time requirement.The non-iterative skeletonization and method of only process-ing the virtual ball’s region of the image were key designfeatures that helped achieve these speeds.

V. DISCUSSION

We presented a system capable of real-time pose trackingusing only off-the-shelf cameras. We used no artificial lightingor active sensing. We would like to discuss some of therestrictions we placed on the system, and possible ways torelax them. In turn, this suggests to us future directions ofresearch, and some possible target applications.


(a) (b) (c) (d)

(e)

Figure 9. 9a: Side frame with player kicking ball. 9b: Front frame with skeletonization. 9c: Body part identification. Notice that the leg was detected forthe hit. 9d: Segmentation used for body part identification. 9e: Player catching a ball.

A. Restrictions

1) Static Background: Our implementation relies heavilyon background subtraction. During testing, shadows and dis-placed objects presented the most trouble. For instance, ifwe accidentally moved an object during testing, it would beconsidered part of the foreground for the duration of the test.An adaptive background approach might help to solve thisproblem for longer trials, but the system will still suffer short-term after the object has been moved.

2) Static Cameras: Currently, we assume the cameras mustbe static. The main reason for this restriction is that movementof the camera will change the observable scene, resulting in anon static background. If the background problem was solved,the cameras could move freely in the environment so long asall cameras could observe a common calibration pattern tocalculate the extrinsic parameters.

3) Single human of interest: The system only works withone person, who is assumed to be standing upright. It would bedifficult to implement multi-user capability because increasingthe number of cameras increases the chances that the masksof multiple users will overlap. This occlusion would renderour current collision detection scheme ineffective.

4) Two Cameras: For testing, two cameras were usedwith viewing angles orthogonal to each other. Adding furthercameras at appropriated viewing positions would increase therobustness of the system by resolving ambiguity and denyingfalse positives. The main restriction preventing us from usingmore than two cameras is the bandwidth of the USB controlleron our test laptop. The sphere-based calibration discussedpreviously would allow us to calibrate the extrinsic parameters

for the entire setup.

B. Applications

We would like to propose a few applications which mightfind our results useful.

1) Surveillance: Our multiple camera setup is capable ofdetecting interactions between a human subject and an objectwe project into 3D space. However, we could easily detectinteractions between objects which are actually present in thescene. Imagine an art gallery in which exhibits are covered bymultiple cameras at different angles. By employing a systemsuch as ours, we can quickly flag humans that interact withexhibits. This can be done by simply tracking the boundingpolygons of objects and humans, and alerting when theyoverlap in sufficiently many cameras.

2) Boxing Judge: An interesting challenge for computervision in general might be judging boxing matches. We believethere is sufficient information present to count punches whichhit or miss. Robust tracking would need to be employed toestimate the position of the gloves precisely. Additionally, theassumption of a static background does not hold in this case.However, using the ropes and other geometric properties of thering as cues, background segmentation should be possible. Theadvantage of a pure vision system is it requires no additionalhardware.

VI. CONCLUSIONS

This paper discussed the implementation of a stereo visioncontroller for the Rally Ball game. The objectives of ball


collision, body part segmentation, and catch recognition wereachieved with real time performance. Future work couldinclude testing multiple camera configurations to resolve col-lision and catching ambiguities.

ACKNOWLEGMENT

Thanks to Pratap Tokekar and Vineet Bhatawadekar forsetting up the sample video footage for our initial testing.Thanks to Dr. Volkan Isler for proposing the challenge, andloaning equipment.

REFERENCES

[1] H. Fujiyoshi, A. J. Lipton, and T. Kanade, “Real-time human motionanalysis by image skeletonization,” IEICE Trans. Inf. and Syst., vol. 87-D, no. 1, 2004, static.

[2] S.-n. Lim, L. S. Davis, and N. Paragios, “Fast Illumination-invariantBackground Subtraction using Two Views : Error Analysis , SensorPlacement and Applications 1 Introduction 2 A Geometric Analysis ofMissed and False Detections 3 Sensor Placement to Eliminate FalseDetections,” Analysis.

[3] P. Noriega, O. Bernier, and P. Marzin, “Real Time Illumination InvariantBackground Subtraction Using Local Kernel Histograms,” pp. 1–10.

[4] M. B. M. Matilainen and J. Heikkila, “Body part segmentation of noisyhuman silhouette images,” ICME, pp. 1189–1192, 2008, static.

[5] J. M. S. Belongie and J. Puzicha, “Shape matching and object recogni-tion using shape contexts.” IEEE Transactions on Pattern Analysis and

Machine Intelligence, vol. 24, no. 4, p. 509U522, 2002, static.[6] K.-Y. Wong and G. Zhang, “A stratified approach for camera calibration

using spheres,” IEEE Transactions on Image Processing, 2011.[7] Z. Zhang, “Flexible camera calibration by viewing a plane from unknown

orientations,” IEEE International Conference on Computer Vision, 1999.

team dev - csci5561: computer vision 1 stereo system for

Documents