the hand mouse -...

The Hand Mouse: GMM Hand-color Classi�cation and

Mean Shift Tracking

Takeshi Kurata Takashi Okuma Masakatsu Kourogi Katsuhiko Sakaue

Intelligent Systems InstitutesNational Institute of Advanced Industrial Science and Technology (AIST)

1{1{1 Umezono, Tsukuba, Ibaraki, 305-8568 [email protected]

Abstract

This paper describes an algorithm to detect and track

a hand in each image taken by a wearable camera. We

primarily use color information, however, instead of

prede�ned skin-color models, we dynamically construct

hand- and background-color models by using a Gaus-

sian mixture model to approximate the color histogram.

Not only to obtain the estimated mean of hand color

necessary for the restricted EM algorithm that esti-

mates the GMM but to classify hand pixels based on

the Bayes decision theory, we use a spatial probability

distribution of hand pixels. Because the static distribu-

tion is inadequate for the hand-tracking stage, we trans-

late the distribution with the hand motion based on the

mean shift algorithm. Using the proposed method, we

implemented the Hand Mouse, that uses the wearer's

hand as a pointing device, on our Wearable Vision Sys-

tem.

1. Introduction

As computers, displays, cameras and other sensors be-come smaller, less expensive, and more e�cient, sys-tems consisting of these components and services pro-vided by these systems are expected to spread and be-come important to our daily life. One of the distinctiveadvantages of wearing computers and sensors is thatthese systems have potential to assist the wearer moreadaptively than usual desktop systems by sharing ex-periences with the wearer throughout the whole timeand by understanding the wearer's context[10]. Visionplays a important role in how to understand contextualinformation, so we have researched context-aware sys-tems and services based on computer vision techniques,which we call Visual Wearables[1, 8, 9, 13].

Visual context awareness[7, 11, 12, 16] can be re-garded as an autonomous input interface that is needed

to construct augmented environments adaptively. Ex-plicit input interfaces are also essential to enabling in-teraction with the environment by, for example, point-ing and clicking on visual tags and web links, which areassociated with real objects, overlaid on video framestaken by the wearer's camera, as shown in Figure 1.

However, it is very di�cult to apply human inter-faces of usual desktop environments to wearable aug-mented environments due to the lack of mobility andoperativity. Several attempts have been made to solvethis problem by using computer vision techniques[11,14, 15, 17, 19]. This paper describes an algorithm todetect and track a hand in each image taken by a wear-able camera to develop the Hand Mouse[9] that usesthe wearer's hand as a pointing device, as described in[11, 15].

Hand segmentation is an essential process in detect-ing and tracking a hand. Compared to images taken bya camera �xed on a desktop PC or a wall, images takenby a wearable camera can have various light conditionsand backgrounds that result from the wearer's motions.This makes background subtraction and color segmen-tation with prede�ned color models rather di�cult toperform.

In this study, we primarily use color information todetect and track the hand. Instead of using prede�nedskin-color models, we dynamically construct hand-and background-color models based on the hand-color-segmentation method proposed in [19]. The methoduses a Gaussian mixture model (GMM) to approximatethe color histogram of each input image. The GMM isestimated by the restricted Expectation-Maximization(EM) algorithm in which the standard EM algorithm[5]was modi�ed to make the �rst Gaussian distribution anapproximation of the hand-color distribution.

Not only to obtain the estimated mean of hand colornecessary for the restricted EM algorithm that esti-mates the GMM but to classify hand pixels based on

tomoyo

RATFG-RTS 2001 in conjunction with ICCV 2001 in Vancouver,Canada, pp.119-124 (2001)

Figure 1: The wearer is clicking on a visual tag (link) associated with the poster on the wall infront of him.

the Bayes decision theory, the method uses a staticspatial probability distribution of hand pixels. How-ever, the static distribution is inadequate for the HandMouse because the hand location is not �xed. In thispaper we describe a method that translates the distri-bution into the appropriate position based on the meanshift algorithm[4, 6]. This method is computationallyinexpensive, and also works e�ectively because it cantrack a hand while dynamically updating hand- andbackground-color models.

We preliminarily evaluated the performance of theHand Mouse in complicated environments with di�er-ent backgrounds and changing light conditions. Wethen implement it on our Wearable Vision System thatallows us to execute various software modules cooper-atively and in parallel.

2. Hand-color Segmentation

This section brie y describes the Bayes decision the-ory framework for hand-color segmentation proposed in[19]. To minimize the in uence of lighting, this methoduses an HS color space in the HSV (or HSI) color sys-tem to calculate a 2-D color histogram. Color his-togram P is approximated by using a GMM, which is aweighted sum of K Gaussian distributions N1; :::;NK

and can be de�ned as follows.

P 0(c; �; �; �) =

KX

i=1

�iN (c;�i; �2

i );

KX

i=1

�i = 1

where c represents color, �i is the weight of each Gaus-sian distribution N , and where each N has the means�i and the standard deviations �i.

The GMM is estimated by using the restricted EMalgorithm in which the standard EM algorithm is mod-i�ed to make the �rst Gaussian distribution N1 an ap-proximation of the hand-color distribution. To modelthe hand color in N1, �1 is �xed and �1 is controlled asfollows.

�1 = E(Chand); �1 = (�low + �high)=2;

where �low and �high are the upper and lower limits of�1 respectively, which are obtained from training data,

and E(Chand) is the estimated mean of hand-color dis-tribution.

To obtain E(Chand), the method uses each inputimage and a spatial probability distribution of handpixels P (handjx; y) so that color information of pix-els with high probability can be weighted according toP (handjx; y). Figure 2 is an example of P (handjx; y)that is generated with training data such as Figure4(right).

The hand pixels and background pixels are thenclassi�ed based on the Bayes decision criterion:

P (cjhand)P (handjx; y) >P (cjbackground)(1� P (handjx; y))

where P (cjhand) = �1N1(c);

P (cjbackground) =PK

i=2 �iNi(c):

(1)

3. Hand Detection and Tracking

3.1. Hand Detection around Its InitialLocation

In [19], the authors assumed that the wearer's hand iswidely open and occupies a central area in each image.In this case, the static and wide-based spatial probabil-ity distribution as shown in Figure 2 can be consideredreasonable. However, such wide-based distribution isinadequate for the Hand Mouse because appearancesof the hand are smaller than the ones assumed in [19].The inconsistency of the appearance can cause incor-rect estimation of E(Chand) or makes hand-pixels clas-si�cation by using (1) di�cult to perform.

To simplify this problem, we divide the whole pro-cess stage into two parts, the hand-detection stageand the tracking stage. In the hand-detection stage,we assume that the wearer put the fore�nger's tipinto a guide-circle as shown in Figure 5 when thewearer use the Hand Mouse interface. Figure 3 showsP (handjx; y) for an initial location obtained from ourtraining data, in which the diameter of the guide-circlewas set to 25% of the vertical angle of view.

This assumption can be considered reasonable be-cause it makes hand-pixels classi�cation easy to per-

Figure 2: Spatial probability distribution of handpixels obtained from 660 images such as Figure4.

Figure 3: Spatial probability distribution of handpixels from 42 images with the fore�nger's tiplocated within the guide circle in Figure 5

form, and furthermore it is useful for distinguishing ex-plicit input like the Hand Mouse from the autonomousvisual context sensing by using the wearer's hand mo-tion as described in [16].

Figure 6 shows the GMM (K = 5) of Frame 2in Figure 7 estimated by the restricted EM algo-rithm. In this �gure, (GMM HAND), (GMM BACK-GROUND), (HAND), and (BACKGROUND) indi-cate estimated hand color P (cjhand), estimated back-ground color P (cjbackground), the actual hand-colorhistogram, and the histogram of the input image, re-spectively. In Figure 7, the estimated hand pixels areoverlaid with black pixels.

Figure 4: Training data: (left) input image(right) hand portion segmented manually.

Figure 5: Hand Mouse Indicators.

3.2. Hand Tracking Based on the MeanShift Algorithm

As describe above, P (handjx; y) in Figure 3 is the spa-tial probability distribution of a hand for its assumedinitial location, and the wearer's hand is detected onlyaround the location with this distribution. Once de-tected, the hand should be tracked until it disappearsout of the frame. However, since there is no constrainton how to move the hand from the initial position, wecannot use P (handjx; y) in Figure 3 as it is.

To overcome this problem, we propose a methodthat translates P (handjx; y) into the appropriate posi-tion derived by the mean shift algorithm[4, 6], which isa simple iterative procedure that climbs the gradient ofa probability distribution to �nd the nearest dominantmode. We combine each iteration of the mean shift al-gorithm with hand-pixels classi�cation by using (1) sothat P (handjx; y) can be gradually translated into thecurrent position of the hand.

In the next frame, E(Chand) is obtained withP (handjx; y) translated into the mean location of handpixels computed in the previous frame, and then theGMM is estimated by the restricted EM algorithm. Asa result, this method can track the hand while dynami-

Frame 1 Frame 2 Frame 6 Frame 9 Frame 12

Figure 7: Estimated hand pixels indicated by black pixels.

Figure 6: GMM for Frame 2 in Figure 7.

cally updating hand- and background-color models. InFigure 7, Frames 6, 9, and 12 show that this methodcould successively track the hand detected in Frame 2,and Figure 8 indicates trajectory of the pointer thatfollows the motion of the fore�nger's tip.

The algorithm overview for each frame in the hand-tracking stage can be described as follows:

1 Translate the center of mass of P (handjx; y) intothe mean location of hand pixels computed in theprevious frame.

2 Obtain E(Chand) from the translatedP (handjx; y) and the present video frame,and construct hand- and background-color modelswith a GMM computed by using the restrictedEM algorithm.

3 Using the translated P (handjx; y) and the hand-and background-color models, classify the handpixels and background pixels based on the Bayesdecision criterion (1).

4 Compute the mean location of the classi�ed handpixels.

Figure 8: Trajectory of the pointer.

5 Translate the center of mass of P (handjx; y) intothe mean location computed in Step 4.

6 Repeat Step 3 to Step 5 until convergence isachieved ( or until the mean location moves lessthan the preset threshold).

4. Experimental Results

4.1. Preliminary Evaluation

Using a single 933-MHz Pentium III, it took 40-50 msecto convert RGB (320� 240) into HSI (160� 120), gen-erate an HS color histogram (64�64), and estimate theGMM by the restricted EM algorithm (K = 5), and �tthe simple hand shape model into the classi�ed handpixels.

We preliminarily evaluated the accuracy of pointingusing 53 sample images taken at several di�erent envi-ronments, which include images in Figures 7 and 9(a).In this experiment, the hand shape and state are de-termined by �tting the simple skeleton model of handshapes into the classi�ed hand pixels and by using thestate transition diagram, as described in [9].

The averages of the distance along the x axis, thedistance along the y axis, and the Euclidean distancebetween the estimated location of the fore�nger's tipand the ground-truth location measured manually wererespectively 1.8, 4.0, and 4.7% of the vertical angleof view. The accuracy along the y axis was greatlya�ected by the method we used which was too simple,for determining the hand shape and state.

Figure 9 shows example results of the hand-pixelclassi�cation by using our method. Though Frames(a)-1 and (a)-11 belonged to the same sequence, andthe hand was detected in Frame (a)-1, the highlightpart of the hand could not be classi�ed in Frame (a)-11. As a result, the hand shape model could not be�t into the classi�ed hand pixels. This means that asingle Gaussian model is not necessarily su�cient forthe hand-color approximation. In the case of (c) we cansee that the short dynamic range of cameras makes theproblem di�cult.

Frame (a)-1 Frame (a)-11

Frame (b) Frame (c)

Figure 9: Estimated hand pixels under di�erentlighting conditions and in di�erent environments.

4.2. Online Experiments

We have implemented the Hand Mouse on ourWearable

Vision System[1]. Figure 10 shows wearable apparatusof the system. The wearable camera is located underthe wearer's ear, like an earring. This system consistsof a wearable client, a vision server (a PC cluster),and a wireless LAN, as shown in Figure 11. We haveperformed live technical demonstration of this systemdozens of times[1, 3, 9].

A mobile PC in the wearable client captures andcompresses each image taken by the wearable camera

by using JPEG encoding, and transmits it to the vi-sion server. The wearable client then receives the sys-tem output from the server. Although many visionalgorithms are computationally heavy for the existingstand-alone wearable computers, our wearable systemallows us to run di�erent vision tasks in (near) realtime, cooperatively, and in parallel.

In our online experiments, the system throughput,which included capturing, compressing and transmit-ting an image, performing the Hand Mouse operation,and displaying the system output, was 100-120 msec,and the latency was 300-600 msec (Wearables' CPU:650-MHz Mobile Pentium III, Server's CPU: dual 933-MHz Pentium III). In Figures 12 and 1, we demonstratean example of a graphical user interface for a wearabledisplay. The hand-tracking results are also shown inthese �gures.

Figure 10: Apparatus for our system's wearer.

Figure 11: Wearable client and vision server.

5. Conclusion and Future Work

In this paper, we developed a new approach that ef-fectively combines the dynamic generation of hand-and background-color models with GMMs and themean shift algorithm into hand detection and trackingmethod. The method was applied to the Hand Mouse,

Figure 12: Soft keyboard pointing.

which was implemented on our Wearable Vision Sys-

tem.However, we have not yet evaluated the performance

of our developed method in classifying hand pixels rig-orously, nor have we thoroughly considered the conver-gence of the tracking procedure. Our future work willhave to address these issues.

If background, lighting, or the re ection has similarcolor, color information is not su�cient to obtain handregions. We will also focus on improving our method sothat it can take advantage of di�erent image featuressuch as texture and edge information[18].

Acknowledgments: This work is supported in partby the Real World Computing (RWC) Program[2] ofMETI and also by Special Coordination Funds forPromoting Science and Technology of MEXT of theJapanese Government.

References

[1] http://www.aist.go.jp/ETL/~7234/visualwearables/.[2] http://www.rwcp.or.jp/.[3] 2000 RWC Symposium,

http://www.rwcp.or.jp/events/rwc2000/home-e.html.[4] R. Bradski. Computer vision face tracking for use in a

perceptual user interface. Technical Report Q2, IntelTechnology Journal, 1998.

[5] A. Dempster, N. Laird, and D. Rubin. Maximum like-lihood from incomplete data via the em algorithm. J.Roy. Statist. Soc., 39(B):1{38, 1977.

[6] K. Fukunaga. Introduction to Statistical PatternRecognition. Academic Press, Boston, 1990.

[7] T. Jebara, B. Schiele, N. Oliver, and A. Pentland.Dypers: Dynamic personal enhanced reality system.Technical Report 463, M.I.T Media Lab. PerceptualComputing Section, 1998.

[8] M. Kourogi, T. Kurata, K. Sakaue, and Y. Muraoka. Apanorama-based technique for annotation overlay andits real-time implementation. In Proc. Int'l Conf. onMultimedia and Expo (ICME2000), TA2.05, 2000.

[9] T. Kurata, T. Okuma, M. Kourogi, and K. Sakaue.The hand-mouse: A human interface suitable for aug-mented reality environments enaled by visual wear-ables. In Proc. 2nd Int'l Symp. on Mixed Reality(ISMR2001), pages 188{189, 2001.

[10] M. Lamming and M. Flynn. \forget-me-not" intimatecomputing in support of human memory. TechnicalReport EPC-1994-103, RXRC Cambridge Laboratory,1994.

[11] S. Mann. Wearable computing: A �rst step towardpersonal imaging. Computer, 30(2):25{32, 1997.

[12] W. Mayol, B. Tordo�, and D. Murray. Wearable visualrobots. In Proc. 4nd Int'l Symp. on Wearable Comput-ers (ISWC2000), pages 95{102, 2000.

[13] T. Okuma, T. Kurata, and K. Sakaue. 3-D annotationof images captured from a wearer's camera based onobject recognition. In Proc. 2nd Int'l Symp. on MixedReality (ISMR2001), pages 184{185, 2000.

[14] T. Starner, J. Auxier, D. Ashbrook, and M. Gandy.The gesture pendant: A self-illuminating, wearable,infrared computer vision system for home automationcontrol and medical monitoring. In Proc. 4nd Int'lSymp. on Wearable Computers (ISWC2000), pages87{94, 2000.

[15] T. Starner, S. Mann, B. Rhodes, J. Levine, J. Healey,D. Kirsch, W. R. Picard, and A. Pentland. Augmentedreality through wearable computing. Technical Report397, M.I.T Media Lab. Perceptual Computing Section,1997.

[16] T. Starner, B. Schiele, and A. Pentland. Visual con-textual awareness in wrearable computing. In Proc.2nd Int'l Symp. on Wearable Computers (ISWC'98),pages 50{57, 1998.

[17] A. Vardy, J. Robinson, and L.-T. Cheng. The wristcamas input device. In Proc. 3rd Int'l Symp. on WearableComputers (ISWC'99), pages 199{202, 1999.

[18] Y. Wu and T. S. Huang. View-independent recog-nition of hand postures. In Proc. IEEE Comp. Soc.Conf. on Computer Vision and Pattern Recognition(CVPR2000), volume 2, pages 88{94, 2000.

[19] X. Zhu, J. Yang, and A. Waibel. Segmenting hands ofarbitrary color. In Proc. 4th Int'l Conf. on AutomaticFace and Gesture Recognition (FG2000), pages 446{453, 2000.

the hand mouse -...

Documents