real-time fingertip tracking and gesture recognition

8
A ugmented desk interfaces and other vir- tual reality systems depend on accurate, real-time hand and fingertip tracking for seamless inte- gration between real objects and associated digital information. We introduce a method for discerning fingertip locations in image frames and measuring fin- gertips trajectories across image frames. We also pro- pose a mechanism for combining direct manipulation and symbolic gestures based on multiple finger- tip motions. Our method uses a filtering tech- nique, in addition to detecting fin- gertips in each image frame, to predict fingertip locations in succes- sive image frames and to examine the correspondences between the predicted locations and detected fin- gertips. This lets us obtain multiple fingertips’ trajectories in real time and improves fingertip tracking. This method can track multiple fingertips reliably even on a complex back- ground under changing lighting con- ditions without invasive devices or color markers. Distinguishing the thumb lets us differentiate manipulative (extended thumb) from sym- bolic (folded thumb) gestures. We base this on the observation that users generally use only a thumb and forefinger in fine manipulation. The method then uses the Hidden Markov Model (HMM), 1 which interprets hand and finger motions as symbolic events based on a probabilistic framework, to recognize symbolic gestures for application to interactive systems. Other researchers have used HMM to recognize body, hand, and finger motions. 2,3 Augmented desk interfaces Several augmented desk interface systems have been developed recently. 4,5 One of the earliest attempts in this domain, DigitalDesk, 6 uses a charge-coupled device (CCD) camera and a video projector to let users oper- ate projected desktop applications using a fingertip. Inspired by DigitalDesk, we’ve developed an augment- ed desk interface system, EnhancedDesk 7 (Figure 1), that lets users perform tasks by manipulating both phys- ical and electronically displayed objects simultaneous- ly with their own hands and fingers. Figure 2 shows an application of our proposed track- ing and gesture recognition methods. 8 This two-hand- ed drawing tool assigns different roles to each hand. After selecting radial menus with the left hand, users draw objects or select objects to be manipulated with the right hand. For example, to color an object, a user selects the color menu with the left hand and indicates the object to be colored with the right hand (Figure 2a). The system also uses gesture recognition to let users draw objects such as circles, ellipses, triangles, and rec- tangles and directly manipulate them using the right hand and fingers (Figure 2b). Real-time fingertip tracking This work evolves from other vision-based hand and finger tracking methods (see the “Related Work” sidebar on p. 66), including our earlier multiple-fingertip track- ing method. 9 Detecting multiple fingertips in an image frame We must first extract multiple fingertips in each input image frame in real time. Extracting hand regions. Extracting hands based on color image segmentation or background subtrac- tion often fails when the scene has a complicated back- ground and dynamic lighting. We therefore use an infrared camera adjusted to measure a temperature range approximating human body temperature (30 to 34 degrees C). This raises pixel values corresponding to human skin above that for other pixels (Figure 3a). Therefore, even with complex backgrounds and chang- ing light, our system easily identifies image regions cor- responding to human skin by binarizing the input image with a proper threshold value. Because hand tempera- 0272-1716/02/$17.00 © 2002 IEEE Tracking 64 November/December 2002 Our hand and fingertip tracking method, developed for augmented desk interface systems, reliably tracks multiple fingertips and hand gestures against complex backgrounds and under dynamic lighting conditions without any markers. Kenji Oka and Yoichi Sato University of Tokyo Hideki Koike University of Electro-Communications, Tokyo Real-Time Fingertip Tracking and Gesture Recognition

Upload: api-3709615

Post on 10-Apr-2015

1.558 views

Category:

Documents


0 download

DESCRIPTION

real-time fingertip tracking

TRANSCRIPT

Page 1: Real-Time Fingertip Tracking and Gesture Recognition

Augmented desk interfaces and other vir-tual reality systems depend on accurate,

real-time hand and fingertip tracking for seamless inte-gration between real objects and associated digitalinformation. We introduce a method for discerningfingertip locations in image frames and measuring fin-gertips trajectories across image frames. We also pro-

pose a mechanism for combiningdirect manipulation and symbolicgestures based on multiple finger-tip motions.

Our method uses a filtering tech-nique, in addition to detecting fin-gertips in each image frame, topredict fingertip locations in succes-sive image frames and to examinethe correspondences between thepredicted locations and detected fin-gertips. This lets us obtain multiplefingertips’ trajectories in real timeand improves fingertip tracking. Thismethod can track multiple fingertipsreliably even on a complex back-ground under changing lighting con-ditions without invasive devices orcolor markers.

Distinguishing the thumb lets usdifferentiate manipulative (extended thumb) from sym-bolic (folded thumb) gestures. We base this on theobservation that users generally use only a thumb andforefinger in fine manipulation. The method then usesthe Hidden Markov Model (HMM),1 which interpretshand and finger motions as symbolic events based on aprobabilistic framework, to recognize symbolic gesturesfor application to interactive systems. Other researchershave used HMM to recognize body, hand, and fingermotions.2,3

Augmented desk interfacesSeveral augmented desk interface systems have been

developed recently.4,5 One of the earliest attempts in thisdomain, DigitalDesk,6 uses a charge-coupled device

(CCD) camera and a video projector to let users oper-ate projected desktop applications using a fingertip.Inspired by DigitalDesk, we’ve developed an augment-ed desk interface system, EnhancedDesk7 (Figure 1),that lets users perform tasks by manipulating both phys-ical and electronically displayed objects simultaneous-ly with their own hands and fingers.

Figure 2 shows an application of our proposed track-ing and gesture recognition methods.8 This two-hand-ed drawing tool assigns different roles to each hand.After selecting radial menus with the left hand, usersdraw objects or select objects to be manipulated withthe right hand. For example, to color an object, a userselects the color menu with the left hand and indicatesthe object to be colored with the right hand (Figure 2a).The system also uses gesture recognition to let usersdraw objects such as circles, ellipses, triangles, and rec-tangles and directly manipulate them using the righthand and fingers (Figure 2b).

Real-time fingertip trackingThis work evolves from other vision-based hand and

finger tracking methods (see the “Related Work” sidebaron p. 66), including our earlier multiple-fingertip track-ing method.9

Detecting multiple fingertips in an image frameWe must first extract multiple fingertips in each input

image frame in real time.

Extracting hand regions. Extracting hands basedon color image segmentation or background subtrac-tion often fails when the scene has a complicated back-ground and dynamic lighting. We therefore use aninfrared camera adjusted to measure a temperaturerange approximating human body temperature (30 to34 degrees C). This raises pixel values corresponding tohuman skin above that for other pixels (Figure 3a).Therefore, even with complex backgrounds and chang-ing light, our system easily identifies image regions cor-responding to human skin by binarizing the input imagewith a proper threshold value. Because hand tempera-

0272-1716/02/$17.00 © 2002 IEEE

Tracking

64 November/December 2002

Our hand and fingertip

tracking method, developed

for augmented desk interface

systems, reliably tracks

multiple fingertips and hand

gestures against complex

backgrounds and under

dynamic lighting conditions

without any markers.

Kenji Oka and Yoichi SatoUniversity of Tokyo

Hideki KoikeUniversity of Electro-Communications, Tokyo

Real-Time FingertipTracking andGestureRecognition

Page 2: Real-Time Fingertip Tracking and Gesture Recognition

ture varies somewhat among people, our system deter-mines an appropriate threshold value for image bina-rization during initialization by examining thehistogram of an image of a user’s hand placed open ona desk. It similarly obtains other parameters such asapproximate hand and fingertip sizes.

We then remove small regions from the binarizedimage and select the two largest regions to obtain animage of both hands.

Finding fingertips. Once we’ve found a user’s armregions, including hands, in an input image, we searchfor fingertips within those regions. This search processis more computationally expensive than arm extraction,so we define search windows for the fingertips ratherthan searching the entire arm region.

We determine a search window based on arm orien-tation, which is estimated as the extracted arm region’sprincipal axis from the image moments up to the sec-ond order. We then set a fixed-size search window cor-responding to the user’s hand size so that it includes ahand part of the arm region based on the arm’s orien-tation. The approximate distance from the infrared cam-era to a user’s hand should determine the search

IEEE Computer Graphics and Applications 65

LCD projector(front projection)

LCD projector(Rear projection)

Infraredcamera

Color camera

Plasma display

1 EnhancedDesk, an augmented desk interface system, applies fingertiptracking and gesture recognition to let users manipulate physical andvirtual objects.

2 Enhanced-Desk’s two-handeddrawing system.

3 Fingertip detection.

(a) (b) (c)

(a) (b)

Page 3: Real-Time Fingertip Tracking and Gesture Recognition

window’s size. However, we found that a fixed-sizesearch window works reliably because the distance fromthe infrared camera to a user’s hand on the augmenteddesk interface system remains relatively constant.

We then search for fingertips within the new window.A cylinder with a hemispherical cap approximates fingershape, and the projected finger shape in an input imageappears to be a rectangle with a semicircle at its tip, sowe can search for a fingertip based on its geometricalfeatures. Our method uses normalized correlation witha template of a properly sized circle corresponding to auser’s fingertip size.

Although a semicircle reasonably approximates pro-jected fingertip shape, we must consider false detectionfrom the template matching and must also find a suffi-ciently large number of candidates. Our current imple-mentation selects 20 candidates with the highestmatching scores inside each search window, a samplewe consider large enough to include all true fingertips.

Once we’ve selected the fingertip candidates, weremove false candidates using two methods. We remove

multiple matching around the fingertip’s true locationby suppressing neighbor candidates around a candidatewith the highest matching score. We then removematching that occurs in the finger’s middle by examin-ing surrounding pixels around a matched template’scenter. If multiple diagonal pixels lie inside the handregion, we consider the candidate not part of a fingertipand therefore discard it (Figure 3b).

Finding a palm’s center. In our method, the cen-ter of a user’s hand is given as the point whose distanceto the closest region boundary is the maximum. Thismakes the hand’s center insensitive to changes such asopening and closing of the hand. We compute this loca-tion by a morphological erosion operation of an extract-ed hand region. First, we obtain a rough shape of theuser’s palm by cutting out the hand region at the esti-mated wrist. We assume the wrist’s location is at the pre-determined distance from the top of the search windowand perpendicular to the hand region’s principal direc-tion (Figure 3c).

Tracking

66 November/December 2002

Related WorkAugmented reality systems can use a tracked hand’s or

fingertip’s position as input for direct manipulation. Forinstance, some researchers have used their trackingtechniques for drawing or for 3D graphic objectmanipulation.1-4

Many researchers have studied and used glove-baseddevices to measure hand location and shape, especially forvirtual reality. In general, glove-based devices measure handpostures and locations with high accuracy and speed, butthey aren’t suitable for some applications because thecables connected to them restrict hand motion.

This has led to research on and adoption of computervision techniques. One approach uses markers attached to auser’s hands or fingertips to facilitate their detection. Whilemarkers help in more reliably detecting hands and fingers,they present obstacles to natural interaction similar toglove-based devices. Another approach extracts imageregions corresponding to human skin by either colorsegmentation or background image subtraction. Becausehuman skin isn’t uniformly colored and changessignificantly under different lighting conditions, suchmethods often produce unreliable segmentation of humanskin regions. Methods based on background imagesubtraction also prove unreliable when applied to imageswith a complex background.

After a system identifies image regions in input images, itcan analyze the regions to estimate hand posture.Researchers have developed several techniques to estimatepointing directions of one or multiple fingertips based on2D hand or fingertip geometrical features.1,2 Anotherapproach used in hand gesture analysis uses a 3D humanhand model. To determine the model’s posture, thisapproach matches the model to a hand image obtained byone or more cameras.3,5-7 Using a 3D human hand modelsolves the problem of self-occlusion, but these methodsdon’t work well for natural or intuitive interactions becausethey’re too computationally expensive for real-time

processing and require controlled environments with arelatively simple background.

Pavlovic et al. provide a comprehensive survey of handtracking methods and gesture analysis algorithms.8

References1. M. Fukumoto, Y. Suenaga, and K. Mase, “Finger-pointer: Point-

ing Interface by Image Processing,” Computer and Graphics, vol.18, no. 5, 1994, pp. 633-642.

2. J. Segan and S. Kumar, “Shadow Gestures: 3D Hand Pose Esti-mation Using a Single Camera,” Proc. IEEE Conf. Computer Visionand Pattern Recognition (CVPR 99), IEEE Press, Piscataway, N.J.,1999, pp. 479-485.

3. A. Utsumi and J. Ohya, “Multiple-Hand-Gesture Tracking UsingMultiple Cameras,” Proc. IEEE Conf. Computer Vision and PatternRecognition (CVPR 99), IEEE Press, Piscataway, N.J., 1999, pp. 473-478.

4. J. Crowley, F. Berard, and J. Coutaz, “Finger Tracking as an InputDevice for Augmented Reality,” Proc. IEEE Int’l Workshop AutomaticFace and Gesture Recognition (FG 95), IEEE Press, Piscataway, N.J.,1995, pp. 195-200.

5. J. Rehg and T. Kanade, “Model-Based Tracking of Self-OccludingArticulated Objects,” Proc. IEEE Int’l Conf. Computer Vision (ICCV95), IEEE Press, Piscataway, N.J., 1995, pp. 612-617.

6. N. Shimada et al., “Hand Gesture Estimation and Model Refine-ment Using Monocular Camera-Ambiguity Limitation by Inequal-ity Constraints,” Proc. 3rd IEEE Int’l Conf. Automatic Face andGesture Recognition (FG 98), IEEE Press, Piscataway, N.J., 1998,pp. 268-273.

7. Y. Wu, J. Lin, and T. Huang, “Capturing Natural Hand Articula-tion,” Proc. IEEE Int’l Conf. Computer Vision (ICCV 01), vol. 2, IEEEPress, Piscataway, N.J., 2001, pp. 426-432.

8. V. Pavlovic, R. Sharma, and T. Huang, “Visual Interpretation ofHand Gestures for Human-Computer Interaction: A Review,” IEEETrans. Pattern Analysis and Machine Intelligence, vol. 19, no. 7, July1997, pp. 677-695.

Page 4: Real-Time Fingertip Tracking and Gesture Recognition

We then apply a morphologicalerosion operator to the obtainedshape until the region becomessmaller than a predetermined thresh-old value. This yields a small regionat the palm’s center. Finally, the cen-ter of the hand region is given as theresulting region’s center of mass.

Measuring fingertiptrajectories

We obtain multiple fingertip tra-jectories by taking correspondencesof detected fingertips between suc-cessive image frames.

Determining trajectories.Suppose that we detect nt fingertipsin the tth image frame It. We refer tothese nt fingertips’ locations as Fi,t(i= 1,2, …, nt) as Figure 4a shows.First, we predict the locations F′i,t+1 of fingertips in thenext frame It+1. Then we compare the locations Fj,t+1 (j= 1,2, …, nt+1) of nt+1 fingertips detected in the t + 1thimage frame It+1 with the predicted location F′i,t+1 (Fig-ure 4b). Finding the best combination among these twosets of fingertips lets us reliably determine multiple fin-gertip trajectories in real time (Figure 5).

Predicting fingertip locations. We use theKalman filter to predict fingertip locations in one imageframe based on their locations detected in the previousframe. We apply this process separately for each fingertip.

First, we measure each fingertip’s location and veloc-ity in each image frame. Hence we define the state vec-tor as xt

(1)

where x(t), y(t), vx(t), vy(t) shows the location of fin-gertip (x(t), y(t)) and the velocity of fingertip (vx(t),vy(t)) in tth image frame. We define the observation vec-tor yt to represent the location of the fingertip detectedin the tth frame. The state vector xt and observation vec-tor yt are related as the following basic system equation:

(2)

(3)

where F is the state transition matrix, G is the drivingmatrix, H is the observation matrix, wt is system noiseadded to the velocity of the state vector xt, and vt is theobservation noise—that is, error between real anddetected location.

Here we assume approximately uniform straightmotion for each fingertip between two successive imageframes because the frame interval ∆T is short. Then, F,G, and H are given as follows:

(4)

(5)

(6) H =

1 00 1

0 00 0

G =

0 00 0

1 00 1

T

F =

∆∆

1 0 00 1 00 0 1 00 0 0 1

TT

yt = +Hx vt t

xt 1+ = +Fx Gwt t

xt = ( ) ( ) ( ) ( )( )x t y t v t v tx y, , ,

T

IEEE Computer Graphics and Applications 67

(a) (b)

4 Taking fin-gertip corre-spondences: (a) detectingfingertips and(b) comparingdetected andpredicted fin-gertip locationsto determinetrajectories.

5 Measuringfingertiptrajectories.

Page 5: Real-Time Fingertip Tracking and Gesture Recognition

The (x, y) coordinates of the state vector xt coincidewith those of the observation vector yt defined withrespect to the image coordinate system. This is for sim-plicity of discussion without loss of generality. Theobservation matrix H should be in an appropriate form,depending on the transformation between the worldcoordinate system defined in the work space—for exam-ple, a desktop of our augmented desk interface system—and the image coordinate system.

Also, we assume that both the system noise wt andthe observation noise vt are constant Gaussian noisewith a zero mean. Thus the covariance matrix for wt andvt becomes σ2

w I2×2 and σ2v I2×2 respectively, where I2×2

represents a 2 × 2 identity matrix. This is a rather coarseapproximation, and those two noise components shouldbe estimated for each image frame based on some cluesuch as a matching score for normalized correlation fortemplate matching. We plan to study this in the future.

Finally, we formulate a Kalman filter as

(7)

(8)

(9)

where x̃t equals x̂t|t−1, the estimated value of xt fromy0,…,yt−1, ̃Pt equals Σ̂t|t−1/σ2

v, Σ̂t|t−1 represents the covari-ance matrix of estimation error of x̂t|t−1, Kt is Kalmangain, and Λ equals GGT.

Then the predicted location of the fingertip in the t +1th image frame is given as (x(t + 1), y(t + 1)) of x̃t+1.If we need a predicted location after more than one imageframe, we can calculate the predicted location as follows:

(10)

(11)

where x̂t+m|t is the estimated value of xt+m fromy0,…,yt, P̂t+m|t equals Σ̂t+m|t/σ2

v, and Σ̂t+m|t representsthe covariance matrix of estimation error of x̂t+m|t.

Fingertip correspondences between succes-

sive frames. For each image frame, we detect finger-tips as described earlier and examine correspondencesbetween the detected fingertips’ locations and the pre-dicted fingertip locations from Equation 8 or 10.

More precisely, we compute the sum of the squareof distances between a detected fingertip and a pre-dicted fingertip for all possible combinations and con-sider the combination with the least sum to be the mostreasonable.

To avoid a high computational cost for examining allpossible combinations, we reduce the number of combi-nations by considering the clockwise (or counterclock-wise) fingertip order around the hand’s center (Figure6a). In other words, we assume the fingertip order ininput images doesn’t change. For instance, in Figure 6a,we consider only three combinations:

� O1–∆1 and O2–∆2� O1–∆1 and O2–∆3� O1–∆2 and O2–∆3

This reduces the maximum possible combinations from5P5 to 5C5.

Occasionally, the system doesn’t detect one or morefingertips in an input image frame. Figure 6b illustratesan example where an error prevents detection of thethumb and little finger. To improve our method’s relia-bility for tracking multiple fingertips, we use a missingfingertip’s predicted location to continue tracking it. Ifwe find no fingertip corresponding to the predicted one,we examine the element (1, 1) of the covariance matrixP̃t+1 in Equation 9 for the predicted fingertip. This ele-ment represents the ambiguity of the predicted finger-tip’s location—if it’s smaller than a predeterminedambiguity threshold, we consider the fingertip to beundetected because of an image frame error. We thenuse the fingertip’s predicted location as its true locationand continue tracking it.

If the element (1, 1) of the covariance matrix exceedsa predetermined threshold, we determine that the fin-gertip prediction is unreliable and terminate its track-

ˆ ˜ ˜P F P K HP F

F F

t m tm

t t t

m

w

v

ll

l

m

+

=

= −( )( )+ ( )∑

T

Tσσ

2

20

1

Λ

ˆ ˜ ˜x x xt m tm

t t t+ = + −( ){ }F K Hyt

˜ ˜ ˜P F P K HP Ft t t tw

v+ = −( ) +1

2

2T σ

σΛ

˜ ˜ ˜x x y xt t t t+ = + −( ){ }1 F K Ht

K P H I HP Ht t t= +( )×

−˜ ˜T T

2 2

1

Tracking

68 November/December 2002

6 Correspon-dences ofdetected andpredicted fin-gertips: (a) fingertiporder and (b) incorrectthumb andfingerdetection.

(a) (b)

Page 6: Real-Time Fingertip Tracking and Gesture Recognition

ing. Our current implementation fixes an experimental-ly chosen ambiguity threshold.

If we detect more fingertips than predicted, we starttracking a fingertip that doesn’t correspond to any of thepredictions. We treat its trajectory as that of a new fin-gertip after the predicted fingertip location’s ambiguityfalls below a predetermined threshold.

Evaluating the tracking methodTo test our method, we experimentally evaluated the

reliability improvement by considering fingertip corre-spondences between successive image frames usingseven test subjects. Our tracking system consists of aLinux-based PC with Intel Pentium III 500-MHz andHitachi IP5000 image processing board, and a NikonLaird-S270 infrared camera.

We asked test subjects to move their hands naturallyon our augmented desk interface system while keepingthe number of extended fingers constant in each trial.In the first trial, subjects moved their hands with oneextended finger for 30 seconds, then extended two,three, four, and finally five fingers. Each trial lasted 30seconds and produced about 900 image frames. Toensure fair comparison, we first recorded the infraredcamera output using a video recorder, then applied ourmethod to the recorded video.

We compared our tracking method’s reliability withand without correspondences between successive imageframes. Figure 7 shows the results, with bar charts indi-cating the average rate that the number of tracked fin-gertips was correct and line charts indicating the lowestrate among seven test subjects.

As Figure 7 shows, tracking reliability improves sig-nificantly when we account for fingertip correspon-dences between image frames. In particular, trackingaccuracy approaches 100 percent for one or two fingers,and the lowest rate also improves. Our method reliably

tracks multiple fingertips and could prove useful in real-time human–computer interaction applications.

Gesture recognitionOur tracking method also works well for gesture

recognition and lets us achieve interactions based onsymbolic gestures while we perform direct manipulationwith our hands and fingers using the mechanism shownin Figure 8. First, our system determines from measuredfingertip trajectories whether a user’s hand motions rep-resent direct manipulation or symbolic gestures. Fordirect manipulation, it then selects operating modes suchas rotate, move, or resize based on the distance betweentwo fingertips or the number of extended fingers, andcontrols the selected modes’ parameters. For symbolicgestures, the system recognizes gesture types using a

IEEE Computer Graphics and Applications 69

Acc

urac

y (p

erce

nt)

Average rate in method without correspondenceAverage rate in method with correspondenceLowest rate in method without correspondenceLowest rate in method with correspondence

100

95

90

85

01 2 3

Number of extended fingers4 5

7 Finger track-ing evaluation.

Operating mode selector

(rotate, resize…)

Symbolicgesture

recognizer

Applicationcontroller for

directmanipulationGesture mode

selector(direct manipulation

orsymbolic gesture)

Number of fingers

Location

Applicationcontroller for

symbolicgesture

Trackingresult

Application

Operating mode

Size, location of gesture

Kind of gesture

Trajectory

Processing in directmanipulation mode

Processing in symbolicgesture mode

8 Interactionbased on directmanipulationand symbolicgestures.

Page 7: Real-Time Fingertip Tracking and Gesture Recognition

symbolic gesture recognizer in addition to recognizinggesture locations and sizes based on trajectories.

To distinguish symbolic gestures from direct manip-ulation, our system locates a thumb in the measured tra-jectories. As described earlier, we regard gestures withan extended thumb as direct manipulation and thosewith a bent thumb as symbolic gestures, and use theHMM to recognize the segmented symbolic gestures.Our gesture recognition system should prove useful foraugmented desk interface applications such as the draw-ing tool shown in Figure 2.

Detecting a thumbOur method for distinguishing a thumb from the

other tracked fingertips uses the angle θ between fingerdirection—the direction from the hand’s center to thefinger’s base—and arm orientation (Figure 9). We usethe finger’s base because it’s more stable than the tipeven if the finger moves.

In the initialization stage, we define the thumb’s stan-

dard angle θT and that of the forefinger θF (θT > θF). First,we apply the morphological process and the image sub-traction to a binarized hand image to extract fingerregions. We regard the end of the extracted finger oppo-site the fingertip as the base of the finger and calculate θ.

Here we define θk as θ in the kth frame from the fin-ger trajectory’s origin, and the current frame is the Nthframe from the origin. Then, the score sT, which repre-sents a thumb’s likelihood, is given as follows:

(12)

(13)

If sT exceeds 0.5, we regard the finger as a thumb.To evaluate this method’s reliability, we performed

three kinds of tests mimicking actual desktop work:drawing with only a thumb, picking up with a thumband forefinger, and drawing with only a forefinger. Table1 shows the results, demonstrating that the method reli-ably distinguishes the thumb from other hand parts.

Symbolic gesture recognitionLike other recognition techniques,2,3 our symbolic ges-

ture recognition system uses HMM. The input to HMMconsists of two components for recognizing multiple fin-gertip trajectories: the number of tracked fingertips anda discrete code from 1 to 16 that represents the direc-tion of the tracked fingertips’ average motions. It’sunlikely that we would move each of our fingers inde-pendently unless we consciously tried to do so. Thus,we decided to use the direction of multiple fingertips’

average motions instead of each fin-gertip’s direction. We used code 17to represent no motion.

We tested our recognition systemusing 12 kinds of hand gestures withthe fingertip trajectories shown inFigure 10. As a training data set foreach gesture, we used 80 hand ges-tures made by a single person to ini-tialize HMM. Six other people alsoparticipated in testing. For eachtrial, a test subject made one of 12gestures 20 times at arbitrary loca-tions and with arbitrary sizes. Table2 shows this experiment’s results,

s

s k

NT

Tk

N

==

′ ( )∑ 1

′ ( ) =−−

>≤ ≤

<s kT

k F

T F

k T

F k T

k F

1 0

0 0

.

.

θ θθ θ

θ θθ θ θ

θ θ

ifif

if

Tracking

70 November/December 2002

Table 1. Evaluating thumb distinction.

Drawing with Picking Up with Drawing with Task Thumb Only Thumb and Forefinger Forefinger Only

Average (percent) 98.2 99.4 98.3Standard deviation (percent) 3.6 0.8 4.6

10 Symbolicgestureexamples.

9 Definition ofthe angle θbetween fingerdirection andarm orientation,for thumbdetection.

Page 8: Real-Time Fingertip Tracking and Gesture Recognition

indicating average accuracy and standard deviation forsingle-finger and double-finger gestures.

Our system offers reliable, near-perfect recognitionof single-finger gestures and high accuracy for double-finger gestures. Our gesture recognition system provessuitable for natural interactions using hand gestures.

Future workWe plan to improve our tracking method’s reliability

by incorporating additional sensors. Although using aninfrared camera had some advantages, it didn’t workwell on cold hands. We’ll solve this problem by using acolor camera in addition to the infrared camera.

We also plan to extend our system for 3D tracking.Currently, our tracking method is limited to 2D motionon a desktop. Although this is enough for our augment-ed desk interface system, other application types requireinteraction based on 3D hand and finger motion. We’lltherefore investigate a practical 3D hand and fingertracking technique using multiple cameras. �

References1. L. Rabiner and B. Juang, “An Introduction to Hidden

Markov Models,” IEEE Acoustic Signal and Speech Process-ing (ASSP), vol. 3, no. 1, Jan. 1986, pp. 4-16.

2. T. Starner and A. Pentland, “Visual Recognition of Ameri-can Sign Language Using Hidden Markov Models,” Proc.IEEE Int’l Workshop Automatic Face and Gesture Recogni-tion (FG 95), IEEE Press, Piscataway, N.J., 1995, pp. 189-194.

3. J. Martin and J. Durand, “Automatic Handwriting GesturesRecognition Using Hidden Markov Models,” Proc. 4th IEEEInt’l Conf. Automatic Face and Gesture Recognition (FG2000), IEEE Press, Piscataway, N.J., 2000, pp. 403-409.

4. J. Underkoffler and H. Ishii, “Illuminating Light: An Opti-cal Design Tool with a Luminous-Tangible Interface,” Proc.ACM Conf. Human Factors and Computing Systems (CHI98), ACM Press, New York, 1998, pp. 542-549.

5. J. Rekimoto and M. Saito, “Augmented Surfaces: A Spa-tially Continuous Work Space for Hybrid Computing Envi-ronments,” Proc. ACM Conf. Human Factors and ComputingSystems (CHI 99), ACM Press, New York, 1999, pp. 378-385.

6. P. Wellner, “Interacting with Paper on the DigitalDesk,”Comm. ACM, vol. 36, no. 7, July 1993, pp. 87-96.

7. H. Koike et al., “Interactive Textbook and Interactive VennDiagram: Natural and Intuitive Interface on AugmentedDesk System,” Proc. ACM Conf. Human Factors and Com-puting Systems (CHI 2000), ACM Press, New York, 2000,pp. 121-128.

8. X. Chen et al., “Two-Handed Drawing on Augmented DeskSystem,” Proc. Int’l Working Conf. Advanced Visual Inter-faces (AVI 2002), ACM Press, New York, 2002, pp. 219-222.

9. Y. Sato, Y. Kobayashi, and H. Koike, “Fast Tracking of Handsand Fingertips in Infrared Images for Augmented DeskInterface,” Proc. 4th IEEE Int’l Conf. Automatic Face and Ges-ture Recognition (FG 2000), IEEE Press, Piscataway, N.J.,2000, pp. 462-467.

Kenji Oka is a PhD candidate at theUniversity of Tokyo Graduate Schoolof Information Science and Technol-ogy. His research interests includehuman–computer interaction andcomputer vision, particularly per-ceptual user interfaces and human

behavior understanding. He received BS and MS degrees ininformation and communication engineering from theUniversity of Tokyo.

Yoichi Sato is an associate profes-sor at the University of Tokyo Insti-tute of Industrial Science. Hisprimary research interests are incomputer vision (physics-basedvision, image-based modeling),human–computer interaction (per-

ceptual user interfaces), and augmented reality. Hereceived a BS in mechanical engineering from the Univer-sity of Tokyo and an MS and PhD in robotics from theSchool of Computer Science, Carnegie Mellon University.He is a member of IEEE and ACM.

Hideki Koike is an associate pro-fessor at the Graduate School ofInformation Systems, University ofElectro-Communications, Tokyo. Hisresearch interests include informa-tion visualization and vision-basedhuman–computer interaction for

perceptual user interfaces. He received a BS in mechanicalengineering and an MS and Dr.Eng. in information engi-neering from the University of Tokyo. He is a member ofIEEE and ACM.

Readers may contact Kenji Oka at the Institute of Indus-trial Science, University of Tokyo, 4-6-1 Komaba Meguro-ku, Tokyo 153-8505, Japan, email [email protected].

For further information on this or any other computingtopic, please visit our Digital Library at http://computer.org/publications/dlib.

IEEE Computer Graphics and Applications 71

Table 2. Evaluating gesture recognition.

Gesture Type Single-Finger Double-Finger

Average (percent) 99.2 97.5Standard deviation (percent) 0.5 1.8