real-time hand tracking using modificated flocks of...

5
Real-time Hand Tracking using Modificated Flocks of Features Algorithm Andrej Fogelton * Institute of Applied Informatics Faculty of Informatics and Information Technologies Slovak University of Technology in Bratislava Ilkoviˇ cova 3, 842 16 Bratislava, Slovakia [email protected] Abstract There is a growing demand to interact with computers in a more natural way. For example using hand gestures to in- teract with certain type of applications would be more effi- cient than old-fashioned mouse and keyboard. To achieve this we need to be able to efficiently track human hand in real-time. We focused on Flocks of Features algorithm introduced by Mathias K¨ olsch and Matthew Turk, which can track human hand continuously during various move- ments and pose variations. It uses Lucas-Kanade tracker for features located on a human hand. This algorithm can handle tracking of fast movements of non-rigid highly articulated objects such as hands. We propose modifica- tions to this algorithm which mostly correspond to the preprocessing of the input frame by using histogram back projection of the skin color. This modification accord- ing to our testing provides more reliable feature tracking which results in better hand tracking efficiency. Categories and Subject Descriptors I.5.3 [Clustering]: Density Based, Flock Behavioral Model; I.4.6 [Segmentation]: Histogram Matching, Adaptive Histogram Keywords Lucas-Kanade Tracker, Histogram Matching, Flocks of Features, Real-time Hand Tracking, Image Segmentation * Master degree study programme in field Software Engi- neering. Supervisor: Matej Makula, Institute of Applied Informatics, Faculty of Informatics and Information Tech- nologies, STU in Bratislava. Work described in this paper was presented at the 7th Student Research Conference in Informatics and Infor- mation Technologies IIT.SRC 2011. c Copyright 2011. All rights reserved. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy other- wise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific per- mission and/or a fee. Permissions may be requested from STU Press, Vazovova 5, 811 07 Bratislava, Slovakia. Fogelton, A. Real-time Hand Tracking using Modificated Flocks of Fea- tures Algorithm. Information Sciences and Technologies Bulletin of the ACM Slovakia, Special Section on Student Research in Informatics and Information Technologies, Vol. 3, No. 2 (2011) 1-5 1. Introduction In the last few years, there is a desire to control com- puters in a more interactive way than using just mouse and keyboard. One of the pioneers of the new way of in- teraction was Nintendo WII, 1 which uses infrared LEDs and infrared camera with a proximity sensor. This de- vice is used in a game console offering a totally new way of game experiencing. For example, you can play tennis by holding WII remote controller in your hand instead of your tennis racket and play a match against your friend at the other side of the world or computer. But what about using these kinds of interactions without using any accessories, just own hands? We assume that in several cases using hands to interact with the computer would be much more efficient than the old-fashioned mouse and keyboard. In order to deal with highly articulated objects, such as hands, effectively in the most common situations with ar- bitrary background, several requirements need to be ac- complished: background invariant, without gloves or any other markers, light invariant, ability to track both hands of the user in real-time, hand shape (pose) invariant, hand size invariant. There is a variety of hand trakcing algo- rithms. We focused on Flocks of Features (FoF), because as one of a few it can handle the majority of the given requirements. The original FoF algorithm is described in Section 2. Sec- tion 3 presents the modifications to this algorithm to over- come some of its difficulties. Testing results are described in Section 4 and the conclusion with future work is pro- posed in the last section. 2. Flocks of Features Mathias K¨ olsch and Matthew Turk presented Flocks of Features in paper Fast 2D hand tracking with flocks of features and multi-cue integration [2]. This algorithm can track the human hand without any artificial artifacts such as gloves. It is robust to various light conditions and furthermore a non-stationary camera can be used. The tracker’s core idea is motivated by the seemingly chaotic flight behavior of a flock of birds [5] such as pigeons. The minimum and maximum safe distance during the flight are defined. Features of the hand are also very close to- gether like birds in a cloud (Figure 1). These features are represented by good-features-to-track [6] and tracked by Lucas-Kanade tracker (often called KLT tracker after 1 http://www.nintendo.com/wii

Upload: others

Post on 24-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Real-time Hand Tracking using Modificated Flocks of ...fogelton/pub/fogelton2011bulletin.pdfVazovova 5, 811 07 Bratislava, Slovakia. Fogelton, A. Real-time Hand Tracking using Modificated

Real-time Hand Tracking using Modificated Flocks ofFeatures Algorithm

Andrej Fogelton∗

Institute of Applied InformaticsFaculty of Informatics and Information Technologies

Slovak University of Technology in BratislavaIlkovicova 3, 842 16 Bratislava, Slovakia

[email protected]

AbstractThere is a growing demand to interact with computers in amore natural way. For example using hand gestures to in-teract with certain type of applications would be more effi-cient than old-fashioned mouse and keyboard. To achievethis we need to be able to efficiently track human handin real-time. We focused on Flocks of Features algorithmintroduced by Mathias Kolsch and Matthew Turk, whichcan track human hand continuously during various move-ments and pose variations. It uses Lucas-Kanade trackerfor features located on a human hand. This algorithmcan handle tracking of fast movements of non-rigid highlyarticulated objects such as hands. We propose modifica-tions to this algorithm which mostly correspond to thepreprocessing of the input frame by using histogram backprojection of the skin color. This modification accord-ing to our testing provides more reliable feature trackingwhich results in better hand tracking efficiency.

Categories and Subject DescriptorsI.5.3 [Clustering]: Density Based, Flock Behavioral Model;I.4.6 [Segmentation]: Histogram Matching, AdaptiveHistogram

KeywordsLucas-Kanade Tracker, Histogram Matching, Flocks ofFeatures, Real-time Hand Tracking, Image Segmentation

∗Master degree study programme in field Software Engi-neering. Supervisor: Matej Makula, Institute of AppliedInformatics, Faculty of Informatics and Information Tech-nologies, STU in Bratislava.Work described in this paper was presented at the 7thStudent Research Conference in Informatics and Infor-mation Technologies IIT.SRC 2011.

c© Copyright 2011. All rights reserved. Permission to make digitalor hard copies of part or all of this work for personal or classroom useis granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies show this notice onthe first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy other-wise, to republish, to post on servers, to redistribute to lists, or to useany component of this work in other works requires prior specific per-mission and/or a fee. Permissions may be requested from STU Press,Vazovova 5, 811 07 Bratislava, Slovakia.Fogelton, A. Real-time Hand Tracking using Modificated Flocks of Fea-tures Algorithm. Information Sciences and Technologies Bulletin of theACM Slovakia, Special Section on Student Research in Informatics andInformation Technologies, Vol. 3, No. 2 (2011) 1-5

1. IntroductionIn the last few years, there is a desire to control com-puters in a more interactive way than using just mouseand keyboard. One of the pioneers of the new way of in-teraction was Nintendo WII,1 which uses infrared LEDsand infrared camera with a proximity sensor. This de-vice is used in a game console offering a totally new wayof game experiencing. For example, you can play tennisby holding WII remote controller in your hand instead ofyour tennis racket and play a match against your friendat the other side of the world or computer. But whatabout using these kinds of interactions without using anyaccessories, just own hands? We assume that in severalcases using hands to interact with the computer wouldbe much more efficient than the old-fashioned mouse andkeyboard.

In order to deal with highly articulated objects, such ashands, effectively in the most common situations with ar-bitrary background, several requirements need to be ac-complished: background invariant, without gloves or anyother markers, light invariant, ability to track both handsof the user in real-time, hand shape (pose) invariant, handsize invariant. There is a variety of hand trakcing algo-rithms. We focused on Flocks of Features (FoF), becauseas one of a few it can handle the majority of the givenrequirements.

The original FoF algorithm is described in Section 2. Sec-tion 3 presents the modifications to this algorithm to over-come some of its difficulties. Testing results are describedin Section 4 and the conclusion with future work is pro-posed in the last section.

2. Flocks of FeaturesMathias Kolsch and Matthew Turk presented Flocks ofFeatures in paper Fast 2D hand tracking with flocks offeatures and multi-cue integration [2]. This algorithm cantrack the human hand without any artificial artifacts suchas gloves. It is robust to various light conditions andfurthermore a non-stationary camera can be used. Thetracker’s core idea is motivated by the seemingly chaoticflight behavior of a flock of birds [5] such as pigeons. Theminimum and maximum safe distance during the flightare defined. Features of the hand are also very close to-gether like birds in a cloud (Figure 1). These featuresare represented by good-features-to-track [6] and trackedby Lucas-Kanade tracker (often called KLT tracker after

1http://www.nintendo.com/wii

Page 2: Real-time Hand Tracking using Modificated Flocks of ...fogelton/pub/fogelton2011bulletin.pdfVazovova 5, 811 07 Bratislava, Slovakia. Fogelton, A. Real-time Hand Tracking using Modificated

2 Fogelton, A.: Real-time Hand Tracking using Modificated Flocks of Features Algorithm

Kanade Lucas and Tomasi). The minimum distance be-tween any two features and the maximum distance fromthe center (median) are defined. The median position offeatures is computed and the search using optical flowcan be provided only up to the maximum distance fromthis position. Very good results are achieved during rapidmovements and with continuous pose changing of the hu-man hand. An overview of the entire algorithm is listedin Listing 1.

Listing 1: Flocks of features algorithm [3].input:bnd_box - rectangular area containing handmindist - minimum pixel distance between featuresn - number of features to trackwinsize - size of feature search windows

initialization:learn color histogramfind n*k good -features -to-track with mindistrank them based on color and fixed hand maskpick the n highest -ranked features//k=3 was used

tracking:update KLT feature locations with image pyramids

compute median featurefor each feature

if less than mindist from any other featureor outside bnd_box , centered at medianor low match correlation

then relocate feature onto good color spot thatmeets the flocking conditions

output:median - the average feature location

2.1 Lucas-Kanade TrackerLucas-Kanade tracking algorithm calculates a brightnessgradient (sobel operator) along at least two directions fora promising feature candidate to be tracked over time [6,7]. In combination with image pyramids (a series of pro-gressively smaller-resolution interpolations of the originalimage), a feature’s image area can be matched efficientlyto the most similar area within a search window in thefollowing video frame. If the feature match correlationbetween two consecutive frames is below a threshold, thefeature is considered “lost”. A hand detection methodsupplies both a rectangular bounding box and a probabil-ity distribution to initialize tracking.

Figure 1: Snapshots of sequences with hand mo-tions; the cloud of little dots are features and thebig dot is their median [2].

The probability mask states for every pixel in the bound-ing box the likelihood whether it belongs to the hand.Features are selected within the bounding box accordingto their ranking and observing a pair wise minimum dis-tance. These features are being ranked according to thecombined probability of their locations and color. Highlyranked features are tracked individually per frames. Theirnew locations become the area with the highest matchcorrelation between the two frame’s areas.

Individual features can latch onto arbitrary artifacts ofthe object being tracked, such as fingers of a hand. Theirmovement is independent along with the artifact, with-out disturbing other features. Too dense concentrationsof the features that would ignore other object’s parts areavoided due to the minimum distance constraint. Butstray features that are too far from the object of inter-est are brought back into the flock with the maximumdistance constraint. To get more stable results, about15% of the furthest features from median computationhave to be removed. The speed of pyramid-based KLT [6,7] feature tracking allows to overcome the computationallimitations of the model-based tracking approaches whileachieving real-time performance.

2.2 Color ClassificationDuring calibration process, a hand color is observed andthe normalized-RGB histogram is calculated. Using thistechnique exclusively is not a very good solution becauseit can detect objects with similar color histograms suchas wooden objects or other parts of the human body. Thecolor information is used as a probability map. At trackerinitialization time, the KLT features are placed preferablyonto locations with high skin color probability. New loca-tion of a relocated feature is chosen with high color prob-ability (more than 50%). Changing light condition cancause bad tracking performance, but only in case of relo-cated features because most of the features will continueto follow gray-level artifacts. This method combines cuesfrom feature movement based on gray-level image texturewith cues from texture-less skin color probability. It de-pends on the algorithm parameters how often features arerelocated and on the importance of the color modality.

This algorithm was used to interact with a wearable com-puter [4]. A webcam was placed at the head mounteddisplay, so the hand size was approximately constant. Itcan be used to track both hands [1] or even other objects,where the skin color is replaced by a given sample. Theproblem is that it is not size invariant due to the constantthreshold for the maximum distance from the center ofthe flock.

3. ModificationsThe original FoF uses gray-level image as the input forKLT features tracking. Due to this procedure we foundthe FoF algorithm to be vulnerable to the edges occur-ring in the background. During movements over strongedges, a lot of KLT features can be relocated into incor-rect positions. This leads to incorrect median relocationand eventually tracking failure (Figure 2).

Figure 2: FoF algorithm is vulnerable to edgesoccurring in the background due to bad KLT fea-tures relocation.

Histogram is calculated from a region containing handduring the initialization procedure. We process the his-

Page 3: Real-time Hand Tracking using Modificated Flocks of ...fogelton/pub/fogelton2011bulletin.pdfVazovova 5, 811 07 Bratislava, Slovakia. Fogelton, A. Real-time Hand Tracking using Modificated

Information Sciences and Technologies Bulletin of the ACM Slovakia, Vol. 3, No. 2 (2011) 1-5 3

togram to eliminate noise. There is an assumption, thatthe skin color takes the majority of the given region andthe histogram is processed (normalized and tresholded)and cropped to contain the biggest gaussian data only(which represents the skin color in our case). We cre-ate a probability map by applying back projection fromthe given histogram during tracking on every frame. Oneslight difference is use of the HSV color model insteadof the normalized-RGB, because it is more common inthese kinds of applications while talking about histogrammatching. The HSV color model and the normalized-RGB have similar characteristics in terms of luminanceinvariance.

3.1 Image ProcessingWe realize that if we use the probability map instead ofthe original image, we will get rid of the edges occuringin the background because they are not skin colored andthey do not appear in the back projection image. The ideato run FoF on this probability map (Figure 3 middle) hasbeen proved to be a step forward, but as we can see, thereis a lot of noise in the back projection image. To reducethis noise, the morfological operation open is used.

Figure 3: Result of the histogram back projectionwith noise and after open operation (Interestingpart of the images).

Because of this modification we do not need to rank fea-tures anymore and we can be almost sure that every fea-ture will be located somewhere in the skin region. It isvery common, that the luminance can change a lot in asequence of frames. We achieved partial luminance invari-ance due to the HSV color model, but this is not sufficient.Artificial light can be turn on or off and the light from thesun can also vary according to the weather conditions.Our modification mostly relies on proper skin color prob-ability map. The color appearance over the hand changesa lot also due to the forwards and backwards hand move-ments as shown on Figure 4. Because of this inaccuracythe KLT tracker loses more features and tracking can faileasily due to a lack of them.

Figure 4: Example of inaccurate back projectiondue to the luminance variation of the skin.

3.2 Adaptive HistogramIn order to solve this issue we decided to recalculate thehistogram during the tracking. To make sure that the

Figure 5: Flowchart of the histogram adaptation.

skin color is segmented properly, we do weighted mergingof the initial and the actual histogram (Figure 5).

The resulted histogram contains the initial one with theweight of 70 percent, because during the initialization pro-cess; the region, the histogram is calculated from, repre-sents the skin color. Due to this, the incorrect medianrelocation will not suppress the histogram to convergeto something other than to the skin color significantly.The actual histogram is being calculated from the regionaround the median of the hand. The problem is that themedian does not represent the center of the hand preciselyand we need to eliminate the background colors and somenoise as well as in the initial histogram. Before we mergethe histograms it is necessary to normalize them, becausethe regions they are calculated from differ in size and thevalues do not correspond to the probability distributionscorrectly.

The important question is when to run the histogramadaptation. As we mentioned above, the skin color seg-mentation fails due to varying luminance conditions. Werealized that we can use the channel Value from the HSVcolor model to find out the average value of luminancearound the median location. We assumed that the me-dian is somewhere in the hand region. We compute thisaverage value every seventh frame and compare it to theprevious one. If the difference is sufficient, the histogramadaptation is proceeded. The problem is that the resultsof image segmentation based on this adaptation are verygood and the KLT tracker fails more often due to lackof features located on the hand. To eliminate this be-havior we destroyed the three biggest values in the skincolor histogram to create “white color” inside the handregion to create more features for the KLT tracker. Theother operation is creating bigger differences between thehistogram bins to create more valuable features for KLTtracking. This is done by normalizing the histogram tolower maximum value e.g. 20 and then multiplying everybin value with other 20 (Figure 6).

Figure 6: Sample of inaccurate back projectionwith the initial histogram, merged histograms andmodificated resulted histogram.

Page 4: Real-time Hand Tracking using Modificated Flocks of ...fogelton/pub/fogelton2011bulletin.pdfVazovova 5, 811 07 Bratislava, Slovakia. Fogelton, A. Real-time Hand Tracking using Modificated

4 Fogelton, A.: Real-time Hand Tracking using Modificated Flocks of Features Algorithm

This principle of histogram adaptation works precisely,when the median is correctly detected in a hand region.The skin color detection in the histogram is based on theassumption that the region is bounded with zeros afterapplying a threshold to eliminate noise. The problem oc-cures when the median is located near the hand border(or at fingers) and the skin color does not correspond tothe majority of the area around the median. In such casesthe segmentation is degraded and the tracker can fail. Wepartially avoid this behavior by controlling the region inthe histogram where to look for the biggest gaussian colordistribution. We noticed from testing that the skin colordistribution takes place between the lower saturation val-ues.

Using the modified histogram a lot of good-features-to-track (Figure 7 left) are created, but the open operationdecreases their number (Figure 7 right). Because of thiswe do not process the back projection image anymoreand the remaining noise is eliminated by the flocking be-havioral model, more precisely by the maximum distanceparameter. One of the side effects of the precise skin colorsegmentation is the higher tracking disturbance by otherskin colored objects like head or the other hand.

Figure 7: Difference between the quality of theoriginal and processed image (open operator) ongood featrues to track.

Listing 2: Modificated Flocks of Features.input:mindist - minimum pixel distance between featuresmax_distance - maximum distance used for flocking

behavioral model

initialization:observe color histogram from hand regionfind all good -features -to-track with mindist in a

hand region on back projection image

tracking:for all features run tracking on back projection

image//about 200 featuresfor each tracked feature

if more than mindist from any other featureand closer than max_distance from median

thencopy it to temp array

if the array is bigger than max features //max wasabout 70

thenuse only first max features

if the luminace changed significantlythen

recalculate the histogramoutput:compute the median from the features in the temp

array

We present a sequence of images with tracking results(Figure 8). We considered tracking to be lost when themedian does not correspond to the hand movements any-more. Our modification can also fail like the original one(Figure 9), mostly because of rapid movements of thehand over face or other skin colored objects. An ordi-nary webcam is able to achieve 30 frames per second, butthis is not enough for rapid movements. The reason forthe KLT tracking failure in this case is the optimizationto look for a new location of a given feature in range of 7pixels (the bigger the range is, the more time it takes tocompute). That is the reason why it considers other skincolored parts as hand and the median is disrupted.

Figure 8: Example of hand tracking.

Figure 9: Tracking failure due to rapid move-ments.

4. ComparisonFor testing purposes we created a dataset of differentvideos (Table 1) containing rapid and some other move-ments on purpose with the aim to disturb the tracker.Each pair of video consists of a person wearing a sweater(a labeled videos) and a T-shirt (b labeled videos). Wedo testing of the original Flocks of Features, the mod-ification using preprocessed image and the modificationusing preprocessed image with adaptive histogram. Allalgorithms did not use any method for new features de-tection during tracking. Testing does not perform exactresults, because of the manual hand region selection (ini-tialization). When the tracking failure occured, manualreinitialization needed to be performed.

The results are presented on Figure 10, where the val-ues for each video represent the number of tracking fail-ures during the video sequence (higher value representsworse tracking results). We can see that the values ofthe original algorithm are much higher than the values ofour modifications. To sum up the original Flocks of Fea-tures failed during the testing 124 times, the modification

Page 5: Real-time Hand Tracking using Modificated Flocks of ...fogelton/pub/fogelton2011bulletin.pdfVazovova 5, 811 07 Bratislava, Slovakia. Fogelton, A. Real-time Hand Tracking using Modificated

Information Sciences and Technologies Bulletin of the ACM Slovakia, Vol. 3, No. 2 (2011) 1-5 5

0

2

4

6

8

10

12

14

16

18

1a 1b 2a 2b 3a 3b 4a 4b 5a 5b 6a 6b

Original Image preprocessing Image preprocessing with adaptive histogram

Figure 10: Testing results (higher value represents worse tracking results – number of failures).

Number Condition

1 rapid movements of the hand2 size of the hand is changing a lot3 arbitrary movements4 arbitrary movements5 moving background6 outside lightening conditions

Table 1: Explanation of listed videos

without adaptive histogram 58 times and with adaptivehistogram 52 times. From these numbers we can say thatthe modifications performs approximately 2 times betterresults than the original algorithm. All testing videosand the documented video results of every testing can befound on my personal website2.

Conclusions and Future WorkWe proposed two modifications, which according to ourtesting achieve better results. Processing the image byhistogram matching makes the Lucas-Kanade tracker work-ing more reliable and thanks to this, the hand trackingperformance has increased. But as a disadvatange of thismodification we considered the importance of the skincolor segmentation. To achieve bigger luminance invari-ance, an adaptive histogram was introduced. Histogramadaptation performs better skin color segmentation mostof the time, but there are still some situations of false pos-itive cases when the adaptation causes tracking failure.

Flocks of features can fail due to movements in front ofother skin colored objects like a head. This could besolved by obtaining the depth information of the object.Microsoft has introduced a new version of their game con-sole XBOX 360 which uses Kinect. Kinect is a webcamextended with an infrared light camera and an infraredlight projector. This projector illuminates the scene withinfrared light and special infrared light camera is ableto compute the depth information from the image. Thishardware could provide proper depth information withoutcomputing severity issue like it has been so far by using

2henryi.yweb.sk

two and more cameras.

References[1] J. Hoey. Tracking using flocks of features, with

application to assisted handwashing. In BritishMachine Vision Conference (BMVC), 2006.

[2] M. Kolsch and M. Turk. Fast 2d hand tracking withflocks of features and multi-cue integration. InComputer Vision and Pattern RecognitionWorkshop, 2004. CVPRW ’04. Conference on, pages158 – 158, 27-02 2004.

[3] M. Kolsch and M. Turk. Robust hand detection. InAutomatic Face and Gesture Recognition, 2004.Proceedings. Sixth IEEE International Conferenceon, pages 614 – 619, 2004.

[4] M. Kolsch, M. Turk, and T. Hollerer. Vision-basedinterfaces for mobility. In Mobile and UbiquitousSystems: Networking and Services, 2004.MOBIQUITOUS 2004. The First AnnualInternational Conference on, pages 86 – 94, august2004.

[5] C. W. Reynolds. Flocks, herds, and schools: Adistributed behavioral model. Computer Graphics,21(4):25–34, July 1987.

[6] J. Shi and C. Tomasi. Good features to track. InComputer Vision and Pattern Recognition, 1994.Proceedings CVPR ’94., 1994 IEEE ComputerSociety Conference on, pages 593 –600, 21-23 1994.

[7] C. Tomasi and T. Kanade. Detection and tracking ofpoint features. Image Rochester NY, pages TechnicalReport CMU–CS–91–132, April 1991.