real time palm and ngertip detection based on depth mapserrors. madhuri, y et al [20] use a similar...

18
Noname manuscript No. (will be inserted by the editor) Real time palm and fingertip detection based on depth maps Jonathan Robin Langford-Cervantes · Octavio Navarro-Hinojosa · Moises Alencastre-Miranda · Lourdes Munoz-Gomez · Gilberto Echeverria-Furio · Cristina Manrique-Juan Received: date / Accepted: date Abstract We present a system that is able to recover and track the 3D posi- tion and orientation of both human hand’s palms and fingertips from markerless visual observations obtained from 3D depth maps generated by a depth camera, specifically, Microsoft Kinect version 1. Since depth map processing can be com- putationally expensive, we used Octrees and clustering algorithms to process the point clouds and estimate the palms and fingertips positions. Our solution is fo- cused on working on low-end or commercial hardware in real time and does not require GPGPU. We are able to track and differentiate both hand’s palms and fingertips in real time. Keywords Real time · Depth Map · Depth Camera · Hand detection · Computer vision 1 Introduction Automatic capture and analysis of human motion is a highly active research area due both to the number of potential applications and its inherent complexity. Most of the attention in this regard has been given to body recognition and tracking, face recognition, augmented reality, object scanning, among others, with focus on Human-Computer Interaction (HCI). There has been a great emphasis lately in HCI research to create easier to use interfaces by directly employing natural communication and manipulation skills of humans. There are many applications that track and use different parts of the body, or a set of gestures, in order to interact with them. Among the different body parts, the hand is the most effective, general-purpose interaction tool due to its dexterous functionality in communication and manipulation. Moises Alencastre-Miranda Av Carlos Lazo 100, Santa Fe, Alvaro Obregon, 01389 Mexico City, Federal District Tel.: 01 55 9177 8000 E-mail: [email protected]

Upload: others

Post on 31-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

Noname manuscript No.(will be inserted by the editor)

Real time palm and fingertip detection based ondepth maps

Jonathan Robin Langford-Cervantes ·Octavio Navarro-Hinojosa ·Moises Alencastre-Miranda ·Lourdes Munoz-Gomez · GilbertoEcheverria-Furio · Cristina Manrique-Juan

Received: date / Accepted: date

Abstract We present a system that is able to recover and track the 3D posi-tion and orientation of both human hand’s palms and fingertips from markerlessvisual observations obtained from 3D depth maps generated by a depth camera,specifically, Microsoft Kinect version 1. Since depth map processing can be com-putationally expensive, we used Octrees and clustering algorithms to process thepoint clouds and estimate the palms and fingertips positions. Our solution is fo-cused on working on low-end or commercial hardware in real time and does notrequire GPGPU. We are able to track and differentiate both hand’s palms andfingertips in real time.

Keywords Real time · Depth Map · Depth Camera · Hand detection · Computervision

1 Introduction

Automatic capture and analysis of human motion is a highly active research areadue both to the number of potential applications and its inherent complexity. Mostof the attention in this regard has been given to body recognition and tracking,face recognition, augmented reality, object scanning, among others, with focus onHuman-Computer Interaction (HCI).

There has been a great emphasis lately in HCI research to create easier to useinterfaces by directly employing natural communication and manipulation skillsof humans. There are many applications that track and use different parts of thebody, or a set of gestures, in order to interact with them. Among the differentbody parts, the hand is the most effective, general-purpose interaction tool due toits dexterous functionality in communication and manipulation.

Moises Alencastre-MirandaAv Carlos Lazo 100, Santa Fe, Alvaro Obregon, 01389 Mexico City, Federal DistrictTel.: 01 55 9177 8000E-mail: [email protected]

Page 2: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

2 Jonathan Robin Langford-Cervantes et al.

Hand tracking and recognition has had a great importance in recent years in thefield of computer vision research, because of its extensive applications in virtualand augmented reality, HCI, sign language recognition, robotics, computer games,among others. In vision-based interfaces, hand tracking is often used to supportuser interactions such as cursor control, 3D navigation, and recognition of dynamicgestures.

In general, hand pose estimation is very challenging due to the many degrees offreedom (DoFs) of the hand as an articulated object, which leads to great variabil-ity in hand appearance and self-occlusions. Hand Gesture Recognition (HGR) is avery popular and effective way used for HCI. It has been used in many applicationsincluding embedded systems, vision based systems and medical applications.

Currently, the most effective tools for capturing hand motion are electro-mechanical or magnetic sensing devices (data gloves). These devices are wornon the hand to measure the location of the hand and the finger joint angles. Theydeliver the most complete, application-independent set of real-time measurementsthat allow importing all the functionality of the hand in HCI. However, they haveseveral drawbacks in terms of casual use as they are very expensive, hinder thenaturalness of hand motion, and require complex calibration and setup proceduresto be able to obtain precise measurements.

The advent of relatively cheap image and depth sensors has spurred researchin the field of object tracking and gesture recognition. One of the more populardevices used to do this type of research is Microsoft’s Kinect version 1 (Kinect v1)sensor, which has sensors that capture both RGB and depth data. Using similardata, researchers have developed algorithms that not only identify humans in ascene, but perform full body tracking; they can infer a person’s skeletal structurein realtime, allowing for the recognition and classification of a set of full bodyactions. However, the first version of the Kinect is unable to recognize and trackhands and hand gestures.

Our goal is to build a reliable, markerless, hand palm and fingertip recognitionand tracking system using the depth data collected from a Kinect v1 sensor. Thisproject was commissioned by Larva Game Studios [4] (see acknowledgements),with the intention of using the final solution in video game development. Becauseof that, we will focus our solution on working on low-end or commercial hardwarein real time, using only the available CPU, and develop it without the need ofGPGPU.

2 Related Work

Markerless hand tracking systems rely on vision based approaches, which pro-cess and analyze video inputs to recognize hand poses and gestures [38]. Thistype of gesture recognition is natural and convenient for users, but hand and fin-ger pose estimation can be challenging due to hand appearance, skin color, skin

Page 3: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

Palm and fingertip detection 3

texture, and lightning conditions. One main problem is segmenting (distinguish-ing) the hand from other objects, which is mainly done by skin color segmenta-tion [15, 20, 21, 30, 36]; however, this approach has problems depending on light,skin color variation, and cannot handle complex backgrounds [38]. Recently, withthe advent of cheap depth sensors, those problems can be mitigated and mayhelp create robust, computational efficient, and more user tolerant applications.Multiple vision based approaches for hand and finger pose estimation have beenmade using color and depth cameras, these approaches may be divided as in [24]:appearance, model, and feature based methods.

Model based methods [16, 25–27] use a 2D or 3D virtual hand model consistingof multiple kinematic joints, which represent the DOFs of the hand, to compareand evaluate with user’s real time hand observations and fit them to the virtualimage or model. Essentially this type of approach is formulated as an optimizationproblem, whose objective is to find the best match between a virtual hand modelpose and real hand’s observations by measuring the discrepancy between them.In [25] a RGB multi-camera system, consisting of 8 synchronized and calibratedPointGrey cameras, is used. In each of the acquired views of the cameras, edge andskin color maps form 2D cues of the presence of a hand, which are combined andcompared with a predefined model to deduce the hand’s pose. This implementationsolves efficiently a full-DOF joint hand tracking problem with occlusions by usingParticle Swarm Optimization (PSO), which is a probabilistic search algorithmthat can be efficiently parallelizable [35]. Even though it is implemented on GPUit does not run in real time (average of 15Hz), it is also sensible to light and requiresan elaborate setup. Also, [26, 27] solve an optimization problem using PSO, butin addition to color information they use depth information of the hand whichmakes it rather insensitive to illumination conditions and it does not require acomplex hardware setup, but still does not work in real time (average of 15Hz).The resulting computational complexity is the main drawback of these methods,but they do not require a training stage and can handle occlusions.

Another model based approach is [37], where the key idea is to leverage markerposition data recorded by a twelve-camera optical motion capture system andRGB/Depth data obtained from a single Microsoft Kinect v1 sensor, in orderto acquire a wide range of high-fidelity hand motion data. The two capturingdevices are complementary to each other as they focus on different aspects of handperformances. Marker-based motion capture systems can obtain high-resolution3D position data at very high frame rates, but they are often not capable ofreconstructing 3D hand articulations accurately, particularly in case of significantself-occlusion. On the other end, a RGB-D camera such as Kinect v1 sensor cancapture per-pixel information for both color and depth data. However, the datafrom the Kinect v1 sensor is often noisy and sampled at a much lower frame rate.By complementing a marker-based motion capture system with RGB-D imagedata obtained from a single Kinect v1 sensor, the effect of missing markers issignificantly reduced and high-quality hand motion can be reconstructed.

Other approaches taken are appearance based, data based, or template basedmethods [20, 22, 33] in which a large database of images corresponding to a finitenumber of hand configurations has to be defined previously to train the system.

Page 4: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

4 Jonathan Robin Langford-Cervantes et al.

When using the system the user hand gestures will be compared with the databaseto find a correspondence. [22] uses 2 Kinect v1 sensors to acquire the structuralinformation of one hand. From the obtained data, 6 descriptors are identified andcompared with the training set to find the best match. This approach was able toget a 100% accuracy for static hand poses without rotation, with a database of 10gestures, but when the gestures were performed in motion the accuracy could dropto 93%. This approach has a high level of accuracy but requires a training stage,and incrementing the number of gestures available will increment the recognitionerrors. Madhuri, Y et al [20] use a similar approach to create a translator forsign language. The program acquires images of the hand through a color camera,processes the unique features and compares them with a 36 sign word lexicon. Ithas the disadvantages of a color hand tracking method, and similar gestures can bemisinterpreted, but it runs in real time in a IOS mobile phone making it portableand easy to use by deaf people. This type of approach is quicker than model basedapproaches but is prone to error when a user can not reproduce a gesture fromthe defined database [21] and it also requires a training stage.

In [34] a solution to the difficult problem of inferring the continuous pose of ahuman hand is presented. They first construct an accurate database of labeledground-truth data in an automatic process and then train a system, using convo-lutional networks, capable of real-time inference. Since the human hand representsa particularly difficult kind of articulable object to track, their solution is appli-cable to a wide range of articulable objects. The final method has a small latencyequal to one frame of video, is robust to self-occlusion, requires no special markers,and can handle objects with self-similar parts such as fingers.

When applications do not need the detection of the DOFs of the hand, a featurebased approach can be taken [10, 14, 17, 30] in which only certain features suchas the finger tips and the center of the palm are obtained. [30] uses an intensitybased histogram to detect the direction of the hand, the wrist’s end, and thefingers’ end, depending on the pixels’ value of the binary silhouette. This methodis fast and efficient to track the hand and fingers, but is only useful when all thehand fingers are present and fully opened. In the case of tasks such as pointing,resizing, and navigating the finger tips are the most important tracking elements,so in [14] they combine depth and color information to segment the hand and findthe fingers using edge extraction. The isolated contours are taken as blobs whichare tracked using the K-nearest neighbors(KNN) classification algorithm and thenthese blobs are interpreted as gestures.

This procedure is easy to implement but it’s only useful for non sophisticated ornon high-precision tasks. [10] segments the hand from the background and bodyby thresholding the depth information, then by using convexity detection it locatesthe convex and concave points of the hand shape. According to the convex andconcave points detected, digits are predicted. It is able to detect hand digits from0 to 5 in real time with an accuracy of 94% for just one hand. [17] also usesconvexity detection to identify the fingers, but can track two hands by applyingK-means clustering algorithm after segmenting the hands using a depth threshold.It can also detect more gestures by applying three layers of classifiers, and has anaccuracy of at least 84% for one hand, while it increases to more than 90% if

Page 5: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

Palm and fingertip detection 5

both hands are doing the same gesture. A feature based approach is useful andefficient when the application required just needs to track hands and finger tips,and recognize simple gestures.

Furthermore, commercial applications have been recently developed for hand andfinger tracking, such as Sculpting from Leap Motion [23] and IISU from SoftKinec-tic [1] but have not disclosed how they work.

There are applications such as [8,13], where the hand detection problem is solved,but have either partially, or have not disclosed how they achieved their results.The latter, uses the Kinect v1 sensor, and the libfreenect driver for interfacing withthe Kinect v1 sensor in linux, to create a hand tracking software and a graphicalinterface as the one in the movie ”Minority Report”. The hand detection softwareis able to distinguish hands and fingers in a cloud of more than 60,000 points at30 frames per second, allowing natural, real time interaction. The former presentsa new real-time articulated hand tracker which can enable new possibilities forhuman-computer interaction (HCI). This system accurately reconstructs complexhand poses across a variety of subjects using a standard Xbox One Kinect. Italso allows for a high-degree of robustness, continually recovering from trackingfailures.

In our work we focus in developing a system that can work in real time and canbe able to track two hands and their fingers, at the same time, distinguishing theleft and right hand apart. Also, the system should not require the use of GPU,should be insensitive to light, and should not require training. To accomplish this,a feature based approach is taken using a simple setup consisting of one depthsensor.

3 Development

Our proposed method takes a depth map as input from the Kinect v1, and triesto determine the position of the fingertips and palms of the hands of a single user.The general steps are as follows:

– Retrieval and filtering of Depth Map by boundaries. These boundaries aremeant to filter out all that does not belong to the user’s hands and forearms.

– Point clustering. A modified clustering algorithm is used to group all the pointsin the filtered map into clusters for further processing.

– Reference object creation so they are tracked from each of the clusters.– Fingertip selection. A graph for the 3D point clouds clusters is created and a

set of branches are chosen as finger candidates. The graph is intended to beseen as a skeleton for the hand.

– Palm position estimation in the scene using distance transformation on a pro-jection of the hand 3D point clusters in the Z axis.

Page 6: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

6 Jonathan Robin Langford-Cervantes et al.

3.1 Depth map retrieval

The depth map is read and filtered in order to keep only the regions correspondingto the user’s hands and forearms; Figure 1 shows a depth map before filtering,which commonly contains elements such as the head, the arms, or other objectsthat are in the scene. In order to filter said regions, we use boundaries that can beset either dynamically or statically: if set statically, the user selects top, bottom,right, left, near, and far boundaries to be used; if set dynamically, the far clippingplane can be set to change by placing it at a fixed distance from the nearest pointfound in the map. For each 3D point in the depth map, its position is checked andis ignored if it lies outside the top, bottom, right, or left boundaries. The value ofthe accepted 3D point is read to be filtered again by depth value; every 3D pointlying outside of a far or a near boundary is ignored. The remaining map is theinput for the rest of the analysis. A filtered map can be seen in Figure 2.

Fig. 1: 3D point cloud beforefiltering.

Fig. 2: 3D point cloud after filtering.

3.2 Point grouping into clusters

After the depth map has been filtered, a way to determine what subset of 3Dpoints belonged to a given hand was needed. In order to do this, we used clusteranalysis (or clustering) which is the task of grouping a set of objects in such away that objects in the same group (cluster) are more similar to each other thanto those in other clusters. Although clustering is normally used in data mining,statistical data analysis, machine learning, pattern recognition, image analysis,computer graphics, information retrieval, and bioinformatics. We decided to gen-erate clusters by grouping the 3D points based on their proximity to each other.Since the 3D points had to be grouped into meaningful clusters, the use of clus-tering algorithms such as K-MEANS [19] and DBSCAN [12] was considered. Sinceour proposed method needs to work in real time, these algorithms were selectedbecause of their computational efficiency and their effectiveness at finding mean-ingful clusters.

We first tested the viability of the K-MEANS algorithm by searching for twoclusters, one for each hand, in the filtered depth map. The K-MEANS algorithm

Page 7: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

Palm and fingertip detection 7

produced two clusters, that are delimited within the green boxes, that could befurther analyzed (as seen in Figure 3). However, since K-MEANS needs a definednumber of clusters to generate, there were some cases where the hands were tooclose to each other such that the algorithm generated the two clusters, but somepoints that belonged to a hand were assigned to the other (see Figure 4). Anotherproblem with the use of K-MEANS was that if two hands appeared in the scene,and additional elements appeared, the clusters would not be correctly generated,as seen in Figure 5, where if you put the hands close together, and insert anotherelement in the scene, the hands are shown inside a single green box, and the extraelement in the scene is shown inside another.

Fig. 3: Correctly generated clustersusing Kmeans.

Fig. 4: Incorrect assignment of pointsto a cluster with Kmeans.

Fig. 5: Incorrectly generated clustersusing Kmeans.

Since DBSCAN works considering the density of the clusters, and produces anumber of clusters depending on the data itself, the problems we found withK-MEANS would be solved. When we tested the basic DBSCAN algorithm onour sample dataset, we found that the hands were properly clustered in differentgroups, but, since each sample had such a large number of points, the algorithmdid not run in real time.

DBSCAN offered us the best results considering the clusters generated, so welooked into making the algorithm perform faster, ideally in milliseconds. There are

Page 8: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

8 Jonathan Robin Langford-Cervantes et al.

some works like [9,18,28], that have proposed changes to the DBSCAN algorithm,and have been able to decrease the processing time considerably; however, thosealgorithms were still not able to run in real time.

Our first attempt at improving the DBSCAN algorithm was to use octrees in orderto reduce the search space of the DBSCAN algorithm and improve the clusteringspeed. Although we obtained a speedup, the clustering was still taking too muchtime to process if we wanted to achieve real time. To further reduce the clusteringtime we modified the DBSCAN algorithm. We used the DBSCAN algorithm thatused octrees, and instead of clustering a large number of points within a smallradius, we perform two clustering steps with two different search radii.

Fig. 6: Clusters generated after thefirst clustering step.

Fig. 7: Final clusters generated afterthe second clustering step.

The first step uses a search radius that is 1/1000 of the depth map size. We testedwith different radii and found that radius gave us the best clustering results. Thisfirst clustering produces a group of clusters whose size is around 1/100 of theinitial depth map size. Figure 6 shows the result of the first clustering step, whereeach colored segment is a different cluster. The second step will cluster all thepoints of the generated clusters based on their centroids. We use a search radius of2/1000 of the depth map size so that we can find the clusters’ centroids and groupthem together. Once we have the group of centroids, all the points associated to agiven centroid are added to a final cluster. Figure 7 shows the result of the secondclustering step.

After this step, all the elements in the scene are already grouped into differentclusters, and will be segmented in the following steps. A sample of the clusteringstep can be seen in Figure 8.

3.3 Reference creation and tracking

The number of points in each cluster is counted and only the ones whose countis above a fixed threshold (which we selected experimentally by testing the pointclouds of different hands) are considered for further analysis. Each of the remaining

Page 9: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

Palm and fingertip detection 9

Fig. 8: DBSCAN algorithm detecting 3 clusters.

Fig. 9: ID Assignment Flow. t is the current frame, t-1 is the previous frame, t-2 is thesecond previous frame.

clusters’ size is clipped to a defined range (this value was experimentally obtainedwhile considering the average size of an open hand facing the camera) in both the Zand Y plane in order to ignore the forearm section and its centroid is recalculated.They are then turned into a reference object for tracking. The finger detectionalgorithm is designed to work correctly when the hands’s palms are presented tothe sensor pointing upwards.

Each reference object is given a unique ID and its position is tracked throughtime. ID’s are assigned by finding the best match for a current centroid given thecentroids of the past frame, as can be seen in Figure 9.

This algorithm works in the following way:

– For every centroid that exists in the current frame, a distance is calculated toevery centroid that exists in the past frame. The results are arranged in a listof vectors, where there is a list item for every current centroid and this itemis a vector that holds the distances and references to the each centroid in thepast frame.

Page 10: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

10 Jonathan Robin Langford-Cervantes et al.

Fig. 10: ID Assignment with Centroid Pushing. t is the current frame, t-1 is the previousframe, t-2 is the second previous frame.

– Each vector is sorted in ascending order according to its distance values so thebest match for a given centroid will always be placed in the first position ofthe vector.

– A conflict resolution routine, is implemented to avoid assigning the same IDto centroids whose best match points to the same past centroid. This routine,which is repeated until the vector list is empty, consists of the following steps:• For every centroid whose best match equals another centroid’s best match,

their distance to the best match is compared. The centroid whose distanceis the lowest will acquire the past ID.

• Every reference to the already matched past centroid is removed from eachvector in the list.

• The vector corresponding to the successfully assigned centroid is removedfrom the list.

• If any vector that remains has no elements it is removed from the list.– All current centroids are checked. For every current centroid with an unassigned

ID, a new ID is generated, and an uncertainty level (a value that allows us tocontrol if the centroid is maintained in the scene) of zero is assigned.

– All past centroids are checked. Every past centroid with no match in the currentframe and an ”uncertainty level” lower than a given threshold (we experimen-tally used an uncertainty level of 20, achieving good results) is pushed into thecurrent frame and its ”uncertainty level” counter increased.

– If the centroid’s uncertainty level is more than the defined threshold, the cen-troid is deleted and will not be tracked anymore.

– The current frame is pushed into the frame list so it can be used as the pastframe in the next iteration; see Figure 10.

– If a centroid has been tracked for more than a given number of frames, itwill be considered a strong object, and will be tracked in the scene until itsuncertainty level is beyond the defined threshold.

Page 11: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

Palm and fingertip detection 11

Each of the tracked reference objects receives a label according to a set of contextrules that are defined as follows:

– If the number of strong objects is greater than two:• The reference objects are sorted according to the X value of their moving

average position.• The leftmost element is labeled as the left hand and the rightmost labeled

as the right hand. All the elements in the middle are labeled as undefinedregions.

• We generate centroid anchors (or update the existing ones) by calculatingthe average position of the centroid of the element with a certain label (leftor right), over a defined number of frames.

• The current position of the centroid anchors are compared, if the anchorscross each other over X their labels are swapped.

– If the number of strong groups equals one, the label will be set as left hand ifthe reference objects moving average X value is lower than the workspace’s Xorigin coordinate, and it will be set as right hand otherwise.

3.4 Palm estimation using distance transform

To estimate the center of the palm, we used the distance transform algorithmas a starting point, which is also used in [29]. The data corresponding to eachhand is translated to a local reference frame and then analyzed in order to findthe location of the palm. Centers of the palms were detected by applying thedistance transform on the binary images of a hand. In order to apply the distancetransform, and estimate the center of the palm, the following steps were taken:

– Project the cloud orthogonally over Z and create a bounded binarized imagefrom the projection. Seen in Figure 11 (a).

– Fill any holes found in the hand area. Seen in Figure 11 (b).– Calculate the distance transformation of the resulting image. Seen in Figure

11 (c).– Find the valid point in the distance map with the greatest distance. If there

is more than one point whose value matches the greatest value, the point withthe lowest Y coordinate is selected. That point will serve as the palm’s center.Seen in Figure 11 (d).

3.5 Fingertip selection

Once we have a group of tracked objects, and an estimation of the palm of thehand, we proceed to process each cluster in order to estimate the position of thefingertips. We decided to use a technique similar to the one used in [31, 32]. Themain idea is to generate a centered curve skeleton of a cluster of 3D points, inorder to help us estimate the shape of the hand and the fingertips.

Page 12: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

12 Jonathan Robin Langford-Cervantes et al.

Fig. 11: Palm center estimation through distance transform on a binarized image.

The first step was to arrange all the points in a cluster into an octree; this allowsthe points to be searched, and further processed in real time. With all the 3Dpoints arranged in an octree, we calculate the centroid of each voxel of the octree,and we use DBScan to group said centroids of each layer of the octree in the Y axis,as can be seen in Figure 12. For each of the new clusters generated, their centroidis calculated (see Figure 13), in order to become nodes of the curve skeleton. Wethen connect these new nodes using the next set of rules:

– Nodes will be connected from bottom to top in the Y axis.– Each node will connect to the next set of closest nodes available.– The nodes will search only the five next levels of the octree for the search. If

the next closest node is farther away, it will not be connected.

Fig. 12: Points in the octree groupedinto clusters.

Fig. 13: Centroids of each of thelayers of the points in the octree.

Page 13: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

Palm and fingertip detection 13

Fig. 14: Final fingertips selection.

Using this technique, the curve skeleton generated had several bifurcations whichhad to be trimmed; these bifurcations are the paths of the skeleton that have lessthan two following connections. The euclidean distance of all of such paths thatstem from each node is calculated, and only the shortest is maintained; the otherpaths are erased and not considered for further analysis.

The final nodes of each path are the ones that will be considered as fingertip can-didates. Considering the length of each path from the root to the finger candidates,all the paths are sorted in a descending way, and only the first five are selected. Inorder to validate the fingertip candidates, we used a geometric approach. We de-fined an approval radius, which is inversely proportional to the longest path fromthe center of the palm to the fingertip candidates, and all the candidates that areoutside that radius, are validated as fingertips. We first tried using a fixed value,however, there were many false positives because of the different hand sizes wetested. Using this dynamic approval radius helped us to avoid false positives. InFigure 14 the fingertips selection can be seen.

4 Results

In this section we present the results that we obtained with both the clustergeneration, and the estimation and tracking of the hand. All of the developmentwas done in C++ using the Kinect v1 SDK version 1.8 [3]. We conducted our testson a sample dataset that we generated, and on a live capture of the Kinect v1. Thesample dataset consists on fifteen different depth maps, each consisting of aroundthirty thousand 3D points, that capture different positions, sizes, and number ofhands. The live capture had an average of 30,000 points, after the clipping steps.For both the sample dataset and the live feed of the Kinect v1, we decided to setthe Kinect v1 at a fixed position. The Kinect v1 was positioned 80 cm in front of

Page 14: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

14 Jonathan Robin Langford-Cervantes et al.

the capture point, 40 cm lower than the capture point, and had an elevation angleof 25 degrees.

4.1 Cluster generation

A key part of our proposed solution was the clustering of the 3D points obtainedfrom the Kinect v1. As was mentioned in 3.2, we modified the DBSCAN algorithmin order to try to process the data in real time. On average, on a sample of thirtythousand 3D points, the DBSCAN algorithm took five seconds to generate theclusters. On a sample of 46,000 points, the algorithm took an average of 12 seconds.Our first attempt at speeding up the algorithm was using octrees to reduce thesearch space. With said technique, we achieved clustering of 30,000 points intoseparate groups in an average of 0.3 seconds. On a sample of 46,000 points, wewere able to cluster the points in an average of 0.5 seconds. Even though weobtained a speedup, between 16 and 24, this still was not real time processing.

Average SampleSize (3D points)

DBSCANvs DBSCANOctrees

DBSCAN Octreesvs DBSCAN 2 clus-tering steps

DBSCAN vs DB-SCAN 2 clusteringsteps

30,000 16.666 100 1666.66646,000 24 100 2400

Table 1: Speedups for the different clustering algorithms tested.

Our second attempt at speeding up the DBSCAN algorithm was using the algo-rithm that used octrees and the two clustering phases. That modification yieldedthe best results, since we were able to reduce the clustering time to an averageof 0.003 seconds on a sample of thirty thousand points. On a sample of forty sixthousand points, the algorithm took an average of 0.005 seconds. This providedthe best speedup, between 1600 and 2400, which allowed us to process the clustersin real time. Table 1 shows the speedups obtained for each algorithm.

4.2 Palm and fingertip estimation

Our proposed solution is capable of estimating, tracking, and rendering a scenewith the position of the fingertips and palms of two hands at an average of 60Hz,using the Kinect v1. In the following Figures, the fingertips are marked with greencircles, while the palms of the hands are marked with a red circle. Figure 15 showsa live capture of our system, tracking 2 hands’ fingertips and palms.

The system can track the recognized palms and fingertips as long as the handsare pointing their palms towards the Kinect v1, and the hands are not occluded.The system is also able to determine if there are no fingertips present, as can beseen in Figure 16 where only the fist is showing, and only the palm is tracked.

Page 15: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

Palm and fingertip detection 15

Fig. 15: Live capture of the proposed solution, tracking two hands .

Fig. 16: Only the palm is tracked when a fist is shown.

Since we had issues with several hand sizes, we implemented the approval radiusin section 3.5. Even though this approval radius allowed us to filter some fingercandidates, it also filters the thumb when it is inside said radius, as can be seenin Figure 17, where a hand with 4 fingers pointing upwards is shown.

To further test our solution, we developed a simple digital sculpting game usingthe Unity Game Engine [11]. Within the game, a set of gestures (such as grabbing,and releasing) were programmed using the information obtained from the trackedhands’ fingertips and palms. With those gestures, we were able to control the userinterface (UI), and the objects in the scene. The UI consisted of buttons and slidebars, which were interacted upon by grabbing, and dragging in the case of the slidebars. The game had two modes: digital clay sculpting, where you could model usingmarching cubes resembling clay; and digital sculpting, where you used or deformedbasic shapes, such as a sphere or a torus, in order to model something else. Thisgame showed us that our solution was accurate enough to develop a simple HCIapplication, that allowed the user to accurately model different objects. A sampleof the game can be seen in Figures 18 and 19.

Page 16: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

16 Jonathan Robin Langford-Cervantes et al.

Fig. 17: Only 4 fingertips and the center of the palm are tracked, from a hand only showing 4fingers.

Fig. 18: Sample game, sculpting anew object using a torus as a base.

Fig. 19: Sample game, modeling anew object using marching cubes as

clay.

5 Conclusions and future work

We developed a system that is able to track and estimate the 3D position andorientation of both human palms and fingertips in real time using the 3D depthmap information from the Kinect v1. Even though our solution only tracks finger-tips and centers of the palm, and can sometimes miss the thumb of the hands, ityields acceptable results for applications that need simple HCI, and do not havethe dedicated hardware, such as GPUs, needed for processing. Some of these canbe videogames, simulators, and, with the advent of mobile depth sensors [5–7],mobile applications.

Normally, clustering can be a computationally expensive task, and, because ofthat, it is not used in real time applications. Because of all the advantages thatclustering provides, we also generated a new DBSCAN algorithm using Octreesand a new clustering step, that allowed us to process all the 3D points obtainedin real time.

For future work, we would like to generate a model to track the complete fingersof each hand. Also, the generation of the curve skeleton relies on the orientationof the hand (which should be pointed upward towards the depth sensor). For this

Page 17: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

Palm and fingertip detection 17

reason, we would like to implement an orientation vector that can identify thehand’s orientation, so that the previous requirement should not have to be metand a lot of freedom is given to the available hand movements.

Acknowledgements This project was done over 6 months (November 2013 to April 2014)thanks to the PROINNOVA fund of the Innovation Incentives Program [2], granted by CONA-CYT. This project was developed for Larva Game Studios [4], a Mexico based game develop-ment company. We thank Jorge Morales and Juan Miral Rio, from Larva Game Studios, fortheir collaboration in this project.

References

1. Iisu middleware (2014). URL http://www.softkinetic.com/products/iisumiddleware.aspx,Accessed: 2014-09-13

2. Innovation incentive program (in spanish) (2014). URLhttp://www.conacyt.mx/index.php/fondos-y-apoyos/programa-de-estimulos-a-la-innovacion, Accessed: 2014-10-29

3. Kinect for windows sdk v1.8 (2014). URL http://www.microsoft.com/en-us/download/details.aspx?id=40278, Accessed: 2014-09-13

4. Larva game studios (2014). URL http://larvagamestudios.com/, Accessed: 2014-10-295. Atap project tango - google (2014, Accessed: 2014-09-13). URL

https://www.google.com/atap/projecttango/#project6. Softkinetic - depthsense modules (2014, Accessed: 2014-09-13). URL

http://www.softkinetic.com/products/depthsensemodules.aspx7. The structure sensor is the first 3d sensor for mobile devices (2014, Accessed: 2014-09-13).

URL http://structure.io/8. Andrew, F., Daniel, F., Shahram, I., Cem, K., Eyal, K., Ido, L., Christoph, R., Toby, S.,

Jamie, S., Jonathan, T., Alon, V., Yichen, W.: Fully articulated hand tracking (2014). URLhttp://research.microsoft.com/en-us/projects/handpose/default.aspx, Accessed: 2014-10-29

9. Borah, B., Bhattacharyya, D.: An improved sampling-based dbscan for large spatialdatabases. In: 2004. Proceedings of International Conference on Intelligent Sensing andInformation Processing, pp. 92–96. IEEE (2004)

10. Du, H., To, T.: Hand gesture recognition using kinect. Techical Report, Boston University(2011)

11. Engine, U.G.: Unity. URL http://unity3d.com/, Accessed: 2014-09-1312. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering

clusters in large spatial databases with noise. In: 2nd International Conference on Knowl-edge Discovery and Data Mining, vol. 96, pp. 226–231. AAAI Press, Portland, OR, USA(1996)

13. Garratt, G., Leslie, P.K., Russ, T., Tomas, L.P.: Kinect hand detection (2010). URLhttp://www.csail.mit.edu/videoarchive/research/hci/kinect-detection, Accessed: 2014-10-29

14. Hongyong, T., Youling, Y.: Finger tracking and gesture recognition with kinect. In: 2012IEEE 12th International Conference on Computer and Information Technology (CIT), pp.214–218. IEEE, Chengdu, Sichuan, China (2012)

15. Kakumanu, P., Makrogiannis, S., Bourbakis, N.: A survey of skin-color modeling anddetection methods. Pattern recognition 40(3), pp. 1106–1122 (2007)

16. Kuznetsova, A., Rosenhahn, B.: Hand pose estimation from a single rgb-d image. In:Advances in Visual Computing, Lecture Notes in Computer Science, vol. 8034, pp. 592–602. Springer Berlin Heidelberg (2013)

17. Li, Y.: Hand gesture recognition using kinect. In: 2012 IEEE 3rd International Conferenceon Software Engineering and Service Science (ICSESS), pp. 196–199. IEEE, HaiddianDistrict, Beijing, China (2012)

18. Liu, B.: A fast density-based clustering algorithm for large databases. In: 2006 Interna-tional Conference on Machine Learning and Cybernetics, pp. 996–1000. IEEE (2006)

Page 18: Real time palm and ngertip detection based on depth mapserrors. Madhuri, Y et al [20] use a similar approach to create a translator for sign language. The program acquires images of

18 Jonathan Robin Langford-Cervantes et al.

19. MacQueen, J., et al.: Some methods for classification and analysis of multivariate obser-vations. In: Fifth Berkeley symposium on mathematical statistics and probability, vol. 1,pp. 281–297. California, USA (1967)

20. Madhuri, Y., Anitha, G., Anburajan, M.: Vision-based sign language translation device.In: 2013 International Conference on Information Communication and Embedded Systems(ICICES), pp. 565–568. Chennai, Tamilnadu, India (2013)

21. Manigandan, M., Jackin, I.: Wireless vision based mobile robot control using hand gesturerecognition through perceptual color space. In: 2010 International Conference on Advancesin Computer Engineering (ACE), pp. 95–99. Bangalore, India (2010)

22. Mihail, R., Jacobs, N., Goldsmith, J.: Static hand gesture recognition with 2 kinect sensors.In: Proceedings of the 2012 International Conference on Image Processing, ComputerVision & Pattern Recognition, pp. 911–917. CSREA Press, Las Vegas, Nevada, USA (2012)

23. Motion, L.: Sculpting (2014). URL https://apps.leapmotion.com/apps/sculpting/windows,Accessed: 2014-09-13

24. Murthy, G., Jadon, R.: A review of vision based hand gestures recognition. InternationalJournal of Information Technology and Knowledge Management 2(2), pp. 405–410 (2009)

25. Oikonomidis, I., Kyriazis, N., Argyros, A.: Full dof tracking of a hand interacting withan object by modeling occlusions and physical constraints. In: 2011 IEEE InternationalConference on Computer Vision (ICCV), pp. 2088–2095. Barcelona, Spain (2011)

26. Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Efficient model-based 3d tracking of handarticulations using kinect. In: 22nd British Machine Vision Conference, vol. 1, p. 3. Dundee,United Kingdom (2011)

27. Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Tracking the articulated motion of twostrongly interacting hands. In: 2012 IEEE Conference on Computer Vision and PatternRecognition (CVPR), pp. 1862–1869. IEEE, Providence, USA (2012)

28. Patwary, M.M.A., Palsetia, D., Agrawal, A., Liao, W.k., Manne, F., Choudhary, A.: A newscalable parallel dbscan algorithm using the disjoint-set data structure. In: 2012 Inter-national Conference for High Performance Computing, Networking, Storage and Analysis(SC), pp. 1–11. IEEE (2012)

29. Raheja, J.L., Chaudhary, A., Singal, K.: Tracking of fingertips and centers of palm usingkinect. In: 2011 Third International Conference on Computational Intelligence, Modellingand Simulation (CIMSiM), pp. 248–252. IEEE, Langkawi, Malaysia (2011)

30. Raheja, J.L., Das, K., Chaudhary, A.: Fingertip detection: A fast method with naturalhand. International Journal of Embedded Systems and Computer Engineering 3(2), pp.85–89 (2011)

31. Sam, V., Kawata, H., Kanai, T.: A robust and centered curve skeleton extraction from 3dpoint cloud. Computer-Aided Design and Applications 9(6), pp. 869–879 (2012)

32. Tagliasacchi, A., Zhang, H., Cohen-Or, D.: Curve skeleton extraction from incompletepoint cloud. In: ACM SIGGRAPH 2009 Papers, SIGGRAPH ’09, pp. 71:1–71:9. ACM,New York, NY, USA (2009)

33. Tang, M.: Recognizing hand gestures with microsofts kinect. California, USA (2011)34. Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous pose recovery of human

hands using convolutional networks. ACM Transactions on Graphics (TOG) 33(5), 169(2014)

35. Venter, G., Sobieszczanski-Sobieski, J.: Particle swarm optimization. AIAA journal 41(8),pp. 1583–1589 (2003)

36. Yasukochi, N., Mitome, A., Ishii, R.: A recognition method of restricted hand shapes instill image and moving image as a man-machine interface. In: 2008 Conference on HumanSystem Interactions, pp. 306–310. IEEE (2008)

37. Zhao, W., Chai, J., Xu, Y.Q.: Combining marker-based mocap and rgb-d camerafor acquiring high-fidelity hand motion data. In: Proceedings of the ACM SIG-GRAPH/Eurographics Symposium on Computer Animation, pp. 33–42. Eurographics As-sociation (2012)

38. Zhu, Y., Yang, Z., Yuan, B.: Vision based hand gesture recognition. In: 2013 InternationalConference on Service Sciences (ICSS), pp. 260–265 (2013)