a robust visual human detection approach with ukf · pdf filea robust visual human detection...

1
A Robust Visual Human Detection Approach withUKF Based Motion Tracking for a Mobile Robot
Meenakshi Gupta, Laxmidhar Behera,Senior Member, IEEE, K. S. Venkatesh, and Mo Jamshidi,Fellow, IEEE
AbstractRobust tracking of a human in a video sequence isan essential prerequisite to an increasing number of applicationswhere a robot needs to interact with a human user or operatesin a human inhabited environment. This paper presents a robustapproach that enables a mobile robot to detect and track a humanusing an on-board RGB-D sensor. Such robots could be usedfor security, surveillance and assistive robotics applications.Ourapproach has real-time computation power through a uniquecombination of new ideas and well established techniques. Inthe proposed method, background subtraction is combined withdepth segmentation detector and template matching method toinitialize the human tracking automatically. We introduce thenovel concept of Head and Hand creation based on depth ofinterest to track the human silhouette in a dynamic environment,when the robot is moving. To make the algorithm robust, weutilize a series of detectors (e.g. height, size, shape) to distinguishtarget human from other objects. Because of the relatively highcomputation time of the silhouette matching based method, wedefine a confidence level which allow us to use the matching basedmethod only where it is imperative. An unscented Kalman filter(UKF) is used to predict the human location in the image frameso as to maintain the continuity of the robot motion. The efficacyof the approach is demonstrated through a real experiment ona mobile robot navigating in an indoor environment.
Index TermsHuman silhouette, Projection histogram, Headand Hand creation, Distance transform, Unscented Kalman filter.
I. I NTRODUCTION
Introducing visual tracking capabilities in artificial visualsystems is one of the most active research challenges in mobilerobotics. Visual tracking of non-rigid object such as a human,is an interesting research field in mobile robotics and hasreceived much attention in recent years because of its potentialapplications such as site security [1], rehabilitation in hospitals[2], [3], guidance in museums [4], assistance in offices [5],and other military applications. In such applications, a mobilerobot not only needs to detect the human, but also needs totrack it continuously in a dynamic environment where theusual background subtraction could not be used. It is alsonecessary to be able to give motion commands to the robot atregular intervals in order to maintain a continuous and smoothmotion of the robot, even when the image processing may takemore time than the permitted interval. In such a case, it would
Manuscript received ; revised . Current version published.Meenakshi Gupta, Laxmidhar Behera and K. S. Venkatesh are with the
Department of Electrical Engineering, Indian Institute ofTechnology, Kanpur,208016, India (e-mail: (meenug,lbehera,venkats)@iitk.ac.in).
Mo Jamshidi is with the Department of Electrical and Computer Engi-neering and ACE Center, University of Texas, San Antonio, TX78249 USA(e-mail: [email protected]).
be necessary to predict the human location in the image planebased on an approximate human motion model [6].
The primary sensor used for human tracking in roboticapplications is a vision sensor such as a camera [7][8].Vision is an attractive choice as it facilitates passive sensingof the environment and provides valuable information aboutthe scene that is unavailable through other sensors. Owingto this fact, many algorithms have been developed whichdetect a human in color images by extracting features suchas face [9], skin color [10] , cloth color [11], and havebeen implemented on mobile robotic platforms. Although thealgorithms developed using a single feature (e.g. face, skinor cloth color) are computationally effective, fail to detectthe human robustly in dynamic environment. For example,the algorithm [12] which uses face detection for tracking ahuman, fails to detect human when implemented on a mobilerobotic platform. As in practical scenarios, when a robot startstracking a human, the face is actually not available to the robot.Therefore, researchers have started to combine the multiplevisual features to make the human detection robust.
Darrell et al. [13] combine multiple visual modalities forreal-time person tracking. Depth information is extractedusinga dense real-time stereo technique and is used to segment userfrom the background. Skin color and face detection algorithmare then applied on the segmented regions. Their algorithmassumes that the user will be nearest to the stereo and humanface is visible to the robot. Gaverila [14] presented a multi-cue vision system for the real-time detection and tracking ofpedestrians from a moving vehicle. The algorithm integratesthe consecutive modules such as stereo-based ROI generation,shape-based detection, texture-based classification and stereo-based verification. The algorithm has high computation timeas stereo is used for disparity map generation. In literature,the human detection algorithm developed by Navneent et al.[15] is found to be the most robust algorithm. They haveused Histograms of Oriented Gradient (HOG) descriptors anda SVM classifier to detect the human. Although the algorithmis robust, its high computation time limits its applicationforreal-time systems. In [16], Liyuan et al. integrates multiplevision models for robust human detection and tracking. Theycombined the HOG based and stereo-based human detectionthrough mean-shift tracking.
Combining the multiple vision models makes the humandetection robust but simultaneously increases the computationcost of the system. To meet the real-time requirements ofhuman tracking, most existing systems employ either a lasersensor or combine the laser sensor information with the colorcamera information. Woojin et al. [17] proposed the detection

2
and tracking schemes for human legs by using a single LRF.They derived the common attributes of legs from a largenumber of sample data and classified them using supportvector data description scheme. The scheme proposed by themworks under the assumption that the attributes derived bythe leg samples does not match with other objects. In [18]Nicola et al. implemented a multisensor data fusion techniquesfor human tracking with a mobile robot. The approach isbased on the recognition of typical leg patterns extracted fromlaser scans, which are shown to be very discriminative incluttered environments. Furthermore, faces are detected usingthe robots camera, and the information is fused to the legsposition using a sequential implementation of UKF. The statemodel taken by them does not consider the robot motion.Also, the state equations use real world coordinates and thusrequires odometry data which is inaccurate. Control actionfora robot to follow a human is not defined by the authors in[18]. In [19] Sabeti et al. fuses laser range and color data totrain a robot vision system. For each pixel in the robots field-of-view, it has color, depth, and surface normal information,which help to extract 3D features. This technique has gooddetection accuracy, but the speed of the algorithm is far fromreal-time.
The inclusion of laser sensor increases the robustness ofthe algorithm but also increases the experimental cost of thesystem. While both sensing modalities (vision and laser) haveadvantages and drawbacks, their distinction may become obso-lete with the availability of Microsoft Kinect that providebothRGB image and depth (range) data. Hao et al. [20] developedan algorithm for real-time human tracking using color-depthcamera. They remove the ground and ceiling planes from the3-D point cloud input to separate candidate point clusters.The concept of depth of interest (DOI) is used to identify thecandidates for detection. A cascade of detectors are used todistinguish humans and objects from possible candidates. Thealgorithm works robustly in real time but requires the priorknowledge of ground and celling plane.
In spite of the advancement made in the the field of humantracking, the vision based human tracking still necessitatesan algorithm which is robust, have real-time computationpower and does not require any a priori knowledge aboutthe environment. In this paper, we introduce a new, real-timehuman tracking algorithm which can detect and track targethuman in dynamic indoor environments, using a mobile robotthat is equipped with a color-depth camera. Our algorithminitializes the human tracking automatically and learns thehue histogram of target humans torso and legs. For trackingthe human in dynamic environment the hue histograms areback-projected. To make the complete human silhouette, allthe blobs found in back-projected image are passed througha Head and Hand creation algorithm which is based onDOI. Afterwords, a series of detectors are used to distinguishbetween the human and other objects.
As image processing which is computationally expensiveis used to detect the human, it is a good choice to use filterfor predicting the motion [21]. In this work, UKF is usedto predict the position of the human in the image so thatthe robot can be commanded continuously at regular intervals
even when the image processing may take more time thanthe permissible limit. Unlike [18] which uses a motion modelbased on odometry information, we propose a model directlybased on image information thereby avoiding the dependenceon the noisy odometry data.
The major contributions of this paper include:
The introduction of the newHead and Hand creationconcept using color-depth images, allows us to detect andtrack the complete human silhouette without processingthe computationally expensive 3D point cloud data.
The use of multiple detectors and confidence level allowus to tra

a robust visual human detection approach with ukf · pdf filea robust visual human detection...

Documents