a fast pipeline for textured object recognition in clutter...

6
A Fast Pipeline for Textured Object Recognition in Clutter using an RGB-D Sensor Kanzhi Wu Centre for Autonomous Systems University of Technology, Sydney Sydney, Australia, 2007 Email: [email protected] Ravindra Ranasinghe and Gamini Dissanayake Centre for Autonomous Systems University of Technology, Sydney Sydney, Australia, 2007 Email: Ravindra.Ranasinghe, [email protected] Abstract—This paper presents a modular algorithm pipeline for recognizing textured household objects in cluttered environ- ment and estimating 6 DOF poses using an RGB-D sensor. The method draws from recent advances in this area and introduces a number of innovations that enables improved performances and faster operational speed in comparison with the state-of- the-art. The pipeline consists of (i) support plane subtraction (ii) SIFT feature extraction and approximate nearest neighbour based matching (iii) feature clustering using 3D Eculidean dis- tances (iv) SVD based pose estimation in combination with a outlier rejection strategy named SORSAC ( Spatially ORdered RAndom Consensus ) and (v) a pose combination and refinement step to combine overlapped identical instances, refine the pose estimation result and remove incorrect hypothesis. Quantitative comparisons with the MOPED [1] system on self-constructed dataset are presented to demonstrate the effectiveness of the object recognition pipeline. I. I NTRODUCTION Despite the significant progresses in the past few years, ob- ject recognition in cluttered environment remains a challenging problem in both computer vision and robotic communities. Reliable object recognition and accurate pose estimation are essential prerequisites for the successful deployment of service robots that are able to understand and interact with unstruc- tured daily environments such as kitchens and offices. Object recognition can be understood on two different levels: (1) category-level recognition: is this a car or a bicycle? (2) instance-level recognition: is this a specific brand of cereal box? Even though these two problems are closely related, category-level recognition has been more focused in computer vision and machine learning communities, and feature learning [2] and feature matching [3] are the key techniques in this area. However, the robotic community has typically focused on instance-level object recognition. In this context, the number of objects in a given environment is relatively small (< 20) and each of objects needs to be recognized specifically. The significant challenge in robotic community is the presence of clutter, therefore objects may only be partially observed due to high clutter. In contrast, datasets which are widely used in category level recognition such as Caltech 101 [4] usually has one target object per image. Additionally, in robotic community, 6 DOF pose estimation of the target object is also of interest as a manipulation step is typically followed and the speed of operation is an important issue. In past research, RGB camera is the dominant sensor in both object recognition levels. However, in recent years, RGB-D sensors such as Microsoft Kinect [5] became popular which are able of providing synchronized point clouds, depth images and color images. The novel RGB-D sensors allow robots to use multimodal information to achieve better object recognition and pose estimation accuracy [17]. In this paper, we propose a new object recognition and pose estimation framework for everyday cluttered scenes using mul- timodal data from a RGB-D sensor. The pipeline is motivated by the MOPED [1] but avoids some of the more computa- tionally expensive steps therein and also introduces a number of new innovations that makes the proposed system faster and more robust. We first remove the support plane of the objects using RANSAC [8] model fitting in 3D point cloud(Section III- B) and extract SIFT features [9] in the selected regions using the generated mask image. Some more feature descriptors such as SURF, BRIEF and ORB are compared and discussed in this section. Feature correspondences are obtained using Approx- imate Nearest Neighbour [10] (ANN) algorithm (Section III- B). The matches for each object are grouped together by Mean Shift [11] using 3D geometrical information and each cluster is associated with a hypothetical object and its related pose. We propose a simple yet effective iterative pose estimation strategy named called SORSAC( Spatially ORdered SAmple Consensus ) from an imperfect set of 3D-3D correspondences. The system is evaluated facing multiple objects recognition under cluttered environment. Given self-constructed cluttered objects recognition dataset, the proposed system takes ap- proximately 1.2 seconds for each frame to detect at least 4 different or identical objects on an Intel Core TM i5-2400 CPU without GPU processing. Approximately 80% of the total time consumption is due to SIFT feature extraction, therefore a significant potential for further efficiency improvements exists using GPU processors. II. RELATED WORK Reliable and robust object recognition and pose estimation are critical prerequisites for robotic manipulation. Estimating the pose of a rigid object from images is a well-studied problem in the literature. This problem is also known as Perspective-n-Points (PnP) problem when using point features

Upload: others

Post on 13-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Fast Pipeline for Textured Object Recognition in Clutter ...kanzhi.me/downloads/meeting/ICARCV2014.pdf · The system is evaluated facing multiple objects recognition under cluttered

A Fast Pipeline for Textured Object Recognition inClutter using an RGB-D Sensor

Kanzhi WuCentre for Autonomous Systems

University of Technology, SydneySydney, Australia, 2007

Email: [email protected]

Ravindra Ranasinghe and Gamini DissanayakeCentre for Autonomous Systems

University of Technology, SydneySydney, Australia, 2007

Email: Ravindra.Ranasinghe, [email protected]

Abstract—This paper presents a modular algorithm pipelinefor recognizing textured household objects in cluttered environ-ment and estimating 6 DOF poses using an RGB-D sensor. Themethod draws from recent advances in this area and introducesa number of innovations that enables improved performancesand faster operational speed in comparison with the state-of-the-art. The pipeline consists of (i) support plane subtraction(ii) SIFT feature extraction and approximate nearest neighbourbased matching (iii) feature clustering using 3D Eculidean dis-tances (iv) SVD based pose estimation in combination with aoutlier rejection strategy named SORSAC ( Spatially ORderedRAndom Consensus ) and (v) a pose combination and refinementstep to combine overlapped identical instances, refine the poseestimation result and remove incorrect hypothesis. Quantitativecomparisons with the MOPED [1] system on self-constructeddataset are presented to demonstrate the effectiveness of theobject recognition pipeline.

I. INTRODUCTION

Despite the significant progresses in the past few years, ob-ject recognition in cluttered environment remains a challengingproblem in both computer vision and robotic communities.Reliable object recognition and accurate pose estimation areessential prerequisites for the successful deployment of servicerobots that are able to understand and interact with unstruc-tured daily environments such as kitchens and offices.

Object recognition can be understood on two differentlevels: (1) category-level recognition: is this a car or a bicycle?(2) instance-level recognition: is this a specific brand of cerealbox? Even though these two problems are closely related,category-level recognition has been more focused in computervision and machine learning communities, and feature learning[2] and feature matching [3] are the key techniques in thisarea. However, the robotic community has typically focused oninstance-level object recognition. In this context, the numberof objects in a given environment is relatively small (< 20)and each of objects needs to be recognized specifically. Thesignificant challenge in robotic community is the presenceof clutter, therefore objects may only be partially observeddue to high clutter. In contrast, datasets which are widelyused in category level recognition such as Caltech 101 [4]usually has one target object per image. Additionally, in roboticcommunity, 6 DOF pose estimation of the target object is alsoof interest as a manipulation step is typically followed and thespeed of operation is an important issue.

In past research, RGB camera is the dominant sensor in bothobject recognition levels. However, in recent years, RGB-Dsensors such as Microsoft Kinect [5] became popular which areable of providing synchronized point clouds, depth images andcolor images. The novel RGB-D sensors allow robots to usemultimodal information to achieve better object recognitionand pose estimation accuracy [17].

In this paper, we propose a new object recognition and poseestimation framework for everyday cluttered scenes using mul-timodal data from a RGB-D sensor. The pipeline is motivatedby the MOPED [1] but avoids some of the more computa-tionally expensive steps therein and also introduces a numberof new innovations that makes the proposed system faster andmore robust. We first remove the support plane of the objectsusing RANSAC [8] model fitting in 3D point cloud(Section III-B) and extract SIFT features [9] in the selected regions usingthe generated mask image. Some more feature descriptors suchas SURF, BRIEF and ORB are compared and discussed in thissection. Feature correspondences are obtained using Approx-imate Nearest Neighbour [10] (ANN) algorithm (Section III-B). The matches for each object are grouped together by MeanShift [11] using 3D geometrical information and each clusteris associated with a hypothetical object and its related pose.We propose a simple yet effective iterative pose estimationstrategy named called SORSAC( Spatially ORdered SAmpleConsensus ) from an imperfect set of 3D-3D correspondences.

The system is evaluated facing multiple objects recognitionunder cluttered environment. Given self-constructed clutteredobjects recognition dataset, the proposed system takes ap-proximately 1.2 seconds for each frame to detect at least 4different or identical objects on an Intel CoreTM i5-2400 CPUwithout GPU processing. Approximately 80% of the total timeconsumption is due to SIFT feature extraction, therefore asignificant potential for further efficiency improvements existsusing GPU processors.

II. RELATED WORK

Reliable and robust object recognition and pose estimationare critical prerequisites for robotic manipulation. Estimatingthe pose of a rigid object from images is a well-studiedproblem in the literature. This problem is also known asPerspective-n-Points (PnP) problem when using point features

Page 2: A Fast Pipeline for Textured Object Recognition in Clutter ...kanzhi.me/downloads/meeting/ICARCV2014.pdf · The system is evaluated facing multiple objects recognition under cluttered

such as SIFT [9]. Gordon and Lowe [13] firstly introduceda 3D object recognition algorithm which extracts and groupsfeatures from the input image. The PnP problem is solvedusing the non-linear least squares optimization techniqueLevenburg-Marquardt (LM) algorithm in Gordon’s work. Re-cently, MOPED system developed by Collet et al improvedGordon’s method by iteratively combining clustering and poseestimation on 3D-2D correspondences. Their work achieved areliable real-time solution in slightly cluttered environments.

During the past few years, consumer-level depth cameras,such as Microsoft Kinect which are capable of providinghigh quality synchronized depth and color data, have becomepopular sensors in robotic research. Using a point cloud, Rusuel al. proposed several feature detectors and descriptors, suchas Fast Point Feature Histogram (FPFH)[14] and ViewpointFeature Histogram (VFH)[15]. Using 3D feature matchinginformation, 6 DOF pose can be identified even for untexturedobjects. Aldoma et al. further improved VFH into ClutteredViewpoint Feature Histogram (CVFH) [16] and they presenteda modular recognition and pose estimation algorithm combin-ing three different recognition pipelines which can take fulluse of the multimodal information. Tang et al. [18] presenteda new textured object recognition pipeline based on addi-tional color histogram information and achieved outstandingperformance on Willow and Challenge datasets [22], howeverpose estimation results are still obtained by solving the PnPproblem using the LM algorithm in Tang’s work. Xie et al.[19] improved the accuracy of pose estimation and robustnessbased on Tang’s work by using dense feature extraction andmultimodal blending.

III. METHODOLOGYThe goal of proposed system is to recognize objects and

estimate related poses using only one RGB-D sensor given aset of object models. In this section, the input data, modelsand pose estimation results are formulated, the framework isbriefly introduced and the key contributions are presented.

A. Problem Formulation

The input and output parameters are listed in Table I. Theinput information for each frame consists a color image Ic, adepth image Id and a dense point cloud P of size nP = |P|captured from Kinect. Before the recognition procedures, themodel for each object is created using a Structure-from-Motiontechnique, bundler [20], in a manner similar with MOPED [7].The model M consists of a set of 3D features fi and each fiincludes 3D coordinates pi = [xi, yi, zi] in object coordinatesystem and associated SIFT descriptors di ∈ R1×128.

The output of the proposed algorithm is the object recogni-tion hypothesis O (one hypothesis for each recognized object)and the associated pose P = [R, t] where R is the rotationmatrix relating the observed object with the model and t is thetranslation vector from sensor to the observed object.

B. Algorithm pipeline

This section gives a brief summary of proposed frameworkand the basic strategy in each step. Key contributions in this

TABLE ISUMMARY OF NOTATIONS

Symbol DescriptionInput parameters

Ic RGB color imageId depth imageP point cloud with size nP

M object model, consists of a set of features fifi point feature in model, includes 3D coordinate pi and

associated descriptor di, fi = (pi,di)

Output parametersH recognized object hypothesis O with pose P = [R, t]

fi detected feature, consists of 3D coordinates in point cloudpci , 2D coordinates in image space pei and associateddescriptor di, fi = (pci , p

ei , di)

Im mask image after plane subtractionc(fi, fj) match between i-th feature in model and j-th feature in

observationG clustered group, consists a set of matches M

Input data

Depth image Point cloud RGB image

Feature extraction

Support plane subtraction

Object models

ANN based feature matching

Feature clustering

SVD based pose estimation with outlier rejection

Pose combination and refinement

Fig. 1. Overview of object recognition and pose estimation system

algorithm pipeline are illustrated in depth in Section III.C.1) Support plane subtraction: In our experiment, all the

target objects are assumed to be placed on one support plane.Given the raw input point cloud P, we extract the largestplane Π in the scene as the support plane using RANSAC[8] for model fitting. Unlike the scenario settings in Willowdataset[22] in which a chessboard marker is placed on theplane as an assistance for plane subtraction, shown in Figure6. However, in our consideration, it is more general andrealistic to use the geometry information of the environmentitself. Besides, RANSAC plane removal is much faster thanusing chessboard marker assisted method with corner pointsextraction.

The normal vector vπ which shared by the point set of

Page 3: A Fast Pipeline for Textured Object Recognition in Clutter ...kanzhi.me/downloads/meeting/ICARCV2014.pdf · The system is evaluated facing multiple objects recognition under cluttered

extracted plane Pπ can be obtained. The center point pπ ofsubset Pπ is selected as the origin point of support plane. Foreach other 3D point pi, we compute the projected vector ofvi = pi − pπ onto normal vector vπ:

vi = |vi|(vi · vπ)vπ (1)

The point pi is selected to be above the plane if vi > vτ . Inour experiments, vτ is set to be positive and |vτ | = 0.005(m)which means points are assigned to be 1 if the point is atleast 0.005m above the plane. A binary mask image Im isgenerated where only points above the plane with availabledepth information are assigned with 1 and vice versa.

2) Feature extraction and matching: Widely-used SIFTfeature extraction is implemented using CPU-only computationand we perform the feature extraction using the generatedmask image. Since we have both RGB and depth information,the feature coordinate pei = [ue, ue] in the image space andcorresponding 3D coordinate pci = [xc, yc, zc] are availableand kept for further 3D-3D pose estimation.

In order to reduce the time consumption of finding thematched correspondences, the fast ANN searching method isapplied. However, outliers are likely to be introduced in thisphase while the efficiency is improved. c(fi, fj) denotes thecorrespondence between fi in modelM and fj in observation.All matches between observation and model M are denotedas cM = ∪∀ic(fi, fj).

Fig. 2. Clustering results for 4 same images under different ranges using 3Ddistances (1st row) and image space distances (2nd row). 3D distances basedclustering shows more consistent and reliable results under different rangeshowever 2D clustering results present much more incorrect groups and theresults are random.

3) Feature clustering: For all potential matches cM be-tween observation and model M, we group geometricallycloser features in 3D camera coordinate system using MeanShift clustering technique. 3D distances between each pairof points are fixed during the movements of cameras sinceenvironment is static and all objects are rigid. However, 2Ddistances may changes a lot due the perspective variation,therefore, 3D distances based clustering is much more robustcompared with clustering in image space. The comparisonresults of two clustering methods are presented in Figure 2.After clustering, each group Gi consists a subset of corre-spondences between modelM and observation. Each group ishypothesized to contain one object instance with related pose.Outliers inevitably exist after clustering and therefore need tobe removed during pose estimation.

Fig. 3. 3D distances based clustering results: The only target object tobe recognised is placed in the center regions of the images blocked by redpolygon. In each image, the correct cluster has one dense area of matcheswhich is related with the target object. The outlier matches are randomlydistributed on the image. The outlier data and inlier data are not distributedas the same in the image space which can be used as prior information whenrejecting the outliers.

4) Iterative pose estimation with outlier rejection: Partsof the disperse distributed features are removed in cluster-ing step because there is no enough numbers of featuresin the neighbourhood region. However, outliers still exist ineach group of matches and examples are shown in Figure3. According Figure 3 and thoroughly experiments, inlinermatches are distributed as unimodal or multimodal dependingon the texture information, however the erroneous matches arerandomly scattered. Based on this fact, similar with NAPSAC(N Adjacent Points SAmple Consensus)[23], we assume thedistribution of all inliers follows a common pattern: inlier dataare likely to be adjacent to each other in center region ofthe cluster while the outliers are randomly distributed in thecluster.

In order to remove outliers in each group, we proposea strategy using biased data point selection and hypothesis-and-verification method named SORSAC (Spatially ORderedSAmple Consensus) . Instead of selecting samples from inputdata randomly, we rerank the data based the spatial informa-tion, start pose estimation and verification using the ordereddata. We improve the time consumption of RANSAC while atthe same time preserve the accuracy of outlier rejection. Thedetailed algorithm is discussed in section III.C.

5) Pose combination and refinement: There exist two dif-ferent methods to handle overlapping recognized hypotheses:1) if multiple adjacent hypotheses Hi have same recognizedobject and overlapping poses, it is highly possible that thereexist only one candidate object however it is separated intoseveral groups in clustering and therefore generated multiplehypotheses with similar poses. We combine the hypothesesinto one group and calculate a final pose for this single objectagain. 2) If the overlapping hypotheses actually belong todifferent kinds of instances, only one of the hypotheses canbe correct. Without any further information, we simply select

Page 4: A Fast Pipeline for Textured Object Recognition in Clutter ...kanzhi.me/downloads/meeting/ICARCV2014.pdf · The system is evaluated facing multiple objects recognition under cluttered

the hypothesis with larger number of correct matches as therecognized object.

C. Iterative Pose Estimation using SORSAC

After feature clustering in 3D space, incorrect matcheswhich do not belong to the potential object inevitably existin each group. RANSAC is the one of most popular outlierrejection methods and also implemented in MOPED [1].However, in our paper, using prior distribution information ofinlier data, we propose an outlier rejection method similar toRANSAC but with biased sample selection. We formulate areasonable and realistic assumption on the distribution of thecorrect matches in each group. This idea significantly reducesthe time consumption while we can still converge to the correct[R, t].

Example images of clustering results under a highly clut-tered environment are presented in Figure 3. The only onered polygon consists of the correct object and other groupsare all false cluttered results due to the feature extraction andinaccurate feature matching. It is also clear the matches inthe correct cluster are distributed as unimodal centered at thetextured region, however the matches in the false group do notshare a common pattern in distribution.

In order to remove the outlier, we compute the 3D mediancenter point pco in the cluster. We re-rank all the observedfeatures in the cluster G based on the distance from the pointpci to center pco in descending order. Since 4 pairs of matchesare sufficient to obtain pose estimation hypothesis [R, t] usingSingular Value Decomposition, we start estimation from the 4most far away matches which also have larger distances amongthem. We verify all the other points using the generated [R, t]and remove the most far away match if the pose is not agreedto by most of correspondences in G. The iteration is finishedif [R, t] is agreed with 80% of the remained matches in G. Inthe end, the removed features are double checked with respectto [R, t], the consistent features are again added into the inliergroup and we calculated the final pose [R, t] again using allthe correct matches. Detail of this algorithm is presented inAlgorithm 1.

The proposed method is more robust to the measurementnoise present in depth information. Using pairs of matcheswhich are close together in 3D space may lead to poseestimation results with larger errors. We are able to achievemore accurate result by starting estimation from features whichare relatively far away.

If we cannot achieve a final consistent pose estimation [R, t]when the relative distances between pairs of matches are lessthan dτ , the cluster is removed and the algorithm declares thatthere is no object in the cluster. This is because we cannotcompute a reasonable [R, t] if the points are too close. dτis set empirically. In order to measure the accuracy of poseestimation, we use the average re-projection error for correctmatches

δo =

∑|M|i (Rpmi + t− poi )

|M|(2)

Algorithm 1: Pose Estimation with Outlier Rejection

Input: n pairs of matches Ci(fj , fi), C = ∪ni=1Cipo = ∪ni=1p

ci ,p

m = ∪ni=1pmi // 3D point sets

Output: pose [R, t]for i = 1 : n do1

doi = ||pci −Median(pc)||;2

while (Rpm + t− po > T ) do3

C = ∪4i=1Mki ; // 4 furthest matches for SVDPose;4

C = C− C; // remained matches;5

//select pm and po from C;6

[R, t] = SVDPose(pm, pc);7

for Cj ∈ C do8

//count consistent matches;9

if |Rpmj + t− pcj | < ε then10

count = count+ 1;11

if count/|M| > τ then12

// consistent pose estimation13

[R, t] = SVDPose(pm,pc);break;14

else15

// remove furthest 2 matches into outlier set;16

C = C1 + C2;17

for Ci ∈ C do18

if |Rpmj + t− pcj | < ε then19

C = C + Ci; //test removed matches;20

[R, t] = SVDPose(pm,po); //Final estimation;21

δo doesn’t depend on the number of features and is able togive a quantitative evaluation for both rotation and translationerrors in multiple object recognition.

Fig. 4. Sample images from dataset

IV. EXPERIMENTS AND DISCUSSIONIn order to evaluate the performance of outlier rejection

strategy and accuracy of pose estimation results, we build acluttered objects dataset shown in Figure 4, which containsat least 4 different or identical objects in each frame. Thereare overall 25 different shaped objects in our dataset including

Page 5: A Fast Pipeline for Textured Object Recognition in Clutter ...kanzhi.me/downloads/meeting/ICARCV2014.pdf · The system is evaluated facing multiple objects recognition under cluttered

cubic objects, cylindrical objects and irregular shaped objects.All the images and point clouds are captured using a Kinectsensor under real world situations and the experiments areperformed on a 3.10 GHz Quad-core Intel Core TM i5-2400CPU, 4 GB of RAM running 32 bits Ubuntu 12.04 LTS withoutGPU processing.

In this section: 1) we evaluate the object recognition andpose estimation system in cluttered objects dataset. we alsocompare this framework against the open source MOPED sys-tem; 2) In order to verify the feasibility of the system in Willowdataset, we constructed a scenario with similar complexity anddemonstrate the results. We also discuss the current limitations,further possible improvements of the system and future work.

A. Multiple Objects Recognition and Pose Estimation

In order to evaluate our approach, we compare the proposedsystem against MOPED. We present the results in our clutteredobjects dataset which includes 100 frames with total 439object that need to be recognized. Our algorithm achievesalmost perfect precision in object recognition except in casewhere 3 pairs of similar objects are presented. It only failsto recognize 32 objects that are under extremely occlusionsamong 439 objects. Since we are not able to obtain theaccurate relative pose between object and Kinect sensor underthis scenario, we present an alternative way to quantitativelymeasure the pose estimation accuracy. Using the estimated[R, t], from Eq.(2), we calculate the re-projection error δwhich is then used as the measurement of pose accuracyincluding both translation and rotation. This system is able toachieve comparable precision-recall recognition performance,slightly better pose estimation accuracy. This is expected as weare able to exploit 3D information while MOPED only relieson images. More importantly, the computational time requiredfor our algorithm is approximately only 50% of that requiredby MOPED due to:• RANSAC based plane subtraction generates a mask image

which approximately separates the support plane and thetarget objects, further decreases the number of extractedfeatures and also speeds up the continuing steps;

• Compared with 2D clustering in image space, 3D geo-metrical Mean Shift clustering is much more accurate androbust with the variation of sensor ranges and viewpoints.This also leads to the successful implementation of asimpler outlier rejection algorithms;

• The biased sample consensus method, SORSAC, takesadvantages of the prior distribution information of bothcorrect and erroneous matches. SORSAC is faster thanthe RANSAC method in our framework;

• Given additional 3D point cloud sensor data, we use SVDsolver to obtain the relative pose between two set of 3Dpoints. Compared with LM optimizer in MOPED, ourmethod is much more faster given reliable matches.

Since SIFT extraction is processed only using CPU in oursystem which takes almost 80% time in our system, muchless time consumption (within 400ms) is expected when usingGPU computation.

TABLE IIPRECISION AND RECALL RESULTS USING PROPOSED METHOD

Method Precision Recall Error(cm) Time consump-tion(s)

Propose system 96.54% 85.26% 1.94 0.8405MOPED[1] 94.20% 83.39% 2.73 1.5473

Fig. 5. Objects recognition in cluttered environment

B. Discussion

At this point in time, we are not able to achieve comparableor better performance using Willow dataset compared withXie’s work[19]. This is due to the following main reasons:• The object are placed far away from the Kinect in Willow

dataset, therefore the number of SIFT features are muchless and the descriptors are not discriminative enough toget reasonable correct matches;

• Even though RGB-D sensor is applied in our system,we only use SIFT feature from gray-scale image inrecognition step. However, there are many less-texturedobjects and similar objects in Willow dataset which onlycan be recognized using color and depth information.

In order to demonstrate the capability of our system onWillow dataset, based on one of the images in Willowdataset(shown in Figure 6), we build a scenario with similarcomplexity. In Figure 7, there are 7 different objects with boxshape and cylindrical shape, and the objects are placed in asimilar manner as in Figure 6 (7 objects with slightly occlusionand 1 object is placed flat on the plane). Given a closer rangeand enough number of observed features, shown in Figure 7,we can successfully recognize all the objects with accuratepose. The most important of our system is the time efficiencyas discussed in Section IV.A.

The current limitations of our system are listed below: 1)depth information is not fully used in recognition step. There-fore the recognition performance is limited by the keypointfeatures such as SIFT; 2) the current system is not able toperform well under highly cluttered environment especiallywhen inadequate number of features are observed.

In the future research, we are focusing on exploiting 3Dinformation of given model and differentiating similar texturedobjects using different shape features. Another important workis try to actively control the viewpoint and position of camera

Page 6: A Fast Pipeline for Textured Object Recognition in Clutter ...kanzhi.me/downloads/meeting/ICARCV2014.pdf · The system is evaluated facing multiple objects recognition under cluttered

Fig. 6. Example image fromWillow

Fig. 7. Constructed scenario with similar complexity Fig. 8. Recognition results

in order to achieve a better recognition and pose estimationresults under extremely high occluded situation.

V. CONCLUSIONS AND FUTURE WORK

In this paper, we present a textured object recognition andpose estimation pipeline using an RGB-D sensor in highlycluttered environment. Given the reliable assumption on inlierdistribution, we propose a biased sample consensus algorithmnamed SORSAC which is faster and also robust in poseestimation. Our system takes full use of depth information inpose estimation by directly using 3D-3D correspondences ina closed form solution which is faster to execute and alsoleads to higher accuracy. Combined with other improvementson each steps, from thoroughly experiments, it is shown thatthe proposed system are performed well under a number ofdifferent scenarios in indoor environment and reduce 50% timeconsumption compared with state-of-the-art.

ACKNOWLEDGMENT

Kanzhi Wu is supported by Joint University of Technology,Sydney–China Scholarship Council (UTS-CSC) InternationalResearch Scholarship.

REFERENCES

[1] A. Collet, M. Martinez and S. S. Srinivasa, The MOPED framework:object recognition and pose estimation for manipulation, IJRR, vol. 30,no. 10, pp. 1284–1306, 2011.

[2] J.J. Wang, J.C. Yang, K. Yu, F.J. Lv, T. Huang and Y.H. Gong, Locality-constrained linear coding for image classification, 2010 IEEE CVPR, pp.3360–3367.

[3] K. Grauman and T. Darrell, The pyramid match kernel: discriminativeclassification with sets of image features, 2005 IEEE ICCV, pp. 1458–1465.

[4] F.F. Li, R.Fergus and P. Perona, Learning generative visual models fromfew training examples: an incremental Bayesian approach tested on 101object categories, 2004 IEEE CVPR workshop on Generative-ModelBased Vision, pp. 59–70.

[5] Microsoft Kinect, http://www.xbox.com/en-US/kinect[6] D.F. Fouhey, A. Collet, M, Hebert and S. Srinivasa, Object recognition

robust to imperfect depth data. 2012 ECCV, pp. 83-92.[7] A. Collet, D. Berenson, S.S. Srinivasa and D. Ferguson, Object recognition

and full pose registration from a single image for robotic manipulation,2009 IEEE ICRA, pp. 48–55.

[8] M.A. Fischler and R.C. Bolles, Random Sample Consensus: a pradigmfor model fitting with applications to image analysis and automatedcarography, Commun. ACM, vol. 24, no. 6, pp. 381–395, 1981.

[9] D.G. Lowe, Distinctive image features from scale-invariant keypoints,IJCV, vol. 60, no. 2, pp. 91–110, 2004.

[10] D.M. Mount and S. Arya, ANN: A library for Approximate NearestNeighbor searching, http://www.cs.umd.edu/mount/ANN/

[11] D. Comaniciu and P. Meer, Mean shift: a robust approach toward featurespace analysis, IEEE Trans. PAMI, vol. 24, no. 5, pp. 603–619, 2002.

[12] E. Staffan, D. Kragic and F. Hoffmann, Object recognition and poseestimation using color cooccurrence histograms and geometric modeling,Image and Vision Computing, vol. 23, no. 11, pp. 943-955, 2005.

[13] I. Gordon and D.G. Lowe, What and where: 3D object recognition withaccurate pose, 2006 LNCS in Toward Category-Level Object Recognition,pp 67–82.

[14] R.B. Rusu, N. Blodow and M. Beetz, Fast Point Feature Histogram(FPFH) for 3D registration, 2009 IEEE ICRA, pp 3212–3217.

[15] R.B. Rusu, G. Bradski, R. Thibaux and J. Hsu, Fast 3D recognition andpose using the Viewpoint Feature Histogram, 2010 IEEE/RSJ IROS, pp.2155–2162.

[16] A. Pinz, T. Pock, H. Bischof and F. Leberl, OUR-CVFH – oriented,unique and repeatable clustered viewpoint feature histogram for objectrecognition and 6DOF pose estimation, 2012 LNCS in Pattern Recogni-tion, pp. 113–122.

[17] A.A. Buchaca, F. Tombari, J. Prankl, A. Richtsfeld, L. De Stefano andM. Vincze, Multimodal cue integration through hypotheses verification forRGB-D object recognition and 6DOF pose estimation, 2013 IEEE ICRA,pp. 2096–2103.

[18] J. Tang, S. Miller, A. Singh and P. Abbeel, A textured object recognitionpipeline for color and depth image data, 2012 IEEE ICRA, pp. 3467–3474.

[19] Z. Xie, A. Singh, J. Uang, K.S. Narayan and P. Abbeel, Multimodalblending for high-accuracy instance recognition, 2013 IEEE/RSJ IROS,pp 2214–2221.

[20] N. Snavely, S.M. Seitz and R. Szeliski, Photo tourism: Exploring photocollections in 3D, 2006 ACM SIGGRAPH, pp. 835–846.

[21] R.B. Rusu and S. Cousins, 3D is here: Point Cloud Library (PCL), 2011IEEE ICRA, pp. 1–4.

[22] Solutions in Perception Instance Recognition Challenge, ICRA 2011.[23] D. Myatt, P.H.S. Torr, J.M. Bishop, R. Craddock and S. Nasuto,

NAPSAC: high noise, high dimensional model parameterisation – it’s ina bag, 2002 BMCV, pp. 458–467.