integrating kinect depth data with a stochastic object ... kinect depth data with a stochastic...

Integrating Kinect Depth Data with a Stochastic Object ClassificationFramework for Forestry Robots

Mostafa Pordel, Thomas Hellstrom and A hmad OstovarUmea University, Umea, Sweden

Keywords: Image Processing, Depth Integration.

A bstract: In this paper we study the integration of a depth sensor and an R G B camera for a stochastic classificationsystem for forestry robots. The images are classified as bush, tree, stone and human and are expected tocome from a robot working in forest environment. A set of features is extracted from labeled images to traina number of stochastic classifiers. The outputs of the classifiers are then combined in a meta-classifier toproduce the final result. The results show that using depth information in addition to the R G B results in higherclassification performance.

1 INTRODUCTION

The major contribution of this paper is a study of theeffects of adding depth sensor information to an ob-ject classification system that uses only R G B camera.The system is aimed for a forestry robot and the task isto classify images as one of the four classes bush, tree,stone and human. In this paper object classification isused to make the forestry robot capable of conduct-ing applied tasks such as navigating in forest environ-ments, collecting information about interest objectsetc. Object classification for this kind of ResourceConstrained Embedded Systems (R C ES) is an emerg-ing research area where real-time performance is oneof the main criteria for developed algorithms e.g. real-time vision system for simultaneous localization andmapping(Davison et al., 2007). In this work all theneeded calculations are done in less than a second thatis fast enough for the experimental level for R C ESs.A lso it is possible to improve the calculation timeby using more powerful operating systems. In gen-eral, object classification methods are either bottom-up or top-down (Kokkinos et al., 2006) (Fang et al.,2011). Top-down methods typically include match-ing input data to a certain model or features of objects(see e.g. (X iao et al., 2010)). Bottom-up approachesstart with low level features of the image, like depth-based features, color-based features etc. (see e.g. (Heet al., 2011)). More specifically, stochastic classifiersin bottom-up style are studied for integration of depthand R G B sensor features. A set of all features in-cluding depth and R G B are used to train classifiers in

a three tier architecture for object detection as illus-trated in F igure 1. It is possible to add very sophisti-cated edge and corner features such that the four ob-jects (trees, bushes, stones and humans) would be de-tected easier. However this is beyond the main goal ofthis paper, which is to study the integration of depthsensor information in a generic stochastic classifica-tion method. F inally, depth information is used to fa-cilitate image labeling.

We extract 12 color-based features from R G B im-ages, inspired by a project (A strand and Baerveldt,2002) with classification of sugar beets and weeds foragricultural robots. Depth images are used to cal-culate additional 6 features. The system uses fivestochastic classifiers and a Weighted Majority Vote(W M V ) method as a meta-classifier (Robert et al.,2003) to fuse the individual classifiers. C lassificationresults are presented for three cases; color-based fea-tures extracted from the entire image, color-based fea-tures extracted from the color foreground image andfinally color-based and depth-based features extractedfrom the color foreground image and depth image.A database of images has been collected and labeledwith a tool described in Section 2. Section 3 describeshow the different components of the system are con-nected. The sensor device is presented in Section4.The used features are described in Section 5. Section6 explains the used classifiers and also how these clas-sifiers are combined together to increase the overallperformance. The results are presented in Section 7and the effect of adding the depth sensor is analyzed.F inally, conclusions and future research are described

��Pord el M., H ellström T. a nd O stovar A . (2012).Inte grating Kin e ct D e pth D ata with a Stoch a stic O bje ct C la ssific ation Fram ework for Fore stry R obots.In Proce e dings of th e 9th Intern ation al C onfere nce on Inform atics in C ontrol, Autom ation a nd R obotics, p a g e s 314-320D O I: 10.5220/0004045203140320C opyright c S ciTe Pre ss

F igure 1: Framework for generic stochastic feature-basedobject classification. (SF: Sensor Fusion, F F: Feature Fu-sion, C F: C lassifier Fusion).

in Section 8.

2 ODAR TOO L

O D A R (Object Detection for A griculture Robots) isa tool that integrates software components written inC#, M AT L A B and C ++ for image analysis (see F ig-ure 2). It interacts with sensors and collects imagesinto a database. O D A R also contains functionality forimage labeling. The user may select the desired ob-ject in an R G B image and associate it with a specificclass. This process is called labeling. More specif-ically, O D A R uses an algorithm that selects an areawhen the user clicks on the image. It uses a filter tosearch for connected pixels with similar R G B values.For objects with many different colors e.g. a humanwith clothes with many colors, the user may click onmultiple points to cover the entire object. This inter-active solution works well, but the procedure is timeconsuming and requires multiple clicks for some im-ages. For this reason, we developed an automaticmethod that uses depth images for object selectionwithin the R G B image, by removing the backgroundpixels from the R G B image. The procedure is similarto what is described in Section 4.2 for background de-tection used for calculation of features. In our experi-ment, all images have been labeled with this method.

3 SYSTE M ARCHITECTURE

One way to increase the performance of an existingclassification solution is to add and fuse new compo-nents, e.g. new sensors, features or classifiers. Oursystem is built on a framework (F igure 1) that sup-ports this approach. It comprises a Sensor Layer, a

F igure 2: O D A R is a tool that supports image analysis. Itcollects images to a database and computes foreground im-ages by using R G B and depth images.

Feature Layer and a C lassification Layer. In eachlayer, fusion of information may take place. In thispaper, a depth sensor and an R G B camera are fusedin the Sensor Layer to generate foreground images. Inthe Feature Layer, 18 different depth-based and R G B-based features are generated as described in Section5. F inally, in the classification layer, classifier out-puts are fused to generate a combined output. In thefollowing, the three layers will be described in moredetail.

4 SENSOR L AYER

This layer reads and fuses data from sensors (or fromthe O D A R database) and provides the inputs for fea-ture extraction in the Feature layer. In the remainderof this section of paper, the K inect sensor and sensorfusion for background detection are presented.

4.1 Microsoft Kinect

In late 2010 M icrosoft released the K inect Device forthe X box console. Since then, several groups withinthe robotics community have integrated K inect withtheir applications. ROS (Robotic Operating System)(Van den Bergh et al., 2011) supports the device,which increased its usefulness in robotics. In sum-mer 2011, M icrosoft introduced the SD K for K inectwith a user friendly A PI that makes interaction mucheasier. K inect has an R G B camera, an infrared baseddepth camera and a row of microphones. In this re-search the microphones are not used. The depth cam-era works well indoors in the limited range of 0.4 to 9meters but has poor performance outdoors, in partic-

�%*��(�* %�� %��*��'*��*��, *��*&��)* ��!��*��#�)) � ��* &%��(�$�,&("��&(��&(�)*(-��&�&*)

��

ular in sharp sunlight. However, we have successfullyused the K inect camera for the experiments reportedin this paper. To collect a proper set of images we con-cerned some limitations such as distance of object tothe camera, the strength and direction of sunlight. Thedeveloped solutions can of course also be used withother kinds of sensors that provide depth informatione.g. laser scanner which has wider working range andhigher depth image quality in different lightning con-ditions in comparison to K inect depth sensor.

4.2 Background Detection

The object of interest (the one that is the basis for thelabeling) in each picture is likely to be located in theforeground of image. Background detection makesit possible to compute features for the interesting ob-ject, rather than for the entire image. Background de-tection is often discussed and solved for sequences ofimages or video frames rather than single R G B im-ages. One common main technique in video back-ground detection uses the differences between con-secutive video frames to detect the background (Cuc-chiara et al., 2003). However, this approach is not ap-plicable for single images or for videos where the in-teresting objects are not moving, such as stones in for-est environments. A nother common technique, whichis applicable for single images, uses morphologicalfilters to classify a specific pattern as background (e.g.Poor lighting in parts of image) (Jimenez-Sanchezet al., 2009). This approach is challenging in for-est environments where objects and backgrounds arevery similar. A nother possibility is to add a range sen-sor like time of flight (T O F) sensor (Van den Berghand Van Gool, 2011) or stereo vision and build a 3Dmodel and detect the background using depth infor-mation. In this project, we use a simple backgrounddetection algorithm based on depth information andthe fact that is very likely in our object classificationproblem that the object of interest would be locatedin foreground image. The background mask B(I,j) iscomputed from a depth image D(i,j), as follows:

if D (i; j) a D mean thenB(i; j) 0

elseB(i; j) 1

end ifWhere D-mean is the mean value of all elements inD. This calculation is repeated for all pixels (i,j) inD. The depth foreground image is then computed bymultiplying D element wise with mask B. For theK inect camera, D has lower resolution than R G B im-age. B is therefore resized and then multiplied ele-ment wise with the R G B image to generate the RG B

F igure 3: R G B Image of a tree.

F igure 4: Depth values from the K inect camera is used todetect the foreground objects of F igure 3.

foreground image. Parameter a is a number between0 and 1 and determines how far from D-mean pix-els are considered to be in foreground image ratherthan background. It is set with maximum likelihoodestimation using a part of the training images, suchthat manually marked foregrounds match the com-puted foregrounds the most. For the images in ourdatabase, a =0.5 was computed as optimal.

The computation of depth foreground and colorforeground images is an example of how sensor datamay be fused to produce new sensor data that can beused by the subsequent Feature layer. F igure 4 showshow the objects of interest appear in the Depth fore-ground image. A ll depth-based features are extractedfrom Depth foreground images while color-based fea-tures are calculated from R G B foreground images (inthe result section, results with color-based featurescalculated from the entire R G B images are also re-ported).

�� *��%*�(%�* &%�#��&%��(�%��&%��%�&($�* �)� %��&%*(&#��+*&$�* &%��%��&�&* �)

��

5 FEATURE L AYER

In total, 18 features are extracted for each image pair(R G B and Depth). They represent either color-basedor depth-based properties. Depth-based features areextracted from depth foreground images (see Section4.2). Hue, Saturation and Value (HSV ) images arecalculated from the R G B images and are less sensitiveto image intensity and vary less to varying light condi-tions. Therefore they often lead to better performancein classification problems (Chen et al., 2007). For theHue, Saturation and Value images, as well as for eachcolor channel (R, G and B) the mean and varianceare computed as separate features. If the generatedfeatures were directly given to the classifiers, featureswith larger ranges would dominate smaller featureseffects. Therefore, all features are normalized to zeromean and unity standard deviation.

A ll color-based features are listed in Table 1.

Table 1: Color-based Features.

R,G,B-Mean Mean values of red, green, blueR,G,B-Var Variances of red, green, blueH,S,V-Mean Mean values of Hue, Saturation, ValueH,S,V-Var Variances of Hue, Saturation, Value

Table 2 shows depth-based features that are ex-tracted from depth foreground images. The Skeletonfeature is based on the idea of removing maximumnumber of pixels in the image without losing the over-all connectivity in the image. The value of the Skele-ton feature equals the number of remaining pixels.This feature has been used for a long time in com-puter vision, optical character recognition and patternclassification (Lakshmi and Punithavalli, 2009). TheSkeleton feature in F igure 6 is computed for the depthforeground image in F igure 4.

Table 2: Depth-based features.

D-mean Mean value of foreground depth imageD-var Variance of foreground depth imageSkeleton Number of pixels in Skeleton imageD ispersion D istribution of foreground depth imagePerimeter Length of perimeter of foreground depth imageA rea A rea of foreground object

The A rea feature is computed as the number ofpixels in the depth foreground image. The D ispersionfeature is calculated by dividing the Skeleton featureby the A rea. For stones, which typically have lowdepth variations, the Skeleton feature is small. TheD ispersion is also small for stones compared to treesor bushes where, for the same A rea, the Skeleton fea-ture is larger. In short, D ispersion shows how dis-tributed an object is over the image. The D ispersionfeature in F igure 9 is computed for the depth fore-ground image in F igure 4.

F igure 5: C lass histograms and fitted distributions for theA rea feature. It helps distinguishing trees from bushes.

F igure 6: C lass histograms and fitted distributions for theSkeleton feature.

The computation of D ispersion feature is an ex-ample of how features may be fused to produce addi-tional features for the subsequent C lassifier layer.

Depth features help distinguishing trees frombushes, which can be a hard task even for humans.Images with trees usually include trunks that have lessvariation in depth and also smaller depth mean. F ig-ure 5 illustrates how trunks effect depth information.

Perimeter is another feature that gives informationabout shape of the object. It is computed as the num-ber of pixels in the identified perimeter in the image(see F igure 7).

F igure 8 shows the distribution of normalizedperimeter for the collected images of trees, bushes,stones and human.

6 C L ASSIFIER L AYER

F ive stochastic classifiers are used for classification:Naive Bayes (Friedman et al., 1997), Decision Trees

�%*��(�* %�� %��*��'*��*��, *��*&��)* ��!��*��#�)) � ��* &%��(�$�,&("��&(��&(�)*(-��&�&*)

��

F igure 7: The Perimeter for the depth foreground imageshown in F igure 4. The Perimeter feature is the number ofpixels in the perimeter.

F igure 8: C lass histograms and fitted distributions for thePerimeter feature.

F igure 9: C lass histograms and fitted distribution for thenormalized D ispersion feature.

(Wang et al., 2008), K -Nearest Neighbor (K N N)(Zhang et al., 2006), Support Vector Machine (SV M)(L i et al., 2009) and L inear D iscriminant A nalysis(L D A ) (A n et al., 2009). These classifiers are generat-ing the naive results that are given to a meta-classifierto be fused in order to make the final result. A ll men-

tioned classifiers, except SV M, train very fast. Witha 3 gigahertz processor, the entire training phase (ex-cluding SV M) with 650 images, (see Section 2) takesless than one second, which is fast enough also forreal-time online training (see (Wang et al., 2008), (A net al., 2009)) if the system would be implemented ona real-time platforms like FPG A (Pordel et al., 2011),(A n et al., 2009). SV M trains slower, but all meth-ods including SV M (L i et al., 2009) can provide real-time classification after training. The basic version ofan SV M classifier only works with two classes. Inthis project we use the ” One Versus Rest” technique(Zhang et al., 2007) to extend the SV M to work withfour classes.

Table 3: A ccuracy without background subtraction usingcolor-based features only.

C lassifier Training Testingaccuracy accuracy

K N N 94.74% 90.51%D T 95.84% 77.22%SV M 93.91% 90.51%N B 86.15% 79.75%L D A 96.687% 77.85%W M V - 89.87%

A ll classifier outputs are combined by WeightedMajority Vote rule (W M V )(De Stefano et al., 2002).W M V does not need training, but uses the accuracyof the individual classifiers to determine the weightsfor each classifier. The accuracy A(i) (fraction cor-rect classifications) for each classifier i is computedby averaging the accuracy from 10-folds cross valida-tion. w(i), the weight for classifier i is computed suchthat all weights add up to 1:

w(i) = A(i)=(5

Âj = 1

A( j))

The W M V classifier computes a fused classification faccording to:

f = arg maxi

5

Âj = 1

(w( j) ⇥ I (C ( j) = i))

Where C(j) is the classification delivered by classifierj and I is an indicator function (James and StanfordUniversity, 1998) computing to 1 if the argument istrue and to 0 otherwise. In other words, f is the classwith the largest number of weighted votes.

7 RESULTS

A s a rule of thumb, the training set should be about10 times the number of features used in classifica-tion (Jain et al., 2000). For the 18 features that are

�� *��%*�(%�* &%�#��&%��(�%��&%��%�&($�* �)� %��&%*(&#��+*&$�* &%��%��&�&* �)

��

Table 4: A ccuracy with background subtraction using color-based features only.



Table 5: A ccuracy with background subtraction using color-based and depth-based features.



described in Section 5, 150 samples have been col-lected for each object type and the database has intotal 650 R G B and depth images. From this set, 80%is used for training, 10% for 10 fold cross-validationof the classifiers and the last 10% for final testing ofW M V meta-classifier. The entire process is repeated10 times for 10 fold cross-validation also for W M V. Inorder to study the effect of adding depth information,classification results are computed for three differentcases. C lassification results for the five classifiers andthe fused W M V classifier are presented in Tables 3,4, and 5. A ccuracy for both training data and test dataare presented and is computed as the fraction of suc-cessful classifications over the respective set.

The results for the first case, color-based featuresextracted from the entire input images (without back-ground subtraction) have the worst performance.

In the second case, with color-based features com-puted from color foreground images, the classificationresults are better. The reason is that the features arecomputed for a more relevant part of the images, andhence are less noisy compared to the first case.

In the third case, depth-based and color-based fea-tures are extracted from foreground images. The re-sults show how the added depth-based features im-prove the classification accuracy.

Table 6, shows the confusion matrix for W M V andcase three. O f special interest is the low confusion be-tween trees and bushes, which may be hard to distin-guish between also for a human.

8 CONC LUSIONS AND FUTUREWORKS

The results obviously show that using depth sensorinformation for subtracting background and addingdepth based features improves the classification accu-racy. Background subtraction is essential for compu-tation of all features and can be easily implementedusing depth sensor information. Furthermore, depthsensor information can facilitate image labeling as de-scribed in Section 4.2.

On the other hand, due to the limited depth rangeof the K inect sensor, objects have to be located be-tween 0.4 and 9 meters from the device. A lso, thedepth image quality is too low in challenging outdoorenvironments such as in sharp sunlight. However, theresults show that it is possible to use the K inect sensoroutdoors in non-direct sunlight and for cloudy days.

A continuation of this project is to improve the ac-curacy by adding corner and edge features that in gen-eral could describe the image structure more specifi-cally. In later stages of this project, other types ofobjects like branches, leaves and peduncles will beadded to the list of object classes. In addition, we mayintegrate laser scanner and a high resolution R G Bcamera to improve depth and color information. A d-dressing the problem of over-fitting, which may occurin any learning method, we will integrate an algorithmto diagnose over-fitting and address it by a feature se-lection algorithm. B y selecting the best subsets of fea-tures, the accuracy of the overall system may increase(see e.g. (Marcano-Cede ando et al., 2010)). F inally,we plan to use compression methods like PC A andGPLV M to maximize the overall accuracy similar towhat is described in (Zhong et al., 2008).

Table 6: Confusion matrix for W M V classifier in Table 5.C lass Tree Bush Stone Human Total M iss-classifiedTree 38 1 0 0 39 2%Bush 0 58 0 0 58 0%Stone 2 0 10 0 12 16%Human 3 0 0 46 49 6%

AC KNOW L EDGE M ENTS

This work was funded by the European Commissionin the 7th Framework Programme (C ROPS G A no246252).

REFERENCES

A n, K . H. and Park, S. H. and Chung, Y. S. and Moon, K .Y. and Chung, M. J. (2009). Learning discriminative

�%*��(�* %�� %��*��'*��*��, *��*&��)* ��!��*��#�)) � ��* &%��(�$�,&("��&(��&(�)*(-��&�&*)

��

multi-scale and multi-position lbp features for face de-tection based on ada-lda. In Robotics and Biomimet-ics (RO BI O), 2009 I E E E International Conference on,pages 1117 –1122.

A strand, B. and Baerveldt, A .-J. (2002). A n agriculturalmobile robot with vision-based perception for me-chanical weed control. Autonomous Robots, 13:21–35.

Chen, W., Shi, Y., and X uan, G. (2007). Identifying com-puter graphics using hsv color model and statisticalmoments of characteristic functions. In Multimediaand Expo, 2007 I E E E International Conference on,pages 1123 –1126.

Cucchiara, R., Grana, C., Piccardi, M., and Prati, A . (2003).Detecting moving objects, ghosts, and shadows invideo streams. Pattern Analysis and Machine Intel-ligence, I E E E Transactions on, 25(10):1337 – 1342.

Davison, A . J., Reid, I. D., Molton, N. D. and Stasse, O.(2007). Monoslam: Real-time single camera slam.IE E E Transactions on Pattern Analysis and MachineIntelligence, 29(6):1052–1067.

De Stefano, C., Della C ioppa, A ., and Marcelli, A . (2002).A n adaptive weighted majority vote rule for combin-ing multiple classifiers. In Pattern Recognition, 2002.Proceedings. 16th International Conference on, vol-ume 2, pages 192 – 195 vol.2.

Fang, Y., L in, W., Lau, C. T., and Lee, B.-S. (2011). A vi-sual attention model combining top-down and bottom-up mechanisms for salient object detection. In Acous-tics, Speech and Signal Processing (I CASSP), 2011IE E E International Conference on, pages 1293 –1296.

Friedman, N., Geiger, D. and Goldszmidt, M. (1997).Bayesian network classifiers. Machine Learning,29(2-3):131–163.

He, L., Wang, H., and Zhang, H. (2011). Object detectionby parts using appearance, structural and shape fea-tures. In Mechatronics and Automation (I CMA), 2011International Conference on, pages 489 –494.

Jain, A . K ., Duin, R. P. W., and Mao, J. (2000). Statisticalpattern recognition: A review. IE E E Transactions onPattern Analysis and Machine Intelligence, 22(1):4–37.

James, G. and Stanford University. Dept. of Statistics(1998). Majority vote classifiers: theory and appli-cations. Stanford University.

Jimenez-Sanchez, A ., Mendiola-Santibanez, J., Terol-Villalobos, I., Herrera-Ruiz, G., Vargas-Vazquez, D.,Garcia-Escalante, J., and Lara-Guevara, A . (2009).Morphological background detection and enhance-ment of images with poor lighting. Image Processing,IE E E Transactions on, 18(3):613 –623.

Kokkinos, I., Maragos, P., and Yuille, A . (2006). Bottom-upamp; top-down object detection using primal sketchfeatures and graphical models. In Computer Visionand Pattern Recognition, 2006 IE E E Computer Soci-ety Conference on, volume 2, pages 1893 – 1900.

Lakshmi, J. and Punithavalli, M. (2009). A survey on skele-tons in digital image processing. In D igital Image Pro-cessing, 2009 International Conference on, pages 260–269.

L i, R., Lu, J., Zhang, Y., Lu, Z., and X u, W. (2009). Aframework of large-scale and real-time image annota-tion system. In Artificial Intelligence, 2009. JCAI ’09.International Joint Conference on, pages 576 –579.

Marcano-Cede ando, A ., Quintanilla-Domi andnguez, J.,Cortina-Januchs, M., and A ndina, D. (2010). Fea-ture selection using sequential forward selection andclassification applying artificial metaplasticity neuralnetwork. In I E C O N 2010 - 36th Annual Conferenceon I E E E Industrial E lectronics Society, pages 2845 –2850.

Pordel, M., K halilzad, N., Yekeh, F., and A splund, L.(2011). A component based architecture to im-prove testability, targeted fpga-based vision systems.In Communication Software and Networks (IC CSN),2011 I E E E 3rd International Conference on, pages601 –605.

Robert S. Lynch Jr. and Peter K . Willett (2003). Use ofbayesian data reduction for the fusion of legacy clas-sifiers. Information F usion, 4(1):23–34.

Van den Bergh, M., Carton, D., De N ijs, R., M itsou,N., Landsiedel, C., Kuehnlenz, K ., Wollherr, D.,Van Gool, L., and Buss, M. (2011). Real-time 3d handgesture interaction with a robot for understanding di-rections from humans. In RO-MAN, 2011 I E E E, pages357 –362.

Van den Bergh, M. and Van Gool, L. (2011). Combining rgband tof cameras for real-time 3d hand gesture inter-action. In Applications of Computer Vision (WACV),2011 IE E E Workshop on, pages 66 –72.

Wang, J., Wu, Q., Deng, H., and Yan, Q. (2008). Real-time speech/music classification with a hierarchicaloblique decision tree. In Acoustics, Speech and SignalProcessing, 2008. I CASSP 2008. IE E E InternationalConference on, pages 2033 –2036.

X iao, Q., Hu, X ., Gao, S., and Wang, H. (2010). Object de-tection based on contour learning and template match-ing. In Intelligent Control and Automation (WC IC A),2010 8th World Congress on, pages 6361 –6365.

Zhang, J., Marszalek, M., Lazebnik, S., and Schmid, C.(2007). Local features and kernels for classification oftexture and object categories: A comprehensive study.International Journal of Computer Vision, 73(2):213–238.

Zhang, H., Berg, A . C., Maire, M. and Malik, J. (2006).Svm-knn: D iscriminative nearest neighbor classifica-tion for visual category recognition. volume 2, pages2126–2136.

Zhong, F., Capson, D., and Schuurman, D. (2008). Par-allel architecture for pca image feature detection us-ing fpga. In E lectrical and Computer Engineering,2008. C C E C E 2008. C anadian Conference on, pages001341 –001344.

�� *��%*�(%�* &%�#��&%��(�%��&%��%�&($�* �)� %��&%*(&#��+*&$�* &%��%��&�&* �)

��

integrating kinect depth data with a stochastic object ... kinect depth data with a stochastic...

Documents