[ieee 2014 third international conference on agro-geoinformatics - beijing, china...

Leaf Recognition and Segmentationby Using Depth Image

Xiaowei ShaoEarth Observation Data Integration

and Fusion Research InitiativeUniversity of Tokyo

Tokyo, JapanEmail: [email protected]

Yun Shi, Wenbing Wu, Peng Yang,Zhongxin Chen

Institute of Agricultural Resourcesand Regional Planning

Chinese Academy of Agricultural SciencesBeijing, China

Ryosuke ShibasakiCenter for Spatial Information Science

University of TokyoTokyo, Japan

Abstract—Measuring the geometric structural traits ofplants, especially the shape of leaves, plays an important rolein the agricultural science. However, most existing techniquesand systems have limited overall performance in accuracy,efficiency and descriptive ability, which is insufficient for therequirements in many real applications. In this study, a newkind of sensing device, the Kinect depth sensor which measuresthe real distance to objects directly and is able to capturehigh-resolution depth images, is exploited for the automaticrecognition and extraction of leaves. The pixels of the depthimage are converted into a set of 3D points and transformed intoa standard coordinate system after ground calibration. Leavesare extracted based on the height information and a hierarchicalclustering algorithm, which combines the density-based spatialclustering algorithm and the mean-shift algorithm, is proposedfor the automatic segmentation of leaves. Experimental resultshows the effectiveness of our proposed method.

Keywords—depth image, leaf recognition, leaf segmentation

I. INTRODUCTION

The development for measuring the geometric structuralfeatures of plants in an automated and accurate way is animportant issue for various agricultural applications such asfield-based phenotyping [1], design and evaluation of plantideotype, and precision agriculture [2]–[4]. Among the struc-tural information of plants, the shape of leaves provides directand clear description about the status of the plant and thereforeis a very important issue in plant science.

In general, measuring the 3D shape of leaves in a plantefficiently and accurately is very challenging. Most existingmethods have limited overall performance, and therefore cannot satisfy the requirements in real applications. Direct meth-ods require manual measurement of each leaf and are highlylabor-intensive and sometimes destructive [5]. In addition,usually only 2D information can be obtained by using thesemethods. Video camera provides another way for this task bycapturing images of plants. However, it is also challengingto separate leaves in a plant since they have similar colorsand are often overlapped with each other from the view ofthe camera. Some progresses have been made recently forreconstructing the 3D shape of plants from stereo images[6], but the accuracy of the reconstructed shape may still beproblematic for practical issues.

Fig. 1: The Kinect sensor

In this study, the depth sensor, which has achieved remark-able progress and received increasing attention in recent years,is introduced for analyzing the leaf information of a plant. Thedepth image which records the real distance between the sensorand the object is converted into a set of calibrated 3D points.Leaves are extracted according to the height information anda hierarchical clustering algorithm is proposed for automaticrecognition and segmentation of leaves.

II. MEASUREMENT SYSTEM

In this research, the 3D shape of a plant is reconstructed byusing a depth sensor, which measure the distances to objectsby emitting light beams at controlled directions and calculatingthe time-of-flight from reflected signals. Traditional long-range3D sensors, which are able to work at a distance of severalhundreds meters such as Leica HDS3000 and Topcon GLS-1000, used to be a kind of accurate but very expensive device.Some researches have been conducted to utilize such kind ofdevice to measure the canopy and estimate the leaf area indexof forests [7], [8]. However, it is seldom used for agriculturalapplications due to the high cost. Previous generation of short-range 3D sensors such as SwissRanger SR4000 and PanasonicD-Imager EKL3103, are much cheaper but have very limitedresolution (e.g., 176 x 144 pixel) and are very sensitive to lightsources.

In recent years, the advances of range sensing techniquemake high cost-effective range sensor available in the market.In this research, Microsoft Kinect sensor was exploited tocapture the depth image of plants and the method for automaticrecognition and extraction of leaves was developed.

(c)(a) (b)

Fig. 2: Captured data from the Kinect sensor. (a) original depth image; (b) 3D point view of the depth image; (c) correspondingcolor image

TABLE I: Specifications of Kinect

Feature Description

Field of view57◦ horizontal

43◦ vertical

Vertical tilt ±27◦

Sensing rangeapproximately 0.7- 6 m

recommended distance: 1.2m - 3.5m

Depth image 640 × 480 pixel

Color image 640 × 480 pixel

Size 37.6 × 15.0 × 12.2 cm

Weight 1.41 kg

Kinect is a type of sensing input device developed byMicrosoft for the Xbox 360 video game console and WindowsPCs. It was first launched in 2010 and receives increasingattention from researchers worldwide due to its outstandingsensing ability (see Table I for specifications). A 3D rangesensor and a video camera are embedded in Kinect, so itcan capture high-resolution depth images and color imagessimultaneously.

In general, Kinect is cheap, small and light-weighted, inaddition to its powerful sensing ability. However, its field ofview and sensing range is somewhat limited, which means itcan only cover a specific small area around several squaremeters. In addition, Kinect is designed for indoor applicationsonly, since its emitted light beam lies in the spectrum ofsunlight and can not collect valid measurements in outdoorenvironments.

III. METHODOLOGY

A. Generating 3D Point Cloud

In this step, the captured Kinect depth image will beconverted into a set of 3D points by utilizing the calibrationtechnique. The raw depth image from Kinect consists of pixelsrepresented by 11 bit data and the value can be converted todistance approximately as follows:

z =1

n ∗ dc1 + dc2, n ∈ [0, 2047] (1)

where dc1 = −0.0030711016 and dc2 = 3.3309495161according to the settings in [9]. It is noteworthy that thedistance is represented in a non-linear way and its resolutiondecreases when the distance increases. For example, the resolu-tion of distance at 1 / 3 / 5 m is about 0.3 / 3 / 8 cm, respectively.Therefore, in our experiment the object to be captured is placedat a distance around 0.8-2 m.

Once the distance z is derived, a pin-hole camera model isemployed to estimate the 3D position [XK , YK , ZK ]T of eachpixel (u, v) in the depth image as follows:

x =1

fx∗ (u− cx) ∗ z (2)

y =1

fy∗ (v − cy) ∗ z (3)[

XK

YK

ZK

]= RK

[xyz

]+ TK (4)

Here (fx, fy, cx, cy), along with the rotation matrix RK andthe translation vector TK , are calibration parameters. Forsimplicity lens distortion is not considered in this model. Thereaders may refer to [10] and [11] for the standard calibrationtechnique and parameter settings of Kinect, respectively. Inaddition, the 3D points can be attached with correspondingcolors if similar calibration steps are also applied to thecollected color images.

B. Ground Calibration

The 3D points acquired in the previous step are definedunder the coordinates of the Kinect depth sensor. Therefore,their values may change violently according to the pose ofKinect. For further analysis, it is necessary to convert the pointsinto a standard coordinate system. Here we select points whichlie in the ground plane and set the condition that the groundplane is represented by the equation z = 0.

In this step, first a small set of points is selected indicatingthe ground plane (e.g, the blue rectangle in Fig. 3, whichcan be easily recognized from the depth image). Then a leastmean square based plane-fitting method was applied to theselected points, resulting in the equation of this fitted plane.Based on some geometric analysis, a rotation matrix RG and atranslation vector TG can be calculated so that the fitted planewill match the reference ground plane after the transformation.

Fig. 3: Ground calibration of 3D points

For more details about ground calibration, please refer to ourprevious work in [12]. Finally, the 3D points after groundcalibration are represented by[

XYZ

]= RG

[XK

YK

ZK

]+ TG (5)

C. Leaf Extraction and Segmentation

In this subsection, we try to extract and segment leavesfrom the calibrated 3D points. To begin with, unexpectedobjects such as the ground need to be removed. As discussedin the previous section, after ground calibration the 3D pointsare represented in a standard coordinate system. Therefore, theground can be recognized by judging the height information(z-value) of the 3D points. An example of the histogram of zwas shown in Fig. 4. It is clear that the height of the pointscan be divided into two groups, where the height of pointsrepresenting the ground ranges from 0 to 0.2 m and the heightof points which belongs to the plant varies from 0.4 - 1.0 m.Here we select the threshold as 0.3 m and the points belowthis height will not be analyzed.

Considering the spatial consistency displayed in the 3Dpoints, here we employ a connectivity-based clustering algo-rithm for leaf segmentation. The clustering algorithm, density-based spatial clustering of applications with noise (DBSCAN)[13], is able to perform clustering based on density distributionand is well known for its good performance against noises. InDBSCAN, two parameters, ε and Nb, needs to be specifiedin addition to the input point set. If the Euclidean distance oftwo points PA and PB is less than ε, then PA and PB areconsidered as ”connected”. PA belongs to the neighborhoodof PB and vice versa. For any point P , if the number of itsneighbors is less than Nb it will be regarded as an outlier.Based on these two rules, DBSCAN tries to generate clustersamong connected points based on a region-growing strategyand outliers are excluded.

DBSCAN has good performance for sets of small size, butcan be very computationally intensive when handling a largenumber of points due to the frequent query of neighboringpoints. In our case, usually there are more than 100,000 valid

-0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.5

1

1.5

2

2.5

3

3.5x 10

4

Ground points

Plant points

Fig. 4: Histogram of the z-value of 3D points

3D points to be clustered and the computation cost can bevery high if we directly apply DBSCAN to the data set. Tosolve this problem, here we developed a hierarchical clusteringstrategy by combing the mean-shift algorithm.

Mean shift is a well-known algorithm for finding localmaximums according to the underlying density function [14].It has been widely used in clustering problems and proved tobe robust and effective. The mean-shift iterative scheme canbe described by:

yj+1 =∑

i∈N(yj)

xig

(∥∥∥∥yj − xi

h

∥∥∥∥2)/∑

i∈N(yj)

g

(∥∥∥∥yj − xi

h

∥∥∥∥2)

(6)

where {xi}i=1,2,...,M stands for the input point set, yj for thevalue of a seed in the jth iteration, N(yj) for the neighborhoodof yj , g(·) for the kernel related function and h for the kernelsize. First, seeds are generated all over the domain, and theniteratively shift along the gradient direction of the densityfunction to local maximums. When seeds become very closeto each other, they will be merged together. The iteration stopswhen the change of seeds is trivial and the remaining seedsare considered as clusters. The mean-shift algorithm can begreatly accelerated by using block based approximation, so itremains very efficient when handling a large number of points.For more details please refer to our previous work in [15].

The final clustering algorithm works in a hierarchical wayas follows. First we apply the mean-shift algorithm to thefull 3D point data set, using a small kernel size to build alarge number of over-segmented groups. Each group includestens of pixels and all the groups can be regarded as a coarserepresentation of the original 3D point set. Then the DBSCANalgorithm is utilized to connect groups which satisfy theconnectivity condition, by using all the points which belong tocorresponding groups. In this way, the computation cost canbe greatly reduced while the clustering performance is almostthe same.

IV. EXPERIMENTAL RESULT AND DISCUSSION

To verify the performance of our proposed method, wetested our system by using a pot of bracketplant chlorophytum.The captured depth image and color image are shown in Fig. 2.

Fig. 5: Mean-shift clustering results

In Fig. 2(a) and (b), the color stands for the distance from thesensor. The black area in (a) indicates invalid measurementsdue to the sensing mechanism of Kinect. In this depth image,there are total 246,513 valid pixels. After performing groundcalibration and recognition, we also specified an area of the3D points in x− y plane so that the unexpected object (suchas the standing people nearby) will not be included. Finally, a3D point set containing 114,364 points were obtained.

We perform the mean-shift clustering algorithm exactly asit was described in [14] by using the Gaussian kernel with thekernel size of 1 cm in (x, y) axis and 0.5 cm in z axis. Fig. 5shows the result of the mean-shift clustering, where the colorindicates the index of clusters. As expected, this clusteringresult is over-segmented, with a total number of 8425 clusters.

The final hierarchical clustering result is shown in Fig. 6.This picture is a projection of clustered 3D points in x − yplane, including total 87 clusters. Here 8 colors are employedto represent the index of clusters and black dots stand fornoises detected by DBSCAN. Because of the limited numberof colors, points with the same color may belong to differentclusters. According to our observation, most leaves have beensuccessfully segmented. We also found a small number ofincorrect segmentations, where leaves which are close to eachother are segmented as one cluster. However, this is not aclustering error, since they are considered as ”connected” dueto the close distance. To solve this problem, advanced leafshape analysis is necessary, which is a part of our future work.

V. CONCLUSION

In this research, a method for automatic recognition andsegmentation of leaves has been developed based on the depthimage captured by the Kinect sensor. Pixels of the depth imageare converted into 3D points and registered in a standardcoordinate system after ground calibration. After extractingleaves based on the height information, a hierarchical clus-tering algorithm is developed for automatic segmentation ofleaves.

Future improvements on this work should consider ap-plying prior knowledge about the shape of leaves into thesegmentation model, so that it can discriminate adjacent leavesand provide more detailed descriptions about the geometricstructural characteristics of plants.

Fig. 6: Hierarchical clustering results

REFERENCES

[1] X. P. Burgos-Artizzu, A. Ribeiro, A. Tellaeche, G. Pajares, and C. Fer-nandez-Quintanilla, “Analysis of natural images processing for theextraction of agricultural elements,” Image and Vision Computing,vol. 28, no. 1, pp. 138–149, 2010.

[2] D. J. Mulla, “Twenty five years of remote sensing in precision a-griculture: Key advances and remaining knowledge gaps,” BiosystemsEngineering, vol. 114, no. 4, pp. 358–371, 2013.

[3] A. McBratney, B. Whelan, T. Ancev, and J. Bouma, “Future directionsof precision agriculture,” Precision Agriculture, vol. 6, no. 1, pp. 7–23,2005.

[4] R. Bongiovanni and J. Lowenberg-DeBoer, “Precision agriculture andsustainability,” Precision Agriculture, vol. 5, no. 4, pp. 359–387, 2004.

[5] J. M. Chen, P. M. Rich, S. T. Gower, J. M. Norman, and S. Plummer,“Leaf area index of boreal forests: Theory, techniques, and measure-ments,” Journal of Geophysical Research: Atmospheres (1984–2012),vol. 102, no. D24, pp. 29 429–29 443, 1997.

[6] Y.-H. F. Yeh, T.-C. Lai, T.-Y. Liu, C.-C. Liu, W.-C. Chung, and T.-T.Lin, “An automated growth measurement system for leafy vegetables,”Biosystems Engineering, vol. 117, pp. 43–50, 2014.

[7] D. Riano, F. Valladares, S. Condes, and E. Chuvieco, “Estimation ofleaf area index and covered ground from airborne laser scanner (lidar)in two contrasting forests,” Agricultural and Forest Meteorology, vol.124, no. 3, pp. 269–275, 2004.

[8] M. A. Lefsky, W. B. Cohen, G. G. Parker, and D. J. Harding, “Lidarremote sensing for ecosystem studies,” BioScience, vol. 52, no. 1, pp.19–30, 2002.

[9] J. Kramer, M. Parker, H. C. Daniel, F. Echtler, and N. Burrus, Hackingthe Kinect. Springer, 2012.

[10] R. Hartley and A. Zisserman, Multiple view geometry in computervision. Cambridge university press, 2003.

[11] N. Burrus, “Kinect Calibration,” http://nicolas.burrus.name/index.php/Research/KinectCalibration, 2012.

[12] X. Shao, H. Zhao, R. Shibasaki, Y. Shi, and K. Sakamoto, “3d crowdsurveillance and analysis using laser range scanners,” in IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS),2011, pp. 2036–2043.

[13] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-basedalgorithm for discovering clusters in large spatial databases with noise.”in Kdd, vol. 96, 1996, pp. 226–231.

[14] D. Comaniciu and P. Meer, “Distribution free decomposition of multi-variate data,” Pattern Analysis & Applications, vol. 2, no. 1, pp. 22–30,1999.

[15] X. Shao, K. Katabira, R. Shibasaki, and H. Zhao, “Multiple peopleextraction using 3d range sensor,” in IEEE International Conference onSystems Man and Cybernetics (SMC). IEEE, 2010, pp. 1550–1554.

[ieee 2014 third international conference on agro-geoinformatics - beijing, china...

Documents