integration of multiresolution image segmentation and neural networks for object depth recovery

Pattern Recognition 38 (2005) 985–996www.elsevier.com/locate/patcog

Integration of multiresolution image segmentation and neuralnetworks for object depth recovery

Li Maa, R.C. Stauntonb,∗aDepartment of Computer Science, Zhengzhou Institute of Light Industry, Zhengzhou 450002, PR China

bSchool of Engineering, University of Warwick, Gibbit Hill Road, Coventry CV4 7AL, UK

Received 19 September 2003; received in revised form 29 September 2004; accepted 6 January 2005

Abstract

A novel technique for three-dimensional depth recovery based on two coaxial defocused images of an object with addedpattern illumination is presented. The approach integrates object segmentation with depth estimation. Firstly segmentationis performed by a multiresolution based approach to isolate object regions from the background given the presence of blurand pattern illumination. The segmentation has three sub-procedures: image pyramid formation; linkage adaptation; andunsupervised clustering. These maximise the object recognition capability while ensuring accurate position information. Fordepth estimation, lower resolution information with a strong correlation to depth is fed into a three-layered neural network asinput feature vectors and processed using a Back-Propagation algorithm. The resulting depth model of object recovery is thenused with higher resolution data to obtain high accuracy depth measurements. Experimental results are presented that showlow error rates and the robustness of the model with respect to pattern variation and inaccuracy in optical settings.� 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.

Keywords:Depth from defocus; Neural network; Multiresolution image segmentation; Fuzzy clustering

1. Introduction

Depth measurement is one of the most importanttasks in many computer vision applications, includingthree dimensional object recognition, scene interpretation,post inspection at manipulation. Three-dimensional (3-D)positional information restoration can be obtained using var-ious techniques, among which depth from defocus (DFD)methods have the advantage that they require only two co-axial images obtained with different optical settings. DFDavoids the missing part and correspondence problems thatoccur with stereo. Despite these merits, DFD shares one

∗ Corresponding author. Tel.: +44 024 76523980;fax: +44 24 76418922.

E-mail addresses:[email protected](L. Ma),[email protected](R.C. Staunton).

0031-3203/$30.00� 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.doi:10.1016/j.patcog.2005.01.005

inherent weakness with stereo and motion techniques in thatit requires the scene to contain natural or projected textures.In the work presented here we used projected texture (activeillumination).

Several researchers have developed accurate, dense depthestimations from defocused images in the past decade. TheDFD method, originally developed by Pentland[1], usesthe relative defocus in two images taken with differentcamera settings to determine scene structures. Many othertechniques have followed, and these fall into the two maincategories of Fourier and spatial domain based modelling.Subbarao has successively presented depth models for bothdomains[2]. Nayar et al.[3] give a precise blur analysisin the frequency domain using focus operators as models.They considered both actively and passively illuminatedscenes[4]. Furthermore, they proposed telecentric optics[5] to achieve magnification invariance under changes inthe focus setting. Their technique employed a small bank of

http://www.elsevier.com/locate/patcog

mailto:[email protected]

mailto:[email protected]

986 L. Ma, R.C. Staunton / Pattern Recognition 38 (2005) 985 – 996

broadband rational filters[6] able to handle arbitrary tex-tures. The method computed efficiently and produced accu-rate results even for weak textures. Ghita and Whelan[7]reported a practical DFD implementation based on simplefilters and a striped illumination pattern. They later use thisalgorithm in a bin picking application[8].

Recently, the techniques of artificial neural networks(ANNs), an empirical modelling in the spatial domain, havebeen applied to the DFD problem. ANNs have the propertiesof robustness and adaptation to approximate any non-linearfunction, so the stringent requirements for optical settingshave been reduced compared to the earlier techniques. Tsai[9] proposed an algorithm to estimate the amount of blurfrom a single camera, in which the blur is calculated usinga moment-preserving technique. The ANNs are only usedto compensate for certain depth errors. Pham and Aslantas[10] have presented a technique employing a multi-layerperceptron (MLP) network to compute distances fromderivative images of blurred edges. The theory of the MLPis described by Pinkus[11]. In addition, Jong and Huang[12] have explored the Radical basis Function (RBF) neuralnetwork for blur scale detection of the point-spread func-tion (PSF). Presently, there are few ANN based approachesto DFD oriented object recovery in the literature. The mainproblems for ANN based depth models are to build robust,accurate depth estimates with reasonably small networks,and the trade-off between the amount of pre-processedinput data required and the efficiency achieved by the train-ing procedure. In this paper, a novel ANN based approachfor depth measurement is reported that simplifies modelarchitecture and improves model performance. We haveintegrated image segmentation with neural network learn-ing, to solve depth recovery by a two-stage procedure, inwhich two-dimensional (2-D) object segmentation is fol-lowed by 3-D depth model formation. The first stage can beviewed as data pre-processing before the depth modellingstage. A multiresolution scheme, used for edge detection in[13,14], was applied at the first stage with the objectives ofreducing the data needed to form the depth model in thelater stage, and to provide a reliable segmentation for thepattern-based image. Firstly, the data from one defocusedimage is processed to form a multiresolution pyramid, inwhich the subsequent levels have progressively lower im-age resolution, but preserve the essential depth informationin the similarity measures between parent–child nodes atthe neighbouring levels. Only one image is required in thisstage as a telecentric lens was employed to eliminate anymagnification of objects between images. Finally an unsu-pervised fuzzy clustering was applied at a working level,defined in Section 3.4, to produce isolated object regions. Inthe depth estimation stage, a depth model in a three-layeredneural network, whose architecture is determined by depthfeature extractions, was generated using a back-propagationalgorithm with the training data derived from the previousstage at a low resolution level and camera calibration data.The basic framework of our approach is shown inFig. 1.

ImageCapture

ImageSegmentation

Fuzzy C-Means

FeatureExtraction

NeuralNetwork

DepthEstimate

Fig. 1. System structure.

Firstly, the two defocused grey level images with the pro-jected illumination pattern are segmented to decompose thescene into distinct meaningful regions. The resulting dataare derived from the object regions at a low resolution inorder to ease the burden of network learning and to reducethe uncertainty in object detection. To estimate the 3-Dinformation, reliable feature vectors from the first stage,that are to be input to the nodes of the neural network, areselected to provide useful data related to depth. Finally,the ANN model is generated to perform the object depthrecovery. After building the ANN model, it can be usedto calculate the depth of objects in unseen or partly seenimages, that is images that were not part of the training set.This work is the first reported in the literature to use neuralnetworks to calculate the depth from a pair of defocusedimages. It is also novel in using an object detection stagefollowed by the depth recovery stage. Experimental resultswith different illumination patterns are demonstrated toshow the effectiveness of the approach.

This paper is organized as follows: Section 2 overviewsthe theory of DFD and the derivation of the depth formulae.Section 3 gives a detailed illustration of the multiresolutiontechniques for 2-D object detection using the image segmen-tation and boundary forming algorithms on blurred images.Section 4 presents a MLP solution to object recovery usingthe pre-processed images and depth related features. Section5 demonstrates the experimental results and includes dis-cussion on the depth accuracy and the effect of varying theillumination pattern. The paper is concluded in Section 6.

2. Depth from defocus

To illustrate the concept of recovering the depth fromdefocus, the basic image formation geometry is shown inFig. 2. When an image is in focus, all the photons that areradiated by a point objectO pass through the aperture Aand are refracted by the lens to converge at the pointQ inthe focused image planeIf . The focused plane positionvdepends on the depthu of the object and the focal lengthfof the lens. According to the lens law

1

u+ 1

v= 1

f. (1)

However, when the point objectO is not in focus, its imageon the far-focused planeI1 and near-focused planeI2 is

L. Ma, R.C. Staunton / Pattern Recognition 38 (2005) 985 – 996 987

O m

w

b2

b1

Q

fd

u

A I2 If I1

v

Fig. 2. Image formation and depth from defocus.

not a point but a patch, which is, indicated by a blur circle,diameterb1, on planeI1 andb2 on planeI2, respectively.Thebi (i = 1, 2), called the degree of defocus, is a functionof the lens settings and the depth of the object. From similartriangles and Eq. (1).

b2 = 2dw

(1

f− 1

u− 1

w

), (2)

whered is the radius of the aperture, andw is the distancebetween the lens and planeI2. In a similar way, it is easy todetermine the blur circleb1 on planeI1 given the distancem between the two planes. The relative defocus in the twoimages with different optical settings can be used to deter-mine the three dimensional structure (depth) of the object.To solve the problem of magnification variation due to thechange in focus setting, an external aperture is placed at thefront-focal point to greatly reduce magnification variation[5]. As a result, the effective image coordinates of pointOin image planesI1, I2 and If are the same, and only oneimage can be used in the segmentation stage.

The analysis model of the two defocused imagesi1 andi2 can be denoted as

i1(x, y) = f (x, y)∗h(x, y, �1),

i2(x, y) = f (x, y)∗h(x, y, �2), (3)

where i1 and i2 are defocused images,f (x, y) is the fo-cused image,h(x, y, �) is a 2-D Gaussian approximation ofblur function, ‘*’ denotes convolution, and�1, �2 are theGaussian blur scales proportional tob1 and b2. The DFDcomputation works on these defocused images to computethe depth map. Two images are required as equal radius blurcircles can result in image planes placed in front of or behindthe in focus plane. This would lead to a depth ambiguity ifonly one image were used.

If the scene has a weak texture or is textureless, the de-focused images cannot provide sufficient information for

depth recovery, so here we project an artificial texture ontothe surface and the depth can be obtained by measuring theblurring of the projected pattern. Previous research[3] hasshown depth measurement to be more accurate when activeillumination is used.

3. Pattern-based image segmentation

3.1. The image segmentation algorithm

The first stage of depth estimation is to isolate each ob-ject within a scene. Here we need to accurately segment ascene onto which an illumination pattern has been projectedto ensure high spatial resolution in the eventual computeddepth map. A crucial problem for segmentation is to man-age two sources of uncertainty. These are the uncertainty inestimating the feature property in each small object region,and the spatial uncertainty of where the region’s boundarylies. Moreover, these two uncertainties are inversely related,in that the smaller the uncertainty in the boundary position,the larger the area needed to compute it.

There are three popular approaches to segmentation:histogram based thresholding, edge-based techniques andregion-based methods. Histogram based thresholding is astatistical method making decisions on the grey levels of aregion’s pattern which contains a central component witha maximal spatial uncertainty. Edge-based methods havea weakness when contour lines are broken and are proneto failure in the presence of blurring. Region-based imagesegmentation techniques make use of similarities in inten-sity, colour, and texture to determine the partitioning of animage. It is suitable for pattern-based image segmentationbut sensitive to noise and blur.

The multiresolution analysis proposed here expedites thetask of detecting the object in a defocused image. It is veryuseful in reducing the uncertainties because the class fea-tures at lower resolutions are better defined, while the higherresolutions are needed to obtain accurate borders. The his-tograms taken from a defocused image inFig. 3 show thatit is easier to distinguish the two regions at a lower resolu-tion as a bimodal pattern appears, whereas at high resolutionthe histogram is unimodal. Our efforts have been focusedon implementing an unsupervised segmentation algorithmto detect disjoint objects, which can then be further pro-cessed to achieve depth estimation. The algorithm has beendeveloped in several stages. First, a pyramid is built up toa predefined level in which each lower resolution level isobtained by smoothing the preceding higher level. The hier-archical feature structures related to the illumination patternare computed simultaneously. In order to set up the link-ing of homogenous regions between corresponding resolu-tion levels, the structures are adapted in a bottom-up wayaccording to similarity measures. Eventually an appropriateworking level is reached at which a stabilisation condition issatisfied. Furthermore, a new measurement of stabilisation


0 50 100 150 200 250 3000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

grey level values

0 50 100 150 200 250 3000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

grey level values

(a)

(b)

Fig. 3. Histogram profiles from different resolution levels: (a) histogram of the block object at the highest resolution,l = 0; (b) histogramof the block object at a lower resolution,l = 3.

using a histogram distance is proposed here to speed up thesegmentation. It compares favourably to the one publishedin [15]. Thirdly, based on the uniformity of adjoining sub re-gions, an unsupervised clustering technique: fuzzy C-means(FCM) [16] is employed to merge similar sub regions and toisolate the object from the background in the working levelof the pyramid. Finally, a pixel wise refinement procedureis carried out to improve object boundary detection.

3.2. Feature extraction

A region with a particular texture can be identified usingthe properties of that texture. Previously, in order to charac-terise a pattern, it was necessary to define a feature vector(texton) related to any given pixel. A feature vector is a de-scriptor related to a pixel but dependent on its relationshipwith other pixels in its neighbourhood. Line detection willpreserve the important structural properties of an image andcan provide elements of the vector. In one of its simplestforms, the line information is extracted using four 3× 3local masks, which provide orientation information:

pi(x, y) = fd(x, y) ∗ mi(x, y), i = 1, 2, . . . , 4,

wherepi(x, y) is the orientation response of theith linemaskmi , convolved with the original defocused image,fd ,

at the position(x, y). The line masks used in this paper are

m1 =[−1 −1 −1

2 2 2−1 −1 −1

], m2 =

[ 2 −1 −1−1 2 −1−1 −1 2

],

m3 =[−1 2 −1

−1 2 −1−1 2 −1

], m4 =

[−1 −1 2−1 2 −12 −1 −1

].

The line masks above respond maximally to lines at orien-tation of 0◦, 45◦, 90◦ and 135◦, respectively. So, the patternfeature vector at any image point has five elements: its mag-nitude and four orientation values. These were the featuresused in the image segmentation stage.

3.3. Pyramid structure and similarity measurement

Multiresolution methods attempt to obtain a global viewof an image by examining it at various resolution levels. Theprimary goal of pyramid segmentation is to find the locationof the objects within an image. In general, the pyramid’sbase is the image of highest resolution, and analysis thenproceeds to successively lower resolutions until the objectof interest is detected. Although the precise boundary of anobject may not be obtained at this low resolution, furtherprocessing is then performed at a higher resolution to de-termine a more accurate segmentation. In the first stage, thepyramid structure is constructed by means of quadtrees. Letpk(x, y, l) be the value of thekth feature at position(x, y)


in level l of the pyramid. The value of a parent node at levell is the average value of its four children at levell − 1.

pk(x, y, l) = 1

4

1∑m=0

1∑n=0

pk(2x + m, 2y + n, l − 1), (4)

wherek = 1, 2, . . . , 5 correspond to five different featureelements at the given levels:l = 0, 1, 2, 3. Level l = 3 isthe predefined highest level, and allows a trade-off betweeninformation loss and pattern extraction.

The goal of the second stage is to optimise the mappingbetween adjacent levels in the pyramid according to a simi-larity measure. The similarity measure is computed progres-sively to modify the parent–child relationship in a bottom-up fashion. Suppose a child nodei of level l−1 is located atposition(x, y), and a parent nodej in the next highest levelhas coordinates(u, v). The similarity measure between thenodesi and j, denotedSij , is defined as the sum of featuredistances between the two nodes.

Sij =5∑

k=1

(pk(x, y, l − 1) − pk(u, v, l))2. (5)

It should be noted thatj belongs to one of nodei’s four im-mediate parents. So that any node in levell−1 is rearrangedand linked to the closest parents among the four at levell.The linkage between parent and child is updated based onwhere the minimum value ofSij is found. The parent valueof nodej is recomputed as the average value of all the chil-dren that are currently connected to it. This process, calledlinkage adaptation, is repeated until no detectable differencecan be found in the histogram distribution at levelL. Wherelevel L is defined as the working level at which further un-supervised segmentation classification will now be carriedout. The stabilisation measure at a levell is defined as

sl = 1

MN

255∑i=0

(hl(i) − h′l (i))

2, (6)

wherehl(i) andh′l(i) are the histogram values for the in-

tensity i before and after linkage adaptation, andMN is thenumber of pixels. In practicesl is compared with a thresh-old value to determine the working level.

3.4. The FCM algorithm for unsupervised clustering

FCM [16] is an unsupervised data clustering techniquein which each data point belonging to a cluster is specifiedbased on a fuzzy membership grade. We used it here topartition the processed image at the working levelL intoindividual object regions and background. The FCM aimsto minimise an objective function given by

Jm(U, v) =n∑

k=1

c∑i=1

(uik)m(dik)

2 (7)

and subject to the condition:∑ci=1 uik = 1, 0<

∑k uik < k anduik �0, whereU is

a fuzzy C-partition of data set X, and is a matrix containingfeature vector values for each pixel. The elementsuik of ma-trix U are in the range 0–1, and represent the fuzzy member-ship degree of thekth data point (pixel) to theith class. Theith cluster centrevi , whose number of components dependson the number of feature vectors, is called the seed point ofa particular class. The distance between theith cluster centreof v and thekth data point of X,dik = ‖xk − vi‖, is used toupdate the cluster centres based on the previous matrixU ineach iteration until there is insignificant difference betweencluster centres in two consecutive iterations. In Eq. (8), aweighting exponentm, m ∈ [1, ∞), determines the level ofcluster fuzziness. The following equations are computed inthe unsupervised clustering process withm = 2.0.

uik = 1

/ c∑

j=1

(dik/djk)2/(m−1)

(8)

and

vi =∑n

k=1 (uik)mxk∑n

k=1 (uik)m , for 2� i �c. (9)

The steps in the FCM procedure are

(1) Assume the number of clusters,c, 2�c�n. Choose aset of n data points to be clustered,xi . Then initialiseU(0) = {u0

ik}.

(2) Calculate the fuzzy cluster centresvi , wherei=1, . . . , c

using Eq. (9).(3) Calculate distance measuresdik for all clusters, where

i = 1, 2, . . . , c, andk = 1, 2, . . . , n.(4) Update the fuzzy membership matrixU using Eq. (8).(5) If ‖U(l+1) − U(l)‖�� then stop; otherwise go to (2).

The FCM will converge to a local minimum or a saddle pointof Jm. Finally, defuzzification must be applied to assigneach data point to a specific cluster by attaching it to thecluster for which the membership value is maximal.

3.5. Refinement of object boundaries

At the end of the clustering process, a binary image isformed and rough object boundaries are obtained at theworking levelL. It was observed that some pixels were mis-classified because no spatial connectivity constraints wereused in the FCM algorithm. A refinement was carried outin which a pixel at location(x, y) was considered as mis-classified if the value at that point was different from themajority of its N8(x, y) neighbourhood (N8(x, y) = {(x +u, y+v)}, −1�u, v�1). The misclassified pixels were thenreassigned to the most represented class inN8(x, y).

It should be noticed that the grey level values of eachobject isolated from the background image at the coarserresolution levelL (the working level) were feed forward, to-gether with the pixel positions, as a part of the training data


for the neural network used to generate the depth informa-tion. While the segmentation results at the highest resolutionwere needed to test the depth model that was subsequentlybuilt by the neural network. The technique used here forboundary extraction is simply a label linking in a top-downprocess. Starting from the working level, at whichc objectshave been classified and each pixel has been assigned a la-bel, the label of a parent is directly assigned to its childrenas the similarity validation is performed during the pyramidformation (Section 3.3). The label assignment is repeateduntil the bottom has been reached at levell = 0.

4. Depth estimation using neural networks

The properties and training algorithms of MLP networksare well documented, but less information is available onhow to configure the network. In general, a feed forwardneural network with one hidden layer, composed of neuronswith activation functions, and a linear output layer can ap-proximate any continuous function to the desired accuracy.In our case, the use of the three layer MLP was predeter-mined as its simplicity and good performance was required.The number of hidden neurons were assigned during thetraining experiment and the number of output neurons wasset to one, to indicate the depth measured at the correspond-ing position. The key design issue here is how many inputneurons and what types of data are needed for depth esti-mation.

Input data selection, called feature extraction, can be anintricate task which largely depends on a solid understand-ing of the problem domain. It was shown in Section 2 thatthe grey level values from the two defocused images containthe depth information of the observed objects, and the pixelcoordinates projected onto the image plane are also indica-tors of the spatial positions of the objects. In our case, thereare four features fed to the input layer, the first two are thepixel grey levels of the isolated object images captured withthe two different focus settings at working levelL, and theother two are the pixels horizontal and vertical coordinates.These features enabled fast training and reduced testing er-rors in the evaluation stage described in the following sec-tion. When the input data had been selected, it needed to bepre-processed. Data normalisation was employed to scalethe input data to matching the range of the input neurons.This greatly improved the network’s performance.

5. Experiments

We have implemented the proposed techniques for depthrecovery. The scene was imaged using a TAMRON 25 mmlens converted to be telecentric by an additional aperture.The aperture diameter was normally set to 7.9 mm unlessotherwise stated, which gave anf-number of 3.17. A Pulnixmonochrome camera model TM-745E and a frame grabber

were set to capture images of size 512× 512 with 256 greylevels. Different illumination patterns, based on checker-boards and stripes were simulated by attaching printed sheetsof paper to the objects before sensing. The far-focused im-agei1 was taken with the lens focused at 1265 mm from thecamera, and the near-focused imagei2 with the lens focusedat 865 mm, unless otherwise stated. These two distanceswere chosen so that all scene points lay between them.

5.1. Segmentation

Object detection was carried out using only one of thedefocused images since telecentric optics provide the ad-vantage of magnification invariance with respect to focussetting. This meant the segmentation results for one imagecould be used to directly segment the other. An image pyra-mid was formed withlmax = 4 as described in Section 3.3.Then the child–parent linkages in the neighbouring levelswere updated based on the similarity measure (Eq. (5)) un-til the working levelL (l =3) was reached. At this level, thehistogram distribution showed no significant changes fromthe previous level (the threshold used with Eq. (6) was setto 0.8). Finally, unsupervised clustering based on fuzzy C-means was employed at levelL and the patterned objectsisolated from the background image.Fig. 4 shows resultsfrom each stage of the segmentation procedure. The originalimage of a five-step object is shown inFig. 4(a). Fig. 4(b)shows the image at the working level (resolution 64× 64)after linkage adaptation. It appears blurred compared to theoriginal in the lowest level of the pyramid. An advantageof reducing the resolution to 64× 64 was that although theaverage intensity value of the pixels in the object remainedthe same, the minimum intensity values in the dark stripesat l = 0 were smoothed with values from the light stripes,and the resulting values atl = 3 were significantly higherthan the average background intensity values. It was notedthat the object edges were smoothed atl = 3, but this hadlittle influence on the clustering as the intensity was onlyone element of the feature vector. The output image afterFCM clustering (Fig. 4(c)) had an error of 1.37%, which isdefined as a ratio of the number of misclassified pixels tothe total number of pixels in the object. The worst cluster-ing appeared at the bottom of the object because here therewas some aliasing between the illumination pattern and theedge.Fig. 4(d) shows the result of applying the local neigh-bourhood refinement to the previous image. The error is re-duced to 0.637%. This error is low compared to a 26.74%error we achieved with a traditional segmentation methodusing the Canny edge detector. The computation time forthe segmentation stage in Matlab was 253.8 s using a PCwith a 600 MHz processor.

5.2. Depth recovery

We calibrated the camera with a total of 10 images ofa planar checkerboard pattern by implementing programs


(a) (b)

(d)(c)

Fig. 4. Image segmentation. (a) Original step-block image(512× 512), (b) lower resolution image atl = 3 (64× 64), (c) object regionsegmentation after FCM atl = 3, (d) object region after refinement atl = 3.

in the MATLAB camera calibration toolbox. The results ofthe calibration were applied to the target generation used totrain the neural network. The training objects were a five-step block with a step interval of 40 mm, a box(120 mm×86 mm× 35 mm), and a cylinder (height 85 mm, diameter135 mm). These objects are shown inFig. 5 together withan “untrained” ramp object. However 5(d) shows the stepswith their natural wood texture, and without the illuminationpattern.

They were imaged at two focus settings and used to trainthe neural network. A three layer MLP network with fourinput nodes, 60 hidden nodes and one output node was usedto estimate the depth. The MLP training parameters usedwere: learning rate 0.001, moment factor 0.9, and epochs1000. The data set derived from the 2-D object segmentationprocedure was then generated by automatically splitting thewhole set into a training subset and a testing subset. Theinput data in the training subset were scaled in the range[−0.5 0.5] and the target data was readjusted to the range[0.2 0.8] to achieve good performance. Before training, thenetwork inputs were selected by a comparative study of dif-ferent feature arrangements, including grey level, pixel co-ordinates, edge orientation and gradient magnitude. The ex-

periments showed that an architecture with four input nodesconverged to a minimum least square error in the shortesttime.

Based on the MLP training at the working level, the depthmaps of the cylinder, box and ramp objects, were as shownin Fig. 6. The camera was located in front of the previouslytrained cylinder with its optical axis perpendicular to thecurved surface. Only the front half of cylinder could thus beobserved. The depth map inFig. 6(a) shows discontinuousdepth data, influenced by the striped illumination pattern.The map was smoother when data from a higher resolutionlevel was subsequently used. It was also noticed that a lowerdepth estimate occurred towards the right of the cylinder.This error was possibly due to a less distinct illuminationpattern at this end of the cylinder.Fig. 6(b) was the outputof the MLP for the trained box object. The shape has beenreasonably well recovered. The depth map inFig. 6(c) wasfrom a ramp object recovered using untrained data. It gen-erally matched the original object well, but the surface hada wave-like structure superimposed on it.

Several experiments were implemented to demonstratethe performance of the MLP network’s adaptive and associa-tive capabilities with unseen data. First we have shown the


(a) (b)

(c) (d)

Fig. 5. Images of test objects. (a) Cylinder with stripe pattern, (b) box with stripe pattern, (c) ramp with stripe pattern, (d) five-step blockwith natural (weak) texture.

effect of changing the input resolution and surface patternin Fig. 7. The depth map inFig. 7(a) was created using datafrom the working level. This coarse depth map(64 × 64)has the general shape of the object, but is slightly discontin-uous along the horizontal axis. In contrast, as shown inFig.7(b), when data from the highest resolution was used notonly was the object’s shape preserved, but also a smootherdepth map resulted. The MLP model was able to generatea smoother output when using higher resolution data thanit was trained on. We conclude that more smoothing canbe achieved by further increasing the input resolution. Thedepth map inFig. 7(c) is again from high resolution data,but the artificial illumination was removed, so that the MLPwas working only on the natural wood pattern of the steps.The input data was defocused images from the scene shownin Fig. 5(d). There is a curved surface imposed on the depthmap here, and errors have been introduced towards the stepedge boundaries.

In the following tests we have demonstrated the perfor-mance of the system with more complicated scenes un-der more realistic illumination.Fig. 8 shows two unseenscenes with checkerboard illumination patterns. In the pre-vious tests the background was black, but for these tests acheckerboard with white components at 30% the brightnessof those on the object has been superimposed on it. This

particular checkerboard was not used during training. Thefront and back focused image planes were at different dis-tances to those used for training. In addition the lens aper-ture (6 mm) and the objects themselves were not used in thetraining phase. The first object was a mug with a handle.The segmentation result is shown beside its front focusedimage. Using the previously built MLP depth model andpre-processed data from the multiresolution image segmen-tation (class parameterc = 2, weighting exponentm = 2),the depth map of the recovered objects is as shown inFig.9. The object structure has been well preserved without anyinfluences from the illumination pattern, however, slight dis-tortions can be seen along the rim, and where the base meetsthe handle.

The other scene inFig. 8 contains the same mug (its di-ameter was 78 mm and its rearmost point was 72 mm infront of the background), a disc placed 100 mm in front ofthe furthest forward point on the mug, and an inclined planewhose front edge was 54 mm in front of the background andthat was inclined at 45◦ to the background. This scene wasprocessed in the same way as the mug scene. The segmen-tation results are displayed beside the scene inFig. 8. Thedepth map of the recovered objects is as shown inFig. 10.The structure of the objects and their relative positions havebeen well preserved in the depth map without any influences


0 10 20 30 40 50 60 70

020

4060

800

0.5

depth map of the cylinder

dept

h

0 10 20 30 40 50 60

020

4060

0

0.5

depth map of the box

dept

h

0 10 20 30 40 50 60

020

4060

0

0.5

x−pixel

depth map of the ramp

y−pixel

dept

h

(a)

(b)

(c)

Fig. 6. Depth map of objects recovered by MLP network.

Fig. 7. Comparison of outputs of MLP network at different resolutions. (a) Depth map of a step-block at the working level, (b) depth mapof the step-block at higher resolution, (c) depth map ofFig. 5(d).


original image segmented image

original image segmented image

Fig. 8. Unseen object images. Top: mug with checkerboard pattern. Bottom: mug, disk, and plane.

Fig. 9. Depth map of mug recovered by the MLP.

from the illumination pattern. However, the inclined plane’sleft edge has been drawn towards the camera. The plane ob-ject was constructed from a 2 mm thick plastic sheet and, asimaged, had a 90◦ corner along this left edge. The resulting2 mm wide strip object was only illuminated by a half periodof the checkerboard pattern, which has lead to this error.

Table 1summarises the performance of the MLP depthmodel. It lists the errors at different resolution levels in the

image pyramid and the errors for both trained and untrainedobjects. The errors were calculated as the sum of the differ-ences between the actual depth data and the network outputat each pixel divided by the number of object pixels in theimage. The result shows that a more accurate depth modelcan be achieved using higher resolution input data afterthe model has been built at low resolution. The generalisa-tion properties of the ANN enable the depth of objects that


Fig. 10. Depth map of multiple object scene shown inFig. 8.

Table 1Errors in depth measurements of the objects processed by the MLP model

RMS2 Trained objects Untrained objects

Step-block Box Cylinder Ramp Multi-objects

Working level 1.2e− 3 7.6e− 4 2.6e− 3 4.6e− 3 10.54e− 3

Higher levelL = 2 8.54e− 4 7.23e− 4 12.34e− 4 2.56e− 3 5.31e− 3

have not been trained to be recovered, but with higher errorrates.

6. Conclusions

The proposed technique for object recovery consisted oftwo main components, image segmentation and 3-D depthestimation. Pairs of defocused images captured with differ-ent optical settings were processed to produce dense depthmaps of various objects. The implementation was based onactive DFD, although it was achieved by attaching an il-lumination pattern to the objects. The image segmentationwas able to decompose the image into disjoint meaningfulregions that had a strong correlation with the observed ob-jects. We concluded that the multiresolution segmentationtechnique was best able to isolate the pattern-based objectsfrom the background at the lower resolution levels but thatthis resulted in a heavier computation. Our segmentationalgorithm enabled a dramatic reduction in the data neededto form the MLP-based depth model, and made it easy topreserve the essential object features for unsupervised clus-tering. In Section 3, we described an efficient feature vec-tor based approach for region linking between neighbouringlevels of the pyramid. A particular novelty of this was thefeature forming and stabilisation scheme that determined theworking level of the pyramid at which the MLP was trained.A pixel refinement scheme was used to accurately position

object boundaries. The 3-D depth estimation, which was im-plemented by the neural network, used the three-stages ofdata pre-processing, training, and testing. The novelty of thetraining scheme was that only essential data associated withthe depth information at the working level were fed to theinput nodes of the MLP network. As a consequence, a bet-ter performance was found at higher image resolution levelsthan at the working level. Since the MLP depth model wasvery robust and adaptive, it was able to make measurementson data that had not been observed in the training phase. Incontrast to previous DFD related implementations, the cur-rent approach is an empirical modelling technique whichdoes not require the precise optical settings to be known andis not sensitive to the uniformity of the illumination pattern.It worked well even with weak textures. Furthermore, thedepth map recovered by the model did not need the addi-tional smoothing operation that was required with methodsthat do not use ANNs. The MLP network was found to workmuch faster in the working phase than in the training phase.With better hardware it may be suitable in the future for realtime implementation.

Summary

In this paper, a novel neural-network based approach todepth measurement, based on two defocused images withadded pattern illumination, is reported. It simplifies the


model architectures and improves model performance. Wehave integrated image segmentation with neural networklearning. The solution to depth recovery is therefore a two-stage procedure: firstly objects are detected in 2-D and then3-D depth is estimated. The object detection is performedby a multiresolution image segmentation to effectively iso-late meaningful object regions from the background in thepresent of blur and pattern illumination. The segmentationprocess consists of three sub-procedures: Image pyramidformation; Linkage adaptation; and Unsupervised clus-tering to maximise the object recognition capacity whilestill ensuring accurate positional information. Then, fordepth estimation, the data at a lower resolution, but witha strong correlation to the depth information are fed intoa three-layered neural network as input feature vectorsand processed using a BP algorithm. The depth modelhas been achieved when the training procedure terminates.The trained model has been further used with data from ahigher resolution level in the image pyramid to improvethe estimation accuracy. Both stages are seamlessly con-nected to greatly reduce the data dimensionality requiredfor modelling and to ease the burden of object recovery.Experimental results are presented to show the low errorrates and robustness of the model with respect to patternvariation and inaccuracy in the optical settings.

Acknowledgements

The authors gives acknowledgements to the China Schol-arship Council for providing financial support, and the Uni-versity of Warwick, UK for the research facilities provided.Thanks are also given to Mr. C. Claxton for his help withthe experiments.

References

[1] A.P. Pentland, A new sense of depth of field, IEEE Trans.Pattern Anal. Mach. Intell. 9 (4) (1987) 523–531.

[2] M. Subbarao, G. Surya, Depth from defocus: a spatialdomain approach, Int. J. Comput. Vision 13 (3) (1994)271–294.

[3] S.K. Nayar, M. Watanabe, M. Noghchi, Real-time rangesensor, IEEE Pattern Anal. Mach. Intell. 18 (12) (1996)1186–1198.

[4] M. Noguchi, S.K. Nayar, Microscopic shape from focus usinga projected illumination pattern, Math. Comput. Modelling 24(6) (1996) 31–48.

[5] M. Watanabe, S.K. Nayar, Telecentric optics for focusanalysis, IEEE Trans. Pattern Anal. Mach. Intell. 19 (12)(1997) 1360–1365.

[6] M. Watanabe, S.K. Nayar, Rational filters for passivedepth from defocus, Int. J. Comput. Vision 27 (3) (1998)203–255.

[7] O. Ghita, P. Whelan, A video-rate range sensor based on depthdefocus, Opt. Laser Technol. 33 (2001) 167–176.

[8] O. Ghita, P.F. Whelan, A bin picking system basedon depth from defocus, Mach. Vision Appl. 13 (2003)234–244.

[9] D. Tsai, C.T. Lin, A moment-preserving approach fordepth from defocus, Pattern Recognition 31 (5) (1998)551–560.

[10] D.T. Pham, V. Aslantas, Depth from defocusing using a neuralnetwork, Pattern Recognition 32 (1999) 715–727.

[11] A. Pinkus, Approximation theory of the MLP model in neuralnetworks, Acta Numerica 8 (1999) 143–195.

[12] S.M. Jong, J.S. Huang, Using radial basis function networksto approach the depth from defocus, J. Imaging Sci. Technol.45 (4) (2001) 400–406.

[13] D.J. Park, K.M. Nam, R.H. Park, Multiresolution edgedetection techniques, Pattern Recognition 28 (2) (1995)211–229.

[14] P. Schroeter, J. Bigun, Hierarchical image segmentationby multi-dimensional clustering and orientation-adaptiveboundary refinement, Pattern Recognition 28 (5) (1995)695–709.

[15] A. Bandera, Scale-dependent hierarchical unsupervisedsegmentation of textured images, Pattern Recognition Lett. 22(2001) 171–181.

[16] D.P. Mukherjee, P. Pal, J. Das, Sodar image segmentation byfuzzy C-means, Signal Processing 54 (3) (1996) 295–301.

About the Author—LI MA received the B.S. and Ph.D. degrees in Electrical Engineering from Central South University, PR China, in 1976and 1998 respectively. She is currently a professor at the department of computer science, Zhengzhou Institute of Light Industry, PR China.She has been an academic visitor to the College of Cardiff University of Wales, UK during 1993–1994, and she is currently a senior visitingfellow at the University of Warwick. Her research interests include signal and image processing, pattern recognition, neural networks.

About the Author—RICHARD C. STAUNTON received the B.Sc. (honours) degree in electronic engineering from the City University,UK, in 1973, and the Ph.D. degree in engineering from the University of Warwick, UK, in 1992. From 1973 to 1977 he worked for theaerospace industry, and from 1977 to 1986 for the UK National Health Service, where he engaged in research and development of medicalimage processing systems. Since 1986 he has been a lecturer at the University of Warwick. His current research interests include industrialimage processing, hexagonal sampling systems, colour image processing, manufactured surface analysis, and depth from defocus.

integration of multiresolution image segmentation and neural networks for object depth recovery

Documents