a fast and accurate approach for computing the dimensions ... · a fast and accurate approach for...

A Fast and Accurate Approach forComputing the Dimensions of Boxes

from Single Perspective ImagesLeandro A. F. Fernandes1, Manuel M. Oliveira1, Roberto da Silva1 & Gustavo J. Crespo2

1Universidade Federal do Rio Grande do SulInstituto de Informatica - PPGC - CP 1506491501-970 - Porto Alegre - RS - BRAZIL

{laffernandes, oliveira, rdasilva}@inf.ufrgs.br

2Stony Brook UniversityDepartment of Computer Science - 11794

New York - NY - USA{guscrespo}@yahoo.com

Abstract

This paper describes an accurate method for com-puting the dimensions of boxes directly from perspectiveprojection images acquired by conventional cameras. Theapproach is based on projective geometry and computesthe box dimensions using data extracted from the box si-lhouette and from the projection of two parallel laser be-ams on one of the imaged faces of the box. In order toidentify the box silhouette, we have developed a statis-tical model for homogeneous-background-color removalthat works with a moving camera, and an efficient votingscheme for the Hough transform that allows the identifi-cation of almost collinear groups of pixels. We demons-trate the effectiveness of the proposed approach by auto-matically computing the dimensions of real boxes usinga scanner prototype that implements the algorithms andmethods described in the paper. We also present a discus-sion of the performed measurements, and an error propa-gation analysis that allows the method to estimate, fromeach single video frame, the uncertainty associated to allmeasurements made over that frame, in real-time.

Keywords: Computing dimensions of boxes, image-based metrology, extraction of geometric informationfrom scenes, uncertainty analysis, real time.

Figure 1. Scanner prototype: (left) Its operation. (right)Camera’s viewfrom another position showing the recovered dimensions and

uncertainty computed in real time.

1. INTRODUCTION

The ability to measure the dimensions of three-dimensional objects directly from images has many prac-tical applications including quality control, surveillance,analysis of forensic records, storage management and costestimation. Unfortunately, unless some information rela-ting distances measured in image space to distances me-asured in 3D is available, the problem of making mea-surements directly on images is not well defined. Thisresults from the inherent ambiguity of perspective projec-tion caused by the loss of depth information.

This paper presents a method for computing box di-mensions from single perspective projection images in acompletely automatic way. The approach uses informa-tion extracted from the silhouette of the target boxes andcan be applied when at least two of their faces are visible,

Leandro A. F. Fernandes, Manuel M. Oliveira,Roberto da Silva & Gustavo J. Crespo

A Fast and Accurate Approach for Computingthe Dimensions of Boxes from Single

Perspective Images

even when the target box is partially occluded by otherobjects in the scene (Figure 1). We eliminate the inherentambiguity associated with perspective images by projec-ting two parallel laser beams, apart from each other by aknown distance, onto one of the visible faces of the box.We demonstrate this technique by building a scanner pro-totype for computing box dimensions and using it to com-pute the dimensions of boxes in real time (Figure 1). Thiscan be an invaluable tool for companies that manipulateboxes on their day-by-day operations, such as couriers,airlines and warehouses.

The paper presents a revised and significantly exten-ded version of the work originally described in [9]. Thenew materials in this extended version include: (i) the useof variable-size elliptical Gaussian kernels in the Houghtransform voting procedure (Section 5). The use of suchkernels makes the transform more robust to discretizationerrors and allows the proper detection of support silhou-ette lines of boxes with bent edges; (ii) the determinationof the plane spanned by the laser beams using a calibra-tion procedure (Subsection 3.2.1). The use of calibrateddata removed the only assumption in the original deriva-tion [9] of the equations shown in Section 3 and improvedthe accuracy of the method; (iii) the statistical analysis ofthe results (Section 7.1) was significantly enhanced con-sidering a larger number of real boxes and new graphsthat improve the interpretation of these results. Such ananalysis shows that the proposed approach is both accu-rate and precise; and (iv) a modeling of the error propaga-tion along all steps of the algorithm that allows our systemto estimate the uncertainty in the computed measurementsin real time (Section 7.2).

The main contributions of this paper include:

• An algorithm for computing the dimensions of bo-xes in a completely automatic way in real-time (Sec-tion 3);

• An algorithm for extracting box silhouettes in thepresence of partial occlusion of the box edges (Sec-tion 3.1);

• A statistical model of homogeneous background co-lor for use with a moving camera under differentlighting conditions (Section 4);

• An efficient voting scheme for identifying nearlycollinear line segments with a Hough transform(Section 5);

• A derivation of how to estimate the error associatedwith the computed dimensions from a single imagein real time (Section 7.2).

2. RELATED WORKMany optical devices have been created for making

measurements in the real world. Those based on activetechniques project some kind of energy onto the surfa-ces of the target objects and analyze the reflected energy.Examples of active techniques include optical triangula-tion [1] and laser range finding [22] to capture the shapesof objects at proper scale [17], and ultrasound to measuredistances [10]. In contrast, passive techniques rely onlyon the use of cameras for extracting the three-dimensionalstructure of a scene and are primarily based on the use ofstereo [18]. In order to achieve metric reconstruction [13],both optical triangulation and stereo-based systems re-quire careful calibration. For optical triangulation, severalimages of the target object with a superimposed movingpattern are usually required for accurate reconstruction.

Labeling schemes for trihedral junctions [4, 15] havebeen used to estimate the spatial orientation of polyhe-dral objects from images. These techniques tend to becomputationally expensive when too many junctions areidentified. Additional information from the shading of theobjects can be used to improve the process. Silhouetteshave been used in computer vision and computer graphicsfor object shape extraction [16, 21]. These techniques re-quire precise camera calibration and use silhouettes obtai-ned from multiple images to define a set of cones whoseintersections approximate the shapes of the objects.

Criminisi et al. [5] presented a technique for making3D affine measurements from a single perspective image.They show how to compute distances between planes pa-rallel to a reference one. In case of some distance froma scene element to the reference plane is known, it ispossible to compute the distances between scene pointsand the reference plane. If such a distance is not known,the computed dimensions are correct up to a scaling fac-tor. The technique requires user interaction and cannotbe used for computing dimensions automatically. Pho-togrammetrists have also made measurements based onsingle images [23], but these techniques also require userintervention.

In a work closely related to ours, Lu [20] described amethod for finding the dimensions of boxes from singlegray-scale images. In order to simplify the task, Lu as-sumes that the images are acquired using parallel ortho-graphic projection and that three faces of the box are visi-ble simultaneously. The computed dimensions are appro-ximately correct up to a scaling factor. Also, special careis required to distinguish the actual box edges from linesin the box texture, causing the method not to perform inreal time.

Our approach computes the dimensions of boxes fromsingle perspective projection images, producing metricreconstructions in real time and in a completely automaticway. The method can be applied to boxes with arbitrary

20



Perspective Images

Figure 2. Identifying the target box silhouette. Naive approach: (b) Background segmentation, followed by (c) High-pass filter (note the spurious“edge” pixels). Proposed approach: (d) Contouring of the foreground region, (e) Contour segmentation, (f) Grouping candidate segments for the

target silhouette, and (g) Recovery of supporting lines forsilhouette edges and vertices.

textures, can be used when only two faces of the box arevisible, even when the edges of the target box are partiallyoccluded by other objects in the scene.

3. COMPUTING BOX DIMENSIONS

We model boxes as parallelepipeds although real bo-xes can present many imperfections (e.g., bent edges andcorners, asymmetries, etc.). The dimensions of a paralle-lepiped can be computed from the 3D coordinates of fourof its non-coplanar vertices. Conceptually, the 3D coor-dinates of the vertices of a box can be obtained by inter-secting rays, defined by the camera’s center and the pro-jections of the box vertices on the camera’s image plane,with the planes containing the actual faces of the box in3D. Thus, before one can compute the dimensions of a gi-ven box (Section 3.3), its necessary to find the projectionsof the vertices on the image (Section 3.1), and then findthe equations of the planes containing the box faces in 3D(Section 3.2).

In the following derivations, we assume that the ori-gin of the image coordinate system is at the center of theimage, with theX-axis growing to the right and theY -axis growing down, and assume that the imaged boxeshave three visible faces. The case involving only two visi-ble faces is similar. Also, we assume that the images usedfor computing the dimensions were obtained through li-near projection (i.e., using a pinhole camera). Althoughimages obtained with real cameras contain some amountof radial and tangential distortions, we compensate suchdistortions in real time with the use of simple warping

procedures [13].

The number of visible faces of the box can be ob-tained by checking if the projection of some of the ed-ges that share the same direction in 3D is (almost) pa-rallel in 2D. In this case, two faces of the box are vi-sible; otherwise, three faces are visible simultaneously.Although this approach has proven to produce good re-sults for well-constructed boxes, most real boxes presentsome distorted edges, which breaks the parallelism as-sumption. Thus, in practice, it is more effective to as-sume that three box faces are visible. Since the system iscapable of computing the dimensions of boxes at30 Hz,we can afford to discard frames if the silhouettes recove-red from the acquired images do not satisfy the imposedrequirements. In this case, the perception of the user issimilar to that of a barcode scanner user: if no answer iscoming out, just slightly change the scanner’s orientationrelatively to the target object in order to get it.

3.1. FINDING THE PROJECTIONS OF THE VERTI-CES

The projection of the vertices can be obtained as thecorners of the box silhouette. Although edge detectiontechniques [3] could be used to find the box silhouette,these algorithms tend to be very sensitive to the presenceof other high-frequency contents in the image. In orderto minimize the occurrence of spurious edges and sup-port the use of boxes with arbitrary textures, we performsilhouette detection using a model for the background pi-xels. Since the images are acquired using a handheld ca-mera, proper modeling of the background pixels is requi-

21



Perspective Images

red and will be discussed in detail in Section 4.However, as shown in Figure 2 (a, b and c), a naive ap-

proach that just models the background and applies sim-ple image processing operations, like background remo-val and high-pass filtering, does not properly identify thesilhouette pixels of the target box (selected by the user bypointing the laser beams onto one of its faces). This isbecause the scene may contain other objects, whose si-lhouettes possibly overlap with the one of the target box.Also, the occurrence of some misclassified pixels (see Fi-gure 2, c) may lead to the detection of spurious edges.Thus, a suitable method was developed to deal with theseproblems. The steps of our algorithm are shown in Figu-res 2 (a, d, e, f and g).

In our approach, the target box silhouette is obtainedstarting from one of the laser dots, finding a silhouettepixel and using a contour-following procedure [12]. Theseed silhouette pixel for the contour-following is foundstepping from the laser dot within the target foregroundregion and checking whether the current pixel matchesthe background model. In order to be a valid silhouette,both laser dots need to fall inside of the contouring re-gion. Notice this procedure produces a much cleaner setof border pixels (Figure 2, d) compared to results shownin Figure 2 (c). But the resulting silhouette may includeoverlapping objects, and one still needs to identify whichborder pixels belong to the target box. To facilitate thehandling of the border pixels, the contour is subdividedinto its most perceptually significant straight line seg-ments [19] (Figure 2, e). Then, the segments resultingfrom the clipping of a foreground object against the limitsof the frame (e.g.,segmentsh, o andp in Figure 2, e) arediscarded. Since a box silhouette defines a convex poly-gon, the remaining segments whose two endpoints are notvisibleby both laser dots can also be discarded. This testis performed using a 2D BSP-tree [11]. In the example ofFigure 2, only segmentsc, d, k, l, t, u andv pass this test.

Still, there is no guarantee that all the remaining seg-ments belong to the target box silhouette. In order to res-trict the amount of possible combinations, the remainingchains of segments defining convex fragments are grou-ped (e.g.,groupsA, B andC in Figure 2, f). We then tryto find the largest combination of groups into valid por-tions of the silhouette. In order to be considered a validcombination, the groups must satisfy the following vali-dation rules: (i) they must characterize a convex polygon;(ii) the silhouette must have six edges (the silhouette ofa parallelepiped with at least two visible faces); (iii) thelaser dots must be on the same box face; and (iv) the com-puted lengths for pairs of parallel edges in 3D must beapproximately the same. In the case of more than onecombination of groups pass the validation tests, the sys-tem discards this ambiguous data and starts processing anew frame (our system is capable of processing frames at

Figure 3. Vanishing points (ωi) and vanishing lines (λi). ej , vj , m0,andfi are the supporting lines for silhouette edges, the silhouettevertices, the inner vertex and the faces of the box, respectively.

the rate of about29 fps).Once the box silhouette is known, the projections of

the six vertices are obtained intersecting pairs of adja-cent supporting lines for the silhouette edges (Figure 2, g).Section 5 discusses how to obtain those supporting lines.

3.2. COMPUTING THE PLANE EQUATIONS

The set of all parallel lines in 3D sharing the samedirection intersect at a point at infinite whose image un-der perspective projection is called a vanishing pointω.The line defined by all vanishing points from all sets ofparallel lines on a planeΠ is called the vanishing lineλof Π (Figure 3). The normal vector toΠ in a given ca-mera’s coordinate system can be obtained multiplying thetranspose of the camera’s intrinsic-parameter matrix bythe coefficients ofλ [13]. Since the resulting vector isnot necessarily a unit vector, it needs to be normalized.Equations (1) and (2) and Figure 3 show the relationshipamong the vanishing pointsωi, vanishing linesλi and thesupporting linesej for the edges that coincide with theimaged silhouette of a parallelepiped with three visiblefaces. The supporting lines are ordered clockwise.

ωi = ei × ei+3 (1)

λi = ωi × ω(i+1)mod3 (2)

where0 ≤ i ≤ 2, 0 ≤ j ≤ 5, λi = (aλi, bλi

, cλi)T and

× is the cross product operator. The normalNΠito plane

22



Perspective Images

Πi is then given by

NΠi=

RKT λi

‖RKT λi‖(3)

whereNΠi= (AΠi

, BΠi, CΠi

). K is the matrix that mo-dels the intrinsic camera parameters [13] andR is a re-flection matrix (Equation 4) used to make theY -axis ofthe image coordinate system grows in the up direction.

K =

αx γ ox

0 αy oy

0 0 1

, R =

1 0 00 −1 00 0 1

(4)

In Equation (4),αx = f/sx andαy = f/sy, wheref is the focal length, andsx andsy are the dimensions ofthe pixel in centimeters.γ, ox andoy represent the skewand the coordinates of the principal point, respectively.

Once we haveNΠi, finding DΠi

, the fourth coeffi-cient of the plane equation, is equivalent to solving theprojective ambiguity and will require the introduction ofone more constraint. Thus, consider the situation depic-ted in 2D in Figure 4 (left), where two laser beams, pa-rallel to each other, are projected onto one of the facesof the box. Let the 3D coordinates of the laser dots de-fined with respect to the camera coordinate system beP0 = (XP0

, YP0, ZP0

)T and P1 = (XP1, YP1

, ZP1)T ,

respectively (Figure 4, right). SinceP0 andP1 are on thesame planeΠ, one can write

AΠXP0+BΠYP0

+CΠZP0= AΠXP1

+BΠYP1+CΠZP1

(5)Using the linear projection model and given

pi = (xpi, ypi

, 1)T , the homogeneous coordinates ofthe pixel associated with the projection of pointPi, onecan reprojectpi on the planeZ = 1 (in 3D) using

p′i = (xp′

i, yp′

i, 1)T = RK−1pi (6)

and express the 3D coordinates of the laser dots on theface of the box as

XPi= xp′

iZPi

, YPi= yp′

iZPi

and ZPi(7)

Substituting the expression forXP0, YP0

, XP1and

YP1(Equation 7) in Equation (5) and solving forZP0

, weobtain

ZP0= kZP1

(8)

where

k =AΠxp′

1+ BΠyp′

1+ CΠ

AΠxp′

0+ BΠyp′

0+ CΠ

(9)

Now, letdlb anddld be the distances, in 3D, betweenthe two parallel laser beams and between the two laserdots projected onto one of the faces of the box, respecti-vely (Figure 4). Section 6 discusses how to find the laserdots on the image.dld can be directly computed fromNΠ,

Figure 4. Top view of a scene. Two laser beams apart in 3D bydlb

project onto one box face at pointsP0 andP1, whose distance in 3D isdld. α is the angle between−L andNL.

the normal vector of the face onto which the dots project,and the known distancedlb:

dld =dlb

cos(α)=

dlb

−(NL · L)(10)

whereα is the angle betweenNL, the normalized projec-tion of NΠ onto the plane defined by the two laser beams,andL is the vector representing the laser beam direction.For now, we will assume that the laser plane is parallel tothe cameraXZ plane andL = (0, 0, 1)T . Therefore,NL

is obtained by dropping theY coordinate ofNΠ and nor-malizing the resulting vector.dld can also be expressed asthe Euclidean distance between the two laser dots in 3D:

d2ld = (XP1

−XP0)2+(YP1

−YP0)2+(ZP1

−ZP0)2 (11)

Substituting Equations (7), (8) and (10) into (11) and sol-ving for ZP1

, one gets

ZP1=

√

d2ld

ak2 − 2bk + c(12)

wherea = (xp′

0)2 +(yp′

0)2 +1, b = xp′

0xp′

1+ yp′

0yp′

1+1

andc = (xp′

1)2 + (yp′

1)2 + 1. GivenZP1

, the 3D coordi-nates ofP1 can be computed as

P1 = (XP1, YP1

, ZP1)T = (xp′

1ZP1

, yp′

1ZP1

, ZP1)T

(13)The projective ambiguity can be finally removed by

computing theDΠ coefficient for the plane equation ofthe face containing the two dots:

DΠ = −(AΠXP1+ BΠYP1

+ CΠZP1) (14)

23



Perspective Images

3.2.1. Estimating the Laser Plane: In practice, it isdifficult to guarantee that the plane defined by the laserbeams is parallel to the camera’sXZ plane, and that theL vector is aligned with the cameraZ-axis. In our scan-ner prototype, we noticed that although the laser beamsare parallel to each other, the plane they define (Πlb) isnot parallel to the camera’sXZ plane. Therefore, it is ne-cessary to take into account the angle between these twoplanes before computingNL and thendld (Equation 10).

The orientation ofΠlb and the direction of theL vec-tor were estimated projecting the laser beams on a planarcheckerboard calibration pattern, placed at varying dis-tances from the scanner. By collecting the coordinatesof a set of 3D points (corresponding to these projections)along the laser lines, we estimated bothΠlb’s orientationandL’s direction with respect to the camera’s coordinatesystem.

3.3. COMPUTING THE BOX DIMENSIONS

Having computed the plane equation of a face of thebox, one can recover the 3D coordinates of vertices of thatface. For each such vertexv on the image, we computev′

using Equation (6). We then compute its correspondingZV coordinate by substituting Equation (7) into the planeequation for the face. GivenZV , bothXV andYV coor-dinates are computed using Equation (7). Since all visiblefaces of the box share some vertices with each other, theD coefficients for the other faces of the box can also beobtained, allowing the recovery of the 3D coordinates ofall vertices on the box silhouette, from which the dimen-sions are computed.

Although not required for computing the dimensionsof the box, the 3D coordinates of the inner vertexm0 (Fi-gure 3, top) can also be computed. Its 2D coordinatesare obtained as the intersection among three lines. Eachsuch line is defined by a vanishing point and the silhou-ette vertex falling in between the two box edges used tocompute that vanishing point. This situation is illustra-ted in Figure 3. Since it is unlikely that these three lineswill intersect exactly at one point, we approximate this in-tersection using least-squares. Given the inner vertex 2Dcoordinates, its corresponding 3D coordinates are com-puted using the same algorithm used to compute the 3Dcoordinates of the other vertices.

4. A MODEL FOR BACKGROUND PI-XELS

In order to obtain the box silhouette, we need to clas-sify the pixels as either background or foreground pixels.One of the most popular techniques for object segmenta-tion is chroma keying [24]. Unfortunately, standard ch-roma keying techniques do not produce satisfactory re-

Figure 5. Chromaticity axis rotated to align to a horizontalaxis. Thecurve is a polynomial fit to the chromaticity distortion threshold for

each slice (the small rectangles).

sults for our application. Shading variations in the back-ground and shadows cast by the boxes usually lend to mis-classification of background pixels. Horprasertet al. [14]describe a statistical method that computes a per-pixelmodel of the background from a set of static backgroundimages. While this technique is fast and produces verygood segmentation results for scenes acquired from a sta-tic camera, it is not appropriate for use with moving ca-meras. Also it requires a complete new calibration whenthe lighting conditions change too much. To avoid pro-blems from lighting changes, a threshold solution basedon hue component of the HSV color space might seem tobe a good solution. However, such an approach tends tomisclassify foreground pixels whose colors are close tothe background color.

In order to support a moving camera, we have deve-loped an approach that proved to be robust, lending tovery satisfactory results. It works under different lightingconditions by computing a statistical model of the back-ground, which contains a single hue. Such a model isdefined by achromaticity axisthat represents the meanexpected shade of the background under various lightingconditions and apolynomial curvedescribing a variablethreshold along the chromaticity axis.

The algorithm takes as input a set ofn imagesIi of thebackground acquired under different lighting conditions.In the first step, we computeE, the average color of allpixels in all imagesIi, and the eigenvalues and eigenvec-tors associated with the colors of those pixels.E and theeigenvector associated with the highest eigenvalue definean axis in the RGB color space (the chromaticity axis).The chromaticity distortiond of a given colorC is com-puted as the distance fromC to the chromaticity axis.

After discarding the pixels whose projections on thechromaticity axis have at least one saturated channel (theylend to misclassification of bright foreground pixels), wesubdivide the chromaticity axis intom slices (Figure 5).For each slice, we computedj and σdj

, the mean andthe standard deviation, respectively, for the chromaticitydistortion of the pixels in the slice. Then, we com-pute a thresholddTj for the maximum acceptable slice-chromaticity distortion considering a confidence level of

24



Perspective Images

Figure 6. Hough Transform parameter space obtained with theconventional (left) and the new voting scheme (right) applied to the segments shownin Figure 2 (f). The peaks represent the supporting lines forsilhouette edges.

99% asdTj = dj + 2.33σdj.

Finally, the coefficients of a polynomial that modelsthe chromaticity-distortion thresholds are computed byfitting a curve through thedT values at the centers of theslices (Figure 5). Intuitively, such a polynomial describesa variable threshold for the different shades of the back-ground color. Once the coefficients have been computed,the dT values are discarded and the tests are performedagainst the model. Figure 5 illustrates the case of a colorC being tested against the background color model.C′ isthe projection ofC on the chromaticity axis. In this exam-ple, as the distance betweenC and the chromaticity axisis bigger than the threshold defined by the polynomial,Cwill be classified as foreground.

Changing the background color only requires obtai-ning samples of the new background and computing thenew values for the chromaticity axis and the coefficientsof the polynomial. It is possible that the box texture maycontain some pixels whose chromaticity distortions aresmaller than the threshold defined by the polynomial fora given background color shade. In this case, the classi-fication process would incorrectly indicate the presenceof background pixels inside the box region. However,the silhouette detection approach described in Section 3.1can handle groups of misclassified foreground pixels and,in practice, no problems have been noticed as a result ofpossible such misclassification. According to our experi-ments,100 slices and a polynomial of degree 3 producevery satisfactory results.

5. IDENTIFYING ALMOST COLLINEAR

SEGMENTSTo compute the image coordinates of the box vertices,

first we need to obtain the supporting lines for the silhou-ette edges. We do this using a Hough transform proce-dure [7]. However, the conventional voting process and

the detection of the most significant lines is computatio-nally intensive and turned out to be a bottleneck to oursystem. To reduce the amount of computation, an alterna-tive to the conventional voting process was developed.

As seen in Section 3.1, silhouette pixels are organizedinto perceptually most significant straight-line segments.The new voting scheme consists in casting votes directlyfor these segments, instead of for individual pixels as it istraditionally done [7]. Thus, for each segment, the(ρ, θ)parameters of its supporting line are computed from theaverage position of the set of pixels defining the segmentand from the 2D eigenvectors of that pixel distribution.The eigenvector with the smaller eigenvalue is the nor-mal to the line, soρ can be computed as the dot productbetween this eigenvector and the average pixel.θ is theangle between theX-axis of the image and the secondaryeigenvector. The use of eigenvectors makes the processrobust, allowing it to handle lines with arbitrary orientati-ons in a consistent way.

For each segment, we distribute its votes in the para-meter space using a Gaussian elliptical kernel (Figure 6,right), whose central position is defined by the(ρ, θ) para-meters of the line fit to its set of pixels. The Gaussian ker-nel spread the votes over a region of the parameters spacearound(ρ, θ), according to the quality of the fit. Noticehowever, that different segments have different numbersof pixels, as well as have different degrees of dispersionaround their corresponding best-fitting lines. The smallerthe dispersion, the more concentrated these votes shouldbe in the parameter space. We estimate the quality of theline fit by computing the variances and covariance of the(ρ, θ) parameters. The variances give the dimensions ofthe axes of the elliptical kernel, while the covariance givesthe orientation of the ellipse. One can compute the vari-ances and covariance ofρ andθ using a linear regressionprocedure [6]. However, standard linear regression proce-dures use the slope-intercept line notation, so one can use

25



Perspective Images

a first order uncertainty propagation analysis [25] to com-pute the variances and covariance ofρ andθ from the va-riances and covariances of the slope-intercept parameters.Once one has computed the variances and covariance as-sociated withρ andθ, the votes are cast using a bi-variatedGaussian distribution [26]. The use of a Gaussian kerneldistributes the votes around a neighborhood, allowing theidentification of approximately collinear segments. Thisis a very important and unique feature of our approachthat allows our system to better handle discretization er-rors and boxes with slightly bent edges. A detailed ex-planation about the proposed voting process can be foundin [8].

Using the new approach, the voting process and thepeak detection are significantly improved because theamount of cells that receive votes is substantially redu-ced. Figure 6 shows the parameter space after the traditi-onal (left) and the new (right) voting processes have beenapplied to the segments shown in Figure 2 (f). Using theconventional (i.e., per-pixel) voting scheme [7],376, 884votes are distributed over228, 255 cells. In contrast,the new approach only casts6, 382 votes distributed over5, 020 cells, which represents1.7% of the number of votesand2.2% of the number of cells used in the conventionalapproach. As a result, the produced voting map is veryclean (Figure 6, right), reducing ambiguities and impro-ving the identification of the most important lines. Theextra cost involved in computing the covariance matricesassociated with a few segments and by the use of Gaus-sian elliptical kernels to cast votes is more than compen-sated by the huge saving achieved.

Special care must be taken when theθ parameter isclose to0◦ or to180◦. In this situation, the voting processcontinues in the diagonally opposing quadrant, at the−ρposition (see Figure 6, peaksd andt). For the examplesshown in the paper, the parameter space was discretizedusing360 angular steps in the rangeθ = [0◦, 180◦) and1, 600 ρ values in the range[−400, 400].

6. FINDING THE LASER DOTS

The ability to find the proper positions of the laser dotsin the image can be affected by several factors such asthe camera’s shutter speed, the box materials and textu-res, and ambient illumination. Although we are using ared laser (650 nm class II), we cannot rely simply on thered channel of the image to identify the positions of thedots. Such a procedure would not distinguish between thelaser dot and red texture elements on the box. Since thepixels corresponding to the laser dots present very high lu-minance, we identify them by thresholding the luminanceimage. However, just simple thresholding may not workfor boxes containing white regions, which tend to have

large areas with saturated pixels. We solved this problemby setting the camera’s shutter speed so that the laser dotsare the only elements in the image with high luminance.

Since the image of a laser spot is composed by seve-ral pixels, we approximate the actual position of the dotby the centroid of its pixels. According to our experi-ments, a variation of one pixel in the estimated center ofthe laser spot produces a variation of a few millimetersin the computed dimensions. These numbers were obtai-ned assuming a camera standing about two meters fromthe box. After computing the position of the inner ver-tex (Section 3.3), the face that contains the laser dots isidentified.

The system may fail to properly detect the laser dots ifthey project on some black region or if the surface exhi-bits specular peaks. This, however, can be avoided byaiming the beams on other portions of the box. Due to theconstruction of the scanner prototype and to some epipo-lar constraints [13], one only needs to search for the laserdots inside a small window in the image. Although a sin-gle laser beam could be used to break the projective am-biguity, the use of two beams introduces additional cons-traints that make silhouette identification more robust.

7. RESULTS

We have built a prototype of a scanner for compu-ting box dimensions and implemented the techniques des-cribed in the paper using C++. The system was testedon several real boxes. For a typical scene, such as theone shown in Figure 2, it can process video and com-pute box dimensions at about29 fps. For comparison,the frame rate drops to10 fps if the traditional pixel-based Hough-transform voting scheme (Figure 6, left) isused. Such numbers illustrate the effectiveness of the pro-posed voting solution. These measurements were madeon a2.8 GHz PC with1.0 GB of memory. A video se-quence illustrating the use of our scanner can be found athttp://www.inf.ufrgs.br/˜laffernandes/boxdimensions.

Figure 1 (left) shows the scanner prototype whosehardware is comprised of a firewire color camera (PointGrey Research DragonFly with640 × 480 pixels), a16mm lens (Computar M1614, with manual focus, no irisand30.9 degrees horizontal field of view) and two laserpointers. The camera is mounted on a plastic box and thelaser pointers were aligned and glued to the sides of thisbox. In such an assembly, the laser beams are15.8 cmapart. For our experiments, we acquired pictures of boxesfrom distances varying from1.7 to 3.0 meters to the ca-mera. The background was created using a piece of greencloth and its statistical model was computed from a set of23 images. Figure 7 shows some examples of boxes usedto test our system. Some of these boxes are particularly

26



Perspective Images

(a) (b)

(c) (d)

(e) (f)

(g) (h)

Figure 7. Examples of real boxes used for testing.

challenging: (e), (f) and (g) are very bright; (f) and (g)have a reflective plastic finishing; and box (b) is mostlycovered with red texture. The dimensions of these boxesvary from13.8 to 48.2 cm. The intrinsic parameters of thecamera (Equation 4) were estimated using a calibrationsprocedure [2].

The geometry of the box is somewhat different froma parallelepiped due to imperfections introduced duringconstruction and handling. For instance, bent edges, dif-ferent sizes for two parallel edges of the same face, lack ofparallelism between opposing faces, and warped cornersare not unlikely to be found in practice. Such inconsisten-cies lend to errors in the orientation of the silhouette ed-ges, which are cascaded into the computation of the boxdimensions.

In order to estimate the inherent inaccuracies of theproposed algorithm, we performed measurements on awooden box (Figure 7, h) that was carefully constructedto avoid these imperfections. We have also implemented asimulator that performs the same computations on imagesof synthetic boxes (exact parallelepipeds) generated using

computer graphics techniques. Using images generatedby the simulator, our system can recover the dimensionsof the box with an average relative error of0.58%. Next,we analyze some of the results obtained on real boxes.

7.1. STATISTICAL ANALYSIS ON REAL BOXES

In order to evaluate the effectiveness of the proposedapproach, we carried out a few statistical experiments.First, we selected several real boxes (Figure 7) and ma-nually measured the dimensions of all their edges with aruler. Each edge was measured twice, once per sharedface of the edge. The eight measurements of each boxdimensions were averaged to produce a single value perdimension. All measurements were made in centimeters.We then used our system to collect a total of30 measure-ments of each dimension of the same box. For each col-lected sample, we projected the laser beams on differentparts of the box. We used this data to compute the mean,standard deviation and confidence intervals for each ofthe computed dimensions. The confidence intervals were

computed asCI =[

x − tγσ√n, x + tγ

σ√n

]

, wherex is

the mean,σ is the standard deviation,n is the size of sam-ple andtγ is a t−Student variable withn − 1 degrees offreedom, such that the probability of a measurex belongsto CI is γ. The tightest theCI, the more precise are thecomputed values.

Figure 8 shows the computed confidence intervalswith γ = 99.5%. Note that the values of the actual di-mensions (the red line) fall inside most these confidenceintervals, indicating accurate measurements. Boxes (f),(g) and especially the wooden box (h) are the ones withtightest confidence intervals. Those are well constructedboxes. Wider confidence intervals were obtained for bo-xes with bent faces and edges, like boxes (a) and (e). Theonly box whose actual dimensions do not fall inside theconfidence interval is box (d). This box has a cardboardlid that changes the box silhouette, shifting the computedmean values away from the true ones.

Another estimate of the error can be expressed as therelative errorǫ = |x − xv| /xv, wherex is the computeddimension andxv is the value of the actual dimension. Fi-gure 9 shows a histogram of the relative errors in the me-asurements obtained with our scanner prototype for theboxes shown in Figure 7. The higher relative errors werecomputed for boxes (a), (d) and (e) (the ones exhibitingimperfections) and is in accordance with the experimentsummarized in Figure 8. Considering all measurements,the mean relative error for all real boxes is3.75%, indica-ting very good accuracy.

7.2. ERROR PROPAGATION

The error associated to a variablew computed from aset of experimental data can be estimated using the fol-

27



Perspective Images

Figure 8. Confidence intervals computed from measurements of theedges of the boxes shown in Figure 7. The boxes edges are sorted by

length.

lowing error propagation model [25]:

Λw = ∇f Λϑ ∇fT (15)

whereΛw is the covariance matrix that models the errorsin w; ∇f is the Jacobian matrix for the functionf(ϑ) thatcomputes each term ofw from then input variables; andΛϑ is the covariance matrix that models the errors of theinput variables. The model assumes a Gaussian distribu-tion of the errors around the mean values estimated byf(ϑ) and allows the computation of confidence intervalsfor the length of each visible edge using a single inputimage.

To apply this error propagation model, one needs toestimate the error associated to each input variable. In theproposed method, the input variables are:

• hi = (ρi, θi)T , 0 ≤ i ≤ 5: the coefficients of the

normal equation of the supporting lines for the si-lhouette edges (12 variables);

• pj =(

xpj, ypj

)T, 0 ≤ j ≤ 1: the image coordinates

of the laser dots (4 variables);

• dlb: the distance between the laser beams (1 varia-ble);

• K: the camera’s intrinsic-parameters matrix (5 vari-ables);

• L = (XL, YL, ZL)T : the laser beam direction (3variables).

So,Λϑ is a25 × 25 covariance matrix, comprised bythe variances and covariances of all input variables:

Λϑ = diag (Λh0, . . . , Λh5

, Λp0, Λp1

, Λddb, ΛK , ΛL)

(16)

Figure 9. Histogram of the relative errors in the computed dimensionsfor the edges of the boxes in Figure 7.

Given these variables, the Jacobian of the function thatcomputes the length of the target box edges can be ob-tained as shown in Equation 17. The partial derivativesin ∇f are calculated using the chain rule over the set ofequations presented in Section 3.

∇f =(

∂w∂ρ0

∂w∂θ0

. . . ∂w∂XL

∂w∂YL

∂w∂ZL

)

(17)

The error propagation model expressed by Equati-ons 15, 16 and 17 allows our system to estimate the un-certainty associated with the measurement of each edgeof the box in real time. This information can be used todiscard unreliable measurements, which may result if thebox is relatively far from the camera, or if one of the boxedges approaches a direction almost perpendicular to thecameras image plane.

Special care must be taken when choosing the dis-tance between the laser beams. The uncertainty in thecomputed dimensions increases as the lasers distance de-creases, because the relative error tends to increase as thedistance becomes smaller. However, the distance betweenthe laser beams constrains the minimal accepted size fora box, since both laser dots must fall inside the same face.So, for a given application, one should consider a trade-off between the minimal box size and the accepted uncer-tainty in the measurements.

8. CONCLUSIONS AND FUTURE WORKWe have presented a completely automatic approach

for computing the dimensions of boxes from single pers-pective projection images in real time. The approach usesinformation extracted from the silhouette of the target boxand removes the projective ambiguity with the use of twoparallel laser beams. We demonstrated the effectivenessof the proposed techniques by building a prototype of ascanner and using it to compute the dimensions of seve-ral real boxes even when the edges of the target box arepartially occluded by other objects.

We have also introduced an algorithm for extractingbox silhouettes in the presence of partially occluded ed-ges, an efficient voting scheme for grouping approxima-

28



Perspective Images

tely collinear segments using a Hough transform, and astatistical model for background removal that works witha moving camera and under different lighting conditions.We validated the proposed approach performing a statisti-cal analysis over measurements obtained with our scannerprototype from real boxes. In addition, we presented ananalytical derivation of uncertainty propagated along theentire computation chain that allows real-time estimationof the error in the computed measurements. The statisticsand experimental validation have shown that the proposedapproach is accurate and precise.

Our algorithm for computing box dimensions can alsobe used by applications requiring heterogeneous back-grounds. For that, background detection can be perfor-med using a technique like the one described in [14]. Inthis case, the camera should remain static while the boxesare moved on some conveyor belt.

We believe that these ideas may lead to optimizationson several procedures that are currently based on manualmeasurements of box dimensions. We are currently ex-ploring ways of using arbitrary backgrounds with a mo-ving camera.

AcknowledgmentsThis work was partially sponsored by CNPq - Bra-

sil (Processo No 477344/2003-8), Petrobras (Processo502009/2003-9), and Microsoft Brazil.

References[1] P. Besl.Active Optical Range Imaging Sensors. Ad-

vances in Machine Vision, pages 1–63. Springer-Verlag, New York, NY, USA, 1988.

[2] J. Y. Bouguet. Camera calibration toolbox for ma-tlab. http://www.vision.caltech.edu/ bouguetj/ ca-lib doc, Jan. 2005.

[3] J. Canny. A computational approach to edge detec-tion. IEEE Transactions on Pattern Analysis andMachine Intelligence, 8(6):679–698, Nov. 1986.

[4] M. B. Clowes. On seeing things.Artificial Intelli-gence, 2:79–116, 1971.

[5] A. Criminisi, I. Reid, and A. Zisserman. Single viewmetrology. InProceedings of the 7th IEEE Interna-tional Conference on Computer Vision (ICCV-99),volume 1, pages 434–441, Kerkyra, Greece, Sept.1999. IEEE Computer Society.

[6] N. R. Draper and H. Smith. Applied RegressionAnalysis. John Wiley & Sons, New York, 1966.

[7] R. O. Duda and P. E. Hart. Use of the Houghtransformation to detect lines and curves in pictu-res.Communications of the ACM, 15(1):11–15, Jan.1972.

[8] L. A. F. Fernandes. Um metodo projetivo paracalculo de dimensoes de caixas em tempo real. Mas-ter’s thesis, Universidade Federal do Rio Grande doSul, Porto Alegre, RS, Brazil, Jan. 2006. (in Portu-guese).

[9] L. A. F. Fernandes, M. M. Oliveira, R. da Silva,and G. Crespo. Computing box dimensions fromsingle perspective images in real time. InProcee-dings of XVIII Brazilian Symposium on ComputerGraphics and Image Processing (SIBGRAPI 2005),pages 155–162, Natal, RN, Brazil, Oct. 2005. IEEEComputer Society.

[10] F. Figueroa and A. Mahajan. A robust method todetermine the coordinates of a wave source for 3-D position sensing. ASME Journal of DynamicSystems, Measurements and Control, 116:505–511,Sept. 1994.

[11] H. Fuchs, Z. M. Kedem, and B. F. Naylor. On visiblesurface generation by a priori tree structures. InPro-ceedings of the 7th Annual Conference on ComputerGraphics and Interactive Techniques (SIGGRAPH-80), pages 124–133, New Orleans, Louisiana, 1980.ACM Press.

[12] J. Gauch. KUIM, image processing system.http://www.ittc.ku.edu/˜jgauch/research, Jan. 2003.

[13] R. I. Hartley and A. Zisserman.Multiple View Ge-ometry in Computer Vision. Cambridge UniversityPress, Cambridge, UK, 2000.

[14] T. Horprasert, D. Harwood, and L. S. Davis. Astatistical approach for real-time robust backgroundsubtraction and shadow detection. InProceedingsof the 7th IEEE ICCV-99, FRAME-RATE Workshop,Kerkyra, Greece, Sept. 1999. IEEE Computer Soci-ety.

[15] D. A. Huffman. Impossible objects as nonsensesentences. InMachine Intelligence, volume 6, pa-ges 295–324. Edinburg University Press, Edinburg,1971.

[16] A. Laurentini. The visual hull concept forsilhouette-based image understanding.IEEE Tran-sactions on Pattern Analysis and Machine Intelli-gence, 16(2):150–162, Feb. 1994.

29



Perspective Images

[17] M. Levoy, K. Pulli, B. Curless, S. Rusinkiewicz,D. Koller, L. Pereira, M. Ginzton, S. Anderson,J. Davis, J. Ginsberg, J. Shade, and D. Fulk. Thedigital Michelangelo project: 3D scanning of largestatues. InProceedings of the 27th Annual Confe-rence on Computer Graphics and Interactive Tech-niques (SIGGRAPH-00), pages 131–144, New Or-leans, Louisiana, Jul. 2000. ACM Press.

[18] H. C. Longuet-Higgins. A computer algorithm forreconstructing a scene from two projections.Nature,293:133–135, Sept. 1981.

[19] D. G. Lowe. Three-dimensional object recognitionfrom single two-dimensional images.Artificial In-telligence, 31:355–395, Mar. 1987.

[20] K. Lu. Box dimension finding from a single gray-scale image. Master’s thesis, SUNY Stony Brook,New York, 2000.

[21] W. Matusik, C. Buehler, R. Raskar, S. J. Gortler, andL. McMillan. Image-based visual hulls. InProce-edings of the 27th Annual Conference on ComputerGraphics and Interactive Techniques (SIGGRAPH-00), pages 369–374, New Orleans, Louisiana, Jul.2000. ACM Press.

[22] L. Nyland, D. McAllister, V. Popescu, C. McCue,A. Lastra, P. Rademacher, M. Oliveira, G. Bishop,G. Meenakshisundaram, M. Cutts, and H. Fu-chs. The impact of dense range data on computergraphics. InProceedings of Multi-View Modelingand Analysis Workshop (MVIEW99), pages 3–10,Fort Collins, CO, Jun. 1999. IEEE Computer Soci-ety.

[23] J. W. H. Tangelder, P. Ermes, G. Vosselman, andF. A. van den Heuvel. Cad-based photogramme-try for reverse engineering of industrial installations.Computer-Aided Civil and Infrastructure Enginee-ring, 18:264–274, Jul. 2003.

[24] P. Vlahos. Composite color photography. U.S. Pa-tent 3.158.477, 1964.

[25] J. H. Vuolo. Fundamentos da Teoria de Erros. Ed-gard Blcher, Sao Paulo, SP, Brazil, 1992. (in Portu-guese).

[26] E. W. Weisstein. Bivariate normal distribution.http://mathworld.wolfram.com/BivariateNormal-Distribution.html, Oct. 2005.

30

a fast and accurate approach for computing the dimensions ... · a fast and accurate approach for...

Documents