real-time object detection and tracking for industrial ... · real-time object detection and...

REAL-TIME OBJECT DETECTION AND TRACKING FORINDUSTRIAL APPLICATIONS

Selim Benhimane1, Hesam Najafi2, Matthias Grundmann3, Ezio Malis4, Yakup Genc2, Nassir Navab11 Department of Computer Science, Technical University Munich, Boltzmannstr. 3 , 85748 Garching, Germany

2 Real-Time Vision & Modeling Dept., Siemens Corporate Research, Inc., College Rd E, Princeton, NJ 08540, USA3 College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA

4 I.N.R.I.A. Sophia-Antipolis, [email protected], [email protected]

Keywords: Real-time Vision, Template-based Tracking, Object Recognition, Object Detection and Pose Estimation, Aug-mented Reality.

Abstract: Real-time tracking of complex 3D objects has been shown to be a challenging task for industrial applicationswhere robustness, accuracy and run-time performance are of critical importance. This paper presents a fullyautomated object tracking system which is capable of overcoming some of the problems faced in industrialenvironments. This is achieved by combining a real-time tracking system with a fast object detection systemfor automatic initialization and re-initialization at run-time. This ensures robustness of object detection, and atthe same time accuracy and speed of recursive tracking. For the initialization we build a compact representa-tion of the object of interest using statistical learning techniques during an off-line learning phase, in order toachieve speed and reliability at run-time by imposing geometric and photometric consistency constraints. Theproposed tracking system is based on a novel template management algorithm which is incorporated into theESM algorithm. Experimental results demonstrate the robustness and high precision of tracking of complexindustrial machines with poor textures under severe illumination conditions.

1 INTRODUCTIONMany applications require tracking of complex 3D

objects in real-time. The 3D tracking aims at contin-uously estimating the 3D displacement of the cam-era relative to an object of interest, i.e. recovering allsix degrees of freedom that define the camera positionand orientation relative to an object. The applicationsinclude Augmented Reality where real-time registra-tion is essential for visual augmentation of the object.

This paper presents a fully automated 3D objectdetection and tracking system which is capable ofovercoming the problems of robust and high preci-sion tracking of complex objects such as industrialmachines with poor textures in real-time. The sys-tem proves to be fast and reliable enough for industrialservice scenarios, where 3D information is presentedto the service technician, e.g. on the head mounteddisplay for hands-free operation. This is achievedby combining a real-time tracking system with a fastobject detection system for automatic initialization.For this purpose, we first introduce a template-basedtracking system using temporal continuity constraints

for accurate short baseline tracking in real-time. Thisis achieved by incorporating a template managementalgorithm into the ESM algorithm (Benhimane andMalis, 2006) which allows an unsupervised selectionof the regions of the image belonging to the objectand their automatic update.

Imposing temporal continuity constraints acrossframes increases the accuracy and quality of the track-ing results. However, because of the recursive natureof tracking algorithms, they still require an initializa-tion, i.e. providing the system with the initial pose ofthe object or camera. In fact, once tracking is started,rapid camera or object movements cause image fea-tures to undergo large motion between frames whichcan cause the visual tracking systems to fail. Fur-thermore, the lighting during a shot can change sig-nificantly; reflections and specularities may confusethe tracker. Finally, complete or partial occlusions aswell as large amounts of background clutter or sim-ply when the object moves out of the field of view,may result in tracking failure. Once the system losestrack, a re-initialization is required to continue track-

ing. In our particular case, namely industrial appli-cations of Augmented Reality, where the user wearsa head-mounted camera, this problem becomes verychallenging. Due to fast head movements of usershead while working in a collaborative industrial en-vironment, frequent and fast initialization is required.Manual initialization or the use of hybrid configura-tions, for instance instrumenting the camera with iner-tial sensors or gyroscopes are not a practical solutionand in many cases not accepted by the end-users.

We therefore propose a purely vision-basedmethod that can detect the target object and computeits three-dimensional pose from a single image. How-ever, wide baseline matching tends to be both less ac-curate and more computationally intensive than theshort baseline variety. Therefore, in order to over-come the above problems, we combine the object de-tection system with the tracking system using a man-agement framework. This allows us to get robustnessfrom object detection, and at the same time accuracyfrom recursive tracking.

2 RELATED WORKSeveral papers have addressed the problem of

template-based tracking in order to obtain the directestimation of 3D camera displacement parameters.The authors of (Cobzas and Jagersand, 2004) avoidthe explicit computation of the Jacobian that relatesthe variation of the 3D pose parameters to the appear-ance in the image by using implicit function deriva-tives. In that case, the minimization of the image er-ror is done using the inverse compositional algorithm(Baker and Matthews, 2004). However, since theparametric models are the 3D camera displacementparameters, this method is valid only for small dis-placements around the reference positions, i.e, aroundthe position of the keyframes (Baker et al., 2004). Theauthors of (Buenaposada and Baumela, 2002) extendthe method proposed in (Hager and Belhumeur, 1998)to homographic warpings and make the assumptionthat the true camera pose can be approximated by thecurrent estimated pose (i.e. the camera displacementis sufficiently small). In addition, the Euclidean con-straints are not directly imposed during the tracking,but once a homography has been estimated, the ro-tation and the translation of the camera are then ex-tracted. The authors of (Sepp and Hirzinger, 2003)go one step further and extend the method by includ-ing the constraint that a set of control points on athree-dimensional surface undergo the same cameradisplacement.

These methods work well for small interframedisplacements. Their major drawback is their re-stricted convergence radius since they are all based

on a very local minimization where they assume thatthe image pixel intensities vary linearly with respectto the estimated motion parameters. Consequently,inescapably, the tracking fails when the relative mo-tion between the camera and the tracked objects isfast. Recently, in (Benhimane and Malis, 2006) , theauthors propose a 2nd-order optimization algorithmthat considerably increases the convergence domainand rate of standard tracking algorithms while havingan equivalent computational complexity. This algo-rithm works well for planar and simple piecewise pla-nar objects. However, it fails when the tracking needsto take new regions into account or when the appear-ance of the object changes during the camera motion.

In this paper, we adopt an extension of this algo-rithm and we propose a template management algo-rithm that allows an unsupervised selection of regionsbelonging to the object newly visible in the image andtheir automatic update. This greatly improves the re-sults of the tracking in real industrial applications andmakes it much more scalable and applicable to com-plex objects and severe illumination conditions.

For the initialization of the tracking system a de-sirable approach would be to use purely vision-basedmethods that can detect the target object and computeits three-dimensional pose from a single image. If thiscan be done fast enough, it can then be used to ini-tialize and re-initialize the system as often as needed.This problem of automated initialization is difficultbecause unlike short baseline tracking methods, a cru-cial source of information, a strong prior on the poseand with this spatial-temporal adjacency of features,is not available.

Computer vision literature includes many objectdetection approaches based on representing objects ofinterests by a set of local 2D features such as cornersor edges. Combination of such features provide ro-bustness against partial occlusion and cluttered back-grounds. However, the appearance of the features canbe distorted by large geometric and significant illu-mination changes. Recently, there have been someexciting breakthroughs in addressing the problem ofwide-baseline feature matching. For instance, featuredetectors and descriptors such as SIFT (Lowe, 2004)and SURF (Bay et al., 2006) and affine covariant fea-ture detectors (Matas et al., 2002; Mikolajczyk andSchmid, 2004; Tuytelaars and Van Gool, 2004) havebeen shown maturity in real applications. (Mikola-jczyk et al., 2005) showed that a characterization ofaffine covariant regions with the SIFT descriptor cancope with the geometric and photometric deforma-tions between wide baseline images. More recently,(Lepetit and Fua, 2006) introduced a very robust fea-ture matching in real-time. They treat the matching

problem as a classification problem using Random-ized Trees.

In this paper, we propose a framework to alleviatesome of the limitations of these approaches in orderto make them more scalable for large industrial envi-ronments. In our applications, both 3D models andseveral training images are available or can be createdeasily during an off-line process. The key idea is touse the underlying 3D information to limit the numberof hypothesis reducing the effect of the complexity ofthe 3D model on the run-time matching performance.

3 TRACKING INITIALIZATION3.1 The approachThe goal is to automatically detect objects and recovertheir pose for arbitrary images. The proposed objectdetection approach is based on two stages: A learn-ing (or training) stage which is done off-line and thematching and pose estimation stage at run-time. Theentire learning and matching processes are fully auto-mated and unsupervised. In the training phase, a com-pact appearance and geometric representation of thetarget object is built. In the second phase, when theinitialization is invoked, an image is processed for de-tecting the target object using the representation builtin the training phase, enforcing both photometric andgeometric consistency constraints.

3.2 The training phaseGiven a set of calibrated training images, in the firststep of the learning stage a set of stable feature re-gions are selected from the object by analyzing a)their detection repeatability and accuracy from differ-ent viewpoints and b) their distinctiveness, i.e. howdistinguishable the detected object regions are.

3.2.1 Feature extraction and evaluationFirst, affine covariant features are extracted from theimages. In our experiments we use a set of state-of-the-art affine region detectors (Mikolajczyk et al.,2005). Region detection is performed in two ma-jor steps. First interest points in scale-space are ex-tracted, e.g. Harris corners or blob like structuresby taking local extrema of a Difference-of-Gaussianpyramid. In the second step for each point an ellip-tical region is determined in an affine invariant way.See (Mikolajczyk et al., 2005) for more detail and acomparison of affine covariant region detectors. Anaccelerated version of the detectors can be achievedby using libraries such as Intel’s IPP. Next, the in-tensity pattern within each feature region is describedwith the SIFT descriptor (Lowe, 2004) representingthe local orientation histograms. The SIFT descrip-tor has been shown to be very effective on a number

of measures (Mikolajczyk and Schmid, 2005). Givena sequence of images, the elliptical feature regionsare detected independently in each viewpoint. Thecorresponding image regions cover approximately thesame object surface region (model region). The, thefeature regions are normalized against the geomet-ric and photometric deformations and described bythe SIFT descriptor to obtain a viewpoint and illu-mination invariant description of the intensity pat-terns within each region. Hence, each model regionis represented by a set of descriptors from multipleviews (multiple view descriptors). However, for a bet-ter statistical representation of the variations in themultiple view descriptors each model region needs tobe sampled from different viewpoints. Since not allsamplings can be covered by the limited number oftraining images, synthesized views are created fromother viewpoints using computer graphics renderingtechniques. Having a set of calibrated images andthe model of the target object, the viewing space iscoarsely sampled at discrete viewpoints and a set ofsynthesized views are generated.

Once all the features are extracted, we select themost stable and distinctive feature regions which arecharacterized by their detection and descriptor per-formance. The evaluation of the detection and de-scriptor performance is done as following. The de-tection performance of a feature region is measuredby the detection repeatability and accuracy. The ba-sic measure of accuracy and repeatability is based onthe relative amount of overlap between the detectedregions in each image and the respective reference re-gions projected onto that image using the ground truthtransformation. The reference regions are determinedfrom the parallel views to the corresponding model re-gion. The performance of the descriptor is measuredby the matching criterion, i.e. how well the descrip-tor represents an object region. This is determined bycomparing the number of corresponding regions ob-tained with the known ground truth and the numberof correctly matched regions.

Depending on the 3D structure of the target ob-ject a model region may be clearly visible only fromcertain viewpoints in the scene. Therefore, we createfor each model feature a similarity map which rep-resents its visibility distribution by comparing it withthe corresponding extracted features. As a similaritymeasure, we use the Mahalanobis distance betweenthe respective feature descriptors. For each modelregion, a visibility map is the set of its appearancesin the views from all possible viewpoints. Based onthe computed similarity values of each model region,we cluster groups of viewpoints together using themean-shift algorithm (Comaniciu and Meer, 2002).

The clustered viewpoints for a model region m j areW (m j) = {v j,k ∈R3|0 < k≤N j}, where v j,k is a view-point of that region.

3.2.2 Learning the statistical featurerepresentation

This section describes a method to incorporate themultiple view descriptors into our statistical model.To minimize the impact of variations of illumination,especially between the real and synthesized images,the descriptor vectors are normalized to unit magni-tude. Two different method have been tested for thefeature representation: PCA and Randomized Trees.We use the PCA-SIFT descriptor (Ke and Sukthankar,2004) for a compact representation using the first 32components. The descriptor vectors gi, j are projectedinto the feature space to a feature vector ei, j. For eachregion, we take k samples from the respective viewsso that the distribution of their feature vectors ei, j for0 < j ≤ K in the feature space is Gaussian. To en-sure the Gaussian distribution of the vectors for eachset of viewpoints W (m j) we apply the χ2 test for amaximal number of samples. If the χ2 test fails aftera certain number of samplings for a region, the regionwill be considered as not reliable enough and will beexcluded. For each input set of viewpoints, we thenlearn the covariance matrix and the mean of the dis-tribution.

In the case of Randomized Trees, we use a ran-dom set of features to build the decision trees similarto (Lepetit and Fua, 2006). At each internal node,a set of tests involving comparison between two de-scriptor components are randomly drawn. The teststhat result in maximum information gain is chosenas the splitting criteria at the node. This estimate isthe conditional distribution over the classes given thata feature reaches that leaf. Furthermore, to reducethe variances of the conditional distribution estimates,multiple Randomized Trees are trained. It has beenshown by (Lepetit and Fua, 2006) that the same treestructure could be used, with good performance for adifferent object, by only updating the leaf nodes. Ourexperiments confirm this observation.

3.3 The matching and pose estimationphase

At run-time features are first extracted from the testimage in the same manner as in the learning stage andtheir descriptor vectors are computed. Matching is thetask of finding groups of corresponding pairs betweenthe regions extracted from test image and the featuredatabase, that are consistent with both appearance andgeometric constraints. This matching problem can beformulated as a classification problem. The goal is

to construct a classifier so that the misclassificationrate is low. We have tested two different classifica-tion techniques, PCA-based Bayesian classificationand Randomized Trees (RDT).

In the first method, the descriptors are projectedinto feature space using PCA. We then use theBayesian classifier to decide whether a test descrip-tor belongs to a view set class or not. Let C ={C1, ...,CN} be the set of all classes representing theview sets and let F denote the set of 2D-featuresF = { f1, ..., fK} extracted from the test image. Usingthe Bayesian rule the a posteriori probability P(Ci| f j)for a test feature f j that it belongs to the class Ci iscalculated as

P(Ci| f j) =p( f j|Ci)P(Ci)

∑Nk=1 p( f j|Ck)P(Ck)

. (1)

For each test descriptor the a posteriori proba-bility of all classes is computed and the best can-didate matches are selected using thresholding. Letm( f j) be the respective set of most probable poten-tial matches m( f j) = {Ci|P(Ci| f j) ≥ T}. The pur-pose of this threshold is only to accelerate the run-time matching and not to consider matching candi-dates with low probability. However this threshold isnot crucial for the results of pose estimation.

In the second method using RDTs the feature de-scriptor is dropped down each tree independently.The average distribution amongst those stored in allreached leaf nodes is used to classify the input fea-ture, utilizing maximum a posteriori estimation.

Once initial matches are established the respectivevisibility maps are used as geometric consistency con-strains to find subsets of N ≥ 3 candidate matches forpose estimation as described in (Najafi et al., 2006).This way the search space is further constrained com-pared to a plain or ”exhaustive” of RANSAC whereall pairs of candidate matches are examined. Further-more, contextual information, such as flexible featuretemplates (Li et al., 2005) or shape context (Belongieet al., 2002), can be used for further local geometricconstraints.

In the case of using RDTs for classification, theestimated pose can further be used to update the de-cision Trees to the actual viewing conditions at run-time (Boffy et al., 2006) . Our experiments showedthat this can significantly improve the matching per-formance and makes it more robust to illuminationchanges, such as cast shadows or reflections.

In our experiments the RDT-based classificationhas been shown to have slightly better classificationrate than the PCA-based one. Our non-optimized im-plementation needs about 200 ms for the PCA basedand 120 for RDT-based approach. To evaluate the

0 100 200 300 400-200

-150

-100

-50

0

50

Frame #

Cam

era

posi

tion

X [m

m]

Ground truthDetector results

0 100 200 300 400600

700

800

900

1000

1100

Frame #

Cam

era

posi

tion

Z [m

m]


0 100 200 300 400-100

-80

-60

-40

-20

0

20

40

60

80

Frame #

Cam

era

posi

tion

Y [m

m]


Figure 1: Comparison of pose estimation results against ground truth. The solid red line in each plot represents the true valueof the three coordinates of the camera position and the dashed blue line depicts the estimated one by our detection algorithmfor each frame of a long sequence which includes strong viewpoint and scale changes as well as occlusions (bottom row).

robustness and accuracy of our detection system weused long sequences taken from a control panel. Eachframe was then used separately as input image todetect and estimate the pose of the panel. Fig. 1depicts the pose estimation results compared to theground truth data created by tracking the object usingRealViztm software. The median error in camera po-sition is about 2.0 cm and about 10 degrees in camerarotation.

4 VISUAL TRACKING4.1 Theoretical BackgroundIn general, the CAD models used in the industryare stored as polygon meshes that consist in sets ofvertices and polygons defining the 3D shape of theobjects. These meshes are made of simple convexpolygons and, in most of the industrial 3D models,they are made of triangles. Consequently, it is le-gitimate to consider that the image contains the pro-jection of a piecewise-planar environment since ev-ery triangular face defines a plane. For each planein the scene, there exists a (3×3) homography ma-trix G that links the coordinates p of a certain pointin a reference image I ∗ to its corresponding point qin the current image I . We suppose that the “im-age constancy assumption” is verified, i.e., after animage intensity normalization (that makes the imagesinsensitive to linear illumination changes), we have:I ∗(p) = I (q). The homography matrix is definedup to a scale factor. In order to fix the scale, we choosedet(G) = 1. Let w(G) : R2 → R2 be the coordinatetransformation (a warping) such that: q = w(G)(p).The camera displacement between two views can berepresented by a (4×4) matrix T ∈ SE(3) (the Spe-

cial Euclidean group): T =[

R t0 1

], where R is

the rotation matrix and t is the translation vector ofthe camera between the two views. The 2D projec-tive transformation G is similar to a matrix H thatdepends on T: G(T) = KH(T)K−1, where K is theupper triangular matrix containing the camera intrin-sic parameters. The matrix H(T) can be written as:H(T) =

(R+ tn∗>

)/(

3√1+ t>Rn∗)

, where n∗ is thenormal vector to the plane and ‖n∗‖ = d∗ where d∗

is the distance of the plane to the origin of the ref-erence frame. Now, let Ai, with i ∈ {1, ...,6}, be abasis of the Lie algebra se(3) (the Lie algebra associ-ated to the Lie Group SE(3)). Any matrix A ∈ se(3)can be written as a linear combination of the matri-ces Ai: A(x) = ∑

6i=1 xiAi, where x = [x1, ..., x6]>

and xi is the i-th element of the base field. The ex-ponential map links the Lie algebra to the Lie Group:exp : se(3) → SE(3). Consequently, a homographymatrix H is a function of T that can be locally param-eterized as: H(T(x)) = H(exp(A(x))).

4.2 Problem statement

Let q = nm be the total number of pixels in some im-age region corresponding to the projection of a planarpatch of the scene in the image I ∗. This image regionis called the reference template. To track the templatein the current image I is to find the transformationT ∈ SE(3) that warps a pixel of that region in the im-age I ∗ into its correspondent in the image I , i.e.find T such that ∀i, we have: I

(w

(G(T)

)(pi)

)=

I ∗(pi). For each plane in the scene, we consider itscorresponding template and its homography. Here,for the sake of simplicity, we describe the compu-tations for a single plane. Knowing an approxima-tion T̂ of the transformation T, the problem is to findthe incremental transformation T(x) such that the dif-

ference between the current image region I warpedwith T̂T(x) and the reference image I ∗ is null, i.e.∀i ∈ {1, ..., q}: yi(x) = I

(w(G(T̂T(x)))(pi)

)−

I ∗(pi) = 0. Let y(x) be the (q× 1) vector contain-ing the image differences: y(x) = [y1(x), ..., yq(x)]>.Then, the problem consists of estimating x = x̃ veri-fying the system:

y(x̃) = 0 (2)

The system is then non linear and needs to be solvediteratively.

4.3 The ESM algorithmLet J(x) be the (q×6) Jacobian matrix describing thevariation of the vector y(x) with respect to the Carte-sian transformation parameters x. Let M(x1,x2) bethe (q× 6) matrix that verifies ∀(x1,x2) ∈ R6 ×R6:M(x1,x2) = ∇x1 (J(x1)x2). The Taylor series of y(x)about x = x̃ can be written:

y(x̃)≈ y(0)+(

J(0)+12

M(0, x̃))

x̃ = 0 (3)

Although the optimization problem can be solvedusing 1st-order methods, we use an efficient 2nd-order minimization algorithm (Benhimane and Malis,2006). Indeed, with little extra computation, it ispossible to perform a stable and local quadratic con-vergence when the 2nd-order approximation is ver-ified: for x = x̃, the system (2) can then be writ-ten: y(x̃) ≈ y(0)+ 1

2 (J(x̃)+ J(0))x̃ = 0. The updatex̃ of the 2nd-order minimization algorithm can thenbe computed as follows: x̃ = −2(J(0)+ J(x̃))+y(0).Thanks to the simultaneous use of the reference andthe current image derivatives, and thanks to the Lie al-gebra parameterization, the ESM algorithm improvesthe convergence frequency and the convergence rateof the standard 1st-order algorithms (see (Benhimaneand Malis, 2006) for more details).

4.4 Tracking multiple faces andspeeding up the system

In the formulation of the ESM algorithm, it is implic-itly assumed that all reference faces come from thesame image and they were extracted in the same co-ordinate system. However, this is not true in prac-tice since two faces can be seen for the first time un-der different poses. This problem is solved by intro-ducing the pose T̃ j under which the face j was ex-tracted the first time into the algorithm. This leads toa different system from the one described in equation(2) and it becomes: ∑ j y j(x̃) = 0 where y j(x) is the(q j×1) vector containing the image differences of theface j: y j(x) = [y j1(x), ..., y jq j(x)]> and ∀i ∈ [1,q j]:

y ji = I(

w(

G(

T̂T(x)T̃−1j

))(pi)

)−I ∗(pi). This

modification was proposed in (Ladikos et al., 2007).In order to speed up the tracking process, the tem-

plates are reduced a few times in size by a factor oftwo to create a stack of templates at different scales.The optimization starts at the highest scale level (low-est resolution) and the cost function is minimized onthis level until convergence is achieved. This is con-tinued until the lowest scale level (highest resolution)is reached.

5 TEMPLATE MANAGEMENT5.1 Simplifying the 3D modelThe ESM algorithm can be used to track any indus-trial object since every model can be described by aset of triangular faces (each face defining a plane).However, the number of the faces grows very quicklywith the model complexity. For instance, the modelshown in Figure 2(a) is described by a mesh of morethan 20,000 faces while the number of planes thatcan be used for the tracking process is much smaller.In addition, given the camera position, the visibilitycheck of the faces can not be processed in real-time.

For these reasons, we transform the original CADmodel in order to simplify the triangulated mesh, sothat the largest connected faces are obtained as poly-gons. This procedure reduces the amount of faces towork on by around 75% in our examples.

5.2 Building the BSP TreesUsing the simplified model, we construct a 3D Bi-nary Space Partioning tree (BSP tree) that allows todetermine the visible surfaces of the considered ob-ject given the camera position. The BSP tree is static,image size invariant and it depends uniquely on the3D model so it can be processed off-line once andfor all. The processing time for building a BSP treeis fast since it is just O(N), where N is the num-ber of faces in the model. For more details aboutBSP trees, refer to (Foley et al., 1990; Schneider andEberly, 2002). From any relative position between thecamera and the object, the visible faces can be auto-matically determined. In fact, the BSP tree is usedto determine the order of all faces of the model withrespect to the optical axis toward the camera. Back-faces and too small faces are discarded because theydo not provide usable information to the tracking pro-cess. From the z-ordered sequence of faces, we obtaina list of those, which are not occluded by others. Oncea face is determined to be occluded, it has never to bechecked again, so even if the worst case of this occlu-sion culling is O(N2), the amortized cost are supposedto be around O(N), although this was not proved.

To deal with partial occlusions a face is consid-ered to be occluded, if more than 5% of its area isoccluded by other faces. This calculation is done by2D polygon intersection test. This task is done using2D BSP trees, which makes it possible to determinethe intersection in just O(E1+E2), where E1 and E2are the number of edges of the two faces to intersect.

5.3 Template update and databasemanagement

The automatic extraction of the trackable faces leadsto the question, whether this can be done during thetracking procedure to update the faces and increasethe number of the images that can be tracked from thereference images without any update. The naive ap-proach to re-extract the faces leads to error integrationand severe drifts. To prevent this, we used a two-stepapproach. First, when a face j1 is seen for the firsttime, it will be stored in a Database. For each faceexists two templates. The one in the Database (calledreference template), referring to the first time of ex-traction and the current one, referring to the last timeof extraction. The face is first tracked with the currenttemplate which leads to a camera pose parameters x1.Then, it is tracked with the reference template start-ing at x1 which leads to camera pose parameters x2.The second step is to examine the difference in cam-era motion ‖x1−x2‖. If it is below a given threshold,the reference template in the Database is updated bythe current one (see (Matthews et al., 2003) for moredetails). The procedure of determining visible facesand updating the templates are done every 0.5 second.Their processing time is about 40ms. Therefore, oneimage has to be dropped (if the acquisition framerateis 25 fps).

6 EXPERIMENTAL RESULTSIn this paragraph, we describe experimental re-

sults showing the result of the selected detection andtracking algorithms. The experiments are done in areal industrial scenario (see Figure 2). Given the CADmodel of an industrial machine (see Figure 2(a)) andfew calibrated images (see Figure 2(b)), the objectiveis to estimate the pose of a HMD camera worn bya worker and observing the scene in real-time. Thismakes it possible to overlay virtual objects contain-ing useful information into the scene in order to helpthe training and the maintenance processes for theworkers. In this experiment, we overlay a colored 3Dmodel of the object (see Figure 2(c)). The acquisitionframerate is at 25 frames per second, and the imagedimensions are (640×480). The object used is poorin texture and composed of a material with non Lam-

bertian surfaces (and therefore highly sensitive to thelightening conditions). It shows the benefit of the se-lected approaches and the intrinsic robustness of thealgorithms to the illumination changes.

(a) The 3D model (b) The keyframe (c) The overlay

Figure 2: The industrial machine. See text.

The method proposed in this paper allows to han-dle big interframe displacements and permits to over-lay more accurately the CAD model of the machineon the acquired images along the whole sequence. Inthe Figure 3, despite the lack of texture on the con-sidered object, we can see that the camera pose hasbeen accurately estimated, even when very few partsare visible. Thanks to the fast convergence rate and tothe high convergence frequency of the 2nd-order min-imization algorithm and to thanks to template man-agement algorithm, the augmentation task is correctlyperformed. Thanks to the tracking initialization thatwas invoked each time the object was out of the cam-era field of view, the pose estimation given by thecomplete system was possible each time that the ob-ject is in the image and it was precise enough to nicelyaugment the observed scene with virtual information(see the video provided as additional material).

7 CONCLUSIONReal-time 3D tracking of complex machines in in-

dustrial environments has been shown to be a chal-lenging task. While frame-based tracking systemscan achieve high precision using temporal continuityconstraints they need an initialization and usually suf-fer from robustness against occlusions, motion blurand abrupt camera motions. On the other side ob-ject detection systems using wide baseline matchingtechniques tend to be robust but not accurate and fastenough for real-time applications. This paper presentsa fully automated system to overcome these problemsby combining a real-time tracking system with a fastobject detection system for automatic initialization.This allows us to get robustness from object detection,and at the same time accuracy from recursive track-ing. For the automatic initialization, we build a scal-able and compact representation of the object of inter-est during a training phase based on statistical learn-ing techniques, in order to achieve speed and reliabil-ity at run-time by imposing both geometric and photo-

Figure 3: Excerpts from the test sequence showing the result of the tracking system proposed and the quality of the augmen-tation achieved. The trackable regions selected by the template management algorithm are the ones with red borders.

metric consistency constraints. Furthermore, we pro-pose a tracking approach along with a novel templatemanagement algorithm which makes it more scalableand applicable to complex objects under severe il-lumination conditions. Experimental results showedthat the system is able to successfully detect and ac-curately track industrial machines in real-time.

REFERENCES

Baker, S. and Matthews, I. (2004). Lucas-kanade 20 yearson: a unifying framework. IJCV, 56(3):221–255.

Baker, S., Patil, R., Cheung, K., and Matthews, I. (2004).Lucas-kanade 20 years on: Part 5. Technical ReportCMU-RI-TR-04-64, Robotics Institute, CMU.

Bay, H., Tuytelaars, T., and Van Gool, L. J. (2006). Surf:Speeded up robust features. In ECCV, pages 404–417.

Belongie, S., Malik, J., and Puzicha, J. (2002). Shapematching and object recognition using shape contexts.PAMI, 24(4):509–522.

Benhimane, S. and Malis, E. (2006). Integration of eu-clidean constraints in template-based visual trackingof piecewise-planar scenes. In IROS, pages 1218–1223.

Boffy, A., Tsin, Y., and Genc, Y. (2006). Real-time fea-ture matching using adaptive and spatially distributedclassification trees. In BMVC.

Buenaposada, J. and Baumela, L. (2002). Real-time track-ing and estimation of planar pose. In ICPR.

Cobzas, D. and Jagersand, M. (2004). 3D SSD trackingfrom uncalibrated video. In Proc. of Spatial Coher-ence for Visual Motion Analysis, in conjunction withECCV.

Comaniciu, D. and Meer, P. (2002). Mean shift: A ro-bust approach toward feature space analysis. PAMI,24(5):603–619.

Foley, J. D., van Dam, A., Feiner, S. K., and Hughes, J. F.(1990). Computer graphics: principles and practice.Addison-Wesley Longman Publishing Co., Inc.

Hager, G. and Belhumeur, P. (1998). Efficient region track-ing with parametric models of geometry and illumina-tion. PAMI, 20(10):1025–1039.

Ke, Y. and Sukthankar, R. (2004). PCA-SIFT: a more dis-tinctive representation for local image descriptors. InCVPR, pages 506–513.

Ladikos, A., Benhimane, S., and Navab, N. (2007). A real-time tracking system combining template-based andfeature-based approaches. In VISAPP.

Lepetit, V. and Fua, P. (2006). Keypoint recognition usingrandomized trees. PAMI, 28(9):1465–1479.

Li, Y., Tsin, Y., Genc, Y., and Kanade, T. (2005). Object de-tection using 2d spatial ordering constraints. In CVPR.

Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110.

Matas, J., Chum, O., Urban, M., and Pajdla, T. (2002). Ro-bust wide baseline stereo from maximally stable ex-tremal regions. In BMVC.

Matthews, I., Ishikawa, T., and Baker, S. (2003). The tem-plate update problem. In BMVC.

Mikolajczyk, K. and Schmid, C. (2004). Scale and affineinvariant interest point detectors. IJCV, 60(1):63–86.

Mikolajczyk, K. and Schmid, C. (2005). A performanceevaluation of local descriptors. PAMI, 27(10).

Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman,A. C., Matas, J., Schaffalitzky, F., Kadir, T., andVan Gool, L. (2005). A comparison of affine regiondetectors. IJCV, 65(1-2):43–72.

Najafi, H., Genc, Y., and Navab, N. (2006). Fusion of 3Dand appearance models for fast object detection andpose estimation. In ACCV, pages 415–426.

Schneider, P. J. and Eberly, D. (2002). Geometric Tools forComputer Graphics. Elsevier Science Inc.

Sepp, W. and Hirzinger, G. (2003). Real-time texture-based3-d tracking. In DAGM Symposium, pages 330–337.

Tuytelaars, T. and Van Gool, L. (2004). Matching widelyseparated views based on affine invariant regions.IJCV, 59(1):61–85.

real-time object detection and tracking for industrial ... · real-time object detection and...

Documents