a probabilistic framework for 3d visual object representation · representation fig. 2. pose...

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 10, PP. 1790–1803, OCTOBER 2009 1

A Probabilistic Frameworkfor 3D Visual Object Representation

Renaud Detry Nicolas Pugeault Justus Piater

Abstract—We present an object representation framework that encodes probabilistic spatial relations between 3D features andorganizes these features in a hierarchy. Features at the bottom of the hierarchy are bound to local 3D descriptors. Higher-level featuresrecursively encode probabilistic spatial configurations of more elementary features. The hierarchy is implemented in a Markov network.Detection is carried out by a belief propagation algorithm, which infers the pose of high-level features from local evidence and reinforceslocal evidence from globally consistent knowledge, effectively producing a likelihood for the pose of the object in the detection scene.We also present a simple learning algorithm that autonomously builds hierarchies from local object descriptors. We explain how to useour framework to estimate the pose of a known object in an unknown scene. Experiments demonstrate the robustness of hierarchiesto input noise, viewpoint changes and occlusions.

Index Terms—Computer vision, 3D object representation, pose estimation, Nonparametric Belief Propagation.

F

1 INTRODUCTION

THE merits of part-based and hierarchical approachesto object modeling have often been put forward in

the vision community [1], [2], [3], [4], [5], [6], [7], [8],[9], [10]. Part-based models typically separate structurefrom appearance, which allows them to deal with vari-ability separately in each modality. A hierarchy of partstakes this idea further, by introducing scale-dependentvariability: small part configurations can be tightly con-strained, while wider associations can allow for morevariability. Furthermore, part-based models do not onlyallow for the detection and localization of an object,but also parsing of its constituent parts. They lendthemselves to part sharing and reuse, which should helpin overcoming the problem of storage size and detectioncost in large object databases. Finally, these models notonly allow for bottom-up inference of object parametersbased on features detected in images, but also for top-down inference of image-space appearance based onobject parameters.

A large body of the object modeling literature focuseson modeling the 2D projections of a 3D object. A majorissue with this approach is that all variations intro-duced by projective geometry (geometrical transforma-tions, self-occlusions) have to be robustly captured and

• R. Detry ([email protected]) and J. Piater are with theINTELSIG Laboratory, Univ. of Liege, Belgium.

• N. Pugeault is with the Cognitive Vision Lab, The Maersk Mc-KinneyMoller Inst., Univ. of Southern Denmark, Denmark.

Manuscript received 16 Aug. 2008; revised 18 Dec. 2008; accepted 5 Mar.2009; published online 17 Mar. 2009.Recommended for acceptance by Q. Ji, A. Torralba, T. Huang, E. Sudderth,and J. Luo.This is an author postprint.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMISI-2008-08-0529.Digital Object Identifier no. 10.1109/TPAMI.2009.64.

handled by the model. In the past few years, modelingobjects directly in 3D has become increasingly popular[11], [12], [13], [14], [15], [16]. The main advantage ofthese methods lies in their natural ability to handleprojective transformations and self-occlusions.

The main contribution of this paper is a frameworkthat encodes the 3D geometry and visual appearance ofan object into a part-based model, and mechanisms forautonomous learning and probabilistic inference of themodel. Our representation combines local appearanceand 3D spatial relationships through a hierarchy ofincreasingly expressive features. Features at the bottomof the hierarchy are bound to local 3D visual perceptionscalled observations. Features at other levels representcombinations of more elementary features, encodingprobabilistic relative spatial relationships between theirchildren. The top level of the hierarchy contains a singlefeature which represents the whole object.

The hierarchy is implemented in a Markov randomfield, where features correspond to hidden variables, andspatial relationships define pairwise potentials. To detectinstances of a model in a scene, observational evidenceis propagated throughout the hierarchy by probabilisticinference mechanisms, leading to one or more consistentscene interpretations. Thus, the model is able to suggesta number of likely poses for the object, a pose beingcomposed of a 3D world location and a 3D world ori-entation. The inference process follows a nonparametricbelief propagation scheme [17] which uses importancesampling for message products. The model is bound tono particular learning scheme. In this paper, we presentan autonomous learning method that builds hierarchiesin a bottom-up fashion.

Learning and detection algorithms reason directly onsets of local 3D visual perceptions which we will re-fer to as (input) observations. These observations shouldrepresent visual input in terms of 3D descriptors, i.e.

0162–8828/09/$25.00 c© 2009 IEEE Published by the IEEE Computer Society

�G ��-)))��4IVWSREP�YWI�SJ�XLMW�QEXIVMEP�MW�TIVQMXXIH��4IVQMWWMSR�JVSQ�-)))�QYWX�FI�SFXEMRIH�JSV�EPP�SXLIV�YWIVW�MRGPYHMRK�VITVMRXMRK�VITYFPMWLMRK�XLMW�QEXIVMEP�JSV�EHZIVXMWMRK�SV�TVSQSXMSREP�TYVTSWIW��GVIEXMRK�RI[�GSPPIGXMZI�[SVOW�JSV�VIWEPISV�VIHMWXVMFYXMSR�XS�WIVZIVW�SV�PMWXW��SV�VIYWI�SJ�ER]�GST]VMKLXIH�GSQTSRIRXW�SJ�XLMW�[SVO�MR�SXLIV�[SVOW�


Fig. 1. ECV observations from Kruger et al. [18]

Early vision

Early vision

Learning

To action/problem solving

3D objectpose

(incl. detection)Representation

Fig. 2. Pose estimation system

points characterized by a 3D location, a local appearance,and possibly an orientation. Prior research on modeldetection directly in 3D data can be found e.g. in thework of Rogers et al. [13].

In this paper, we evaluate our object representationon input observations produced with the 3D Early-Cognitive-Vision (ECV) system of Kruger et al. [18], [19].From stereo imagery, this system extracts pixel patchesalong image contours, and computes 3D patch positionsand orientations by stereopsis across image pairs. Theresulting reconstruction consists of a set of short edgesegments in 3D space (Fig. 1), which we will refer to asECV observations. Each ECV observation corresponds toa patch of about 25 square millimeters of object surface.Since image patches are extracted on edges, each ECVobservation can be characterized with an appearancedescriptor composed of the two colors found on the sidesof the edge. An ECV reconstruction of a scene typicallycontains 1000–5000 observations.

The task on which we evaluate our model is objectpose estimation. Fig. 2 illustrates the pose estimationprocess. Observations are provided by the ECV system.The learning algorithm builds a hierarchy from a set ofobservations from a segmented object; the hierarchy isthen used to recover the pose of the object in a clutteredscene. Preliminary results appeared in conference andworkshop proceedings [20], [21], [22]. We note that eventhough we concentrate our evaluation on 3D data fromKruger et al., our system can in principle be applied toother 3D sources, such as dense stereo, or range data.

We emphasize that we intend to develop generative

representations that allow for detection and localizationof known objects within highly cluttered scenes, asopposed to object classification which is best achievedusing discriminative models.

2 RELATED WORK

Recent work has shown many successful implementa-tions of the hierarchical approach to object modelingin 2D, with both bottom-up [7], [9], [20] and top-down[8] unsupervised learning of object parts and structure.These generally aim at classifying objects into genericcategories, and put a strong accent on learning in orderto capture intra-class variability. Intra-class variability isvery important for these methods also because it allowsthem to capture the affine transformations that the objectundergoes when projected to 2D images. Our learn-ing procedure is similar in spirit, although it requiresmuch less sophistication. Since we work directly in 3D,projection-related issues are handled by the method thatproduces the observations, with e.g. stereopsis whenusing ECV observations. Our system is thus intrinsicallyrobust to problems like the decreasing apparent 2Dsize of an object when moved away from the camera,and projective deformations do not need to be learned.Compared to these 2D methods, the most distinguishingaspects of our approach are its explicit 3D support anda unified probabilistic formalization through a Markovnetwork.

Part-based models have proved very efficient forrepresenting articulated objects in 2D [10], [23], [24],[25], [26] or 3D [13], [27], [28], and matching them in2D images [10], [23], [24], [25], [26], [27], [28] or 3Drange data [13]. Articulated objects are often formalizedthrough a graphical model, and probabilistic inferencealgorithms such as belief propagation [29] or its variantsare widely used for detection [23], [24], [27], [28]. Themodel topology, i.e. the parts, are typically defined byhand. Model parameters (compatibility and observationpotentials) are also often defined by hand [10], [13], [24],although it has been shown that compatibility potentialscan be learned from motion [26], [27], and observationpotentials (image likelihoods) may emerge from anno-tated images [23], [26].

Despite the similar formalism, the issue addressedby these methods is rather different from ours. Thework cited in the previous paragraph seeks uniquematches of relatively high-level entities (body parts, ...),and represents loose, one-to-one relations between them.In our work, a part is an abstract concept that mayhave any number of instances; a parent-child potentialthus encodes a one-to-many relationship, as a set ofrelationships between the parent part and many childpart instances. Furthermore, our parts reach a much finerlevel of granularity; a part can for instance correspondto a “red-ish world-surface patch” as small as 5×5mm.

Coughlan et al. [30] have demonstrated a 2D part-based object model where parts are relatively low-level


(e.g. short 2D edge segment). However, in the same wayas the articulated models mentioned above, potentialsencode one-to-one relationships between neighboringparts.

When targeting 3D objects, a crucial model propertyis viewpoint invariance, i.e. the ability to encode objectinformation independently of the viewpoint. Althoughviewpoint invariance has been achieved in 2D [31],an increasingly popular means of achieving viewpointinvariance is to represent object geometry directly in 3D.

Very interesting work [11], [12], [32] takes the ap-proach of representing an object with a large set of localaffine-invariant descriptors and the relative 3D spatial re-lationships between the corresponding surface patches.Other methods organize local appearance descriptors ona 3D shape model, obtained by CAD [16] or 3D homog-raphy [14]. To detect an object in an image, these meth-ods start with an appearance-based matching of imagedescriptors with model descriptors. In a second phase,global optimization is used to evaluate the geometricalconsistency of the appearance-based matches using thegeometrical information contained in the model, andcompute a 2D bounding box [14] or a 3D pose [16].Rothganger et al. [11] do not speak of 3D pose in theirresults, although it seems obvious that a 3D pose isimplicitly computed during detection.

Instead of using a precise 3D shape model, objectshave also successfully been represented by a set ofnear-planar parts connected through their mutual homo-graphic transformations [15]. In this context, a part is alarge region of the object represented with many localinvariant descriptors.

Compared to our approach, the preceding 3D methodsrepresent an object as a whole, and do not intrinsically al-low for parsing an object into parts. Another distinguish-ing aspect of our work is its probabilistic formalizationthrough a graphical model. Finally, the 3D methods citedabove work with image data, and encompass the entirereasoning process from image pixels to object pose. Inour case, a large part of the work is done by the upstreamsystem that produces 3D observations – e.g. the ECVsystem, or range scanning. This allows us to concentrateon the encoding of 3D geometry and appearance, andyields a framework that can cope with different sourcesof 3D observations.

The performance of our system is strongly influ-enced by the characteristics of the 3D observations weuse. While the 3D methods cited above rely on affine-invariant descriptors, for which textured objects areideal, the ECV system used in this paper extracts surfaceedges, and thus strongly prefers objects with clear edgesand little texture.

For prior work on top-down parsing of scenes andobjects, we refer the reader to the work of Lee andMumford [5], and of Tu et al. [6]. We note that ourrepresentation and methods are compatible [33] with theideas developed by Lee and Mumford.

bridgepattern

trianglularframe

traffic sign

φ1 φ2 φ3

X4 X5

X6

X3X2X1

Y1 Y2 Y3

ψ3,5

ψ4,6 ψ5,6

ψ2,4ψ1,4

Fig. 3. On the right: an example of a hierarchy for thetraffic sign shown in the bottom-left corner. X1 throughX3 are primitive features; each of these is linked to anobserved variable Yi. X4 through X6 are meta-features.Top-left: ECV observations of the traffic sign.

3 HIERARCHICAL MODEL

Our object model consists of a set of generic featuresorganized in a hierarchy. Features that form the bottomlevel of the hierarchy, referred to as primitive features, arebound to visual observations. The rest of the featuresare meta-features which embody relative spatial config-urations of more elementary features, either meta orprimitive.

A feature can intuitively be associated to a “part”of an object, i.e. a generic component instantiated onceor several times during a “mental reconstruction” ofthe object. At the bottom of the hierarchy, primitivefeatures correspond to local parts that each may havemany instances in the object. Climbing up the hierarchy,meta-features correspond to increasingly complex partsdefined in terms of constellations of lower parts. Even-tually, parts become complex enough to satisfactorilyrepresent the whole object. In this paper, a primitivefeature represents a class of ECV observations of similarappearance, e.g. an ECV observation with colors close tored and white. Given the large number of observationsproduced by the ECV system, a primitive feature willusually have hundreds of instances in a scene.

Fig. 3 shows an example of a hierarchy for a trafficsign. Ignoring the nodes labeled Yi for now, the figureshows the traffic sign as the combination of two features:a triangular frame (feature 5) and a bridge pattern(feature 4). The fact that the bridge pattern has to bein the center of the triangle to form the traffic sign isencoded in the links between features 4-6-5. The trian-gular frame is further encoded using a single (generic)feature: a short red-white edge segment (feature 3). Thelink between feature 3 and feature 5 encodes the fact thatmany short red-white edge segments (several hundredsof instances of feature 3, i.e. several hundreds of red-ishECV observations) are necessary to form the triangularframe, and the fact that these edges have to be arranged


along a triangle-shaped structure. During instantiation,the activation of a (single) feature (e.g. feature 3) will rep-resent all its instances (the pose of hundreds of instancesof feature 3).

We emphasize that for us, a “feature” is an abstractconcept that may have any number of instances. Thelower-level the feature, the larger generally the numberof instances. Conversely, the higher-level the feature,the richer its spatio-appearance description (because itrepresents a combination of lower-level features), andthus the lower generally the number of its instances.The next section explains how instances are representedas one spatial probability density per feature, thereforeavoiding specific model-to-scene correspondences.

3.1 ParametrizationFormally, the hierarchy is implemented in a Markovtree (Fig. 3). Features correspond to hidden nodes ofthe network. When a model is associated to a scene(during learning or instantiation), the pose of feature i inthat scene will be represented by the probability densityfunction of a random variable Xi, effectively linkingfeature i to its instances. Random variables are thusdefined over the pose space, which exactly correspondsto the Special Euclidean group SE(3) = R3 × SO(3).

The structure of the hierarchy is reflected by theedge pattern of the network; each meta-feature is thuslinked to its child features. As noted above, a meta-feature encodes the relationship between its children,which is done by recording the relative relationshipsbetween the meta-feature and each of its children. Therelationship between a meta-feature i and one of itschildren j is parametrized by a compatibility potentialfunction ψij(Xi, Xj) which reflects, for any given relativeconfiguration of feature i and feature j, the likelihood offinding these two features in that relative configuration.The potential between i and j will be denoted equiv-alently by ψij(Xi, Xj) or ψji(Xj , Xi). We only considerrigid-body, relative spatial configurations. A compatibil-ity potential is thus equivalent to the spatial distributionof the child feature in a reference frame that matches thepose of the parent feature; a potential can be representedby a probability density defined on SE(3).

Finally, each primitive feature is linked to an observedvariable Yi. Observed variables are tagged with an ap-pearance descriptor, called a codebook vector, that definesa class of observation appearance. In the case of ECVobservations, a codebook vector will be composed of twocolors. The set of all codebook vectors forms a codebookthat binds the object model to feature observations. Thestatistical dependency between a hidden variable Xi andits observed variable Yi is parametrized by a potentialψii(Xi, Yi). We generally cannot observe meta-features;their observation potentials are thus uniform.

In the applications we are considering, observationsare assumed constant throughout inference. Further-more, the observations we are dealing with reflect di-rectly the state of their corresponding hidden variable;

ψii(Xi, Yi) thus encodes an identity transformation. Con-sequently, the notation for the statistical dependencybetween a hidden variable Xi and its observed variableYi will be simplified in an observation potential φi(Xi),which corresponds to the spatial distribution of theinput observations that hold an appearance descriptorresembling the codebook of Yi. An observation potentialφi(Xi) will also be referred to as evidence for Xi.

3.2 Instantiation

Model instantiation is the process of detecting instancesof an object model in a scene. It provides pose densi-ties for all features of the model, indicating where thelearned object and its sub-parts are likely to be present.

Because a hierarchy is implemented with a Markovnetwork, there is a clear separation between the model,and the algorithms that make use of the model. Funda-mentally, instantiation involves two operations:

1) Define priors (observation potentials) from inputobservations;

2) Propagate this information through the graph us-ing an applicable inference algorithm.

During the definition of priors, each observation poten-tial φi(Xi) is built by means described in Section 4.2 froma subset of the input observations. The subset that servesto build the potential φi(Xi) linking Xi to Yi is the subsetof observations that hold an appearance descriptor thatis close enough, in the appearance space, to the codebookvector associated to Yi.

The inference algorithm we use to propagate informa-tion is currently the belief propagation (BP) algorithm[29], [34], [35], discussed in Section 5. BP works byexchanging messages between neighboring nodes. Eachmessage carries the belief that the sending node hasabout the pose of the receiving node. Let us consider,for example, nodes X3 and X5 in the network of Fig. 3.Through the message that X3 sends to X5, X3 probabilis-tically votes for all the poses of X5 that are compatiblewith its own pose; this compatibility is defined throughthe compatibility potential ψ5,3. Through this exchangeof messages, each feature probabilistically votes for allpossible object configurations consistent with its posedensity. During inference, a consensus emerges amongthe available evidence, leading to one or more consis-tent scene interpretations. The system never commits tospecific feature correspondences, and is thus robust tosubstantial clutter and occlusions. After inference, thepose likelihood of the whole object can be read out ofthe top feature; if the object is present twice in a scene,the top feature density should present two major modes.

4 DENSITY REPRESENTATION

We opted for a nonparametric computational encodingof density functions, for both random variables andpotentials. The idea behind nonparametric methods is torepresent a density simply by the samples we see from


it. Formally, a density is represented by a set of weightedsamples called particles. The probabilistic density in aregion of space is given by the local density of the par-ticles in that region. The continuous density function isaccessed by assigning a kernel function to each particle,a technique generally known as kernel density estimation[36]. Evaluation is performed by summing the evaluationof all kernels. Sampling is performed by sampling fromthe kernel of a particle ` selected from p(` = i) ∝ wi,where wi is the weight of particle i.

The expressiveness of a single “nonparametric” kernelis generally limited; the kernel we use models locationand orientation independently, and location/orientationcomponents are both isotropic (see below). Nonpara-metric methods account for the simplicity of individualkernels by employing a large number of them: in thisframework, a density will typically be supported by 500particles.

Compared to traditional parametric methods, the non-parametric approach eliminates problems such as fittingof mixtures or the choice of a number of components.Also, no assumption concerning density shapes (normal-ity, ...) has to be made.

4.1 Kernel DefinitionRandom variables and potentials are defined over theSpecial Euclidean group SE(3) = R3 × SO(3), whereSO(3) is the Special Orthogonal group (the group of3D rotations). We use a kernel that factorizes into twofunctions defined on R3 and SO(3), which is consistentwith the policy of kernel simplicity in the nonparametricapproach. Denoting the separation of SE(3) elementsinto translations and rotations by

x = (λ, θ), µ = (µt, µr), σ = (σt, σr),

we define our kernel with

K(x;µ, σ) = N(λ;µt, σt) ·Θ(θ;µr, σr) (1)

where µ is the kernel mean point, σ is the kernel band-width, N(·) is a trivariate isotropic Gaussian kernel, andΘ(·) is an orientation kernel defined on SO(3). Denotingby θ′ and µ′r the quaternion representations of θ and µr[37], we define the orientation kernel with the Dimroth-Watson distribution [38], [39]

Θ(θ;µr, σr) = W(θ′;µ′r, σr) = Cw(σr) · eσr(µ′>r θ′)2 (2)

where Cw(σr) is a normalizing factor. This kernel cor-responds to a Gaussian-like distribution on SO(3).The Dimroth-Watson distribution inherently handles thedouble cover of SO(3) by quaternions [40]. Similar ap-proaches have been explored in prior work [28].

The bandwidth σ associated to a density should ide-ally be selected jointly in R3 and SO(3). However, this isdifficult to do. Instead, we set the orientation bandwidthσr to a constant allowing about 10◦ of deviation; thelocation bandwidth σt is then selected using a k-nearestneighbor technique [36].

ψ6,4(X6, X4) ψ6,5(X6, X5)

X4 X5

ψ4,1(X4, X1) ψ4,2(X4, X2) ψ5,3(X5, X3)

X1 X2 X3

φ3(X3)φ2(X2)φ1(X1)

PotentialHidden Node (Feature)

Codebook VectorClassification

Observed Node

Learned

Scen

e

Y1 Y2 Y3

Det

ecti

on

Object

X6

Fig. 4. Example of a hierarchical model of a traffic sign,with explicit potentials and feature pose densities.

4.2 EvidenceIn Section 3.2, we explain that an observation potentialis built from a set of input observations. Given thenonparametric representation, building an observationpotential is straightforward: input observations are de-fined in 3D, they can thus directly be used as particlesrepresenting the observation potential. However, inputobservations may not provide a full SO(3) orientation.For example, an ECV observation has an orientationparametrized by the local tangent to the 3D edge itis extracted from (θ ∈ S2

+); the orientation about this3D edge is unspecified. Observation potentials will thusgenerally be defined on a subspace of SE(3).

Section 5.1 and Section 5.2 describe inference in thegeneral case of SE(3) densities. Section 5.3 refine themechanisms of Section 5.2 to handle a network in whichpotentials are defined on different domains.

4.3 A Complete ExampleFig. 4 shows in greater detail the hierarchy of Fig. 3.Feature 2 is a primitive feature that corresponds to a


local black-white edge segment (the white looks greenishin the picture). The blue patch pattern in the φ2(X2)box is the non-parametric representation of the evidencedistribution for feature 2, which corresponds to black-white edge segments produced by the ECV system. Theblue patch pattern in the X2 box is the non-parametricrepresentation of the posterior density of X2, i.e. theposes in which the part “feature 2” is likely to be found.Feature 4 corresponds to the opening-bridge drawing.Its inferred pose in the scene is shown by the red patchin the X4 box. For clarity, only one patch is shown;the density would normally be represented by a largenumber of patches around that point. The ψ4,2(X4, X2)box shows the encoding of the relationship betweenfeatures 4 and 2: for a fixed pose for feature 4 (in red),it shows the likely poses for feature 2 (in blue).

5 INFERENCE

Inference provides marginal pose densities for all fea-tures of the model, given the observations extracted froma scene. After inference, the top feature indicates wherethe learned object is likely to be present.

Inference is preceded by the definition of evidence forall features (hidden nodes) of the model (Section 3 andSection 4.2). Once these priors have been defined, infer-ence can be carried out with any applicable algorithm.We currently use a belief propagation algorithm of whichwe give a complete, top-down view below.

5.1 Belief PropagationIn the case of a Markov tree, belief propagation (BP) [29],[34], [35] is based on an exchange of messages betweenneighboring hidden nodes. A message that feature isends to feature j is denoted by mij(Xj), and carriesfeature i’s belief about the state of feature j. To preparea message for feature j, feature i starts by computing a“local pose belief estimate”, as the product of the localevidence and all incoming messages but the one thatcomes from j. This product is then multiplied with thecompatibility potential of i and j, and marginalized overXi. The complete message expression is

mij(Xj) =

∫ψij(Xi, Xj)φi(Xi)

∏k∈N(i)\j

mki(Xi)dXi (3)

where N(i) denotes the set of features adjacent to featurei. When all messages have been exchanged once, a beliefis computed for each feature as

bj(Xj) =1

Zφj(Xj)

∏i∈N(j)

mij(Xj), (4)

where Z is a normalizing constant. The belief bj(Xj)corresponds to the posterior marginal pose density offeature j.

BP propagates the collected evidence from primitivefeatures to the top of the hierarchy, permitting the in-ference of the object pose likelihood. Furthermore, the

global belief about the object pose is also propagatedfrom the top node down the hierarchy, reinforcing glob-ally consistent evidence and permitting the inference ofoccluded features.

5.2 Nonparametric Belief PropagationSeveral instantiations of the BP algorithm for continuous,non-Gaussian potentials are available in the literature,e.g. NBP [17], PAMPAS [41], or more recently MSBP [24].These methods support nonparametric density represen-tations, and are applicable to our graphical model.

We are currently using the NBP algorithm. Since NBPtends to outperform PAMPAS for graphical models withmultimodal potentials containing large numbers of ker-nels [17], [41], it is a good choice for our models wherepotentials linking primitive features to their parentsare inherently multimodal and are usually representedwith more than a hundred particles. Park et al. recentlypresented the MSBP algorithm [24], which uses mean-shift to perform nonparametric mode-seeking on beliefsurfaces generated within the belief propagation frame-work. While MSBP has been shown to outperform NBPin terms of speed and accuracy, the ability to computecomplete posteriors with NBP is a strong advantage forsome applications.

NBP is easier to explain if we decompose the analyticalmessage expression (3) into two steps:

1) Computation of the local belief estimate:

βts(Xt) = φt(Xt)∏

i∈N(t)\s

mit(Xt), (5)

2) Combination of βts with the compatibility functionψts, and marginalization over Xt:

mts(Xs) =

∫ψts(Xt, Xs)βts(Xt)dXt. (6)

NBP forms a message by first sampling from the product(5) to collect a non-parametric representation of βts(Xt),then sampling from the integral (6) to collect a non-parametric representation of mts(Xs).

Sampling from the message product (5) is a complextask for which Ihler et al. explored multiple efficient so-lutions [42], one of which is to sample from the productusing importance sampling (IS). Importance samplingcan be used to produce a nonparametric representationof an unknown distribution p(x) by correctly weightingsamples from a known distribution q(x): IS accounts forthe difference between the target distribution p(x) andthe proposal distribution q(x) by assigning to each samplex` a weight defined as w` = p

(x`)/q(x`). Since p(x)

is often rather different from q(x), many samples mayend up with a very small weight. To compute a set ofn representative samples of p(x), one usually takes rnsamples from q(x) (r ≥ 1), computes their weights asdefined above, and eventually resamples to a size ofn. The closer q(x) is to p(x), the better

{(x`, w`)

}will

approximate p(x).


To sample from a message product (5) with IS, we usea proposal function that corresponds to the sum of thefactors of the product; importance weights are computedas

w` =φt(x

`t)∏i∈N(t)\smit(x

`t)

φt(x`t) +∑i∈N(t)\smit(x`t)

.

The second phase of the NBP message construc-tion computes an approximation of the integral (6) bystochastic integration. Stochastic integration takes a setof samples {xit} from βts(Xt), and propagates them tofeature s by sampling from ψts(x

it, Xs) for each xit. It

would normally also be necessary to take into accountthe marginal influence of ψts(Xt, Xs) on Xt. In ourcase however, potentials only depend on the differencebetween their arguments; the marginal influence is aconstant and can be ignored. In NBP inference, theefficiency trade-off is largely governed by the numberof particles n supporting density functions: while thesuccess of NBP inference highly depends on a sufficientdensity resolution, increasing n has a hard impact oncomputational cost and memory footprint.

5.3 Analytic MessagesIn Section 4.2, we mention that observation potentialsmay be defined on a subspace of SE(3). This can effec-tively be represented in SE(3) by setting the free dimen-sions to a uniform distribution. Each observation wouldbe represented with multiple SE(3) particles which,in the case of a completely non-oriented observation,would each be assigned a random orientation. However,the number of particles needed to account for as little asone degree of freedom in orientation without imposingtoo much noise on other dimensions may already bequite high, and will surely impede inference. Anotherway to approach the problem is to define observation-related densities on the specific SE(3) subspace D onwhich observations are defined. For example, with com-pletely non-oriented observations, the domain wouldsimply be D = R3. This way, each observation can berepresented by a single particle. Primitive features arethe hidden counterpart of observations; if a differentdomain D is used for observation potentials, primitivefeatures will also be defined on D. Finally, the domainof potentials linking primitive features to second-levelfeatures should be changed. Since a potential holds thespatial distribution of the child feature in a referenceframe that corresponds to the pose of the parent feature,these potentials will also be defined on D. The next para-graphs show how analytic messages [28], [43] can be usedto allow NBP to efficiently manage messages exchangedbetween pairs of variables defined on different domains.

Let us turn to the propagation equation (3), whichwe analytically decomposed into a multiplication (5) andan integration (6). We explained that NBP implementsBP by numerically performing the same decomposition,i.e. computing explicit nonparametric representations formessages and local estimates alternately. Forming an

mvt(Xt)

βts(Xt)

mts(Xs)ψst(Xt, Xs)

mut(Xt)

βvt(Xv)βut(Xu)

u

t

s

v

ψut ψvt

(a) Part of a hierarchy

βut(Xu) a

mut(Xt)

y

d

z x

(b) 2D cut of a nonparametric rep-resentation of βut and mut.

d

z

ψut(Xu,0)

x

y

(c) 2D cut of a representation of ψut(Xu,0).

Fig. 5. Illustration of the issue of sending a messagefrom a feature u defined on D ⊂ SE(3) to a feature tdefined on SE(3). We note that in Fig. (b), mut should bedistributed on the surface of a sphere centered at a. Forthe purpose of this illustration, we only show the particlesthat are near the intersection of the sphere with the xyplane. Since Xt is defined on SE(3), its particles have afull 3D orientation. Given the information contained in βutand ψut, we know that the particles in mut should havetheir x axis oriented towards a. As for their y and z axis,there is no evidence. This degree of freedom is illustratedin Fig. (b) by the random direction of the smaller line ofthe frame picturing each particle.

explicit representation of a message sent from a featureu defined on D ⊂ SE(3) to a feature t defined onSE(3) unfortunately leads to representing in SE(3) adistribution that is likely to span a large space. Let usconsider the example of Fig. 5a where u is defined onD = R3 and is located at a distance d from t in thex direction. The potential ψut(Xu, Xt), which representsthe distribution of u in a reference frame defined byt, will be a unimodal distribution centered at (d, 0, 0)(Fig. 5c). Let us assume we know that u is located arounda center point a, i.e. βut(Xu) is a unimodal distributioncentered at a. In forming mut(Xt), all we can say isthat t will be located about the surface of a sphere ofradius d centered at a (Fig. 5b). In general, mut(Xt) willbe a lot more expensive to represent than ψut(Xu, Xt)and βut(Xu) together. Yet, the representational cost willtypically be much smaller for βts(Xt) than for mut(Xt),thanks to the disambiguation brought by mvt(Xt). Thefollowing paragraphs show how to avoid mut(Xt) byusing directly ψut(Xu, Xt) and βut(Xu) in computingβts(Xt).

Let us assume we are in the process of constructing a


nonparametric representation of βts(Xt), i.e. the local es-timate of feature t that includes all incoming informationbut that from s. In general IS-based NBP, we first forma proposal function q(Xt) from the sum of incomingmessages and local evidence; then, we repetitively takea sample x`t from q(Xt), and compute its importanceweight by multiplying the evaluations of the incoming-message nonparametric representations at x`t :

w` =1

q(x`t)· φt(x`t)

∏i∈N(t)\{s}

mit(x`t). (7)

However, this expression can be rewritten

w` =φt(x

`t)

q(x`t)

∏i∈N(t)\{s}

∫ψit(Xi, x

`t)βit(Xi)dXi, (8)

and w` can be computed directly from βit(Xt) andψit(Xi, Xt) (i ∈ N(t)\ {s}), i.e. by using analytic messages[28], [43]. Evaluating each integral is achieved by sam-pling p times an example xki from ψit(Xi, x

`t), evaluating

βit(xki ), and taking the average over k.

When βit(Xi) is defined on a proper subspace ofSE(3), forming an explicit message representation tocompute importance weights (7) would suffer from theissue illustrated in Fig. 5. In (8), we evaluate βit(x

ki ),

which means that its degrees of freedom are taken intoaccount analytically. We can thus afford to use a rathersparse proposal q, computed from explicit messages,because we then precisely weight these samples directlythrough potentials and local estimates.

5.4 Two-Level Importance SamplingOne known weakness of IS-based NBP is that it cannotintrinsically concentrate its attention on the modes ofa product, which is an issue since individual messagesoften present many irrelevant modes [42]. We overcomethis problem with a two-level IS: we first computean intermediate representation of the product with theprocedure explained above, we then use this very rep-resentation as the proposal function for a second ISthat will be geared towards relevant modes. Let usdenote by n the number of particles supporting a density.The intermediate representation is obtained with sparseanalytic messages (p� n) but many importance samples(r � 1), while the second IS uses rich analytic messages(p ≈ n) but a low value for r. We typically chooser and p such that r × p = n. There are rn particlesto weight in the proposal density. Denoting by d thenumber of messages in the product (5), weighting oneparticle (8) involves the evaluation of d densities at ppoints, the evaluation of the evidence, and the evaluationof the proposal. Density evaluation is logarithmic in thenumber of density particles if implemented with kd-trees[42]. The product of d analytic messages is thus

O(dn2 log n). (9)

We note that comparing this with the algorithmic com-plexity of other NBP implementations would need to be

done with care, because the magnitude of n required toachieve a given resolution depends on the method.

The two-level IS described above and the product ofanalytic messages have proved crucial for the successfulapplication to real-world objects such as those presentedin Section 7.

6 LEARNING

Model learning is the process of building a hierarchyfrom a set of input observations. The information con-tained in a hierarchy is held in the topology of thenetwork, in the compatibility potentials, and in the code-book. A learning procedure is responsible for definingthese three elements; there is no additional constraint.We are primarily interested in unsupervised algorithms,which can be divided into top-down and bottom-upapproaches. A top-down approach would recursivelydivide the set of available observations following struc-tural or semantic criteria, each division giving rise toa new meta-feature. When reaching observation groupsthat are simple enough, primitive features would beformed and codebook vectors would possibly be ex-tracted. A bottom-up approach works the other wayaround. Observations are first separated into k sets, eachof which gives rise to a primitive feature. If appearancedescriptors are used in the separation process, the code-book is trivially defined from the appearance of each set.To form the hierarchy, features are iteratively combinedwith one another based again on structural or semanticcriteria. Previous work on top-down [8] and bottom-up[7], [9], [20] approaches is available in the literature.

In the next sections, we develop a bottom-up learningmethod which autonomously builds a tree-structuredhierarchy from a set of observations extracted from asegmented object. The method was designed with sim-plicity in mind. Despite this simplicity, it allows us toproduce hierarchies that can readily be used for objectpose estimation, as shown in Section 7.

6.1 Primitive Features

Learning starts with a set of n observations

O ={(xi, wi, ai

)}i∈[1,n]

where xi contains geometric information, wi is a weight,and ai is an appearance descriptor. The geometric infor-mation includes a 3D location and an orientation. Asdiscussed in previous sections, the orientation does notneed to be a full 3D orientation; partially defined orien-tations also work. The appearance descriptor is definedin Rd (d ≥ 0), and it is assumed that the Euclideanmetric in Rd is a meaningful measure of the closenessof two descriptors. There is no particular requirementon the quality of the observations used for learning, butaccurate observations naturally produce better models.

To define primitive features, we essentially divide Ointo k subsets. The division criterion we are currently


using is exclusively appearance-based. K-Means clus-tering is applied in the appearance space on the set ofappearance descriptors

{ai}

, producing k cluster centersA1, ..., Ak. These k cluster centers will directly form thecodebook; each primitive feature i will correspond to ageneric element characterized by an appearance Ai [7],[20].

The next section explains how meta-features are cre-ated. Meta-features encode relative spatial relationships.For the purpose of their creation, pose densities haveto be available for all existing features. The pose dis-tribution of each primitive feature i is computed fromthe spatial distribution of the observations that wereassociated to the ith cluster during the k-means cluster-ing. Since distributions are encoded in a nonparametricway, feature pose densities are computed simply byresampling the set of corresponding observations. Thesepose densities will only serve to build the hierarchy, theyare discarded at the end of the learning process.

To illustrate the previous paragraphs, we can comeback to Fig. 4, in which feature 3 is a generic, red-whiteedge segment. During the construction of the model,the density associated to feature 3 would show that thisfeature is mainly distributed along the outer edges of thetraffic sign.

6.2 Meta-Features

After primitive features have been defined, the graphis built incrementally, one level at a time. The learningalgorithm works by repeatedly grouping sets of featuresinto a higher-level meta-feature. We are currently re-stricting the number of features that can be combinedinto a meta-feature to 2. Limiting the number of childrenof a meta-feature is a rather soft restriction though, sincewider combination can intuitively be traded for deeperhierarchies.

A new level is built on top of the previous level,which we will refer to as the supporting level. Meta-features encode geometric configurations; when startinga new level, the pose of each feature of the supportinglevel is assumed to be already defined. The definitionof primitive feature poses has been explained in theprevious section; the definition of meta-feature posesis explained below. We note again that the purpose ofdefining feature pose densities is to extract relative con-figurations and define the potentials linking to the newlevel. Feature densities will be specific to the learningscene, they cannot help in the inference of the model ina different scene.

To create an additional level, the features of the sup-porting level are randomly grouped into pairs. (If thereis an odd number of features, one will stand alone.)Each pair forms a meta-feature. The pose of a meta-feature in the scene can be defined in an arbitrary way.Since a meta-feature encodes the composition of its childfeatures, its pose is defined as the mean of the densitiesof its children, which helps in interpreting the final

hierarchy. Each meta-feature i thus has a single pose xpiin the learning scene.

Compatibility potentials are then defined through theextraction of relative spatial relations between each newmeta-feature and its children. The potential ψij(Xi, Xj)between meta-feature i and its child j is formed byrepeatedly sampling from the relative transformationbetween i and j. Drawing a sample xψ from the relativetransformation is achieved by drawing a sample xj fromfeature j and combining it with the pose xpi of itsparent i. Denoting the separation of pose variables intotranslations and rotations by

xψ = (λψ, θψ), xj = (λj , θj), xpi = (λi, θi),

the sample xψ from the transformation is formed fromxpi and xj as

λψ = θ>i (λj − λi), θψ = θ>i θj ,

where SO(3) orientations are represented as rotationmatrices for simplicity. In this last expression, xψ cor-responds to the projection of xj into the reference framedefined by xpi . The set {xψ} represents the pose distri-bution of feature j in the reference frame defined bythe pose of its parent i which actually corresponds toψij(Xi = 0, Xj). During inference, to evaluate ψij(Xi =xi, Xj = xj), we first need to transform ψij(Xi = 0, Xj)into ψij(Xi = xi, Xj) by applying the transformationdefined by xi on all the supporting particles

λ′ψ = θiλψ + λi, θ′ψ = θiθψ

then evaluate ψij(Xi = xi, Xj = xj) with kernel densityestimation on the set {x′ψ = (λ′ψ, θ

′ψ)}. (For efficiency, we

actually evaluate ψij(Xi = 0, Xj = x′j) where x′j is theresult of the transformation of xj by the inverse of thetransformation defined by xi.)

Finally, the pose of each new meta-feature is encodedin the feature pose density, to provide the next level-building iteration with spatial information. The pose ofmeta-feature i is encoded as a Dirac peak with a set ofparticles concentrated at xpi .

The procedure described above divides the number ofnodes by two from one level to the next. It stops whenthere is only one node at the top level. The extractionof spatial relations constitutes its the principal outcome.The complexity (number of features) of the models itproduces is mainly influenced by color variety withinobjects. Objects with a larger color variety will give riseto more primitive features, which in turn leads to widerand taller hierarchies.

We note that speaking about the pose of an object im-plies that a reference pose (Fig. 6) has been defined for thatobject. Within a hierarchical model, the reference pose ofan object is defined through the pose xpt associated to thetop feature t during model learning.

7 EVALUATIONThis section illustrates the applicability of our model topose estimation.


Fig. 6. Reference pose of an object. The reference framemarked A corresponds to the coordinate system of thescene. The relative configuration of the frame B and thegripper is the reference pose of the gripper. The pose ofthe gripper corresponds to the pose of B in A, i.e. thetransformation between A and B.

7.1 Early Cognitive VisionIn this work, observations are provided by the ECVsystem presented in Section 1, which extracts local 3Dedge segments from stereo imagery. Each edge segmenthas an appearance descriptor composed of two colors.The orientation θ associated to these observations is a 3Dline (θ ∈ S2

+). The space on which observation-relateddensities are defined is D = R3 × S2

+. The Dimroth-Watson kernel (2) is directly applicable in S2

+, and thekernel we use on D is

KD((λ, θ);µ, σ)) = N(λ;µt, σt) ·W(θ;µr, σr) ,

with θ ∈ S2+, µ = (µt, µr), and σ = (σt, σr).

Fig. 1 illustrates typical ECV reconstructions. The sys-tem can work in two different modes. In the first mode(Fig. 1, left), it produces an ECV reconstruction from asingle stereo view of a scene. In the second mode (Fig. 1,right), it integrates information through a set of stereoviews to produce a reconstruction that is much lessaffected by self-occlusions [19], [44]. We will be referringto the latter as accumulated reconstruction. ECV accu-mulated reconstructions are typically computed from adozen stereo views.

We usually provide the ECV system with stereo im-ages of 1280 × 960 pixels. In the following illustrations,we only show the left image of a stereo pair.

7.2 Pose EstimationModel learning is ideally done from an accumulatedECV reconstruction. Learning from single-view ECV re-constructions is also possible, but the resulting modelsare less robust to large viewpoint changes. Detection isalways performed on a single view (see Fig. 2). Poseestimates are inferred by sending messages from prim-itive features to the top feature. Each feature sends onemessage to each of its parents. Inference produces anobject pose likelihood in the density associated to the topfeature of the object model. When instantiating a modelin a scene in which exactly one instance of the object ispresent, the top feature density should generally presenta single mode showing the pose of the object. However,if an object very similar to the one we are searchingfor appears in the scene, the top density will present a

second, weaker mode. Consequently, densities have to beclustered to extract prominent modes. We are currentlylimiting experiments to scenes which include a knownnumber p of object instances. If p pose estimates have tobe produced, the p largest modes are selected.

Visualizing a pose estimate is done through top-downpropagation. To visualize a pose estimate x, we fix theevidence of the top feature to x, and we propagate thisinformation to the bottom of the graph to produce anonparametric representation of primitive feature densi-ties. Using camera calibration parameters, particles fromthese densities are projected onto the 2D images, as inFig. 7b. We can then easily verify that the projectionmatches the object we are looking for. We usually removeprimitive feature evidence before down-propagating anestimate, which allows down-propagation to clearly re-veal occluded object portions (Fig. 7,9–11). Alternatively,if evidence is kept, primitive feature posteriors showparticles that are coherent with both the model and sceneevidence (Fig. 12, second and third images).

We typically use hierarchies of 2 or 3 levels, verysimilar to the example of Fig. 3. We fix the numberof particles supporting feature densities and potentialsto 400. We keep the number of particles in observationpotentials ne equal to the number of observations pro-duced by ECV – typically 1000–5000 observations. It isnecessary to use a larger number of particles in obser-vation potentials because in a scene containing manydifferent objects, only a small fraction of the particlescomes from the object we are trying to detect. Havinga larger number of particles in observation potentialsis acceptable because this number only appear as itslogarithm in the algorithmic complexity of inference(9). Inference within a hierarchy of M meta-featuresis O(Mn2 log ne) where n < ne. We are developing acomputer implementation of our learning and inferencealgorithms that, using a few reasonable approximations,allows us to detect hierarchies similar to that of Fig. 3(M = 3) in about 10 seconds on a standard desktopcomputer. The memory footprint of the pose estimationprocess is always below 50MB. The cost of detectingmultiple objects is linear in the number of objects.

7.3 Experimental Setups

This section presents experiments through which weevaluate the performance impact of various parametersetups. Our input data consist of a sequence showinga plastic pot undergoing a rotation of 360◦ [45]. Thesequence contains 72 stereo frames; the object is rotatedof 5◦ between two successive views (Fig. 7). We learnedtwo models of the pot: one was learned from obser-vations extracted from the first view of the sequence(Fig. 7a), another was learned from observations ac-cumulated over 10 stereo pairs of the sequence. Bothmodels were then instantiated in all views; the results aresummarized in Fig. 8 and Tab. 1. The single view modelonly works for rotations of about 40◦ around the learning


(a) 0◦ (b) 135◦

Fig. 7. Example frames from the rotating pot sequence.Fig. (b) shows a correct estimate of the pose of the potobtained with a hierarchy learned from an accumulatedECV reconstruction. See text for details.

# ECV IS n time suc. rate loc. err. ori. err.

1 acc. 2-lev 400 11s 66/72 10mm 4.9◦2 single 2-lev 400 11s 40/72 9mm 5.4◦3 acc. 1-lev 400 11s 41/72 18mm 13◦4 acc. 2-lev 200 3.4s 52/72 12mm 5.2◦

TABLE 1Running time per frame, success rate, and absolute error

(location and orientation) for different experimentalsetups in the rotating pot sequence (Fig. 7), including

learning from accumulated or single-view ECV, 2-level or1-level importance sampling NBP, and 400 or 200

particles per density. Location and orientation errors areaveraged over successful estimations.

view. When the opening of the pot faces away fromthe camera, the system fails to make a correct estimate.Fig. 8 shows that the model learned from accumulatedobservations performs better for a much wider rangeof viewpoints. Robustness drops around 90◦ and 270◦,where the pot is viewed from the side; good stereopsisis then difficult to achieve and the ECV system is unableto produce a reliable 3D structure. As shown in Tab. 1,the average accuracy of successful estimates is similarfor both models.

The third row of Tab. 1 summarizes the result ofthe detection of the model learned from accumulatedECV when a single-level importance sampling is usedinstead of the two-level IS of Section 5.4. For the samerunning time as the two-level IS, single-level IS presentsa substantially lower detection rate, and the error incorrect estimates is about twice as large.

The fourth row of Tab. 1 summarizes the result ofthe detection of the model learned from accumulatedECV with only half the number of particles of othersetups. Pose estimates are less accurate; Fig. 8 showsthat the system mainly loses robustness around difficultviewpoints (90◦ and 270◦). Detection is however muchfaster.

7.4 Detection In Cluttered ScenesTo evaluate the robustness of our model to clutter andocclusions, we have computed the pose of various ob-jects in several cluttered scenes [45], [46]. For technical

0

20

40

60

80

100

0 50 100 150 200 250 350Viewpoint angle (degrees)

300Perc

enta

geof

corr

ecte

stim

ates

Single, 2-lev, n = 400Acc, 2-lev, n = 400

Acc., 2-lev, n = 200Acc., 1-lev, n = 400

Fig. 8. Proportion of correct estimates in the rotating potsequence (Fig. 7) as a function of the viewpoint. The 4plots correspond to the 4 cases of Tab. 1.

reasons, most of the models were build from single-view ECV observations, or ECV observations accumu-lated from two views. Only the basket model was builtfrom ECV observations accumulated through 10 stereopairs. Fig. 9 shows a set of scenes for which all esti-mates are correct. The first three images of the top rowdemonstrate that our system can handle some objectdeformations. Small deformations are integrated throughthe soft probabilistic reasoning. Larger deformations (e.g.towel folding) are handled as occlusions.

Fig. 10 shows scenes in which an estimation failed.Failures are usually due to one of two reasons. Onecause of failure is a poor ECV reconstruction, whichmay be due to stereopsis-related issues, low contrast, orpoor image resolution. For example, in the third imageof Fig. 10, the handle of the brush is almost horizontalwhich makes it almost impossible for ECV to find correctstereo matches. The other cause of failure is the presencein a scene of a structure that has the same geometry andappearance as the object. This is for example the casein the first image of Fig. 10, where half the pink kitchencloth is occluded, and a better match is found in the pinkstructure of the towel.

Fig. 11 shows a similar experiment with traffic signsthat share a common outer structure, but present varia-tion in their central pattern. The fourth image illustratesthe utility of color and of a hierarchy of multiple parts.It shows the same scene as the third image, but ingreyscale. When color is not used, learned hierarchiesdegenerate into a two-feature network. All observationsare associated to a single primitive feature, the geometryof the object is encoded in a single potential ψ. Thenumber of particles of ψ that encode the pattern is smallproportionally to the particles encoding the triangularframe, and their weight in the message sent through ψ israther small. In that context, we observed that the poselikelihood of a sign generally presented several majormodes, some corresponding to the correct sign, otherscentered on wrong signs. The largest estimate doesn’t


Fig. 9. Cluttered scenes with correct pose estimates for all objects. Colors are used to identify the object to which2D-projected particles belong. See text for details.

Fig. 10. Cluttered scenes where an estimation fails. Estimations failed respectively for the pink kitchen cloth, thebreadboard, the brush, and the blue ice brick. Colors are used to identify the object to which 2D-projected particlesbelong. See text for details.

Fig. 11. Detection of objects of similar shapes. The last image shows pose estimates computed without taking colorinto account. See text for details.


Fig. 12. Detection of an articulated object. See text fordetails.

always correspond to the correct sign, as illustrated inthe fourth image. Computing estimates in the first threeimages without using color led to only 4 correct poses outof 9. The results shown in these three images correspondto experiments where color is taken into account, allestimates are correct. When color is used, a multi-parthierarchy is built, and the central pattern is encodedin its own potential; during inference, it can contributeto the correct pose through a distinct message, whicheffectively discards incorrect modes.

7.5 Articulated ObjectsThis section explains how this framework can be usedto represent loose objects by providing training data thatreflect articulations. We chose to use a toy weighingscale (Fig. 12). Our training data are a set of ECVobservations extracted from several images exclusivelyshowing the scale. The pose of the camera and toyscale are identical in all images, but each image showsthe scale in a different equilibrium position. A modellearned from such data will represent all equilibriumconfigurations simultaneously and indistinctly. Fig. 12shows pose estimates obtained from this model. The firstimage corresponds to a top-down propagation with priordeletion of primitive feature evidence (see Section 7.2); itclearly shows the different configurations supported bythe model. The other two images correspond to a top-down propagation including all evidence, which revealsobject evidence that effectively contributed to the finalpose estimate.

8 CONCLUSION AND FUTURE WORK

We presented an object representation framework thatencodes probabilistic relations between 3D features or-ganized in a Markov-tree hierarchy. We started with adiscussion of nonparametric function representation inthe Special Euclidean group SE(3). We then showedhow an object can be detected in a novel scene, andhow to build a hierarchy for a given object. We finallydemonstrated the applicability of our method through aset of pose estimation experiments.

Detection largely amounts to inferring the pose ofthe top feature using nonparametric belief propagation.We detailed an importance sampling implementation ofNBP message products, along with means of efficientlysupporting networks that mix different density domains.Inference produces a likelihood of the object pose in theentire scene.

Learning is currently handled by a simple procedurewhich first clusters object observations according to theirappearance, then randomly combines these to form ahierarchy. The learning procedure leaves room for im-provement, and we plan to explore several approachesin the near future. Yet, despite its simplicity, the currentmethod is able do produce models that allow for ratherrobust pose estimation.

Our method is readily applicable to object pose estima-tion. Experiments demonstrate a high level of robustnessto input noise, and show that a satisfactory accuracy canbe reached within reasonable computational constraints.We are thus able to achieve pose recovery without priorobject models, and without explicit point correspon-dences.

Our method can in principle incorporate features fromother perceptual modalities than vision. Our objectiveis to observe haptic and kinematic features that corre-late with successful grasps, and integrate them into thefeature hierarchy. Grasps will generally not relate to anobject as a whole. Our part-based representation offeran elegant way to link grasps to the 3D visual featuresthat predict their applicability, and learn visual featuresthat permit the generalization of grasps across objects.Grasps success likelihoods can then emerge from visualevidence through top-down inference.

ACKNOWLEDGMENTS

The authors warmly thank Prof. Norbert Kruger for histireless support, and for very fruitful discussions. Theauthors also thank Dirk Kraft for providing us withobject imagery annotated with ground-truth motion.

This work was supported by the Belgian NationalFund for Scientific Research (FNRS) and the EU Cog-nitive Systems project PACO-PLUS (IST-FP6-IP-027657).

REFERENCES

[1] K. Fukushima, “Neocognitron: A self-organizing neural networkmodel for a mechanism of pattern recognition unaffected by shiftin position,” Biological Cybernetics, vol. 36, no. 4, pp. 193–202, 1980.

[2] E. Bienenstock, S. Geman, and D. Potter, “Compositionality, MDLpriors, and object recognition,” in Advances in Neural InformationProcessing Systems, 1996.

[3] M. Riesenhuber and T. Poggio, “Hierarchical models of objectrecognition in cortex,” Nat. Neurosci., 1999.

[4] S.-C. Zhu, “Embedding gestalt laws in markov random fields,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 11, pp. 1170–1187, 1999.

[5] T. S. Lee and D. Mumford, “Hierarchical Bayesian inference in thevisual cortex,” Journal of the Optical Society of America, pp. 1434–1448, 7 2003.

[6] Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu, “Image parsing:Unifying segmentation, detection, and recognition,” Int. J. Comput.Vision, vol. 63, no. 2, pp. 113–140, 2005.

[7] G. Bouchard and B. Triggs, “Hierarchical part-based visual objectcategorization,” in Computer Vision and Pattern Recognition, vol. 1,2005, pp. 710–715.

[8] B. Epshtein and S. Ullman, “Feature hierarchies for object classifi-cation,” in IEEE International Conference on Computer Vision, 2005.

[9] S. Fidler and A. Leonardis, “Towards scalable representationsof object categories: Learning a hierarchy of parts,” in ComputerVision and Pattern Recognition, 2007.


[10] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient matching ofpictorial structures,” in Computer Vision and Pattern Recognition,2000, pp. 2066–.

[11] F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce, “3D ob-ject modeling and recognition using local affine-invariant imagedescriptors and multi-view spatial constraints,” Int. J. Comput.Vision, vol. 66, no. 3, pp. 231–259, 2006.

[12] A. Kushal and J. Ponce, “Modeling 3D objects from stereo viewsand recognizing them in photographs,” in European Conference onComputer Vision, 2006, pp. 563–574.

[13] J. Rodgers, D. Anguelov, H.-C. Pang, and D. Koller, “Object posedetection in range scan data,” in Computer Vision and PatternRecognition, 2006.

[14] P. Yan, S. M. Khan, and M. Shah, “3D model based object classdetection in an arbitrary view,” in International Conference onComputer Vision, 2007, pp. 1–6.

[15] S. Savarese and L. Fei-Fei, “3D generic object categorization,localization and pose estimation,” in International Conference onComputer Vision, Oct. 2007, pp. 1–8.

[16] J. Liebelt, C. Schmid, and K. Schertler, “Viewpoint-independentobject class detection using 3D feature maps,” in Computer Visionand Pattern Recognition, jun 2008.

[17] E. B. Sudderth, A. T. Ihler, W. T. Freeman, and A. S. Willsky,“Nonparametric belief propagation,” in Computer Vision and Pat-tern Recognition, Los Alamitos, CA, USA, 2003.

[18] N. Kruger and F. Worgotter, “Multi-modal primitives as func-tional models of hyper-columns and their use for contextualintegration.” in BVAI, ser. Lecture Notes in Computer Science,M. D. Gregorio, V. D. Maio, M. Frucci, and C. Musio, Eds., vol.3704. Springer, 2005, pp. 157–166.

[19] N. Pugeault, Early Cognitive Vision: Feedback Mechanisms for theDisambiguation of Early Visual Representation. Verlag Dr. Muller,2008.

[20] F. Scalzo and J. H. Piater, “Statistical learning of visual feature hi-erarchies,” in Workshop at Computer Vision and Pattern Recognition,Washington, DC, USA, 2005.

[21] R. Detry and J. H. Piater, “Hierarchical integration of local 3Dfeatures for probabilistic pose recovery,” in Robot Manipulation:Sensing and Adapting to the Real World (Workshop at Robotics, Scienceand Systems), 2007.

[22] R. Detry, N. Pugeault, and J. H. Piater, “Probabilistic pose recoveryusing learned hierarchical object models,” in International Cogni-tive Vision Workshop (Workshop at the 6th International Conference onVision Systems), 2008.

[23] G. Hua, M.-H. Yang, and Y. Wu, “Learning to estimate humanpose with data driven belief propagation,” in Computer Visionand Pattern Recognition. Washington, DC, USA: IEEE ComputerSociety, 2005, pp. 747–754.

[24] M. Park, Y. Liu, and R. Collins, “Efficient mean shift beliefpropagation for vision tracking,” in Computer Vision and PatternRecognition, June 2008, (to appear).

[25] J. Zhang, Y. Liu, J. Luo, and R. Collins, “Body localization in stillimages using hierarchical models and hybrid search,” in ComputerVision and Pattern Recognition, 2006, pp. 1536–1543.

[26] L. Sigal, Y. Zhu, D. Comaniciu, and M. J. Black, “Tracking com-plex objects using graphical object models,” in First InternationalWorkshop on Complex Motion, 2004, pp. 223–234.

[27] L. Sigal, S. Bhatia, S. Roth, M. J. Black, and M. Isard, “Trackingloose-limbed people,” in Computer Vision and Pattern Recognition,vol. 1, 2004, pp. I–421–I–428 Vol.1.

[28] E. B. Sudderth, “Graphical models for visual object recognitionand tracking,” Ph.D. dissertation, Massachusetts Institute of Tech-nology, Cambridge, MA, USA, 2006, adviser-William T. Freemanand Adviser-Alan S. Willsky.

[29] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks ofPlausible Inference. Morgan Kaufmann, 1988.

[30] J. Coughlan and H. Shen, “Dynamic quantization for belief prop-agation in sparse spaces,” Comput. Vis. Image Underst., vol. 106,no. 1, pp. 47–58, 2007.

[31] M. Toews and T. Arbel, “Detecting and localizing 3D object classesusing viewpoint invariant reference frames,” in International Con-ference on Computer Vision, 2007, pp. 1–8.

[32] I. Gordon and D. G. Lowe, “What and where: 3D object recogni-tion with accurate pose,” in Toward Category-Level Object Recogni-tion, 2006, pp. 67–82.

[33] J. Piater, F. Scalzo, and R. Detry, “Vision as inference in a hier-archical markov network,” in Twelfth International Conference onCognitive and Neural Systems, 2008.

[34] J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Understanding beliefpropagation and its generalizations,” Mitsubishi Electric ResearchLaboratories, Tech. Rep., 2002.

[35] M. I. Jordan and Y. Weiss, “Graphical models: Probabilistic infer-ence,” in The Handbook of Brain Theory and Neural Networks, 2ndedition, M. Arbib, Ed. MIT Press, 2002.

[36] B. W. Silverman, Density Estimation for Statistics and Data Analysis.Chapman & Hall/CRC, 1986.

[37] J. Kuffner, “Effective sampling and distance metrics for 3D rigidbody path planning,” in Int’l Conf. on Robotics and Automation.IEEE, May 2004.

[38] K. V. Mardia and P. E. Jupp, Directional Statistics, ser. Wiley Seriesin Probability and Statistics. Wiley, 1999.

[39] I. L. Dryden, “Statistical analysis on high-dimensional spheresand shape spaces,” ArXiv Mathematics e-prints, Aug. 2005.

[40] C. de Granville, J. Southerland, and A. H. Fagg, “Learninggrasp affordances through human demonstration,” in InternationalConference on Development and Learning, 2006.

[41] M. Isard, “PAMPAS: Real-valued graphical models for computervision.” in Computer Vision and Pattern Recognition, 2003, pp. 613–620.

[42] A. T. Ihler, E. B. Sudderth, W. T. Freeman, and A. S. Willsky, “Ef-ficient multiscale sampling from products of Gaussian mixtures.”in Neural Information Processing Systems, 2003.

[43] A. T. Ihler, J. W. Fisher III, R. L. Moses, and A. S. Willsky,“Nonparametric belief propagation for self-calibration in sensornetworks,” IEEE Journal of Selected Areas in Communication, vol. 23,no. 4, pp. 809–819, 2005.

[44] D. Kraft, N. Pugeault, E. Baseski, M. Popovic, D. Kragic, S. Kalkan,F. Worgotter, and N. Kruger, “Birth of the Object: Detection ofObjectness and Extraction of Object Shape through Object ActionComplexes,” Special Issue on “Cognitive Humanoid Robots” of theInternational Journal of Humanoid Robotics, 2008, (accepted).

[45] D. Kraft and N. Kruger, “Object sequences,” 2009. [Online].Available: http://www.mip.sdu.dk/covig/sequences.html

[46] R. Detry and J. Piater, “A probabilistic framework for 3Dvisual object representation: Experimental data,” 2009. [Online].Available: http://intelsig.org/publications/Detry-2009-PAMI/

Renaud Detry earned his engineering degreeat the University of Liege in 2006. He is cur-rently a Ph.D. student at the University ofLiege, working under the supervision of Prof.Justus Piater on multimodal object representa-tions for autonomous, sensorimotor interaction.He has contributed probabilistic representationsand methods for multi-sensory learning that al-low a robotic agent to infer useful behaviors fromperceptual observations.

Nicolas Pugeault obtained an M.Sc. fromthe University of Plymouth in 2002 and anEngineer degree from the Ecole Superieured’Informatique, Electronique, Automatique(Paris) in 2004. He obtained his Ph.D. from theUniversity of Gottingen in 2008, and is currentlyworking jointly as a lecturer at the University ofSouthern Denmark and as a research associateat the University of Edinburgh.


Justus H. Piater is a professor of computer sci-ence at the University of Liege, Belgium, wherehe directs the Computer Vision research group.His research interests include computer visionand machine learning, with a focus on visuallearning, closed-loop interaction of sensorimotorsystems, and video analysis. He has publishedabout 80 technical papers in scientific journalsand at international conferences. He earned hisPh.D. degree at the University of MassachusettsAmherst, USA, in 2001, where he held a Ful-

bright graduate student fellowship, and subsequently spent two yearson a European Marie-Curie Individual Fellowship at INRIA Rhone-Alpes,France, before joining the University of Liege.

a probabilistic framework for 3d visual object representation · representation fig. 2. pose...

Documents