ieee transactions on image processing 1 learning view ... · learning view-model joint relevance...

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TIP.2015.2395961, IEEE Transactions on Image Processing

IEEE TRANSACTIONS ON IMAGE PROCESSING 1

Learning View-Model Joint Relevance for 3DObject Retrieval

Ke Lu, Member, IEEE, Ning He, Jian Xue, Jiyang Dong Ling Shao, Senior Member, IEEE

Abstract—3D object retrieval has attracted extensive researchefforts and become an important task in recent years. It is notedthat how to measure the relevance between 3D objects is still adifficult issue. Most of existing methods employ just the model-based or view-based approaches, which may lead to incompleteinformation for 3D object representation. In this paper, wepropose to jointly learn the view-model relevance among 3Dobjects for retrieval, in which the 3D objects are formulatedin different graph structures. With the view information, themultiple views of 3D objects are employed to formulate 3Dobject relationship in an object hypergraph structure. With themodel data, model-based features are extracted to construct anobject graph to describe the relationship among 3D objects. Thelearning on the two graphs is conducted to estimate the relevanceamong 3D objects, in which the view/model graph weights can bealso optimized in the learning process. This is the first work tojointly explore the view-based and model-based relevance among3D objects in a graph-based framework. The proposed methodhas been evaluated in three datasets. The experimental resultsand comparison with the state-of-the-art methods demonstratethe effectiveness on retrieval accuracy of the proposed 3D objectretrieval method.

Index Terms—3D Object Retrieval, View Information, ModelData, Joint Learning.

I. INTRODUCTION

3D Objects have been widely applied in plenty of diverseapplications [1]–[3], e.g., computer graphics, the medical

industry, and virtual reality, due to the fast advances in graphichardware, computer techniques and networks. Large scaledatabases of 3D objects are rapidly increasing, which leadsto the high requirement of effective and efficient 3D objectretrieval algorithms.

Recently, extensive research efforts have been dedicated to3D object retrieval technologies [4]–[7]. Existing 3D objectretrieval approaches can be briefly divided into two paradigms,i.e., model-based methods and view-based methods.

Copyright (c) 2012 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected].

This work was supported by the NSFC (Grant nos. 61271435, 61370138,U1301251), Supported by Beijing Natural Science Foundation (4152017,4141003),The Importation and Development of High-Caliber Talents Projectof Beijing Municipal Institutions (IDHT20130225), National Program on Keybasic research Project (973 Programs) (2011CB706901-4).Supported by theInstrument Developing Project of the Chinese Academy of Sciences (GrantNo. YZ201321).

Ke Lu, Jian Xue, and Jiyang Dong are with University of Chi-nese Academy of Sciences,Beijing,100049, China. Ke Lu is also withBeijing Center for Mathematics and Information Interdisciplinary Sci-ences.([email protected],[email protected],[email protected])

Ning He is with Beijing Union University. ([email protected])Ling Shao is with Northumbria University. ([email protected])

Fig. 1. Example views of two 3D objects.

In model-based method [8]–[10], 3D objects are describedmodel-based features, such as low-level feature (e.g. the volu-metric descriptor [11], the surface distribution [9] and surfacegeometry [8], [12], [13]) or high-level features, e.g. the methodin [14]. In [14], both visual and geometric characteristicsare taken into consideration and a high level semantic spacemapping from the low level features is further learned withuser relevance feedback, which is another Euclidean space andcan be regarded as a dimension reduction or feature selectionmethod. One advantage of model-based methods is that theycan preserve the global spatial information of 3D objects.Although model-based method is effective, they require 3Dmodel information explicitly, which limits the applicationsof model-based methods. The 3D model information is notalways available, especially in some practical applications.

In view-based method, [15]–[17], 3D objects are representedby a group of images from different directions. For differentmethods, these views may be captured with a static cameraarray or without such camera array constraint. For view-basedmethod, the matching between two 3D objects is accomplishedvia multiple-view matching. Figure 1 shows some examples ofmultiple views for 3D objects. The view-based methods benefitfrom existing image processing/matching technologies. Thesemethods make 3D object retrieval more flexible due to thatthey do not require 3D model information. Existing works [18]also show that view-based method can be highly discriminativefor 3D objects, which also provide better retrieval performancethan model-based methods [3], [19]. Compared with model-based methods, one disadvantage of view-based methods isthat when the camera array information is not available, theyare difficult to describe the spatial relationship among differentviews.

One typical scenario that 3D model information is notavailable is when we want to search the objects in the world.



2 IEEE TRANSACTIONS ON IMAGE PROCESSING

Fig. 2. The framework of the proposed method.

For example, when the tourist finds some interesting things andwants to find similar ones in the dataset, it is hard to obtain themodel information but just take several pictures. In this case,the model-based methods cannot work and only the image-based methods can be applied. For model-based method, CADis a very important area for application. Other areas wheremodel-based methods work well are entertainment, such as3D TV and games, and the medical field, such as tele-medicaltreatment and diagnosis. It is noted that the visual informationbecomes more important recently in the above application.Both the model-information and view-based information canbring in useful angles, which can further improve the perfor-mance.

It is noted that most of existing methods separate the model-based methods and the view-based methods, and employ eithermodel information or view feature for 3D object retrieval.In this work, we propose to jointly employ both the modeland the view information for 3D object relevance estimation.In the view part, representative views are firstly selected foreach object, and then the view-level distances are calculat-ed. Following the method in [20], an object hypergraph isconstructed using the view star expansion. In the model part,the spatial structure circular descriptor [21] is extracted and asimple graph is generated using the pairwise object distances.In this way, the view information and the model data can beformulated in two graph structures. Learning on the two graphsis conducted to estimate the relevance among 3D objects,in which the graph weights can be also optimized. Figure2 demonstrates the schematic framework of the proposedapproach. Evaluation on three datasets has shown superior 3Dobject retrieval accuracy performance compared with the state-of-the-art methods.

The rest of the paper is organized as follows. Related workon 3D object retrieval is reviewed in Section II. The proposedmethod is provided in Section III. Experiments and discussionare given in Section IV. We conclude the paper in Section V.

II. RELATED WORK

In this section, we briefly review existing methods on 3Dobject retrieval. To represent 3D objects, low-level features,such as volumetric descriptor [11] and surface geometry [8],[12], and high-level features, such as the method in [14] wereemployed in previous works.

For model-based 3D object retrieval, the shape descriptor isan important role for 3D object representation. According to

[22], 3D shape descriptors can be divided into four categories,i.e., histogram-based method [9], [23]–[25], transform-basedmethod [26]–[29], graph-based method [30]–[32] and view-based method [21], [33], [34].

In the histogram-based method, a histogram-like featureis extracted from the 3D model to collect numerical valuesof certain attributes. Typical histogram-based descriptors in-clude shape distribution [9], generalized shape distribution[23], extended Gaussian image [24] and 3D Hough transform[25]. In transform-based method, transform coefficients areemployed as the 3D shape descriptor, such as 3D Fourier [26],spherical trace transform [27], radialized extend function [28],and concrete radialized shperical projection [29]. The graph-based method aims to represent 3 objects by graph structure,and the comparison between 3D objects turns to matching oftwo graphs. Some typical graph-based methods include reebgraphs [30], [31] and skeletal graphs [32].

Given the 3D model, a spatial structure circular descriptor(SSCD) descriptor was introduced in [21], which projectedthe model information into a circular region to preserve theglobal spatial information of the 3D model. In this method, thehistogram for each SSCD view was calculated to measure thedistance between two 3D objects. A panoramic view, namedPANORAMA, was employed in [35] for 3D model represen-tation. The panoramic view was generated by projecting themodel to a lateral surface of a cylinder in PANORAMA, andthe distance between two models can be calculated by thematching between two PANORAMA images. Leng et al. [14]employs both the Dbuffer descriptor [36], which contains 6depth buffer images from the front, lateral and vertical views,and GEDT coefficients [37] as the descriptors. Then thesetwo descriptors are combined as TUGE descriptor, which is982-dimension. With user feedback, these low level featuresare mapped to high level semantic space, which is anotherEuclidean space and can be regarded as a dimension reductionor feature selection method. A bipartite graph learning methodis introduced in [38], where the comparison between twogroups of multiple views is formulated in a bipartite graph.A learning-based method for bipartite graph matching isproposed in [39].

In view-based 3D object retrieval methods, how to generatemultiple views is an important issue. Some existing methodsemployed predefined camera arrays to capture views, whilesome other works may not have such constraints. LightingField Descriptor (LFD [33] is the first view-based 3D object



LEARNING VIEW-MODEL JOINT RELEVANCE FOR 3D OBJECT RETRIEVAL 3

retrieval method. In LFD, each 3D object was represented byseveral groups of representative views. Each group contained10 views and the Zernike moments and Fourier descriptorswere employed as the view feature. The minimal distancebetween two groups of views from two compared 3D objectswas employed as the pairwise object distance. Different fromLFD, Elevation Descriptor (ED) [34] employed six rangeviews from different directions of 3D objects. The depthhistogram was extracted to describe the EDs and the matchingbetween two groups of EDs was conducted to calculate thedistance between two 3D objects. 18 views were captured inCompact Multi-View Descriptor (CMVD) [18] from the 18vertices of a 32-hedron. 7 characteristic views were generatedin [40] from different directions. In the camera constraint freemethod (CCFV) [41], a set of representative views are selectedfrom the originally captured multiple views via view clusteringand a probabilistic matching method is then employed tocalculate the similarity between each two 3D objects.

Some other methods first generate large scale raw views,and further select representative views in the big view pool.One typical method is Adaptive Views Clustering (AVC)[15]. In AVC, 320 initial views were firstly captured andrepresentative views, generally about 20 to 40 views, wereselected from these raw views. The comparison between 3Dobjects is formulated as a probabilistic approach to measurethe posterior probability of the target object given the query.In [41], a positive matching model and a negative matchingmodel were used to measure the relevance between a targetobject and the query. This is the first attempt to explore therelevance of one candidate object on both positive and negativesamples and evaluation has shown satisfied performance.

In [42], curvature scale space was employed as the viewdescriptor, which was further combined with Zernike Momentsto measure the distance between two 3D models. In depthgradient image (DGI) model [43], both the surface and thecontour information were synthesized, which can avoid re-strictions concerning the layout and visibility of the models inthe scene.

Distance estimation between two groups of views is oneimportant problem in view-based 3D object retrieval. Gao et al.[44] propose a learning based Hausdorff distance for 3D objectretrieval. In this method, a Mahalanobis distance metric waslearnt to the view-level distance measure, which can be furtherused in the object-level Hausdorff distance calculation. Thismethod solves the challenges that the label is on the objectlevel while the distance metric is on the view level. To estimatethe relevance among 3D objects, semi-supervised learninghas been investigated in recent years. In [20], a hypergraphstructure was employed to formulate the relationship among3D objects. In this method, the view clustering was conductedto generate hyperedges, which were used to connect 3Dobjects. Based on different view clustering results, multiplehypergraphs could be constructed, and learning was conductedon the hypergraph to estimate the relevance among 3D objects.This method further extends existing view-based 3D objectretrieval method to semi-supervised learning approach, whichhas been justfied as the state-of-the-art methods. Gaussianmixture model (GMM) was used in [45] to formulate the

distribution of multiple views for 3D objects. In this method,the KL divergence was employed to measure the distancebetween two 3D objects.

III. LEARNING VIEW-MODEL JOINT RELEVANCE FOR 3DOBJECT RETRIEVAL

In this section, we introduce the view-model joint relevancelearning method for 3D object retrieval. This method exploresboth the view information and the model data of 3D objects.The proposed method is composed of three key components,as shown in Figure 2. Given the view information of 3Dobjects, the proposed method first constructs a hypergraph toformulate the relationship among 3D objects with the viewconnections. Then with the model data, a spatial structurecircular descriptor is extracted from each 3D model, and thedistance between each two 3D models is used to generate asimple graph to explore the relationship among 3D models.Finally, the learning the joint view-model graphs is conductedto estimate the relevance among 3D objects.

A. View-based hypergraph generation

Here the view-based hypergraph is generated following themethod in [20] and briefly introduced as follows. Let O ={O1, O2, . . . , On} denote the n 3D objects in the dataset, andVi = {vi1, vi2, . . . , vini} denote the ni views of the ith 3Dobject Oi. In this part, we aim to explore the relevance among3D object with multiple view information.

Generally, although multiple views can represent rich in-formation of 3D objects, they also bring in redundant data,which may cause much computational cost and even lead tofalse results. Here we first select representative views for each3D object, and only these representative views are employedin the 3D object retrieval process.

Given the ni views Vi = {vi1, vi2, . . . , vini} of Oi, weconduct hierarchical agglomerative clustering (HAC) [46] togroup these views into view clusters. The HAC method isselected here due to that it can guarantee the intraclusterdistance between each pair of views cannot exceed a giventhreshold. Here the widely employed Zernike moments [47]are used as the view features, which are robust to image rota-tion, scaling and translation and have been used in many 3Dobject retrieval tasks [15], [20], [33], [48]. The 49-D Zernikemoments are extracted from each view of 3D objects. Withthe view clustering results, one representative view is selectedfrom each view cluster. Here we let Vi = {vi1,vi2, . . . ,vimi}denote the mi representative views for Oi. In our experiments,mi mostly ranges from 5 to 20.

Hypergraph has been used in many multimedia informationretrieval tasks, such as image retrieval [49], [50]. Hypergraphhas shown its superior on high-order information representa-tion. In our work, we propose to employ star expansion toconstruct an object hypergraph with views to formulate therelationship among 3D objects. Here we denote the objecthypergraph as GH = (VH ,EH ,WH). For the n objects inthe dataset, there are n vertices in GH , where each vertexrepresents one 3D object.




Fig. 3. An illustration of hyperedge construction. In this figure, there areseven objects with representative views. Here one view from O4 is selectedas the centra view, and its four closest views are located in the figure, whichare from O1, O3, O6 and O7. Then the corresponding hyperedge connectsO1, O3, O4, O6 and O7.

The hyperedges are generated as follows. We assume thereare totally nr representative views for all n objects. We firstcalculate the Zernike moments-based distance between eachtwo views, and the top K closest views can be generatedfor each representative view. For each representative view,one hyperedge is constructed, which connects the objects withviews in the top K closest views. In our experiment, K is setas 10. Figure 3 shows an example of hyperedge generation.

Generally, nr hyperedges can be generated for GH . Theweight of one hyperedge eH can be calculated by

wH (e) =1

K

∑exp

(−d(vx,vc)

2

σ2H

), (1)

where vc is the centra view of the hyperedge, vx is one of thetop K closest view to vc, d (vx,vc) is the distance betweenvc and vx, and σH is empirically set as the median of all viewpair distances.

Given the object hypergraph GH = (VH ,EH ,WH), theincidence matrix H can be generated by

h(vH , eH) =

{1 if vH ∈ eH0 if vH /∈ eH

(2)

The vertex degree of vH can be defined as

ρ (vH) =∑

eH∈EH

ω (eH)h (vH , eH). (3)

The edge degree of eH can be defined as

ρ(eH) =∑

vH∈VH

h(vH , eH). (4)

The vertex degree matrix and the edge degree matrix canbe denoted by two diagonal matrices Dv and De.

In the constructed hypergraph, when two 3D objects sharemore similar views, they can be connected by more hyperedgeswith high weights, which can indicate the high correlationamong these 3D objects.

B. Model-based graph generation

Given the model data of 3D objects, here we further explorethe model-based object relationship. Here the spatial structurecircular descriptor (SSCD) [21] is employed as the model

feature. SSCD aims to represent the depth information of themodel surface on the projection minimal bounding box of the3D model. The depth histogram is generated as the feature forthe 3D model. Following [21], the bipartite graph matchingis conducted to measure the distance between each two 3Dmodels, i.e., dSSCD(Oi, Oj).

Here, the relationship among 3D objects is formulated ina simple object graph structure G = (V,E,W). Here eachvertex in G represents one 3D object, i.e., there are n verticesin G. The weight of an edge e(i, j) in G is calculated by usingthe similarity between two corresponding 3D objects O i andOj as

W (vi,vj) = exp

(−dSSCD (vi,vj)

2

σ2s

), (5)

where dSSCD (vi,vj) is distance between Oi and Oj , and σs

is set as the median of all modal pair distances.

C. Learning on the joint graphs

Now we have two types of formulation of relationshipamong 3D objects, i.e., view-based and model-based. Herethese two formulations are jointly explored to estimate therelevance among 3D objects.

In this part, first we introduce the learning framework whenthe view-based and model-based information are regardedwith equal weight, and then we propose a jointly learningframework to learn the optimal combination weights for eachmodality.

1) The initial learning framework: Here we start from thelearning framework which regards different modalities, i.e.,model and view, as equal. The 3D object retrieval task canbe formulated as the one-class classification work as shown in[51]. The main objective is to learn the optimal pairwise objectrelevance under both the graph and hypergraph structure.Given the initial labeled data (the query object in our case),an empirical loss term can be added as a constraint for thelearning process. The transductive inference can be formulatedas a regularization as

argminf

{ΩV (f) + ΩM (f) + μR (f)} (6)

In this formulation, f is the to-be-learnt relevance vector,ΩV (f) is the regularizer term on the view-based hypergraphstructure, ΩM (f) is the regularizer term on the model-basedgraph structure, R (f) is the empirical loss. This objectivefunction aims to minimize the empirical loss and the regular-izers on the model-based graph and the view-based hypergraphsimultaneously which can lead to the optimal relevance vectorf for retrieval. The two regularizers and the empirical lossterm are defined as follows.

The view-based hypergraph regularizer ΩV (f) is definedas

ΩV (f) = 12

∑eH

∑u,v∈VH

wH (eHi)h(u,eHi)h(v,eHi)ρ(eHi)

(f(u)√ρ(u)

− f(v)√ρ(v)

)2

= 12

∑eH

∑u,v∈VH

wH(eHi)h(u,eHi)h(v,eHi)ρ(eHi)

(f2(u)ρ(u) − f(u)f(v)√

ρ(u)ρ(v)

)= fT (I−ΘV ) f,

(7)




where ΘH is defined as ΘH = D− 1

2v HWD−1

e HTD− 1

2v . Here

we denote ΔH = I −ΘH , ΩV (f) can be written as

ΩV (f) = fTΔHf. (8)

The model-based graph regularizer ΩM (f) is defined as

ΩM (f) = 12

∑u,v∈V

w (ei)

(f(u)√d(u)

− f(v)√d(v)

)2

=∑

u,v∈V

w (ei)

(f2(u)d(u) − f(u)f(v)√

d(u)d(v)

)= fT (I−ΘS) f,

(9)

where ΘS = D−1/2WD−1/2. Here we denote ΔS = I−ΘS ,ΩM (f) can be written as

ΩM (f) = fTΔSf. (10)

The empirical loss term R (f) is defined as

R(f)=‖f − y‖2, (11)

where y is the initial label vector. In the retrieval process, itis defined as an n × 1 vector, in which only the query is setas 1 and all other components are set as 0.

Now the objective function can be rewritten as

argminf

{fTΔHf + fTΔSf + μ‖f − y‖2

}. (12)

f can be solved by

f =

(I+

1

λ(ΔH +ΔS)

)−1

y. (13)

f is the relevance of all the objects in the dataset withrespect to the query object. A large relevance value indicateshigh similarity between the object and the query. The higherthe corresponding relevance value is, the more similar the twoobjects are. With the generated object relevance f , all theobjects in the dataset can be sorted in a descending orderaccording to f .

2) Learning the combination weights: We noted that theview information and the model information may not share thesame impact on 3D object representation. In some scenarios,the view information may be more important, and in someother cases, the model data may play an important role. Undersuch circumstances, we further learn the optimal weights forthe view information and the model data. In this part, weintroduce the learning framework embedding the combinationweight learning. The objective for the learning process iscomposed of three parts, i.e., the graph/hypergraph structureregularizers, the empirical loss and the combination weightregularizer.

Here we let α and β denote the combination weights forview-based and model-based information respectively, whereα + β = 1. After addeing the l2 norm on the combinationweights, the objective function can be further revised as

arg minf,α,β

{αfTΔHf + βfTΔSf + μ‖f − y‖2 + η

(α2 + β2

)},

(14)where α+ β = 1.

The solotion for the above optimization task is provided asfollows. To solve the above objective function, we alternativelyoptimize f and α/β. We first fix α and β, and optimize f .Now the objective function changes to

argminf

{αfTΔHf + βfTΔSf + μ‖f − y‖2

}(15)

According to Eq. (13), it can be solved by

f =

(I+

1

λ(αΔH + βΔS)

)−1

y. (16)

Then we optimize α/β with fixed f . Here we employ theLagrangian method, and the objective function changes to

argminα,β

{αfTΔHf + βfTΔSf + η

(α2 + β2

)+ξ (α+ β − 1)

}.

(17)Solving the above optimization problem, we can obtain

ξ = −fTΔHf + fTΔSf

2− η, (18)

α =1

2− fTΔHf − fTΔSf

4η(19)

and

β =1

2− fTΔSf − fTΔHf

4η. (20)

The above alternative optimization can be processed underthe optimal f value is achieved, which can be used for the3D object retrieval. With the learned combination weights, themodel-based and view-based data can be optimally exploredsimultaneously and the relevance vector f can be obtained.The main merit of the proposed method is that it jointlyexplore the view information and the model data of 3D objectsin hypergraph/graph frameworks for 3D object retrieval.

IV. EXPERIMENT

To evaluate the proposed approach for 3D object retrieval,we conduct experiments on three datasets and compare ourapproach with state-of-the-arts methods. In this section, wefirst introduce the testing dataset, followed by compared meth-ods and evaluation criteria. Then we provide the experimentalresults and comparisons.

A. Testing datasets

In our experiments, to evaluate the performance of theproposed method, three datasets are employed, i.e., NationalTaiwan University 3D Model database (NTU) [33], PrincetonShape Benchmark (PSB) [19] and Shape Retrieval Content2009 (SHREC) [2].

The NTU dataset is composed of 500 models from 50categories, such as bed, car, tank, truck, and plane. For eachcategory, there are 10 3D models. In the NTU dataset, eachobject has a corresponding 3D model. To generate the multipleviews of each 3D model, a virtual camera array including60 cameras is employed. These 60 virtual cameras are seton the vertices of a polyhedron with the same structure ofBuckminsterfullerene (C60). In this way, we can obtain 60views for each 3D model. The PSB dataset [19] contains 1,814




models from 161 classes. The SHREC dataset [2] contains 8003D models from 40 categories. Each category contains 20 3Dmodels.The 60 cameras used in the NTU dataset are employedhere to capture the 60 views for each 3D model in the PSBand the SHREC datasets. Figure 4 shows some example 3Dobjects.

(a) Example views of 3D models in the NTU dataset

(b) Example views of 3D models in the PSB dataset

Fig. 4. 3D object examples in the NTU and PSB datasets.

B. Compared methods

In our experiments, the following state-of-the-art methodsare selected as comparison to evaluate the performance of theproposed method.

• Elevation Descriptor (ED) [34]. ED employs six rangeviews as the descriptor, which contains the altitude infor-mation of the 3D model from six directions. The distancebetween two models is calculated by the matching of EDdescriptors.

• Extension Ray-based Descriptor (ERD) [52]. In ERD,ray-based descriptor is employed for 3D model represen-tation. Concentric spheres are used to extract the surfaceinformation, and each sampling surface point is givena corresponding value on the nearest sphere surface.The distance between two models is measured by thematching of these feature vectors.

• Query View Selection (QVS) [53]. In QVS, the employedquery views are incrementally selected with relevancefeedback and a distance metric is learnt for each newlyselected query view.

• Hypergraph Learning (HL) [20]. In HL, the relevanceamong 3D objects is formulated in a hypergraph struc-ture, where the hyperedges are generated using the viewclustering results. Semi-supervised learning is conductedto learn the relevance among 3D objects.

• Distance Combination (DC). In DC, both the model-based distance and the view-based distance are employed

for 3D object matching. This method calculates themodel-based distance and the view-based distance firstand further combines these two distance as the pair-wise3D object distance.

• Learning View-Model Joint Relevance (VMJR), i.e., theproposed method.

We have implemented ED, ERD, QVS and HL following thedescriptions in [20], [34], [52], [53]. In the experiments, foreach dataset, each time one 3D model is selected as the query,and 3D object retrieval is conducted in the dataset. All the 3Dmodels are employed as the query once. The average resultsfor all queries in each dataset are employed for comparison.

C. Evaluation criteria

In our experiments, the following criteria are employed toevaluate the retrieval performance of different methods.

• Precision-Recall Curve [15]. The Precision-Recall curveis able to comprehensively demonstrate the retrievalperformance, which illustrates the precision and recallmeasures by varying the threshold for distinguishing rel-evance and irrelevance in model retrieval. Here Precisionand Recall are measured according to

Precision =# {(all relevant models) ∩ (retrieved models)}

#(retrieved models)

and

Recall =# {(all relevant models) ∩ (retrieved models)}

#(all relevant models).

• F-Measure (F). F-measure considers the top 20 returnedresults for each query. F-measure is defined as:

F =2× P20×R20

P20 +R20,

where P20 and R20 are the precision and recall values ofthe top 20 retrieval results.

• Average normalized modified retrieval rank (ANMRR)[54]. ANMRR evaluates the ranking performance byconsidering the ranking order. A low ANMRR valueindicates a high precision in top returned results.To calculate the ANMRR, we first introduce the averageretrieval rank (ARR). Given the k th query Qk, the topSk = min{4 × τk, 2 × τmax} returned results are takeninto consideration, where τk is the number of relevantobjects for Qk, and τmax is the maximal number ofrelevant objects for all queries. For these Sk objects, ifthe ith result is relevant to the query, the rank r(i) is theranking position; otherwise r(i) = S + 1. The ARR isaccordingly calculated as

ARR (Qk) =

τk∑i=1

r (i)

τk. (21)

The modified retrieval rank (MRR) can be calculatedaccording to

MRR (Qk) = ARR (Qk)− τk2

− 0.5. (22)




Then the MRR can be normalized to obtain the normal-ized MRR (NMRR) as

NMRR (Qk) =MRR (Qk)

Sk − τk2 + 0.5

. (23)

The ANMRR can be obtained by averaging the NMRRvalues of all queries as

ANMRR =1

nq

nq∑k=1

NMRR (Qk), (24)

where nq is the number of queries.

D. Experimental results

Figure 5 provides the Precision-Recall curves on the NTU,SHREC and PSB datasets for different compared methods.Figure 6 illustrates the quantitative measures on the threedatasets respectively. As shown in these results, the proposedVMJR method can outperform all other compared methodand achieve the best 3D object retrieval results. From theexperimental results, we can obtain the following observations.

• The view-based methods can achieve better results thanmodel-based method. As shown in Figure 5 and Figure6, QVS and HL outperform ED and ERD. For example,in the NTU dataset, HL achieves a gain of 95.53% and66.30% in terms of the F measure and a gain of 20.13%and 16.38% in terms of the ANMRR measure comparedwith ED and ERD, respectively. Similarly, in the SHRECdataset, HL achieves a gain of 49.04% and 58.70% interms of the F measure and a gain of 28.63% and 28.10%in terms of the ANMRR measure compared with ED andERD, respectively. Similar conclusion can be found inthe PSB dataset. This advantage of view-based methodscan be dedicated to the highly discriminative informationof multiple views for 3D object representation.

• Compared with model-based methods, the proposedVMJR method achieves better results. on retrieval accu-racy For instance, in the NTU dataset, VMJR achievesa gain of 120.61% and 87.63% in terms of the Fmeasure and a gain of 32.38% and 29.20% in termsof the ANMRR measure compared with ED and ERD,respectively. Similarly, in the SHREC dataset, VMJRachieves a gain of 61.22% and 71.66% in terms of theF measure and a gain of 34.87% and 34.38% in termsof the ANMRR measure compared with ED and ERD,respectively. Similar conclusion can be found in the PSBdataset. The proposed method employed both the viewinformation and the model data for 3D object represen-tation. By jointly exploring the two parts of informationin the hypergraph/graph structure, the proposed methodcan achieve better 3D object representation, which furtherleads to better 3D object retrieval performance.

• Compared with QVS, the proposed VMJR can achievebetter performance. VMJR achieves a gain of 40.00%,17.13% and 16.08% in terms of the F measure and a gainof 25.82%, 16.70% and 18.00% in terms of the ANMRRmeasure compared with QVS in the NTU, SHREC andPSB datasets, respectively. The proposed method benefits

(a) PR curves in the NTU dataset

(b) PR curves in the SHREC dataset

(c) PR curves in the PSB dataset

Fig. 5. Performance comparison of compared methods in terms of PR curvesin the NTU, SHREC and PSB datasets.

from the joint graph formulation and the richer 3D objectinformation.

• In comparison with HL, the proposed method outper-forms HL in all datasets. VMJR achieves a gain of12.82%, 8.16% and 10.89% in terms of the F measureand a gain of 15.34%, 8.73% and 14.78% in terms ofthe ANMRR measure compared with QVS in the NTU,SHREC2009 and PSB datasets, respectively. Both HL andVMJR employ graph-based learning method, while theproposed VMJR considers not only the view informationbut also the model data and formulates them in twodifferent structures. VMJR can explore more completeinformation and learn better object relevance through thejoint graph structures.

• Compared with DC, the proposed method achieves better




(a) F and ANMRR in the NTU dataset

(b) F and ANMRR in the SHREC dataset

(c) F and ANMRR in the PSB dataset

Fig. 6. Performance comparison of compared methods in terms of F andANMRR in the NTU, SHREC and PSB datasets.

accuracy in 3D object retrieval. More specifically, VMJRobtains a gain of 18.96%, 12.17% and 13.75% in terms ofthe F measure and a gain of 18.92%, 13.59% and 16.60%in terms of the ANMRR measure compared with DC inthe NTU, SHREC2009 and PSB datasets, respectively.These results can demonstrate that the proposed VMJRcan explore the joint view-model information better.

• It is noted that when the query is outside of the database,we need to further embed it into the existing graphstructure. First we can obtain the graph structure of theentire database. When a new object comes, the similaritybetween the object and other objects is calculated andthe new graph structure should be generated. It is quitea hard task to automatically update the graph structure,which is our future work.

E. On the graph weights

In this subsection, we further evaluate the influence ofthe graph weights α and β for the hypergraph part and thegraph part. α and β are used to balance the impact of view

information and model data in 3D object representation andrelevance estimation. A large α value will lead to focus onview information more than model data, while a small α valuewill make the proposed method focus on model data morethan view information. It is noted that for different 3D objectdata, the view information and the model data can have variedinfluence.

Figure 7 shows the comparison of the proposed method withequal weighting, in terms of F and ANMRR. As shown inthe results, with the learnt weights, the proposed method canachieve better 3D object retrieval performance than that usingequal weighting.

Fig. 7. Performance comparison of the proposed method with and withoutweight learning in terms of F and ANMRR in the NTU, SHREC and PSBdatasets.

F. On parameter μ

In this subsection, we evaluate the influence of the param-eter μ on the 3D object retrieval performance. μ is used tomodulate the effects of the loss term. Here we vary μ from0.01 to 1000, and the performance on F and ANMRR isshown in Figure 8. As shown in the results, the proposedmethod can achieve satisfactory results when μ varies ina large range. When μ is too small, the regularization onthe graph/hypergraph structures will dominate the learningprocess. When μ is too large, the influence of the regular-ization on the graph/hypergraph structures will be decreased.Experiments show that μ = 10 is a good selection for theretrieval performance.

G. Comparison with the reported results of state-of-the-artmethods

In this part, we further compare the proposed method withthe reported results on the NTU and the PSB dataset. Figure9 shows the comparison on the two datasets, where the resultsfrom other compared methods are obtained directly from thereported performance in the corresponding papers.




(a) In the NTU dataset

(b) In the SHREC dataset

(c) In the PSB dataset

Fig. 8. Performance comparison with respect to the variation of µ ofcompared methods in terms of F and ANMRR in the NTU, SHREC andPSB datasets.

CCFV [41] captures a group of multiple views and thenselect a set of representative views via view clustering. Aprobabilistic matching method is then employed to calculatethe similarity between each two 3D objects. In GMM [45], thefeatures of the views are formulated in a Gaussian Mixturemodal, and the Kullback Leibler divergence between twoGMMs is calculated to measure the distance between two3D objects. In the visual-topic-model method (VTM) [55],latent dirichlet allocation is employed to formulate the visualfeature of multiple views, and the Kullback Leibler divergencyis used to measure the distance between two 3D objects. Asshown in these results, the proposed method is able to achievebetter results in comparison with the reported state-of-the-artmethods.

H. Computational cost

In this part, we analyze the computational cost of theproposed method. According the algorithm introduced in Sec-tion III, the computational cost comes from three parts, i.e.,the generation of view-based hypergraph, the generation ofmodel-based graph, and the learning on the joint graphs. Forthe generation of view-based hypergraph, the computationalcomplexity is O(nnav), where nav is the average numberof views for each object. The computational cost for themodel-based graph generation is O(n). The computational costfor the learning procedure scales as O(n3t), where t is theiteration times, which is around 10 in our experiments.

Figure 10 shows the computational cost for the proposedmethod in the three testing datasets, where the experimentswere conducted on a PC with i52.4GHz CPU and 16GBmemory. As shown in the results, the running time for each

(a) F and ANMRR in the NTU dataset

(b) F and ANMRR in the PSB dataset

Fig. 9. Performance comparison with the reported results in the NTU andPSB datasets.

query increases when the dataset becomes larger, but the timeis affordable in practice.

Fig. 10. The running time of the proposed method in the NTU, SHREC andPSB datasets.

V. CONCLUSION

In this paper, we propose a joint view-model relevancelearning method for 3D object retrieval. This method employsboth the view information and the model data of 3D objectsto jointly learn the relevance among 3D objects in hyper-graph/graph structures. For the view information, an objecthypergraph is generated by using the view star expansion.For the model data, an object graph is constructed usingthe pairwise object distances. The learning on the graphs isconducted to estimate the object relevance, which can be usedin 3D object retrieval. Evaluations on NTU, SHREC and PSBdatasets are provided to show the superior results on retreivalaccuracy in comparison with the state-of-the-art methods.

The proposed method is flexible to the 3D object data. Whenboth the view information and the model data are available, itcan employ all of them in the framework. When either the viewinformation or the model data is not available, the proposed




method can degrade to a single modal method, which can justemploy the view content or the model feature individually.

REFERENCES

[1] A. del Bimbo and P. Pala., “Content-based retrieval of 3D models,”ACM Transactions on Multimedia Computing, Communications, andApplications, vol. 2, no. 1, pp. 20–43, 2006.

[2] A. Godil, H. Dutagaci, C. Akgul, A. Axenopoulos, B. Bustos,M. Chaouch, P. Daras, T. Furuya, S. Kreft, Z. Lian, T. Napoleon,A. Mademlis, R. Ohbuchi, P. L. Rosin, B. Sankur, T. Schreck, X. Sun,M. Tezuka, A. Verroust-Blondet, M. Walter, and Y. Yemez, “Shrec’09track: Generic shape retrieval,” in Proceedings of Eurographics Work-shop on 3D Object Retrieval, Munich, Germany, 2009.

[3] B. Bustos, D. Keim, D. Saupe, T. Schreck, and D. Vranic, “Feature-basedsimilarity search in 3D object databases,” ACM Computing Surveys,vol. 37, no. 4, pp. 345–387, 2005.

[4] J. W. H. Tangelder and R. C. Veltkamp, “A survey of content based 3Dshape retrieval methods,” Multimedia Tools and Applications, vol. 39,pp. 441–471, 2008.

[5] Y. Yang, H. Lin, and Y. Zhang, “Content-based 3D model retrieval: Asurvey,” IEEE Transactions on Systems, Man, and Cybernetics-Part C:Applications and Reviews, vol. 37, pp. 1081–1035, 2007.

[6] K. Lu, Q. Wang, J. Xue, and W. Pan, “3d model retrieval and clas-sification by semi-supervised learning with content-based similarity,”Information Sciences, vol. 281, pp. 703–713, 2014.

[7] K. Lu, N. He, and J. Xue, “Content-based similarity for 3d modelretrieval and classification,” Progress in Natural Science, vol. 19, no. 4,pp. 495–499, 2009.

[8] A. E. Johnson and M. Hebert, “Using spin images for efficient objectrecognition in cluttered 3D scenes,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 21, no. 5, pp. 433–449, 1999.

[9] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin, “Shape distri-butions,” ACM Transactions on Graphic, vol. 21, no. 4, pp. 807–832,2002.

[10] E. Paquet and M. Rioux, “Nefertiti: A query by content system for three-dimensional model and image databases management,” Image VisionComputing, vol. 17, pp. 157–166, 1999.

[11] J. Tangelder and R. Veltkamp, “Polyhedral model retrieval using weight-ed point sets,” International Journal of Image and Graphics, vol. 3,no. 1, pp. 209–229, 2003.

[12] C. Ip, D. Lapadat, L. Soeger, and W. C. Regli, “Using shape distributionsto compare solid models,” in Proc. ACM Symposium on Solid Modelingand Applications, 2002, pp. 273–280.

[13] A. Makadia and K. Daniilidis, “Spherical correlation of visual repre-sentations for 3D model retrieval,” International Journal of ComputerVision, vol. 89, no. 2, pp. 193–210, 2010.

[14] B. Leng and Z. Xiong, “Modelseek: an effective 3D model retrievalsystem,” Multimedia Tools and Applications, 2010.

[15] T. Filali Ansary, M. Daoudi, and J. P. Vandeborre, “A Bayesian 3Dsearch engine using adaptive views clustering,” IEEE Transactions onMultimedia, vol. 9, no. 1, pp. 78–88, 2007.

[16] R. Ohbuchi, K. Osada, T. Furuya, and T. Banno, “Salient local visualfeatures for shape-based 3D model retrieval,” in Proceedings of IEEEConference on Shape Modeling and Applications, 2008, pp. 93–102.

[17] W. Li, G. Bebis, and N. Bourbakis, “3D object recognition using 2Dviews,” IEEE Transactions on Image Processing, vol. 17, no. 11, pp.2236–2255, 2008.

[18] P. Daras and A. Axenopoulos, “A 3D shape retrieval framework sup-porting multimodal queries,” International Journal of Computer Vision,vol. 89, no. 2, pp. 229–247, 2010.

[19] P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser, “The Princetonshape benchmark,” in Proceedings of Shape Modeling International,2004, pp. 167–178.

[20] Y. Gao, M. Wang, D. Tao, R. Ji, and Q. Dai, “3D object retrievaland recognition with hypergraph analysis,” IEEE Transactions on ImageProcessing, vol. 21, no. 9, pp. 4290–4303, 2012.

[21] Y. Gao, Q. H. Dai, and N. Y. Zhang, “3D model comparison usingspatial structure circular descriptor,” Pattern Recognition, vol. 43, no. 3,pp. 1142–1151, 2010.

[22] C. B. Akgul, B. Sankur, Y. Yemez, and F. Schmitt, “3d model retrievalusing probability density-based shape descriptors,” Pattern Analysis andMachine Intelligence, IEEE Transactions on, vol. 31, no. 6, pp. 1117–1133, 2009.

[23] Y. Liu, H. Zha, and H. Qin, “The generalized shape distributions forshape matching and analysis,” in IEEE International Conference onShape Modeling and Applications. IEEE, 2006, pp. 16–16.

[24] S. B. Kang and K. Ikeuchi, “The complex egi: A new representation for3-d pose determination,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 15, no. 7, pp. 707–721, 1993.

[25] T. Zaharia and F. Preteux, “Shape-based retrieval of 3d mesh models,” inIEEE International Conference on Multimedia and Expo, vol. 1. IEEE,2002, pp. 437–440.

[26] H. Dutagaci, B. Sankur, and Y. Yemez, “Transform-based methods forindexing and retrieval of 3d objects,” in Fifth International Conferenceon 3-D Digital Imaging and Modeling. IEEE, 2005, pp. 188–195.

[27] D. Zarpalas, P. Daras, A. Axenopoulos, D. Tzovaras, and M. G. Strintzis,“3d model search and retrieval using the spherical trace transform,”EURASIP Journal on Advances in Signal Processing, vol. 2007, 2006.

[28] D. V. Vranic, “An improvement of rotation invariant 3d-shape based onfunctions on concentric spheres,” in Image Processing, 2003. ICIP 2003.Proceedings. 2003 International Conference on, vol. 3. IEEE, 2003,pp. III–757.

[29] P. Papadakis, I. Pratikakis, S. Perantonis, and T. Theoharis, “Efficient3d shape matching and retrieval using a concrete radialized sphericalprojection representation,” Pattern Recognition, vol. 40, no. 9, pp. 2437–2452, 2007.

[30] M. Hilaga, Y. Shinagawa, T. Kohmura, and T. L. Kunii, “Topologymatching for fully automatic similarity estimation of 3d shapes,” inProceedings of the 28th annual conference on Computer graphics andinteractive techniques. ACM, 2001, pp. 203–212.

[31] T. Tung and F. Schmitt, “The augmented multiresolution reeb graphapproach for content-based retrieval of 3d shapes.” International Journalof Shape Modeling, vol. 11, no. 1, pp. 91–120, 2005.

[32] H. Sundar, D. Silver, N. Gagvani, and S. Dickinson, “Skeleton basedshape matching and retrieval,” in Shape Modeling International, 2003.IEEE, 2003, pp. 130–139.

[33] D. Y. Chen, X. P. Tian, Y. T. Shen, and M.Ouhyoung, “On visualsimilarity based 3D model retrieval,” Computer Graphics Forum, vol. 22,no. 3, pp. 223–232, 2003.

[34] J. L. Shih, C. H. Lee, and J. T. Wang, “A new 3D model retrievalapproach based on the elevation descriptor,” Pattern Recognition, vol. 40,pp. 283–295, 2007.

[35] P. Papadakis, I. Pratikakis, T. Theoharis, and S. Perantonis, “Panorama:A 3D shape descriptor based on panoramic views for unsupervised3D object retrieval,” International Journal of Computer Vision, vol. 89,no. 2, pp. 177–192, 2010.

[36] D. V. Vranic, “3d model retrieval,” University of Leipzig, Germany, PhDthesis, 2004.

[37] T. Funkhouser, P. Min, M. Kazhdan, J. Chen, A. Halderman, D. Dobkin,and D. Jacobs, “A search engine for 3d models,” ACM Transactions onGraphics (TOG), vol. 22, no. 1, pp. 83–105, 2003.

[38] Y. Gao, Q. Dai, M. Wang, and N. Zhang, “3D model retrieval usingweighted bipartite graph matching,” Signal Processing: Image Commu-nication, vol. 26, no. 1, pp. 39–47, 2011.

[39] K. Lu, R. Ji, J. Tang, and Y. Gao, “Learning based bipartite graphmatching for view-based 3d model retrieval,” vol. 23, no. 10, pp. 4553–4563, 2014.

[40] S. Mahmoudi and M. Daoudi, “3D models retrieval by using character-istic views,” in Proceedings of the International Conference on PatternRecognition. Quebec, Canada: IEEE, Aug. 2002, pp. 11–15.

[41] Y. Gao, J. Tang, R. Hong, S. Yan, Q. Dai, N. Zhang, and T. Chua, “Cam-era constraint-free view-based 3D object retrieval,” IEEE Transactionson Image Processing, vol. 21, no. 4, 2012.

[42] S. Mahmoudi, M. Benjelloun, and T. Filali Ansary, “3d objects retrievalusing curvature scale space and zernike moments,” Journal of PatternRecognition Research, vol. 6, no. 1, 2011.

[43] A. Adan, P. Merchan, and S. Salamanca, “3D scene retrieval andrecognition with depth gradient images,” Pattern Recognition Letters,vol. 32, no. 9, pp. 1337–1353, 2011.

[44] Y. Gao, M. Wang, R. Ji, X. Wu, and Q. Dai, “3d object retrievalwith hausdorff distance learning,” IEEE Transactions on IndustrialElectronics, vol. 61, no. 4, pp. 2088–2098, 2014.

[45] M. Wang, Y. Gao, K. Lu, and Y. Rui, “View-based discriminativeprobabilistic modeling for 3D object retrieval and recognition,” IEEETransactions on Image Processing, 2013.

[46] M. Steinbach, G. Karypis, and V. Kumar, “A comparison of documentclustering techniques,” in Proceedings of KDD Workshop on Text Min-ing, 2000.




[47] A. Khotanzad and Y. H. Hong, “Invariant image recognition by zernikemoments,” IEEE Transactions on Pattern Analysis and Machine Intelli-gence, vol. 12, no. 5, pp. 489–497, 1990.

[48] W. Y. Kim and Y. S. Kim, “A region-based shape descriptor usingzernike moments,” Signal Processing: Image Communication, vol. 16,no. 1-2, pp. 95–102, 2000.

[49] Y. Gao, M. Wang, Z. Zha, J. Shen, X. Li, and X. Wu, “Visual-textual joint relevance learning for tag-based social image search,” IEEETransactions on Image Processing, vol. 22, no. 1, pp. 363–376, 2013.

[50] Y. Huang, Q. Liu, and D. Metaxas, “Video object segmentation by hy-pergraph cut,” in Proceedings of the IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition, 2009, pp. 1738–1745.

[51] Y. Huang, Q. Liu, S. Zhang, and D. Metaxas, “Image retrieval viaprobabilistic hypergraph ranking,” in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition, 2010,pp. 3376–3383.

[52] D. Vranic, “An improvement of rotation invariant 3D shape descriptorbased on functions on concentric spheres,” in Proceedings of IEEEInternational Conference on Image Processing, 2003, pp. 757–760.

[53] Y. Gao, M. Wang, Z. Zha, Q. Tian, Q. Dai, and N. Zhang, “Less ismore: Efficient 3D object retrieval with query view selection,” IEEETransactions on Multimedia, vol. 11, no. 5, pp. 1007–1018, 2011.

[54] “Description of core experiments for MPEG-7 color/texture descriptors,”in ISO/MPEGJTC1/SC29/WG11 MPEG98/M2819, MPEG video group,1999.

[55] J. Zeng, B. Leng, and Z. Xiong, “3-d object retrieval using topic model,”Multimedia Tools and Applications, pp. 1–23, 2014.

Ke Lu was born in Ningxia on March 13, 1971. Hereceived his master degree and Ph.D. degree fromthe Department of Mathematics and Department ofComputer Science at Northwest University in July1998 and July 2003, respectively. He worked as apostdoctoral fellow in the Institute of AutomationChinese Academy of Sciences from July 2003 toApril 2005. Currently he is a professor of the U-niversity of the Chinese Academy of Sciences. Hiscurrent research areas focuses on computer vision,3D image reconstruction and computer graphics.

Ning He was born in Panjin, Liaoning Province onJuly 29, 1970. She graduated from Department ofMathematics at Ningxia University in July 1993. Shereceived M.S. degree and Ph.D. degree in appliedmathematics from Northwest University and CapitalNormal University in July 2003 and July 2009,respectively. Currently she is a associate professorof Beijing Union University and master tutor. Herresearch interests include digital image processing,computer graphics.

Jian Xue received the Ph.D. degree in computerapplied technology from the Institute of Automation,Chinese Academy of Sciences, Beijing, China, in2007. He is currently with the College of Engi-neering and Information Technology, University ofChinese Academy of Sciences, Beijing. Since 2003,He has long been engaged in the research work aboutout-of-core medical data analysis and processing,and visualization in scientific computing. His currentresearch interests include image processing, comput-er graphics and scientific visualization.

Jiyang Dong was born in 1990, received the B.S.degree in Electronic Engineering from TsinghuaUniversity in 2008, and studying for a PhD in DigitalImage Processing, University of Chinese Academyof Science. His current research interests includesegmentation of medical image, image registration.

Ling Shao (M’09SM’10) is a Professor with theDepartment of Computer Science and Digital Tech-nologies at Northumbria University, Newcastle uponTyne, U.K. Previously, he was a Senior Lecturer(2009-2014) with the Department of Electronic andElectrical Engineering at the University of Sheffieldand a Senior Scientist (2005-2009) with PhilipsResearch, The Netherlands. His research interestsinclude Computer Vision, Image/Video Processingand Machine Learning. He is an associate editorof IEEE Transactions on Image Processing, IEEE

Transactions on Cybernetics and several other journals. He is a Fellow of theBritish Computer Society and the Institution of Engineering and Technology.

ieee transactions on image processing 1 learning view ... · learning view-model joint relevance...

Documents