measures for benchmarking of automatic … for benchmarking of automatic correspondence algorithms...

22
Measures for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson Centre for Mathematical Sciences Lund University [email protected] and [email protected] September 20, 2007 Abstract Automatic localisation of correspondences for the con- struction of Statistical Shape Models from examples has been the focus of intense research during the last decade. Several algorithms are available and benchmarking is needed to rank the different algorithms. Prior work has argued that the quality of the models produced by the algorithms can be evaluated by measuring compactness, generality and specificity. In this paper severe problems with these standard measures are analysed both theoret- ically and experimentally both on natural and synthetic datasets. We also propose that a Ground Truth Corre- spondence Measure (GCM) is used for benchmarking and in this paper benchmarking is performed on several state of the art algorithms using seven real and one synthetic dataset. 1 Background Statistical shape modeling [11] has turned out to be a very effective tool in image segmentation and image interpre- tation. A major drawback is that a dense correspondence between the shapes in the training set must be established. In practice this has been done by hand, a process that com- monly requires a lot of work and can be difficult, espe- cially in 3D. In recent years there has been a lot of work on auto- matic construction of Shape Models. There are several different algorithms for this automatic model construc- tion. The algorithms normally optimise parameterisations of the shapes in the training set to get correspondences between the shapes. The correspondence problem is one of the important unsolved problems in computer vision. It is hard since without prior information the problem does not have a unique solution. For example for contours only changes in curvature and the relative position of points, where some of the points have specific curvature, bring information that can be used to locate correspondence. The corre- spondence problem for two straight lines or two circles is unsolvable without other information. If additional in- formation is available correspondences can still be estab- lished in a unique way. For example in images of the iris the shape may always be a circle but correspondences can still be established by the use of colour intensities. Such a shape model could be useful for example for classifica- tion. There have been many suggestions on how to auto- mate the process of building shape models or more pre- cisely finding a dense correspondence among a set of shapes [1, 3, 21, 23, 35, 39]. Many approaches that cor- respond to human intuition attempt to locate landmarks on curves using shape features [3, 21, 35], such as high curvature. The located features have been used to es- tablish point correspondences, using equal distance inter- polation in between. Local geometric properties, such as geodesics, have been tested for surfaces [39]. Dif- ferent ways of parameterising the training shape bound- aries have been proposed [1, 23]. The above cited are 1

Upload: haquynh

Post on 27-Apr-2018

226 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

Measures for Benchmarking of Automatic CorrespondenceAlgorithms

Anders Ericsson and Johan KarlssonCentre for Mathematical Sciences

Lund [email protected] and [email protected]

September 20, 2007

Abstract

Automatic localisation of correspondences for the con-struction of Statistical Shape Models from examples hasbeen the focus of intense research during the last decade.Several algorithms are available and benchmarking isneeded to rank the different algorithms. Prior work hasargued that the quality of the models produced by thealgorithms can be evaluated by measuring compactness,generality and specificity. In this paper severe problemswith these standard measures are analysed both theoret-ically and experimentally both on natural and syntheticdatasets. We also propose that a Ground Truth Corre-spondence Measure (GCM) is used for benchmarking andin this paper benchmarking is performed on several stateof the art algorithms using seven real and one syntheticdataset.

1 Background

Statistical shape modeling [11] has turned out to be a veryeffective tool in image segmentation and image interpre-tation. A major drawback is that a dense correspondencebetween the shapes in the training set must be established.In practice this has been done by hand, a process that com-monly requires a lot of work and can be difficult, espe-cially in 3D.

In recent years there has been a lot of work on auto-matic construction of Shape Models. There are severaldifferent algorithms for this automatic model construc-

tion. The algorithms normally optimise parameterisationsof the shapes in the training set to get correspondencesbetween the shapes.

The correspondence problem is one of the importantunsolved problems in computer vision. It is hard sincewithout prior information the problem does not have aunique solution. For example for contours only changes incurvature and the relative position of points, where someof the points have specific curvature, bring informationthat can be used to locate correspondence. The corre-spondence problem for two straight lines or two circlesis unsolvable without other information. If additional in-formation is available correspondences can still be estab-lished in a unique way. For example in images of the iristhe shape may always be a circle but correspondences canstill be established by the use of colour intensities. Sucha shape model could be useful for example for classifica-tion.

There have been many suggestions on how to auto-mate the process of building shape models or more pre-cisely finding a dense correspondence among a set ofshapes [1, 3, 21, 23, 35, 39]. Many approaches that cor-respond to human intuition attempt to locate landmarkson curves using shape features [3, 21, 35], such as highcurvature. The located features have been used to es-tablish point correspondences, using equal distance inter-polation in between. Local geometric properties, suchas geodesics, have been tested for surfaces [39]. Dif-ferent ways of parameterising the training shape bound-aries have been proposed [1, 23]. The above cited are

1

Page 2: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

not clearly optimal in any sense. Many have statedthe correspondence problem as an optimisation problem[2, 4, 8, 12, 15, 14, 19, 20, 24, 30]. For example, in [30]a measure is proposed and dynamic programming is ap-plied to find the reparameterisation functions and in [2]shapes are matched using shape contexts.

Minimum Description Length, (MDL) [8], is aparadigm that has been used in many different applica-tions, often in connection with model optimisation. Inrecent papers [9, 8, 24] this paradigm is used to locate adense correspondence between shapes.

In this paper we focus on the correspondence problemfor shape models, where the shape is the boundary ofsome object, i.e. a curve or a surface. Thus the correspon-dences should match, as one to one and onto functions,from each shape to all the others in the training set andthe correspondences can be described by monotonous pa-rameterisation functions. There have been some recent in-teresting papers including code on the problem of match-ing one point set to another, [5, 6, 40]. However thesealgorithms only match one shape to another, instead ofworking with the training set as a whole. Therefore thesealgorithms are not directly useful for shape modeling ofcurves and surfaces.

Note that these algorithms do not operate upon the im-ages that the shapes originate from, but upon the curvesor surfaces that have been extracted from the images ina previous step. Therefore for algorithms for correspon-dence localisation it does not matter what kind of imagethat the shape originally comes from.

2 Introduction

In short the field has matured and there are many algo-rithms available. So there is a need for benchmarking ofthese algorithms. In recent years a similar developmenthas taken place in the field of stereo [28]. In order to eval-uate these algorithms, different measures of the quality ofthe parameterisations and the resulting models have beenused. If the model is to be used for a specific purpose,such as segmentation of the heart in scintigrams [27],the choice of algorithm should be made using a criterionbased on the application. For a more general evaluationof shape model quality the standard measures are com-pactness, specificity and generality [7]. It is also common

to evaluate correspondences subjectively by plotting theshapes with some corresponding points marked. Thesemethods have become very popular since they do not re-quire any ground truth data, which is very time consumingto collect.

In [29] the quality of registrations of images is evalu-ated both by measuring the overlap with ground truth andby measuring model quality measures such as specificityand generality of models constructed from the registeredimages. There it is claimed that ground truth correlateswith generality and specificity. This is shown by doingrandom perturbations of ground truth and noting that allthe measures increase monotonously as the level of per-turbation increases. However, this does not show that themeasures are minimal at the ground truth. The path to theminimum is not likely to be found by a random perturba-tion. Measuring the sensitivity to perturbation of groundtruth it is also claimed that specificity is more sensitivethan the other two measures. Instead of measuring sensi-tivity to random perturbation it would be more interestingto examine which measure is most suitable for choosingbetween two training sets produced by different non ran-dom strategies. This might be done by letting human ex-perts choose which is the best of the two compared train-ing sets.

In this paper problems with the standard general modelquality measures, namely compactness, generality andspecificity, are analysed. We show that especially speci-ficity and compactness do not succeed in measuring whatthey attempt to measure. With practical experiments wealso show that the standard measures do not correlate wellwith correspondence quality. Also these measures are notquantitative in the sense that they do not assign a singlenumber to describe the quality of the shape model. Whatshould be considered as a shape model of high qualityis highly dependent on the application. For example themodel that performs best on segmentation might not bethe model that performs best on classification. The qual-ities that the standard measures attempt to measure areoften, but certainly not always, important. However, evenwhen they are important, as we will see, it is problematicto actually measure them.

For most applications, high quality of the correspon-dences is desirable. A shape model built from correct cor-respondences is a model that correctly describes the shapevariation within the class, whereas, as will be shown, a

2

Page 3: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

simplified model can get excellent measure of specificity,generality and compactness, but relevant shape informa-tion may have been lost. We propose that a Ground TruthCorrespondence Measure (GCM), measuring the qualityof the correspondences at important locations, is used forbenchmarking. Verification using ground truth is an es-tablished method in computer vision, see for example theCAVIAR project [17]. Ground truth has also been usedfor verification of segmentation see for example [26].

In [34] four algorithms for 3D-correspondenceare eval-uated and here a simple Ground Truth CorrespondenceMeasure is used together with the standard measures tocompare the algorithms. However no evaluation of theevaluation measures is done. In our paper a close exami-nation of the different evaluation measures is performed.

The two major contributions in this paper are: (i) It isshown by theoretical analysis and practical experimentson both natural (extracted from real images) and syntheticdatasets that the former shape model measures compact-ness, specificity and generality have severe weaknesses.(ii) A simple correspondence quality measure (GCM)is proposed and it is shown that GCM together with adatabase of ground truth correspondences is well suitedfor benchmarking algorithms for correspondence local-isation. A database of seven natural and one syntheticdataset and MATLAB code for evaluating algorithms arepublished. Benchmarking of several state of the art al-gorithms is presented. Apart from analysing the differentmeasures, one purpose of this paper is also that peoplewith newly developed algorithms download our databaseand code to benchmark their algorithm and compare withthe results published in this paper.

In Section3 preliminaries of shape models and pa-rameterisation optimisation for locating correspondencesare given. In Section4 the compactness-, specificity-and generality-measures are recapitulated. In Section5 the theoretical weaknesses of these measures are dis-cussed. In Section6 a measure called Ground Truth Cor-respondence Measure (GCM) is introduced. This measuremakes the subjective evaluation of correspondences moreobjective and quantitative. In Section7 experiments showthat all the standard measures can lead to very wrong con-clusions when used to evaluate algorithms. GCM how-ever always corresponds to the subjective evaluation ofcorrespondence quality or in synthetic examples to theknown ground truth. In Section8 several algorithms for

finding correspondences are benchmarked using GCM.

3 Preliminaries

3.1 Statistical Shape Models

A shape is what is left from a geometrical object aftersimilarity transformation has been filtered out. Whenanalysing a set ofns similar shapes, it is convenient andusually effective to describe them using Statistical ShapeModels. The shapes are typically boundaries of someobjects and are commonly defined by landmarks, curvesor surfaces. In this paper we treat shapes defined asn-dimensional manifolds. Specifically we consider one-dimensional manifolds, i.e. curves, but the measures dis-cussed work equivalently forn-dimensional manifolds,for example surfaces. For more complicated shapes con-sisting of several manifolds correspondences are estab-lished by treating one manifold at a time.

Assume first that the shapes are sampled in a finitenumber of points so that each shape can be representedas a vector. After the shapesxi (i = 1 . . . ns) have beenaligned and normalised to the same size, a principal com-ponent analysis of the covariance matrix for the shapes isnormally performed (other types of dimensionality reduc-tion such as Independent Component Analysis are alsopossible). A shape from the shape class can now be de-scribed by a linear shape model of the form,

x = x + Φb + e, (1)

wherex is the mean shape, the columns ofΦ describea set of orthogonal modes of shape variation andb is thevector of shape parameters for the shape ande is the resid-ual. If now the shapes are continuous curves and not finitevectors then we interpret (1) as

x(t) = x(t) +

nm∑

i=1

Φi(t)bi + e(t) , (2)

wherenm is the number of modes of shape variation andx(t) , x(t) , Φi(t) ande(t) are functions of the continuousparametert. For details of how to perform the principalcomponent analysis for continuous curves see [22].

3

Page 4: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

3.2 Parameterisation Optimisation

To illustrate we look at shapes which are curves. Theprinciples are the same for surfaces. Assume that a pop-ulation of geometric objects, represented as continuouscurvesci(t), i = 1, . . . , ns, t ∈ [0, 1], is given and thatthe shape is to be modeled.

Each curve is represented using some arbitrary param-eterisation,

ci : [0, 1] ∋ s → ci(s) ∈ R2 . (3)

For simplicity it is assumed that the shapes are param-eterised by arc length. We want to model the shape ofthese curves. A linear representation of the model for thecontinuous curves, as in the previous section, would bedesirable.

To model the shape it is necessary to solve the corre-spondence problem, i.e. to find a set of reparameterisa-tion functionsγi

ns

i=1, whereγi : [0, 1] → [0, 1] arestrictly increasing bijective functions such thatci (γi(s))corresponds tocj (γj (s)) for all pairs(i, j) and all param-eter valuess ∈ [0, 1], i.e. if ci(γi(s)) is at a certainanatomical point on shapei thencj (γj (s)) should be atthe same anatomical point on shapej. Correspondencebetween the curvesci (γi(s)) and cj (γj (s)) is denotedci(γi(s)) :=: cj (γj (s)). The above describes an ideal-ized case without outliers and noise. In the presence ofsuch disturbances of the shape, a good optimiser will putlow weight of those parts of the curves. The same for-mulation can be used for closed curves by changing theinterval[0, 1] to the circleS1. For generalN -dimensionalmanifoldsγ will be a function from[0, 1]N to [0, 1]N ands will be anN -dimensional vector.

In practice the alignment method will influence the cor-respondences slightly. For a more detailed investigationof these effects see [16]. In this paper we use the stan-dard Procrustes alignment method, see [18]. Procrustesaligns the shapes by finding the similarity transformationthat minimises the squared distance from all the shapes tothe mean shape.

In order to determine correspondences a model ofshape is constructed from the parameterised curves. Acost function, such as the minimum description length(MDL), is calculated from the shape variation model.Then the parameters of the basis functions controlling theparameterisation functions are optimised using a suitable

optimiser to minimise this cost function. At each iterationin the optimisation procedure, before a model can be con-structed, the shapes are resampled with a fixed number ofsampling points defined by the current set of parameter-isation functions. The sampling of the shapes is done inorder to perform the numerical computations necessary toevaluate the cost function.

3.3 Using MDL for Correspondence Local-isation

One popular way to solve the correspondence problemin shape modeling is to apply the MDL-technique c.f.[9, 10]. MDL has proven to be a successful cost functionfor locating the parameterisation functionsγi . The cost inMDL is derived from information theory and is, in sim-ple words, the amount of databits that is needed to trans-mit the model and the training shapes, which the modelapproximates, bit by bit. The MDL - principle searchesiteratively for the set of functionsγi that gives the cheap-est model to transmit. The idea behind introduction ofthis cost function was that it should try to make a trade-off between a model that is general (can represent any in-stance of the object), specific (it can only represent validinstances of the object) and compact (it can represent thevariation with as few parameters as possible).

Since the idea of using MDL for landmark determina-tion first was published [9] the cost function has been re-fined and tuned. Here we use the simple cost functionstated in [36]

DL =∑

λi≥c

(1 + logλi

c) +

λi<c

λi

c. (4)

The scalarDL is the description length and is the cost totransmit the model according to information theory. Thescalars λi are the eigenvalues of the linear model inequation (1) andc is a cutoff constant. Information canonly be sent up to a certain degree of accuracy. The con-stantc expresses this accuracy. Typically we have set itto c = 10−5, which corresponds to an acceptable error of0.3 pixels for shapes with an original radius of 100 pixels.

In [13] the simpler version above is derived from theversion published by Davies et al. in [9] using a few nat-ural assumptions. The original version in [9] is not con-tinuous. Therefore an additional step involving numerical

4

Page 5: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

integration is introduced in [7]. However the version usedin this paper is continuous so integration is not necessary.

In [37] it was shown that a curvature cost can be addedto the description length to improve the correspondencesproduced by the optimisation (MDL+Cur). The curva-ture cost is a measure of the variation in the set of curva-ture functions at corresponding points along the shapes.With good correspondences the curvature at correspond-ing points will be similar and so the variance and therebythe curvature cost will be low. The total cost function isthenE = EDL + KEcc, whereEDL is the descriptionlength,K is a constant andEcc is the curvature cost. In[22] MDL is used with a parameterisation invariant scalarproduct and this gives better results (called MDL+Par inthis paper). In [14] a technique that formerly was used instereo for finding correspondences is shown to work forshape models, AIAS+MDL. AIAS+MDL+Cur is a ver-sion with a curvature cost added. There are also manyother methods to locate correspondences among shapesin a training set.

4 Measuring Shape Model Quality

First of all the model should be able to represent the train-ing set. A model that is able to represent any instance ofthe class is said to begeneral. Secondly, the model shouldnot be able to model illegal instances. Aspecificmodel isonly capable of representing legal instances of the class.Thirdly, the model should use as little information as pos-sible. Acompactmodel represents instances with as fewparameters as possible.

Compactness, specificity and generality are the currentstate of the art criteria for measuring shape model qual-ity. To evaluate correspondences it is common to plot thesample points on the shapes and in a subjective mannervisually evaluate the correspondences.

In the rest of the paper the norm used is||f(x)|| =√

|f(x)|2dx , i.e.L2-norm, unless otherwise stated. Aswe will see, many of the problems of the standard mea-sures are related to the fact that they measure similaritybetween shapes using a norm of this type. If there wasa general measure that could pick out relevant informa-tion and determine shape similarity so that the distancebetween shapes reflect the way they differ and not only

the distance using theLN -norm and if this measure wasinvariant to the parameterisations, many of the problemswith the standard measures might be solved.

Compactness. Qualitatively, a compact model is amodel that represents all shapes of the class with as lit-tle variance as possible in the shape variation modes andpreferably with few modes. A measure that tries to cap-ture this is the sum of variances of the model [7],

C(nm) =

nm∑

j=1

λj (5)

whereλj is the variance in shape modej andC(nm) isthe compactness usingnm number of modes.

Specificity. It is in general desirable that a shape modelis specific. A specific model can only represent shapesfrom the shape class for valid parameter values. Thishas been measured by generating a large number (N ) ofshapes by picking random parameter values to the modelaccording to the parameter space distribution. Each sam-ple shape is then compared to the most similar shape inthe training set. A quantitative measure for this [7] is:

S(nm) =1

N

N∑

j=1

||yj − y′j (nm)|| (6)

wherey′j are shape examples generated by the model and

yj is the nearest member of the training set toy′j andnm

is the number of modes used to create the samples. For-mally,

yj = arg infxj∈x1 ,...,xns

||xj − y′j (nm)|| . (7)

In [29] this is interpreted as an approximation of

S(nm) =

Φ

p(y′j )||yj − y′

j (nm)||dy′j (8)

wherep(y′j ) is the distribution function describing the

probability that the model generates the shapey′j andΦ

is the shape space.A more general definition is

S(nm) =

Φ

||yj − y′j (nm)||dµp(y

′j ) (9)

5

Page 6: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

whereµp is the probability measure for the shapesy′j .

In the special case where the shapes are vectors thiscoincides with the above definition.

In [38] specificity is interpreted as a graph based es-timator of cross-entropy and KL divergence. However inorder to do this estimation a number of limits and assump-tions are needed, which means that this estimation is notvalid in practice.

Generality. Qualitatively, by generality is meant howgood the model can generalise to formerly unseen shapes.The model should be able to describe all shapes of theclass and not only those of the training set. This is mea-sured by ”leave one out” tests, where a model is built byusing all but one of the shapes in the training set. Themodel then tries to describe the left out shape. In [7] theerror for one left out shape is the norm of the differencebetween the shape produced by the model and the trueshape. Generality is measured as the mean over allns

left out shapes. Usually this is plotted over the number ofmodes used by the approximating model. The expressionfor generality is

G(nm) =1

ns

ns∑

j=1

||xj − x′j (nm)|| (10)

wherexj is the left out shape andx′j (nm) is the attempted

description using the model withnm modes.

Comparing Quality of Models For generality andspecificity it is common to plot the measures over thenumber of modes used for the constructed shape model.If the curve of the error-measure over number of modesfor one model is below or equal to the curve for anothermodel for allnm and lower for somenm, the model repre-sented by the lower curve can be said to be more specificor general depending on the measured quantity [7]. Forcompactness it is common to compare the total variance,i.e. letnm = ns−1, but as we will see it can be importantto look at the whole curve.

5 Theoretical Analysis of StandardModel Quality Measures

Assume that a set of training shapes with perfect corre-spondence,ci (γi(s)) :=: cj (γj (s)) over all the shapesat all points, is given. This means that every anatomi-cal point in one shape is matched to the same anatomi-cal points in all other shapes in the training set (even fornon-medical data, we still use the term anatomical to re-fer to points such as for example the upper left corner ofthe right headlight on a car). A model built from thesecorrespondences will describe all the shape variation thatis present in the training set. If we perturbate the cor-respondences for this model new artificial shape modesare likely to be introduced. These shape modes describeinferred movement of points along the shape, due to in-correct correspondence. It may also be the case that thereare real shape variations along the boundary of the ob-jects and these modes may be lost by algorithms favoringcloseness between corresponding points. Therefore cor-rect correspondences are very important for model qual-ity. Specificity and generalisation ability are also two im-portant properties of a shape model, but as we will see,they are both difficult to measure.

Model 1

Shape A

Model 2

Shape A

Shape B Shape B

Figure 1: Correspondences on the same shapes with per-fect correspondences but with different parameterisationfunctions. Different weight has been put on the differentparts of the curves in the two examples. In the figure theweighting is illustrated by the point density. Assume thatbetween each cross is an equal number of sample points.Model 2 has much more points on the three straight linesthat do not have any shape changes in the set.

Assume two training sets that have perfect correspon-dence and are sampled as in Figure1. Model 1 is built

6

Page 7: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

from shapes with sample points more equally distributedaround the shapes but model 2 is built from shapes wherethe straight lines without any shape variation informa-tion is weighted with much more points than the upperpart of the curve where the shape change occurs. Sinceboth training sets represent the same correspondences, themodels are theoretically equally good. The sampling is ofcourse important, when the models are to be used. So, ifthe current sampling is to be used model 1 will be bet-ter than model 2, since it will not neglect the upper partof the curve. However, resampling using inverse param-eterisation functions can always be performed so that allparts of the curves are weighted equally [22] and usingsuch a procedure model 1 and model 2 will be equal. Thestandard measures, however, will all favor model 2 as themodel with highest quality. This is because the part ofthe shapes with no shape change is up weighted and willreduce the errors that only will be measurable in the up-per part of the shape. This problem is avoided with theGround Truth Measure we will use in this paper since it isworking with inverse parameterisation functions as part ofthe definition so that no resampling as in [22] is necessary.

Also a common situation when comparing the qual-ity of two models using generality, specificity and com-pactness is that the plotted curves of quality measures(G(nm), S(nm) or C(nm)) for the two models cross eachother and in this case it is not possible to choose whichmodel is of higher quality. Sometimes this problem canbe avoided since it may be enough to only examine theasymptotic value of the quality measures.

Compactness One problem with the compactness mea-sure is that it does not take into consideration the value ofgathering the variance in as few modes as possible. In Fig-ure2 the compactness functions of two models are plot-ted. The total variances of the two models are equal. Formodel A all variance is concentrated to the first mode. Thegraph for model A therefore goes up to the total variancein the first mode.

For model B the variance is distributed equally overall modes. According to the quantitative definition ofthe compactness measure described above, model B ismore compact than model A. However, from the qualita-tive definition, by compactness is meant as little varianceas possible in as few modes as possible, so from that def-

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

11

Compactness of model A and model B.

nr of modes

com

pact

ness

Model AModel B

Figure 2: The compactness curve of model A is abovethat of model B. So model B is therefore more compactjudging from the measured quantity. However from theoriginal definition model A is more compact than modelB, since it captures all the variation in one mode.

inition the conclusion is that model A is more compactthan model B, since they have the same total variance butmodel A captures it in one mode. Therefore we can con-clude that the quantitative compactness measure does notcapture the notion of compactness very well.

Also, optimising MDL and compactness both meanthat the shapes are parameterised so that they can be de-scribed as simply as possible. This approach works wellfor finding approximate correspondences but there is alsothe risk that these simple descriptions come at the costof neglecting real information about shape variation whendetermining shape modes. Such simplifications should bedetected by a good measure of generality, but the standardmeasure will not detect such simplifications.

Specificity Assume first that the norm used for speci-ficity is ||f(x)|| =

|f(x)|dx , i.e. theL1-norm. As-sume two models built from a training set consisting ofrectangles as shape A and shape B in Figure3. Model 1generates samples as shape C and model 2 generates sam-ples as shape D. These models will have equal specificitymeasure, since C and D have equal distance to the clos-est shape in the training set. However shape C belongs tothe shape class, but shape D does not. One conclusion isthat disturbance with zero mean will not be detected bythe specificity measure.

To explain the above in a more precise way, con-sider the training setT = y1, . . . ,yns

consisting of

7

Page 8: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

A

C

D

B

Figure 3: Shape A and shape B are in the training set. Onemodel generates examples as shape C and another modelgenerates shapes as D. These models are equal with regardto specificity.

rectangles with equal widthw and varying heighth ∈h1, . . . , hns

. Assume that model 1 generates all rectan-gles with widthw and heighth ∈ (0,∞). For simplicityassumew = 1. Call this spaceΩ1 and call its probabilitydistributionµ1. In this case we identifyΩ1 with R. Model2 generates the same rectangles and also any shape thatcan be constructed by adding functions with mean zeroand bounded amplitude to the top side of the rectangle,Ω2

with probability distributionµ2. Let F = C0([0, 1],R),i.e. the space of continuous functions from[0, 1] to R.Let

Bm,a =

f ∈ F : supx∈[0,1]

|f(x) − m| ≤ a,

∫ 1

0

fdx = m. (11)

Definea(h) = inf

hi∈h1,...,hns|h − hi|. (12)

We denoteΩh = Bh,a(h). Note that

Ωh1

Ωh2= ∅ (13)

for all h1, h2 ∈ Ω1, h1 6= h2. Now, let

Ω2 =⋃

h∈Ω1

Ωh. (14)

In the example in Figure3 the function f is asin plusthe constantm. The shape space generated by model 2is an extension of the shape space generated by model 1,Ω1 ⊂ Ω2 ⊂ Φ. Model 2 is therefore clearly less specificthan model 1.

Let π : Ω2 → Ω1, π : f 7→∫ 1

0fdx.

These definitions are illustrated in Figure4.

Ω1

Ω2π:Ω2 Ω1

xh4

xh3

xh2

xh1

Ωh

h

x

x

Figure 4: The spaceΩ2.

Now, assume that, loosely speaking (thinking for ex-ample ofΩ2 as a subset ofRn and assuming probabilitydensity functionsp1 andp2 corresponding toµ1 andµ2),we want

Ωh

p2(z)dµ0(z) = p1(h) (15)

whereΩh is the space consisting of shapes generated fromthe shapeh, which is a rectangle with heighth. This canbe stated as demanding thatµ2 is normalised such that

Ω2

φ π(z)dµ2(z) =

Ω1

φ(h)dµ1(h) (16)

for all functionsφ ∈ C0(Ω1,R).Lets now evaluate the specificity measure for model 2

using the general definition.∫

Ω2

infyj∈T

||yj − z||dµ2(z) . (17)

Defineφ(z) = infyj∈T ||yj − z|| and note thatφ(z) =

φ π(z) whereφ = infhj|hj − h|.

Then use (16) to show that (17) is equal to∫

Ω1

infhj

|hj − h|dµ1(h) . (18)

8

Page 9: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

1020

3040 50

60

Shape A

Shape C

1020

3040 50

60

Shape D

1020

3040 50

60

Shape B

1020

3040 50

60

Figure 5: Two models are built from shape A and shape B.It turns out that the model that generates shapes like shapeD has a better specificity than the model that generatesshapes like shape C, but shape D does not belong to theshape class whereas shape C does.

and this is the specificity measure for model 1. So thespecificity measures for model 1 and model 2 are equalbut as explained above, model 2 is clearly less specificthan model 1. In the calculations above the shapes clearlydo not have to be rectangles. ActuallyΩ1 can be anyshape space andΩF can be any space of functions withmean zero.

Using the standardL2-norm the above proof is not asintuitive as above and the spaceΩh gets more compli-cated, but still we get the same specificity for the spaceΩ1 as forΩ2.

To illustrate the problem with the specificity measurefurther another example is given in Figure5, where anumber of shapes are shown. For shape A, B and C land-mark 40 is at the left corner of the bump. Landmark 50 isat the right corner of the bump. Landmark 10, 20, 30 and60 are at the corners of the rectangle in clockwise order.Between these landmarks the landmarks are placed by arclength parameterisation. Shape D is constructed by takingshape B and projecting down the bump onto the rectangle.Assume that two different models of Box-bumps is builtfrom shape A and shape B. Shape C also belongs to theclass of Box-bumps but shape D does not. Say that model1 generates shapes like Shape C and model 2 generatesshapes like shape D. Look at the terms in (6). The termfor model 2 that comes from shape D will be less than theterm for model 1 that comes from shape C. This will im-ply that model 2 is more specific than model 1, but this isclearly wrong.

So we can conclude that it is difficult to design a speci-

ficity measure that takes into consideration in what wayshapes are similar to each other and that the quantitativemeasure discussed above does not succeed. The measureabove is based on the idea that the model should onlybe able to represent shapes that look like the ones in thetraining set. This might be a reasonable approach if thetraining set is very large but in practice it is often rathersmall. The model must then be able to represent not onlyshapes that look like the ones in the training set but alsoshapes that deviate from the training shapes, but do soin a way that means that they still belong to the shapeclass. The standard specificity measure can give worseerror for shapes that belong to the shape class than it doesfor shapes that do not belong to the shape class.

Generality One problem with measuring generality isthat the parameterisation of the left out shape is unknown.This is often solved by letting the shape be included in thecorrespondence localisation. This produces a bias imply-ing that the generality cost for the optimised model willalways be underestimated. This is especially true whenoptimising costs like MDL since there is the risk of over-simplifying the shapes by optimising the parameterisa-tions using MDL. To avoid these problems it would benecessary to run the optimisation procedure once for eachshape that is left out and then pick a suitable parameteri-sation for the left out shape and try to represent this shapeusing the model. There is no obvious way to pick the pa-rameterisation of the left out shape. To decide this param-eterisation function the correspondence problem must besolved. This means that evaluation of solutions to the cor-respondence problem in their turn require solutions to thecorrespondence problem for the left out shape. Clearly aproblem.

In [29], a different version of the generalisation mea-sure is used.

G(nm) =1

ns

ns∑

j=1

||yj − y′j (nm)|| (19)

whereyj are the shapes in the training set andy′j is the

closest shape toyj minimised over a set of model gener-ated shapes andnm is the number of modes used to createthe samples. Formally,

y′j = arg inf

xj∈x1 ,...,xN||yj − xj (nm)|| . (20)

9

Page 10: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

This version avoids the leave one out problem, but it doesnot measure generalisation ability to shapes outside thetraining set.

Consider the shape spaceΦ. Let ΩM be the subspacegenerated by the model. Thenarg infxj∈x1 ,...,xN ||yj −xj (nm)|| is actually an approximation of the orthogonalprojection of the shapeyj in the training set onto the sub-spaceΩM . Therefore there is no need for approximatingthe distance, since the projection can be calculated usingthe scalar product and the vectors generatingΩM . This isillustrated in Figure6

x

x

x

x

oo

o

o

ΩM

Φ

yj

Figure 6: The generality measure (19) approximates thesum of distances from the shapes in the training set to theshape space generated by the model.

If the model spans the whole space spanned by thetraining set the projection distance will of course be zero.Then it may be argued that we are instead interested inmeasuring the closest distance for a bounded number ofshapes inΩM . However, this does not measure general-ity. Actually, a more general model will have more spreadout sample shapes and therefore the distance to the closestone will be greater, so the measure will be larger when itshould be smaller.

6 Ground Truth CorrespondenceMeasure

If ground truth correspondences are known it is possi-ble to measure the quality of the model by measuring thecloseness of the produced correspondences to the knowntruth. Since ground truth is required, this measure can not

be used to locate correspondences, but instead to evaluatealgorithms for finding those correspondences.

6.1 Basic Definition

In order to measure the quality of the correspondencesproduced by an algorithm for automatic correspondencelocalisation, datasets with manually located ground truthlandmarks and synthetic datasets with known correspond-ing points can be used.

Let the parameterisationsγi of the shapesxi be op-timised by the algorithm that is to be evaluated. Then,for every shapexi (i = 1, . . . , ns) together with itsground truth pointsgij (j = 1, . . . , ng), find tij so thatxi(γi(tij)) = gij . This means that the parameter val-ues that correspond to the ground truth points on theshape are calculated. Formallytij = γ−1

i (x−1i (gij )).

Now, keeping for a momenti fixed, for every shapexk

(1 ≤ k ≤ ns, k 6= i) use the same parameter valuestijcomputed forxi . The points produced should be close tothe ground truth points of this shape, if the parameterisa-tion functions represent correspondences of high quality.That is,xk (γk (tij)) should be close togkj . This is mea-sured as the mean distance in the metricd over all shapesin the dataset. The procedure described above is illus-trated in Figure7 We get,

GCM =

1

ns(ns − 1)ng

ns∑

i=1

ng∑

j=1

k∈K

d (xk (γk (tij)) , gkj ) (21)

K = 1, . . . , i− 1, i + 1, . . . , ns ,

wheretij is the parameterisation parameter value for theground truth pointj on shapei, i.e.

tij = γ−1i (x−1

i (gij )) . (22)

The constantns is the number of shapes andng is thenumber of ground truth points. Any metric could be used,but in this paperd(a, b) is the length of the shortest pathbetween the pointa and the pointb along the shape, whichhas been normalised to have area one. Locally (usuallyalso globally) the shortest path is a geodesic. Apart fromthe advantage of measuring the error along the shape, thisalso gives scale invariance. On curves the metricd corre-sponds to the arclength distance on curves normalised to

10

Page 11: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

Optimised γi

gij

Ground truth point j on shape i

xi(γ

i(t

ij))

Calculating tij

Using tij

on shape k

gkj

Calculating dijk

xk(γ (t

ij))

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

Optimised γk

1

2

5

4

3

k

d

Figure 7: Illustration of gcm calculation: 1. The optimiserdetermines the parameterisation functionsγi, i = 1 . . . ns,2. Given ground truth pointj on shapei, 3. calculate the corresponding parametertij . 4. Usetij on shapek. 5.Calculate the distancedijk from xk (γk(tij)) to ground truth pointj on shapek.

length one. Note that in case of intersections of the shapethe path is not allowed to use the intersection as a shortcut. Think of the letterα for an example of a shape withan intersection.

In [25] a similar measure is used to evaluate correspon-dences on registered surfaces. However, there the distancebetween the corresponding points is measured as the Eu-

clidean distance. Also, the distance is measured betweenpoints on a deformed template and points on the targetsurface whereas here we focus on groupwise correspon-dence.

11

Page 12: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

6.2 Ground Truth Correspondence Mea-sure with Mahalanobis Distance

For synthetic examples ground truth marks are exact butfor manually placed marks there is of course a subjectiveelement and also the introduction of a small error is in-evitable.

Therefore, for natural shapes, statistics about theground truth points could be taken into account. Let anumber of people mark ground truth points on the samedataset. One example, which has been marked by 18 peo-ple, from the database can be seen in Figure8. Meansand variances can then be calculated and the norm used tocalculate GCM can then be the Mahalanobis normaliseddistance,

GCM =

1

ns(ns − 1)ng

ns∑

i=1

ng∑

j=1

k∈K

d(

xk (γk (tij)) , gkj

)

σkj

(23)

K = 1, . . . , i − 1, i + 1, . . . , ns ,

wheregkj is the mean andσkj is the standard deviationfor landmarkj on shapek.

Rat 14 of 20

Figure 8: One example of the rat shapes with ground truthpoints set by 18 people. Note that there is different vari-ance for different ground truth points.

6.3 Discussion

Note that it is not the case that we optimise the posi-tion of a set of landmarks that then can be compared toground truth. There are no such landmarks during op-timisation. The optimisation works directly on the pa-rameterisation functions to establish correspondence. Theground truth points are used to evaluate the establishedparameterisation functions. The correspondence is com-pletely described by the parameterisation functions. Bymapping the ground truth points from one shape via theparameterisation functions to another shape we can mea-sure the quality of the correspondence. Note that there areinfinitely many sets of parameterisation functions describ-ing the same correspondences. This is the reason why itis necessary to map ground truth points between shapes.

Although the measure is based on points on the shape,if these points are chosen carefully they will representmost of the valuable information of the shape. In hand-writing recognition such points are called core points, see[33]. The most valuable information of a shape is oftensituated at points of changing curvature. As an example,a linear segment needs only the two end points to repre-sent the information. For additional points along the seg-ment there is no information in the shape itself regardingtheir correspondence to other shapes. To establish suchcorrespondences additional information is needed, suchas image intensity.

This Ground Truth Correspondence Measure measuresthe precision of the established correspondences, andthereby the extent to which the shape model representsthe true shapes. Using similar techniques other groundtruth measures can be defined, for example using groundtruth segmented images a ground truth segmentation mea-sure can be defined. In medical applications the diagnosecould be used to define a ground truth classification mea-sure.

12

Page 13: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

7 Experimental Analysis of Stan-dard Quality Measures and ofGCM

7.1 Setup

In this section, GCM, generality, compactness and speci-ficity are evaluated experimentally. It is also shown thatdescription length fails to give optimal correspondencesand that adding a curvature cost can be useful. Theexperiments show that GCM together with its databaseof ground truth landmarks is a better measure than thestandard evaluation methods. The experiments in thissection treat two datasets.Silhouettes: The silhouette dataset consists of 22contours of face silhouettes from digital images. Thesilhouettes were extracted using an edge detector.Box-bumps: A synthetic shape consisting of a box witha bump.

Examples of these shapes can be seen in Figure18.For the optimisation seven control nodes and 128 samplepoints were used for each parameterisation function.

7.2 Experiment 1

The first experiment was to start from correct correspon-dences and then optimise the parameterisations so as tominimise the description length. Synthetic Box-bumpshapes, consisting of a rectangle with a bump on differ-ent positions on the top side, were used for this test, sincewe know the true correspondence here. The value of thedescription length (DL) and the Ground Truth Correspon-dence Measure (GCM) over the number of iterations isplotted in Figure9.

It is interesting to note here that the GCM increases(gets worse) as the description length decreases (gets bet-ter). The minimum, when the parameterisation is opti-mised with description length as goal function, does notcorrespond to true correspondences. In Figure10it can beseen that minimising the description length from true cor-respondences has resulted in worse correspondences. Theplot shows 4 Box-bump shapes zoomed in on the bump.Since the plot does not show the actual curve of the shapebut instead straight lines between the sample points on

0 5 10 15 20 25 300

0.2

0.4

0.6

0.8

MDLoptimisation from ground truth

nr of iterations

GC

M0 5 10 15 20 25 30

10.8

11

11.2

11.4

11.6

11.8

nr of iterations

DL

Figure 9: The Ground Truth Correspondence Measure(GCM) and the description length (DL) plotted over num-ber of iterations.

the shapes there appears to have been shape changes at thecorners. In an attempt to ensure that the algorithm consid-ers the original shapes properly the number of samplingpoints could be increased but the sample points will stillnot be exactly at the corners. Also the main problem withensuring that the entire shape is taken into considerationis not the number of sample points but instead the relativeweight given to different parts of the shape depending onthe parameterisation function independent of the numberof sample points.

In Davies’ Thesis [7] a somewhat similar experimentwas performed where the description length of the Box-bump model was measured when the correspondences ini-tialised at ground truth were perturbed by Gaussian noise.The description length near ground truth was examinedin Figure 6.1 in [7] by showing that description lengthincreases monotonously in mean when applying randomperturbations with increasing magnitude. However thisdoes not show that the minimum of the description lengthis located at ground truth, since it does not show that thereis no path that leads to lower description length than thatat ground truth correspondences. In fact, it is unlikely thatthe path to the optimum would be found by random per-turbations. So these experiments do not contradict eachother.

13

Page 14: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

Ground Truth

40

44 48

52

DL

40

4448

52

40

44 48

52 40

4448

52

40

44 48

52 40

44 48

52

40

44 48

52 40

44 48

52

Figure 10: To the left the true correspondences for theBox-bumps can be seen. To the right the correspon-dences established by minimising the description lengthare shown. The figure shows the bump part of the shapes.

In Figure11 the compactness and specificity measuresindicate that the optimised model has higher quality. Inthis case we started from ground truth and as can be seenin Figure10 the correspondences are worse for the opti-mised model. So we can conclude that compactness andspecificity do not correlate with correct correspondences.The problems with compactness were already noted in[7]. As an extreme example it is possible to get perfectcompactness, specificity and generality (zero) by placingall sample points in one point on all shapes.

In Figure12 the specificity is evaluated visually. Theupper model is built from ground truth correspondencesand the lower model is built from DL-optimised corre-spondences. The plots show the mean shape plus threestandard deviations of the first two shape modes. Sincethe data in this case only has one shape mode it is enoughto examine the first two modes as in this plot. The modelbuilt from ground truth correspondences is clearly morespecific than the DL-optimised model. The slight dis-tortion in model 1 is due to the alignment of the shapes.The Procrustes alignment introduces nonlinearities in theshape modes. From Figure12it can be concluded that themodel built from better correspondences has better speci-ficity, contrary to the conclusion of the plot of (6) in Fig-

1 20

0.5

1x 10

5 Generality, Compactness and Specificity

Gen

.

GTMDL

1 2 30

0.01

0.02

Com

p.

1 2 30.5

1

1.5

2x 10

3

Nr of modes

Spec

.Figure 11: Generalisation, compactness and specificity ofthe ground truth Box-bump model and the optimised Box-bump model.

ure11. So also in practice the specificity measure definedby (6) does not work. The problem with the specificitymeasure is that if the training set is limited (which is of-ten the case) it can not be assumed that all shapes of aclass are close to one of the shapes in the training set.

Summing up this experiment, the conclusion is thatalthough minimising the description length is a goodmethod for finding approximate correspondences, in thisexperiment it fails to locate the correct correspondences.We also conclude that compactness and specificity do notcorrelate with the quality of the correspondences in gen-eral and the specificity measure fails to measure the de-sired property.

In Section 5.3.2 in [7] a similar experiment shows thatMDL does succeed in finding optimal correspondences onthe synthetic Box-bump data set. However, in this exper-iment the important Procrustes alignment step is skippedand the shapes are given perfect shape alignment.

7.3 Experiment 2

In the second experiment silhouette shapes initialised witharclength parameterisation were used. The parameterisa-tions were optimised with respect to MDL [8] until con-vergence (40 iterations). Then we continue the optimisa-tion with respect to MDL plus a curvature cost [37] untilconvergence (another 40 iterations).

Figure13 shows the resulting correspondences on thepart of the shapes corresponding to the eye. The plotsshow sample points 25 to 40. Anatomically this shows the

14

Page 15: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

Shape mode 1 Shape mode 2

Model 1

Model 2

Figure 12: Mean shape plus three standard deviations ofthe first two shape modes for models built from groundtruth correspondences (Model 1) and DL-optimised cor-respondences (Model 2).

end of the forehead and the beginning of the nose of a per-son looking to the left. The nose begins approximately atsample point 34 in the bottom row. The correspondencesare clearly better when using curvature. Other parts ofthe curves are similar. Figure14 shows how GCM de-creases when curvature is added. The middle plot showshow DL first decreases as it is minimised, but then whenDL + curvature cost is minimised in the second part DLincreases. So GCM captures an improvement in corre-spondences that DL misses. In Figure15 it can be seenhow the measures of generality, compactness and speci-ficity all indicate that the model without curvature costhas higher quality. So the standard measures can not beused to measure correspondence quality.

This experiment shows that GCM captures an improve-ment in correspondences that generality, compactness,specificity and DL all miss.

The above experiment on silhouettes was performed on4 shapes. A similar experiment using all the 22 silhouetteshapes was also done and the results were similar, as canbe seen in Figure16and Figure17.

DL

25

29

33

37

25

29

33

37

25

29

33

37

25

29

33

37

25

29

33

37

25

29

33

37

25

29

33

37

25

29

33

37

DL+CC

Figure 13: Corresponding sample points on parts of sil-houettes. To the left the part zoomed in on is marked boldon a whole silhouette.

8 Benchmarking using GCM

8.1 Database and Algorithms

First a database of eight shape classes (one syntheticdataset and seven natural datasets of curves extracetedfrom real images) was collected. The first five shapeclasses (sharks, birds, flight birds, rats, forks) are curvesgenerated from images in the LEMS database from BrownUniversity and each of these shape classes consists of 13-23 shapes [31, 32]. The sixth and eighth shape classesare the silhouette, and Box-bump shapes from the previ-ous section. The seventh dataset consists of 23 contoursof a hand. Figure18shows one example from each shapeclass in the database.

All the natural examples have been marked withground truth correspondences (10-21 landmarks) by18-25 people. In total the database consist of 28518(not including the synthetic dataset) ground truth land-marks manually set. The Box-bumps are syntheticwith a total of 1464 ground truth points on 24 shapes.This database together with code can be downloaded fromwww.maths.lth.se/matematiklth/vision/ddd

The following algorithms have been benchmarked us-ing GCM:arclength: All the landmarks are placed with equal ar-

15

Page 16: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

0 10 20 30 40 50 60 70 80 9020

40

60

80

Optimisation of DL with and without curvature cost

GC

M

0 10 20 30 40 50 60 70 80 9015

15.5

16

16.5

DL

0 10 20 30 40 50 60 70 80 9020

25

30

DL+

CC

nr of iterations

Figure 14: Optimisation of DL and DL + curvature costfor 4 silhouettes.

1 20

2

4

6x 10

5 MDL vs MDL + curvature, silhouettes

Gen

.

MDL + curvatureMDL

1 2 30

2

4

6x 10

3

Com

p.

1 2 30.5

1

1.5x 10

3

Nr of modes

Spec

.

Figure 15: Generalisation, compactness and specificity ofthe DL silhouette model and the DL + curvature cost sil-houette model.

clength distance from each other.MDL: The parameterisations are optimised so as to min-imise the description length of the model and the dataset.See for example [8].MDL+Cur: A curvature cost measuring the difference incurvature at corresponding points is added to the descrip-tion length [37].MDL+me: One of the shapes is used as a master exam-ple. Its parameterisation function is held fixed during op-timisation. This is a technique aimed at avoiding cluster-ing of sample points [8].MDL+nodecost: Another technique for avoiding cluster-ing of sample points. A node cost is introduced to ensurethat parameterisation node parameters should in mean beclose to zero for each parameterisation node. [37]MDL+Par: An alternative scalar product, which is in-

0 10 20 30 40 50 60 70 80 9020

40

60

80

Optimisation of DL with and without curvature cost

GC

M

0 10 20 30 40 50 60 70 80 9030

40

50

DL

0 10 20 30 40 50 60 70 80 9040

45

50

55

DL

+C

C

nr of iterations

Figure 16: Optimisation of DL and DL + curvature costfor 22 silhouettes.

variant to mutual reparameterisations [22] is used. This isanother alternative for avoiding clustering.AIAS+MDL: The description length is minimised for anAffine Invariant Active Shape model [14].AIAS+MDL+Cur: The description length is minimisedfor an Affine Invariant Active Shape model [14] and a cur-vature cost is added.Eucl: The parameterisations are optimised to minimisethe Euclidean distance between corresponding points onall the shapes.Eucl+Cur: The parameterisations are optimised to min-imise a distance that is defined as the Euclidian distanceplus a distance in curvature difference between corre-sponding points on all the shapes.Cur: The curvature cost used in the algorithms above ishere used by itself as a cost function to be minimised.Handmade: Also handmade models of all the datasetswere built by a different person than the ones markingground truth. This was done without knowing whichanatomical points were used as ground truth.

8.2 Benchmarking

All tests were performed with 128 sample points, 40 it-erations and 7 reparameterisation control nodes. Table

16

Page 17: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

Algorithm Sharks Birds Flightbirds Rats Forks Silhouettes Hands Mean Median Box − bumps∗

Arclength (GCM) 15.55 22.88 7.12 13.37 19.11 8.93 14.00 14.42 14.00 –

MDL (%) 27.217 65.176 56.336 29.037 19.413 46.236 17.608 37.286 29.037 29.203

MDL+Cur (%) 21.511 79.938 45.021 27.021 23.266 22.542 14.804 33.443 23.261 17.421

MDL+me (%) 29.269 91.679 61.8010 30.108 19.834 47.768 16.817 42.468 30.108 50.205

MDL+nodecost (%) 26.136 66.717 56.347 28.906 19.212 47.377 16.246 37.275 28.906 30.974

MDL+Par (%) 24.405 62.405 47.944 27.803 18.281 39.885 14.172 33.554 27.804 85.807

AIAS+MDL 21.933 23.381 57.698 28.375 20.375 38.814 14.573 29.302 23.382 75.936

AIAS+MDL+Cur (%) 21.612 23.732 47.733 27.422 23.927 26.123 14.171 26.381 23.923 111.008

Eucl (%) 43.7610 60.024 58.859 36.9210 27.328 129.7410 26.3910 54.7110 43.7610 116.829

Eucl+Cur (%) 28.848 55.183 53.545 35.039 28.599 103.579 18.269 46.149 35.039 118.8810

Cur (%) 22.014 111.1610 45.822 27.964 31.1910 21.121 15.845 39.307 27.965 17.602

Semiauto (%) 20.31 20.40 47.68 24.58 14.32 16.67 9.27 21.89 20.31 16.52

Handmade (%) 17.94 8.84 14.24 10.44 7.33 9.53 6.67 10.71 9.53 0.00

Table 1: The second row in the table shows the GCM for arclength parameterisation of the different datasets. Thefollowing rows shows the percentage of GCM error (Mahalanobis normalised) left after optimising from arclengthparameterisation. The upper index indicates the rank of this algorithm on this dataset. The winning algorithm on eachdataset is bold. Since the Box-bumps (*) are so easy to mark, only one person has marked this dataset and the GCMwithout Mahalanobis has been used.

1 shows the remaining percentage of GCM (with Ma-halanobis) after optimising from arclength (100% meansequal GCM as when using arclength parameterisationand 0% means perfect correspondences according toGCM). AIAS+MDL+Cur is the algorithm that is best inmean. This algorithm succeeds especially well on the birddataset. MDL+Cur has the lowest median result. Sincethe median is a more stable measure and since MDL+Curis the best algorithm on three natural datasets and also per-forms best on the synthetic dataset, MDL+Cur is selectedas the winning algorithm. There is no algorithm that isbest on all datasets and no algorithm gives as good corre-spondences as the correspondences manually annotated.

In Figure 19 the correspondences on the legs of thebirds (the most difficult dataset) are shown. Four birdsout of fourteen were picked to illustrate the results andthe legs are where the differences are most easily seen.See Figure18 for an example of the whole shape. SinceAIAS+MDL turned out to be outstanding on the birdsdataset in the benchmarking as seen in Table1, it is inter-esting to visually compare the results of the different algo-rithms on this dataset. In the columns of Figure19, start-ing with the left column, are the results of the MDL+Curalgorithm, the Eucl+Cur algorithm, the AIAS+MDL al-gorithm and in the right column the handmade correspon-dences. MDL+Cur is the best overall algorithm, but as canbe seen in Table1, it is not good on the birds dataset. TheEuclidean algorithm is a simple algorithm, but as can beseen in the table it gets decent results on this dataset. The

AIAS+MDL algorithm is the best on the birds dataset. Itcan easily be seen that MDL+Cur does not perform wellon this dataset. Eucl+Cur also has problems that can beseen most easily on the first shape. AIAS+MDL is quitegood and the handmade correspondences are even better.So the visual inspection confirms the results of the GCM-benchmarking.

For the winning algorithm GCM was then used topick optimal parameter values, such as number of samplepoints and number of parameterisation nodes by evaluat-ing GCM on the shark dataset. The algorithm was thenrun with these parameters on all datasets, which verifiedthat this resulted in an even better algorithm.

8.3 Semi-Automatic Algorithm

Since handmade models are best, a semi-automatic al-gorithm was tested. Five shapes were manually markedand then kept fixed, while the rest of the shapes in thedataset were optimised one by one using DL with cur-vature cost to fit the five fixed shapes. This resulted inan algorithm better than all the automatic algorithms, seeTable 1. Seven control nodes were used for all naturaldatasets but for the synthetic Box-bumps 15 nodes wereused. Experiments with 15 nodes for the automatic algo-rithms results in worse correspondences for all algorithmsexcept Eucl and Eucl+Cur where only slightly better re-sults were obtained.

17

Page 18: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

1

611

16

2126 31 36

41

1

6

11

16

2126

31

36

1

6

11

1621

2631

36

1

6

1116

21

2631

36

1

6 11

16

21

2631

3641

1

6

11

1621

26

31

36

1

6

11

1621

26

31

36

1

6

11

16

21

26

31

36

1

6

11 16

21

26

3136

41 1

6

11

16

21

26

3136

1

6

11

16

21

2631

36

1

6

1116

21

26

31

36

1

6

11

16

21

26

31

36

41

1

6

11

16

21

26

31

36 1

6

11

16

21

26

31

36

1

6

11

16

21

26

31

36

MDL+Cur Eucl+Cur AIAS+MDL Handmade

Figure 19: Correspondences on the bird dataset (the most difficult dataset) for four different algorithms zoomed in onthe legs.

8.4 Benchmarking Summary

In Table1 it can be seen that in median MDL+Cur is thebest algorithm and it is also best on the synthetic dataset.There is no algorithm that is best on all datasets and no al-gorithm gives as good correspondences as the correspon-dences manually marked. The semi-automatic algorithmis better than the automatic algorithms on all datasets but

the flightbird dataset.

In previous work it has been claimed that automatic al-gorithms give better models than models built by hand[9]. These claims are supported by measures like gen-erality, specificity and compactness, but we have shownthat these measures have severe problems and should notbe used for evaluation. Measuring GCM, it is concludedthat models carefully built by hand are actually very good,

18

Page 19: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

0 5 10 15 20 250

0.5

1

1.5x 10

5 MDL vs MDL + curvature, silhouettes

Ge

n.

MDL+curvature

MDL

0 5 10 15 20 250

2

4x 10

3Compactness MDL all silhouettes

Co

mp

.

0 5 10 15 20 250

0.5

1

1.5x 10

3Specificity MDL all silhouettes

Nr of modes

Sp

rec.

Figure 17: Generalisation, compactness and specificity ofthe arclength DL silhouette model and the DL + curvaturecost silhouette model using 22 shapes.

at least in 2D. In some cases it may not be reasonableto manually mark the full dataset but, as seen, a semi-automatic approach, where only five shapes need to bemanually marked, works very well.

9 Summary and Conclusions

For evaluation of the quality of shape models built fromcorrespondences located automatically there exist a num-ber of standard measures. These measures can give arough idea about the quality of a model in some cases,but in other cases the conclusions drawn from these mea-sures can be wrong. It is shown in this paper that thereis reason to question the ability of these measures to cap-ture model quality. In this paper these measures are anal-ysed both theoretically and experimentally on real andsynthetic data. It is shown that they do not succeed inmeasuring the desired properties and also they do not cap-ture the quality of the correspondences.

Figure 18: One example from each class of shapes in thedatabase.

Since the general model quality measures do not work,we instead propose a Ground Truth Correspondence Mea-sure (GCM) for the evaluation of correspondence qualityto be used in benchmarking. It is shown in experimentson different datasets (both natural and synthetic) that thismeasure corresponds well to subjective evaluation of cor-respondence quality, whereas the standard measures donot. In this paper several state of the art algorithms arebenchmarked using GCM.

As a final note, of course it would be desirable to haveground truth free measures, but as we have shown here,the measures available all have severe problems and cannot be used for benchmarking. It is our view that groundtruth based measures must be used for evaluation of corre-spondence algorithms used for automatic model building.

10 Future Work

The measure GCM can also be used to evaluate algo-rithms for finding correspondences on shapes of any di-mension, like surfaces for example. A simple examinationis done in [34] but a more thorough examination along thelines in this paper remains to be done.

It would be very interesting to examine a ground truthsegmentation measure using a database of segmented im-ages and measuring how well the shape model is able tosegment them. This could be done using for example themethods used in [26].

19

Page 20: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

Also in for example medical decision support it wouldbe interesting to evaluate a ground truth diagnose mea-sure.

A possibility for future generations of shape correspon-dence algorithms could be to use machine learning andhuge databases of correspondences to learn which pointsshould be in correspondence. This could imitate how hu-mans manage to solve the problem. The database pub-lished here could be the first step in this direction. An-other argument for using machine learning is that anatom-ical landmarks are defined by humans and that it is un-likely that there exist a mathematical expression that co-incides with human opinion of correspondence.

11 Acknowledgments

We thank H. Thodberg (IMM, DTU) for the silhouettesand B. Kimia et al. for the images of the sharks, birds,flight birds, rats and forks.

This work has been financed by the SSF sponsoredproject ’Vision in Cognitive Systems’ (VISCOS) and byUMAS and the Swedish Knowledge Foundation throughthe Industrial PhD program in Medical Bioinformatics atthe Centre for Medical Innovations (CMI) at the Karolin-ska Institute.

References

[1] A. Baumberg and D. Hogg. Learning flexible mod-els from image sequences. InProc. European Conf.on Computer Vision, ECCV’94, pages 299–308,1994.1

[2] S. Belongie, J. Malik, and J. Puzicha. Shape match-ing and object recognition using shape contexts.IEEE Trans. Pattern Analysis and Machine Intelli-gence, 24(24):509–522, 2002.2

[3] A. Benayoun, N. Ayache, and I. Cohen. Adap-tive meshes and nonrigid motion computation. InProc. International Conference on Pattern Recogni-tion, Jerusalem, Israel, pages 730–732, 1994.1

[4] F. Bookstein. Landmark methods for forms withoutlandmarks: Morphometrics of group differences in

outline shape.Medical Image Analysis, 3:225–243,1999.2

[5] H. Chui and A. Rangarajan. A feature registrationframework using mixture models. InIEEE Work-shop on Mathematical Methods in Biomedical Im-age Analysis (MMBIA), pages 190–197, 2000.2

[6] H. Chui and A. Rangarajan. A new algorithm fornon-rigid point matching. InIEEE Conference onComputer Vision and Pattern Recognition (CVPR),volume volume II, pages 44–51, 2000.2

[7] R. Davies. Learning Shape: Optimal Models forAnalysing Natural Variability. PhD thesis, Univer-sity of Manchester, 2002.2, 5, 6, 13, 14

[8] R. Davies, C. Twining, T. Cootes, J. Waterton, andC. Taylor. A minimum description length approachto statistical shape modeling.IEEE Trans. medicalimaging, 21(5):525–537, 2002.2, 14, 16

[9] R. H. Davies, T. F. Cootes, J. C. Waterton, andC. J. Taylor. An efficient method for constructingoptimal statistical shape models. InMedical Im-age Computing and Computer-Assisted InterventionMICCAI’2001, pages 57–65, 2001.2, 4, 18

[10] R. H. Davies, C. J. Twining, P. D. Allen, T. F.Cootes, and C. J. Taylor. Shape discrimination inthe hippocampus using an MDL model. InInforma-tion Processing in Medical Imaging, 2003.4

[11] I. L. Dryden and K. V. Mardia. Statistical ShapeAnalysis. John Wiley & Sons, Inc., 1998.1

[12] A. Ericsson. Automatic shape modelling and appli-cations in medical imaging. Technical report, Math-ematics LTH, Centre for Mathematical Sciences,Box 118, SE-22100, Lund, Sweden, Nov 2003.2

[13] A. Ericsson. Automatic Shape Modelling with Ap-plications in Medical Imaging. PhD thesis, LundUniversity, Centre for Mathematical Sciences, Box118, SE-22100, Lund, Sweden, Sep 2006.4

[14] A. Ericsson and K.Astrom. An affine invariant de-formable shape representation for general curves.In Proc. 9th Int. Conf. on Computer Vision, Nice,

20

Page 21: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

France, pages 1142–1149, Nice, France, 2003.2, 5,16

[15] A. Ericsson and K.Astrom. Minimizing the descrip-tion length using steepest descent. InProc. BritishMachine Vision Conference, Norwich, United King-dom, volume 2, pages 93–102, 2003.2

[16] A. Ericsson and J. Karlsson. Aligning shapes byminimising the description length. InProc. Scandi-navian Conf. on Image Analysis, SCIA’05, Joensuu,Finland, volume 3540 / 2005, pages 709–718, 2005.4

[17] R. Fisher. Caviar project, 2005. Groundtruth labelled video sequences. Available athttp://homepages.inf.ed.ac.uk/rbf/CAVIAR/.3

[18] J. Gower. Generalized procrustes analysis.Psycho-metrica, 40:33–50, 1975.4

[19] A. Hill and C. Taylor. Automatic landmark genera-tion for point distribution models. InProc. BritishMachine Vision Conference, pages 429–438, 1994.2

[20] A. Hill and C. Taylor. A framework for automaticlandmark indentification using a new method ofnonrigid correspondence.IEEE Trans. Pattern Anal-ysis and Machine Intelligence, 22:241–251, 2000.2

[21] C. Kambhamettu and D. Goldgof. Points corre-spondences recovery in non-rigid motion. InProc.Conf. Computer Vision and Pattern Recognition,CVPR’92, pages 222–237, 1992.1

[22] J. Karlsson, A. Ericsson, and K.Astrom. Parameter-isation invariant statistical shape models. InProc.International Conference on Pattern Recognition,Cambridge, UK, 2004.3, 5, 7, 16

[23] A. Kelemen, G. Szekely, and G. G. Elastic model-based segmentation of 3D neuroradiological datasets. IEEE Trans. medical imaging, 18(10):828–839, 1999.1

[24] A. Kotcheff and C. Taylor. Automatic constructionof eigenshape models by direct optimization.Medi-cal Image Analysis, 2:303–314, 1998.2

[25] Z. Mao, X. Ju, J. Siebert, W. Cockshott, and A. Ay-oub. Constructing dense correspondences for theanalysis of 3D facial morphology.Pattern Recog-nition Letters, 27(6):597–608, 15 April 2006.11

[26] D. R. Martin, C. C. Fowlkes, and J. Malik. Learn-ing to detect natural image boundaries using localbrightness, color, and texture cues.IEEE Transac-tions on Pattern Analysis and Machine Intelligence,26(5):530–549, 2004.3, 19

[27] J. Richter, A. Ericsson, K.Astrom, F. Kahl, andL. Edenbrant. Automated interpretation of cardiacscintigrams. InProc. 13th Scandinavian Conf. onImage Analysis, Gothenburg, Sweden, 2003.2

[28] D. Scharstein and R. Szeliski. A taxonomy and eval-uation of dense two-frame stereo correspondence al-gorithms.IJCV, 47(1/2/3):7–42, 2002.2

[29] R. Schestowitz, C. Twining, T. Cootes, V. Petrovic,C. Taylor, and B. Crum. Assessing the accuracy ofnon-rigid registration with and without ground truth.In Proc. IEEE International Symposium on Biomed-ical Imaging, 2006.2, 5, 9

[30] T. Sebastian, P. Klein, and B. Kimia. Constructing2D curve atlases. InIEEE Workshop on Mathemat-ical Methods in Biomedical Image Analysis, pages70–77, 2000.2

[31] T. Sebastian, P. Klein, and B. Kimia. On aligningcurves.IEEE Transactions on Pattern Analysis andMachine Intelligence, 25(1):116–125, 2003.15

[32] D. Sharvit, J. Chan, H. Tek, and B. Kimia.Symmetry-based indexing of image databases.Jour-nal of Visual Communication and Image Represen-tation, 9(4):366–380, 1998.15

[33] J. Sternby and A. Ericsson. Core points - a frame-work for structural parameterization. InProc. In-ternation Conference on Document Analysis andRecognition, ICDAR’05, Seoul, Korea, 2005.12

[34] M. Styner, K. Rajamani, L. Nolte, G. Zsemlye,G. Szekely, C. Taylor, and R. H. Davies. Evalua-tion of 3D correspondence methods for model build-ing. In Information Processing in Medical Imaging(IPMI), pages 63–75, 2003.3, 19

21

Page 22: Measures for Benchmarking of Automatic … for Benchmarking of Automatic Correspondence Algorithms Anders Ericsson and Johan Karlsson ... teresting papers including code …Published

[35] H. Tagare. Shape-based nonrigid correspondencewith application to heart motion analysis.IEEETrans. medical imaging, 18:570–579, 1999.1

[36] H. H. Thodberg. Minimum description length shapeand appearance models. InImage Processing Medi-cal Imaging, IPMI, 2003.4

[37] H. H. Thodberg and H. Olafsdottir. Adding curva-ture to minimum description length shape models.In Proc. British Machine Vision Conference, 2003.5, 14, 16

[38] C. Twining and C. Taylor. Specificity as a graph-based estimator of cross-entropy. InProc. BritishMachine Vision Conference, Edinburgh, UnitedKingdom, volume 2, pages 459–468, 2006.6

[39] Y. Wang, B. Peterson, and L. Staib. Shape-based 3Dsurface correspondence using geodesics and localgeometry. InProc. Conf. Computer Vision and Pat-tern Recognition, CVPR’00, pages 644–651, 2000.1

[40] Y. Zheng and D. Doermann. Robust point matchingfor non-rigid shapes: A relaxation labeling based ap-proach. Technical report: Lamp-tr-117, Universityof Maryland, College Park, 2004.2

22