modern features-part-4-evaluation

Andrea Vedaldi, University of Oxford

Feature evaluationOld and new benchmarks, and new software

HARVEST Programme

Benchmarking: why and how

• Dozens of feature detectors and descriptors have been proposed

• Benchmarks- compare methods empirically- select the best method for a task

• Public benchmarks- reproducible research- simplify your life!

• Ingredients of a benchmark:

37

Theory Data Software

38

Indirect evaluationRepeatability and matching scoreData: affine covariant testbed

Direct evaluationImage retrievalData: oxford 5k

SoftwareVLBenchmarks

39




Indirect feature evaluation 40

• Intuition Test how well features persist and can be matched across image transformations.

• Data- must be representative of the transformations

(viewpoint, illumination, noise, etc.)

• Performance measures- repeatability

persistence of features- matching score

matchability of features

K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. IJCV, 1(65):43–72, 2005.

Viewpoint, scale, rotation

Affine Testbed 49

Lighting, compression, blur

Affine Testbed 50

Intuition

Detector repeatability 51

• two pairs of features correspond

• another pair of features does not

• repeatability = 2/3

Formal definition

Region overlap 52

H

homography

Ra

Rb

HRb

area intersection

area union

overlap(a, b) =|Ra \HRb||Ra [HRb|

=

Intuition

Region overlap 53A Comparison of Affine Region Detectors

Figure 12. Overlap error !O . Examples of ellipses projected on the corresponding ellipse with the ground truth transformation. (bottom)Overlap error for above displayed ellipses. Note that the overlap error comes from different size, orientation and position of the ellipses.

sometimes specific to detectors and scene types(discussed below), and sometimes general—the trans-formation is outside the range for which the detector isdesigned, e.g. discretization errors, noise, non-linearillumination changes, projective deformations etc.Also the limited ‘range’ of the regions shape (size,skewness, . . . ) can partially explain this effect. Forinstance, in case of a zoomed out test image, only thelarge regions in the reference image will survive thetransformation, as the small regions will have becometoo small for accurate detection. The same holds forother types of transformations: very elongated regionsin the reference image may become undetectable ifthe inferred affine transformation stretches them evenfurther but, on the other hand, allow for very largeviewpoint changes if the inferred affine transformationmakes them rounder.

The left side of each figure typically represents smalltransformations. The repeatability score obtained inthis range indicates how well a given detector per-forms for the given scene type and to what extent thedetector is affected by a small transformation of thisscene. The invariance of the detector under the studiedtransformation, on the other hand, is reflected in theslope of the curves, i.e., how much does a given curvedegrade with increasing transformations.

The absolute number of correspondences typicallydrops faster than the relative number. This can be un-derstood by the fact that in most cases larger transfor-mations result in lower quality images and/or smallercommonly visible parts between the reference imageand the other image, and hence a smaller number ofregions are detected.

Viewpoint Change. The effect of changing view-point for the structured graffiti scene from Fig. 9(a)are displayed in Fig. 13. Figure 13(a) shows the re-peatability score and Fig. 13(b) the absolute number

of correspondences. The results for images contain-ing repeated texture motifs (Fig. 9(b)) are displayedin Fig. 14. The best results are obtained with theMSER detector for both scene types. This is due tothe high detection accuracy especially on the homo-geneous regions with distinctive boundaries. The re-peatability score for a viewpoint change of 20 degreesvaries between 40% and 78% and decreases for largeviewpoint angles to 10% ! 46%. The largest numberof corresponding regions is given by Hessian-Affine(1300) detector followed by Harris-Affine (900) detec-tor for the structured scene, and given by Harris-Affine(1200), MSER (1200) and EBR (1300) detectors forthe textured scene. These numbers decrease to lessthan 200/400 for the structured/textured scene for largeviewpoint angle.

Scale Change. Figure 15 shows the results for thestructured scene from Fig. 9(c), while Fig. 16 shows theresults for the textured scene from Fig. 9(d). The mainimage transformation is a scale change and in-planerotation. The Hessian-Affine detector performs best,followed by MSER and Harris-Affine detectors. Thisconfirms the high performance of the automatic scaleselection applied in both Hessian-Affine and Harris-Affine detectors. These plots clearly show the sensi-tivity of the detectors to the scene type. For the tex-tured scene, the edge-based region detector gives verylow repeatability scores (below 20%), whereas for thestructured scene, its results are similar to the other de-tectors, with score going from 60% down to 28%. Theunstable repeatability score of the salient region detec-tor for the textured scene is due to the small number ofdetected regions in this type of images.

Blur. Figures 17 and 18 show the results for thestructured scene from Fig. 9(e) and the textured onefrom Fig. 9(f), both undergoing increasing amounts of

• Examples of ellipses overlapping by different amounts

• Usually, features are tested at 40% overlap error (= 60% overlap)

1 - overlap

Intuition

Normalised region overlap 58

larger scale = better overlap

Rescale so thatreference region hasand area of 302 pixels

Formal definition

Normalised region overlap 59

H

homography1. Detect

Ra Rb

2. Warp Ra HRb

3. Normalise sHRbsRa s = |Ra|/302

4. Intersection over union overlap(a, b) =

|sRa \ sHRb||sRa [ sHRb|

Formal definition

Detector repeatability 60

H

homography

1. Find featuresin common area

{Rb : b 2 B}{Ra : a 2 A}

2. Threshold theoverlap score s

ab

=

(overlap(a, b), overlap(a, b) � 1� ✏

o

� inf, otherwise.

3. Find geometricmatches

M⇤= max

M bipartite

X

(a,b)2M

sab greedy approximation

repeatability(A,B) =|M⇤|

min{|A|, |B|}repeatability(A,B) =|M⇤|

min{|A| |B|}

Intuition

Descriptor matching score 61

• In addition to being stable, features must be visually distinctive

• Descriptor matching score- similar to repeatability- but matches are constructed by comparing descriptors

Formal definition

Descriptor matching score 62

H

homography

1. Find featuresin common area

{Rb : b 2 B}{Ra : a 2 A}

2. Descriptordistances {da : a 2 A} {db : b 2 B} dab = kda � dbk2

4. Geometricmatches (as before)

M⇤= max

M bipartite

X

(a,b)2M

sab

3. Descriptormatches

M⇤d = min

M bipartite

X

(a,b)2M

dab

match-score(A,B) =

|M⇤ \M⇤d|

min{|A|, |B|}

Example of a repeatability graph 63

30 40 50 60 200

10

20

30

40

50

60

70

80

90

100

Viewpoint angle

Rep

eata

bilit

y (g

raf)

Repeatability (graf)

VL DoGVL HessianVL HessianLaplaceVL HarrisLaplaceVL DoG (double)VL Hessian (double)VL HessianLaplace (double)VL HarrisLaplace (double)VLFeat SIFTCMP HessianVGG hesVGG har

64




Indirect evaluation

• Indirect evaluation- a “synthetic” performance measure in a “synthetic” setting

• The good- independent of a specific application / implementation- allow to evaluate single components, e.g.▪ repeatability of detector▪ matching score of descriptor

• The bad- difficult to design well- unclear correlation to the performance in applications

65

Direct evaluation

• Direct evaluation- performance of a real system using a feature

• The good- tied to the “real” performance of the feature

• The bad- tied to one application- worse, tied to one implementation- difficult to evaluate single aspects of a feature

• In the follow up we will focus on object instance retrieval

66

object instance retrievalobject category recognitionobject detectiontext recognitionsemantic segmentation...

Used to evaluate features

Image retrieval 67

...

Represent images as bags of features

Image retrieval pipeline 68

input image detector

{f1, . . . , fn}

Harris (Laplace)Hessian (Laplace)

DoGMSER

Harris AffineHessian Affine

....

descriptor

{d1, . . . ,dn}

SIFTLIOP

BRIEFJets...

increasing descriptor distance

Step 1: find neighbours of each query descriptor


...

queryimage

H. Jégou, M. Douze, and C. Schmid. Exploiting descriptor distances for precise image search. Technical Report 7656, INRIA, 2011.

Step 2: each query descriptor casts a vote for each DB image


...

query descriptord

d1 d2 dk

� � � � � � � � � � � � � � � � � � � � � � � � � � � �

vote strength max{dk � di, 0}

...rank k

dist

ance

Step 3: sort DB images by decreasing total votes


query image

decreasing total votes

...

✔ ✗✗

Average Precision (AP)35%

recall

prec

isio

n

1

1

✗

✔

✔✗

Step 4: Overall performance score


✔ ✗✗

35%

query retrieval results AP

✔ ✗✗

100%

✔ ✗ ✔

75%

53%Mean Average Precision (mAP) 53%Mean Average Precision (mAP)

...... ... ... ...

A retrieval benchmark dataset

Oxford 5K data 73

• ~ 5K images of Oxford- For each of 58 queries▪ about XX matching images▪ about XX confounders images

• Larger datasets are possible, but slow for extensive evaluation

• Relative ranking of features seems to be representative

Query Retrieved Images

✔✗✔

...

74




A new easy-to-use benchmarking suite

VLBenchmarks

• A novel MATLAB framework for feature evaluation- Repeatability and matching scores▪ VGG affine testbed

- Image retrieval▪ Oxford 5K

• Goodies- Simple to use MATLAB code- Automatically download datasets & run evaluations- Backward compatible with published results

75

http://www.vlfeat.org/benchmarks/index.html



Obtaining and installing the code

VLBenchmarks

• Installation- Download the latest version- Unpack the archive- Launch MATLAB and type

>>8install

• Requirements- MATLAB R2008a (7.6)- A C compiler (e.g. Visual Studio, GCC, or Xcode)- Do not forget to setup MATLAB to use your C compiler

mex8;setup

76

Example usage 77

� � � � � � � � � � � � � t.a� � � � � � � � � � � � � � � � � � t.a� � � � � � � � � � � � � � � � t.a

� � � � � � � � � o� � � � � � � � � se� a

� � � � � � o� � � � � � � � � � � � � �� s*� � � � � � "*d*� � � *e� a

� � � � � � � � � o� � � � � �� "� � � � � � � � se�a

� � � � � � � � � � "� o� ttt� � � � � � � � t� � � � � � � � � � � �� s� �� d�t tt

� � � � � t�� sled� ttt� � � � � t�� s;ed� ttt� � � � � t�� slee� a

choose a detector(Harris Affine, VGG version)

choose a dataset(graffiti sequence)

choose test(detector repeatability)

run the evaluation(repeatability = 0.66)

Testing on a sequence of images 78

� � � � � � � � � � � .ls� � � � � � � � � � � � � � � �. ls� � �� . ls

� � � � � � � � � r� � � � � � � � � *;� s� � � � � � r� � � � � � � � � � � � � �� *o� � � � � � aoto� � � o;� s� � � � � � � � � r� � � � � �� a� � � � � � � � *;�s

� � � � r� ceu� � � � � � � � � � a*� ;� r� ...

� � � � � � � � .� � � � � � � � � � d�� *� �� t�. ..� � � �� .�� *� ;t� ...� � � �� .�� *c;t� ...� � � �� .�� *� ;;� s

� � �

� � � � s� � � � � *� � � � � � � � � � at� o� � � � "� � � � ot� F;� sd� � � � *o� � � � � � � � � � o;� sa� � � � *o� � � � � � � � � � ao;� s� � � � � � � � s

Use parfor on a cluster!

79

1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

image number

repe

atab

ility

Comparing two features 80

import"datasets.*;"import"localFeatures.*;"import"benchmarks.*;detectors{1}"="VlFeatSift()";

detectors{2}"="VlFeatCovdet('EstimateAffineShape',"true)";

dataset"="VggAffineDataset('category','graf')";

benchmark"="RepeatabilityBenchmark()";

for"d"="1:2for"j"="1:5repeatability(j,d)"="...

""""""benchmark."testFeatureExtractor(detectors{d},"...

"""""""""""""""""""""""""""dataset.getTransformation(j),"...

"""""""""""""""""""""""""""dataset.getImagePath(1),"...

"""""""""""""""""""""""""""dataset.getImagePath(j))";

endendclf";"plot(repeatability,"'linewidth',"4)";

xlabel('image"number')";

ylabel('repeatability')";

grid"on";

81

1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

image number

repe

atab

ility

Affine Adaptation

• Compare the following features- SIFT, MSER, and features on a grid- on the Graffiti sequence- for repeatability and number of correspondence

82

30 40 50 60 200

10

20

30

40

50

60

70

Rep

eata

bilit

y

Viewpoint angle

Repeatability

SIFTMSERFeatures on a grid

30 40 50 60 200

100

200

300

400

500

600

700

800

Num

ber o

f cor

resp

onde

nces

Viewpoint angle

Number of correspondences

SIFTMSERFeatures on a grid

Example

Backward compatible

• Previously published results can be easily reproduced

- if interested, try the script8reproduceIjcv05.m

83

International Journal of Computer Visionc! 2006 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

DOI: 10.1007/s11263-005-3848-x

A Comparison of Affine Region Detectors

K. MIKOLAJCZYKUniversity of Oxford, OX1 3PJ, Oxford, United Kingdom

[email protected]

T. TUYTELAARSUniversity of Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium

[email protected]

C. SCHMIDINRIA, GRAVIR-CNRS, 655, av. de l’Europe, 38330, Montbonnot, France

[email protected]

A. ZISSERMANUniversity of Oxford, OX1 3PJ, Oxford, United Kingdom

[email protected]

J. MATASCzech Technical University, Karlovo Namesti 13, 121 35, Prague, Czech Republic

[email protected]

F. SCHAFFALITZKY AND T. KADIRUniversity of Oxford, OX1 3PJ, Oxford, United Kingdom

[email protected]

[email protected]

L. VAN GOOLUniversity of Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium

[email protected]

Received August 20, 2004; Revised May 3, 2005; Accepted May 11, 2005

First online version published in January, 2006

Abstract. The paper gives a snapshot of the state of the art in affine covariant region detectors, and comparestheir performance on a set of test images under varying imaging conditions. Six types of detectors are included:detectors based on affine normalization around Harris (Mikolajczyk and Schmid, 2002; Schaffalitzky andZisserman, 2002) and Hessian points (Mikolajczyk and Schmid, 2002), a detector of ‘maximally stable extremalregions’, proposed by Matas et al. (2002); an edge-based region detector (Tuytelaars and Van Gool, 1999)and a detector based on intensity extrema (Tuytelaars and Van Gool, 2000), and a detector of ‘salient regions’,

Other useful tricks

• Compare different parameter settings

• Visualising matches

84

detectors{1}"="VggAffine('Detector','haraff',"'Threshold"',"500)";detectors{2}"="VggAffine('Detector','haraff',"'Threshold"',"1000)";

[~"~"matches"reprojFrames]"="benchmark.testFeatureExtractor("...")...benchmarks.helpers.plotFrameMatches(matches,"reprojFrames)

SIFT Matches with 4 image (VggAffineDataset−graf dataset).

Matched ref. image framesUnmatched ref. image framesMatched test image framesUnmatched test image frames

Matches using mean−variance−median descriptor with 4 image (VggAffineDataset−graf dataset).

Matched ref. image framesUnmatched ref. image framesMatched test image framesUnmatched test image frames

Other benchmarks

• Detector matching score

• Image retrieval- Example: Oxford 5K lite- mAP evaluation

85

benchmark"="RepeatabilityBenchmark('mode','MatchingScore')";

dataset"="VggRetrievalDataset('Category','oxbuild',""""""""""""""""""""""""""""""'BadImagesNum',100);benchmark"="RetrievalBenchmark()";mAP"="benchmark.testFeatureExtractor(detectors{d},"dataset);

Summary 86

• Benchmarks- Indirect: repeatability and matching score- Direct: image retrieval

• VLBenchmarks- a simple to use MATLAB framework- convenient

• The future- Existing measures have many shortcomings- Hopefully better benchmarks will be available soon- And they will be added to VLBenchmarks for your convenience




Credits 87

HARVEST Programme

Karel Lenc

Andrew Zisserman

Cordelia SchmidKrystian Mikolajczyk Jiri MatasTinne Tuytelaars

Varun Gulshan

88

VLFeathttp://www.vlfeat.org/

VLBenchmarkshttp://www.vlfeat.org/benchmarks/

Thank you for coming!

http://www.vlfeat.org

http://www.vlfeat.org

http://www.vlfeat.org/benchmarks/

http://www.vlfeat.org/benchmarks/

modern features-part-4-evaluation

Documents

overlap error

matching scoredata

overlap score inf

afne testbed

reference region

better overlap

pairs of features

pair of features