modern features-part-4-evaluation
TRANSCRIPT
Andrea Vedaldi, University of Oxford
Feature evaluationOld and new benchmarks, and new software
HARVEST Programme
Benchmarking: why and how
• Dozens of feature detectors and descriptors have been proposed
• Benchmarks- compare methods empirically- select the best method for a task
• Public benchmarks- reproducible research- simplify your life!
• Ingredients of a benchmark:
37
Theory Data Software
38
Indirect evaluationRepeatability and matching scoreData: affine covariant testbed
Direct evaluationImage retrievalData: oxford 5k
SoftwareVLBenchmarks
39
Indirect evaluationRepeatability and matching scoreData: affine covariant testbed
Direct evaluationImage retrievalData: oxford 5k
SoftwareVLBenchmarks
Indirect feature evaluation 40
• Intuition Test how well features persist and can be matched across image transformations.
• Data- must be representative of the transformations
(viewpoint, illumination, noise, etc.)
• Performance measures- repeatability
persistence of features- matching score
matchability of features
K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. IJCV, 1(65):43–72, 2005.
frame
Viewpoint, scale, rotation
Affine Testbed 49
Lighting, compression, blur
Affine Testbed 50
Intuition
Detector repeatability 51
• two pairs of features correspond
• another pair of features does not
• repeatability = 2/3
Formal definition
Region overlap 52
H
homography
Ra
Rb
HRb
area intersection
area union
overlap(a, b) =|Ra \HRb||Ra [HRb|
=
Intuition
Region overlap 53A Comparison of Affine Region Detectors
Figure 12. Overlap error !O . Examples of ellipses projected on the corresponding ellipse with the ground truth transformation. (bottom)Overlap error for above displayed ellipses. Note that the overlap error comes from different size, orientation and position of the ellipses.
sometimes specific to detectors and scene types(discussed below), and sometimes general—the trans-formation is outside the range for which the detector isdesigned, e.g. discretization errors, noise, non-linearillumination changes, projective deformations etc.Also the limited ‘range’ of the regions shape (size,skewness, . . . ) can partially explain this effect. Forinstance, in case of a zoomed out test image, only thelarge regions in the reference image will survive thetransformation, as the small regions will have becometoo small for accurate detection. The same holds forother types of transformations: very elongated regionsin the reference image may become undetectable ifthe inferred affine transformation stretches them evenfurther but, on the other hand, allow for very largeviewpoint changes if the inferred affine transformationmakes them rounder.
The left side of each figure typically represents smalltransformations. The repeatability score obtained inthis range indicates how well a given detector per-forms for the given scene type and to what extent thedetector is affected by a small transformation of thisscene. The invariance of the detector under the studiedtransformation, on the other hand, is reflected in theslope of the curves, i.e., how much does a given curvedegrade with increasing transformations.
The absolute number of correspondences typicallydrops faster than the relative number. This can be un-derstood by the fact that in most cases larger transfor-mations result in lower quality images and/or smallercommonly visible parts between the reference imageand the other image, and hence a smaller number ofregions are detected.
Viewpoint Change. The effect of changing view-point for the structured graffiti scene from Fig. 9(a)are displayed in Fig. 13. Figure 13(a) shows the re-peatability score and Fig. 13(b) the absolute number
of correspondences. The results for images contain-ing repeated texture motifs (Fig. 9(b)) are displayedin Fig. 14. The best results are obtained with theMSER detector for both scene types. This is due tothe high detection accuracy especially on the homo-geneous regions with distinctive boundaries. The re-peatability score for a viewpoint change of 20 degreesvaries between 40% and 78% and decreases for largeviewpoint angles to 10% ! 46%. The largest numberof corresponding regions is given by Hessian-Affine(1300) detector followed by Harris-Affine (900) detec-tor for the structured scene, and given by Harris-Affine(1200), MSER (1200) and EBR (1300) detectors forthe textured scene. These numbers decrease to lessthan 200/400 for the structured/textured scene for largeviewpoint angle.
Scale Change. Figure 15 shows the results for thestructured scene from Fig. 9(c), while Fig. 16 shows theresults for the textured scene from Fig. 9(d). The mainimage transformation is a scale change and in-planerotation. The Hessian-Affine detector performs best,followed by MSER and Harris-Affine detectors. Thisconfirms the high performance of the automatic scaleselection applied in both Hessian-Affine and Harris-Affine detectors. These plots clearly show the sensi-tivity of the detectors to the scene type. For the tex-tured scene, the edge-based region detector gives verylow repeatability scores (below 20%), whereas for thestructured scene, its results are similar to the other de-tectors, with score going from 60% down to 28%. Theunstable repeatability score of the salient region detec-tor for the textured scene is due to the small number ofdetected regions in this type of images.
Blur. Figures 17 and 18 show the results for thestructured scene from Fig. 9(e) and the textured onefrom Fig. 9(f), both undergoing increasing amounts of
• Examples of ellipses overlapping by different amounts
• Usually, features are tested at 40% overlap error (= 60% overlap)
1 - overlap
54
55
56
57
Intuition
Normalised region overlap 58
larger scale = better overlap
Rescale so thatreference region hasand area of 302 pixels
Formal definition
Normalised region overlap 59
H
homography1. Detect
Ra Rb
2. Warp Ra HRb
3. Normalise sHRbsRa s = |Ra|/302
4. Intersection over union overlap(a, b) =
|sRa \ sHRb||sRa [ sHRb|
Formal definition
Detector repeatability 60
H
homography
1. Find featuresin common area
{Rb : b 2 B}{Ra : a 2 A}
2. Threshold theoverlap score s
ab
=
(overlap(a, b), overlap(a, b) � 1� ✏
o
� inf, otherwise.
3. Find geometricmatches
M⇤= max
M bipartite
X
(a,b)2M
sab greedy approximation
repeatability(A,B) =|M⇤|
min{|A|, |B|}repeatability(A,B) =|M⇤|
min{|A| |B|}
Intuition
Descriptor matching score 61
• In addition to being stable, features must be visually distinctive
• Descriptor matching score- similar to repeatability- but matches are constructed by comparing descriptors
Formal definition
Descriptor matching score 62
H
homography
1. Find featuresin common area
{Rb : b 2 B}{Ra : a 2 A}
2. Descriptordistances {da : a 2 A} {db : b 2 B} dab = kda � dbk2
4. Geometricmatches (as before)
M⇤= max
M bipartite
X
(a,b)2M
sab
3. Descriptormatches
M⇤d = min
M bipartite
X
(a,b)2M
dab
match-score(A,B) =
|M⇤ \M⇤d|
min{|A|, |B|}
Example of a repeatability graph 63
30 40 50 60 200
10
20
30
40
50
60
70
80
90
100
Viewpoint angle
Rep
eata
bilit
y (g
raf)
Repeatability (graf)
VL DoGVL HessianVL HessianLaplaceVL HarrisLaplaceVL DoG (double)VL Hessian (double)VL HessianLaplace (double)VL HarrisLaplace (double)VLFeat SIFTCMP HessianVGG hesVGG har
64
Indirect evaluationRepeatability and matching scoreData: affine covariant testbed
Direct evaluationImage retrievalData: oxford 5k
SoftwareVLBenchmarks
Indirect evaluation
• Indirect evaluation- a “synthetic” performance measure in a “synthetic” setting
• The good- independent of a specific application / implementation- allow to evaluate single components, e.g.▪ repeatability of detector▪ matching score of descriptor
• The bad- difficult to design well- unclear correlation to the performance in applications
65
Direct evaluation
• Direct evaluation- performance of a real system using a feature
• The good- tied to the “real” performance of the feature
• The bad- tied to one application- worse, tied to one implementation- difficult to evaluate single aspects of a feature
• In the follow up we will focus on object instance retrieval
66
object instance retrievalobject category recognitionobject detectiontext recognitionsemantic segmentation...
Used to evaluate features
Image retrieval 67
...
Represent images as bags of features
Image retrieval pipeline 68
input image detector
{f1, . . . , fn}
Harris (Laplace)Hessian (Laplace)
DoGMSER
Harris AffineHessian Affine
....
descriptor
{d1, . . . ,dn}
SIFTLIOP
BRIEFJets...
increasing descriptor distance
Step 1: find neighbours of each query descriptor
Image retrieval pipeline 69
...
queryimage
H. Jégou, M. Douze, and C. Schmid. Exploiting descriptor distances for precise image search. Technical Report 7656, INRIA, 2011.
Step 2: each query descriptor casts a vote for each DB image
Image retrieval pipeline 70
...
query descriptord
d1 d2 dk
� � � � � � � � � � � � � � � � � � � � � � � � � � � �
vote strength max{dk � di, 0}
...rank k
dist
ance
Step 3: sort DB images by decreasing total votes
Image retrieval pipeline 71
query image
decreasing total votes
...
✔ ✗✗
Average Precision (AP)35%
recall
prec
isio
n
1
1
✗
✔
✔✗
Step 4: Overall performance score
Image retrieval pipeline 72
✔ ✗✗
35%
query retrieval results AP
✔ ✗✗
100%
✔ ✗ ✔
75%
53%Mean Average Precision (mAP) 53%Mean Average Precision (mAP)
...... ... ... ...
A retrieval benchmark dataset
Oxford 5K data 73
• ~ 5K images of Oxford- For each of 58 queries▪ about XX matching images▪ about XX confounders images
• Larger datasets are possible, but slow for extensive evaluation
• Relative ranking of features seems to be representative
Query Retrieved Images
✔✗✔
...
74
Indirect evaluationRepeatability and matching scoreData: affine covariant testbed
Direct evaluationImage retrievalData: oxford 5k
SoftwareVLBenchmarks
A new easy-to-use benchmarking suite
VLBenchmarks
• A novel MATLAB framework for feature evaluation- Repeatability and matching scores▪ VGG affine testbed
- Image retrieval▪ Oxford 5K
• Goodies- Simple to use MATLAB code- Automatically download datasets & run evaluations- Backward compatible with published results
75
http://www.vlfeat.org/benchmarks/index.html
Obtaining and installing the code
VLBenchmarks
• Installation- Download the latest version- Unpack the archive- Launch MATLAB and type
>>8install
• Requirements- MATLAB R2008a (7.6)- A C compiler (e.g. Visual Studio, GCC, or Xcode)- Do not forget to setup MATLAB to use your C compiler
mex8;setup
76
Example usage 77
� � � � � � � � � � � � � t.a� � � � � � � � � � � � � � � � � � t.a� � � � � � � � � � � � � � � � t.a
� � � � � � � � � o� � � � � � � � � se� a
� � � � � � o� � � � � � � � � � � � � �� s*� � � � � � "*d*� � � *e� a
� � � � � � � � � o� � � � � �� � � � "� � � � � � � � se�a
� � � � � � � � � � "� o� ttt� � � � � � � � t� � � � � � � � � � � �� � � � � s� �� � � �� � d�t tt
� � � � � t�� � �� �� � �� � � � �� sled� ttt� � � � � t�� � �� �� � � � s;ed� ttt� � � � � t�� � �� �� � � � slee� a
choose a detector(Harris Affine, VGG version)
choose a dataset(graffiti sequence)
choose test(detector repeatability)
run the evaluation(repeatability = 0.66)
Testing on a sequence of images 78
� � � � � � � � � � � .ls� � � � � � � � � � � � � � � �. ls� � �� � � �� � � �� �� �. ls
� � � � � � � � � r� � � � � � � � � *;� s� � � � � � r� � � � � � � � � � � � � �� *o� � � � � � aoto� � � o;� s� � � � � � � � � r� � � � � �� � � � a� � � � � � � � *;�s
� � � � r� ceu� � � � � � � � � � a*� ;� r� ...
� � � � � � � � .� � � � � � � � � � d�� � � �� *� �� � � �� � t�. ..� � � �� .�� � �� �� � �� � � � �� *� ;t� ...� � � �� .�� � �� �� � � � *c;t� ...� � � �� .�� � �� �� � � � *� ;;� s
� � �
� � � � s� � � � � *� � � � � � � � � � at� o� � � � "� � � � ot� F;� sd� � � � *o� � � � � � � � � � o;� sa� � � � *o� � � � � � � � � � ao;� s� � � � � � � � s
Use parfor on a cluster!
79
1 1.5 2 2.5 3 3.5 4 4.5 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
image number
repe
atab
ility
Comparing two features 80
import"datasets.*;"import"localFeatures.*;"import"benchmarks.*;detectors{1}"="VlFeatSift()";
detectors{2}"="VlFeatCovdet('EstimateAffineShape',"true)";
dataset"="VggAffineDataset('category','graf')";
benchmark"="RepeatabilityBenchmark()";
for"d"="1:2for"j"="1:5repeatability(j,d)"="...
""""""benchmark."testFeatureExtractor(detectors{d},"...
"""""""""""""""""""""""""""dataset.getTransformation(j),"...
"""""""""""""""""""""""""""dataset.getImagePath(1),"...
"""""""""""""""""""""""""""dataset.getImagePath(j))";
endendclf";"plot(repeatability,"'linewidth',"4)";
xlabel('image"number')";
ylabel('repeatability')";
grid"on";
81
1 1.5 2 2.5 3 3.5 4 4.5 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
image number
repe
atab
ility
Affine Adaptation
• Compare the following features- SIFT, MSER, and features on a grid- on the Graffiti sequence- for repeatability and number of correspondence
82
30 40 50 60 200
10
20
30
40
50
60
70
Rep
eata
bilit
y
Viewpoint angle
Repeatability
SIFTMSERFeatures on a grid
30 40 50 60 200
100
200
300
400
500
600
700
800
Num
ber o
f cor
resp
onde
nces
Viewpoint angle
Number of correspondences
SIFTMSERFeatures on a grid
Example
Backward compatible
• Previously published results can be easily reproduced
- if interested, try the script8reproduceIjcv05.m
83
International Journal of Computer Visionc! 2006 Springer Science + Business Media, Inc. Manufactured in The Netherlands.
DOI: 10.1007/s11263-005-3848-x
A Comparison of Affine Region Detectors
K. MIKOLAJCZYKUniversity of Oxford, OX1 3PJ, Oxford, United Kingdom
T. TUYTELAARSUniversity of Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium
C. SCHMIDINRIA, GRAVIR-CNRS, 655, av. de l’Europe, 38330, Montbonnot, France
A. ZISSERMANUniversity of Oxford, OX1 3PJ, Oxford, United Kingdom
J. MATASCzech Technical University, Karlovo Namesti 13, 121 35, Prague, Czech Republic
F. SCHAFFALITZKY AND T. KADIRUniversity of Oxford, OX1 3PJ, Oxford, United Kingdom
L. VAN GOOLUniversity of Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium
Received August 20, 2004; Revised May 3, 2005; Accepted May 11, 2005
First online version published in January, 2006
Abstract. The paper gives a snapshot of the state of the art in affine covariant region detectors, and comparestheir performance on a set of test images under varying imaging conditions. Six types of detectors are included:detectors based on affine normalization around Harris (Mikolajczyk and Schmid, 2002; Schaffalitzky andZisserman, 2002) and Hessian points (Mikolajczyk and Schmid, 2002), a detector of ‘maximally stable extremalregions’, proposed by Matas et al. (2002); an edge-based region detector (Tuytelaars and Van Gool, 1999)and a detector based on intensity extrema (Tuytelaars and Van Gool, 2000), and a detector of ‘salient regions’,
Other useful tricks
• Compare different parameter settings
• Visualising matches
84
detectors{1}"="VggAffine('Detector','haraff',"'Threshold"',"500)";detectors{2}"="VggAffine('Detector','haraff',"'Threshold"',"1000)";
[~"~"matches"reprojFrames]"="benchmark.testFeatureExtractor("...")...benchmarks.helpers.plotFrameMatches(matches,"reprojFrames)
SIFT Matches with 4 image (VggAffineDataset−graf dataset).
Matched ref. image framesUnmatched ref. image framesMatched test image framesUnmatched test image frames
Matches using mean−variance−median descriptor with 4 image (VggAffineDataset−graf dataset).
Matched ref. image framesUnmatched ref. image framesMatched test image framesUnmatched test image frames
Other benchmarks
• Detector matching score
• Image retrieval- Example: Oxford 5K lite- mAP evaluation
85
benchmark"="RepeatabilityBenchmark('mode','MatchingScore')";
dataset"="VggRetrievalDataset('Category','oxbuild',""""""""""""""""""""""""""""""'BadImagesNum',100);benchmark"="RetrievalBenchmark()";mAP"="benchmark.testFeatureExtractor(detectors{d},"dataset);
Summary 86
• Benchmarks- Indirect: repeatability and matching score- Direct: image retrieval
• VLBenchmarks- a simple to use MATLAB framework- convenient
• The future- Existing measures have many shortcomings- Hopefully better benchmarks will be available soon- And they will be added to VLBenchmarks for your convenience
http://www.vlfeat.org/benchmarks/index.html
Credits 87
HARVEST Programme
Karel Lenc
Andrew Zisserman
Cordelia SchmidKrystian Mikolajczyk Jiri MatasTinne Tuytelaars
Varun Gulshan
88
VLFeathttp://www.vlfeat.org/
VLBenchmarkshttp://www.vlfeat.org/benchmarks/
Thank you for coming!