efficient visual search of local...
TRANSCRIPT
![Page 1: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/1.jpg)
Efficient visual search of local features
Cordelia Schmid
![Page 2: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/2.jpg)
Bag-of-features [Sivic&Zisserman’03]
Harris-Hessian-Laplaceregions + SIFT descriptors
Bag-of-featuresprocessing
+tf-idf weighting
sparse frequency vector
centroids(visual words)
Set of SIFTdescriptors
Queryimage
queryingInverted
file
ranked imageshort-list
Geometricverification
Re-ranked list
• “visual words”: – 1 “word” (index) per local
descriptor – only images ids in inverted file=> 8 GB fits!
[Chum & al. 2007]
![Page 3: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/3.jpg)
Geometric verification
Use the position and shape of the underlying features to improve retrieval quality
Both images have many matches – which is correct?
![Page 4: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/4.jpg)
Geometric verification
We can measure spatial consistency between the query and each result to improve retrieval quality
Many spatially consistent matches –correct result
Few spatially consistent matches –incorrect
result
![Page 5: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/5.jpg)
Geometric verification
Gives localization of the object
![Page 6: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/6.jpg)
Geometric verification
• Remove outliers, matches contain a high number of incorrect ones
• Estimate geometric transformation
• Robust strategies– RANSAC – Hough transform
![Page 7: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/7.jpg)
Example: estimating 2D affine transformation
• Simple fitting procedure (linear least squares)• Approximates viewpoint changes for roughly planar
objects and roughly orthographic cameras• Can be used to initialize fitting for more complex models
Matches consistent with an affine transformation
![Page 8: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/8.jpg)
Fitting an affine transformation
Assume we know the correspondences, how do we get the transformation?
),( ii yx ′′),( ii yx
+
=
′′
2
1
43
21
t
t
y
x
mm
mm
y
x
i
i
i
i
′′
=
L
L
L
L
i
i
ii
ii
y
x
t
t
m
m
m
m
yx
yx
2
1
4
3
2
1
1000
0100
![Page 9: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/9.jpg)
Fitting an affine transformation
L
x i y i 0 0 1 0
0 0 x i y i 0 1
L
m1
m2
m3
m4
t1
=
L
′ x i′ y i
L
Linear system with six unknownsEach match gives us two linearly independent
equations: need at least three to solve for the transformation parameters
1
t2
![Page 10: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/10.jpg)
Dealing with outliers
The set of putative matches may contain a high percentage (e.g. 90%) of outliers
How do we fit a geometric transformation to a small subset of all possible matches?
Possible strategies:
• RANSAC
• Hough transform
![Page 11: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/11.jpg)
Strategy 1: RANSAC
RANSAC loop (Fischler & Bolles, 1981):
• Randomly select a seed group of matches
• Compute transformation from seed group
• Find inliers to this transformation• Find inliers to this transformation
• If the number of inliers is sufficiently large, re-compute least-squares estimate of transformation on all of the inliers
• Keep the transformation with the largest number of inliers
![Page 12: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/12.jpg)
Repeat1. Select 3 point to point correspondences
2. Compute H (2x2 matrix) + t (2x1) vector for translation
3. Measure support (number of inliers within threshold distance, i.e. d2
transfer < t)
Algorithm summary – RANSAC robust estimation of 2D affine transformation
Choose the (H,t) with the largest number of inliers
(Re-estimate (H,t) from all inliers)
![Page 13: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/13.jpg)
Strategy 2: Hough Transform
• Origin: Detection of straight lines in cluttered images
• Can be generalized to arbitrary shapes
• Can extract feature groupings from cluttered images in linear time.
• Illustrate on extracting sets of local features consistent • Illustrate on extracting sets of local features consistent with a similarity transformation
![Page 14: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/14.jpg)
Hough transform for object recognitionSuppose our features are scale- and rotation-covariant
• Then a single feature match provides an alignment hypothesis (translation, scale, orientation)
model
Target image
David G. Lowe. “Distinctive image features from scale-invariant keypoints”, IJCV 60 (2), pp. 91-110, 2004.
![Page 15: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/15.jpg)
Hough transform for object recognitionSuppose our features are scale- and rotation-covariant
• Then a single feature match provides an alignment hypothesis (translation, scale, orientation)
• Of course, a hypothesis obtained from a single match is unreliable
• Solution: Coarsely quantize the transformation space. Let each match vote for its hypothesis in the quantized space.
model
David G. Lowe. “Distinctive image features from scale-invariant key points”, IJCV 60 (2), pp. 91-110, 2004.
![Page 16: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/16.jpg)
Basic algorithm outline
1. Initialize accumulator H to all zeros
2. For each tentative match compute transformation
hypothesis: tx, ty, s, θ H(tx,ty,s,θ) = H(tx,ty,s,θ) + 1
tx
ty
H: 4D-accumulator array(only 2-d shown here)
H(tx,ty,s,θ) = H(tx,ty,s,θ) + 1end
end3. Find all bins (tx,ty,s,θ) where H(tx,ty,s,θ) has at least
three votes
• Correct matches will consistently vote for the same transformation while mismatches will spread votes
ty
![Page 17: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/17.jpg)
Hough transform details (D. Lowe’s system)
Training phase: For each model feature, record 2D location, scale, and orientation of model (relative to normalized feature frame)
Test phase: Let each match between a test and a model feature vote in a 4D Hough space• Use broad bin sizes of 30 degrees for orientation, a factor • Use broad bin sizes of 30 degrees for orientation, a factor
of 2 for scale, and 0.25 times image size for location
• Vote for two closest bins in each dimension
Find all bins with at least three votes and perform geometric verification • Estimate least squares affine transformation
• Use stricter thresholds on transformation residual
• Search for additional features that agree with the alignment
![Page 18: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/18.jpg)
Comparison
Hough TransformAdvantages
• Can handle high percentage of outliers (>95%)
• Extracts groupings from clutter in linear time
Disadvantages
RANSACAdvantages
• General method suited to large range of problems
• Easy to implement
• “Independent” of number of dimensions
DisadvantagesDisadvantages• Quantization issues
• Only practical for small number of dimensions (up to 4)
Improvements available• Probabilistic Extensions
• Continuous Voting Space
• Can be generalized to arbitrary shapes and objects
• Basic version only handles moderate number of outliers (<50%)
Many variants available, e.g.• PROSAC: Progressive RANSAC
[Chum05]
• Preemptive RANSAC [Nister05][Leibe08]
![Page 19: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/19.jpg)
Geometric verification – example
1. Query
…
2. Initial retrieval set (bag of words model)
3. Spatial verification (re-rank on # of inliers)
![Page 20: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/20.jpg)
Evaluation dataset: Oxford buildings
All Soul's
Ashmolean
Balliol
Bridge of Sighs
Keble
Magdalen
Bodleian
Thom Tower
Cornmarket
University Museum
Radcliffe Camera
� Ground truth obtained for 11 landmarks� Evaluate performance by mean Average Precision
![Page 21: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/21.jpg)
Measuring retrieval performance: Precision - Recall
• Precision: % of returned images that
are relevant
• Recall: % of relevant images that are
returned
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
recall
prec
isio
n
all images
returned images
relevant images
![Page 22: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/22.jpg)
Average Precision
0.2
0.4
0.6
0.8
1
prec
isio
n
• A good AP score requires both high recall andhigh precision
• Application-independentAP
0 0.2 0.4 0.6 0.8 10
0.2
recall
Performance measured by mean Average Precision (mAP) over 55 queries on 100K or 1.1M image datasets
![Page 23: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/23.jpg)
![Page 24: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/24.jpg)
INRIA holidays dataset
• Evaluation for the INRIA holidays dataset, 1491 images– 500 query images + 991 annotated true positives– Most images are holiday photos of friends and family
• 1 million & 10 million distractor images from Flickr• Vocabulary construction on a different Flickr set • Vocabulary construction on a different Flickr set
• Evaluation metric: mean average precision (in [0,1], bigger = better)– Average over precision/recall curve
![Page 25: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/25.jpg)
Holiday dataset – example queries
![Page 26: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/26.jpg)
Dataset : Venice Channel
Query Base 2Base 1
Base 4Base 3
![Page 27: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/27.jpg)
Dataset : San Marco square
Query Base 1 Base 3Base 2
Base 9Base 8
Base 4 Base 5 Base 7Base 6
![Page 28: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/28.jpg)
Example distractors - Flickr
![Page 29: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/29.jpg)
Experimental evaluation
• Evaluation on our holidays dataset, 500 query images, 1 million distracter images
• Metric: mean average precision (in [0,1], bigger = better)
0.8
0.9
1baseline
HE
+re-ranking
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1000000100000100001000
mA
P
database size
![Page 30: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/30.jpg)
Results – Venice Channel
Base 1 Flickr
Flickr Base 4
Query
Demo at http://bigimbaz.inrialpes.fr
![Page 31: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/31.jpg)
Towards larger databases?
� BOF can handle up to ~10 M d’images► with a limited number of descriptors per image
► 40 GB of RAM ► search = 2 s
� Web-scale = billions of images� Web-scale = billions of images► With 100 M per machine
→ search = 20 s, RAM = 400 GB→ not tractable!
![Page 32: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/32.jpg)
Recent approaches for very large scale indexing
Hessian-Affineregions + SIFT descriptors
Bag-of-featuresprocessing
+tf-idf weighting
Vector
sparse frequency vector
centroids(visual words)Set of SIFT
descriptorsQueryimage
compression
ranked imageshort-list
Geometricverification
Re-ranked list
Vector search
![Page 33: Efficient visual search of local featureslear.inrialpes.fr/~verbeek/mlcr.slides.11.12/visual_search_gre2.pdf• Vocabulary construction on a different Flickr set • Evaluation metric:](https://reader033.vdocuments.site/reader033/viewer/2022051909/5ffdc5ce9b328912fd3cd0f6/html5/thumbnails/33.jpg)
Related work on very large scale image search
� GIST descriptors with Spectral Hashing [Torralba et al. ‘08]
� Compressing the BoF representation (miniBof) [Jegou et al. ‘09]
� Aggregating local desc into a compact image representation [Jegou et al. ‘10]� Aggregating local desc into a compact image representation [Jegou et al. ‘10]
� Efficient object category recognition using classemes [Torresani et al.’10]