image retrieval: challenges and opportunities
TRANSCRIPT
Image retrieval: ���challenges and opportunities
Oge Marques Florida Atlantic University
Boca Raton, FL - USA
June 4, 2012 UTFPR
Curi3ba, PR -‐ Brazil
Watch this…
Oge Marques h@p://www.google.com/mobile/goggles
Google Goggles
• Mobile visual search (MVS) solution – Android and iPhone – Narrow-domain search and retrieval
Oge Marques h@p://www.google.com/mobile/goggles
Outline
• How does it work?
• Why is it relevant?
• What else is going on?
• Which challenges and opportunities lie ahead?
Oge Marques
Fundamentals
How does it work?
Fundamentals • Google Goggles is (one of) the first – and maybe
the best-known – solution for MVS
• It is a contemporary example of content-based image retrieval (CBIR)
• Its technical details (algorithms, etc.) are not publicly available
• However…
Oge Marques
MVS: Pipeline for image retrieval
Oge Marques Girod et al. IEEE Mul3media 2011
MVS: 3 scenarios
Oge Marques Girod et al. IEEE Mul3media 2011
MVS: descriptor extraction
• Interest point detection • Feature descriptor computation
Oge Marques Girod et al. IEEE Mul3media 2011
Interest point detection • Numerous interest-point detectors have been proposed in
the literature: – Harris Corners (Harris and Stephens 1988) – Scale-Invariant Feature Transform (SIFT) Difference-of-Gaussian
(DoG) (Lowe 2004) – Maximally Stable Extremal Regions (MSERs) (Matas et al. 2002) – Hessian affine (Mikolajczyk et al. 2005) – Features from Accelerated Segment Test (FAST) (Rosten and
Drummond 2006) – Hessian blobs (Bay, Tuytelaars and Van Gool 2006)
• Different tradeoffs in repeatability and complexity • See (Mikolajczyk and Schmid 2005) for a comparative
performance evaluation of local descriptors in a common framework.
Oge Marques Girod et al. IEEE Signal Processing Magazine 2011
Feature descriptor computation
• After interest-point detection, we compute a visual word descriptor on a normalized patch.
• Ideally, descriptors should be: – robust to small distortions in scale, orientation, and
lighting conditions; – discriminative, i.e., characteristic of an image or a small
set of images; – compact, due to typical mobile computing constraints.
Oge Marques Girod et al. IEEE Signal Processing Magazine 2011
Feature descriptor computation • Examples of feature descriptors in the literature: – SIFT (Lowe 1999) – Speeded Up Robust Feature (SURF) interest-point
detector (Bay et al. 2008)
– Gradient Location and Orientation Histogram (GLOH) (Mikolajczyk and Schmid 2005)
– Compressed Histogram of Gradients (CHoG) (Chandrasekhar et al. 2009, 2010)
• See (Winder, (Hua,) and Brown CVPR 2007, 2009) and (Mikolajczyk and Schmid PAMI 2005) for comparative performance evaluation of different descriptors.
Oge Marques Girod et al. IEEE Signal Processing Magazine 2011
Feature descriptor computation
• What about compactness? – Option 1: Compress off-the-shelf descriptors. • Result: poor rate-constrained image-retrieval
performance.
– Option 2: Design a descriptor with compression in mind. – Example: CHoG (Compressed Histogram of Gradients) ���
(Chandrasekhar et al. 2009, 2010)
Oge Marques Girod et al. IEEE Signal Processing Magazine 2011
CHoG: Compressed Histogram of Gradients
Oge Marques Chandrasekhar et al. CVPR 09,10 Bernd Girod: Mobile Visual Search
Patch
01101
101101
CHoG
Descriptor
Gradient distributions
for each bin Gradients
Spatial
binning
Histogram
compression
dx
dy
0100011 111001 0010011 01100 1010100
dx
dy
0100101
011101
CHoG: Compressed Histogram of Gradients
• Performance evaluation – Recall vs. bit rate
Oge Marques Girod et al. IEEE Mul3media 2011
by approximately a factor of two. Moreover,
transmission of features allows yet another
optimization: it’s possible to use progressive
transmission of image features, and let the
server execute searches on a partial set of
features, as they arrive.15 Once the server
finds a result that has sufficiently high match-
ing score, it terminates the search and immedi-
ately sends the results back. The use of this
optimization reduces system latency by an-
other factor of two.Overall, the SPS system demonstrates that
using the described array of technologies, mo-
bile visual-search systems can achieve high rec-
ognition accuracy, scale to realistically large
databases, and deliver search results in an ac-
ceptable time.
Emerging MPEG standardAs we have seen, key component technolo-
gies for mobile visual search already exist, and
we can choose among several possible architec-
tures to design such a system. We have shown
these options at the beginning, in Figure 2.
The architecture shown in Figure 2a is the easi-
est one to implement on a mobile phone, but it
requires fast networks such as Wi-Fi to achieve
good performance. The architecture shown in
Figure 2b reduces network latency, and allows
fast response over today’s 3G networks, but
requires descriptors to be extracted on the
phone. Many applications might be accelerated
further by using a cache of the database on the
phone, as exemplified by the architecture
shown in Figure 2c.However, this immediately raises the ques-
tion of interoperability. How can we enable
mobile visual search applications and databases
across a broad range of devices and platforms, if
the information is exchanged in the form of
compressed visual descriptors rather than
images? This question was initially posed dur-
ing the Workshop on Mobile Visual Search,
held at Stanford University in December 2009.
This discussion led to a formal request by the
US delegation to MPEG, suggesting that the po-
tential interest in a standard for visual search
applications be explored.16 As a result, an ex-
ploratory activity in MPEG was started, which
produced a series of documents in the subse-
quent year describing applications, use cases,
objectives, scope, and requirements for a future
standard.17
As MPEG exploratory work progressed, it
was recognized that the suite of existing
MPEG technologies, such as MPEG-7 Visual,
does not yet include tools for robust image-
based retrieval and that a new standard should
therefore be defined. It was further recognized
[3B2-9] mmu2011030086.3d 30/7/011 16:27 Page 92
Figure 7. Comparison of different schemes with regard to classification
accuracy and query size. CHoG descriptor data is an order of magnitude
smaller compared to JPEG images or uncompressed SIFT descriptors.
100 101 10280
82
84
86
88
90
92
94
96
98
100
Query size (Kbytes)
Cla
ssifi
catio
n ac
cura
cy (
%)
Send feature (CHoG)
Send image (JPEG)
Send feature (SIFT)
Figure 8. End-to-end latency for different schemes. Compared to a system
transmitting a JPEG query image, a scheme employing progressive
transmission of CHoG features achieves approximately four times the
reduction in system latency over a 3G network.
0
2
4
6
8
10
12
Resp
onse
tim
e (s
econ
ds)
JPEG(3G)
Feature(3G)
Featureprogressive
(3G)
JPEG(WLAN)
Feature(WLAN)
Feature extractionNetwork transmissionRetrieval
92
Industry and Standards
MVS: feature indexing and matching • Goal: produce a data structure that can quickly return a short
list of the database candidates most likely to match the query image. – The short list may contain false positives as long as the correct match
is included. – Slower pairwise comparisons can be subsequently performed on just
the short list of candidates rather than the entire database. • Example of a technique: Vocabulary Tree (VT)-Based Retrieval
Oge Marques Girod et al. IEEE Mul3media 2011
MVS: geometric verification
• Goal: use location information of features in query and database images to confirm that the feature matches are consistent with a change in viewpoint between the two images.
Oge Marques Girod et al. IEEE Mul3media 2011
MVS: geometric verification • Method: perform pairwise matching of feature descriptors and evaluate
geometric consistency of correspondences. • Techniques:
– The geometric transform between the query and database image is usually estimated using robust regression techniques such as: • Random sample consensus (RANSAC) (Fischler and Bolles 1981) • Hough transform (Lowe 2004)
– The transformation is often represented by an affine mapping or a homography.
• Note: GV is computationally expensive, which is why it’s only used for a subset of images selected during the feature-matching stage.
Oge Marques
IEEE SIGNAL PROCESSING MAGAZINE [69] JULY 2011
[11] use weak geometric consistency checks to rerank images based on the orientation and scale information of all features. The authors in [53] and [69] propose incor-porating geometric information into the VT matching or hashing step. In [70] and [71], the authors investigate how to speed up RANSAC estimation itself. Philbin et al. [72] use single pairs of matching features to propose hypotheses of the geometric transformation model and verify only possible sets of hypotheses. Weak geometric consistency checks are typically used to rerank a larger candidate list of images, before a full GV is performed on a shorter candidate list.
To speed up GV, one can add a geometric reranking step before the RANSAC GV step, as illustrated in Figure 5. In [73], we propose a reranking step that incorporates geometric information directly into the fast index lookup stage and use it to reorder the list of top matching images (see “Fast Geometric Reranking”). The main advantage of the scheme is that it only requires x, y fea-ture location data and does not use scale
INVERTED INDEX COMPRESSION For a database containing 1 million images and a VT that uses soft binning, each image ID can be stored in a 32-b unsigned integer, and each fractional count can be stored in a 32-b float in the inverted index. The memory usage of the entire inverted index is gk
k51 Nk# 64 bits, where Nk is the length of
the inverted list at the kth leaf node. For a database of 1 mil-lion product images, this amount of memory reaches 10 GB, a huge amount for even a modern server. Such a large memory footprint limits the ability to run other concurrent processes on the same server, such as recognition systems for other databases. When the inverted index’s memory usage exceeds the server’s available random access memory (RAM), swap-ping between main and virtual memory occurs, which signifi-cantly slows down all processes.
A compressed inverted index [58] can significantly reduce memory usage without affecting recognition accuracy. First, because each list of IDs 5ik1, ik2, c, ikNk
6 is sorted, it is more efficient to store consecutive ID differences 5dk15 ik1,dk25 ik22 ik1, c, dkNk
5 ikNk2 ik1Nk2126 in place of the IDs. This
practice is also commonly used in text retrieval [62]. Second, the fractional visit counts can be quantized to a few repre-sentative values using Lloyd-Max quantization. Third, the dis-tributions of the ID differences and visit counts are far from uniform, so variable-length coding can be much more rate efficient than fixed-length coding. Using the distributions of the ID differences and visit counts, each inverted list can be encoded using an arithmetic code (AC) [63]. Since keeping the decoding delay low is very important for interactive mobile visual search applications, a scheme that allows ultra-fast decoding is often preferred over AC. The carryover code
[64] and recursive bottom-up complete (RBUC) code [65] have been shown to be at least ten times faster in decoding than AC, while achieving comparable compression gains as AC. The carryover and RBUC codes attain these speedups by enforcing word-aligned memory accesses.
Figure S6(a) compares the memory usage of the invert-ed index with and without compression using the RBUC code. Index compression reduces memory usage from near-ly 10 GB to 2 GB. This five times reduction leads to a sub-stantial speedup in server-side processing, as shown in Figure S6(b). Without compression, the large inverted index causes swapping between main and virtual memory and slows down the retrieval engine. After compression, memory swapping is avoided and memory congestion delays no longer contribute to the query latency.
Uncod
ed
Coded
Uncod
ed
Coded
0
5
10
Mem
ory
Usa
ge (
GB
)
(a)
0
2
4
6
Que
ry L
aten
cy (
s)
(b)
[FIG S6] (a) Memory usage for inverted index with and without compression. A five times savings in memory is achieved with compression. (b) Server-side query latency (per image) with and without compression. The RBUC code is used to encode the inverted index.
QueryData VT
GeometricReranking GV
IdentifyInformation
[FIG5] An image retrieval pipeline can be greatly sped up by incorporating a geometric reranking stage.
[FIG4] In the GV step, we match feature descriptors pairwise and find feature correspondences that are consistent with a geometric model. True feature matches are shown in red. False feature matches are shown in green.
Girod et al. IEEE Mul3media 2011
Relevance
Why is it relevant?
Relevance
• Explosive growth and increasing popularity of mobile devices and apps
• (Finally!) a good use case for CBIR
• Many commercial opportunities
Oge Marques
Mobile visual search: driving factors
• Age of mobile computing
Oge Marques h@p://60secondmarketer.com/blog/2011/10/18/more-‐mobile-‐phones-‐than-‐toothbrushes/
Mobile visual search: driving factors
• Why do I need a camera? I have a smartphone…������
���(22 Dec 2011)
Oge Marques h@p://www.cellular-‐news.com/story/52382.php
Mobile visual search: driving factors
• Powerful devices
1 GHz ARM Cortex-A9 processor, PowerVR SGX543MP2, Apple A5 chipset
Oge Marques h@p://www.apple.com/iphone/specs.html h@p://www.gsmarena.com/apple_iphone_4s-‐4212.php
Mobile visual search: driving factors
• Powerful devices
Oge Marques h@p://europe.nokia.com/PRODUCT_METADATA_0/Products/Phones/8000-‐series/808/Nokia808PureView_Whitepaper.pdf h@p://www.nokia.com/fr-‐fr/produits/mobiles/808/
Mobile visual search: driving factors • Instagram: – 50 million registered users (35 M in last four
months) – 7 employees – A (growing ecosystem) based on it!
• Search • Send postcards • Manage your photos • Build a poster • etc.
– Sold to Facebook (for $ 1 Billion !) ���earlier this year
Oge Marques h@p://thenextweb.com/apps/2011/12/07/instagram-‐hits-‐15m-‐users-‐and-‐has-‐2-‐people-‐working-‐on-‐an-‐android-‐app-‐right-‐now/ h@p://www.nuwomb.com/instagram/
Mobile visual search: driving factors
• A natural use case for CBIR with QBE (at last!) – The example is right in front of the user!
Oge Marques Girod et al. IEEE Mul3media 2011
IEEE SIGNAL PROCESSING MAGAZINE [62] JULY 2011
! The mobile client processes the query image, extracts fea-tures, and transmits feature data. The image-retrieval algo-rithms run on the server using the feature data as query.
! The mobile client downloads data from the server, and all image matching is performed on the device.One could also imagine a hybrid of the approaches men-
tioned above. When the database is small, it can be stored on the phone, and image-retrieval algorithms can be run locally [8]. When the database is large, it has to be placed on a remote server and the retrieval algorithms are run remotely.
In each case, the retrieval framework has to work within stringent memory, computation, power, and bandwidth constraints of the mobile device. The size of the data transmit-ted over the network needs to be as small as possible to reduce network latency and improve user experience. The server laten-cy has to be low as we scale to large databases. This article reviews the recent advances in content-based image retrieval with a focus on mobile applications. We first review large-scale image retrieval, highlighting recent progress in mobile visual search. As an example, we then present the Stanford Product Search system, a low-latency interactive visual search system. Several sidebars in this article invite the interested reader to dig deeper into the underlying algorithms.
ROBUST MOBILE IMAGE RECOGNITIONToday, the most successful algorithms for content-based image retrieval use an approach that is referred to as bag of features (BoFs) or bag of words (BoWs). The BoW idea is borrowed from text retrieval. To find a particular text document, such as a Web page, it is sufficient to use a few well-chosen words. In the database, the document itself can be likewise represented by a
bag of salient words, regardless of where these words appear in the text. For images, robust local features take the analogous role of visual words. Like text
retrieval, BoF image retrieval does not consider where in the image the features occur, at least in the initial stages of the retrieval pipeline. However, the variability of features extracted from different images of the same object makes the problem much more challenging.
A typical pipeline for image retrieval is shown in Figure 2. First, the local features are extracted from the query image. The set of image features is used to assess the similarity between query and database images. For mobile applications, individual features must be robust against geometric and photometric dis-tortions encountered when the user takes the query photo from a different viewpoint and with different lighting compared to the corresponding database image.
Next, the query features are quantized [9]–[12]. The parti-tioning into quantization cells is precomputed for the database, and each quantization cell is associated with a list of database images in which the quantized feature vector appears some-where. This inverted file circumvents a pairwise comparison of each query feature vector with all the feature vectors in the data-base and is the key to very fast retrieval. Based on the number of features they have in common with the query image, a short list of potentially similar images is selected from the database.
Finally, a geometric verification (GV) step is applied to the most similar matches in the database. The GV finds a coherent spatial pattern between features of the query image and the can-didate database image to ensure that the match is plausible. Example retrieval systems are presented in [9]–[14].
For mobile visual search, there are considerable challenges to provide the users with an interactive experience. Current deployed systems typically transmit an image from the client to the server, which might require tens of seconds. As we scale to large databases, the inverted file index becomes very large, with memory swapping operations slowing down the feature-match-ing stage. Further, the GV step is computationally expensive and thus increases the response time. We discuss each block of the retrieval pipeline in the following, focusing on how to meet the challenges of mobile visual search.
[FIG1] A snapshot of an outdoor mobile visual search system being used. The system augments the viewfinder with information about the objects it recognizes in the image taken with a camera phone.
Database
QueryImage
FeatureExtraction
FeatureMatching
GeometricVerification
[FIG2] A Pipeline for image retrieval. Local features are extracted from the query image. Feature matching finds a small set of images in the database that have many features in common with the query image. The GV step rejects all matches with feature locations that cannot be plausibly explained by a change in viewing position.
MOBILE IMAGE-RETRIEVAL APPLICATIONS POSE A UNIQUE
SET OF CHALLENGES.
MVS: commercial opportunities
• Example app (La Redoute by pixlinQ)
Oge Marques h@p://www.youtube.com/watch?v=qUZCFtc42Q4
Context
What else is going on?
Context
• Research: datasets and groups
• Standardization: MPEG CDVS efforts
• Commercial: main players (so far)
Oge Marques
Datasets for MVS research
• Stanford Mobile Visual Search Data Set ���(http://web.cs.wpi.edu/~claypool/mmsys-dataset/2011/stanford/) – Key characteristics: • rigid objects
• widely varying lighting conditions • perspective distortion
• foreground and background clutter • realistic ground-truth reference data
• query data collected from heterogeneous low and high-end camera phones.
Oge Marques Chandrasekhar et al. ACM MMSys 2011
SMVS Data Set: categories and examples
• DVD covers
Oge Marques h@p://web.cs.wpi.edu/~claypool/mmsys-‐2011-‐dataset/stanford/mvs_images/dvd_covers.html
SMVS Data Set: categories and examples
• CD covers
Oge Marques h@p://web.cs.wpi.edu/~claypool/mmsys-‐2011-‐dataset/stanford/mvs_images/cd_covers.html
SMVS Data Set: categories and examples
• Museum paintings
Oge Marques h@p://web.cs.wpi.edu/~claypool/mmsys-‐2011-‐dataset/stanford/mvs_images/museum_pain3ngs.html
Other MVS data sets
Oge Marques ISO/IEC JTC1/SC29/WG11/N12202 -‐ July 2011, Torino, IT
MPEG Compact Descriptors for Visual Search (CDVS)
• Objective – Define a standard that enables efficient
implementation of visual search functionality on mobile devices
• Scope • bitstream of descriptors • parts of descriptor extraction process (e.g. key-point
detection) needed to ensure interoperability
– Additional info: • https://mailhost.tnt.uni-hannover.de/mailman/listinfo/cdvs • http://mpeg.chiariglione.org/meetings/geneva11-1/geneva_ahg.htm (Ad hoc groups)
Oge Marques Bober, Cordara, and Reznik (2010)
MPEG CDVS
• Summarized timeline
Oge Marques
that among several component technologies for
image retrieval, such a standard should focus pri-
marily on defining the format of descriptors and
parts of their extraction process (such as interest
point detectors) needed to ensure interoperabil-
ity. Such descriptors must be compact, image-
format independent, and sufficient for robust
image-matching. Hence, the title Compact
Descriptors for Visual Search was coined as an in-
terim name for this activity. Requirements and
Evaluation Framework documents have been
subsequently produced to formulate precise cri-
teria and evaluation methodologies to be used
in selection of technology for this standard.
The Call for Proposals17 was issued at the 96th
MPEG meeting in Geneva, in March 2011, and
responses are now expected by November
2011. Table 1 lists milestones to be reached in
subsequent development of this standard.It is envisioned that, when completed, this
standard will
! ensure interoperability of visual search appli-
cations and databases,
! enable a high level of performance of imple-
mentations conformant to the standard,
! simplify design of descriptor extraction and
matching for visual search applications,
! enable hardware support for descriptor ex-
traction and matching in mobile devices, and
! reduce load on wireless networks carrying
visual search-related information.
To build full visual-search applications, this
standard may be used jointly with other
existing standards, such as MPEG Query For-
mat, HTTP, XML, JPEG, and JPSearch.
Conclusions and outlookRecent years have witnessed remarkable
technological progress, making mobile visual
search possible today. Robust local image fea-
tures achieve a high degree of invariance
against scale changes, rotation, as well as
changes in illumination and other photometric
conditions. The BoW approach offers resiliency
to partial occlusions and background clutter,
and allows design of efficient indexing
schemes. The use of compressed image features
makes it possible to communicate query
requests using only a fraction of the rate
needed by JPEG, and further accelerates search
by storing a cache of the visual database on
the phone.Nevertheless, many improvements are still
possible and much needed. Existing image fea-
tures are robust to much of the variability be-
tween query and database images, but not all.
Improvements in complexity and compactness
are also critically important for mobile visual-
search systems. In mobile augmented-reality
applications, annotations of the viewfinder
content simply pop up without the user ever
pressing a button. Such continuous annota-
tions require video-rate processing on the
mobile device. They may also require improve-
ments in indexing structures, retrieval algo-
rithms, and moving more retrieval-related
operations to the phone.Standardization of compact descriptors for
visual search, such as the new initiative within
MPEG, will undoubtedly provide a further
boost to an already exciting area. In the near
[3B2-9] mmu2011030086.3d 1/8/011 16:44 Page 93
Table 1. Timeline for development of MPEG standard for visual search.
When Milestone Comments
March, 2011 Call for Proposals is published Registration deadline: 11 July 2011
Proposals due: 21 November 2011
December, 2011 Evaluation of proposals None
February, 2012 1st Working Draft First specification and test software model that can
be used for subsequent improvements.
July, 2012 Committee Draft Essentially complete and stabilized specification.
January, 2013 Draft International Standard Complete specification. Only minor editorial
changes are allowed after DIS.
July, 2013 Final Draft International
Standard
Finalized specification, submitted for approval and
publication as International standard.
July!
Sep
tem
ber
2011
93
Girod et al. IEEE Mul3media 2011
Commercial apps
• SnapTell
• oMoby (and the IQ Engines API)
• Moodstocks
Oge Marques
SnapTell • One of the earliest (ca. 2008) MVS apps for iPhone – Eventually acquired by Amazon (A9)
• Proprietary technique (“highly accurate and robust algorithm for image matching: Accumulated Signed Gradient (ASG)”).
Oge Marques h@p://www.snaptell.com/technology/index.htm
oMoby (and the IQ Engines API) – iPhone app
Oge Marques h@p://omoby.com/pages/screenshots.php
oMoby (and the IQ Engines API)
• The IQ Engines API: ���“vision as a service”
Oge Marques h@p://www.iqengines.com/applica3ons.php
Moodstocks: overview • Offline image recognition thanks to a smart image
signatures synchronization
Oge Marques h@p://www.youtube.com/watch?v=tsxe23b12eU
Perspective
Which challenges and opportunities lie ahead?
MVS: technical challenges
• How to ensure low latency (and interactive queries) under constraints such as: – Network bandwidth
– Computational power – Battery consumption
• How to achieve robust visual recognition in spite of low-resolution cameras, varying lighting conditions, etc.
• How to handle broad and narrow domains
Oge Marques
Other technical challenges
• How to handle the (infamous) "semantic gap"
• Combination of text-based and visual queries
• Visualization of results
• Users' needs and intentions
Oge Marques
The semantic gap
• The semantic gap is the lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data have for a user in a given situation.
• “The pivotal point in content-based retrieval is that the user seeks semantic similarity, but the database can only provide similarity by data processing. This is what we called the semantic gap.” [Smeulders et al., 2000]
Oge Marques
Alipr
Oge Marques
Alipr
Oge Marques
Alipr
Oge Marques
Alipr
Oge Marques
Google similarity search
Oge Marques
Google similarity search
Oge Marques
Google sort by subject
http://www.google.com/landing/imagesorting/
Oge Marques
Google image swirl
http://image-swirl.googlelabs.com/ Oge Marques
Challenge: users’ needs and intentions
• Users and developers have quite different views • Cultural and contextual information should be
taken into account • User intentions are hard to infer – Privacy issues
– Users themselves don’t always know what they want – Who misses the MS Office paper clip?
Oge Marques
Concluding thoughts
Oge Marques
(Mobile) visual search and retrieval is a fascinating research field with many open challenges and opportunities which have the potential to impact the way we organize, annotate, and retrieve visual data (images and videos).
Learn more about it
• http://savvash.blogspot.com/
Oge Marques