image retrieval: challenges and opportunities

Image retrieval: ��challenges and opportunities

Oge Marques Florida Atlantic University

Boca Raton, FL - USA

June 4, 2012 UTFPR

Curi3ba, PR -‐ Brazil

Watch this…

Oge Marques h@p://www.google.com/mobile/goggles

Google Goggles

•  Mobile visual search (MVS) solution – Android and iPhone – Narrow-domain search and retrieval

Oge Marques h@p://www.google.com/mobile/goggles

Outline

•  How does it work?

•  Why is it relevant?

•  What else is going on?

•  Which challenges and opportunities lie ahead?

Oge Marques

Fundamentals

How does it work?

Fundamentals •  Google Goggles is (one of) the first – and maybe

the best-known – solution for MVS

•  It is a contemporary example of content-based image retrieval (CBIR)

•  Its technical details (algorithms, etc.) are not publicly available

•  However…

Oge Marques

MVS: Pipeline for image retrieval

Oge Marques Girod et al. IEEE Mul3media 2011

MVS: 3 scenarios


MVS: descriptor extraction

•  Interest point detection •  Feature descriptor computation


Interest point detection •  Numerous interest-point detectors have been proposed in

the literature: –  Harris Corners (Harris and Stephens 1988) –  Scale-Invariant Feature Transform (SIFT) Difference-of-Gaussian

(DoG) (Lowe 2004) –  Maximally Stable Extremal Regions (MSERs) (Matas et al. 2002) –  Hessian affine (Mikolajczyk et al. 2005) –  Features from Accelerated Segment Test (FAST) (Rosten and

Drummond 2006) –  Hessian blobs (Bay, Tuytelaars and Van Gool 2006)

•  Different tradeoffs in repeatability and complexity •  See (Mikolajczyk and Schmid 2005) for a comparative

performance evaluation of local descriptors in a common framework.

Oge Marques Girod et al. IEEE Signal Processing Magazine 2011

Feature descriptor computation

•  After interest-point detection, we compute a visual word descriptor on a normalized patch.

•  Ideally, descriptors should be: –  robust to small distortions in scale, orientation, and

lighting conditions; – discriminative, i.e., characteristic of an image or a small

set of images; – compact, due to typical mobile computing constraints.


Feature descriptor computation •  Examples of feature descriptors in the literature: –  SIFT (Lowe 1999) –  Speeded Up Robust Feature (SURF) interest-point

detector (Bay et al. 2008)

– Gradient Location and Orientation Histogram (GLOH) (Mikolajczyk and Schmid 2005)

– Compressed Histogram of Gradients (CHoG) (Chandrasekhar et al. 2009, 2010)

•  See (Winder, (Hua,) and Brown CVPR 2007, 2009) and (Mikolajczyk and Schmid PAMI 2005) for comparative performance evaluation of different descriptors.


Feature descriptor computation

•  What about compactness? – Option 1: Compress off-the-shelf descriptors. •  Result: poor rate-constrained image-retrieval

performance.

– Option 2: Design a descriptor with compression in mind. –  Example: CHoG (Compressed Histogram of Gradients) ��

(Chandrasekhar et al. 2009, 2010)


CHoG: Compressed Histogram of Gradients

Oge Marques Chandrasekhar et al. CVPR 09,10 Bernd Girod: Mobile Visual Search

Patch

01101

101101

CHoG

Descriptor

Gradient distributions

for each bin Gradients

Spatial

binning

Histogram

compression

dx

dy

0100011 111001 0010011 01100 1010100

dx

dy

0100101

011101

CHoG: Compressed Histogram of Gradients

•  Performance evaluation – Recall vs. bit rate


by approximately a factor of two. Moreover,

transmission of features allows yet another

optimization: it’s possible to use progressive

transmission of image features, and let the

server execute searches on a partial set of

features, as they arrive.15 Once the server

finds a result that has sufficiently high match-

ing score, it terminates the search and immedi-

ately sends the results back. The use of this

optimization reduces system latency by an-

other factor of two.Overall, the SPS system demonstrates that

using the described array of technologies, mo-

bile visual-search systems can achieve high rec-

ognition accuracy, scale to realistically large

databases, and deliver search results in an ac-

ceptable time.

Emerging MPEG standardAs we have seen, key component technolo-

gies for mobile visual search already exist, and

we can choose among several possible architec-

tures to design such a system. We have shown

these options at the beginning, in Figure 2.

The architecture shown in Figure 2a is the easi-

est one to implement on a mobile phone, but it

requires fast networks such as Wi-Fi to achieve

good performance. The architecture shown in

Figure 2b reduces network latency, and allows

fast response over today’s 3G networks, but

requires descriptors to be extracted on the

phone. Many applications might be accelerated

further by using a cache of the database on the

phone, as exemplified by the architecture

shown in Figure 2c.However, this immediately raises the ques-

tion of interoperability. How can we enable

mobile visual search applications and databases

across a broad range of devices and platforms, if

the information is exchanged in the form of

compressed visual descriptors rather than

images? This question was initially posed dur-

ing the Workshop on Mobile Visual Search,

held at Stanford University in December 2009.

This discussion led to a formal request by the

US delegation to MPEG, suggesting that the po-

tential interest in a standard for visual search

applications be explored.16 As a result, an ex-

ploratory activity in MPEG was started, which

produced a series of documents in the subse-

quent year describing applications, use cases,

objectives, scope, and requirements for a future

standard.17

As MPEG exploratory work progressed, it

was recognized that the suite of existing

MPEG technologies, such as MPEG-7 Visual,

does not yet include tools for robust image-

based retrieval and that a new standard should

therefore be defined. It was further recognized

[3B2-9] mmu2011030086.3d 30/7/011 16:27 Page 92

Figure 7. Comparison of different schemes with regard to classification

accuracy and query size. CHoG descriptor data is an order of magnitude

smaller compared to JPEG images or uncompressed SIFT descriptors.

100 101 10280

82

84

86

88

90

92

94

96

98

100

Query size (Kbytes)

Cla

ssifi

catio

n ac

cura

cy (

%)

Send feature (CHoG)

Send image (JPEG)

Send feature (SIFT)

Figure 8. End-to-end latency for different schemes. Compared to a system

transmitting a JPEG query image, a scheme employing progressive

transmission of CHoG features achieves approximately four times the

reduction in system latency over a 3G network.

0

2

4

6

8

10

12

Resp

onse

tim

e (s

econ

ds)

JPEG(3G)

Feature(3G)

Featureprogressive

(3G)

JPEG(WLAN)

Feature(WLAN)

Feature extractionNetwork transmissionRetrieval

92

Industry and Standards

MVS: feature indexing and matching •  Goal: produce a data structure that can quickly return a short

list of the database candidates most likely to match the query image. –  The short list may contain false positives as long as the correct match

is included. –  Slower pairwise comparisons can be subsequently performed on just

the short list of candidates rather than the entire database. •  Example of a technique: Vocabulary Tree (VT)-Based Retrieval


MVS: geometric verification

•  Goal: use location information of features in query and database images to confirm that the feature matches are consistent with a change in viewpoint between the two images.


MVS: geometric verification •  Method: perform pairwise matching of feature descriptors and evaluate

geometric consistency of correspondences. •  Techniques:

–  The geometric transform between the query and database image is usually estimated using robust regression techniques such as: •  Random sample consensus (RANSAC) (Fischler and Bolles 1981) •  Hough transform (Lowe 2004)

–  The transformation is often represented by an affine mapping or a homography.

•  Note: GV is computationally expensive, which is why it’s only used for a subset of images selected during the feature-matching stage.

Oge Marques

IEEE SIGNAL PROCESSING MAGAZINE [69] JULY 2011

[11] use weak geometric consistency checks to rerank images based on the orientation and scale information of all features. The authors in [53] and [69] propose incor-porating geometric information into the VT matching or hashing step. In [70] and [71], the authors investigate how to speed up RANSAC estimation itself. Philbin et al. [72] use single pairs of matching features to propose hypotheses of the geometric transformation model and verify only possible sets of hypotheses. Weak geometric consistency checks are typically used to rerank a larger candidate list of images, before a full GV is performed on a shorter candidate list.

To speed up GV, one can add a geometric reranking step before the RANSAC GV step, as illustrated in Figure 5. In [73], we propose a reranking step that incorporates geometric information directly into the fast index lookup stage and use it to reorder the list of top matching images (see “Fast Geometric Reranking”). The main advantage of the scheme is that it only requires x, y fea-ture location data and does not use scale

INVERTED INDEX COMPRESSION For a database containing 1 million images and a VT that uses soft binning, each image ID can be stored in a 32-b unsigned integer, and each fractional count can be stored in a 32-b float in the inverted index. The memory usage of the entire inverted index is gk

k51 Nk# 64 bits, where Nk is the length of

the inverted list at the kth leaf node. For a database of 1 mil-lion product images, this amount of memory reaches 10 GB, a huge amount for even a modern server. Such a large memory footprint limits the ability to run other concurrent processes on the same server, such as recognition systems for other databases. When the inverted index’s memory usage exceeds the server’s available random access memory (RAM), swap-ping between main and virtual memory occurs, which signifi-cantly slows down all processes.

A compressed inverted index [58] can significantly reduce memory usage without affecting recognition accuracy. First, because each list of IDs 5ik1, ik2, c, ikNk

6 is sorted, it is more efficient to store consecutive ID differences 5dk15 ik1,dk25 ik22 ik1, c, dkNk

5 ikNk2 ik1Nk2126 in place of the IDs. This

practice is also commonly used in text retrieval [62]. Second, the fractional visit counts can be quantized to a few repre-sentative values using Lloyd-Max quantization. Third, the dis-tributions of the ID differences and visit counts are far from uniform, so variable-length coding can be much more rate efficient than fixed-length coding. Using the distributions of the ID differences and visit counts, each inverted list can be encoded using an arithmetic code (AC) [63]. Since keeping the decoding delay low is very important for interactive mobile visual search applications, a scheme that allows ultra-fast decoding is often preferred over AC. The carryover code

[64] and recursive bottom-up complete (RBUC) code [65] have been shown to be at least ten times faster in decoding than AC, while achieving comparable compression gains as AC. The carryover and RBUC codes attain these speedups by enforcing word-aligned memory accesses.

Figure S6(a) compares the memory usage of the invert-ed index with and without compression using the RBUC code. Index compression reduces memory usage from near-ly 10 GB to 2 GB. This five times reduction leads to a sub-stantial speedup in server-side processing, as shown in Figure S6(b). Without compression, the large inverted index causes swapping between main and virtual memory and slows down the retrieval engine. After compression, memory swapping is avoided and memory congestion delays no longer contribute to the query latency.

Uncod

ed

Coded

Uncod

ed

Coded

0

5

10

Mem

ory

Usa

ge (

GB

)

(a)

0

2

4

6

Que

ry L

aten

cy (

s)

(b)

[FIG S6] (a) Memory usage for inverted index with and without compression. A five times savings in memory is achieved with compression. (b) Server-side query latency (per image) with and without compression. The RBUC code is used to encode the inverted index.

QueryData VT

GeometricReranking GV

IdentifyInformation

[FIG5] An image retrieval pipeline can be greatly sped up by incorporating a geometric reranking stage.

[FIG4] In the GV step, we match feature descriptors pairwise and find feature correspondences that are consistent with a geometric model. True feature matches are shown in red. False feature matches are shown in green.

Girod et al. IEEE Mul3media 2011

Relevance

Why is it relevant?

Relevance

•  Explosive growth and increasing popularity of mobile devices and apps

•  (Finally!) a good use case for CBIR

•  Many commercial opportunities

Oge Marques

Mobile visual search: driving factors

•  Age of mobile computing

Oge Marques h@p://60secondmarketer.com/blog/2011/10/18/more-‐mobile-‐phones-‐than-‐toothbrushes/


•  Why do I need a camera? I have a smartphone…��

��(22 Dec 2011)

Oge Marques h@p://www.cellular-‐news.com/story/52382.php


•  Powerful devices

1 GHz ARM Cortex-A9 processor, PowerVR SGX543MP2, Apple A5 chipset

Oge Marques h@p://www.apple.com/iphone/specs.html h@p://www.gsmarena.com/apple_iphone_4s-‐4212.php


•  Powerful devices

Oge Marques h@p://europe.nokia.com/PRODUCT_METADATA_0/Products/Phones/8000-‐series/808/Nokia808PureView_Whitepaper.pdf h@p://www.nokia.com/fr-‐fr/produits/mobiles/808/

Mobile visual search: driving factors •  Instagram: –  50 million registered users (35 M in last four

months) –  7 employees – A (growing ecosystem) based on it!

•  Search •  Send postcards •  Manage your photos •  Build a poster •  etc.

–  Sold to Facebook (for $ 1 Billion !) ��earlier this year

Oge Marques h@p://thenextweb.com/apps/2011/12/07/instagram-‐hits-‐15m-‐users-‐and-‐has-‐2-‐people-‐working-‐on-‐an-‐android-‐app-‐right-‐now/ h@p://www.nuwomb.com/instagram/


•  A natural use case for CBIR with QBE (at last!) – The example is right in front of the user!


IEEE SIGNAL PROCESSING MAGAZINE [62] JULY 2011

! The mobile client processes the query image, extracts fea-tures, and transmits feature data. The image-retrieval algo-rithms run on the server using the feature data as query.

! The mobile client downloads data from the server, and all image matching is performed on the device.One could also imagine a hybrid of the approaches men-

tioned above. When the database is small, it can be stored on the phone, and image-retrieval algorithms can be run locally [8]. When the database is large, it has to be placed on a remote server and the retrieval algorithms are run remotely.

In each case, the retrieval framework has to work within stringent memory, computation, power, and bandwidth constraints of the mobile device. The size of the data transmit-ted over the network needs to be as small as possible to reduce network latency and improve user experience. The server laten-cy has to be low as we scale to large databases. This article reviews the recent advances in content-based image retrieval with a focus on mobile applications. We first review large-scale image retrieval, highlighting recent progress in mobile visual search. As an example, we then present the Stanford Product Search system, a low-latency interactive visual search system. Several sidebars in this article invite the interested reader to dig deeper into the underlying algorithms.

ROBUST MOBILE IMAGE RECOGNITIONToday, the most successful algorithms for content-based image retrieval use an approach that is referred to as bag of features (BoFs) or bag of words (BoWs). The BoW idea is borrowed from text retrieval. To find a particular text document, such as a Web page, it is sufficient to use a few well-chosen words. In the database, the document itself can be likewise represented by a

bag of salient words, regardless of where these words appear in the text. For images, robust local features take the analogous role of visual words. Like text

retrieval, BoF image retrieval does not consider where in the image the features occur, at least in the initial stages of the retrieval pipeline. However, the variability of features extracted from different images of the same object makes the problem much more challenging.

A typical pipeline for image retrieval is shown in Figure 2. First, the local features are extracted from the query image. The set of image features is used to assess the similarity between query and database images. For mobile applications, individual features must be robust against geometric and photometric dis-tortions encountered when the user takes the query photo from a different viewpoint and with different lighting compared to the corresponding database image.

Next, the query features are quantized [9]–[12]. The parti-tioning into quantization cells is precomputed for the database, and each quantization cell is associated with a list of database images in which the quantized feature vector appears some-where. This inverted file circumvents a pairwise comparison of each query feature vector with all the feature vectors in the data-base and is the key to very fast retrieval. Based on the number of features they have in common with the query image, a short list of potentially similar images is selected from the database.

Finally, a geometric verification (GV) step is applied to the most similar matches in the database. The GV finds a coherent spatial pattern between features of the query image and the can-didate database image to ensure that the match is plausible. Example retrieval systems are presented in [9]–[14].

For mobile visual search, there are considerable challenges to provide the users with an interactive experience. Current deployed systems typically transmit an image from the client to the server, which might require tens of seconds. As we scale to large databases, the inverted file index becomes very large, with memory swapping operations slowing down the feature-match-ing stage. Further, the GV step is computationally expensive and thus increases the response time. We discuss each block of the retrieval pipeline in the following, focusing on how to meet the challenges of mobile visual search.

[FIG1] A snapshot of an outdoor mobile visual search system being used. The system augments the viewfinder with information about the objects it recognizes in the image taken with a camera phone.

Database

QueryImage

FeatureExtraction

FeatureMatching

GeometricVerification

[FIG2] A Pipeline for image retrieval. Local features are extracted from the query image. Feature matching finds a small set of images in the database that have many features in common with the query image. The GV step rejects all matches with feature locations that cannot be plausibly explained by a change in viewing position.

MOBILE IMAGE-RETRIEVAL APPLICATIONS POSE A UNIQUE

SET OF CHALLENGES.

MVS: commercial opportunities

•  Example app (La Redoute by pixlinQ)

Oge Marques h@p://www.youtube.com/watch?v=qUZCFtc42Q4

Context

What else is going on?

Context

•  Research: datasets and groups

•  Standardization: MPEG CDVS efforts

•  Commercial: main players (so far)

Oge Marques

Datasets for MVS research

•  Stanford Mobile Visual Search Data Set ��(http://web.cs.wpi.edu/~claypool/mmsys-dataset/2011/stanford/) – Key characteristics: •  rigid objects

•  widely varying lighting conditions •  perspective distortion

•  foreground and background clutter •  realistic ground-truth reference data

•  query data collected from heterogeneous low and high-end camera phones.

Oge Marques Chandrasekhar et al. ACM MMSys 2011

SMVS Data Set: categories and examples

•  DVD covers

Oge Marques h@p://web.cs.wpi.edu/~claypool/mmsys-‐2011-‐dataset/stanford/mvs_images/dvd_covers.html


•  CD covers

Oge Marques h@p://web.cs.wpi.edu/~claypool/mmsys-‐2011-‐dataset/stanford/mvs_images/cd_covers.html


•  Museum paintings

Oge Marques h@p://web.cs.wpi.edu/~claypool/mmsys-‐2011-‐dataset/stanford/mvs_images/museum_pain3ngs.html

Other MVS data sets

Oge Marques ISO/IEC JTC1/SC29/WG11/N12202 -‐ July 2011, Torino, IT

MPEG Compact Descriptors for Visual Search (CDVS)

•  Objective – Define a standard that enables efficient

implementation of visual search functionality on mobile devices

•  Scope •  bitstream of descriptors •  parts of descriptor extraction process (e.g. key-point

detection) needed to ensure interoperability

– Additional info: •  https://mailhost.tnt.uni-hannover.de/mailman/listinfo/cdvs •  http://mpeg.chiariglione.org/meetings/geneva11-1/geneva_ahg.htm (Ad hoc groups)

Oge Marques Bober, Cordara, and Reznik (2010)

MPEG CDVS

•  Summarized timeline

Oge Marques

that among several component technologies for

image retrieval, such a standard should focus pri-

marily on defining the format of descriptors and

parts of their extraction process (such as interest

point detectors) needed to ensure interoperabil-

ity. Such descriptors must be compact, image-

format independent, and sufficient for robust

image-matching. Hence, the title Compact

Descriptors for Visual Search was coined as an in-

terim name for this activity. Requirements and

Evaluation Framework documents have been

subsequently produced to formulate precise cri-

teria and evaluation methodologies to be used

in selection of technology for this standard.

The Call for Proposals17 was issued at the 96th

MPEG meeting in Geneva, in March 2011, and

responses are now expected by November

2011. Table 1 lists milestones to be reached in

subsequent development of this standard.It is envisioned that, when completed, this

standard will

! ensure interoperability of visual search appli-

cations and databases,

! enable a high level of performance of imple-

mentations conformant to the standard,

! simplify design of descriptor extraction and

matching for visual search applications,

! enable hardware support for descriptor ex-

traction and matching in mobile devices, and

! reduce load on wireless networks carrying

visual search-related information.

To build full visual-search applications, this

standard may be used jointly with other

existing standards, such as MPEG Query For-

mat, HTTP, XML, JPEG, and JPSearch.

Conclusions and outlookRecent years have witnessed remarkable

technological progress, making mobile visual

search possible today. Robust local image fea-

tures achieve a high degree of invariance

against scale changes, rotation, as well as

changes in illumination and other photometric

conditions. The BoW approach offers resiliency

to partial occlusions and background clutter,

and allows design of efficient indexing

schemes. The use of compressed image features

makes it possible to communicate query

requests using only a fraction of the rate

needed by JPEG, and further accelerates search

by storing a cache of the visual database on

the phone.Nevertheless, many improvements are still

possible and much needed. Existing image fea-

tures are robust to much of the variability be-

tween query and database images, but not all.

Improvements in complexity and compactness

are also critically important for mobile visual-

search systems. In mobile augmented-reality

applications, annotations of the viewfinder

content simply pop up without the user ever

pressing a button. Such continuous annota-

tions require video-rate processing on the

mobile device. They may also require improve-

ments in indexing structures, retrieval algo-

rithms, and moving more retrieval-related

operations to the phone.Standardization of compact descriptors for

visual search, such as the new initiative within

MPEG, will undoubtedly provide a further

boost to an already exciting area. In the near

[3B2-9] mmu2011030086.3d 1/8/011 16:44 Page 93

Table 1. Timeline for development of MPEG standard for visual search.

When Milestone Comments

March, 2011 Call for Proposals is published Registration deadline: 11 July 2011

Proposals due: 21 November 2011

December, 2011 Evaluation of proposals None

February, 2012 1st Working Draft First specification and test software model that can

be used for subsequent improvements.

July, 2012 Committee Draft Essentially complete and stabilized specification.

January, 2013 Draft International Standard Complete specification. Only minor editorial

changes are allowed after DIS.

July, 2013 Final Draft International

Standard

Finalized specification, submitted for approval and

publication as International standard.

July!

Sep

tem

ber

2011

93

Girod et al. IEEE Mul3media 2011

Commercial apps

•  SnapTell

•  oMoby (and the IQ Engines API)

•  Moodstocks

Oge Marques

SnapTell •  One of the earliest (ca. 2008) MVS apps for iPhone –  Eventually acquired by Amazon (A9)

•  Proprietary technique (“highly accurate and robust algorithm for image matching: Accumulated Signed Gradient (ASG)”).

Oge Marques h@p://www.snaptell.com/technology/index.htm

oMoby (and the IQ Engines API) –  iPhone app

Oge Marques h@p://omoby.com/pages/screenshots.php

oMoby (and the IQ Engines API)

•  The IQ Engines API: ��“vision as a service”

Oge Marques h@p://www.iqengines.com/applica3ons.php

Moodstocks: overview •  Offline image recognition thanks to a smart image

signatures synchronization

Oge Marques h@p://www.youtube.com/watch?v=tsxe23b12eU

Perspective

Which challenges and opportunities lie ahead?

MVS: technical challenges

•  How to ensure low latency (and interactive queries) under constraints such as: – Network bandwidth

– Computational power – Battery consumption

•  How to achieve robust visual recognition in spite of low-resolution cameras, varying lighting conditions, etc.

•  How to handle broad and narrow domains

Oge Marques

Other technical challenges

•  How to handle the (infamous) "semantic gap"

•  Combination of text-based and visual queries

•  Visualization of results

•  Users' needs and intentions

Oge Marques

The semantic gap

•  The semantic gap is the lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data have for a user in a given situation.

•  “The pivotal point in content-based retrieval is that the user seeks semantic similarity, but the database can only provide similarity by data processing. This is what we called the semantic gap.” [Smeulders et al., 2000]

Oge Marques

Alipr

Oge Marques

Google similarity search

Oge Marques

Google sort by subject

http://www.google.com/landing/imagesorting/

Oge Marques

Google image swirl

http://image-swirl.googlelabs.com/ Oge Marques

Challenge: users’ needs and intentions

•  Users and developers have quite different views •  Cultural and contextual information should be

taken into account •  User intentions are hard to infer – Privacy issues

– Users themselves don’t always know what they want – Who misses the MS Office paper clip?

Oge Marques

Concluding thoughts

Oge Marques

(Mobile) visual search and retrieval is a fascinating research field with many open challenges and opportunities which have the potential to impact the way we organize, annotate, and retrieve visual data (images and videos).

Learn more about it

•  http://savvash.blogspot.com/

Oge Marques

Thanks!

•  Questions?

•  For additional information: [email protected] Oge Marques

image retrieval: challenges and opportunities

Technology

ieee mul3media

feature matches

feature indexing

chog descriptor data

available howeveroge

descriptor extraction

query image

robust feature surf