know2look: commonsense knowledge for visual search · et al., 2008). search engines like baidu,...

Proceedings of AKBC 2016, pages 57–62,San Diego, California, June 12-17, 2016. c©2016 Association for Computational Linguistics

Know2Look: Commonsense Knowledge for Visual Search

Sreyasi Nag Chowdhury Niket Tandon Gerhard WeikumMax Planck Institute for Informatics

Saarbrucken, Germanysreyasi, ntandon, [email protected]

Abstract

With the rise in popularity of social media, im-ages accompanied by contextual text form ahuge section of the web. However, search andretrieval of documents are still largely depen-dent on solely textual cues. Although visualcues have started to gain focus, the imperfec-tion in object/scene detection do not lead tosignificantly improved results. We hypothe-size that the use of background commonsenseknowledge on query terms can significantlyaid in retrieval of documents with associatedimages. To this end we deploy three differentmodalities - text, visual cues, and common-sense knowledge pertaining to the query - as arecipe for efficient search and retrieval.

1 Introduction

Motivation: Image retrieval by querying visual con-tents has been on the agenda of the database, infor-mation retrieval, multimedia, and computer visioncommunities for decades (Liu et al., 2007; Dattaet al., 2008). Search engines like Baidu, Bing orGoogle perform reasonably well on this task, butcrucially rely on textual cues that accompany an im-age: tags, caption, URL string, adjacent text etc.

In recent years, deep learning has led to a boostin the quality of visual object recognition in imageswith fine-grained object labels (Simonyan and Zis-serman, 2014; LeCun et al., 2015; Mordvintsev etal., 2015). Methods like LSDA (Hoffman et al.,2014) are trained on more than 15,000 classes of Im-ageNet (Deng et al., 2009) (which are mostly leaf-level synsets of WordNet (Miller, 1995)), and anno-tate newly seen images with class labels for bound-

Detected visual objects:

traffic light, car, person,

bicycle, bus, car, grille,

radiator grille

(a) Good object detection

Detected visual objects:

tv or monitor, cargo

door, piano

(b) Poor object detection

Figure 1: Example cases where visual object detec-tion may or may not aid in search and retrieval.

ing boxes of objects. For the image in Figure 1a, forexample, object labels traffic light, car, person, bi-cycle and bus have been recognized making it easilyretrievable for queries with these concepts. How-ever, these labels come with uncertainty. For theimage in Figure 1b, there is much higher noise inits visual object labels; so querying by visual labelswould not work here.

Opportunity and Challenge: These limitations oftext-based search, on one hand, and visual-objectsearch, on the other hand, suggest combining thecues from text and vision for more effective re-trieval. Although each side of this combined featurespace is incomplete and noisy, the hope is that the

57

“environment friendlytraffic”

“downsides of moun-taineering”

“street-side soulful mu-sic”

Figure 2: Sample queries containing abstract con-cepts and expected results of image retrieval.

combination can improve retrieval quality.Unfortunately, images that show more sophisti-

cated scenes, or emotions evoked on the viewer arestill out of reach. Figure 2 shows three examples,along with query formulations that would likely con-sider these sample images as relevant results. Theseanswers would best be retrieved by queries with ab-stract words (e.g. “environment friendly”) or ac-tivity words (e.g. “traffic”) rather than words thatdirectly correspond to visual objects (e.g. “car” or“bike”). So there is a vocabulary gap, or even con-cept mismatch, between what users want and ex-press in queries and the visual and textual cues thatcome directly with an image. This is the key prob-lem addressed in this paper.

Approach and Contribution: To bridge the con-cepts and vocabulary between user queries and im-age features, we propose an approach that har-nesses commonsense knowledge (CSK). Recent ad-vances in automatic knowledge acquisition haveproduced large collections of CSK: physical (e.g.color or shape) as well as abstract (e.g. abili-ties) properties of everyday objects (e.g. bike, bird,sofa, etc.) (Tandon et al., 2014), subclass and part-whole relations between objects (Tandon et al.,

2016), activities and their participants (Tandon etal., 2015), and more. This kind of knowledge al-lows us to establish relationships between our ex-ample queries and observable objects or activitiesin the image. For example, the following CSKtriples establish relationships between ‘backpack’,‘tourist’ and ‘travel map’: (backpacks, are

carried by, tourists), (tourists, use,

travel maps). This allows for retrieval of imageswith generic queries like “travel with backpack”.

This idea is worked out into a query expansionmodel where we leverage a CSK knowledge basefor automatically generating additional query words.Our model unifies three kinds of features: textualfeatures from the page context of an image, visualfeatures obtained from recognizing fine-grained ob-ject classes in an image, and CSK features in theform of additional properties of the concepts re-ferred to by query words. The weighing of the dif-ferent features is crucial for query-result ranking. Tothis end, we have devised a method based on statis-tical language models (Zhai, 2008).

The paper’s contribution can be characterized asfollows. We present the first model for incorporat-ing CSK into image retrieval. We develop a full-fledged system architecture for this purpose, alongwith a query processor and an answer-ranking com-ponent. Our system Know2Look, uses common-sense knowledge to look for images relevant to aquery by looking at the components of the imagesin greater detail. We further discuss experimentsthat compare our approach to state-of-the-art imagesearch in various configurations. Our approach sub-stantially improves the query result quality.

2 Related Work

Existing Commonsense Knowledge Bases: Tra-ditionally commonsense knowledge bases were cu-rated manually through experts (Lenat, 1995) orthrough crowd-sourcing (Singh et al., 2002). Mod-ern methods of CSK acquisition are automatic, ei-ther from test corpora (Liu and Singh, 2004) or fromthe web (Tandon et al., 2014).

Vision and NLP: Research at the intersection ofNatural Language Processing and Computer Visionis in limelight in the recent past. There have beenwork on automatic image annotations (Wang et al.,

58

2014), description generation (Vinyals et al., 2014;Ordonez et al., 2011; Mitchell et al., 2012), sceneunderstanding (Farhadi et al., 2010), image retrievalthrough natural language queries (Malinowski andFritz, 2014) etc.

Commonsense knowledge from text and vision:There have been attempts for learning CSK fromreal images (Chen et al., 2013) as well as from non-photo-realistic abstractions (Vedantam et al., 2015).Recent work have also leveraged CSK for visual ver-ification of relational phrases (Sadeghi et al., 2015)and for non-visual tasks like fill-in-the-blanks byintelligent agents (Lin and Parikh, 2015). Learn-ing commonsense from visual cues continue to bea challenge in itself. The CSK used in our work ismotivated by research on CSK acquisition from theweb (Tandon et al., 2014).

3 Multimodal document retrieval

Adjoining text of images may or may not explic-itly annotate their visual contents. Search enginesrelying on only textual matches ignore informationwhich may be solely available in the visual cues.Moreover, the intuition behind using CSK is that hu-mans innately interpolate visual or textual informa-tion with associated latent knowledge for analysisand understanding. Hence we believe that leverag-ing CSK in addition to textual and visual informa-tion would take results closer to human users’ pref-erences. In order to use such background knowl-edge, curating a CSK knowledge base is of pri-mary importance. Since automatic acquisition ofcanonicalized CSK from the web can be costly, weconjecture that noisy subject-predicate-object (SPO)triples extracted through Open Information Extrac-tion (Banko et al., 2007) may be used as CSK. Wehypothesize that the combination of the noisy ingre-dients – CSK, object-classes, and textual descrip-tions – would create an ensemble effect providingfor efficient search and retrieval. We describe thecomponents of our architecture in the following sec-tions.

3.1 Data, Knowledge and Features

We consider a document x from a collection X withtwo kinds of features:

• Visual features xvj: labels of object classes rec-ognized in the image, including their hypernyms(e.g., king cobra, cobra, snake).

• Textual features xxj: words that occur in thetext that accompanies the image, for exampleimage caption.

We assume that the two kinds of features canbe combined into a single feature vector x =〈x1 . . . xM 〉 with hyper-parameters αv and αx toweigh visual vs. textual features.

CSK is denoted by a set Y of triples yk(k = 1..j)with components ysk, ypk, yok (s - subject, p - pred-icate, o - object). Each component consists of one ormore words. This yields a feature vector ykj(j =1..M) for the triple yk.

3.2 Language Models for RankingWe study a variety of query-likelihood languagemodels (LM) for ranking documents x with regardto a given query q. We assume that a query is simplya set of keywords qi(i = 1..L). In the following weformulate equations for unigram LMs, which can besimply extended to bigram LMs by using word pairsinstead of single ones.

Basic LM:

Pbasic[q|x] =∏

i

P [qi|x] (1)

where we set the weight of word qi in x as follows:

P [qi|x] = αxP [qi|xxj ]P [xxj |x]+αvP [qi|xvj ]P [xvj |x](2)

Here, xxj and xvj are unigrams in the textual orvisual components of a document; αx and αv arehyper-parameters to weigh the textual and visualfeatures respectively.

Smoothed LM:

Psmoothed[q|x] = αPbasic[q|x]+(1−α)P [q|B] (3)

where B is a background corpus model andP [q|B] =

∏i P [qi|B]. We use Flickr tags from

the YFCC100M dataset (Thomee et al., 2015) alongwith their frequency of occurrences as a backgroundcorpus.

Commonsense-aware LM (a translation LM):

PCS [q|x] =∏

i

[∑k P [qi|yk]P [yk|x]

|k|]

(4)

59

The summation ranges over all yk that can bridgethe query vocabulary with the image-feature vo-cabulary; so both of the probabilities P [qi|yk]and P [yk|x] must be non-zero. For example,when the query asks for “electric car” and animage has features “vehicle” (visual) and “en-ergy saving” (textual), triples such as (car, is a

type of, vehicle) and (electric engine,

saves, energy) would have this property. Thatis, we consider only commonsense triples that over-lap with both the query and the image features.

The probabilities P [qi|yk] and P [yk|x] are esti-mated based on the word-wise overlap between qiand yk and yk and x, respectively. They also con-sider the confidence of the words in yk and x.

Mixture LM (the final ranking LM):Since a document x can capture a query term or itscommonsense expansion, we formulate a mixturemodel for the ranking of a document with respectto a query:

P [q|x] = βCSPCS [q|x] + (1− βCS)Psmoothed[q|x](5)

where βCS is a hyper-parameter weighing the com-monsense features of the expanded query.

3.3 Feature Weights

By casting all features into word-level unigrams, wehave a unified feature space with hyper-parameters(αx, αv, and βCS). For this submission the hyper-parameters are manually chosen.

For weights of visual object class xvj of doc-ument x, we consider the confidence score fromLSDA (Hoffman et al., 2014). We extend theseobject classes with their hypernyms from WordNetwhich are set to the same confidence as their de-tected hyponyms. Although not in common par-lance this kind of expansion can also be consideredas CSK. We define the weight for a textual uni-gram xxj as its informativeness – the inverse docu-ment frequency with respect to a background corpus(Flickr tags with frequencies).

The words in a CSK triple yk have non-uniformweights proportional to their similarity with thequery words, their idf with respect to a backgroundcorpus, and the salience of their position – boost-ing the weight of words in s and o components of

y. The function computing similarity between twounigrams favors exact matches to partial matches.

3.4 ExampleQuery string: travel with backpackCommonsense triples to expand query:t1:(tourists, use, travel maps)

t2:(tourists, carry, backpacks)

t3:(backpack, is a type of, bag)

Say we have a document x with features:Textual - “A tourist reading a map by the road.”Visual - person, bag, bottle, bus

The query will now successfully retrieve the abovedocument, whereas it would have been missed bytext-only systems.

4 Datasets

For the purpose of demonstration we choose a top-ical domain – Tourism. Our CSK knowledge baseand image dataset obey this constraint.

CSK acquisition through OpenIE: We consider aslice of Wikipedia pertaining to the domain tourismas the text corpus to extract CSK from. Nouns fromthe Wikipedia article titled ‘Tourism’(seed docu-ment) constitute our basic language model. We col-lect articles by traversing the Wiki Category hier-archy tree while pruning out those with substantialtopic drift. The Jaccard Distance (Equation 6) of adocument from the seed document is used as a met-ric for pruning.

JaccardDistance = 1−WeightedJaccardSimilarity(6)

where,

WeightedJaccardSimilarity =Σnmin[f(di, wn), f(D,wn)]Σnmax[f(di, wn), f(D,wn)]

(7)

In Equation 7, acquired Wikipedia articles di arecompared to the seed document D; f(d′, w) is thefrequency of occurrence of word w in document d′.For simplicity only articles with Jaccard distance of1 from the seed document are pruned out. The cor-pus of domain-specific pages thus collected consti-tute ~5000 Wikipedia articles.

60

Table 1: Query Benchmark for evaluation

aircraft international diesel transportairport vehicle dog parkbackpack travel fish marketball park housing townbench high lamp homebicycle road old clockbicycle trip road signalbird park table homeboat tour tourist busbridge road van road

The OpenIE tool ReVerb (Fader et al., 2011) runagainst our corpus produces around 1 million noisySPO triples. After filtering with our basic languagemodel we have ~22,000 moderately clean assertions.

Image Dataset: For the purpose of experimentswe construct our own image dataset. ~50,000 im-ages with descriptions are collected from the follow-ing datasets: Flickr30k (Young et al., 2014), Pas-cal Sentences (Rashtchian et al., 2010), SBU Cap-tioned Photo Dataset (Ordonez et al., 2011), andMSCOCO (Lin et al., 2014). The images are col-lected by comparing their textual descriptions withour basic language model for Tourism. An existingobject detection algorithm – LSDA (Hoffman et al.,2014) – is used for object detection in the images.The detected object classes are based on the 7000leaf nodes of ImageNet (Deng et al., 2009). We alsoexpand these classes by adding their super-classes orhypernyms with the same confidence score.

Query Benchmark: We construct a benchmark of20 queries from co-occurring Flickr tags from theYFCC100M dataset (Thomee et al., 2015). Thisbenchmark is shown in Table 1. Each query con-sists of two keywords that have appeared togetherwith high frequency as user tags in Flickr images.

5 Experiments

Baseline Google search results on our image datasetform the baseline for the evaluation of Know2Look.We consider the results in two settings – searchonly on original image caption (Vanilla Google),and on image captions along with detected objectclasses (Extended Google). The later is done to aid

Table 2: Comparison of Know2Look with baselines

Average Precision@10

Vanilla Google 0.47Extended Google 0.64Know2Look 0.85

Google in its search by providing additional visualcues. We exploit the domain restriction facility ofGoogle search (query string site:domain name) toget Google search results explicitly on our dataset.

Know2Look In addition to the setup for ExtendedGoogle, Know2Look also performs query expansionwith CSK. In most cases we win over the baselinesince CSK captures additional concepts related toquery terms enhancing latent information that maybe present in the images. We consider the top 10 re-trieval results of the two baselines and Know2Lookfor the 20 queries in our query benchmark1. Wecompare the three systems by Precision@10. Ta-ble 2 shows the values of Precision@10 averagedover 20 queries for each of the three systems –Know2Look performs better than the baselines.

6 ConclusionIn this paper we propose the incorporation of com-monsense knowledge for image retrieval. Ourarchitecture, Know2Look, expands queries by re-lated commonsense knowledge and retrieves im-ages based on their visual and textual contents. Byutilizing the visual and commonsense modalitieswe make search results more appealing to the hu-mans than traditional text-only approaches. We sup-port our claim by comparing Know2Look to Googlesearch on our image data set. The proposed conceptcan be easily extrapolated to document retrieval.Moreover, in addition to using noisy OpenIE triplesas commonsense knowledge, we aim to leverageexisting commonsense knowledge bases for futureevaluations of Know2Look.

Acknowledgment: We would like to thank AnnaRohrbach for her assistance with visual object de-tection of our image data set using LSDA. We alsothank Ali Shah for his help with visualization of theevaluation results.

1http://mpi-inf.mpg.de/~sreyasi/queries/evaluation.html

61

ReferencesMichele Banko, Michael J Cafarella, Stephen Soderland,

Matthew Broadhead, and Oren Etzioni. 2007. Open infor-mation extraction for the web. In IJCAI, volume 7, pages2670–2676.

Xinlei Chen, Abhinav Shrivastava, and Abhinav Gupta. 2013.Neil: Extracting visual knowledge from web data. In Pro-ceedings of the IEEE International Conference on ComputerVision, pages 1409–1416.

Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. 2008.Image retrieval: Ideas, influences, and trends of the new age.ACM Computing Surveys (CSUR), 40(2):5.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, andLi Fei-Fei. 2009. Imagenet: A large-scale hierarchical im-age database. In Computer Vision and Pattern Recognition,2009. CVPR 2009. IEEE Conference on, pages 248–255.IEEE.

Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011.Identifying relations for open information extraction. InProceedings of the Conference on Empirical Methods in Nat-ural Language Processing, pages 1535–1545. Associationfor Computational Linguistics.

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Pe-ter Young, Cyrus Rashtchian, Julia Hockenmaier, and DavidForsyth. 2010. Every picture tells a story: Generating sen-tences from images. In Computer Vision–ECCV 2010, pages15–29. Springer.

Judy Hoffman, Sergio Guadarrama, Eric S Tzeng, RonghangHu, Jeff Donahue, Ross Girshick, Trevor Darrell, and KateSaenko. 2014. Lsda: Large scale detection through adap-tation. In Advances in Neural Information Processing Sys-tems, pages 3536–3544.

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015.Deep learning. Nature, 521(7553):436–444.

Douglas B Lenat. 1995. Cyc: A large-scale investment inknowledge infrastructure. Communications of the ACM,38(11):33–38.

Xiao Lin and Devi Parikh. 2015. Don’t just listen, useyour imagination: Leveraging visual common sense fornon-visual tasks. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 2984–2993.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollar, and C LawrenceZitnick. 2014. Microsoft coco: Common objects in context.In Computer Vision–ECCV 2014, pages 740–755. Springer.

Hugo Liu and Push Singh. 2004. Conceptneta practicalcommonsense reasoning tool-kit. BT technology journal,22(4):211–226.

Ying Liu, Dengsheng Zhang, Guojun Lu, and Wei-Ying Ma.2007. A survey of content-based image retrieval with high-level semantics. Pattern Recognition, 40(1):262–282.

Mateusz Malinowski and Mario Fritz. 2014. A multi-world ap-proach to question answering about real-world scenes basedon uncertain input. In Advances in Neural Information Pro-cessing Systems, pages 1682–1690.

George A Miller. 1995. Wordnet: a lexical database for english.Communications of the ACM, 38(11):39–41.

Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch,Amit Goyal, Alex Berg, Kota Yamaguchi, Tamara Berg, KarlStratos, and Hal Daume III. 2012. Midge: Generating imagedescriptions from computer vision detections. In Proceed-ings of the 13th Conference of the European Chapter of the

Association for Computational Linguistics, pages 747–756.Association for Computational Linguistics.

Alexander Mordvintsev, Christopher Olah, and Mike Tyka.2015. Inceptionism: Going deeper into neural networks.Google Research Blog. Retrieved June.

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011.Im2text: Describing images using 1 million captioned pho-tographs. In Neural Information Processing Systems (NIPS).

Cyrus Rashtchian, Peter Young, Micah Hodosh, and JuliaHockenmaier. 2010. Collecting image annotations usingamazon’s mechanical turk. In Proceedings of the NAACLHLT 2010 Workshop on Creating Speech and LanguageData with Amazon’s Mechanical Turk, pages 139–147. As-sociation for Computational Linguistics.

Fereshteh Sadeghi, Santosh K Divvala, and Ali Farhadi. 2015.Viske: Visual knowledge extraction and question answeringby visual verification of relation phrases. In Computer Vi-sion and Pattern Recognition (CVPR), 2015 IEEE Confer-ence on, pages 1456–1464. IEEE.

Karen Simonyan and Andrew Zisserman. 2014. Very deep con-volutional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556.

Push Singh, Thomas Lin, Erik T Mueller, Grace Lim, TravellPerkins, and Wan Li Zhu. 2002. Open mind common sense:Knowledge acquisition from the general public. In On themove to meaningful internet systems 2002: Coopis, doa, andodbase, pages 1223–1237. Springer.

Niket Tandon, Gerard de Melo, Fabian Suchanek, and GerhardWeikum. 2014. Webchild: Harvesting and organizing com-monsense knowledge from the web. In Proceedings of the7th ACM international conference on Web search and datamining, pages 523–532. ACM.

Niket Tandon, Gerard de Melo, Abir De, and Gerhard Weikum.2015. Knowlywood: Mining activity knowledge from hol-lywood narratives. In Proc. CIKM.

Niket Tandon, Charles Hariman, Jacopo Urbani, AnnaRohrbach, Marcus Rohrbach, and Gerhard Weikum. 2016.Commonsense in parts: Mining part-whole relations fromthe web and image tags. AAAI.

Bart Thomee, David A. Shamma, Gerald Friedland, BenjaminElizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-JiaLi. 2015. The new data and new challenges in multimediaresearch. arXiv preprint arXiv:1503.01817.

Ramakrishna Vedantam, Xiao Lin, Tanmay Batra,C Lawrence Zitnick, and Devi Parikh. 2015. Learningcommon sense through visual abstraction. In Proceedingsof the IEEE International Conference on Computer Vision,pages 2542–2550.

Oriol Vinyals, Alexander Toshev, Samy Bengio, and DumitruErhan. 2014. Show and tell: A neural image caption gener-ator. arXiv preprint arXiv:1411.4555.

Josiah K Wang, Fei Yan, Ahmet Aker, and Robert Gaizauskas.2014. A poodle or a dog? evaluating automatic image anno-tation using human descriptions at different levels of granu-larity. V&L Net 2014, page 38.

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier.2014. From image descriptions to visual denotations: Newsimilarity metrics for semantic inference over event descrip-tions. Transactions of the Association for ComputationalLinguistics, 2:67–78.

ChengXiang Zhai. 2008. Statistical language models for in-formation retrieval. Synthesis Lectures on Human Language

Technologies, 1(1):1–141.

62

know2look: commonsense knowledge for visual search · et al., 2008). search engines like baidu,...

Documents