[lecture notes in computer science] computer vision – eccv 2010 volume 6314 ||

Lecture Notes in Computer Science 6314Commenced Publication in 1973Founding and Former Series Editors:Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial BoardDavid Hutchison

Lancaster University, UKTakeo Kanade

Carnegie Mellon University, Pittsburgh, PA, USAJosef Kittler

University of Surrey, Guildford, UKJon M. Kleinberg

Cornell University, Ithaca, NY, USAAlfred Kobsa

University of California, Irvine, CA, USAFriedemann Mattern

ETH Zurich, SwitzerlandJohn C. Mitchell

Stanford University, CA, USAMoni Naor

Weizmann Institute of Science, Rehovot, IsraelOscar Nierstrasz

University of Bern, SwitzerlandC. Pandu Rangan

Indian Institute of Technology, Madras, IndiaBernhard Steffen

TU Dortmund University, GermanyMadhu Sudan

Microsoft Research, Cambridge, MA, USADemetri Terzopoulos

University of California, Los Angeles, CA, USADoug Tygar

University of California, Berkeley, CA, USAGerhard Weikum

Max Planck Institute for Informatics, Saarbruecken, Germany

Kostas Daniilidis Petros MaragosNikos Paragios (Eds.)

Computer Vision ECCV 2010

11th European Conference on Computer VisionHeraklion, Crete, Greece, September 5-11, 2010Proceedings, Part IV

13

Volume Editors

Kostas DaniilidisGRASP LaboratoryUniversity of Pennsylvania3330 Walnut Street, Philadelphia, PA 19104, USAE-mail: [email protected]

Petros MaragosNational Technical University of AthensSchool of Electrical and Computer Engineering15773 Athens, GreeceE-mail: [email protected]

Nikos ParagiosEcole Centrale de ParisDepartment of Applied MathematicsGrande Voie des Vignes, 92295 Chatenay-Malabry, FranceE-mail: [email protected]

Library of Congress Control Number: 2010933243

CR Subject Classification (1998): I.2.10, I.3, I.5, I.4, F.2.2, I.3.5

LNCS Sublibrary: SL 6 Image Processing, Computer Vision, Pattern Recognition,and Graphics

ISSN 0302-9743ISBN-10 3-642-15560-X Springer Berlin Heidelberg New YorkISBN-13 978-3-642-15560-4 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material isconcerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publicationor parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,in its current version, and permission for use must always be obtained from Springer. Violations are liableto prosecution under the German Copyright Law.springer.com

Springer-Verlag Berlin Heidelberg 2010Printed in Germany

Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, IndiaPrinted on acid-free paper 06/3180

Preface

The 2010 edition of the European Conference on Computer Vision was held in Heraklion, Crete. The call for papers attracted an absolute record of 1,174 submissions. We describe here the selection of the accepted papers:

Thirty-eight area chairs were selected coming from Europe (18), USA and Canada (16), and Asia (4). Their selection was based on the following criteria: (1) Researchers who had served at least two times as Area Chairs within the past two years at major vision conferences were excluded; (2) Researchers who served as Area Chairs at the 2010 Computer Vision and Pattern Recognition were also excluded (exception: ECCV 2012 Program Chairs); (3) Minimization of overlap introduced by Area Chairs being former student and advisors; (4) 20% of the Area Chairs had never served before in a major conference; (5) The Area Chair selection process made all possible efforts to achieve a reasonable geographic distribution between countries, thematic areas and trends in computer vision.

Each Area Chair was assigned by the Program Chairs between 2832 papers. Based on paper content, the Area Chair recommended up to seven potential reviewers per paper. Such assignment was made using all reviewers in the database including the conflicting ones. The Program Chairs manually entered the missing conflict domains of approximately 300 reviewers. Based on the recommendation of the Area Chairs, three reviewers were selected per paper (with at least one being of the top three suggestions), with 99.7% being the recommendations of the Area Chairs. When this was not possible, senior reviewers were assigned to these papers by the Program Chairs, with the consent of the Area Chairs. Upon completion of this process there were 653 active reviewers in the system.

Each reviewer got a maximum load of eight reviewsin a few cases we had nine papers when re-assignments were made manually because of hidden conflicts. Upon the completion of the reviews deadline, 38 reviews were missing. The Program Chairs proceeded with fast re-assignment of these papers to senior reviewers. Prior to the deadline of submitting the rebuttal by

VI Preface

the authors, all papers had three reviews. The distribution of the reviews was the following: 100 papers with an average score of weak accept and higher, 125 papers with an average score toward weak accept, 425 papers with an average score around borderline.

For papers with strong consensus among reviewers, we introduced a procedure to handle potential overwriting of the recommendation by the Area Chair. In particular for all papers with weak accept and higher or with weak reject and lower, the Area Chair should have sought for an additional reviewer prior to the Area Chair meeting. The decision of the paper could have been changed if the additional reviewer was supporting the recommendation of the Area Chair, and the Area Chair was able to convince his/her group of Area Chairs of that decision.

The discussion phase between the Area Chair and the reviewers was initiated once the review became available. The Area Chairs had to provide their identity to the reviewers. The discussion remained open until the Area Chair meeting that was held in Paris, June 56. Each Area Chair was paired to a buddy and the decisions for all papers were made jointly, or when needed using the opinion of other Area Chairs. The pairing was done considering conflicts, thematic proximity, and when possible geographic diversity. The Area Chairs were responsible for taking decisions on their papers. Prior to the Area Chair meeting, 92% of the consolidation reports and the decision suggestions had been made by the Area Chairs. These recommendations were used as a basis for the final decisions.

Orals were discussed in groups of Area Chairs. Four groups were formed, with no direct conflict between paper conflicts and the participating Area Chairs. The Area Chair recommending a paper had to present the paper to the whole group and explain why such a contribution is worth being published as an oral. In most of the cases consensus was reached in the group, while in the cases where discrepancies existed between the Area Chairs views, the decision was taken according to the majority of opinions.

The final outcome of the Area Chair meeting, was 38 papers accepted for an oral presentation and 284 for poster. The percentage ratios of submissions/ acceptance per area are the following:

Preface VII

Thematic area # submitted % over submitted

# accepted % over accepted

% acceptance in area

Object and Scene Recognition 192 16.4% 66 20.3% 34.4%

Segmentation and Grouping 129 11.0% 28 8.6% 21.7%

Face, Gesture, Biometrics 125 10.6% 32 9.8% 25.6%

Motion and Tracking 119 10.1% 27 8.3% 22.7%

Statistical Models and VisualLearning

101 8.6% 30 9.2% 29.7%

Matching, Registration, Alignment 90 7.7% 21 6.5% 23.3%

Computational Imaging 74 6.3% 24 7.4% 32.4%

Multi-view Geometry 67 5.7% 24 7.4% 35.8%

Image Features 66 5.6% 17 5.2% 25.8%

Video and Event Characterization 62 5.3% 14 4.3% 22.6%

Shape Representation and Recognition

48 4.1% 19 5.8% 39.6%

Stereo 38 3.2% 4 1.2% 10.5%

Reflectance, Illumination, Color 37 3.2% 14 4.3% 37.8%

Medical Image Analysis 26 2.2% 5 1.5% 19.2%

We received 14 complaints/reconsideration requests. All of them were sent to the Area Chairs who handled the papers. Based on the reviewers arguments and the reaction of the Area Chair, three papers were acceptedas posterson top of the 322 at the Area Chair meeting, bringing the total number of accepted papers to 325 or 27.6%. The selection rate for the 38 orals was 3.2%.The acceptance rate for the papers submitted by the group of Area Chairs was 39%.

Award nominations were proposed by the Area and Program Chairs based on the reviews and the consolidation report. An external award committee was formed comprising David Fleet, Luc Van Gool, Bernt Schiele, Alan Yuille, Ramin Zabih. Additional reviews were considered for the nominated papers and the decision on the paper awards was made by the award committee. We thank the Area Chairs, Reviewers, Award Committee Members, and the General Chairs for their hard work and we gratefully acknowledge Microsoft Research for accommodating the ECCV needs by generously providing the CMT Conference Management Toolkit. We hope you enjoy the proceedings.

September 2010 Kostas Daniilidis Petros Maragos Nikos Paragios

Organization

General Chairs

Argyros, Antonis University of Crete/FORTH, GreeceTrahanias, Panos University of Crete/FORTH, GreeceTziritas, George University of Crete, Greece

Program Chairs

Daniilidis, Kostas University of Pennsylvania, USAMaragos, Petros National Technical University of Athens,

GreeceParagios, Nikos Ecole Centrale de Paris/INRIA Saclay

le-de-France, France

Workshops Chair

Kutulakos, Kyros University of Toronto, Canada

Tutorials Chair

Lourakis, Manolis FORTH, Greece

Demonstrations Chair

Kakadiaris, Ioannis University of Houston, USA

Industrial Chair

Pavlidis, Ioannis University of Houston, USA

Travel Grants Chair

Komodakis, Nikos University of Crete, Greece

X Organization

Area Chairs

Bach, Francis INRIA Paris - Rocquencourt, FranceBelongie, Serge University of California-San Diego, USABischof, Horst Graz University of Technology, AustriaBlack, Michael Brown University, USABoyer, Edmond INRIA Grenoble - Rhone-Alpes, FranceCootes, Tim University of Manchester, UKDana, Kristin Rutgers University, USADavis, Larry University of Maryland, USAEfros, Alyosha Carnegie Mellon University, USAFermuller, Cornelia University of Maryland, USAFitzgibbon, Andrew Microsoft Research, Cambridge, UKJepson, Alan University of Toronto, CanadaKahl, Fredrik Lund University, SwedenKeriven, Renaud Ecole des Ponts-ParisTech, FranceKimmel, Ron Technion Institute of Technology, IrelandKolmogorov, Vladimir University College of London, UKLepetit, Vincent Ecole Polytechnique Federale de Lausanne,

SwitzerlandMatas, Jiri Czech Technical University, Prague,

Czech RepublicMetaxas, Dimitris Rutgers University, USANavab, Nassir Technical University of Munich, GermanyNister, David Microsoft Research, Redmont, USAPerez, Patrick THOMSON Research, FrancePerona, Pietro Caltech University, USARamesh, Visvanathan Siemens Corporate Research, USARaskar, Ramesh Massachusetts Institute of Technology, USASamaras, Dimitris State University of New York - Stony Brook,

USASato, Yoichi University of Tokyo, JapanSchmid, Cordelia INRIA Grenoble - Rhone-Alpes, FranceSchnoerr, Christoph University of Heidelberg, GermanySebe, Nicu University of Trento, ItalySzeliski, Richard Microsoft Research, Redmont, USATaskar, Ben University of Pennsylvania, USATorr, Phil Oxford Brookes University, UKTorralba, Antonio Massachusetts Institute of Technology, USATuytelaars, Tinne Katholieke Universiteit Leuven, BelgiumWeickert, Joachim Saarland University, GermanyWeinshall, Daphna Hebrew University of Jerusalem, IsraelWeiss, Yair Hebrew University of Jerusalem, Israel

Organization XI

Conference Board

Horst Bischof Graz University of Technology, AustriaHans Burkhardt University of Freiburg, GermanyBernard Buxton University College London, UKRoberto Cipolla University of Cambridge, UKJan-Olof Eklundh Royal Institute of Technology, SwedenOlivier Faugeras INRIA, Sophia Antipolis, FranceDavid Forsyth University of Illinois, USAAnders Heyden Lund University, SwedenAles Leonardis University of Ljubljana, SloveniaBernd Neumann University of Hamburg, GermanyMads Nielsen IT University of Copenhagen, DenmarkTomas Pajdla CTU Prague, Czech RepublicJean Ponce Ecole Normale Superieure, FranceGiulio Sandini University of Genoa, ItalyPhilip Torr Oxford Brookes University, UKDavid Vernon Trinity College, IrelandAndrew Zisserman University of Oxford, UK

Reviewers

Abd-Almageed, WaelAgapito, LourdesAgarwal, SameerAggarwal, GauravAhlberg, JuergenAhonen, TimoAi, HaizhouAlahari, KarteekAleman-Flores, MiguelAloimonos, YiannisAmberg, BrianAndreetto, MarcoAngelopoulou, ElliAnsar, AdnanArbel, TalArbelaez, PabloAstroem, KalleAthitsos, VassilisAugust, JonasAvraham, TamarAzzabou, NouraBabenko, BorisBagdanov, Andrew

Bahlmann, ClausBaker, SimonBallan, LucaBarbu, AdrianBarnes, NickBarreto, JoaoBartlett, MarianBartoli, AdrienBatra, DhruvBaust, MaximilianBeardsley, PaulBehera, ArdhenduBeleznai, CsabaBen-ezra, MosheBerg, AlexanderBerg, TamaraBetke, MargritBileschi, StanBircheld, StanBiswas, SomaBlanz, VolkerBlaschko, MatthewBobick, Aaron

Bougleux, SebastienBoult, TerranceBoureau, Y-LanBowden, RichardBoykov, YuriBradski, GaryBregler, ChristophBremond, FrancoisBronstein, AlexBronstein, MichaelBrown, MatthewBrown, MichaelBrox, ThomasBrubaker, MarcusBruckstein, FreddyBruhn, AndresBuisson, OlivierBurkhardt, HansBurschka, DariusCaetano, TiberioCai, DengCalway, AndrewCappelli, Raaele

XII Organization

Caputo, BarbaraCarreira-Perpinan,

MiguelCaselles, VincentCavallaro, AndreaCham, Tat-JenChandraker, ManmohanChandran, SharatChetverikov, DmitryChiu, Han-PangCho, Taeg SangChuang, Yung-YuChung, Albert C. S.Chung, MooClark, JamesCohen, IsaacCollins, RobertColombo, CarloCord, MatthieuCorso, JasonCosten, NicholasCour, TimotheeCrandall, DavidCremers, DanielCriminisi, AntonioCrowley, JamesCui, JinshiCula, OanaDalalyan, ArnakDarbon, JeromeDavis, JamesDavison, Andrewde Bruijne, MarleenDe la Torre, FernandoDedeoglu, GokselDelong, AndrewDemirci, StefanieDemirdjian, DavidDenzler, JoachimDeselaers, ThomasDhome, MichelDick, AnthonyDickinson, SvenDivakaran, AjayDollar, Piotr

Domke, JustinDonoser, MichaelDoretto, GianfrancoDouze, MatthijsDraper, BruceDrbohlav, OndrejDuan, QiDuchenne, OlivierDuric, ZoranDuygulu-Sahin, PinarEklundh, Jan-OlofElder, JamesElgammal, AhmedEpshtein, BorisEriksson, AndersEspuny, FerranEssa, IrfanFarhadi, AliFarrell, RyanFavaro, PaoloFehr, JanisFei-Fei, LiFelsberg, MichaelFerencz, AndrasFergus, RobFeris, RogerioFerrari, VittorioFerryman, JamesFidler, SanjaFinlayson, GrahamFisher, RobertFlach, BorisFleet, DavidFletcher, TomFlorack, LucFlynn, PatrickFoerstner, WolfgangForoosh, HassanForssen, Per-ErikFowlkes, CharlessFrahm, Jan-MichaelFraundorfer, FriedrichFreeman, WilliamFrey, BrendanFritz, Mario

Fua, PascalFuchs, MartinFurukawa, YasutakaFusiello, AndreaGall, JuergenGallagher, AndrewGao, XiangGatica-Perez, DanielGee, JamesGehler, PeterGenc, YakupGeorgescu, BogdanGeusebroek, Jan-MarkGevers, TheoGeyer, ChristopherGhosh, AbhijeetGlocker, BenGoecke, RolandGoedeme, ToonGoldberger, JacobGoldenstein, SiomeGoldluecke, BastianGomes, RyanGong, SeanGorelick, LenaGould, StephenGrabner, HelmutGrady, LeoGrau, OliverGrauman, KristenGross, RalphGrossmann, EtienneGruber, AmitGulshan, VarunGuo, GuodongGupta, AbhinavGupta, MohitHabbecke, MartinHager, GregoryHamid, RaayHan, BohyungHan, TonyHanbury, AllanHancock, EdwinHasino, Samuel

Organization XIII

Hassner, TalHaussecker, HorstHays, JamesHe, XumingHeas, PatrickHebert, MartialHeibel, T. HaukeHeidrich, WolfgangHernandez, CarlosHilton, AdrianHinterstoisser, StefanHlavac, VaclavHoiem, DerekHoogs, AnthonyHornegger, JoachimHua, GangHuang, RuiHuang, XiaoleiHuber, DanielHudelot, CelineHussein, MohamedHuttenlocher, DanIhler, AlexIlic, SlobodanIrschara, ArnoldIshikawa, HiroshiIsler, VolkanJain, PrateekJain, VirenJamie Shotton, JamieJegou, HerveJenatton, RodolpheJermyn, IanJi, HuiJi, QiangJia, JiayaJin, HailinJogan, MatjazJohnson, MicahJoshi, NeelJuan, OlivierJurie, FredericKakadiaris, IoannisKale, Amit

Kamarainen,Joni-Kristian

Kamberov, GeorgeKamberova, GerdaKambhamettu, ChandraKanatani, KenichiKanaujia, AtulKang, Sing BingKappes, JorgKavukcuoglu, KorayKawakami, ReiKe, QifaKemelmacher, IraKhamene, AliKhan, SaadKikinis, RonKim, Seon JooKimia, BenjaminKittler, JosefKoch, ReinhardKoeser, KevinKohli, PushmeetKokiopoulou, EKokkinos, IasonasKolev, KalinKomodakis, NikosKonolige, KurtKoschan, AndreasKukelova, ZuzanaKulis, BrianKumar, M. PawanKumar, SanjivKuthirummal, SujitKutulakos, KyrosKweon, In SoLadicky, LuborLai, Shang-HongLalonde, Jean-FrancoisLampert, ChristophLandon, GeorgeLanger, MichaelLangs, GeorgLanman, DouglasLaptev, Ivan

Larlus, DianeLatecki, Longin JanLazebnik, SvetlanaLee, ChanSuLee, HonglakLee, Kyoung MuLee, Sang-WookLeibe, BastianLeichter, IdoLeistner, ChristianLellmann, JanLempitsky, VictorLenzen, FrankLeonardis, AlesLeung, ThomasLevin, AnatLi, ChunmingLi, GangLi, HongdongLi, HongshengLi, Li-JiaLi, RuiLi, RuonanLi, StanLi, YiLi, YunpengLiefeng, BoLim, JongwooLin, StephenLin, ZheLing, HaibinLittle, JimLiu, CeLiu, JingenLiu, QingshanLiu, Tyng-LuhLiu, XiaomingLiu, YanxiLiu, YazhouLiu, ZichengLourakis, ManolisLovell, BrianLu, LeLucey, Simon

XIV Organization

Luo, JieboLyu, SiweiMa, XiaoxuMairal, JulienMaire, MichaelMaji, SubhransuMaki, AtsutoMakris, DimitriosMalisiewicz, TomaszMallick, SatyaManduchi, RobertoManmatha, R.Marchand, EricMarcialis, GianMarks, TimMarszalek, MarcinMartinec, DanielMartinez, AleixMatei, BogdanMateus, DianaMatsushita, YasuyukiMatthews, IainMaxwell, BruceMaybank, StephenMayer, HelmutMcCloskey, ScottMcKenna, StephenMedioni, GerardMeer, PeterMei, ChristopherMichael, NicholasMicusik, BranislavMinh, NguyenMirmehdi, MajidMittal, AnuragMiyazaki, DaisukeMonasse, PascalMordohai, PhilipposMoreno-Noguer,

FrancescMori, GregMorimoto, CarlosMorse, BryanMoses, YaelMueller, Henning

Mukaigawa, YasuhiroMulligan, JaneMunich, MarioMurino, VittorioNamboodiri, VinayNarasimhan, SrinivasaNarayanan, P.J.Naroditsky, OlegNeumann, JanNevatia, RamNicolls, FredNiebles, Juan CarlosNielsen, MadsNishino, KoNixon, MarkNowozin, SebastianOdonnell, ThomasObozinski, GuillaumeOdobez, Jean-MarcOdone, FrancescaOfek, EyalOgale, AbhijitOkabe, TakahiroOkatani, TakayukiOkuma, KenjiOlson, ClarkOlsson, CarlOmmer, BjornOsadchy, MargaritaOvergaard, Niels

ChristianOzuysal, MustafaPajdla, TomasPanagopoulos,

AlexandrosPandharkar, RohitPankanti, SharathPantic, MajaPapadopoulo, TheoParameswaran, VasuParikh, DeviParis, SylvainPatow, GustavoPatras, IoannisPavlovic, Vladimir

Peleg, ShmuelPerera, A.G. AmithaPerronnin, FlorentPetrou, MariaPetrovic, VladimirPeursum, PatrickPhilbin, JamesPiater, JustusPietikainen, MattiPinz, AxelPless, RobertPock, ThomasPoh, NormanPollefeys, MarcPonce, JeanPons, Jean-PhilippePotetz, BrianPrabhakar, SalilQian, GangQuattoni, AriadnaRadeva, PetiaRadke, RichardRakotomamonjy, AlainRamanan, DevaRamanathan, NarayananRanzato, MarcAurelioRaviv, DanReid, IanReitmayr, GerhardRen, XiaofengRittscher, JensRogez, GregoryRosales, RomerRosenberg, CharlesRosenhahn, BodoRosman, GuyRoss, ArunRoth, PeterRother, CarstenRothganger, FredRougon, NicolasRoy, SebastienRueckert, DanielRuether, MatthiasRussell, Bryan

Organization XV

Russell, ChristopherSahbi, HichemStiefelhagen, RainerSaad, AliSaari, AmirSalgian, GarbisSalzmann, MathieuSangineto, EnverSankaranarayanan,

AswinSapiro, GuillermoSara, RadimSato, ImariSavarese, SilvioSavchynskyy, BogdanSawhney, HarpreetScharr, HannoScharstein, DanielSchellewald, ChristianSchiele, BerntSchindler, GrantSchindler, KonradSchlesinger, DmitrijSchoenemann, ThomasSchro, FlorianSchubert, FalkSchultz, ThomasSe, StephenSeidel, Hans-PeterSerre, ThomasShah, MubarakShakhnarovich, GregoryShan, YingShashua, AmnonShechtman, EliSheikh, YaserShekhovtsov, AlexanderShet, VinayShi, JianboShimshoni, IlanShokoufandeh, AliSigal, LeonidSimon, LoicSingaraju, DheerajSingh, Maneesh

Singh, VikasSinha, SudiptaSivic, JosefSlabaugh, GregSmeulders, ArnoldSminchisescu, CristianSmith, KevinSmith, WilliamSnavely, NoahSnoek, CeesSoatto, StefanoSochen, NirSochman, JanSofka, MichalSorokin, AlexanderSouthall, BenSouvenir, RichardSrivastava, AnujStauer, ChrisStein, GideonStrecha, ChristophSugimoto, AkihiroSullivan, JosephineSun, DeqingSun, JianSun, MinSunkavalli, KalyanSuter, DavidSvoboda, TomasSyeda-Mahmood,

TanveerSusstrunk, SabineTai, Yu-WingTakamatsu, JunTalbot, HuguesTan, PingTan, RobbyTanaka, MasayukiTao, DachengTappen, MarshallTaylor, CamilloTheobalt, ChristianThonnat, MoniqueTieu, KinhTistarelli, Massimo

Todorovic, SinisaToreyin, Behcet UgurTorresani, LorenzoTorsello, AndreaToshev, AlexanderTrucco, EmanueleTschumperle, DavidTsin, YanghaiTu, PeterTung, TonyTurek, MattTurk, MatthewTuzel, OncelTyagi, AmbrishUrschler, MartinUrtasun, RaquelVan de Weijer, Joostvan Gemert, Janvan den Hengel, AntonVasilescu, M. Alex O.Vedaldi, AndreaVeeraraghavan, AshokVeksler, OlgaVerbeek, JakobVese, LuminitaVitaladevuni, ShivVogiatzis, GeorgeVogler, ChristianWachinger, ChristianWada, ToshikazuWagner, DanielWang, ChaohuiWang, HanziWang, HongchengWang, JueWang, KaiWang, SongWang, XiaogangWang, YangWeese, JuergenWei, YichenWein, WolfgangWelinder, PeterWerner, TomasWestin, Carl-Fredrik

XVI Organization

Wilburn, BennettWildes, RichardWilliams, OliverWills, JoshWilson, KevinWojek, ChristianWolf, LiorWright, JohnWu, Tai-PangWu, YingXiao, JiangjianXiao, JianxiongXiao, JingYagi, YasushiYan, ShuichengYang, FeiYang, JieYang, Ming-Hsuan

Yang, PengYang, QingxiongYang, RuigangYe, JiepingYeung, Dit-YanYezzi, AnthonyYilmaz, AlperYin, LijunYoon, Kuk JinYu, JingyiYu, KaiYu, QianYu, StellaYuille, AlanZach, ChristopherZaid, HarchaouiZelnik-Manor, LihiZeng, Gang

Zhang, ChaZhang, LiZhang, ShengZhang, WeiweiZhang, WenchaoZhao, WenyiZheng, YuanjieZhou, JinghaoZhou, KevinZhu, LeoZhu, Song-ChunZhu, YingZickler, ToddZikic, DarkoZisserman, AndrewZitnick, LarryZivny, StanislavZu, Silvia

Organization XVII

Sponsoring Institutions

Platinum Sponsor

Gold Sponsors

Silver Sponsors

Table of Contents Part IV

Spotlights and Posters W1

Kernel Sparse Representation for Image Classication and FaceRecognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Shenghua Gao, Ivor Wai-Hung Tsang, and Liang-Tien Chia

Every Picture Tells a Story: Generating Sentences from Images . . . . . . . . 15Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi,Peter Young, Cyrus Rashtchian, Julia Hockenmaier, andDavid Forsyth

An Eye Fixation Database for Saliency Detection in Images . . . . . . . . . . . 30Subramanian Ramanathan, Harish Katti, Nicu Sebe,Mohan Kankanhalli, and Tat-Seng Chua

Face Image Relighting Using Locally Constrained GlobalOptimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Jiansheng Chen, Guangda Su, Jinping He, and Shenglan Ben

Correlation-Based Intrinsic Image Extraction from a Single Image . . . . . . 58Xiaoyue Jiang, Andrew J. Schoeld, and Jeremy L. Wyatt

ADICT: Accurate Direct and Inverse Color Transformation . . . . . . . . . . . 72Behzad Sajadi, Maxim Lazarov, and Aditi Majumder

Real-Time Specular Highlight Removal Using Bilateral Filtering . . . . . . . 87Qingxiong Yang, Shengnan Wang, and Narendra Ahuja

Learning Artistic Lighting Template from Portrait Photographs . . . . . . . . 101Xin Jin, Mingtian Zhao, Xiaowu Chen, Qinping Zhao, andSong-Chun Zhu

Photometric Stereo from Maximum Feasible Lambertian Reections . . . . 115Chanki Yu, Yongduek Seo, and Sang Wook Lee

Part-Based Feature Synthesis for Human Detection . . . . . . . . . . . . . . . . . . . 127Aharon Bar-Hillel, Dan Levi, Eyal Krupka, and Chen Goldberg

Improving the Fisher Kernel for Large-Scale Image Classication . . . . . . . 143Florent Perronnin, Jorge Sanchez, and Thomas Mensink

Max-Margin Dictionary Learning for Multiclass ImageCategorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

Xiao-Chen Lian, Zhiwei Li, Bao-Liang Lu, and Lei Zhang

XX Table of Contents Part IV

Towards Optimal Naive Bayes Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . 171Regis Behmo, Paul Marcombes, Arnak Dalalyan, andVeronique Prinet

Weakly Supervised Classication of Objects in Images Using SoftRandom Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

Riwal Lefort, Ronan Fablet, and Jean-Marc Boucher

Learning What and How of Contextual Models for Scene Labeling . . . . . 199Arpit Jain, Abhinav Gupta, and Larry S. Davis

Adapting Visual Category Models to New Domains . . . . . . . . . . . . . . . . . . 213Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell

Improved Human Parsing with a Full Relational Model . . . . . . . . . . . . . . . 227Duan Tran and David Forsyth

Multiresolution Models for Object Detection . . . . . . . . . . . . . . . . . . . . . . . . 241Dennis Park, Deva Ramanan, and Charless Fowlkes

Accurate Image Localization Based on Google Maps Street View . . . . . . . 255Amir Roshan Zamir and Mubarak Shah

A Minimal Case Solution to the Calibrated Relative Pose Problem forthe Case of Two Known Orientation Angles . . . . . . . . . . . . . . . . . . . . . . . . . 269

Friedrich Fraundorfer, Petri Tanskanen, and Marc Pollefeys

Bilinear Factorization via Augmented Lagrange Multipliers . . . . . . . . . . . . 283Alessio Del Bue, Joao Xavier, Lourdes Agapito, and Marco Paladini

Piecewise Quadratic Reconstruction of Non-Rigid Surfaces fromMonocular Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

Joao Fayad, Lourdes Agapito, and Alessio Del Bue

Extrinsic Camera Calibration Using Multiple Reections . . . . . . . . . . . . . . 311Joel A. Hesch, Anastasios I. Mourikis, and Stergios I. Roumeliotis

Probabilistic Deformable Surface Tracking from Multiple Videos . . . . . . . 326Cedric Cagniart, Edmond Boyer, and Slobodan Ilic

Theory of Optimal View Interpolation with Depth Inaccuracy . . . . . . . . . 340Keita Takahashi

Practical Methods for Convex Multi-view Reconstruction . . . . . . . . . . . . . 354Christopher Zach and Marc Pollefeys

Building Rome on a Cloudless Day . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368Jan-Michael Frahm, Pierre Fite-Georgel, David Gallup,Tim Johnson, Rahul Raguram, Changchang Wu, Yi-Hung Jen,Enrique Dunn, Brian Clipp, Svetlana Lazebnik, and Marc Pollefeys

Table of Contents Part IV XXI

Camera Pose Estimation Using Images of Planar Mirror Reections . . . . 382Rui Rodrigues, Joao P. Barreto, and Urbano Nunes

Element-Wise Factorization for N-View Projective Reconstruction . . . . . . 396Yuchao Dai, Hongdong Li, and Mingyi He

Learning Relations among Movie Characters: A Social NetworkPerspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410

Lei Ding and Alper Yilmaz

Scene and Object Recognition

What, Where and How Many? Combining Object Detectors andCRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424

Lubor Ladicky, Paul Sturgess, Karteek Alahari, Chris Russell, andPhilip H.S. Torr

Visual Recognition with Humans in the Loop . . . . . . . . . . . . . . . . . . . . . . . . 438Steve Branson, Catherine Wah, Florian Schro, Boris Babenko,Peter Welinder, Pietro Perona, and Serge Belongie

Localizing Objects While Learning Their Appearance . . . . . . . . . . . . . . . . . 452Thomas Deselaers, Bogdan Alexe, and Vittorio Ferrari

Monocular 3D Scene Modeling and Inference: UnderstandingMulti-Object Trac Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467

Christian Wojek, Stefan Roth, Konrad Schindler, and Bernt Schiele

Blocks World Revisited: Image Understanding Using QualitativeGeometry and Mechanics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482

Abhinav Gupta, Alexei A. Efros, and Martial Hebert

Discriminative Learning with Latent Variables for Cluttered IndoorScene Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497

Huayan Wang, Stephen Gould, and Daphne Koller

Spotlights and Posters W2

Visual Tracking Using a Pixelwise Spatiotemporal Oriented EnergyRepresentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511

Kevin J. Cannons, Jacob M. Gryn, and Richard P. Wildes

A Globally Optimal Approach for 3D Elastic Motion Estimation fromStereo Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525

Qifan Wang, Linmi Tao, and Huijun Di

Occlusion Boundary Detection Using Pseudo-depth . . . . . . . . . . . . . . . . . . 539Xuming He and Alan Yuille

XXII Table of Contents Part IV

Multiple Target Tracking in World Coordinate with Single, MinimallyCalibrated Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553

Wongun Choi and Silvio Savarese

Joint Estimation of Motion, Structure and Geometry from StereoSequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568

Levi Valgaerts, Andres Bruhn, Henning Zimmer, Joachim Weickert,Carsten Stoll, and Christian Theobalt

Dense, Robust, and Accurate Motion Field Estimation from StereoImage Sequences in Real-Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582

Clemens Rabe, Thomas Muller, Andreas Wedel, and Uwe Franke

Estimation of 3D Object Structure, Motion and Rotation Based on 4DAne Optical Flow Using a Multi-camera Array . . . . . . . . . . . . . . . . . . . . . 596

Tobias Schuchert and Hanno Scharr

Eciently Scaling Up Video Annotation with CrowdsourcedMarketplaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610

Carl Vondrick, Deva Ramanan, and Donald Patterson

Robust and Fast Collaborative Tracking with Two Stage SparseOptimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624

Baiyang Liu, Lin Yang, Junzhou Huang, Peter Meer,Leiguang Gong, and Casimir Kulikowski

Nonlocal Multiscale Hierarchical Decomposition on Graphs . . . . . . . . . . . . 638Moncef Hidane, Olivier Lezoray, Vinh-Thong Ta, andAbderrahim Elmoataz

Adaptive Regularization for Image Segmentation Using Local ImageCurvature Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651

Josna Rao, Rafeef Abugharbieh, and Ghassan Hamarneh

A Static SMC Sampler on Shapes for the Automated Segmentation ofAortic Calcications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666

Kersten Petersen, Mads Nielsen, and Sami S. Brandt

Fast Dynamic Texture Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680V. Javier Traver, Majid Mirmehdi, Xianghua Xie, and Raul Montoliu

Finding Semantic Structures in Image Hierarchies Using LaplacianGraph Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694

Yi-Zhe Song, Pablo Arbelaez, Peter Hall, Chuan Li, andAnupriya Balikai

Semantic Segmentation of Urban Scenes Using Dense Depth Maps . . . . . 708Chenxi Zhang, Liang Wang, and Ruigang Yang

Table of Contents Part IV XXIII

Tensor Sparse Coding for Region Covariances . . . . . . . . . . . . . . . . . . . . . . . . 722Ravishankar Sivalingam, Daniel Boley, Vassilios Morellas, andNikolaos Papanikolopoulos

Improving Local Descriptors by Embedding Global and Local SpatialInformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736

Tatsuya Harada, Hideki Nakayama, and Yasuo Kuniyoshi

Detecting Faint Curved Edges in Noisy Images . . . . . . . . . . . . . . . . . . . . . . 750Sharon Alpert, Meirav Galun, Boaz Nadler, and Ronen Basri

Spatial Statistics of Visual Keypoints for Texture Recognition . . . . . . . . . 764Huu-Giao Nguyen, Ronan Fablet, and Jean-Marc Boucher

BRIEF: Binary Robust Independent Elementary Features . . . . . . . . . . . . . 778Michael Calonder, Vincent Lepetit, Christoph Strecha, andPascal Fua

Multi-label Feature Transform for Image Classications . . . . . . . . . . . . . . . 793Hua Wang, Heng Huang, and Chris Ding

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807

Kernel Sparse Representation forImage Classication and Face Recognition

Shenghua Gao, Ivor Wai-Hung Tsang, and Liang-Tien Chia

School of Computer Engineering, Nanyang Technological Univertiy, Singapore{gaos0004,IvorTsang,asltchia}@ntu.edu.sg

Abstract. Recent research has shown the eectiveness of using sparsecoding(Sc) to solve many computer vision problems. Motivated by thefact that kernel trick can capture the nonlinear similarity of features,which may reduce the feature quantization error and boost the sparsecoding performance, we propose Kernel Sparse Representation(KSR).KSR is essentially the sparse coding technique in a high dimensionalfeature space mapped by implicit mapping function. We apply KSR toboth image classication and face recognition. By incorporating KSRinto Spatial Pyramid Matching(SPM), we propose KSRSPM for imageclassication. KSRSPM can further reduce the information loss in fea-ture quantization step compared with Spatial Pyramid Matching usingSparse Coding(ScSPM). KSRSPM can be both regarded as the gener-alization of Ecient Match Kernel(EMK) and an extension of ScSPM.Compared with sparse coding, KSR can learn more discriminative sparsecodes for face recognition. Extensive experimental results show that KSRoutperforms sparse coding and EMK, and achieves state-of-the-art per-formance for image classication and face recognition on publicly avail-able datasets.

1 Introduction

Sparse coding technique is attracting more and more researchers attention incomputer vision due to its state-of-the-art performance in many applications,such as image annotation [25], image restoration [20], image classication [28]etc. It aims at selecting the least possible basis from the large basis pool tolinearly recover the given signal under a small reconstruction error constraint.Therefore, sparse coding can be easily applied to feature quantization in Bag-of-Word(BoW) model based image representation. Moreover, under the assumptionthat the face images to be tested can be reconstructed by the images from thesame categories, sparse coding can also be used in face recognition [26].

BoW model [23] is widely used in computer vision [27,21] due to its con-cise representation and robustness to scale and rotation variance. Generally, itcontains three modules: (i) Region selection and representation; (ii) Codebookgeneration and feature quantization; (iii) Frequency histogram based image rep-resentation. In these three modules, codebook generation and feature quantiza-tion are the most important portions for image presentation. The codebook is

K. Daniilidis, P. Maragos, N. Paragios (Eds.): ECCV 2010, Part IV, LNCS 6314, pp. 114, 2010.c Springer-Verlag Berlin Heidelberg 2010

2 S. Gao, I.W.-H. Tsang, and L.-T. Chia

a collection of basic patterns used to reconstruct the local features. Each basicpattern is known as a visual word. Usually k -means is adopted to generate thecodebook, and each local feature is quantized to its nearest visual word. However,such hard assignment method may cause severe information loss [3,6], especiallyfor those features located at the boundary of several visual words. To minimizesuch errors, soft assignment [21,6] was introduced by assigning each feature tomore than one visual words. However, the way of choosing parameters, includingthe weight assigned to the visual word and the number of visual words to beassigned, is not trivial to be determined.

Recently, Yang et al. [28] proposed the method of using sparse coding inthe codebook generation and feature quantization module. Sparse coding canlearn better codebook that further minimizes the quantization error than k-means. Meanwhile, the weights assigned to each visual word are learnt concur-rently. By applying sparse coding to Spatial Pyramid Matching [13] (referredto as: ScSPM), their method achieves state-of-the-art performance in imageclassication.

Another application of sparse coding is face recognition. Face recognition is aclassic problem in computer vision, and has a great potential in many real worldapplication. It generally contains two stages. (i): Feature extraction; and (ii):Classier construction and label prediction. Usually Nearest Neighbor (NN) [5]and Nearest Subspace(NS) [11] are used. However, NN predicts the label ofthe image to be tested by only using its nearest neighbor in the training data,therefore it can easily be aected by noise. NS approximates the test image byusing all the images belonging to the same category, and assigns the image to thecategory which minimizes the reconstruction error. But NS may not work wellfor the case where classes are highly correlated to each other[26]. To overcomethese problems, Wright et al. proposed a sparse coding based face recognitionframework [26], which can automatically selects the images in the training set toapproximate the test image. Their method is robust to occlusion, illuminationand noise and achieves excellent performance.

Existing work based on sparse coding only seeks the sparse representationof the given signal in original signal space. Recall that kernel trick [22] mapsthe non-linear separable features into high dimensional feature space, in whichfeatures of the same type are easier grouped together and linear separable. Inthis case we may nd the sparse representation for the signals more easily, andthe reconstruction error may be reduced as well. Motivated by this, we proposeKernel Sparse Representation(KSR), which is the sparse coding in the mappedhigh dimensional feature space.

The contributions of this paper can be summarized as follows: (i): We pro-pose the idea of kernel sparse representation, which is sparse coding in a highdimensional feature space. Experiments show that KSR greatly reduces the fea-ture reconstruction error. (2): We propose KSRSPM for image classication.KSRSPM can be regarded as a generalized EMK, which can evaluate the sim-ilarity between local features accurately. Compared with EMK, our KSRSPMis more robust by using quantized feature other than the approximated high

Kernel Sparse Representation for Image Classication and Face Recognition 3

dimensional feature. (3): We extend KSR to face recognition. KSR can achievemore discriminative sparse codes compared with sparse coding, which can boostthe performance for face recognition.

The rest of this paper is organized as follows: In Section 2, we describe thedetails of KSR, including its objective function and its implementation. By incor-porating KSR into SPM framework, we propose KSRSPM in Section 3. We alsoemphasize the relationship between our KSRSPM and EMK in details. Imageclassication performance on several public available datasets are also reportedat the end of this section. In Section 4, we use KSR for face recognition. Resultscomparisons between sparse coding and KSR on Extended Yale B Face Datasetare listed in this section. Finally, we conclude our work in Section 5.

2 Kernel Sparse Representation and Implementation

2.1 Kernel Sparse Representation

For general sparse coding, it aims at nding the sparse representation under thegiven basis U(U Rdk), while minimizing the reconstruction error. It equalsto solving the following objective.

minU,v

x Uv2 + v1subject to : um2 1

(1)

where U = [u1, u2, ..., uk]. The rst term of Equation (1) is the reconstructionerror, and the second term is used to control the sparsity of the sparse codes v.Empirically larger corresponds to sparser solution.

Suppose there exists a feature mapping function : Rd RK , (d < K). Itmaps the feature and basis to the high dimensional feature space: x (x), U =[u1, u2, ..., uk] U = [(u1), (u2), . . . , (uk)]. We substitute the mapped fea-tures and basis to the formulation of sparse coding, and arrive at kernel sparserepresentation(KSR):

minU,v

(x) Uv2 + v1 (2)

where U = [(u1), (u2), . . . , (uk)]. In our work, we use Gaussian kernel due toits excellent performance in many work [22,2]: (x1, x2) = exp(x1 x22).Note that (ui)T(ui) = (ui, ui) = exp(ui ui2) = 1, so we can removethe constraint on ui. Kernel sparse representation seeks the sparse representationfor a mapped feature under the mapped basis in the high dimensional space.

2.2 Implementation

The objective of Equation (2) is not convex. Following the work of [28,14], weoptimize the sparse codes v and the codebook U alternatively.


Learning The Sparse Codes in New Feature Space. When the codebookU is xed, the objective in Equation (2) can be rewritten as:

minv

(x) Uv2 + v1=(x, x) + vTKUUv 2vTKU (x) + v1=L(v) + v1

(3)

where L(v) = 1+ vTKUUv 2vTKU (x), KUU is a k k matrix with {KUU}ij =(ui, uj), and KU (x) is a k 1 vector with {KU (x)}i = (ui, x). The objective isthe same as that of sparse coding except for the denition of KUU and KU (x). Sowe can easily extend the Feature-Sign Search Algorithm[14] to solve the sparsecodes. As for the computational cost, they are the same except for the dierencein calculating kernel matrix.

Learning Codebook. When v is xed, we learn the codebook U . Due to thelarge amount of features, it is hard to use all the feature to learn the codebook.Following the work [28,2], we random sample some features to learn the codebookU , then use the learnt U to sparsely encode all the features. Suppose we randomlysample N features, then we rewrite the objective as follows (m, s, t are used toindex the columns number of the codebook.):

f(U) =1N

Ni=1

[(xi) Uvi2 + vi1]

=1N

Ni=1

[1 +k

s=1

kt=1

vi,svi,t(us, ut) 2k

s=1

vi,s(us, xi) + vi1](4)

Since U is in the kernel ((ui, .)), it is very challenging to adopt the commonlyused methods, for example, Stochastic Gradient Descent method [2] to nd theoptimal codebook. Instead we optimize each column of U alternatively. Thederivative of f(U) with respect to um is (um is the column to be updated):

f

um=4N

Ni=1

[k

t=1

vi,mvi,t(um, ut)(um ut) vi,m(um, xi)(um xi)] (5)

To nd the optimal um, we set fum = 0. However, it is not easy to solve theequation due to the terms with respect to (um, .). As a compromise, we usethe approximate solution to replace the exact solution. Similar to xed pointalgorithm [12], in the nth um updating iteration, we use the result of um in the(n 1)th updating iteration to compute the part in the kernel function. Denotethe um in the nth updating process as um,n, then the equation with respect toum,n becomes:

f

um,n= 4

N

Ni=1

[k

t=1

vi,mvi,t(um,n1, ut)(um,n ut) vi,m(um,n1, xi)(um,n xi)]

= 0


When all the remaining columns are xed, it becomes a linear equation of um,nand can be solved easily. Following the work [2], the codebook is initialized asthe results of k-means.

3 Application I: Kernel Sparse Representation for ImageClassication

In this Section, we apply kernel sparse representation in SPM framework, andpropose the KSRSPM. On the one hand, KSRSPM is an extension of ScSPM [28]by replacing sparse coding with KSR. On the other hand, KSRSPM can beregarded as the generalization of Ecient Match Kernel(EMK) [2].

3.1 Sparse Coding for Codebook Generation

k-means clustering is usually used to generate the codebook in BoW model. Ink-means, the whole local feature space X = [x1, x2, . . . , xN ] (where xi Rd1) issplit into k clusterings S = [S1, S2, . . . , Sk]. Denote the corresponding clusteringcenters as U = [u1, u2, . . . , uk] Rdk. In hard assignment, each feature is onlyassigned to its nearest cluster center, and the weight the feature contributing tothat center is 1. The objective of k-means can be formulated as the followingoptimization problem:

minU,S

ki=1

xjSi

xj ui2 = minU,V

Ni=1

xi Uvi2

subject to : Card(vi) = 1, |vi| = 1, vi 0, i.(6)

Here V is a clustering indices, V = [v1, v2, . . . , vN ] (where vi Rk1). Eachcolumn of V indicates which visual word the local feature should be assigned to.To reduce the information loss in feature quantization, the constraint on vm isrelaxed. Meanwhile, to avoid each feature being assigned to too many clusters,the sparse constraint is imposed on vm. Then, we arrive at the optimizationproblem of sparse coding:

minU,V

Ni=1

xi Uvi2 + vi1

subject to : |uj | 1, j = 1, . . . , k.(7)

3.2 Maximum Feature Pooling and Spatial Pyramid MatchingBased Image Representation

Following the work of [28,4], we use maximum pooling method to represent theimages. Maximum pooling uses the largest responses to each basic pattern torepresent the region. More specically, suppose one image region has D localfeatures, and the codebook size is k. After maximum pooling, each image will be


represented by a k dimensional vector y, and the lth entry is the largest responseto the lth basis vector of all the sparse codes in the selected region(vD is thesparse codes of the Dth feature in this local region, and vDl is the lth entry ofvD):

yl = max{|v1l|, |v2l|, . . . , |vDl|} (8)SPM technique is also used to preserve the spatial information. The whole imageis divided into increasing ner regions, and maximum pooling is used in eachsubregion.

3.3 KSRSPM An Generalization of Ecient Matching Kernel

Besides interpreted as an extension of ScSPM [28], KSRSPM can also be in-terpreted as a generalization of Ecient Matching Kernel (EMK) [2]. Let X =[x1, x2, . . . , xp] be a set of local features in one image, and V (x) = [v1(x), v2(x),. . . , vp(x)] are the corresponding clustering index vector in Equation (6). In BoWmodel, each image is presented by a normalized histogram v(X)= 1|X|

xX v(x),

which characterizes its visual word distribution. By using linear classier, theresulting kernel function is:

KB(X,Y ) =1

|X||Y |xX

yY

v(x)T v(y) =1

|X||Y |xX

yY

(x, y) (9)

where

(x, y) =

{1, v(x) = v(y)0, otherwise

(10)

(x, y) is positive denite kernel, which is used to measure the similarity betweentwo local features. However, such hard assignment based local feature similaritymeasuring method increases the information loss and reduces classication ac-curacy. Thus a continuous kernel is introduced to more accurately measure thesimilarity between local feature x and y:

KS(X,Y ) =1

|X ||Y |xX

yY

k(x, y) (11)

Here k(x, y) is positive denite kernel, which is referred to as local kernel. Thisis related to the normalized sum match kernel [19,9].

Due to the large amount of local features, directly using local kernel is bothstorage and computation prohibitive for image classication. To decrease thecomputation cost, Ecient Match Kernel(EMK) is introduced. Under the def-inition of nite dimensional kernel function [2], k(x, y) = (x)T (y), we canapproximate (x) by using low dimensional features vx in the space spanned byk basis vectors H = [(u1), (u2), . . . , (uk)]:

minH,vx

(x) Hvx2 (12)

In this way, each image can be represented by v(X)new = 1|X|H

xX vx be-forehand. As a consequence, the computation speed can be accelerated.


EMK maps the local feature to high dimensional feature space to evaluate thesimilarity between local features more accurately, and uses the approximatedfeature Hvx to construct the linear classier for image classication. It can besummarized as two stages: (i): x

(x): Map the feature to new feature space;(ii): (x) H vx: Reconstruct (x) by using the basis H .

Note that directly using original feature for image classication may causeovertting [3]. To avoid this, and following the BoW model, we use vx for imageclassication. We hope each (x) is only assigned several clusterings, so we addthe sparse constraint in the objective of EMK:

minH,vx

(x) Hvx2 + vx1 (13)

This is the same as the objective of our kernel sparse representation. So EMKcan be regarded as the special case of our KSRSPM at = 0. Comparedwith EMK, our KSRSPM uses the quantized feature indices for image classi-cation, so it is more robust to the noise. Whats more, by using maximumpooling, the robustness to intra-class and noise of our KSRSPM can be furtherstrengthened.

3.4 Experiments

Parameters Setting. SIFT [16] is widely used in image recognition due to itsexcellent performance. For a fair comparison and to be consistent with pre-vious work [28,13,2], we use the SIFT features under the same feature ex-traction setting. Specically, we use dense grid sampling strategy and x thestep size and patch size to 8 and 16 respectively. We also resize the maximumside(width/length) of each image to 300 pixels1. After obtaining the SIFT, weuse 2-norm to normalize the feature length to 1. For the codebook size, weset k = 1024 in k-means, and randomly select (5.0 8.0) 104 features togenerate codebook for each data set. Following the work [28], we set = 0.30for all the datasets. As for the parameter in the Gaussian kernel, we set to164 ,

164 ,

1128 ,

1256 on Scene 15, UIUC-Sports, Caltech 256 and Corel 10 respectively.

For SPM, we use top 3 layers and the weight for each layer is the same. We useone-vs-all linear SVM due to its advantage in speed [28] and excellent perfor-mance in maximum feature pooling based image classication. All the results foreach dataset are based on six independent experiments, and the training imagesare selected randomly.

Scene 15 Dataset. Scene 15 [13] dataset is usually used for scene classication.It contains 4485 images, which are divided into 15 categories. Each category con-tains about 200 to 400 images. The image content is diverse, containing suburb,coast, forest, highway, inside city, mountain, open country, street, tall building,

1 For UIUC-Sport dataset, we resize the maximum side to 400 due to the high reso-lution of original image.


oce, bedroom, industrial, kitchen, living room and store. For fair comparison,we follow the same experimental setting [28,13]: randomly select 100 images eachcategory as training data and use the remaining images as test data. The resultsare listed in Table 1.

Table 1. Performance Comparison on Scene 15 Dataset(%)

Method Average Classication RateKSPM[13] 81.400.50EMK[2] 77.890.85

ScSPM[28] 80.280.93KSRSPM 83.680.61

Caltech 256. Caltech 2562 is a very challenging dataset in both image contentand dataset scale. First of all, compared with Caltech 101, the objects in Caltech256 contains larger intra-class variance, and the object locations are no longer inthe center of the image. Second, Caltech 256 contains 29780 images, which are di-vided into 256 categories. More categories will inevitably increase the inter-classsimilarity, and increase the performance degradation.We evaluate the method un-der four dierent settings: selecting 15, 30, 45, 60 per category as training datarespectively, and use the rest as test data. The results are listed in Table 2.

Table 2. Performance Comparison on Caltech 256 dataset(%) (KC: Kernel codebook;)

Trn No. KSPM[8] KC[6] EMK[2] ScSPM[28] KSRSPM15 NA NA 23.20.6 27.730.51 29.770.1430 34.10 27.170.46 30.50.4 34.020.35 35.670.1045 NA NA 34.40.4 37.460.55 38.610.1960 NA NA 37.60.5 40.140.91 40.300.22

UIUC-Sport Dataset. UIUC-Sport [15] contains images collected from 8 kindof dierent sports: badminton, bocce,croquet, polo, rock climbing, rowing, sailingand snow boarding. There are 1792 images in all, and the number of images rangesfrom 137 to 250 per category. Following the work of Wu et al. [27], we randomlyselect 70 images from each category as training data, and randomly select another60 images from each category as test data. The results are listed in Table 3.

Table 3. Performance Comparison on UIUC-Sport Dataset(%)

Method Average Classication RateHIK+ocSVM[27] 83.541.13

EMK[2] 74.561.32ScSPM[28] 82.741.46KSRSPM 84.920.78

2 www.vision.caltech.edu/Image_Datasets/Caltech256/


Table 4. Performance Comparison on Corel10 Dataset(%) (SMK:Spatial MarkovModel)

Method Average Classication RateSMK [17] 77.9EMK [2] 79.901.73

ScSPM [28] 86.21.01KSRSPM 89.431.27

Corel10 Dataset. Corel10 [18] contains 10 categories: skiing, beach, buildings,tigers, owls, elephants, owers, horses, mountains and food. Each category con-tains 100 images. Following the work of Lu et al. [18], we randomly select 50 imagesas training data and use the rest as test data. The results are listed in Table 4.

Results Analysis. From Table 1-4, we can see that on Scene, UIUC-Sports,Corel10, KSRSPM outperforms EMK around (5.7 10.4)%, and outperformsScSPM around (2.2 3.4)%. For Caltech 256, due to too many classes, theimprovements are not very substantial, but still higher than EMK and ScSPM.We also list the confusion matrices of Scene, UIUC-Sports and Corel10 datasetsin Figure 1 and Figure 2. The entry located in ith row, jth column in confusionmatrix represents the percentage of class i being misclassied to class j. Fromthe confusion matrices, we can see that some classes are easily be misclassiedto some others.

Feature Quantization Error. Dene Average Quantization Error ( AverQE )as:AverQE = 1N

Ni=1 (xi)Uvi2F . It can be used to evaluate the information

loss in the feature quantization process. To retain more information, we hope thefeature quantization error can be reduced. We compute the AverQE of our kernelsparse representation (KSR) and Sparse coding (Sc) on all the features used forcodebook generation, and list them in Table 5. From results we can see thatkernel sparse representation can greatly decrease the feature quantization error.

subu

rb

coast

fore

st

high

way

insi

deci

ty

mounta

in

ope

ncountry

stre

et

tallb

uild

ing

PAR

offi

ce

bedr

oom

indu

stria

l

kitc

hen

livin

groo

m

stor

e

suburb 99.3 0 0 0 0 0 0.24 0 0 0 0 0 0 0.47 0coast 0 83.5 0.77 1.92 0 2.05 11.3 0 0.32 0.13 0 0 0 0 0forest 0 0.07 95.9 0 0 2.34 1.17 0.37 0 0 0 0 0 0 0.15highway 0 2.5 0.1 89.7 2.92 1.15 1.77 0.83 0.63 0 0 0.21 0 0 0.21insidecity 0.56 0.08 0.08 0.16 89.3 0 0.08 3.85 4.25 0.24 0 0.56 0.24 0.08 0.48mountain 0.06 1.22 2.31 0.24 0.06 90.5 4.01 0.24 0.97 0 0.12 0.18 0 0 0.06opencountry 0.7 10.2 5.11 1.72 0 5.48 75.3 0.86 0.05 0 0.05 0.05 0.05 0.11 0.27street 0 0 0.35 1.74 3.73 0.78 0 91.1 1.48 0 0 0.17 0 0.09 0.52tallbuilding 0.2 0.13 0.26 0 4.1 1.04 0.13 0.46 92.1 0 0 0.72 0.13 0 0.72PARoffice 0 0 0 0 0.58 0 0 0 0 95.1 1.01 0 2.17 0.87 0.29bedroom 0.43 0.14 0 0 1.44 0.29 0 0 0 3.59 71.4 0.86 5.03 15.1 1.72industrial 1.66 0.63 0.16 0.32 2.29 0.55 0.08 0.95 2.53 1.82 1.26 70.3 2.21 1.42 13.8kitchen 0.15 0 0 0 1.21 0.61 0 0 0 4.09 3.94 1.52 71.1 11.8 5.61livingroom 0.09 0 0 0 0.35 0.26 0 0.53 0.26 3.88 13.8 2.29 8.91 61.6 8.02store 0 0.08 0.39 0 3.64 1.86 0 0.54 0.85 1.55 1.47 3.95 2.87 3.88 78.9

Fig. 1. Confusion Matrix on Scene 15 dataset(%)


Rock

Clim

bing

badm

into

n

bocc

e

croqu

et

polo

row

ing

sailin

g

snow

boar

ding

Rock Climbing 95.16 0.00 0.67 0.13 0.54 0.81 0.13 2.55badminton 0.13 93.72 1.67 1.15 1.15 0.64 0.13 1.41bocce 5.22 3.98 62.19 16.17 3.23 4.98 0.25 3.98croquet 4.42 0.00 12.85 78.41 2.71 0.70 0.30 0.60polo 3.13 1.93 3.42 2.38 85.27 2.08 0.15 1.64rowing 1.11 1.11 1.11 0.00 2.31 89.91 1.85 2.59sailing 0.28 0.00 0.56 2.50 0.42 2.64 92.78 0.83snowboarding 5.97 1.25 3.75 1.67 0.83 3.61 0.97 81.94

flowe

r

ele

phan

ts

ow

ls

tiger

build

ing

beac

h

skiin

g

hors

es

mo

un

tain

s

food

flower 90.67 0.00 1.67 6.00 0.00 0.00 1.33 0.00 0.33 0.00elephants 1.67 76.67 6.33 0.33 0.00 3.33 1.00 1.33 5.67 3.67owls 0.33 4.67 84.33 0.00 1.00 0.67 0.00 3.00 2.33 3.67tiger 1.00 0.00 0.00 99.00 0.00 0.00 0.00 0.00 0.00 0.00building 1.00 0.00 0.00 6.00 89.67 0.00 1.67 0.00 0.67 1.00beach 0.67 0.00 0.00 0.00 0.00 90.00 0.00 6.00 3.33 0.00skiing 0.00 0.00 0.00 0.00 0.00 0.00 95.33 0.00 3.33 1.33horses 0.00 0.33 0.33 0.00 0.00 3.00 0.00 96.33 0.00 0.00mountains 0.00 9.67 3.33 0.00 0.67 1.00 0.00 0.00 84.00 1.33food 0.00 2.33 2.67 0.33 0.33 0.67 2.00 0.00 3.33 88.33

Fig. 2. Confusion Matrices on UIUC-Sports and Corel10(%)

Table 5. Average Feature Quantization Error on Dierent datasets

Scene Caltech 256 Sport CorelSc 0.8681 0.9164 0.8864 0.9295

KSR 9.63E-02 5.72E-02 9.40E-02 4.13E-02

This may be the reason that our KSRSPM outperforms ScSPM. The resultsalso agree with our assumption that sparse coding in high dimensional spacecan reduce the feature quantization error.

4 Application II: Kernel Sparse Representation for FaceRecognition

4.1 Sparse Coding for Face Recognition

For face recognition, If sucient training samples are available from each class,it would be possible to represent the test samples as a linear combination ofthose training samples from the same class [26].

Suppose there are N classes in all, and the training instances for class i areAi = [ai,1, . . . , ai,ni ] Rdni , in which each column corresponds to one instance.Let A = [A1, . . . , AN ] Rd

Ni=1 ni be the training set, and y Rd1 be the

test sample. When noise e exists, the problem for face recognition [26] can beformulated as follows:

min x01 s.t. y = AxT + e = [A I ][xT eT ]T = A0x0 (14)sparse coding based image recognition aims at selecting only a few imagesfrom all the training instances to reconstruct the images to be tested. Leti = [i,1, . . . , i,ni ](1 i N) be the coecients corresponds to Ai in x0.The reconstruction error by using the instances from class i can be computedas: ri(y) = y Aii2. Then the test image is assigned to the category thatminimizes the reconstruction error: identity(y) = arg mini {r1(y), . . . , rN (y)}.

4.2 Kernel Sparse Representation for Face Recognition

Kernel method can make the features belonging to the same category closer toeach other [22]. Thus we apply kernel sparse representation in face recognition.


Firstly, the 1 norm on reconstruction error is replaced by using 2 norm(Weassume that the noise may not be sparsely reconstructed by using the trainingsamples). By mapping features to a high dimensional space: y (y), A =[a1,1, . . . , aN,nN ] A = [(a1,1), . . . , (aN,nN )], we obtain the objective of kernelsparse representation for face recognition:

min x1 + (y)Ax22 (15)In which the parameter is used to balance the weight between the sparsityand the reconstruction error. Following the work of John Wright et al., the testimage is assigned to the category which minimizes the reconstruction error inthe high dimensional feature space.

4.3 Evaluation on Extended Yale B Database

We evaluate our method on Extended Yale B Database [7], which contains 38categories, 2414 frontal-face images. The cropped image size is 192168. Follow-ing the work [26], we randomly select a half as training images in each category,and use the rest as test. The following ve features are used for evaluation:RandomFace [26], LaplacianFace [10], EigenFace [24],FisherFace [1] and Down-sample [26], and each feature is normalized to unit length by using 2 norm.Gaussian kernel is used in our experiments: (x1, x2) = exp(x1 x22). ForEigenfaces, Laplacianfaces, Downsample and Fisherfaces, we set = 1/d whered is the feature dimension. For Randomfaces, = 1/32d.

The Eect of . We rstly evaluate by using 56D Downsample Feature. Welist the results based on dierent in Table 6. When = 0, as decreases, theperformance increases, and the proportion of non-zero elements in coecientsincreases. But computational time also increases. When = 0, it happens to bethe objective of Ecient Match Kernel, but the performance is not good as thatin the case of = 0. This can show the eectiveness of the sparse term.

Result Comparison. Considering both the computational cost and the accu-racy in Table 6, we set = 105. The experimental results are listed in Table 7.All the results are based on 10 times independent experiments. Experimentalresults show that kernel sparse representation can outperform sparse coding inface recognition.

Table 6. The Eect of Sparsity Parameter: 56D Downsample Feature (Here sparsityis percentage of non-zeros elements in sparse codes)

101 102 103 104 105 106 107 0sparsity(%) 0.58 0.75 0.88 2.13 4.66 8.35 16.69 -

reconstruction error 0.2399 0.1763 0.1651 0.1113 0.0893 0.0671 0.0462 -time(sec) 0.0270 0.0280 0.0299 0.0477 0.2445 0.9926 6.2990 -

accuracy(%) 76.92 84.12 85.19 90.32 91.65 93.30 93.47 84.37


Table 7. Performance of Sparse Coding for Face Recognition(%)

Feature Dimension 30 56 120 504Sc [26] 86.5 91.63 93.95 96.77

Eigen KSR 89.01 94.42 97.49 99.16Sc [26] 87.49 91.72 93.95 96.52

Laplacian KSR 88.86 94.24 97.11 98.12Sc [26] 82.6 91.47 95.53 98.09

Random KSR 85.46 92.36 96.14 98.37Sc [26] 74.57 86.16 92.13 97.1

Downsample KSR 83.57 91.65 95.31 97.8Sc [26] 86.91 NA NA NA

Fisher KSR 88.93 NA NA NA

To further illustrate the performance of KSR, we calculate the similarity be-tween the sparse codes of KSR and Sc in three classes(each classes contains 32images). We list the results in Figure 3, in which the entry in (i, j) is the sparsecodes similarity (normalized correlation) between image i and j. We know thata good sparse coding method can make the sparse codes belonging to same classmore similar, therefore, the sparse codes similarity should be block-wise. FromFigure 3 we can see that our KSR can get more discriminative sparse codes thansparse coding, which facilitates the better performance of the image recognition.

Similarity bettwen the sparse codes of KSR

20 40 60 80

102030405060708090

0.2

0

0.2

0.4

0.6

0.8

1Similarity bettwen the sparse codes of Sc

20 40 60 80

102030405060708090

0.2

0

0.2

0.4

0.6

0.8

1

Fig. 3. Similarity between the sparse codes of KSR and Sc

5 Conclusion

In this paper, we propose a new technique: Kernel Sparse Representation, whichis the sparse coding technique in a high dimensional feature space mapped byimplicit feature mapping feature. We apply KSR to image classication andface recognition. For image classication, our proposed KSRSPM can both beregarded as an extension of ScSPM and an generalization of EMK. For facerecognition, KSR can learn more discriminative sparse codes for face category


identication. Experimental results on several publicly available datasets showthat our KSR outperforms both ScSPM and EMK, and achieves state-of-the-artperformance.

References

1. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. sherfaces:Recognition using class specic linear projection. TPAMI 19(7), 711720 (1997)

2. Bo, L., Sminchisescu, C.: Ecient match kernels between sets of features for visualrecognition. In: NIPS (2009)

3. Boiman, O., Shechtman, E., Irani, M.: In defense of nearest-neighbor based imageclassication. In: CVPR (2008)

4. Boureau, Y., Bach, F., LeCun, Y., Ponce, J.: Learnning mid-level features forrecognition (2010)

5. Duda, R.O., Hart, P.E., Stock, D.G.: Pattern Classication, 2nd edn. John Wiley& Sons, Chichester (2001)

6. van Gemert, J.C., Geusebroek, J.M., Veenman, C.J., Smeulders, A.W.M.: Kernelcodebooks for scene categorization. In: Forsyth, D., Torr, P., Zisserman, A. (eds.)ECCV 2008, Part III. LNCS, vol. 5304, pp. 696709. Springer, Heidelberg (2008)

7. Georghiades, A., Belhumeur, P., Kriegman, D.: From few to many: Illuminationcone models for face recognition under variable lighting and pose. TPAMI 23(6),643660 (2001)

8. Grin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. In: Tech-nical Report (2007)

9. Haussler, D.: Convolution kernels on discrete structure. In: Technical Report (1999)10. He, X., Yan, S., Hu, Y., Niyogi, P., Zhang, H.: Face recognition using laplacianfaces.

TPAMI 27(3), 328340 (2005)11. Ho, J., Yang, M.H., Lim, J., Lee, K.C., Kriegman, D.J.: Clustering appearances of

objects under varying illumination conditions. In: CVPR (2003)12. Hyvarinen, A.: The xed-point algorithm and maximum likelihood estimation for

independent component analysis. Neural Process. Lett. 10(1) (1999)13. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid

matching for recognizing natural scene categories. In: CVPR, pp. 21692178 (2006)14. Lee, H., Battle, A., Raina, R., Ng, A.Y.: Ecient sparse coding algorithms. In:

NIPS, pp. 801808 (2006)15. Li, L.J., Fei-Fei, L.: What, where and who? classifying events by scene and object

recognition. In: ICCV (2007)16. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2),

91110 (2004)17. Lu, Z., Ip, H.H.: Image categorization by learning with context and consistency.

In: CVPR (2009)18. Lu, Z., Ip, H.H.: Image categorization with spatial mismatch kernels. In: CVPR

(2009)19. Lyu, S.: Mercer kernels for object recognition with local features. In: CVPR, pp.

223229 (2005)20. Marial, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A.: Non-local sparse models

for image restoration. In: ICCV (2009)21. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with

large vocabularies and fast spatial matching. In: CVPR (2007)


22. Scholkopf, B., Smola, A.J., Muller, K.R.: Kernel principal component analysis. In:International Conference on Articial Neural Networks, pp. 583588 (1997)

23. Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matchingin videos. In: ICCV, pp. 14701477 (2003)

24. Turk, M., Pentland, A.: Eigenfaces for recognition. In: CVPR (1991)25. Wang, C., Yan, S., Zhang, L., Zhang, H.J.: Multi-label sparse coding for automatic

image annotation. In: CVPR (2009)26. Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust face recognition

via sparse representation. TPAMI 31(2), 210227 (2009)27. Wu, J., Rehg, J.M.: Beyond the euclidean distance: Creating eective visual code-

books using the histogram intersection kernel. In: ICCV (2003)28. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using

sparse coding for image classication. In: CVPR (2009)

Every Picture Tells a Story:Generating Sentences from Images

Ali Farhadi1, Mohsen Hejrati2 , Mohammad Amin Sadeghi2, Peter Young1,Cyrus Rashtchian1, Julia Hockenmaier1, David Forsyth1

1 Computer Science DepartmentUniversity of Illinois at Urbana-Champaign

{afarhad2,pyoung2,crashtc2,juliahmr,daf}@illinois.edu2 Computer Vision Group, School of Mathematics

Institute for studies in theoretical Physics and Mathematics(IPM){m.a.sadeghi,mhejrati}@gmail.com

Abstract. Humans can prepare concise descriptions of pictures, focus-ing on what they nd important. We demonstrate that automatic meth-ods can do so too. We describe a system that can compute a score linkingan image to a sentence. This score can be used to attach a descriptivesentence to a given image, or to obtain images that illustrate a givensentence. The score is obtained by comparing an estimate of meaning ob-tained from the image to one obtained from the sentence. Each estimateof meaning comes from a discriminative procedure that is learned us-ing data. We evaluate on a novel dataset consisting of human-annotatedimages. While our underlying estimate of meaning is impoverished, itis sucient to produce very good quantitative results, evaluated with anovel score that can account for synecdoche.

1 Introduction

For most pictures, humans can prepare a concise description in the form of asentence relatively easily. Such descriptions might identify the most interestingobjects, what they are doing, and where this is happening. These descriptions arerich, because they are in sentence form. They are accurate, with good agreementbetween annotators. They are concise: much is omitted, because humans tendnot to mention objects or events that they judge to be less signicant. Finally,they are consistent: in our data, annotators tend to agree on what is mentioned.Barnard et al. name two applications for methods that link text and images:Illustration, where one nds pictures suggested by text (perhaps to suggest il-lustrations from a collection); and annotation, where one nds text annotationsfor images (perhaps to allow keyword search to nd more images) [1].

This paper investigates methods to generate short descriptive sentences fromimages. Our contributions include: We introduce a dataset to study this problem(section 3.1). We introduce a novel representation intermediate between imagesand sentences (section 2.1). We describe a novel, discriminative approach thatproduces very good results at sentence annotation (section 2.4). For illustration,

K. Daniilidis, P. Maragos, N. Paragios (Eds.): ECCV 2010, Part IV, LNCS 6314, pp. 1529, 2010.c Springer-Verlag Berlin Heidelberg 2010

16 A. Farhadi et al.

out of vocabulary words pose serious diculties, and we show methods to usedistributional semantics to cope with these issues (section 3.4). Evaluating sen-tence generation is very dicult, because sentences are uid, and quite dierentsentences can describe the same phenomena. Worse, synecdoche (for example,substituting animal for cat or bicycle for vehicle) and the general rich-ness of vocabulary means that many dierent words can quite legitimately beused to describe the same picture. In section 3, we describe a quantitative eval-uation of sentence generation at a useful scale.

Linking individual words to images has a rich history and space allows onlya mention of the most relevant papers. A natural strategy is to try and predictwords from image regions. The rst image annotation system is due to Moriet al. [2]; Duygulu et al. continued this tradition using models from machinetranslation [3]. Since then, a wide range of models has been deployed (reviewsin [4,5]); the current best performer is a form of nearest neighbours matching [6].The most recent methods perform fairly well, but still nd diculty placingannotations on the correct regions.

Sentences are richer than lists of words, because they describe activities,properties of objects, and relations between entities (among other things). Suchrelations are revealing: Gupta and Davis show that respecting likely spatial re-lations between objects markedly improves the accuracy of both annotation andplacing [7]. Li and Fei-Fei show that event recognition is improved by explicitinference on a generative model representing the scene in which the event oc-curs and also the objects in the image [8]. Using a dierent generative model,Li and Fei-Fei demonstrate that relations improve object labels, scene labelsand segmentation [9]. Gupta and Davis show that respecting relations betweenobjects and actions improve recognition of each [10,11]. Yao and Fei-Fei usethe fact that objects and human poses are coupled and show that recognizingone helps the recognition of the other [12]. Relations between words in annotat-ing sentences can reveal image structure. Berg et al. show that word featuressuggest which names in a caption are depicted in the attached picture, andthat this improves the accuracy of links between names and faces [13]. Mensinkand Verbeek show that complex co-occurrence relations between people improveface labelling, too [14]. Luo, Caputo and Ferrari [15] show benets of associ-ating faces and poses to names and verbs in predicting whos doing what innews articles. Coyne and Sproat describe an auto-illustration system that givesnaive users a method to produce rendered images from free text descriptions(Wordseye; [16];http://www.wordseye.com).

There are few attempts to generate sentences from visual data. Gupta etal. generate sentences narrating a sports event in video using a compositionalmodel based around AND-OR graphs [17]. The relatively stylised structure ofthe events helps both in sentence generation and in evaluation, because it isstraightforward to tell which sentence is right. Yao et al. show some examplesof both temporal narrative sentences (i.e. this happened, then that) and scenedescription sentences generated from visual data, but there is no evaluation [18].

Every Picture Tells a Story: Generating Sentences from Images 17

,PDJH6SDFH

0HDQLQJ6SDFH

6HQWHQFH6SDFH

EXVSDUNVWUHHW!

SODQHIO\VN\!

VKLSVDLOVHD!WUDLQPRYHUDLO!

ELNHULGHJUDVV!

$\HOORZEXVLVSDUNLQJLQWKHVWUHHW

7KHUHLVDVPDOOSODQHIO\LQJLQWKHVN\

$QROGILVKLQJVKLSVDLOLQJLQDEOXHVHD7KHWUDLQLVPRYLQJRQUDLOVFORVHWRWKHVWDWLRQ

$QDGYHQWXURXVPDQULGLQJDELNHLQDIRUHVW

Fig. 1. There is an intermediate space of meaning which has dierent projections tothe space of images and sentences. Once we learn the projections we can generatesentences for images and nd images best described by a given sentence.

These methods generate a direct representation of what is happening in a scene,and then decode it into a sentence.

An alternative, which we espouse, is to build a scoring procedure that evalu-ates the similarity between a sentence and an image. This approach is attractive,because it is symmetric: given an image (resp. sentence), one can search for thebest sentence (resp. image) in a large set. This means that one can do bothillustration and annotation with one method. Another attraction is the methoddoes not need a strong syntactic model, which is represented by the prior onsentences. Our scoring procedure is built around an intermediate representa-tion, which we call the meaning of the image (resp. sentence). In eect, imageand sentence are each mapped to this intermediate space, and the results arecompared; similar meanings result in a high score. The advantage of doing sois that each of these maps can be adjusted discriminatively. While the meaningspace could be abstract, in our implementation we use a direct representationof simple sentences as a meaning space. This allows us to exploit distributionalsemantics ideas to deal with out of vocabulary words. For example, we have nodetector for cattle; but we can link sentences containing this word to images,because distributional semantics tells us that a cattle is similar to sheep andcow, etc. (Figure 6)

2 Approach

Our model assumes that there is a space of Meanings that comes between thespace of Sentences and the space of Images. We evaluate the similarity be-tween a sentence and an image by (a) mapping each to the meaning spacethen (b) comparing the results. Figure 1 depicts the intermediate space ofmeanings. We will learn the mapping from images (resp. sentences) to meaningdiscriminatively from pairs of images (resp. sentences) and assigned meaningrepresentations.


/

!

3

W

Ws

^W^ &

^t^

^^^&ZZ

'dZ^

&(ORSEd'^

^&

d&IELD^&

Z '&^ ,


2.2 Image Potentials

We need informative features to drive the mapping from the image space to themeaning space.

Node Potentials. To provide information about the nodes on the MRF werst need to construct image features. Our image features consist of:

Felzenszwalb et al. detector responses. We use Felzenszwalb detectors [19]to predict condence scores on all the images. We set the threshold such that allof the classes get predicted, at least once in each image. We then consider themax condence of the detections for each category, the location of the center ofthe detected bounding box, the aspect ratio of the bounding box, and its scale.

Hoiem et al. classication responses. We use the classication scores ofHoiem et. al [20] for the PASCAL classication tasks. These classiers are basedon geometry, HOG features, and detection responses.

Gist-based scene classication responses. We encode global information ofimages using gist [21]. Our features for scenes are the condences of our Adabooststyle classier for scenes.

First we build node features by tting a discriminative classier (a linearSVM) to predict each of the nodes independently on the image features. Al-though the classiers are being learned independently, they are well aware ofother objects and scene information. We call these estimates node features. Thisis a number-of-nodes-dimensional vector and each element in this vector providesa score for a node given the image. This can be a node potential for object, ac-tion, and scene nodes. We expect similar images to have similar meanings, andso we obtain a set of features by matching our test image to training images. Wecombine these features into various other node potentials as below:

by matching image features, we obtain the k-nearest neighbours in the train-ing set to the test image, then compute the average of the node features overthose neighbours, computed from the image side. By doing so, we have arepresentation of what the node features are for similar images.

by matching image features, we obtain the k-nearest neighbours in the train-ing set to the test image, then compute the average of the node features overthose neighbours, computed from the sentence side. By doing so, we have arepresentation of what the sentence representation does for images that looklike our image.

by matching those node features derived from classiers and detectors(above), we obtain the k-nearest neighbours in the training set to the testimage, then compute the average of the node features over those neighbours,computed from the image side. By doing so, we have a representation of whatthe node features are for images that produce similar classier and detectoroutputs.


by matching those node features derived from classiers and detectors(above), we obtain the k-nearest neighbours in the training set to the testimage, then compute the average of the node features over those neighbours,computed from the sentence side. By doing so, we have a representation ofwhat the sentence representation does for images that produce similar clas-sier and detector outputs.

Edge Potentials. Introducing a parameter for each edge results in unman-ageable number of parameters. In addition, estimates of the parameters for themajority of edges would be noisy. There are serious smoothing issues. We adoptan approach similar to Good Turing smoothing methods to a) control the num-ber of parameters b) do smoothing. We have multiple estimates for the edgespotentials which can provide more accurate estimates if used together. We formthe linear combinations of these potentials. Therefore, in learning we are inter-ested in nding weights of the linear combination of the initial estimates so thatthe nal linearly combined potentials provide values on the MRF so that theground truth triplet is the highest scored triplet for all examples. This way welimit the number of parameters to the number of initial estimates.

We have four dierent estimates for edges. Our nal score on the edges takethe form of a linear combination of these estimates. Our four estimates for edgesfrom node A to node B are:

The normalized frequency of the word A in our corpus, f(A). The normalized frequency of the word B in our corpus, f(B). The normalized frequency of (A and B) at the same time, f(A, b). f(A,B)f(A)f(B) .

2.3 Sentence Potentials

We need a representation of the sentences. We represent a sentence by computingthe similarity between the sentence and our triplets. For that we need to have anotion of similarity for objects, scenes and actions in text.

We used the Curran & Clark parser [22] to generate a dependency parse foreach sentence. We extracted the subject, direct object, and any nmod dependen-cis involving a noun and a verb. These dependencies were used to generate the(object, action) pairs for the sentences. In order to extract the scene informationfrom the sentences, we extracted the head nouns of the prepositional phrases(except for the prepositions of and with), and the head nouns of the phraseX in the background.

Lin Similarity Measure for Objects and Scenes. We use the Lin similaritymeasure [23] to determine the semantic distance between two words. The Linsimilarity measure uses WordNet synsets as the possible meanings of each words.The noun synsets are arranged in a heirarchy based on hypernym (is-a) andhyponym (instance-of) relations. Each synset is dened as having an informationcontent based on how frequently the synset or a hyponym of the synset occurs in


a corpus (in the case, SemCor). The similarity of two synsets is dened as twicethe information content of the least common ancestor of the synsets divided bythe sum of the information content of the two synsets. Similar synsets will havea LCA that covers the two synsets, and very little else. When we compared twonouns, we considered all pairs of a ltered list of synsets for each noun, and usedthe most similar synsets. We ltered the list of synsets for each noun by limitingit to the rst four synsets that were at least 10% as frequent as the most commonsynset of that noun. We also required the synsets to be physical entities.

Action Co-occurrence Score. We generated a second image caption dataset consisting of roughly 8,000 images pulled from six Flickr groups. For allpairs of verbs, we used the likelihood ratio to determine if the two verbs co-occurring in the dierent captions of the same image was signicant. We thenused the likelihood ratio as the similarity score for the positively correlatedverb pairs, and the negative of the likelihood ratio as the similarity score forthe negatively correlated verb pairs. Typically, we found that this procedurediscovered verbs that were either describing the same action or describing twoactions that commonly co-occurred.

Node Potentials. We now can provide a similarity measure between sentencesand objects, actions, and scenes using scores explained above. Below we explainour estimates of sentence node potentials.

First we compute the similarity of each object, scene, and action extractedfrom each sentence. This gives us the the rst estimates for the potentialsover the nodes. We call this the sentence node feature.

For each sentence, we also compute the average of sentence node features forother four sentences describing the same images in the train set.

We compute the average of k nearest neighbors in the sentence node featuresspace for a given sentence. We consider this as our third estimate for nodes.

We also compute the average of the image node features for images corre-sponding to the nearest neighbors in the item above.

The average of the sentence node features of reference sentences for thenearest neighbors in the item 3 is considered as our fth estimate for nodes.

We also include the sentence node feature for the reference sentence.

Edge Potentials. The edge estimates for sentences are identical to to edgeestimates for the images explained in previous section.

2.4 Learning

There are two mappings that need to be learned. The map from the image spaceto the meaning space uses the image potentials and the map from the sentencespace to the meaning space uses the sentence potentials. Learning the mappingfrom images to meaning involves nding the weights on the linear combinations ofour image potentials on nodes and edges so that the ground truth triplets score


highest among all other triplets for all examples. This is a structure learningproblem [24] which takes the form of

minw

2w2 + 1

n

iexamples

i (1)

subject tow(xi, yi) + i max

ymeaning spacew(xi, y) + L(yi, y) i examples

i 0 i exampleswhere is the tradeo factor between the regularization and slack variables , is our feature functions, xi corresponds to our ith image, and yi is our structuredlabel for the ith image. We use the stochastic subgradient descent method [25]to solve this minimization.

3 Evaluation

We emphasize quantitative evaluation in our work. Our vocabulary of meaningis signicantly larger than the equivalent in [8,9]. Evaluation requires innovationboth in datasets and in measurement, described below.

3.1 Dataset

We need a dataset with images and corresponding sentences and also labelsfor our representations of the meaning space. No such dataset exists. We buildour own dataset of images and sentences around the PASCAL 2008 images. Thismeans we can use and compare to state of the art models and image annotationsin PASCAL dataset.

PASCAL Sentence data set. To generate the sentences, we started with the2008 PASCAL development kit. We randomly selected 50 images belonging toeach of the 20 categories. Once we had a set of 1000 images, we used AmazonsMechanical Turk to generate ve captions for each image. We required the an-notators to be based in the US, and that they pass a qualication exam testingtheir ability to identify spelling errors, grammatical errors, and descriptive cap-tions. More details about the methods of collection can be found in [26]. Ourdataset has 5 sentences for each image of the thousand images resulting in 5000sentences. We also manually add labels for triplets of objects, actions, scenesfor each images. These triplets label the main object in the image, the mainaction, and the main place. There are 173 dierent triplets in our train set and123 in test set. There are 80 triplets in the test set that appeared in the train set.The dataset is available at http://vision.cs.uiuc.edu/pascal-sentences/.

3.2 Inference

Our model is learned to maximize the sum of the scores along the path identi-ed by a triplet. In inference we search for the triplet which gives us the best


additive score, argmaxywT(xi, y). These models prefer triplets with combina-tion o

[lecture notes in computer science] computer vision – eccv 2010 volume 6314 ||

Documents