[lecture notes in computer science] computer vision – eccv 2010 volume 6314 ||

836

Upload: nikos

Post on 23-Dec-2016

291 views

Category:

Documents


4 download

TRANSCRIPT

  • Lecture Notes in Computer Science 6314Commenced Publication in 1973Founding and Former Series Editors:Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

    Editorial BoardDavid Hutchison

    Lancaster University, UKTakeo Kanade

    Carnegie Mellon University, Pittsburgh, PA, USAJosef Kittler

    University of Surrey, Guildford, UKJon M. Kleinberg

    Cornell University, Ithaca, NY, USAAlfred Kobsa

    University of California, Irvine, CA, USAFriedemann Mattern

    ETH Zurich, SwitzerlandJohn C. Mitchell

    Stanford University, CA, USAMoni Naor

    Weizmann Institute of Science, Rehovot, IsraelOscar Nierstrasz

    University of Bern, SwitzerlandC. Pandu Rangan

    Indian Institute of Technology, Madras, IndiaBernhard Steffen

    TU Dortmund University, GermanyMadhu Sudan

    Microsoft Research, Cambridge, MA, USADemetri Terzopoulos

    University of California, Los Angeles, CA, USADoug Tygar

    University of California, Berkeley, CA, USAGerhard Weikum

    Max Planck Institute for Informatics, Saarbruecken, Germany

  • Kostas Daniilidis Petros MaragosNikos Paragios (Eds.)

    Computer Vision ECCV 2010

    11th European Conference on Computer VisionHeraklion, Crete, Greece, September 5-11, 2010Proceedings, Part IV

    13

  • Volume Editors

    Kostas DaniilidisGRASP LaboratoryUniversity of Pennsylvania3330 Walnut Street, Philadelphia, PA 19104, USAE-mail: [email protected]

    Petros MaragosNational Technical University of AthensSchool of Electrical and Computer Engineering15773 Athens, GreeceE-mail: [email protected]

    Nikos ParagiosEcole Centrale de ParisDepartment of Applied MathematicsGrande Voie des Vignes, 92295 Chatenay-Malabry, FranceE-mail: [email protected]

    Library of Congress Control Number: 2010933243

    CR Subject Classification (1998): I.2.10, I.3, I.5, I.4, F.2.2, I.3.5

    LNCS Sublibrary: SL 6 Image Processing, Computer Vision, Pattern Recognition,and Graphics

    ISSN 0302-9743ISBN-10 3-642-15560-X Springer Berlin Heidelberg New YorkISBN-13 978-3-642-15560-4 Springer Berlin Heidelberg New York

    This work is subject to copyright. All rights are reserved, whether the whole or part of the material isconcerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publicationor parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,in its current version, and permission for use must always be obtained from Springer. Violations are liableto prosecution under the German Copyright Law.springer.com

    Springer-Verlag Berlin Heidelberg 2010Printed in Germany

    Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, IndiaPrinted on acid-free paper 06/3180

  • Preface

    The 2010 edition of the European Conference on Computer Vision was held in Heraklion, Crete. The call for papers attracted an absolute record of 1,174 submissions. We describe here the selection of the accepted papers:

    Thirty-eight area chairs were selected coming from Europe (18), USA and Canada (16), and Asia (4). Their selection was based on the following criteria: (1) Researchers who had served at least two times as Area Chairs within the past two years at major vision conferences were excluded; (2) Researchers who served as Area Chairs at the 2010 Computer Vision and Pattern Recognition were also excluded (exception: ECCV 2012 Program Chairs); (3) Minimization of overlap introduced by Area Chairs being former student and advisors; (4) 20% of the Area Chairs had never served before in a major conference; (5) The Area Chair selection process made all possible efforts to achieve a reasonable geographic distribution between countries, thematic areas and trends in computer vision.

    Each Area Chair was assigned by the Program Chairs between 2832 papers. Based on paper content, the Area Chair recommended up to seven potential reviewers per paper. Such assignment was made using all reviewers in the database including the conflicting ones. The Program Chairs manually entered the missing conflict domains of approximately 300 reviewers. Based on the recommendation of the Area Chairs, three reviewers were selected per paper (with at least one being of the top three suggestions), with 99.7% being the recommendations of the Area Chairs. When this was not possible, senior reviewers were assigned to these papers by the Program Chairs, with the consent of the Area Chairs. Upon completion of this process there were 653 active reviewers in the system.

    Each reviewer got a maximum load of eight reviewsin a few cases we had nine papers when re-assignments were made manually because of hidden conflicts. Upon the completion of the reviews deadline, 38 reviews were missing. The Program Chairs proceeded with fast re-assignment of these papers to senior reviewers. Prior to the deadline of submitting the rebuttal by

  • VI Preface

    the authors, all papers had three reviews. The distribution of the reviews was the following: 100 papers with an average score of weak accept and higher, 125 papers with an average score toward weak accept, 425 papers with an average score around borderline.

    For papers with strong consensus among reviewers, we introduced a procedure to handle potential overwriting of the recommendation by the Area Chair. In particular for all papers with weak accept and higher or with weak reject and lower, the Area Chair should have sought for an additional reviewer prior to the Area Chair meeting. The decision of the paper could have been changed if the additional reviewer was supporting the recommendation of the Area Chair, and the Area Chair was able to convince his/her group of Area Chairs of that decision.

    The discussion phase between the Area Chair and the reviewers was initiated once the review became available. The Area Chairs had to provide their identity to the reviewers. The discussion remained open until the Area Chair meeting that was held in Paris, June 56. Each Area Chair was paired to a buddy and the decisions for all papers were made jointly, or when needed using the opinion of other Area Chairs. The pairing was done considering conflicts, thematic proximity, and when possible geographic diversity. The Area Chairs were responsible for taking decisions on their papers. Prior to the Area Chair meeting, 92% of the consolidation reports and the decision suggestions had been made by the Area Chairs. These recommendations were used as a basis for the final decisions.

    Orals were discussed in groups of Area Chairs. Four groups were formed, with no direct conflict between paper conflicts and the participating Area Chairs. The Area Chair recommending a paper had to present the paper to the whole group and explain why such a contribution is worth being published as an oral. In most of the cases consensus was reached in the group, while in the cases where discrepancies existed between the Area Chairs views, the decision was taken according to the majority of opinions.

    The final outcome of the Area Chair meeting, was 38 papers accepted for an oral presentation and 284 for poster. The percentage ratios of submissions/ acceptance per area are the following:

  • Preface VII

    Thematic area # submitted % over submitted

    # accepted % over accepted

    % acceptance in area

    Object and Scene Recognition 192 16.4% 66 20.3% 34.4%

    Segmentation and Grouping 129 11.0% 28 8.6% 21.7%

    Face, Gesture, Biometrics 125 10.6% 32 9.8% 25.6%

    Motion and Tracking 119 10.1% 27 8.3% 22.7%

    Statistical Models and VisualLearning

    101 8.6% 30 9.2% 29.7%

    Matching, Registration, Alignment 90 7.7% 21 6.5% 23.3%

    Computational Imaging 74 6.3% 24 7.4% 32.4%

    Multi-view Geometry 67 5.7% 24 7.4% 35.8%

    Image Features 66 5.6% 17 5.2% 25.8%

    Video and Event Characterization 62 5.3% 14 4.3% 22.6%

    Shape Representation and Recognition

    48 4.1% 19 5.8% 39.6%

    Stereo 38 3.2% 4 1.2% 10.5%

    Reflectance, Illumination, Color 37 3.2% 14 4.3% 37.8%

    Medical Image Analysis 26 2.2% 5 1.5% 19.2%

    We received 14 complaints/reconsideration requests. All of them were sent to the Area Chairs who handled the papers. Based on the reviewers arguments and the reaction of the Area Chair, three papers were acceptedas posterson top of the 322 at the Area Chair meeting, bringing the total number of accepted papers to 325 or 27.6%. The selection rate for the 38 orals was 3.2%.The acceptance rate for the papers submitted by the group of Area Chairs was 39%.

    Award nominations were proposed by the Area and Program Chairs based on the reviews and the consolidation report. An external award committee was formed comprising David Fleet, Luc Van Gool, Bernt Schiele, Alan Yuille, Ramin Zabih. Additional reviews were considered for the nominated papers and the decision on the paper awards was made by the award committee. We thank the Area Chairs, Reviewers, Award Committee Members, and the General Chairs for their hard work and we gratefully acknowledge Microsoft Research for accommodating the ECCV needs by generously providing the CMT Conference Management Toolkit. We hope you enjoy the proceedings.

    September 2010 Kostas Daniilidis Petros Maragos Nikos Paragios

  • Organization

    General Chairs

    Argyros, Antonis University of Crete/FORTH, GreeceTrahanias, Panos University of Crete/FORTH, GreeceTziritas, George University of Crete, Greece

    Program Chairs

    Daniilidis, Kostas University of Pennsylvania, USAMaragos, Petros National Technical University of Athens,

    GreeceParagios, Nikos Ecole Centrale de Paris/INRIA Saclay

    le-de-France, France

    Workshops Chair

    Kutulakos, Kyros University of Toronto, Canada

    Tutorials Chair

    Lourakis, Manolis FORTH, Greece

    Demonstrations Chair

    Kakadiaris, Ioannis University of Houston, USA

    Industrial Chair

    Pavlidis, Ioannis University of Houston, USA

    Travel Grants Chair

    Komodakis, Nikos University of Crete, Greece

  • X Organization

    Area Chairs

    Bach, Francis INRIA Paris - Rocquencourt, FranceBelongie, Serge University of California-San Diego, USABischof, Horst Graz University of Technology, AustriaBlack, Michael Brown University, USABoyer, Edmond INRIA Grenoble - Rhone-Alpes, FranceCootes, Tim University of Manchester, UKDana, Kristin Rutgers University, USADavis, Larry University of Maryland, USAEfros, Alyosha Carnegie Mellon University, USAFermuller, Cornelia University of Maryland, USAFitzgibbon, Andrew Microsoft Research, Cambridge, UKJepson, Alan University of Toronto, CanadaKahl, Fredrik Lund University, SwedenKeriven, Renaud Ecole des Ponts-ParisTech, FranceKimmel, Ron Technion Institute of Technology, IrelandKolmogorov, Vladimir University College of London, UKLepetit, Vincent Ecole Polytechnique Federale de Lausanne,

    SwitzerlandMatas, Jiri Czech Technical University, Prague,

    Czech RepublicMetaxas, Dimitris Rutgers University, USANavab, Nassir Technical University of Munich, GermanyNister, David Microsoft Research, Redmont, USAPerez, Patrick THOMSON Research, FrancePerona, Pietro Caltech University, USARamesh, Visvanathan Siemens Corporate Research, USARaskar, Ramesh Massachusetts Institute of Technology, USASamaras, Dimitris State University of New York - Stony Brook,

    USASato, Yoichi University of Tokyo, JapanSchmid, Cordelia INRIA Grenoble - Rhone-Alpes, FranceSchnoerr, Christoph University of Heidelberg, GermanySebe, Nicu University of Trento, ItalySzeliski, Richard Microsoft Research, Redmont, USATaskar, Ben University of Pennsylvania, USATorr, Phil Oxford Brookes University, UKTorralba, Antonio Massachusetts Institute of Technology, USATuytelaars, Tinne Katholieke Universiteit Leuven, BelgiumWeickert, Joachim Saarland University, GermanyWeinshall, Daphna Hebrew University of Jerusalem, IsraelWeiss, Yair Hebrew University of Jerusalem, Israel

  • Organization XI

    Conference Board

    Horst Bischof Graz University of Technology, AustriaHans Burkhardt University of Freiburg, GermanyBernard Buxton University College London, UKRoberto Cipolla University of Cambridge, UKJan-Olof Eklundh Royal Institute of Technology, SwedenOlivier Faugeras INRIA, Sophia Antipolis, FranceDavid Forsyth University of Illinois, USAAnders Heyden Lund University, SwedenAles Leonardis University of Ljubljana, SloveniaBernd Neumann University of Hamburg, GermanyMads Nielsen IT University of Copenhagen, DenmarkTomas Pajdla CTU Prague, Czech RepublicJean Ponce Ecole Normale Superieure, FranceGiulio Sandini University of Genoa, ItalyPhilip Torr Oxford Brookes University, UKDavid Vernon Trinity College, IrelandAndrew Zisserman University of Oxford, UK

    Reviewers

    Abd-Almageed, WaelAgapito, LourdesAgarwal, SameerAggarwal, GauravAhlberg, JuergenAhonen, TimoAi, HaizhouAlahari, KarteekAleman-Flores, MiguelAloimonos, YiannisAmberg, BrianAndreetto, MarcoAngelopoulou, ElliAnsar, AdnanArbel, TalArbelaez, PabloAstroem, KalleAthitsos, VassilisAugust, JonasAvraham, TamarAzzabou, NouraBabenko, BorisBagdanov, Andrew

    Bahlmann, ClausBaker, SimonBallan, LucaBarbu, AdrianBarnes, NickBarreto, JoaoBartlett, MarianBartoli, AdrienBatra, DhruvBaust, MaximilianBeardsley, PaulBehera, ArdhenduBeleznai, CsabaBen-ezra, MosheBerg, AlexanderBerg, TamaraBetke, MargritBileschi, StanBircheld, StanBiswas, SomaBlanz, VolkerBlaschko, MatthewBobick, Aaron

    Bougleux, SebastienBoult, TerranceBoureau, Y-LanBowden, RichardBoykov, YuriBradski, GaryBregler, ChristophBremond, FrancoisBronstein, AlexBronstein, MichaelBrown, MatthewBrown, MichaelBrox, ThomasBrubaker, MarcusBruckstein, FreddyBruhn, AndresBuisson, OlivierBurkhardt, HansBurschka, DariusCaetano, TiberioCai, DengCalway, AndrewCappelli, Raaele

  • XII Organization

    Caputo, BarbaraCarreira-Perpinan,

    MiguelCaselles, VincentCavallaro, AndreaCham, Tat-JenChandraker, ManmohanChandran, SharatChetverikov, DmitryChiu, Han-PangCho, Taeg SangChuang, Yung-YuChung, Albert C. S.Chung, MooClark, JamesCohen, IsaacCollins, RobertColombo, CarloCord, MatthieuCorso, JasonCosten, NicholasCour, TimotheeCrandall, DavidCremers, DanielCriminisi, AntonioCrowley, JamesCui, JinshiCula, OanaDalalyan, ArnakDarbon, JeromeDavis, JamesDavison, Andrewde Bruijne, MarleenDe la Torre, FernandoDedeoglu, GokselDelong, AndrewDemirci, StefanieDemirdjian, DavidDenzler, JoachimDeselaers, ThomasDhome, MichelDick, AnthonyDickinson, SvenDivakaran, AjayDollar, Piotr

    Domke, JustinDonoser, MichaelDoretto, GianfrancoDouze, MatthijsDraper, BruceDrbohlav, OndrejDuan, QiDuchenne, OlivierDuric, ZoranDuygulu-Sahin, PinarEklundh, Jan-OlofElder, JamesElgammal, AhmedEpshtein, BorisEriksson, AndersEspuny, FerranEssa, IrfanFarhadi, AliFarrell, RyanFavaro, PaoloFehr, JanisFei-Fei, LiFelsberg, MichaelFerencz, AndrasFergus, RobFeris, RogerioFerrari, VittorioFerryman, JamesFidler, SanjaFinlayson, GrahamFisher, RobertFlach, BorisFleet, DavidFletcher, TomFlorack, LucFlynn, PatrickFoerstner, WolfgangForoosh, HassanForssen, Per-ErikFowlkes, CharlessFrahm, Jan-MichaelFraundorfer, FriedrichFreeman, WilliamFrey, BrendanFritz, Mario

    Fua, PascalFuchs, MartinFurukawa, YasutakaFusiello, AndreaGall, JuergenGallagher, AndrewGao, XiangGatica-Perez, DanielGee, JamesGehler, PeterGenc, YakupGeorgescu, BogdanGeusebroek, Jan-MarkGevers, TheoGeyer, ChristopherGhosh, AbhijeetGlocker, BenGoecke, RolandGoedeme, ToonGoldberger, JacobGoldenstein, SiomeGoldluecke, BastianGomes, RyanGong, SeanGorelick, LenaGould, StephenGrabner, HelmutGrady, LeoGrau, OliverGrauman, KristenGross, RalphGrossmann, EtienneGruber, AmitGulshan, VarunGuo, GuodongGupta, AbhinavGupta, MohitHabbecke, MartinHager, GregoryHamid, RaayHan, BohyungHan, TonyHanbury, AllanHancock, EdwinHasino, Samuel

  • Organization XIII

    Hassner, TalHaussecker, HorstHays, JamesHe, XumingHeas, PatrickHebert, MartialHeibel, T. HaukeHeidrich, WolfgangHernandez, CarlosHilton, AdrianHinterstoisser, StefanHlavac, VaclavHoiem, DerekHoogs, AnthonyHornegger, JoachimHua, GangHuang, RuiHuang, XiaoleiHuber, DanielHudelot, CelineHussein, MohamedHuttenlocher, DanIhler, AlexIlic, SlobodanIrschara, ArnoldIshikawa, HiroshiIsler, VolkanJain, PrateekJain, VirenJamie Shotton, JamieJegou, HerveJenatton, RodolpheJermyn, IanJi, HuiJi, QiangJia, JiayaJin, HailinJogan, MatjazJohnson, MicahJoshi, NeelJuan, OlivierJurie, FredericKakadiaris, IoannisKale, Amit

    Kamarainen,Joni-Kristian

    Kamberov, GeorgeKamberova, GerdaKambhamettu, ChandraKanatani, KenichiKanaujia, AtulKang, Sing BingKappes, JorgKavukcuoglu, KorayKawakami, ReiKe, QifaKemelmacher, IraKhamene, AliKhan, SaadKikinis, RonKim, Seon JooKimia, BenjaminKittler, JosefKoch, ReinhardKoeser, KevinKohli, PushmeetKokiopoulou, EKokkinos, IasonasKolev, KalinKomodakis, NikosKonolige, KurtKoschan, AndreasKukelova, ZuzanaKulis, BrianKumar, M. PawanKumar, SanjivKuthirummal, SujitKutulakos, KyrosKweon, In SoLadicky, LuborLai, Shang-HongLalonde, Jean-FrancoisLampert, ChristophLandon, GeorgeLanger, MichaelLangs, GeorgLanman, DouglasLaptev, Ivan

    Larlus, DianeLatecki, Longin JanLazebnik, SvetlanaLee, ChanSuLee, HonglakLee, Kyoung MuLee, Sang-WookLeibe, BastianLeichter, IdoLeistner, ChristianLellmann, JanLempitsky, VictorLenzen, FrankLeonardis, AlesLeung, ThomasLevin, AnatLi, ChunmingLi, GangLi, HongdongLi, HongshengLi, Li-JiaLi, RuiLi, RuonanLi, StanLi, YiLi, YunpengLiefeng, BoLim, JongwooLin, StephenLin, ZheLing, HaibinLittle, JimLiu, CeLiu, JingenLiu, QingshanLiu, Tyng-LuhLiu, XiaomingLiu, YanxiLiu, YazhouLiu, ZichengLourakis, ManolisLovell, BrianLu, LeLucey, Simon

  • XIV Organization

    Luo, JieboLyu, SiweiMa, XiaoxuMairal, JulienMaire, MichaelMaji, SubhransuMaki, AtsutoMakris, DimitriosMalisiewicz, TomaszMallick, SatyaManduchi, RobertoManmatha, R.Marchand, EricMarcialis, GianMarks, TimMarszalek, MarcinMartinec, DanielMartinez, AleixMatei, BogdanMateus, DianaMatsushita, YasuyukiMatthews, IainMaxwell, BruceMaybank, StephenMayer, HelmutMcCloskey, ScottMcKenna, StephenMedioni, GerardMeer, PeterMei, ChristopherMichael, NicholasMicusik, BranislavMinh, NguyenMirmehdi, MajidMittal, AnuragMiyazaki, DaisukeMonasse, PascalMordohai, PhilipposMoreno-Noguer,

    FrancescMori, GregMorimoto, CarlosMorse, BryanMoses, YaelMueller, Henning

    Mukaigawa, YasuhiroMulligan, JaneMunich, MarioMurino, VittorioNamboodiri, VinayNarasimhan, SrinivasaNarayanan, P.J.Naroditsky, OlegNeumann, JanNevatia, RamNicolls, FredNiebles, Juan CarlosNielsen, MadsNishino, KoNixon, MarkNowozin, SebastianOdonnell, ThomasObozinski, GuillaumeOdobez, Jean-MarcOdone, FrancescaOfek, EyalOgale, AbhijitOkabe, TakahiroOkatani, TakayukiOkuma, KenjiOlson, ClarkOlsson, CarlOmmer, BjornOsadchy, MargaritaOvergaard, Niels

    ChristianOzuysal, MustafaPajdla, TomasPanagopoulos,

    AlexandrosPandharkar, RohitPankanti, SharathPantic, MajaPapadopoulo, TheoParameswaran, VasuParikh, DeviParis, SylvainPatow, GustavoPatras, IoannisPavlovic, Vladimir

    Peleg, ShmuelPerera, A.G. AmithaPerronnin, FlorentPetrou, MariaPetrovic, VladimirPeursum, PatrickPhilbin, JamesPiater, JustusPietikainen, MattiPinz, AxelPless, RobertPock, ThomasPoh, NormanPollefeys, MarcPonce, JeanPons, Jean-PhilippePotetz, BrianPrabhakar, SalilQian, GangQuattoni, AriadnaRadeva, PetiaRadke, RichardRakotomamonjy, AlainRamanan, DevaRamanathan, NarayananRanzato, MarcAurelioRaviv, DanReid, IanReitmayr, GerhardRen, XiaofengRittscher, JensRogez, GregoryRosales, RomerRosenberg, CharlesRosenhahn, BodoRosman, GuyRoss, ArunRoth, PeterRother, CarstenRothganger, FredRougon, NicolasRoy, SebastienRueckert, DanielRuether, MatthiasRussell, Bryan

  • Organization XV

    Russell, ChristopherSahbi, HichemStiefelhagen, RainerSaad, AliSaari, AmirSalgian, GarbisSalzmann, MathieuSangineto, EnverSankaranarayanan,

    AswinSapiro, GuillermoSara, RadimSato, ImariSavarese, SilvioSavchynskyy, BogdanSawhney, HarpreetScharr, HannoScharstein, DanielSchellewald, ChristianSchiele, BerntSchindler, GrantSchindler, KonradSchlesinger, DmitrijSchoenemann, ThomasSchro, FlorianSchubert, FalkSchultz, ThomasSe, StephenSeidel, Hans-PeterSerre, ThomasShah, MubarakShakhnarovich, GregoryShan, YingShashua, AmnonShechtman, EliSheikh, YaserShekhovtsov, AlexanderShet, VinayShi, JianboShimshoni, IlanShokoufandeh, AliSigal, LeonidSimon, LoicSingaraju, DheerajSingh, Maneesh

    Singh, VikasSinha, SudiptaSivic, JosefSlabaugh, GregSmeulders, ArnoldSminchisescu, CristianSmith, KevinSmith, WilliamSnavely, NoahSnoek, CeesSoatto, StefanoSochen, NirSochman, JanSofka, MichalSorokin, AlexanderSouthall, BenSouvenir, RichardSrivastava, AnujStauer, ChrisStein, GideonStrecha, ChristophSugimoto, AkihiroSullivan, JosephineSun, DeqingSun, JianSun, MinSunkavalli, KalyanSuter, DavidSvoboda, TomasSyeda-Mahmood,

    TanveerSusstrunk, SabineTai, Yu-WingTakamatsu, JunTalbot, HuguesTan, PingTan, RobbyTanaka, MasayukiTao, DachengTappen, MarshallTaylor, CamilloTheobalt, ChristianThonnat, MoniqueTieu, KinhTistarelli, Massimo

    Todorovic, SinisaToreyin, Behcet UgurTorresani, LorenzoTorsello, AndreaToshev, AlexanderTrucco, EmanueleTschumperle, DavidTsin, YanghaiTu, PeterTung, TonyTurek, MattTurk, MatthewTuzel, OncelTyagi, AmbrishUrschler, MartinUrtasun, RaquelVan de Weijer, Joostvan Gemert, Janvan den Hengel, AntonVasilescu, M. Alex O.Vedaldi, AndreaVeeraraghavan, AshokVeksler, OlgaVerbeek, JakobVese, LuminitaVitaladevuni, ShivVogiatzis, GeorgeVogler, ChristianWachinger, ChristianWada, ToshikazuWagner, DanielWang, ChaohuiWang, HanziWang, HongchengWang, JueWang, KaiWang, SongWang, XiaogangWang, YangWeese, JuergenWei, YichenWein, WolfgangWelinder, PeterWerner, TomasWestin, Carl-Fredrik

  • XVI Organization

    Wilburn, BennettWildes, RichardWilliams, OliverWills, JoshWilson, KevinWojek, ChristianWolf, LiorWright, JohnWu, Tai-PangWu, YingXiao, JiangjianXiao, JianxiongXiao, JingYagi, YasushiYan, ShuichengYang, FeiYang, JieYang, Ming-Hsuan

    Yang, PengYang, QingxiongYang, RuigangYe, JiepingYeung, Dit-YanYezzi, AnthonyYilmaz, AlperYin, LijunYoon, Kuk JinYu, JingyiYu, KaiYu, QianYu, StellaYuille, AlanZach, ChristopherZaid, HarchaouiZelnik-Manor, LihiZeng, Gang

    Zhang, ChaZhang, LiZhang, ShengZhang, WeiweiZhang, WenchaoZhao, WenyiZheng, YuanjieZhou, JinghaoZhou, KevinZhu, LeoZhu, Song-ChunZhu, YingZickler, ToddZikic, DarkoZisserman, AndrewZitnick, LarryZivny, StanislavZu, Silvia

  • Organization XVII

    Sponsoring Institutions

    Platinum Sponsor

    Gold Sponsors

    Silver Sponsors

  • Table of Contents Part IV

    Spotlights and Posters W1

    Kernel Sparse Representation for Image Classication and FaceRecognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    Shenghua Gao, Ivor Wai-Hung Tsang, and Liang-Tien Chia

    Every Picture Tells a Story: Generating Sentences from Images . . . . . . . . 15Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi,Peter Young, Cyrus Rashtchian, Julia Hockenmaier, andDavid Forsyth

    An Eye Fixation Database for Saliency Detection in Images . . . . . . . . . . . 30Subramanian Ramanathan, Harish Katti, Nicu Sebe,Mohan Kankanhalli, and Tat-Seng Chua

    Face Image Relighting Using Locally Constrained GlobalOptimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    Jiansheng Chen, Guangda Su, Jinping He, and Shenglan Ben

    Correlation-Based Intrinsic Image Extraction from a Single Image . . . . . . 58Xiaoyue Jiang, Andrew J. Schoeld, and Jeremy L. Wyatt

    ADICT: Accurate Direct and Inverse Color Transformation . . . . . . . . . . . 72Behzad Sajadi, Maxim Lazarov, and Aditi Majumder

    Real-Time Specular Highlight Removal Using Bilateral Filtering . . . . . . . 87Qingxiong Yang, Shengnan Wang, and Narendra Ahuja

    Learning Artistic Lighting Template from Portrait Photographs . . . . . . . . 101Xin Jin, Mingtian Zhao, Xiaowu Chen, Qinping Zhao, andSong-Chun Zhu

    Photometric Stereo from Maximum Feasible Lambertian Reections . . . . 115Chanki Yu, Yongduek Seo, and Sang Wook Lee

    Part-Based Feature Synthesis for Human Detection . . . . . . . . . . . . . . . . . . . 127Aharon Bar-Hillel, Dan Levi, Eyal Krupka, and Chen Goldberg

    Improving the Fisher Kernel for Large-Scale Image Classication . . . . . . . 143Florent Perronnin, Jorge Sanchez, and Thomas Mensink

    Max-Margin Dictionary Learning for Multiclass ImageCategorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

    Xiao-Chen Lian, Zhiwei Li, Bao-Liang Lu, and Lei Zhang

  • XX Table of Contents Part IV

    Towards Optimal Naive Bayes Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . 171Regis Behmo, Paul Marcombes, Arnak Dalalyan, andVeronique Prinet

    Weakly Supervised Classication of Objects in Images Using SoftRandom Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

    Riwal Lefort, Ronan Fablet, and Jean-Marc Boucher

    Learning What and How of Contextual Models for Scene Labeling . . . . . 199Arpit Jain, Abhinav Gupta, and Larry S. Davis

    Adapting Visual Category Models to New Domains . . . . . . . . . . . . . . . . . . 213Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell

    Improved Human Parsing with a Full Relational Model . . . . . . . . . . . . . . . 227Duan Tran and David Forsyth

    Multiresolution Models for Object Detection . . . . . . . . . . . . . . . . . . . . . . . . 241Dennis Park, Deva Ramanan, and Charless Fowlkes

    Accurate Image Localization Based on Google Maps Street View . . . . . . . 255Amir Roshan Zamir and Mubarak Shah

    A Minimal Case Solution to the Calibrated Relative Pose Problem forthe Case of Two Known Orientation Angles . . . . . . . . . . . . . . . . . . . . . . . . . 269

    Friedrich Fraundorfer, Petri Tanskanen, and Marc Pollefeys

    Bilinear Factorization via Augmented Lagrange Multipliers . . . . . . . . . . . . 283Alessio Del Bue, Joao Xavier, Lourdes Agapito, and Marco Paladini

    Piecewise Quadratic Reconstruction of Non-Rigid Surfaces fromMonocular Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

    Joao Fayad, Lourdes Agapito, and Alessio Del Bue

    Extrinsic Camera Calibration Using Multiple Reections . . . . . . . . . . . . . . 311Joel A. Hesch, Anastasios I. Mourikis, and Stergios I. Roumeliotis

    Probabilistic Deformable Surface Tracking from Multiple Videos . . . . . . . 326Cedric Cagniart, Edmond Boyer, and Slobodan Ilic

    Theory of Optimal View Interpolation with Depth Inaccuracy . . . . . . . . . 340Keita Takahashi

    Practical Methods for Convex Multi-view Reconstruction . . . . . . . . . . . . . 354Christopher Zach and Marc Pollefeys

    Building Rome on a Cloudless Day . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368Jan-Michael Frahm, Pierre Fite-Georgel, David Gallup,Tim Johnson, Rahul Raguram, Changchang Wu, Yi-Hung Jen,Enrique Dunn, Brian Clipp, Svetlana Lazebnik, and Marc Pollefeys

  • Table of Contents Part IV XXI

    Camera Pose Estimation Using Images of Planar Mirror Reections . . . . 382Rui Rodrigues, Joao P. Barreto, and Urbano Nunes

    Element-Wise Factorization for N-View Projective Reconstruction . . . . . . 396Yuchao Dai, Hongdong Li, and Mingyi He

    Learning Relations among Movie Characters: A Social NetworkPerspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410

    Lei Ding and Alper Yilmaz

    Scene and Object Recognition

    What, Where and How Many? Combining Object Detectors andCRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424

    Lubor Ladicky, Paul Sturgess, Karteek Alahari, Chris Russell, andPhilip H.S. Torr

    Visual Recognition with Humans in the Loop . . . . . . . . . . . . . . . . . . . . . . . . 438Steve Branson, Catherine Wah, Florian Schro, Boris Babenko,Peter Welinder, Pietro Perona, and Serge Belongie

    Localizing Objects While Learning Their Appearance . . . . . . . . . . . . . . . . . 452Thomas Deselaers, Bogdan Alexe, and Vittorio Ferrari

    Monocular 3D Scene Modeling and Inference: UnderstandingMulti-Object Trac Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467

    Christian Wojek, Stefan Roth, Konrad Schindler, and Bernt Schiele

    Blocks World Revisited: Image Understanding Using QualitativeGeometry and Mechanics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482

    Abhinav Gupta, Alexei A. Efros, and Martial Hebert

    Discriminative Learning with Latent Variables for Cluttered IndoorScene Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497

    Huayan Wang, Stephen Gould, and Daphne Koller

    Spotlights and Posters W2

    Visual Tracking Using a Pixelwise Spatiotemporal Oriented EnergyRepresentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511

    Kevin J. Cannons, Jacob M. Gryn, and Richard P. Wildes

    A Globally Optimal Approach for 3D Elastic Motion Estimation fromStereo Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525

    Qifan Wang, Linmi Tao, and Huijun Di

    Occlusion Boundary Detection Using Pseudo-depth . . . . . . . . . . . . . . . . . . 539Xuming He and Alan Yuille

  • XXII Table of Contents Part IV

    Multiple Target Tracking in World Coordinate with Single, MinimallyCalibrated Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553

    Wongun Choi and Silvio Savarese

    Joint Estimation of Motion, Structure and Geometry from StereoSequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568

    Levi Valgaerts, Andres Bruhn, Henning Zimmer, Joachim Weickert,Carsten Stoll, and Christian Theobalt

    Dense, Robust, and Accurate Motion Field Estimation from StereoImage Sequences in Real-Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582

    Clemens Rabe, Thomas Muller, Andreas Wedel, and Uwe Franke

    Estimation of 3D Object Structure, Motion and Rotation Based on 4DAne Optical Flow Using a Multi-camera Array . . . . . . . . . . . . . . . . . . . . . 596

    Tobias Schuchert and Hanno Scharr

    Eciently Scaling Up Video Annotation with CrowdsourcedMarketplaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610

    Carl Vondrick, Deva Ramanan, and Donald Patterson

    Robust and Fast Collaborative Tracking with Two Stage SparseOptimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624

    Baiyang Liu, Lin Yang, Junzhou Huang, Peter Meer,Leiguang Gong, and Casimir Kulikowski

    Nonlocal Multiscale Hierarchical Decomposition on Graphs . . . . . . . . . . . . 638Moncef Hidane, Olivier Lezoray, Vinh-Thong Ta, andAbderrahim Elmoataz

    Adaptive Regularization for Image Segmentation Using Local ImageCurvature Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651

    Josna Rao, Rafeef Abugharbieh, and Ghassan Hamarneh

    A Static SMC Sampler on Shapes for the Automated Segmentation ofAortic Calcications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666

    Kersten Petersen, Mads Nielsen, and Sami S. Brandt

    Fast Dynamic Texture Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680V. Javier Traver, Majid Mirmehdi, Xianghua Xie, and Raul Montoliu

    Finding Semantic Structures in Image Hierarchies Using LaplacianGraph Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694

    Yi-Zhe Song, Pablo Arbelaez, Peter Hall, Chuan Li, andAnupriya Balikai

    Semantic Segmentation of Urban Scenes Using Dense Depth Maps . . . . . 708Chenxi Zhang, Liang Wang, and Ruigang Yang

  • Table of Contents Part IV XXIII

    Tensor Sparse Coding for Region Covariances . . . . . . . . . . . . . . . . . . . . . . . . 722Ravishankar Sivalingam, Daniel Boley, Vassilios Morellas, andNikolaos Papanikolopoulos

    Improving Local Descriptors by Embedding Global and Local SpatialInformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736

    Tatsuya Harada, Hideki Nakayama, and Yasuo Kuniyoshi

    Detecting Faint Curved Edges in Noisy Images . . . . . . . . . . . . . . . . . . . . . . 750Sharon Alpert, Meirav Galun, Boaz Nadler, and Ronen Basri

    Spatial Statistics of Visual Keypoints for Texture Recognition . . . . . . . . . 764Huu-Giao Nguyen, Ronan Fablet, and Jean-Marc Boucher

    BRIEF: Binary Robust Independent Elementary Features . . . . . . . . . . . . . 778Michael Calonder, Vincent Lepetit, Christoph Strecha, andPascal Fua

    Multi-label Feature Transform for Image Classications . . . . . . . . . . . . . . . 793Hua Wang, Heng Huang, and Chris Ding

    Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807

  • Kernel Sparse Representation forImage Classication and Face Recognition

    Shenghua Gao, Ivor Wai-Hung Tsang, and Liang-Tien Chia

    School of Computer Engineering, Nanyang Technological Univertiy, Singapore{gaos0004,IvorTsang,asltchia}@ntu.edu.sg

    Abstract. Recent research has shown the eectiveness of using sparsecoding(Sc) to solve many computer vision problems. Motivated by thefact that kernel trick can capture the nonlinear similarity of features,which may reduce the feature quantization error and boost the sparsecoding performance, we propose Kernel Sparse Representation(KSR).KSR is essentially the sparse coding technique in a high dimensionalfeature space mapped by implicit mapping function. We apply KSR toboth image classication and face recognition. By incorporating KSRinto Spatial Pyramid Matching(SPM), we propose KSRSPM for imageclassication. KSRSPM can further reduce the information loss in fea-ture quantization step compared with Spatial Pyramid Matching usingSparse Coding(ScSPM). KSRSPM can be both regarded as the gener-alization of Ecient Match Kernel(EMK) and an extension of ScSPM.Compared with sparse coding, KSR can learn more discriminative sparsecodes for face recognition. Extensive experimental results show that KSRoutperforms sparse coding and EMK, and achieves state-of-the-art per-formance for image classication and face recognition on publicly avail-able datasets.

    1 Introduction

    Sparse coding technique is attracting more and more researchers attention incomputer vision due to its state-of-the-art performance in many applications,such as image annotation [25], image restoration [20], image classication [28]etc. It aims at selecting the least possible basis from the large basis pool tolinearly recover the given signal under a small reconstruction error constraint.Therefore, sparse coding can be easily applied to feature quantization in Bag-of-Word(BoW) model based image representation. Moreover, under the assumptionthat the face images to be tested can be reconstructed by the images from thesame categories, sparse coding can also be used in face recognition [26].

    BoW model [23] is widely used in computer vision [27,21] due to its con-cise representation and robustness to scale and rotation variance. Generally, itcontains three modules: (i) Region selection and representation; (ii) Codebookgeneration and feature quantization; (iii) Frequency histogram based image rep-resentation. In these three modules, codebook generation and feature quantiza-tion are the most important portions for image presentation. The codebook is

    K. Daniilidis, P. Maragos, N. Paragios (Eds.): ECCV 2010, Part IV, LNCS 6314, pp. 114, 2010.c Springer-Verlag Berlin Heidelberg 2010

  • 2 S. Gao, I.W.-H. Tsang, and L.-T. Chia

    a collection of basic patterns used to reconstruct the local features. Each basicpattern is known as a visual word. Usually k -means is adopted to generate thecodebook, and each local feature is quantized to its nearest visual word. However,such hard assignment method may cause severe information loss [3,6], especiallyfor those features located at the boundary of several visual words. To minimizesuch errors, soft assignment [21,6] was introduced by assigning each feature tomore than one visual words. However, the way of choosing parameters, includingthe weight assigned to the visual word and the number of visual words to beassigned, is not trivial to be determined.

    Recently, Yang et al. [28] proposed the method of using sparse coding inthe codebook generation and feature quantization module. Sparse coding canlearn better codebook that further minimizes the quantization error than k-means. Meanwhile, the weights assigned to each visual word are learnt concur-rently. By applying sparse coding to Spatial Pyramid Matching [13] (referredto as: ScSPM), their method achieves state-of-the-art performance in imageclassication.

    Another application of sparse coding is face recognition. Face recognition is aclassic problem in computer vision, and has a great potential in many real worldapplication. It generally contains two stages. (i): Feature extraction; and (ii):Classier construction and label prediction. Usually Nearest Neighbor (NN) [5]and Nearest Subspace(NS) [11] are used. However, NN predicts the label ofthe image to be tested by only using its nearest neighbor in the training data,therefore it can easily be aected by noise. NS approximates the test image byusing all the images belonging to the same category, and assigns the image to thecategory which minimizes the reconstruction error. But NS may not work wellfor the case where classes are highly correlated to each other[26]. To overcomethese problems, Wright et al. proposed a sparse coding based face recognitionframework [26], which can automatically selects the images in the training set toapproximate the test image. Their method is robust to occlusion, illuminationand noise and achieves excellent performance.

    Existing work based on sparse coding only seeks the sparse representationof the given signal in original signal space. Recall that kernel trick [22] mapsthe non-linear separable features into high dimensional feature space, in whichfeatures of the same type are easier grouped together and linear separable. Inthis case we may nd the sparse representation for the signals more easily, andthe reconstruction error may be reduced as well. Motivated by this, we proposeKernel Sparse Representation(KSR), which is the sparse coding in the mappedhigh dimensional feature space.

    The contributions of this paper can be summarized as follows: (i): We pro-pose the idea of kernel sparse representation, which is sparse coding in a highdimensional feature space. Experiments show that KSR greatly reduces the fea-ture reconstruction error. (2): We propose KSRSPM for image classication.KSRSPM can be regarded as a generalized EMK, which can evaluate the sim-ilarity between local features accurately. Compared with EMK, our KSRSPMis more robust by using quantized feature other than the approximated high

  • Kernel Sparse Representation for Image Classication and Face Recognition 3

    dimensional feature. (3): We extend KSR to face recognition. KSR can achievemore discriminative sparse codes compared with sparse coding, which can boostthe performance for face recognition.

    The rest of this paper is organized as follows: In Section 2, we describe thedetails of KSR, including its objective function and its implementation. By incor-porating KSR into SPM framework, we propose KSRSPM in Section 3. We alsoemphasize the relationship between our KSRSPM and EMK in details. Imageclassication performance on several public available datasets are also reportedat the end of this section. In Section 4, we use KSR for face recognition. Resultscomparisons between sparse coding and KSR on Extended Yale B Face Datasetare listed in this section. Finally, we conclude our work in Section 5.

    2 Kernel Sparse Representation and Implementation

    2.1 Kernel Sparse Representation

    For general sparse coding, it aims at nding the sparse representation under thegiven basis U(U Rdk), while minimizing the reconstruction error. It equalsto solving the following objective.

    minU,v

    x Uv2 + v1subject to : um2 1

    (1)

    where U = [u1, u2, ..., uk]. The rst term of Equation (1) is the reconstructionerror, and the second term is used to control the sparsity of the sparse codes v.Empirically larger corresponds to sparser solution.

    Suppose there exists a feature mapping function : Rd RK , (d < K). Itmaps the feature and basis to the high dimensional feature space: x (x), U =[u1, u2, ..., uk] U = [(u1), (u2), . . . , (uk)]. We substitute the mapped fea-tures and basis to the formulation of sparse coding, and arrive at kernel sparserepresentation(KSR):

    minU,v

    (x) Uv2 + v1 (2)

    where U = [(u1), (u2), . . . , (uk)]. In our work, we use Gaussian kernel due toits excellent performance in many work [22,2]: (x1, x2) = exp(x1 x22).Note that (ui)T(ui) = (ui, ui) = exp(ui ui2) = 1, so we can removethe constraint on ui. Kernel sparse representation seeks the sparse representationfor a mapped feature under the mapped basis in the high dimensional space.

    2.2 Implementation

    The objective of Equation (2) is not convex. Following the work of [28,14], weoptimize the sparse codes v and the codebook U alternatively.

  • 4 S. Gao, I.W.-H. Tsang, and L.-T. Chia

    Learning The Sparse Codes in New Feature Space. When the codebookU is xed, the objective in Equation (2) can be rewritten as:

    minv

    (x) Uv2 + v1=(x, x) + vTKUUv 2vTKU (x) + v1=L(v) + v1

    (3)

    where L(v) = 1+ vTKUUv 2vTKU (x), KUU is a k k matrix with {KUU}ij =(ui, uj), and KU (x) is a k 1 vector with {KU (x)}i = (ui, x). The objective isthe same as that of sparse coding except for the denition of KUU and KU (x). Sowe can easily extend the Feature-Sign Search Algorithm[14] to solve the sparsecodes. As for the computational cost, they are the same except for the dierencein calculating kernel matrix.

    Learning Codebook. When v is xed, we learn the codebook U . Due to thelarge amount of features, it is hard to use all the feature to learn the codebook.Following the work [28,2], we random sample some features to learn the codebookU , then use the learnt U to sparsely encode all the features. Suppose we randomlysample N features, then we rewrite the objective as follows (m, s, t are used toindex the columns number of the codebook.):

    f(U) =1N

    Ni=1

    [(xi) Uvi2 + vi1]

    =1N

    Ni=1

    [1 +k

    s=1

    kt=1

    vi,svi,t(us, ut) 2k

    s=1

    vi,s(us, xi) + vi1](4)

    Since U is in the kernel ((ui, .)), it is very challenging to adopt the commonlyused methods, for example, Stochastic Gradient Descent method [2] to nd theoptimal codebook. Instead we optimize each column of U alternatively. Thederivative of f(U) with respect to um is (um is the column to be updated):

    f

    um=4N

    Ni=1

    [k

    t=1

    vi,mvi,t(um, ut)(um ut) vi,m(um, xi)(um xi)] (5)

    To nd the optimal um, we set fum = 0. However, it is not easy to solve theequation due to the terms with respect to (um, .). As a compromise, we usethe approximate solution to replace the exact solution. Similar to xed pointalgorithm [12], in the nth um updating iteration, we use the result of um in the(n 1)th updating iteration to compute the part in the kernel function. Denotethe um in the nth updating process as um,n, then the equation with respect toum,n becomes:

    f

    um,n= 4

    N

    Ni=1

    [k

    t=1

    vi,mvi,t(um,n1, ut)(um,n ut) vi,m(um,n1, xi)(um,n xi)]

    = 0

  • Kernel Sparse Representation for Image Classication and Face Recognition 5

    When all the remaining columns are xed, it becomes a linear equation of um,nand can be solved easily. Following the work [2], the codebook is initialized asthe results of k-means.

    3 Application I: Kernel Sparse Representation for ImageClassication

    In this Section, we apply kernel sparse representation in SPM framework, andpropose the KSRSPM. On the one hand, KSRSPM is an extension of ScSPM [28]by replacing sparse coding with KSR. On the other hand, KSRSPM can beregarded as the generalization of Ecient Match Kernel(EMK) [2].

    3.1 Sparse Coding for Codebook Generation

    k-means clustering is usually used to generate the codebook in BoW model. Ink-means, the whole local feature space X = [x1, x2, . . . , xN ] (where xi Rd1) issplit into k clusterings S = [S1, S2, . . . , Sk]. Denote the corresponding clusteringcenters as U = [u1, u2, . . . , uk] Rdk. In hard assignment, each feature is onlyassigned to its nearest cluster center, and the weight the feature contributing tothat center is 1. The objective of k-means can be formulated as the followingoptimization problem:

    minU,S

    ki=1

    xjSi

    xj ui2 = minU,V

    Ni=1

    xi Uvi2

    subject to : Card(vi) = 1, |vi| = 1, vi 0, i.(6)

    Here V is a clustering indices, V = [v1, v2, . . . , vN ] (where vi Rk1). Eachcolumn of V indicates which visual word the local feature should be assigned to.To reduce the information loss in feature quantization, the constraint on vm isrelaxed. Meanwhile, to avoid each feature being assigned to too many clusters,the sparse constraint is imposed on vm. Then, we arrive at the optimizationproblem of sparse coding:

    minU,V

    Ni=1

    xi Uvi2 + vi1

    subject to : |uj | 1, j = 1, . . . , k.(7)

    3.2 Maximum Feature Pooling and Spatial Pyramid MatchingBased Image Representation

    Following the work of [28,4], we use maximum pooling method to represent theimages. Maximum pooling uses the largest responses to each basic pattern torepresent the region. More specically, suppose one image region has D localfeatures, and the codebook size is k. After maximum pooling, each image will be

  • 6 S. Gao, I.W.-H. Tsang, and L.-T. Chia

    represented by a k dimensional vector y, and the lth entry is the largest responseto the lth basis vector of all the sparse codes in the selected region(vD is thesparse codes of the Dth feature in this local region, and vDl is the lth entry ofvD):

    yl = max{|v1l|, |v2l|, . . . , |vDl|} (8)SPM technique is also used to preserve the spatial information. The whole imageis divided into increasing ner regions, and maximum pooling is used in eachsubregion.

    3.3 KSRSPM An Generalization of Ecient Matching Kernel

    Besides interpreted as an extension of ScSPM [28], KSRSPM can also be in-terpreted as a generalization of Ecient Matching Kernel (EMK) [2]. Let X =[x1, x2, . . . , xp] be a set of local features in one image, and V (x) = [v1(x), v2(x),. . . , vp(x)] are the corresponding clustering index vector in Equation (6). In BoWmodel, each image is presented by a normalized histogram v(X)= 1|X|

    xX v(x),

    which characterizes its visual word distribution. By using linear classier, theresulting kernel function is:

    KB(X,Y ) =1

    |X||Y |xX

    yY

    v(x)T v(y) =1

    |X||Y |xX

    yY

    (x, y) (9)

    where

    (x, y) =

    {1, v(x) = v(y)0, otherwise

    (10)

    (x, y) is positive denite kernel, which is used to measure the similarity betweentwo local features. However, such hard assignment based local feature similaritymeasuring method increases the information loss and reduces classication ac-curacy. Thus a continuous kernel is introduced to more accurately measure thesimilarity between local feature x and y:

    KS(X,Y ) =1

    |X ||Y |xX

    yY

    k(x, y) (11)

    Here k(x, y) is positive denite kernel, which is referred to as local kernel. Thisis related to the normalized sum match kernel [19,9].

    Due to the large amount of local features, directly using local kernel is bothstorage and computation prohibitive for image classication. To decrease thecomputation cost, Ecient Match Kernel(EMK) is introduced. Under the def-inition of nite dimensional kernel function [2], k(x, y) = (x)T (y), we canapproximate (x) by using low dimensional features vx in the space spanned byk basis vectors H = [(u1), (u2), . . . , (uk)]:

    minH,vx

    (x) Hvx2 (12)

    In this way, each image can be represented by v(X)new = 1|X|H

    xX vx be-forehand. As a consequence, the computation speed can be accelerated.

  • Kernel Sparse Representation for Image Classication and Face Recognition 7

    EMK maps the local feature to high dimensional feature space to evaluate thesimilarity between local features more accurately, and uses the approximatedfeature Hvx to construct the linear classier for image classication. It can besummarized as two stages: (i): x

    (x): Map the feature to new feature space;(ii): (x) H vx: Reconstruct (x) by using the basis H .

    Note that directly using original feature for image classication may causeovertting [3]. To avoid this, and following the BoW model, we use vx for imageclassication. We hope each (x) is only assigned several clusterings, so we addthe sparse constraint in the objective of EMK:

    minH,vx

    (x) Hvx2 + vx1 (13)

    This is the same as the objective of our kernel sparse representation. So EMKcan be regarded as the special case of our KSRSPM at = 0. Comparedwith EMK, our KSRSPM uses the quantized feature indices for image classi-cation, so it is more robust to the noise. Whats more, by using maximumpooling, the robustness to intra-class and noise of our KSRSPM can be furtherstrengthened.

    3.4 Experiments

    Parameters Setting. SIFT [16] is widely used in image recognition due to itsexcellent performance. For a fair comparison and to be consistent with pre-vious work [28,13,2], we use the SIFT features under the same feature ex-traction setting. Specically, we use dense grid sampling strategy and x thestep size and patch size to 8 and 16 respectively. We also resize the maximumside(width/length) of each image to 300 pixels1. After obtaining the SIFT, weuse 2-norm to normalize the feature length to 1. For the codebook size, weset k = 1024 in k-means, and randomly select (5.0 8.0) 104 features togenerate codebook for each data set. Following the work [28], we set = 0.30for all the datasets. As for the parameter in the Gaussian kernel, we set to164 ,

    164 ,

    1128 ,

    1256 on Scene 15, UIUC-Sports, Caltech 256 and Corel 10 respectively.

    For SPM, we use top 3 layers and the weight for each layer is the same. We useone-vs-all linear SVM due to its advantage in speed [28] and excellent perfor-mance in maximum feature pooling based image classication. All the results foreach dataset are based on six independent experiments, and the training imagesare selected randomly.

    Scene 15 Dataset. Scene 15 [13] dataset is usually used for scene classication.It contains 4485 images, which are divided into 15 categories. Each category con-tains about 200 to 400 images. The image content is diverse, containing suburb,coast, forest, highway, inside city, mountain, open country, street, tall building,

    1 For UIUC-Sport dataset, we resize the maximum side to 400 due to the high reso-lution of original image.

  • 8 S. Gao, I.W.-H. Tsang, and L.-T. Chia

    oce, bedroom, industrial, kitchen, living room and store. For fair comparison,we follow the same experimental setting [28,13]: randomly select 100 images eachcategory as training data and use the remaining images as test data. The resultsare listed in Table 1.

    Table 1. Performance Comparison on Scene 15 Dataset(%)

    Method Average Classication RateKSPM[13] 81.400.50EMK[2] 77.890.85

    ScSPM[28] 80.280.93KSRSPM 83.680.61

    Caltech 256. Caltech 2562 is a very challenging dataset in both image contentand dataset scale. First of all, compared with Caltech 101, the objects in Caltech256 contains larger intra-class variance, and the object locations are no longer inthe center of the image. Second, Caltech 256 contains 29780 images, which are di-vided into 256 categories. More categories will inevitably increase the inter-classsimilarity, and increase the performance degradation.We evaluate the method un-der four dierent settings: selecting 15, 30, 45, 60 per category as training datarespectively, and use the rest as test data. The results are listed in Table 2.

    Table 2. Performance Comparison on Caltech 256 dataset(%) (KC: Kernel codebook;)

    Trn No. KSPM[8] KC[6] EMK[2] ScSPM[28] KSRSPM15 NA NA 23.20.6 27.730.51 29.770.1430 34.10 27.170.46 30.50.4 34.020.35 35.670.1045 NA NA 34.40.4 37.460.55 38.610.1960 NA NA 37.60.5 40.140.91 40.300.22

    UIUC-Sport Dataset. UIUC-Sport [15] contains images collected from 8 kindof dierent sports: badminton, bocce,croquet, polo, rock climbing, rowing, sailingand snow boarding. There are 1792 images in all, and the number of images rangesfrom 137 to 250 per category. Following the work of Wu et al. [27], we randomlyselect 70 images from each category as training data, and randomly select another60 images from each category as test data. The results are listed in Table 3.

    Table 3. Performance Comparison on UIUC-Sport Dataset(%)

    Method Average Classication RateHIK+ocSVM[27] 83.541.13

    EMK[2] 74.561.32ScSPM[28] 82.741.46KSRSPM 84.920.78

    2 www.vision.caltech.edu/Image_Datasets/Caltech256/

  • Kernel Sparse Representation for Image Classication and Face Recognition 9

    Table 4. Performance Comparison on Corel10 Dataset(%) (SMK:Spatial MarkovModel)

    Method Average Classication RateSMK [17] 77.9EMK [2] 79.901.73

    ScSPM [28] 86.21.01KSRSPM 89.431.27

    Corel10 Dataset. Corel10 [18] contains 10 categories: skiing, beach, buildings,tigers, owls, elephants, owers, horses, mountains and food. Each category con-tains 100 images. Following the work of Lu et al. [18], we randomly select 50 imagesas training data and use the rest as test data. The results are listed in Table 4.

    Results Analysis. From Table 1-4, we can see that on Scene, UIUC-Sports,Corel10, KSRSPM outperforms EMK around (5.7 10.4)%, and outperformsScSPM around (2.2 3.4)%. For Caltech 256, due to too many classes, theimprovements are not very substantial, but still higher than EMK and ScSPM.We also list the confusion matrices of Scene, UIUC-Sports and Corel10 datasetsin Figure 1 and Figure 2. The entry located in ith row, jth column in confusionmatrix represents the percentage of class i being misclassied to class j. Fromthe confusion matrices, we can see that some classes are easily be misclassiedto some others.

    Feature Quantization Error. Dene Average Quantization Error ( AverQE )as:AverQE = 1N

    Ni=1 (xi)Uvi2F . It can be used to evaluate the information

    loss in the feature quantization process. To retain more information, we hope thefeature quantization error can be reduced. We compute the AverQE of our kernelsparse representation (KSR) and Sparse coding (Sc) on all the features used forcodebook generation, and list them in Table 5. From results we can see thatkernel sparse representation can greatly decrease the feature quantization error.

    subu

    rb

    coast

    fore

    st

    high

    way

    insi

    deci

    ty

    mounta

    in

    ope

    ncountry

    stre

    et

    tallb

    uild

    ing

    PAR

    offi

    ce

    bedr

    oom

    indu

    stria

    l

    kitc

    hen

    livin

    groo

    m

    stor

    e

    suburb 99.3 0 0 0 0 0 0.24 0 0 0 0 0 0 0.47 0coast 0 83.5 0.77 1.92 0 2.05 11.3 0 0.32 0.13 0 0 0 0 0forest 0 0.07 95.9 0 0 2.34 1.17 0.37 0 0 0 0 0 0 0.15highway 0 2.5 0.1 89.7 2.92 1.15 1.77 0.83 0.63 0 0 0.21 0 0 0.21insidecity 0.56 0.08 0.08 0.16 89.3 0 0.08 3.85 4.25 0.24 0 0.56 0.24 0.08 0.48mountain 0.06 1.22 2.31 0.24 0.06 90.5 4.01 0.24 0.97 0 0.12 0.18 0 0 0.06opencountry 0.7 10.2 5.11 1.72 0 5.48 75.3 0.86 0.05 0 0.05 0.05 0.05 0.11 0.27street 0 0 0.35 1.74 3.73 0.78 0 91.1 1.48 0 0 0.17 0 0.09 0.52tallbuilding 0.2 0.13 0.26 0 4.1 1.04 0.13 0.46 92.1 0 0 0.72 0.13 0 0.72PARoffice 0 0 0 0 0.58 0 0 0 0 95.1 1.01 0 2.17 0.87 0.29bedroom 0.43 0.14 0 0 1.44 0.29 0 0 0 3.59 71.4 0.86 5.03 15.1 1.72industrial 1.66 0.63 0.16 0.32 2.29 0.55 0.08 0.95 2.53 1.82 1.26 70.3 2.21 1.42 13.8kitchen 0.15 0 0 0 1.21 0.61 0 0 0 4.09 3.94 1.52 71.1 11.8 5.61livingroom 0.09 0 0 0 0.35 0.26 0 0.53 0.26 3.88 13.8 2.29 8.91 61.6 8.02store 0 0.08 0.39 0 3.64 1.86 0 0.54 0.85 1.55 1.47 3.95 2.87 3.88 78.9

    Fig. 1. Confusion Matrix on Scene 15 dataset(%)

  • 10 S. Gao, I.W.-H. Tsang, and L.-T. Chia

    Rock

    Clim

    bing

    badm

    into

    n

    bocc

    e

    croqu

    et

    polo

    row

    ing

    sailin

    g

    snow

    boar

    ding

    Rock Climbing 95.16 0.00 0.67 0.13 0.54 0.81 0.13 2.55badminton 0.13 93.72 1.67 1.15 1.15 0.64 0.13 1.41bocce 5.22 3.98 62.19 16.17 3.23 4.98 0.25 3.98croquet 4.42 0.00 12.85 78.41 2.71 0.70 0.30 0.60polo 3.13 1.93 3.42 2.38 85.27 2.08 0.15 1.64rowing 1.11 1.11 1.11 0.00 2.31 89.91 1.85 2.59sailing 0.28 0.00 0.56 2.50 0.42 2.64 92.78 0.83snowboarding 5.97 1.25 3.75 1.67 0.83 3.61 0.97 81.94

    flowe

    r

    ele

    phan

    ts

    ow

    ls

    tiger

    build

    ing

    beac

    h

    skiin

    g

    hors

    es

    mo

    un

    tain

    s

    food

    flower 90.67 0.00 1.67 6.00 0.00 0.00 1.33 0.00 0.33 0.00elephants 1.67 76.67 6.33 0.33 0.00 3.33 1.00 1.33 5.67 3.67owls 0.33 4.67 84.33 0.00 1.00 0.67 0.00 3.00 2.33 3.67tiger 1.00 0.00 0.00 99.00 0.00 0.00 0.00 0.00 0.00 0.00building 1.00 0.00 0.00 6.00 89.67 0.00 1.67 0.00 0.67 1.00beach 0.67 0.00 0.00 0.00 0.00 90.00 0.00 6.00 3.33 0.00skiing 0.00 0.00 0.00 0.00 0.00 0.00 95.33 0.00 3.33 1.33horses 0.00 0.33 0.33 0.00 0.00 3.00 0.00 96.33 0.00 0.00mountains 0.00 9.67 3.33 0.00 0.67 1.00 0.00 0.00 84.00 1.33food 0.00 2.33 2.67 0.33 0.33 0.67 2.00 0.00 3.33 88.33

    Fig. 2. Confusion Matrices on UIUC-Sports and Corel10(%)

    Table 5. Average Feature Quantization Error on Dierent datasets

    Scene Caltech 256 Sport CorelSc 0.8681 0.9164 0.8864 0.9295

    KSR 9.63E-02 5.72E-02 9.40E-02 4.13E-02

    This may be the reason that our KSRSPM outperforms ScSPM. The resultsalso agree with our assumption that sparse coding in high dimensional spacecan reduce the feature quantization error.

    4 Application II: Kernel Sparse Representation for FaceRecognition

    4.1 Sparse Coding for Face Recognition

    For face recognition, If sucient training samples are available from each class,it would be possible to represent the test samples as a linear combination ofthose training samples from the same class [26].

    Suppose there are N classes in all, and the training instances for class i areAi = [ai,1, . . . , ai,ni ] Rdni , in which each column corresponds to one instance.Let A = [A1, . . . , AN ] Rd

    Ni=1 ni be the training set, and y Rd1 be the

    test sample. When noise e exists, the problem for face recognition [26] can beformulated as follows:

    min x01 s.t. y = AxT + e = [A I ][xT eT ]T = A0x0 (14)sparse coding based image recognition aims at selecting only a few imagesfrom all the training instances to reconstruct the images to be tested. Leti = [i,1, . . . , i,ni ](1 i N) be the coecients corresponds to Ai in x0.The reconstruction error by using the instances from class i can be computedas: ri(y) = y Aii2. Then the test image is assigned to the category thatminimizes the reconstruction error: identity(y) = arg mini {r1(y), . . . , rN (y)}.

    4.2 Kernel Sparse Representation for Face Recognition

    Kernel method can make the features belonging to the same category closer toeach other [22]. Thus we apply kernel sparse representation in face recognition.

  • Kernel Sparse Representation for Image Classication and Face Recognition 11

    Firstly, the 1 norm on reconstruction error is replaced by using 2 norm(Weassume that the noise may not be sparsely reconstructed by using the trainingsamples). By mapping features to a high dimensional space: y (y), A =[a1,1, . . . , aN,nN ] A = [(a1,1), . . . , (aN,nN )], we obtain the objective of kernelsparse representation for face recognition:

    min x1 + (y)Ax22 (15)In which the parameter is used to balance the weight between the sparsityand the reconstruction error. Following the work of John Wright et al., the testimage is assigned to the category which minimizes the reconstruction error inthe high dimensional feature space.

    4.3 Evaluation on Extended Yale B Database

    We evaluate our method on Extended Yale B Database [7], which contains 38categories, 2414 frontal-face images. The cropped image size is 192168. Follow-ing the work [26], we randomly select a half as training images in each category,and use the rest as test. The following ve features are used for evaluation:RandomFace [26], LaplacianFace [10], EigenFace [24],FisherFace [1] and Down-sample [26], and each feature is normalized to unit length by using 2 norm.Gaussian kernel is used in our experiments: (x1, x2) = exp(x1 x22). ForEigenfaces, Laplacianfaces, Downsample and Fisherfaces, we set = 1/d whered is the feature dimension. For Randomfaces, = 1/32d.

    The Eect of . We rstly evaluate by using 56D Downsample Feature. Welist the results based on dierent in Table 6. When = 0, as decreases, theperformance increases, and the proportion of non-zero elements in coecientsincreases. But computational time also increases. When = 0, it happens to bethe objective of Ecient Match Kernel, but the performance is not good as thatin the case of = 0. This can show the eectiveness of the sparse term.

    Result Comparison. Considering both the computational cost and the accu-racy in Table 6, we set = 105. The experimental results are listed in Table 7.All the results are based on 10 times independent experiments. Experimentalresults show that kernel sparse representation can outperform sparse coding inface recognition.

    Table 6. The Eect of Sparsity Parameter: 56D Downsample Feature (Here sparsityis percentage of non-zeros elements in sparse codes)

    101 102 103 104 105 106 107 0sparsity(%) 0.58 0.75 0.88 2.13 4.66 8.35 16.69 -

    reconstruction error 0.2399 0.1763 0.1651 0.1113 0.0893 0.0671 0.0462 -time(sec) 0.0270 0.0280 0.0299 0.0477 0.2445 0.9926 6.2990 -

    accuracy(%) 76.92 84.12 85.19 90.32 91.65 93.30 93.47 84.37

  • 12 S. Gao, I.W.-H. Tsang, and L.-T. Chia

    Table 7. Performance of Sparse Coding for Face Recognition(%)

    Feature Dimension 30 56 120 504Sc [26] 86.5 91.63 93.95 96.77

    Eigen KSR 89.01 94.42 97.49 99.16Sc [26] 87.49 91.72 93.95 96.52

    Laplacian KSR 88.86 94.24 97.11 98.12Sc [26] 82.6 91.47 95.53 98.09

    Random KSR 85.46 92.36 96.14 98.37Sc [26] 74.57 86.16 92.13 97.1

    Downsample KSR 83.57 91.65 95.31 97.8Sc [26] 86.91 NA NA NA

    Fisher KSR 88.93 NA NA NA

    To further illustrate the performance of KSR, we calculate the similarity be-tween the sparse codes of KSR and Sc in three classes(each classes contains 32images). We list the results in Figure 3, in which the entry in (i, j) is the sparsecodes similarity (normalized correlation) between image i and j. We know thata good sparse coding method can make the sparse codes belonging to same classmore similar, therefore, the sparse codes similarity should be block-wise. FromFigure 3 we can see that our KSR can get more discriminative sparse codes thansparse coding, which facilitates the better performance of the image recognition.

    Similarity bettwen the sparse codes of KSR

    20 40 60 80

    102030405060708090

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1Similarity bettwen the sparse codes of Sc

    20 40 60 80

    102030405060708090

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    Fig. 3. Similarity between the sparse codes of KSR and Sc

    5 Conclusion

    In this paper, we propose a new technique: Kernel Sparse Representation, whichis the sparse coding technique in a high dimensional feature space mapped byimplicit feature mapping feature. We apply KSR to image classication andface recognition. For image classication, our proposed KSRSPM can both beregarded as an extension of ScSPM and an generalization of EMK. For facerecognition, KSR can learn more discriminative sparse codes for face category

  • Kernel Sparse Representation for Image Classication and Face Recognition 13

    identication. Experimental results on several publicly available datasets showthat our KSR outperforms both ScSPM and EMK, and achieves state-of-the-artperformance.

    References

    1. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. sherfaces:Recognition using class specic linear projection. TPAMI 19(7), 711720 (1997)

    2. Bo, L., Sminchisescu, C.: Ecient match kernels between sets of features for visualrecognition. In: NIPS (2009)

    3. Boiman, O., Shechtman, E., Irani, M.: In defense of nearest-neighbor based imageclassication. In: CVPR (2008)

    4. Boureau, Y., Bach, F., LeCun, Y., Ponce, J.: Learnning mid-level features forrecognition (2010)

    5. Duda, R.O., Hart, P.E., Stock, D.G.: Pattern Classication, 2nd edn. John Wiley& Sons, Chichester (2001)

    6. van Gemert, J.C., Geusebroek, J.M., Veenman, C.J., Smeulders, A.W.M.: Kernelcodebooks for scene categorization. In: Forsyth, D., Torr, P., Zisserman, A. (eds.)ECCV 2008, Part III. LNCS, vol. 5304, pp. 696709. Springer, Heidelberg (2008)

    7. Georghiades, A., Belhumeur, P., Kriegman, D.: From few to many: Illuminationcone models for face recognition under variable lighting and pose. TPAMI 23(6),643660 (2001)

    8. Grin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. In: Tech-nical Report (2007)

    9. Haussler, D.: Convolution kernels on discrete structure. In: Technical Report (1999)10. He, X., Yan, S., Hu, Y., Niyogi, P., Zhang, H.: Face recognition using laplacianfaces.

    TPAMI 27(3), 328340 (2005)11. Ho, J., Yang, M.H., Lim, J., Lee, K.C., Kriegman, D.J.: Clustering appearances of

    objects under varying illumination conditions. In: CVPR (2003)12. Hyvarinen, A.: The xed-point algorithm and maximum likelihood estimation for

    independent component analysis. Neural Process. Lett. 10(1) (1999)13. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid

    matching for recognizing natural scene categories. In: CVPR, pp. 21692178 (2006)14. Lee, H., Battle, A., Raina, R., Ng, A.Y.: Ecient sparse coding algorithms. In:

    NIPS, pp. 801808 (2006)15. Li, L.J., Fei-Fei, L.: What, where and who? classifying events by scene and object

    recognition. In: ICCV (2007)16. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2),

    91110 (2004)17. Lu, Z., Ip, H.H.: Image categorization by learning with context and consistency.

    In: CVPR (2009)18. Lu, Z., Ip, H.H.: Image categorization with spatial mismatch kernels. In: CVPR

    (2009)19. Lyu, S.: Mercer kernels for object recognition with local features. In: CVPR, pp.

    223229 (2005)20. Marial, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A.: Non-local sparse models

    for image restoration. In: ICCV (2009)21. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with

    large vocabularies and fast spatial matching. In: CVPR (2007)

  • 14 S. Gao, I.W.-H. Tsang, and L.-T. Chia

    22. Scholkopf, B., Smola, A.J., Muller, K.R.: Kernel principal component analysis. In:International Conference on Articial Neural Networks, pp. 583588 (1997)

    23. Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matchingin videos. In: ICCV, pp. 14701477 (2003)

    24. Turk, M., Pentland, A.: Eigenfaces for recognition. In: CVPR (1991)25. Wang, C., Yan, S., Zhang, L., Zhang, H.J.: Multi-label sparse coding for automatic

    image annotation. In: CVPR (2009)26. Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust face recognition

    via sparse representation. TPAMI 31(2), 210227 (2009)27. Wu, J., Rehg, J.M.: Beyond the euclidean distance: Creating eective visual code-

    books using the histogram intersection kernel. In: ICCV (2003)28. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using

    sparse coding for image classication. In: CVPR (2009)

  • Every Picture Tells a Story:Generating Sentences from Images

    Ali Farhadi1, Mohsen Hejrati2 , Mohammad Amin Sadeghi2, Peter Young1,Cyrus Rashtchian1, Julia Hockenmaier1, David Forsyth1

    1 Computer Science DepartmentUniversity of Illinois at Urbana-Champaign

    {afarhad2,pyoung2,crashtc2,juliahmr,daf}@illinois.edu2 Computer Vision Group, School of Mathematics

    Institute for studies in theoretical Physics and Mathematics(IPM){m.a.sadeghi,mhejrati}@gmail.com

    Abstract. Humans can prepare concise descriptions of pictures, focus-ing on what they nd important. We demonstrate that automatic meth-ods can do so too. We describe a system that can compute a score linkingan image to a sentence. This score can be used to attach a descriptivesentence to a given image, or to obtain images that illustrate a givensentence. The score is obtained by comparing an estimate of meaning ob-tained from the image to one obtained from the sentence. Each estimateof meaning comes from a discriminative procedure that is learned us-ing data. We evaluate on a novel dataset consisting of human-annotatedimages. While our underlying estimate of meaning is impoverished, itis sucient to produce very good quantitative results, evaluated with anovel score that can account for synecdoche.

    1 Introduction

    For most pictures, humans can prepare a concise description in the form of asentence relatively easily. Such descriptions might identify the most interestingobjects, what they are doing, and where this is happening. These descriptions arerich, because they are in sentence form. They are accurate, with good agreementbetween annotators. They are concise: much is omitted, because humans tendnot to mention objects or events that they judge to be less signicant. Finally,they are consistent: in our data, annotators tend to agree on what is mentioned.Barnard et al. name two applications for methods that link text and images:Illustration, where one nds pictures suggested by text (perhaps to suggest il-lustrations from a collection); and annotation, where one nds text annotationsfor images (perhaps to allow keyword search to nd more images) [1].

    This paper investigates methods to generate short descriptive sentences fromimages. Our contributions include: We introduce a dataset to study this problem(section 3.1). We introduce a novel representation intermediate between imagesand sentences (section 2.1). We describe a novel, discriminative approach thatproduces very good results at sentence annotation (section 2.4). For illustration,

    K. Daniilidis, P. Maragos, N. Paragios (Eds.): ECCV 2010, Part IV, LNCS 6314, pp. 1529, 2010.c Springer-Verlag Berlin Heidelberg 2010

  • 16 A. Farhadi et al.

    out of vocabulary words pose serious diculties, and we show methods to usedistributional semantics to cope with these issues (section 3.4). Evaluating sen-tence generation is very dicult, because sentences are uid, and quite dierentsentences can describe the same phenomena. Worse, synecdoche (for example,substituting animal for cat or bicycle for vehicle) and the general rich-ness of vocabulary means that many dierent words can quite legitimately beused to describe the same picture. In section 3, we describe a quantitative eval-uation of sentence generation at a useful scale.

    Linking individual words to images has a rich history and space allows onlya mention of the most relevant papers. A natural strategy is to try and predictwords from image regions. The rst image annotation system is due to Moriet al. [2]; Duygulu et al. continued this tradition using models from machinetranslation [3]. Since then, a wide range of models has been deployed (reviewsin [4,5]); the current best performer is a form of nearest neighbours matching [6].The most recent methods perform fairly well, but still nd diculty placingannotations on the correct regions.

    Sentences are richer than lists of words, because they describe activities,properties of objects, and relations between entities (among other things). Suchrelations are revealing: Gupta and Davis show that respecting likely spatial re-lations between objects markedly improves the accuracy of both annotation andplacing [7]. Li and Fei-Fei show that event recognition is improved by explicitinference on a generative model representing the scene in which the event oc-curs and also the objects in the image [8]. Using a dierent generative model,Li and Fei-Fei demonstrate that relations improve object labels, scene labelsand segmentation [9]. Gupta and Davis show that respecting relations betweenobjects and actions improve recognition of each [10,11]. Yao and Fei-Fei usethe fact that objects and human poses are coupled and show that recognizingone helps the recognition of the other [12]. Relations between words in annotat-ing sentences can reveal image structure. Berg et al. show that word featuressuggest which names in a caption are depicted in the attached picture, andthat this improves the accuracy of links between names and faces [13]. Mensinkand Verbeek show that complex co-occurrence relations between people improveface labelling, too [14]. Luo, Caputo and Ferrari [15] show benets of associ-ating faces and poses to names and verbs in predicting whos doing what innews articles. Coyne and Sproat describe an auto-illustration system that givesnaive users a method to produce rendered images from free text descriptions(Wordseye; [16];http://www.wordseye.com).

    There are few attempts to generate sentences from visual data. Gupta etal. generate sentences narrating a sports event in video using a compositionalmodel based around AND-OR graphs [17]. The relatively stylised structure ofthe events helps both in sentence generation and in evaluation, because it isstraightforward to tell which sentence is right. Yao et al. show some examplesof both temporal narrative sentences (i.e. this happened, then that) and scenedescription sentences generated from visual data, but there is no evaluation [18].

  • Every Picture Tells a Story: Generating Sentences from Images 17

    ,PDJH6SDFH

    0HDQLQJ6SDFH

    6HQWHQFH6SDFH

    EXVSDUNVWUHHW!

    SODQHIO\VN\!

    VKLSVDLOVHD!WUDLQPRYHUDLO!

    ELNHULGHJUDVV!

    $\HOORZEXVLVSDUNLQJLQWKHVWUHHW

    7KHUHLVDVPDOOSODQHIO\LQJLQWKHVN\

    $QROGILVKLQJVKLSVDLOLQJLQDEOXHVHD7KHWUDLQLVPRYLQJRQUDLOVFORVHWRWKHVWDWLRQ

    $QDGYHQWXURXVPDQULGLQJDELNHLQDIRUHVW

    Fig. 1. There is an intermediate space of meaning which has dierent projections tothe space of images and sentences. Once we learn the projections we can generatesentences for images and nd images best described by a given sentence.

    These methods generate a direct representation of what is happening in a scene,and then decode it into a sentence.

    An alternative, which we espouse, is to build a scoring procedure that evalu-ates the similarity between a sentence and an image. This approach is attractive,because it is symmetric: given an image (resp. sentence), one can search for thebest sentence (resp. image) in a large set. This means that one can do bothillustration and annotation with one method. Another attraction is the methoddoes not need a strong syntactic model, which is represented by the prior onsentences. Our scoring procedure is built around an intermediate representa-tion, which we call the meaning of the image (resp. sentence). In eect, imageand sentence are each mapped to this intermediate space, and the results arecompared; similar meanings result in a high score. The advantage of doing sois that each of these maps can be adjusted discriminatively. While the meaningspace could be abstract, in our implementation we use a direct representationof simple sentences as a meaning space. This allows us to exploit distributionalsemantics ideas to deal with out of vocabulary words. For example, we have nodetector for cattle; but we can link sentences containing this word to images,because distributional semantics tells us that a cattle is similar to sheep andcow, etc. (Figure 6)

    2 Approach

    Our model assumes that there is a space of Meanings that comes between thespace of Sentences and the space of Images. We evaluate the similarity be-tween a sentence and an image by (a) mapping each to the meaning spacethen (b) comparing the results. Figure 1 depicts the intermediate space ofmeanings. We will learn the mapping from images (resp. sentences) to meaningdiscriminatively from pairs of images (resp. sentences) and assigned meaningrepresentations.

  • 18 A. Farhadi et al.

    /

    !

    3

    W

    Ws

    ^W^ &

    ^t^

    ^^^&ZZ

    'dZ^

    &(ORSEd'^

    ^&

    d&IELD^&

    Z '&^ ,

  • Every Picture Tells a Story: Generating Sentences from Images 19

    2.2 Image Potentials

    We need informative features to drive the mapping from the image space to themeaning space.

    Node Potentials. To provide information about the nodes on the MRF werst need to construct image features. Our image features consist of:

    Felzenszwalb et al. detector responses. We use Felzenszwalb detectors [19]to predict condence scores on all the images. We set the threshold such that allof the classes get predicted, at least once in each image. We then consider themax condence of the detections for each category, the location of the center ofthe detected bounding box, the aspect ratio of the bounding box, and its scale.

    Hoiem et al. classication responses. We use the classication scores ofHoiem et. al [20] for the PASCAL classication tasks. These classiers are basedon geometry, HOG features, and detection responses.

    Gist-based scene classication responses. We encode global information ofimages using gist [21]. Our features for scenes are the condences of our Adabooststyle classier for scenes.

    First we build node features by tting a discriminative classier (a linearSVM) to predict each of the nodes independently on the image features. Al-though the classiers are being learned independently, they are well aware ofother objects and scene information. We call these estimates node features. Thisis a number-of-nodes-dimensional vector and each element in this vector providesa score for a node given the image. This can be a node potential for object, ac-tion, and scene nodes. We expect similar images to have similar meanings, andso we obtain a set of features by matching our test image to training images. Wecombine these features into various other node potentials as below:

    by matching image features, we obtain the k-nearest neighbours in the train-ing set to the test image, then compute the average of the node features overthose neighbours, computed from the image side. By doing so, we have arepresentation of what the node features are for similar images.

    by matching image features, we obtain the k-nearest neighbours in the train-ing set to the test image, then compute the average of the node features overthose neighbours, computed from the sentence side. By doing so, we have arepresentation of what the sentence representation does for images that looklike our image.

    by matching those node features derived from classiers and detectors(above), we obtain the k-nearest neighbours in the training set to the testimage, then compute the average of the node features over those neighbours,computed from the image side. By doing so, we have a representation of whatthe node features are for images that produce similar classier and detectoroutputs.

  • 20 A. Farhadi et al.

    by matching those node features derived from classiers and detectors(above), we obtain the k-nearest neighbours in the training set to the testimage, then compute the average of the node features over those neighbours,computed from the sentence side. By doing so, we have a representation ofwhat the sentence representation does for images that produce similar clas-sier and detector outputs.

    Edge Potentials. Introducing a parameter for each edge results in unman-ageable number of parameters. In addition, estimates of the parameters for themajority of edges would be noisy. There are serious smoothing issues. We adoptan approach similar to Good Turing smoothing methods to a) control the num-ber of parameters b) do smoothing. We have multiple estimates for the edgespotentials which can provide more accurate estimates if used together. We formthe linear combinations of these potentials. Therefore, in learning we are inter-ested in nding weights of the linear combination of the initial estimates so thatthe nal linearly combined potentials provide values on the MRF so that theground truth triplet is the highest scored triplet for all examples. This way welimit the number of parameters to the number of initial estimates.

    We have four dierent estimates for edges. Our nal score on the edges takethe form of a linear combination of these estimates. Our four estimates for edgesfrom node A to node B are:

    The normalized frequency of the word A in our corpus, f(A). The normalized frequency of the word B in our corpus, f(B). The normalized frequency of (A and B) at the same time, f(A, b). f(A,B)f(A)f(B) .

    2.3 Sentence Potentials

    We need a representation of the sentences. We represent a sentence by computingthe similarity between the sentence and our triplets. For that we need to have anotion of similarity for objects, scenes and actions in text.

    We used the Curran & Clark parser [22] to generate a dependency parse foreach sentence. We extracted the subject, direct object, and any nmod dependen-cis involving a noun and a verb. These dependencies were used to generate the(object, action) pairs for the sentences. In order to extract the scene informationfrom the sentences, we extracted the head nouns of the prepositional phrases(except for the prepositions of and with), and the head nouns of the phraseX in the background.

    Lin Similarity Measure for Objects and Scenes. We use the Lin similaritymeasure [23] to determine the semantic distance between two words. The Linsimilarity measure uses WordNet synsets as the possible meanings of each words.The noun synsets are arranged in a heirarchy based on hypernym (is-a) andhyponym (instance-of) relations. Each synset is dened as having an informationcontent based on how frequently the synset or a hyponym of the synset occurs in

  • Every Picture Tells a Story: Generating Sentences from Images 21

    a corpus (in the case, SemCor). The similarity of two synsets is dened as twicethe information content of the least common ancestor of the synsets divided bythe sum of the information content of the two synsets. Similar synsets will havea LCA that covers the two synsets, and very little else. When we compared twonouns, we considered all pairs of a ltered list of synsets for each noun, and usedthe most similar synsets. We ltered the list of synsets for each noun by limitingit to the rst four synsets that were at least 10% as frequent as the most commonsynset of that noun. We also required the synsets to be physical entities.

    Action Co-occurrence Score. We generated a second image caption dataset consisting of roughly 8,000 images pulled from six Flickr groups. For allpairs of verbs, we used the likelihood ratio to determine if the two verbs co-occurring in the dierent captions of the same image was signicant. We thenused the likelihood ratio as the similarity score for the positively correlatedverb pairs, and the negative of the likelihood ratio as the similarity score forthe negatively correlated verb pairs. Typically, we found that this procedurediscovered verbs that were either describing the same action or describing twoactions that commonly co-occurred.

    Node Potentials. We now can provide a similarity measure between sentencesand objects, actions, and scenes using scores explained above. Below we explainour estimates of sentence node potentials.

    First we compute the similarity of each object, scene, and action extractedfrom each sentence. This gives us the the rst estimates for the potentialsover the nodes. We call this the sentence node feature.

    For each sentence, we also compute the average of sentence node features forother four sentences describing the same images in the train set.

    We compute the average of k nearest neighbors in the sentence node featuresspace for a given sentence. We consider this as our third estimate for nodes.

    We also compute the average of the image node features for images corre-sponding to the nearest neighbors in the item above.

    The average of the sentence node features of reference sentences for thenearest neighbors in the item 3 is considered as our fth estimate for nodes.

    We also include the sentence node feature for the reference sentence.

    Edge Potentials. The edge estimates for sentences are identical to to edgeestimates for the images explained in previous section.

    2.4 Learning

    There are two mappings that need to be learned. The map from the image spaceto the meaning space uses the image potentials and the map from the sentencespace to the meaning space uses the sentence potentials. Learning the mappingfrom images to meaning involves nding the weights on the linear combinations ofour image potentials on nodes and edges so that the ground truth triplets score

  • 22 A. Farhadi et al.

    highest among all other triplets for all examples. This is a structure learningproblem [24] which takes the form of

    minw

    2w2 + 1

    n

    iexamples

    i (1)

    subject tow(xi, yi) + i max

    ymeaning spacew(xi, y) + L(yi, y) i examples

    i 0 i exampleswhere is the tradeo factor between the regularization and slack variables , is our feature functions, xi corresponds to our ith image, and yi is our structuredlabel for the ith image. We use the stochastic subgradient descent method [25]to solve this minimization.

    3 Evaluation

    We emphasize quantitative evaluation in our work. Our vocabulary of meaningis signicantly larger than the equivalent in [8,9]. Evaluation requires innovationboth in datasets and in measurement, described below.

    3.1 Dataset

    We need a dataset with images and corresponding sentences and also labelsfor our representations of the meaning space. No such dataset exists. We buildour own dataset of images and sentences around the PASCAL 2008 images. Thismeans we can use and compare to state of the art models and image annotationsin PASCAL dataset.

    PASCAL Sentence data set. To generate the sentences, we started with the2008 PASCAL development kit. We randomly selected 50 images belonging toeach of the 20 categories. Once we had a set of 1000 images, we used AmazonsMechanical Turk to generate ve captions for each image. We required the an-notators to be based in the US, and that they pass a qualication exam testingtheir ability to identify spelling errors, grammatical errors, and descriptive cap-tions. More details about the methods of collection can be found in [26]. Ourdataset has 5 sentences for each image of the thousand images resulting in 5000sentences. We also manually add labels for triplets of objects, actions, scenesfor each images. These triplets label the main object in the image, the mainaction, and the main place. There are 173 dierent triplets in our train set and123 in test set. There are 80 triplets in the test set that appeared in the train set.The dataset is available at http://vision.cs.uiuc.edu/pascal-sentences/.

    3.2 Inference

    Our model is learned to maximize the sum of the scores along the path identi-ed by a triplet. In inference we search for the triplet which gives us the best

  • Every Picture Tells a Story: Generating Sentences from Images 23

    additive score, argmaxywT(xi, y). These models prefer triplets with combina-tion o