diagnosing state-of-the-art object proposal methods · will be released to facilitate further...

12
H.ZHU, S.LU, J.CAI, G. LEE: 1 Diagnosing State-Of-The-Art Object Proposal Methods Hongyuan Zhu 1 [email protected] Shijian Lu 1 [email protected] Jianfei Cai 2 [email protected] Guangqing Lee 3 [email protected] 1 Institute for Infocomm Research, A*Star, Singapore 2 Nanyang Technological University, Singapore 3 Nanyang Polytechnic, Singapore Abstract Object proposal has become a popular paradigm to replace exhaustive sliding win- dow search in current top-performing methods in PASCAL VOC and ImageNet. Re- cently, Hosang et al. [17] conduct the first unified study of existing methods’ in terms of various image-level degradations. On the other hand, the vital question "what object- level characteristics really affect existing methods’ performance?" is not yet answered. Inspired by Hoiem et al.’s work in categorical object detection [16], this paper conducts the first meta-analysis of various object-level characteristics’ impact on state-of-the-art object proposal methods. Specifically, we examine the effects of object size, aspect ra- tio, iconic view, color contrast, shape regularity and texture. We also analyse existing methods’ localization accuracy and latency for various PASCAL VOC object classes. Our study reveals the limitations of existing methods in terms of non-iconic view, small object size, low color contrast, shape regularity etc. Based on our observations, lessons are also learned and shared with respect to the selection of existing object proposal tech- nologies as well as the design of the future ones. 1 Introduction Recent top performing methods in PASCAL VOC [11] and ImageNet [29] make use of object proposal to replace exhaustive window search [5, 10, 13, 14, 34]. Object proposal’s effectiveness is rooted in the assumption that there are general cues to differentiate objects from the background. Since the very first work by Alexe et al. [1], many object proposal methods have been proposed [2, 3, 4, 9, 18, 20, 23, 26, 27, 32, 37] and tested on various large scale datasets [11, 22, 29], and their overall detection rates versus different thresholds or window number have also been reported. Yet such partial performance summaries give us little idea of a method’s strengths and weaknesses for further improvement, and users are still facing difficulties in choosing methods for their applications. Therefore, more detailed analysis of existing state-of-the-arts is critical for future research and applications. There are considerable works in categorical object detection [6, 8, 15, 33, 35] which study the impact of different object-level properties (such as occlusion, aspect ratio, view- point change) to the performance of categorical object detectors. Others have investigated c 2015. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms. Pages 11.1-11.12 DOI: https://dx.doi.org/10.5244/C.29.11

Upload: others

Post on 18-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Diagnosing State-Of-The-Art Object Proposal Methods · will be released to facilitate further related research. 2 Evaluation Protocol. 2.1 PASCAL VOC detection dataset. Dataset Selection

H.ZHU, S.LU, J.CAI, G. LEE: 1

Diagnosing State-Of-The-Art ObjectProposal Methods

Hongyuan Zhu1

[email protected]

Shijian Lu1

[email protected]

Jianfei Cai2

[email protected]

Guangqing Lee3

[email protected]

1 Institute for Infocomm Research,A*Star,Singapore

2 Nanyang Technological University,Singapore

3 Nanyang Polytechnic,Singapore

AbstractObject proposal has become a popular paradigm to replace exhaustive sliding win-

dow search in current top-performing methods in PASCAL VOC and ImageNet. Re-cently, Hosang et al. [17] conduct the first unified study of existing methods’ in termsof various image-level degradations. On the other hand, the vital question "what object-level characteristics really affect existing methods’ performance?" is not yet answered.Inspired by Hoiem et al.’s work in categorical object detection [16], this paper conductsthe first meta-analysis of various object-level characteristics’ impact on state-of-the-artobject proposal methods. Specifically, we examine the effects of object size, aspect ra-tio, iconic view, color contrast, shape regularity and texture. We also analyse existingmethods’ localization accuracy and latency for various PASCAL VOC object classes.Our study reveals the limitations of existing methods in terms of non-iconic view, smallobject size, low color contrast, shape regularity etc. Based on our observations, lessonsare also learned and shared with respect to the selection of existing object proposal tech-nologies as well as the design of the future ones.

1 IntroductionRecent top performing methods in PASCAL VOC [11] and ImageNet [29] make use ofobject proposal to replace exhaustive window search [5, 10, 13, 14, 34]. Object proposal’seffectiveness is rooted in the assumption that there are general cues to differentiate objectsfrom the background. Since the very first work by Alexe et al. [1], many object proposalmethods have been proposed [2, 3, 4, 9, 18, 20, 23, 26, 27, 32, 37] and tested on variouslarge scale datasets [11, 22, 29], and their overall detection rates versus different thresholdsor window number have also been reported. Yet such partial performance summaries giveus little idea of a method’s strengths and weaknesses for further improvement, and users arestill facing difficulties in choosing methods for their applications. Therefore, more detailedanalysis of existing state-of-the-arts is critical for future research and applications.

There are considerable works in categorical object detection [6, 8, 15, 33, 35] whichstudy the impact of different object-level properties (such as occlusion, aspect ratio, view-point change) to the performance of categorical object detectors. Others have investigated

c© 2015. The copyright of this document resides with its authors.It may be distributed unchanged freely in print or electronic forms. Pages 11.1-11.12

DOI: https://dx.doi.org/10.5244/C.29.11

Citation
Citation
{Hosang, Benenson, and Schiele} 2014
Citation
Citation
{Hoiem, Chodpathumwan, and Dai} 2012
Citation
Citation
{Everingham, Gool, Williams, Winn, and Zisserman} 2010
Citation
Citation
{Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei{-}Fei} 2014
Citation
Citation
{Cinbis, Verbeek, and Schmid} 2013
Citation
Citation
{Erhan, Szegedy, Toshev, and Anguelov} 2014
Citation
Citation
{Girshick, Donahue, Darrell, and Malik} 2014
Citation
Citation
{He, Zhang, Ren, and Sun} 2014
Citation
Citation
{Wang, Yang, Zhu, and Lin} 2013
Citation
Citation
{Alexe, Deselaers, and Ferrari} 2010
Citation
Citation
{Arbel{á}ez, Pont{-}Tuset, Barron, Marqu{é}s, and Malik} 2014
Citation
Citation
{Carreira and Sminchisescu} 2010
Citation
Citation
{Cheng, Zhang, Lin, and Torr} 2014
Citation
Citation
{Endres and Hoiem} 2010
Citation
Citation
{Humayun, Li, and Rehg} 2014
Citation
Citation
{Kr{ä}henb{ü}hl and Koltun} 2014{}
Citation
Citation
{Manen, Guillaumin, and Gool} 2013
Citation
Citation
{Rahtu, Kannala, and Blaschko} 2011
Citation
Citation
{Rantalankila, Kannala, and Rahtu} 2014
Citation
Citation
{vanprotect unhbox voidb@x penalty @M {}de Sande, Uijlings, Gevers, and Smeulders} 2011
Citation
Citation
{Zitnick and Doll{á}r} 2014
Citation
Citation
{Everingham, Gool, Williams, Winn, and Zisserman} 2010
Citation
Citation
{Lin, Maire, Belongie, Hays, Perona, Ramanan, Doll{á}r, and Zitnick} 2014
Citation
Citation
{Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei{-}Fei} 2014
Citation
Citation
{Divvala, Hoiem, Hays, Efros, and Hebert} 2009
Citation
Citation
{Doll{á}r, Wojek, Schiele, and Perona} 2009
Citation
Citation
{Hoiem, Rother, and Winn} 2007
Citation
Citation
{Wang, Han, and Yan} 2009
Citation
Citation
{Wu and Nevatia} 2005
Page 2: Diagnosing State-Of-The-Art Object Proposal Methods · will be released to facilitate further related research. 2 Evaluation Protocol. 2.1 PASCAL VOC detection dataset. Dataset Selection

2 H.ZHU, S.LU, J.CAI, G. LEE:

dataset design [11, 25, 31] and impact of the amount of training data [36]. Our work is greatlyinspired by the recent research by Hoeim et al. [16] and Russakovsky et al. [29] in categor-ical object detection and Hosang et al. [17] in object proposal. Hoiem et al.’s work [16]provides depth analysis of existing categorical object detectors in PASCAL VOC in termsof object size, aspect ratio, parts and viewpoints, yet they provided annotations for a limitedobject classes. Russakovsky et al. [29] provide further depth analysis of the state-of-the-artcategorical object detectors in ILSVRC in terms of various object attributes such as color,shape and texture, though such annotations are not publicly available. Hosang et al. [17] con-duct the first unified and comprehensive comparison of existing object proposal methods interms of various image-level degradations, but the more relevant object-level characteristicsanalysis are missing.

Our contributions can be summarized in three aspects. First, we investigate the influenceof object-level characteristics over state-of-the-art object proposal methods for the first time.Although there are some similar works in categorical object detection, few research has beenconducted on object proposal side to the best of our knowledge. Second, we introduce theconcept of localization latency to evaluate a method’s localization efficiency and accuracy.Third, we create a fully annotated PASCAL VOC dataset with various object-level charac-teristics to facilitate our analysis. The annotations take us nearly one month’s time whichwill be released to facilitate further related research.

2 Evaluation Protocol

2.1 PASCAL VOC detection dataset

Dataset Selection and Annotation: Our experiments are based on PASCAL VOC2007 testset, which has been widely used in evaluating object proposal methods [1, 4, 17, 37]. Thetest set has around 15000 object instances across 20 object classes. Hoiem et al. [16] makeannotations for seven object classes with the characteristics such as object size, aspect ratio,occlusion etc. In our work, we keep existing annotations for object size and aspect ratio, andextend them to all other object classes. In addition, we make annotations for other objectproperties (color contrast, shape regularity and textureness), as studied in Russakovsky etal. [29]. The annotations are conducted by two students and finally confirmed by an expertto reduce the labelling ambiguity.

Evaluation Criteria: A proposed window B is treated as detected if its Intersection-over-Union (IoU) with a ground truth bounding box B̄: IoU(B̄,B) = area(B ∩ B̄)

area(B ∪ B̄) is above acertain threshold T . The ‘best instance IoU’ (BIoU) measures the maximum IoU between aground-truth instance G with a group of window candidates S:

BIoU(G,S) = maxs∈S

IoU(G,s) (1)

The windows of all methods are pre-computed by Hosang et al. [17]. We adopt the com-monly used criterion to evaluate proposal quality by fixing the number of proposals, thenthe recall varies with different IoU thresholds. Due to the page limit, we choose the ‘recallversus IoU’ curve for 1000 windows to better reflect existing methods’ performance in termsof localization accuracy and detection rate. Numbers next to each method in Fig 1∼ 7 indi-cate area under curve and average window number per image, respectively. We also attachthe figures for other window number (100 and 10000) in supplementary material for refer-ence. The curves for ‘recall versus region number’ and ‘area under recall-versus-IoU’ arealso attached to provide complementary information.

Citation
Citation
{Everingham, Gool, Williams, Winn, and Zisserman} 2010
Citation
Citation
{Pinto, Cox, and DiCarlo} 2008
Citation
Citation
{Torralba and Efros} 2011
Citation
Citation
{Zhu, Vondrick, Ramanan, and Fowlkes} 2012
Citation
Citation
{Hoiem, Chodpathumwan, and Dai} 2012
Citation
Citation
{Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei{-}Fei} 2014
Citation
Citation
{Hosang, Benenson, and Schiele} 2014
Citation
Citation
{Hoiem, Chodpathumwan, and Dai} 2012
Citation
Citation
{Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei{-}Fei} 2014
Citation
Citation
{Hosang, Benenson, and Schiele} 2014
Citation
Citation
{Alexe, Deselaers, and Ferrari} 2010
Citation
Citation
{Cheng, Zhang, Lin, and Torr} 2014
Citation
Citation
{Hosang, Benenson, and Schiele} 2014
Citation
Citation
{Zitnick and Doll{á}r} 2014
Citation
Citation
{Hoiem, Chodpathumwan, and Dai} 2012
Citation
Citation
{Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei{-}Fei} 2014
Citation
Citation
{Hosang, Benenson, and Schiele} 2014
Page 3: Diagnosing State-Of-The-Art Object Proposal Methods · will be released to facilitate further related research. 2 Evaluation Protocol. 2.1 PASCAL VOC detection dataset. Dataset Selection

H.ZHU, S.LU, J.CAI, G. LEE: 3

2.2 Object Proposal Methods

We choose representative and recent top-performing object proposal methods in PASCALVOC 2007 dataset as our experiment candidates to reflect the recent development. Existingmethods can be divided into two major groups, window based methods and region basedmethods.

Window based methods generate proposals by ranking each sampled window accordingto the likelihood that it contains the object of interest.

1) Objectness (O) [1] assigns the objectness likelihood by sampling initial windowsfrom salient object locations and then the cues from color contrast, edge distribution, super-pixel straddling are integrated by using Bayes rule.

2) Rathu2011 (R1) [26] extends [1] by using new objectness cues from superpixel andimage gradients, and applies a structured output ranking method to combine them.

3) BING (B) [4] ranks coarse to fine sliding windows by using classifiers trained withthe normalized gradient feature. It encloses the closed contour information and can run at300fps on a desktop.

4) EdgeBox (EB) [37] also evaluates the objectness with sliding windows, but the scor-ing is derived from the state-of-the-art structured forest contour detector [7]. Each windowis assigned a score corresponds to the strength of enclosed contour signal.

Region based methods generate multiple foreground regions that correspond to objects.The process starts from single or multiple over-segmentation, then the regions are groupedaccording to the multiple cues and finally the candidate regions are ranked either accordingto region hierarchy or by using a learned ranking classifiers etc. A region proposal can betransformed to a window proposal by enclosing the region with the tightest bounding box.

1) CPMC (C) [3] applies non-overlapped seeds to train appearance models for fore-ground and background and solves graph cuts to generate foreground proposals. Then theproposals are ranked according to various cues. Endres et al. [9] applies a similar pipelinebut using a seeding map from occlusion boundary and a different ranking model.

2) MCG (M) [2] makes use of a novel multi-scale eigenvector solver to quickly glob-alize the object contours. Then the regions are merged to form candidates according to thecontour similarities in different scales. Further candidates are generated by merging up tofour regions, and are ranked according to various cues similar to CPMC.

3) SelectiveSearch (SS) [32] also starts from superpixels, but it applies a diversifiedapproach to form object regions by merging them with a set of hand-crafted features andsimilarity functions.

4) RandomizedPrim (RP) [23] follows a similar approach to SelectiveSearch, while theweights between superpixels are learned and the merging process is randomized.

5) Geodesic (G) [21] assigns seeds by using heuristics or learned classifiers for subse-quent geodesic distance transform. The level sets of each geodesic transform define and rankthe foregrounds.

6) RIGOR (RI) [18] is similar to CPMC, but the contour detector is replaced with thefast structure forest [7] and the algorithm is speeded up by reusing the inference.

Baselines are also considered in our work to provide reference points to study the corre-lation between the existing methods and their components. We consider two baselines fromHosang et al. [17] in this work. 1) Uniform (U) baseline generates proposals by uniformlysampling the object center, square root area and aspect ratio. 2) Superpixel (SP) of [12] isadopted, as four methods in this work apply its superpixel. This baseline over-segments animage into regions where each region is treated as a candidate.

Citation
Citation
{Alexe, Deselaers, and Ferrari} 2010
Citation
Citation
{Rahtu, Kannala, and Blaschko} 2011
Citation
Citation
{Alexe, Deselaers, and Ferrari} 2010
Citation
Citation
{Cheng, Zhang, Lin, and Torr} 2014
Citation
Citation
{Zitnick and Doll{á}r} 2014
Citation
Citation
{Doll{á}r and Zitnick} 2013
Citation
Citation
{Carreira and Sminchisescu} 2010
Citation
Citation
{Endres and Hoiem} 2010
Citation
Citation
{Arbel{á}ez, Pont{-}Tuset, Barron, Marqu{é}s, and Malik} 2014
Citation
Citation
{vanprotect unhbox voidb@x penalty @M {}de Sande, Uijlings, Gevers, and Smeulders} 2011
Citation
Citation
{Manen, Guillaumin, and Gool} 2013
Citation
Citation
{Kr{ä}henb{ü}hl and Koltun} 2014{}
Citation
Citation
{Humayun, Li, and Rehg} 2014
Citation
Citation
{Doll{á}r and Zitnick} 2013
Citation
Citation
{Hosang, Benenson, and Schiele} 2014
Citation
Citation
{Felzenszwalb and Huttenlocher} 2004
Page 4: Diagnosing State-Of-The-Art Object Proposal Methods · will be released to facilitate further related research. 2 Evaluation Protocol. 2.1 PASCAL VOC detection dataset. Dataset Selection

4 H.ZHU, S.LU, J.CAI, G. LEE:

3 Profiling Localization Accuracy and Latency

Method Person Bird Cat Cow Dog Horse Sheep Plane Bike Boat Bus Car MBike Train Bottle Chair Table Plant Sofa TV mIoUBing 0.58 0.57 0.70 0.58 0.67 0.63 0.57 0.61 0.61 0.53 0.62 0.56 0.62 0.67 0.49 0.55 0.65 0.56 0.68 0.59 0.60EB 0.65 0.69 0.80 0.72 0.80 0.77 0.70 0.74 0.74 0.64 0.78 0.66 0.74 0.78 0.52 0.63 0.74 0.63 0.77 0.76 0.71OBJ 0.57 0.57 0.72 0.59 0.69 0.67 0.56 0.64 0.63 0.54 0.68 0.57 0.63 0.70 0.45 0.53 0.70 0.54 0.72 0.60 0.61Rathu 0.57 0.57 0.79 0.61 0.77 0.73 0.59 0.69 0.67 0.53 0.73 0.57 0.69 0.77 0.38 0.50 0.74 0.52 0.79 0.68 0.64Mean 0.59 0.60 0.75 0.63 0.73 0.70 0.60 0.67 0.66 0.56 0.70 0.59 0.67 0.73 0.46 0.55 0.71 0.56 0.74 0.66 0.64CPMC 0.62 0.65 0.87 0.70 0.85 0.76 0.66 0.72 0.68 0.55 0.77 0.64 0.72 0.80 0.46 0.61 0.76 0.59 0.85 0.74 0.70GOP 0.66 0.65 0.86 0.72 0.82 0.77 0.68 0.69 0.72 0.59 0.80 0.69 0.74 0.81 0.52 0.66 0.78 0.64 0.83 0.76 0.72MCG 0.71 0.70 0.87 0.77 0.85 0.79 0.73 0.76 0.75 0.64 0.82 0.72 0.77 0.83 0.58 0.69 0.78 0.65 0.87 0.80 0.75RP 0.63 0.64 0.85 0.68 0.82 0.73 0.66 0.77 0.71 0.59 0.77 0.65 0.73 0.78 0.49 0.65 0.79 0.61 0.86 0.75 0.71RIGOR 0.61 0.64 0.87 0.69 0.82 0.75 0.64 0.69 0.70 0.57 0.76 0.65 0.72 0.78 0.48 0.62 0.74 0.59 0.84 0.72 0.69SS 0.67 0.69 0.87 0.72 0.85 0.77 0.68 0.79 0.75 0.63 0.79 0.68 0.76 0.82 0.53 0.67 0.82 0.64 0.87 0.77 0.74Mean 0.65 0.66 0.86 0.71 0.83 0.76 0.68 0.74 0.72 0.60 0.78 0.67 0.74 0.80 0.5083 0.65 0.78 0.62 0.85 0.76 0.72

Table 1: Different methods’ localization accuracy for each PASCAL VOC class, which is definedin Sec 3. The methods on the upper parts of the rows are the window based methods, the methodsfollowing rows are for the region based methods. Bold face numerics are the best results for eachmethodology. While, the bold face numerics in red color demonstrates the bests results for all methods.

To motivate our object level analysis, we need to answer the question ‘Are state-of-the-arts really good at all kinds of objects?’. Therefore, we juxtapose each PASCAL VOC class’localization accuracy in Table 1. The localization accuracy is measured by taking averageof the ’mean best instance IoU’ for different window number L ∈ {100,1000,10000}, where‘mean best instance IoU’ measures the mean BIoU (Eq. 1) between all ground-truth instancesand a group of window candidates S, given |S| ∈ L. As demonstrated in Table 1, differentmethods’ localization accuracy varies with classes, which gives hints that object proposalsare not as ‘object agnostic’ as the original assumption. The region based methods havehigher localization accuracy than window based methods. MCG and SelectiveSearch are thetop performing region based methods, though window based EdgeBox shows comparableperformance. The localization accuracy for region based methods are similar. One potentialexplanation is that all region based methods follow similar pipeline by grouping superpixelswith either learned or hand-crafted edge measures.

Method Person Bird Cat Cow Dog Horse Sheep Plane Bike Boat Bus Car MBike Train Bottle Chair Table Plant Sofa TV MeanBing 18.2 15.1 13.4 14.3 14.2 14.1 14.2 14.1 14.3 14.7 13.6 16.6 14.1 13.6 15.6 16.4 13.7 15.2 13.9 14.3 14.7EB 18 14.6 13.0 13.6 13.6 13.4 13.7 13.5 13.8 14.3 12.7 16.1 13.6 13.1 15.4 16.2 13.5 15 13.5 13.4 14.2Obj 18.3 15.1 13.4 14.2 14.2 14.0 14.3 13.9 14.3 14.7 13.3 16.5 14.2 13.3 15.7 16.6 13.5 15.3 13.6 14.3 14.6Rathu 18.5 15.3 14.0 14.4 14.7 14.5 14.4 14.1 14.5 14.8 13.6 16.6 14.4 13.8 15.8 16.7 13.8 15.5 14.1 14.3 14.9Mean 18.3 15.0 13.4 14.1 14.2 14.0 14.1 13.9 14.2 14.6 13.3 16.5 14.1 13.4 15.6 16.5 13.6 15.2 13.8 14.1 14.6CPMC 18.3 14.9 13.2 13.9 13.9 14.0 14.0 13.8 14.4 14.7 13.4 16.4 14.1 13.4 15.7 16.4 13.7 15.3 13.6 14 14.6GOP 18.4 15.1 13.9 14.0 14.5 14.4 14.0 14.1 14.5 14.7 13.6 16.4 14.5 14.0 15.6 16.4 14.0 15.3 14.1 14.0 14.8MCG 18.2 14.9 13.3 13.8 14.1 14.1 13.8 13.8 14.3 14.6 13.3 16.3 14.1 13.6 15.5 16.3 13.9 15.2 13.8 13.9 14.5RP 18.5 15.2 14.1 14.3 14.8 14.6 14.3 14.0 14.6 14.7 13.9 16.7 14.5 14.1 15.7 16.5 14.1 15.4 14.2 14.3 14.9RIGOR 18.3 14.9 13.1 14.0 13.9 14.0 14.0 13.7 14.2 14.6 13.3 16.3 14.0 13.3 15.6 16.3 13.6 15.2 13.3 14.0 14.5SS 18.4 15.1 13.8 14.2 14.4 14.4 14.2 13.7 14.4 14.7 13.6 16.6 14.3 13.8 15.7 16.5 13.8 15.4 14 14.3 14.8Mean 18.3 15.0 13.6 14.0 14.3 14.2 14.0 13.9 14.4 14.7 13.5 16.5 14.2 13.7 15.6 16.4 13.9 15.3 13.8 14.1 14.7

Table 2: Different methods’ localization latency for each PASCAL VOC class, which is defined inEq. 2. The methods in the top part of rows are the window based approach, the methods in the belowrows are for the region based approach. The blue numerics indicate the top 5 lowest latency classes foreach method, the red ones indicate the top 5 highest latency classes.

A good object proposal method should not only produce candidates with high accuracy,but also use as less windows as possible. To summarize a method’s performance in termsof the accuracy and window number, we propose the localization latency metric as inspiredby the clutterness measure in [28], which measures the average number of windows requiredto localize all object instances. Moreover, different object classes intuitively should havedifferent levels of localization latency. For example, some classes (e.g. sheep or cow) oftenstay in a clean background, and are easy to be localized with a few windows. On the otherhand, when an object is located in a cluttered indoor, more windows may be needed to

Citation
Citation
{Russakovsky, Deng, Huang, Berg, and Fei{-}Fei} 2013
Page 5: Diagnosing State-Of-The-Art Object Proposal Methods · will be released to facilitate further related research. 2 Evaluation Protocol. 2.1 PASCAL VOC detection dataset. Dataset Selection

H.ZHU, S.LU, J.CAI, G. LEE: 5

localize the object. By studying the localization latency, we can have some initial idea ofwhat object appearance properties may affect the performance of existing methods.

To quantify a method I’s localization latency, we sample K (K ∈ {100,1000,10000})windows W I

1 ,WI2 , ...,W

IK in returned descending order. Let OBJ(m,T,K) be the total number

of sampled windows to localize all instances of the target category in image m under varyingthreshold T ∈ {0.5,0.7,0.85} and K (i.e. OBJ(m,T,K) = ∑i min{k ∈ K : IOU(W m

k ,Bmi ) >

T}). If using k windows still can’t localize the object, k is set to K +1. M is the number ofimages contain the instances. The localization latency is defined as follows:

LocalizationLatency =1|T |∑T

log2(1M ∑

K∑m

OBJ(m,T,K)) (2)

Table 2 juxtaposes the class localization latency of the selected methods. The higherthe measure, the more windows are needed to localize the object. The top 5 most diffi-cult classes are ‘Person’, ‘Car’, ‘Chair’, ‘Plant’ and ‘TV’, where the objects contain strongintra-variation, thin structures and often from images with background clutter. On the otherhand, the top 5 easiest classes are ‘Cats’, ‘Bus’, ‘Train’, ‘Table’ and ‘Plane’, which tend tohave large homogeneous foreground or background or are highly contrasted. It can thereforebe concluded that some object properties can indeed influence the performance of exist-ing proposal methods. Moreover, by inspecting from the table, the window based methodshave lower localization latency than region based methods. Our interpretation for such re-sult is that window based approach inherently assumes that objects are spatially compact.As a result, the window sampled by using carefully hand-crafted parameters can capturemore integrated objects in the combinatorial space, which are difficult to be detected by us-ing bottom-up region based approach. Finally, the measures also reflect complementarinessamong different methods, which is difficult to be revealed by overall IoU metric.

4 Impact of Object CharacteristicsSo far, we have studied existing methods’ localization accuracy and latency, and can con-clude that object appearance indeed influences the performance of different proposal meth-ods. We look further into what object properties can bring such influences. First, we studythe impact of natural and man-made objects. Then, we examine the impact of additionalproperties: iconic view, object size, aspect ratio, color contrast, shape regularity and texture-ness.

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 72.8 (998.8)RP 69.3 (997.2)C 69.7 (733)M 75.4 (999.4)Ri 70.9 (668.5)G 72.1 (893.3)O 62.1 (999.5)R1 63.9 (1000)EB 72.5 (991.1)B 62 (1000)U 35.5 (1000)SP 49.2 (742.7)

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 73.5 (998.8)RP 70.8 (997.2)C 69.2 (733)M 74.5 (999.4)Ri 71.5 (668.5)G 72.4 (893.3)O 61.6 (999.5)R1 61.2 (1000)EB 71.9 (991.1)B 61.4 (1000)U 36.9 (1000)SP 51.3 (742.7)

(a) (b)

Figure 1: Recall versus IoU threshold curves for‘Natural’ (a) and ‘Man-Made’ (b). Dashed lines arefor window based methods and solid lines are forregion based methods. Dashed lines with dots arefor the baseline methods.

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 66.1 (998.8)RP 62.7 (997.2)C 61.5 (733)M 68.8 (999.4)Ri 63.7 (668.5)G 66.1 (893.3)O 57 (999.5)R1 53.6 (1000)EB 67.2 (991.1)B 58.2 (1000)U 26.8 (1000)SP 53.2 (742.7)

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 88.4 (998.8)RP 86 (997.2)C 86.8 (733)M 88.5 (999.4)Ri 87.4 (668.5)G 85.8 (893.3)O 72.4 (999.5)R1 82.1 (1000)EB 83.2 (991.1)B 69.4 (1000)U 56.6 (1000)SP 43.6 (742.7)

(a) (b)

Figure 2: Recall versus IoU threshold curves for‘Non-iconic’ (a) and ‘Iconic’ (b) objects. Dashedlines are for window based methods and solid linesare for region based methods. Dashed lines withdots are for the baseline methods.

Natural vs Man-made classifies objects as ‘natural’ (e.g. horse, person, cow et al) or‘man-made’ (e.g. motorbike, tv monitor et al). The ‘recall versus IoU threshold’ curves

Page 6: Diagnosing State-Of-The-Art Object Proposal Methods · will be released to facilitate further related research. 2 Evaluation Protocol. 2.1 PASCAL VOC detection dataset. Dataset Selection

6 H.ZHU, S.LU, J.CAI, G. LEE:

for ‘Natural’ and ‘Man-Made’ objects are shown in Figure 1. The recall rate for naturalobjects is slightly higher than man-made ones, though the difference is almost negligible.This suggests existing objectness rules are general to man-made and nature objects. Regionbased methods maintain a smooth recall rate transition, while the window based EdgeBoxachieves the highest recall when it is tuned for IoU = 0.7.

Iconic Objects refers to the objects in canonical perspective centred in the image withcertain size, and vice versa. Many commercial photos contain such ‘iconic objects’. On theother hand, for images such as street or indoor scenes, objects appear in various locations,scale and aspect ratio. Therefore, we are interested in studying whether existing methodswould favour such ‘iconic objects’. To decide whether an object is iconic or not, we estimatea multivariate Gaussian distribution for the bounding box parameters e.g. centre position,square root area, and log aspect ratio from the PASCAL VOC’s training set as Hosang etal. [17]. Then we sample bounding boxes from this distribution. An iconic object is decidedby checking if there is an sampled bounding box that has an IoU ≥ 0.6 with the groundtruth. Non-iconic objects take up nearly 68% of the objects. The “recall versus IoU thresh-old” curves for ‘Non-iconic’ and ‘Iconic’ objects are shown in Figure 2. To our surprise,existing methods strongly favour iconic objects. One conjecture is that some methods al-ready incorporate the prior knowledge for iconic object by applying classifiers with locationfeatures (e.g. MCG, CPMC, RIGOR) or window sampling technique (e.g. Rathu2011).On the other hand, many methods that don’t incorporate such prior (e.g. BING, EdgeBox,Objectness, SelectiveSearch, Geodesic) also exhibit different degrees of performance loss,though such loss is smaller compared with the ones apply the location prior (except MCG).Such phenomenon suggests the existing methods have limitations in localizing objects inestrange positions and appearances. Our subsequent analysis also supports such hypothesis.

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 32.8 (998.8)RP 36 (997.2)C 27.1 (733)M 47.9 (999.4)Ri 34.6 (668.5)G 35 (893.3)O 28.7 (999.5)R1 19.6 (1000)EB 35 (991.1)B 36.2 (1000)U 6.8 (1000)SP 54.2 (742.7)

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 59.8 (998.8)RP 54.6 (997.2)C 54.4 (733)M 61.7 (999.4)Ri 54.5 (668.5)G 62.3 (893.3)O 51.9 (999.5)R1 41.9 (1000)EB 62.3 (991.1)B 59.2 (1000)U 14.8 (1000)SP 58.9 (742.7)

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 76.9 (998.8)RP 72.4 (997.2)C 72.8 (733)M 76.7 (999.4)Ri 74.8 (668.5)G 75 (893.3)O 66.5 (999.5)R1 67.2 (1000)EB 77.8 (991.1)B 62.6 (1000)U 35.1 (1000)SP 50.7 (742.7)

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 88.6 (998.8)RP 85.7 (997.2)C 86.8 (733)M 88.5 (999.4)Ri 87.2 (668.5)G 84.9 (893.3)O 71.7 (999.5)R1 83.6 (1000)EB 83 (991.1)B 66.9 (1000)U 61.6 (1000)SP 42.5 (742.7)

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 94.4 (998.8)RP 94.1 (997.2)C 93.9 (733)M 94.5 (999.4)Ri 95 (668.5)G 93.2 (893.3)O 76.4 (999.5)R1 86.5 (1000)EB 85.6 (991.1)B 78.3 (1000)U 61.7 (1000)SP 42.3 (742.7)

(a) (b) (c) (d) (e)

Figure 3: Recall versus IoU threshold curves for different object sizes: ‘extra-small’ (a), ‘small’ (b), ‘medium’(c), ‘large’ (d) and ‘extra-large’ (c). Dashed lines are for window based methods and solid lines are for region basedmethods. Dashed lines with dots are for the baseline methods.

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 57.9 (998.8)RP 54.1 (997.2)C 53.8 (733)M 64.8 (999.4)Ri 53.6 (668.5)G 57.8 (893.3)O 54.8 (999.5)R1 49 (1000)EB 59.7 (991.1)B 56.2 (1000)U 24.1 (1000)SP 47.7 (742.7)

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 67.5 (998.8)RP 63.2 (997.2)C 63.1 (733)M 71.3 (999.4)Ri 64.7 (668.5)G 66.9 (893.3)O 58.9 (999.5)R1 57 (1000)EB 69.9 (991.1)B 59.8 (1000)U 29.8 (1000)SP 49.1 (742.7)

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 75.1 (998.8)RP 72.3 (997.2)C 71.5 (733)M 76.3 (999.4)Ri 73.8 (668.5)G 74.5 (893.3)O 62.3 (999.5)R1 63.8 (1000)EB 73.9 (991.1)B 62.5 (1000)U 36.5 (1000)SP 51.6 (742.7)

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 80.5 (998.8)RP 78.1 (997.2)C 77.6 (733)M 80.6 (999.4)Ri 79.4 (668.5)G 79.5 (893.3)O 65.8 (999.5)R1 69.9 (1000)EB 77.6 (991.1)B 65.6 (1000)U 44.4 (1000)SP 50.5 (742.7)

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 77.5 (998.8)RP 74.7 (997.2)C 73.6 (733)M 76.2 (999.4)Ri 75.1 (668.5)G 74.2 (893.3)O 65.1 (999.5)R1 67.9 (1000)EB 72.3 (991.1)B 60.4 (1000)U 43.6 (1000)SP 48.9 (742.7)

(a) (b) (c) (d) (e)

Figure 4: Recall versus IoU threshold curves for different aspect ratio: ‘extra-tall’ (a), ‘tall’ (b), ‘medium’ (c),‘wide’ (d) and ‘extra-wide’ (c). Dashed lines are for window based methods and solid lines are for region basedmethods. Dashed lines with dots are for the baseline methods.

Object Size is measured as the pixel area of the bounding box, each bounding box isassigned to a size category depending on the percentile sizes within the object instances:‘extra-small’ (XS: bottom 10%); ‘small’ (S: next 20%); ‘medium’ (M: next 40%); ‘large’

Citation
Citation
{Hosang, Benenson, and Schiele} 2014
Page 7: Diagnosing State-Of-The-Art Object Proposal Methods · will be released to facilitate further related research. 2 Evaluation Protocol. 2.1 PASCAL VOC detection dataset. Dataset Selection

H.ZHU, S.LU, J.CAI, G. LEE: 7

(L: next 20%); ‘extra-large’ (XL: next 10%). The “recall versus IoU threshold” curves fordifferent object sizes are shown in Figure 3. Existing methods exhibit various performancein terms of different sizes, typically for objects from extra-small to medium size. EdgeBoxshows the best performance for small to medium range objects when IoU ≤ 0.7. BING’sperformance is also stable across this size range. For other sizes, EdgeBox also shows com-parable performance with the region based approach. Rahtu2011 (R1) is most size-sensitivewindow based method, which works decently for localizing large and extra-large objects.Such phenomenon for window based methods is correlated with the initial window samplingtechniques (as shown in the the ‘uniform’ baseline), which favour searching sizeable objectfirst. MCG is the best performing region based method across different sizes. Interestingly,superpixel baseline shows better performance than all existing methods for extra-small cases,which supports our previous hypothesis that the pipeline of existing methods has a tendencyto ignore small to medium scale objects or has difficulty in picking them out, though morecomponent-wise analysis is left for future work.

Aspect Ratio is defined as the object width divided by height. The objects are catego-rized into ‘extra-tall’ (XT), ‘tall’ (T), ‘medium’ (M), ‘wide’ (W) and ‘extra-wide’ (XW), alsoaccording to the percentile size within the dataset. The “recall versus IoU threshold” curvesfor different aspect ratios are shown in Figure 4. Most objects from ‘medium’ (M) to ‘wide’(W) are easier to localize than other aspect ratios, as many ‘extra-tall’ and ‘tall’ are correlatedwith ‘bottle’ and ‘person’ classes. Existing methods experience variations in terms of variousaspect ratios. MCG and EdgeBox are the top performing methods when IoU ≤ 0.7. The per-formance of region based methods are correlated with the superpixel baseline, as superpixelstend to oversegment image into regular shape regions. Another notable phenomenon is theperformance variation for seeded region based approach (CPMC and RIGOR). One potne-tial explanation is that CPMC and RIGOR applies a regular grid based seeding approach,which may not produce stable seeding for elongated objects. MCG and Geodesic applies anirregular seeding approach from contour map, which is more stable for objects in various as-pect ratio. The performance of window based methods is also correlated with initial windowsampling, which can be observed by inspecting the ‘uniform’ baseline. The window basedmethods maintain performance at middle aspect ratio range, while show lower performanceat two ends of the aspect ratio, due to the insufficient parameter sampling.

Color Distinctiveness is divided into three classes: ‘low’ (e.g. black car moving in thenight), ‘medium’ (e.g a pedestrian in human class), ‘high’ (e.g. white horse on the grassland).One common assumption for object proposal is that objects contain properties which helpthem stand out from the background areas. Therefore, objects with strong contrast withthe background should be easier to get localized. The “recall versus IoU threshold” curvesfor different color contrast level are shown in Figure 5. Our analysis did suggest suchpreference for highly contrast objects for existing methods, since many existing methodseither explicitly modelling color contrast (e.g. CPMC, RIGOR, SelectiveSearch, Objectness)or implicitly making use of color contrast based features (e.g. MCG, EdgeBox, Geodeisc,RandomizedPrime etc.). A direct consequence for such design flavour is that region basedmethods are the most sensitive to color contrast changes, whereas window based methodssuffer less because they benefit from gradient and shape cues (e.g. EdgeBox, BING andObjectness). On the other hand, Rathu2011 is quite sensitive to contrast changes as it relieson superpixel based window sampling technique. As color distinctiveness is also correlatedwith non-iconic objects, we evaluate the color contrast’s influence in terms of iconic and non-iconic views individually. The improvement shifts from low contrast to hight contrast forboth kinds of objects follows similar performance improvement pattern (the figures can be

Page 8: Diagnosing State-Of-The-Art Object Proposal Methods · will be released to facilitate further related research. 2 Evaluation Protocol. 2.1 PASCAL VOC detection dataset. Dataset Selection

8 H.ZHU, S.LU, J.CAI, G. LEE:

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 64.2 (998.8)RP 61.7 (997.2)C 57 (733)M 64.7 (999.4)Ri 59.3 (668.5)G 62.3 (893.3)O 54.1 (999.5)R1 52.7 (1000)EB 63.7 (991.1)B 57.2 (1000)U 29.2 (1000)SP 50.1 (742.7)

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 73 (998.8)RP 69.7 (997.2)C 69.6 (733)M 75.2 (999.4)Ri 71.3 (668.5)G 72.4 (893.3)O 62.1 (999.5)R1 62.7 (1000)EB 72.7 (991.1)B 61.9 (1000)U 36.2 (1000)SP 49.5 (742.7)

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 85.1 (998.8)RP 82.7 (997.2)C 83.9 (733)M 86.4 (999.4)Ri 84.7 (668.5)G 83.3 (893.3)O 69.4 (999.5)R1 73.9 (1000)EB 78.9 (991.1)B 66.1 (1000)U 45 (1000)SP 56.1 (742.7)

(a) (b) (c)Figure 5: Recall versus IoU threshold curves for different color contrast level: ‘low’ (a), ‘medium’ (b) and ‘high’(c). Dashed lines are for window based methods and solid lines are for region based methods. Dashed lines withdots are for the baseline methods.

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 63.4 (998.8)RP 61.1 (997.2)C 58.7 (733)M 67 (999.4)Ri 61.1 (668.5)G 62.9 (893.3)O 53.5 (999.5)R1 49.4 (1000)EB 62.7 (991.1)B 55.8 (1000)U 25.6 (1000)SP 56.5 (742.7)

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 72.2 (998.8)RP 69.4 (997.2)C 67.9 (733)M 74 (999.4)Ri 70.7 (668.5)G 72 (893.3)O 60.5 (999.5)R1 60.3 (1000)EB 70.9 (991.1)B 60.3 (1000)U 36.2 (1000)SP 50.4 (742.7)

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 76.9 (998.8)RP 73.3 (997.2)C 73.8 (733)M 78.1 (999.4)Ri 74.8 (668.5)G 75.5 (893.3)O 65.3 (999.5)R1 68.1 (1000)EB 76.1 (991.1)B 64.4 (1000)U 39.7 (1000)SP 48 (742.7)

(a) (b) (c)Figure 6: Recall versus IoU threshold curves for different shape regularities: ‘low’ (a), ‘medium’ (b) and ‘high’(c). Dashed lines are for window based methods and solid lines are for region based methods. Dashed lines withdots are for the baseline methods.

found in last section of the supplementary material). Therefore, color contrast is an importantfactor in influencing existing state-of-the-arts.

Shape Regularity is classified into three categories including ‘low’ (e.g. tv monitor,bus), ‘medium’ (e.g car), ‘high’ (e.g. horse, dog). The “recall versus IoU threshold” curvesfor different shape regularities are shown in Figure 6. Nearly all existing methods make useof some shape or gradient information, such as BING, EdgeBox et al. Therefore, it is in-teresting to inspect whether objects with more irregular shapes are easier to be located. Ourexperimental results partially support such intuition and are also consistent with the methoddesign. For example, EdgeBox directly evaluates the enclosed contour strengths. The moreirregular the shape enclosed, the more contour that may be included, and so makes it easierto be localized. On the other hand, BING is less sensitive to shape regularity change as itrely on a compactly represented shape template. Region based methods are more sensitiveto shape regularity changes, as they also relies on contour detections. Moreover, when meth-ods overly rely on such cues, they face difficulties under weak illumination changes or otherimage degradations where the shape cues can be easily contaminated. This is supported byevaluating the shape regularity’s influence in terms of iconic and non-iconic view individu-ally. The shape information carried non-iconic view is weak, therefore the performance gapbetween these two kinds of objects is large (the figures can be observed from last section ofthe supplementary material).

Textureness is defined by following classes: ‘low’ (e.g. horse), ‘medium’ (e.g sheep),‘high’ (e.g. damaitting). The “recall versus IoU threshold” curves for different texture levelsare shown in Figure 7. According to the figures, there is little correlation between the per-formance and the amount of textures, except for Rathu. We conjecture that this phenomenon

Page 9: Diagnosing State-Of-The-Art Object Proposal Methods · will be released to facilitate further related research. 2 Evaluation Protocol. 2.1 PASCAL VOC detection dataset. Dataset Selection

H.ZHU, S.LU, J.CAI, G. LEE: 9

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 70.6 (998.8)RP 68.1 (997.2)C 66 (733)M 71.5 (999.4)Ri 68.7 (668.5)G 69.4 (893.3)O 59.1 (999.5)R1 55.5 (1000)EB 69.2 (991.1)B 60 (1000)U 32.7 (1000)SP 56.4 (742.7)

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 71.7 (998.8)RP 68.4 (997.2)C 68 (733)M 74.4 (999.4)Ri 69.6 (668.5)G 71.4 (893.3)O 61.5 (999.5)R1 62.5 (1000)EB 71.5 (991.1)B 61.5 (1000)U 35.3 (1000)SP 47.2 (742.7)

0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IoU overlap threshold

reca

ll

SS 80.9 (998.8)RP 77.5 (997.2)C 78.2 (733)M 80.9 (999.4)Ri 79.3 (668.5)G 78.2 (893.3)O 66 (999.5)R1 70.8 (1000)EB 78.2 (991.1)B 64.5 (1000)U 42.8 (1000)SP 53.1 (742.7)

(a) (b) (c)Figure 7: Recall versus IoU threshold curves for different texture levels: ‘low’ (a), ‘medium’ (b) and ‘high’ (c).Dashed lines are for window based methods and solid lines are for region based methods. Dashed lines with dotsare for the baseline methods.

is caused by the spatial support and feature design. Existing methods make use of largespatial support (e.g. bounding box and superpixels) to extract mid-level features and pro-duce candidate regions, which is more robust to small scale texture change. Noted, singlesuperpixel segmentation is not that repeatable across different texture levels as indicated inthe figure, existing region based approach makes use of diversified superpixel generation togreatly alleviates the sensitivity to textureness change.

5 Conclusion

In this work, we study the influence of object-level characteristics over existing object pro-posal methods. We study existing methods’ localization accuracy and latency, which sug-gests existing methods’ performance variation for different classes and their complemen-tariness. Our further analysis along different object characteristics suggest existing meth-ods’ pros and cons. BING, Geodesic, EdgeBox, SelectiveSearch and MCG are the methodswhich achieve stable performance across different characteristics. In terms of localizationaccuracy, MCG is the best region based method across various properties, though the speedis mediocre. The speed of BING is good, however its localization is quite poor. On theother hand, EdgeBox, Geodesic, SelectiveSearch achieve good balance between accuracyand speed. A more detailed summarization along each dimensions can be found in supple-mentary material.

The study of iconic view reveals existing methods’ difficulty in localizing objects outsidecanonical perspectives, hence future work should focus more on improving the recall rateat non-canonical view and reducing the influence of location prior. This also applies toexperiment and dataset design. Our concerns is that further object proposal research alongPASCAL VOC and ImageNet has a risk of over-fitting the large numbers of iconic objectsbut achieves little improvements for non-iconic objects. The recent MSCOCO dataset [22] isa good attempt by collecting images with more balanced distribution in perspective change.

Existing methods also have limitations in terms of object size, aspect ratio, color contrastand shape change. The improvement with respect to these attributes could be substantialthough it may not be an easy task as difficult cases are combinations of different challenges.Large improvement in a few aspects may bring small improvement in terms of overall per-formance, therefore further analysis along different attributes is encouraged to better reflectthe performance development.

There are some potential directions worthwhile for exploration, e.g. developing specifictemplate for challenge cases such as low contrast, illumination and resolution [24]. Furthergain can be can also be achieved by using context [30]. Moreover, most existing methods use

Citation
Citation
{Lin, Maire, Belongie, Hays, Perona, Ramanan, Doll{á}r, and Zitnick} 2014
Citation
Citation
{Park, Ramanan, and Fowlkes} 2010
Citation
Citation
{Torralba} 2003
Page 10: Diagnosing State-Of-The-Art Object Proposal Methods · will be released to facilitate further related research. 2 Evaluation Protocol. 2.1 PASCAL VOC detection dataset. Dataset Selection

10 H.ZHU, S.LU, J.CAI, G. LEE:

bottom-up methods to generate object proposal, and the combination with mid-level shapecues can be helpful as investigated in ShapeSharing [19].

In our future work, we will conduct similar research for MSCOCO, which has a differentdistributions from PASCAL VOC and includes more annotated attributes. Such work willfurther reveal insight of existing methods and cast light for future research.

6 AcknowledgementThe project is funded by Office for Space Technology and Industry (OSTIn) Space ResearchProgram under Economic Development Board, Singapore. The project reference no is S14-1136-IAF OSTIn-SIAG.

References[1] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. What is an object? In CVPR,

2010.

[2] Pablo Andrés Arbeláez, Jordi Pont-Tuset, Jonathan T. Barron, Ferran Marqués, andJitendra Malik. Multiscale combinatorial grouping. In CVPR, 2014.

[3] João Carreira and Cristian Sminchisescu. Constrained parametric min-cuts for auto-matic object segmentation. In CVPR, 2010.

[4] Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, and Philip H. S. Torr. BING: bina-rized normed gradients for objectness estimation at 300fps. In CVPR, 2014.

[5] Ramazan Gokberk Cinbis, Jakob J. Verbeek, and Cordelia Schmid. Segmentationdriven object detection with fisher vectors. In ICCV, 2013.

[6] Santosh Kumar Divvala, Derek Hoiem, James Hays, Alexei A. Efros, and MartialHebert. An empirical study of context in object detection. In CVPR, 2009.

[7] Piotr Dollár and C. Lawrence Zitnick. Structured forests for fast edge detection. InICCV, 2013.

[8] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection:A benchmark. In CVPR, pages 304–311, 2009.

[9] Ian Endres and Derek Hoiem. Category independent object proposals. In ECCV, 2010.

[10] Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov. Scal-able object detection using deep neural networks. In CVPR, pages 2155–2162, 2014.

[11] Mark Everingham, Luc J. Van Gool, Christopher K. I. Williams, John M. Winn, andAndrew Zisserman. The pascal visual object classes (VOC) challenge. IJCV, 88(2):303–338, 2010.

[12] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Efficient graph-based image seg-mentation. IJCV, 59(2):167–181, 2004.

Citation
Citation
{Kim and Grauman} 2012
Page 11: Diagnosing State-Of-The-Art Object Proposal Methods · will be released to facilitate further related research. 2 Evaluation Protocol. 2.1 PASCAL VOC detection dataset. Dataset Selection

H.ZHU, S.LU, J.CAI, G. LEE: 11

[13] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hier-archies for accurate object detection and semantic egmentation. In CVPR, 2014.

[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling indeep convolutional networks for visual recognition. In ECCV, 2014.

[15] Derek Hoiem, Carsten Rother, and John M. Winn. 3d layoutcrf for multi-view objectclass recognition and segmentation. In CVPR, 2007.

[16] Derek Hoiem, Yodsawalai Chodpathumwan, and Qieyun Dai. Diagnosing error inobject detectors. In ECCV, 2012.

[17] Jan Hendrik Hosang, Rodrigo Benenson, and Bernt Schiele. How good are detectionproposals, really? In BMVC, 2014.

[18] Ahmad Humayun, Fuxin Li, and James M. Rehg. RIGOR: reusing inference in graphcuts for generating object regions. In CVPR, 2014.

[19] Jaechul Kim and Kristen Grauman. Shape sharing for object segmentation. In ECCV,2012.

[20] Philipp Krähenbühl and Vladlen Koltun. Geodesic object proposals. In ECCV, 2014.

[21] Philipp Krähenbühl and Vladlen Koltun. Geodesic object proposals. In ECCV, 2014.

[22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ra-manan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects incontext. In ECCV, 2014.

[23] Santiago Manen, Matthieu Guillaumin, and Luc J. Van Gool. Prime object proposalswith randomized prim’s algorithm. In ICCV, 2013.

[24] Dennis Park, Deva Ramanan, and Charless Fowlkes. Multiresolution models for objectdetection. In ECCV, 2010.

[25] Nicolas Pinto, David D. Cox, and James J. DiCarlo. Why is real-world visual objectrecognition hard? PLoS Computational Biology, 4(1), 2008.

[26] Esa Rahtu, Juho Kannala, and Matthew B. Blaschko. Learning a category independentobject detection cascade. In ICCV, 2011.

[27] Pekka Rantalankila, Juho Kannala, and Esa Rahtu. Generating object segmentationproposals using global and local search. In CVPR, 2014.

[28] Olga Russakovsky, Jia Deng, Zhiheng Huang, Alexander C. Berg, and Li Fei-Fei. De-tecting avocados to zucchinis: What have we done, and where are we going? In ICCV,2013.

[29] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C.Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. CoRR,abs/1409.0575, 2014. URL http://arxiv.org/abs/1409.0575.

[30] Antonio Torralba. Contextual priming for object detection. IJCV, 53(2):169–191, 2003.

Page 12: Diagnosing State-Of-The-Art Object Proposal Methods · will be released to facilitate further related research. 2 Evaluation Protocol. 2.1 PASCAL VOC detection dataset. Dataset Selection

12 H.ZHU, S.LU, J.CAI, G. LEE:

[31] Antonio Torralba and Alexei A. Efros. Unbiased look at dataset bias. In CVPR, pages1521–1528, 2011.

[32] Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, and Arnold W. M. Smeul-ders. Segmentation as selective search for object recognition. In ICCV, 2011.

[33] Xiaoyu Wang, Tony X. Han, and Shuicheng Yan. An HOG-LBP human detector withpartial occlusion handling. In ICCV, pages 32–39, 2009.

[34] Xiaoyu Wang, Ming Yang, Shenghuo Zhu, and Yuanqing Lin. Regionlets for genericobject detection. In ICCV, 2013.

[35] Bo Wu and Ramakant Nevatia. Detection of multiple, partially occluded humans in asingle image by bayesian combination of edgelet part detectors. In ICCV, 2005.

[36] Xiangxin Zhu, Carl Vondrick, Deva Ramanan, and Charless Fowlkes. Do we needmore training data or better models for object detection? In BMVC, 2012.

[37] C. Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals fromedges. In ECCV, 2014.