exploring visual dictionaries a model driven perspective

17
Contents lists available at ScienceDirect Journal of Visual Communication and Image Representation journal homepage: www.elsevier.com/locate/jvci Exploring visual dictionaries: A model driven perspective Sinem Aslan a, , Ceyhun Burak Akgül b , Bülent Sankur b , E. Turhan Tunalı c a International Computer Institute, Ege University, İzmir, Turkey b Electrical and Electronics Engineering Department, Boğaziçi University, İstanbul, Turkey c Department of Computer Engineering, İzmir University of Economics, İzmir, Turkey ARTICLE INFO Keywords: Model-driven Visual dictionary Bag of Visual Words Shape models Primitive image structures Image understanding Object recognition Scene classication ABSTRACT Good representative dictionaries is the most critical part of the BoVW: Bag of Visual Words scheme, used for such tasks as category identication. The paradigm of learning dictionaries from datasets is by far the most widely used approach and there exists a plethora of methods to this eect. Dictionary learning methods demand abundant data, and when the amount of training data is limited, the quality of dictionaries and consequently the performance of BoVW methods suer. A much less explored path for creating visual dictionaries starts from the knowledge of primitives in appearance models and creates families of parametric shape models. In this work, we develop shape models starting from a small number of primitives and develop a visual dictionary using various nonlinear operations and nonlinear combinations. Compared with the existing model-driven schemes, our method is able to represent and characterize images in various image understanding applications with compe- titive, and often better performance. 1. Introduction The Bag-of-Visual Words (BoVW) paradigm provides state-of-the-art performance for tasks of object recognition, image category determi- nation, and in general scene understanding. BoVW methods use visual words, extracted for instance, from Scale-Invariant Feature Transform (SIFT) vectors as mid-level representations of image patches. It is im- portant in the BoVW framework to obtain good representative dic- tionaries. A plethora of visual dictionaries have been generated in the literature according to the following three paradigms: 1. Dictionaries built from mixtures of the column set of known transform matrices, such as DCT [1], DWT [2], Gabor lter [2], curvelet [3], edgelet [4], ridgelet [5], contourlet [6], bandelet [7], and steerable lters [8]. The main advantage of these dictionaries is their reali- zation by means of their fast implementation. However these dic- tionaries have limitations, i.e., they can only be successful as their underlying model. For example DCT is good at representing images with homogeneous components, DWT is good at representing point singularities and, edgelets, curvelets, ridgelets, contourlets, and bandelets, are good at representing line singularities in images [9]. 2. Dictionaries learned from data. In a number of methods, one obtains dictionaries directly from pixel data, based on matrix factorization principles under sparsity constraints such K-Singular Value Decomposition (K-SVD) [10] and Online Dictionary Learning (ODL) [11]. Another set of approaches follow the steps of: dense sampling of images, obtaining local features such as HOG [12] or SIFT [13], and building a dictionary via clustering [14]15. These can be grouped under the name of unsupervised dictionary learning techni- ques. Recent studies have introduced supervised dictionary learning [1622] for better classication performance where a class-specic discrimination term is added to the learning algorithm. The main advantage of both unsupervised and supervised dictionary-learning techniques is that dictionaries can be ne-tuned to the underlying dataset as compared to the transform-based approaches. Further- more, results in the literature indicate that better performance can be achieved. Their main disadvantage is that unsupervised techni- ques result in an unstructured dictionary, and they are computa- tionally costlier to apply compared to the transform-based ones. Supervised techniques are more discriminative than unsupervised ones and better in classication tasks, yet they still have some drawbacks, e.g., very large sized dictionaries may be encountered in [16,17]. Supervised pruning of the dictionary, initially learned without supervision, does not necessarily improve the performance [18,19]. The related optimization problem is non-convex and can become quite complex as in [2022]. http://dx.doi.org/10.1016/j.jvcir.2017.09.009 Received 3 October 2016; Received in revised form 27 July 2017; Accepted 19 September 2017 This paper has been recommended for acceptance by Zicheng Liu. Corresponding author. E-mail addresses: [email protected], [email protected] (S. Aslan), [email protected] (C.B. Akgül), [email protected] (B. Sankur), [email protected] (E. Turhan Tunalı). Journal of Visual Communication and Image Representation 49 (2017) 315–331 Available online 25 September 2017 1047-3203/ © 2017 Elsevier Inc. All rights reserved. MARK

Upload: others

Post on 02-Dec-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring visual dictionaries A model driven perspective

Contents lists available at ScienceDirect

Journal of Visual Communication andImage Representation

journal homepage: www.elsevier.com/locate/jvci

Exploring visual dictionaries: A model driven perspective☆

Sinem Aslana,⁎, Ceyhun Burak Akgülb, Bülent Sankurb, E. Turhan Tunalıca International Computer Institute, Ege University, İzmir, Turkeyb Electrical and Electronics Engineering Department, Boğaziçi University, İstanbul, Turkeyc Department of Computer Engineering, İzmir University of Economics, İzmir, Turkey

A R T I C L E I N F O

Keywords:Model-drivenVisual dictionaryBag of Visual WordsShape modelsPrimitive image structuresImage understandingObject recognitionScene classification

A B S T R A C T

Good representative dictionaries is the most critical part of the BoVW: Bag of Visual Words scheme, used for suchtasks as category identification. The paradigm of learning dictionaries from datasets is by far the most widelyused approach and there exists a plethora of methods to this effect. Dictionary learning methods demandabundant data, and when the amount of training data is limited, the quality of dictionaries and consequently theperformance of BoVW methods suffer. A much less explored path for creating visual dictionaries starts from theknowledge of primitives in appearance models and creates families of parametric shape models. In this work, wedevelop shape models starting from a small number of primitives and develop a visual dictionary using variousnonlinear operations and nonlinear combinations. Compared with the existing model-driven schemes, ourmethod is able to represent and characterize images in various image understanding applications with compe-titive, and often better performance.

1. Introduction

The Bag-of-Visual Words (BoVW) paradigm provides state-of-the-artperformance for tasks of object recognition, image category determi-nation, and in general scene understanding. BoVW methods use visualwords, extracted for instance, from Scale-Invariant Feature Transform(SIFT) vectors as mid-level representations of image patches. It is im-portant in the BoVW framework to obtain good representative dic-tionaries. A plethora of visual dictionaries have been generated in theliterature according to the following three paradigms:

1. Dictionaries built from mixtures of the column set of known transformmatrices, such as DCT [1], DWT [2], Gabor filter [2], curvelet [3],edgelet [4], ridgelet [5], contourlet [6], bandelet [7], and steerablefilters [8]. The main advantage of these dictionaries is their reali-zation by means of their fast implementation. However these dic-tionaries have limitations, i.e., they can only be successful as theirunderlying model. For example DCT is good at representing imageswith homogeneous components, DWT is good at representing pointsingularities and, edgelets, curvelets, ridgelets, contourlets, andbandelets, are good at representing line singularities in images [9].

2. Dictionaries learned from data. In a number of methods, one obtainsdictionaries directly from pixel data, based on matrix factorization

principles under sparsity constraints such K-Singular ValueDecomposition (K-SVD) [10] and Online Dictionary Learning (ODL)[11]. Another set of approaches follow the steps of: dense samplingof images, obtaining local features such as HOG [12] or SIFT [13],and building a dictionary via clustering [14]15. These can begrouped under the name of unsupervised dictionary learning techni-ques. Recent studies have introduced supervised dictionary learning[16–22] for better classification performance where a class-specificdiscrimination term is added to the learning algorithm. The mainadvantage of both unsupervised and supervised dictionary-learningtechniques is that dictionaries can be fine-tuned to the underlyingdataset as compared to the transform-based approaches. Further-more, results in the literature indicate that better performance canbe achieved. Their main disadvantage is that unsupervised techni-ques result in an unstructured dictionary, and they are computa-tionally costlier to apply compared to the transform-based ones.Supervised techniques are more discriminative than unsupervisedones and better in classification tasks, yet they still have somedrawbacks, e.g., very large sized dictionaries may be encountered in[16,17]. Supervised pruning of the dictionary, initially learnedwithout supervision, does not necessarily improve the performance[18,19]. The related optimization problem is non-convex and canbecome quite complex as in [20–22].

http://dx.doi.org/10.1016/j.jvcir.2017.09.009Received 3 October 2016; Received in revised form 27 July 2017; Accepted 19 September 2017

☆ This paper has been recommended for acceptance by Zicheng Liu.⁎ Corresponding author.E-mail addresses: [email protected], [email protected] (S. Aslan), [email protected] (C.B. Akgül), [email protected] (B. Sankur),

[email protected] (E. Turhan Tunalı).

Journal of Visual Communication and Image Representation 49 (2017) 315–331

Available online 25 September 20171047-3203/ © 2017 Elsevier Inc. All rights reserved.

MARK

Page 2: Exploring visual dictionaries A model driven perspective

3. Dictionaries that are crafted on models of local image appearances.These are typically models of gray-level image topological features,such as ramps, corners, wedges, bars, crosses, saddles, mesas, valleys,potholes, valleys, depressions, gorges, ridges, and flat zones, etc. Thistechnique has been a much less explored path for creating visualdictionaries. Marr’s studies in 1980s [23,24] can be accepted as thebeginning of describing natural images in terms of a geometricalstructures set. Inspired from the findings in physiology [25–27],Marr claimed that in order to achieve visual perception for machinevision systems, some primitive shape structures such as edge, barand blob should be detected on the images firstly. Recently, Griffinet al. [28–30] have introduced a dictionary construction method,where images are described in terms of a pre-determined dictionaryof merely 7 basic qualitative structures, that are flat, dark and lightbar, dark and light blob and saddle, called as Basic Image Features(BIFs). The shape models are defined by a parametric mapping froma jet space to a partitioned orbifold. These authors have subse-quently enriched their coarse dictionary by replicating BIFs in dif-ferent orientations, though its performance in object categorizationtasks was far from being competitive [29].

We believe that model-based dictionary methods have further roomfor exploration and improvement. The potential for improvement lies ina more detailed quantization of the parameter space of the shapemodels as well as exploring new representative shape types. This paperpresents our work in this direction.

Fig. 1 shows the three main operations in the pipeline of visualdictionary construction. These operations are (i) Feature extraction, (ii)Descriptor computation, and (iii) Signature extraction.

Feature extraction. In the first stage, characteristics of the local pat-ches around selected image points can be used. The simplest imagefeature can be the vector of pixel values or their histogram within apatch. However, raw pixel values are sensitive to position, illumination,and noise variations, or geometrical transform effects. Thus, image fea-tures have been developed in the literature [31,32], that, if not totallyinvariant, mitigate spatial and/or photometric transformations. One canuse HOG [12], SIFT [13], GLOH [31], SURF [33], etc. features, on sparsepoints of interest or densely sampled points on a regular grid [34]. Otherpossibilities consist of the family of filter kernels, e.g., steerable filters[8], and Gabor filters [35,36]. Derivative-based features investigated byKoenderink and van Doorn [37] are some other examples. These havebeen used successfully in many applications such as image coding [38],foreground/background segmentation [39,40], moving object detection[41], pose estimation [42] or image registration [43]. Recently, binaryfeatures, i.e., BRIEF [44], ORB [45], BRISK [46], FREAK [47], have at-tracted some attention, due to their computational simplicity, memory-efficiency and their inherent robustness against image variability. In thetraining stage, the local features are processed to extract a visual dic-tionary (a.k.a. codebook), consisting of code words.

Descriptor computation. The extracted local features are encoded in adescriptor, regarding to their association to the elements (a.k.a. codewords) of a predetermined visual dictionary (a.k.a. codebook). Principalencoding methods that have been used in the literature [48] can begrouped under categories of (i) voting-based methods such as hard-voting [14] and soft-voting [49], (ii) reconstruction-based methodssuch as sparse coding [50], Local Coordinate Coding (LCC) [51], andLocal-constraint Linear Coding (LLC) [52], and (iii) Fisher codingmethods [53,54]. Fisher coding and reconstruction-based methodsoutperform voting-based methods [48]. Among all, Fisher coding isreported as the best performing one, as the Gaussian mixture model(GMM) provides richer information, and it is more robust to unusual,noisy features. However, Fisher coding gives rise to very high dimen-sional descriptor vectors. Reconstruction-based methods yield a moreexact representation of features than voting-based methods, but com-putational complexity is higher and they are the least robust onesamong all as reported in [48].

Signature extraction. Signature vector is a unique representation ofthe image to enable its similarity comparison with other images. Oneway to accomplish this is to combine the descriptors occurrences (hits)into a “bag of features” vector. Essentially this is a spatial pooling op-eration. Spatial pooling provides not only compactness of representa-tion, but also, invariance to transformations such as changes in posi-tion, and robustness to lighting conditions, noise and clutter [55]. Sum(or average) and max pooling are the two common ways used for thispurpose [55,56]. Sum pooling can reduce discriminability since it isinfluenced strongly by the most frequent features, which may nothowever be informative as in the stop words case in text retrieval [56].Max pooling balances can have better discrimination as it focuses onthe most strongly expressed features. However, it is not necessarily thebest method for every coding scheme. For example it does not performwell with Fisher coding, but works quite well with soft-voting andsparse coding [56]. Furthermore, pooling spatially close descriptors asin Spatial Pyramid Matching (SPM) [34] and macro-features [57] hasbeen shown to bring substantial improvements.

In this paper, we propose a novel dictionary generation methodadopting the model-driven perspective. The proposed dictionary basedscheme, that we call Symbolic Patch Dictionary (SymPaD), follows thesteps of BoVW paradigm in that, pixels are visited on a dense grid, localimage characteristics are extracted in terms of shape similarity scores tothe dictionary atoms, the scores are pooled, and finally an image sig-nature is obtained. We differ from BoVW schemes in the literature inthe generation of our shape dictionary. These shape patterns are gen-erated by mathematical formulae encoding qualitative image char-acteristics [23,24,58–60,28,29,61].

Our contributions can be summarized in two items. First, ourscheme can incorporate any shape primitive in the visual dictionarythanks to its parametric generative function. More importantly, theparametric representation allows a more thorough sampling of the

Fig. 1. Processing steps in the pipeline of a dictionary-based computer vision task (a dashed line indicates dictionary learning stage).

S. Aslan et al. Journal of Visual Communication and Image Representation 49 (2017) 315–331

316

Page 3: Exploring visual dictionaries A model driven perspective

parameter space in order to produce a variety of shapes. Second, weachieve a more generic shape dictionary and minimize the dependenceof the dictionary to any specific image database. This is convenientespecially when there is not abundant training data, and also one canavoid overfitting problems.

The paper is organized as follows: Section 2 introduces works donein the literature in two subtitles, namely, background in which, thestudies investigating visual primitive structures of natural images arepresented, and the state of the art in model driven approach. Thegenerative models that are designed to construct the shape dictionaryand their parametrization are explained in Section 3. Section 4, in-troduces the scheme utilized to compute SymPaD vectors. We presentexperimental results in Sections 5 and 6 concludes.

2. Literature survey

In the first part of this section, we briefly review qualitative featuresof images that can inspire model driven patch shape dictionaries. Wepresent state-of-the-art in visual dictionary construction adoptingmodel driven approach in the second part.

2.1. Modelling local structures of natural images

Analysis of natural images, i.e., investigating the meaningful pri-mitive representatives and their statistics, is crucial to design propershape models to generate a useful visual dictionary. Many researchershave addressed the objective of finding evidences of visual primitives atan early stage of perception, so higher-level processing, e.g., recogni-tion, compression, etc., could be made computationally more feasible.

Marr was one of the earliest researchers who studied visual per-ception from a computational perspective, i.e., for machine visionsystems [23,24]. He established a representational framework for visualperception, and claimed that at the first stage, information is madeexplicit via symbolic representations of tokens that are intensity dis-continuities in various formations such as edges, bars, and blobs. Healso suggested using various spatial groupings of initial tokens for amore meaningful description, named as place token, i.e., each group oftokens is defined as an individual token [23]. Marr argued that suchrepresentations that include primitive tokens and place tokens, called asthe primal sketch, should be sufficient to represent the original image[24,23]. In the same decade, Julesz investigated the problem for pre-attentive discrimination of texture images, and asserted that textureimages are characterized by repetition of a few atomic elements, suchas bars, edges and terminations (or end points), that were named astextons [62]. Tenenbaum and Witking [63] and Lowe [60] suggestedthe non-accidentalness of successful grouping of primitive structures forimage representation. Lowe [60] demonstrated that certain relationssuch as collinearity, curvilinearity, co-termination, crossings, paralle-lism and symmetry between initial primitives on the 2D image are in-variant across viewpoint changes. Thus they are unlikely to emerge byaccident; furthermore these grouping structures are more informativeabout the object.

Another research stream investigated the statistics of intensity dis-continuities in natural images. Field demonstrated that the histogramsof Gabor filter responses on natural images have high kurtosis [64].Geman and Koloydenko analyzed a large set of ×3 3 patches from twonatural image datasets by a modified order statistics and found that thenon-background patches are in the appearance of edges with a highprobability [65]. Lee et al. analyzed the probability distribution of alarge set of ×3 3 high-contrast patches that were sampled from naturalimages and found that patch space is extremely sparse with patcheslocated around the manifold of edges of different orientations and po-sitions [66]. The common outcome of these studies [64–66], i.e., theobserved high kurtosis in image statistics, demonstrates that it shouldbe possible to represent natural image primitives with a limited numberof visual elements.

These initial studies have inspired many researchers to develop avariety of vision applications. Extracting primitive structures of edgefeatures, and describing their attributes and spatial relations, Vilnrotteret al. obtained good performance results for texture recognition [67].Saund proposed to add scale dimension to Marr’s Primal Sketch forshape recognition applications [58]. Horaud et al. suggested an inter-mediate-level description accomplished by an exhaustive search todetect geometric structures, and groupings of them, such as linear andcurved contours, junctions, and local symmetries like parallels, ribbons,and parallelograms. Horaud et al. claimed that such a representation isuseful for object recognition [59] and stereo image matching [68]. Byusing primal sketch priors such as edges, ridges, corners, T-junctionsand terminations, Sun et al. obtained encouraging results in enhancingthe quality of the hallucinated high-resolution generic images [69]. Guoet al. used a visual dictionary including primitive shapes of blobs, ter-minations, edges, ridges, multi ridges, corners, junctions, crosses tomodel structural components of natural images and demonstrated thatsuch a representation is useful for lossy image coding [70]. In a morerecent study, Griffin et al. proposed to represent images in terms of a setof image primitives, that are flat, ramp, dark/light line, dark/lightcircle, and saddle, in a bag of words scheme, and demonstrated that thestate-of-the-art performance is obtained in texture classification [30]when their occurrences in scale-space is considered. These successfulimplementations adopting model-driven schemes demonstrate thatmodel-driven schemes deserves further investigation.

The last decade has seen a plethora of methods to obtain local imagerepresentations based on the data-driven methods. These methods thatlearn local structures of images automatically range from PrincipalComponent Analysis (PCA) to clustering, from Nonnegative MatrixFactorization (NMF) to Independent Component Analysis (ICA).Recently dictionary learning techniques based on sparsity constrainedoptimization have been popular [71]. These learned dictionaries provedsurprisingly effective in representing images for classification, detectionand recognition tasks. In fact, most of these methods can be conceivedunder a single mathematical formalism, that of matrix factorization. Let

= … ∈ ×X x x[ , , ]np n

1 be the training set of p-dimensional images orimage patches, = …D d d[ , , ]r1 the matrix of dictionary elements re-presented by ∈di

p and = …A α α[ , , ]n1 is the matrix of decompositioncoefficients represented by ∈αi

r . The matrix is factorized into theproduct of a dictionary matrix and the mixture coefficients as ≈X DAsuch that one obtains a good approximation of

≈ = =x Dα Dα α j d, Σ [ ]i i i jr

i j1 [71]. The various methods proposed differ inthe regularizer terms, that is constraints imposed, such as sparsity of themixing coefficients A, and low-rank, orthogonality, or the existence ofstructure in D. The dictionaries learned by the mentioned methodsconsist of low-frequency patterns, i.e., ramps and bars in various or-ientations and positions, and Gabor-like patterns with different or-ientations, frequencies and positions [71]. These observed primitiveshape structures coincide with choice of primitive shapes used by modeldriven approaches [23,24,67,58,59,68–70,30].

2.2. State-of-the art in model-driven visual dictionary

The most current study using visual primitives designed accordingto a model-driven scheme, and used in the BoVW type image re-presentation is developed by Griffin et al. [28–30]. This shape dic-tionary called as Basic Image Features (BIFs) [28], has been applied on awide variety of image understanding problems such as object categor-ization [29], texture classification [30], quartz sand grains classifica-tion for forensic analysis [72], character recognition on natural images[73], writer identification [74], and biomedical image analysis [75,76].

Shape models of BIFs are defined by partitioning the re-para-metrized response space of six Derivative of Gaussian (DtG) filters, one0th order, two 1st order, and three 2nd order, into seven regions eachcorresponding to one of seven qualitative image structures, namely, flat,ramp, dark and light line, dark and light circle and saddle. Such re-

S. Aslan et al. Journal of Visual Communication and Image Representation 49 (2017) 315–331

317

Page 4: Exploring visual dictionaries A model driven perspective

parametrization yields a mapping from filter response space in D6 to apartitioned orbifold, where the intrinsic information of the scene isseparated from the extrinsic information such as geometrical andphotometrical variations in the scene [28]. To improve the recognitionperformance, the authors proposed scaled templates of BIFs, called BIF-columns, which consist of a stack of BIFs computed on the same pixelposition of the images computed at different scales. The augmentedimage signature, consisting of 1296 co-occurrences of six structures atfour scales ( =6 12964 ), state-of-the-art performance (98.5%) was ob-tained for KTH-TIPs, which is a challenging texture dataset [30].

For the object categorization problem, oBIFs (oriented BIFs) [29]proved useful. The 7 original feature types were increased to 23 byquantizing orientations of ramp shape into 8 levels, and the line andsaddle shapes into 4 levels. Since a histogram with 23 bins was still toolimited, they have used oBIF columns that are computed in scale-spaceand selected 1000 most informative ones (based on Mutual Informationcriterion) [29]. Griffin et al. [28] also stressed that using filters up to4th or 5th order could provide a more enriched description of localstructures, though they do not seem to have elaborate on it. We do notfollow the path of using higher order filters; instead starting from a coreof basic shapes, we proceed to build a visual dictionary using on the onehand a detailed parametrization, and on the other hand, selected non-linear compositions.

3. Shape dictionary

Based on our knowledge of both visual primitives and their group-ings employed by model-driven approaches and the local shapes in thedata-driven, i.e., learned dictionaries, we decided to define a smallnumber of essential base models for the shapes of local image appear-ances. These base shapes, also called as core shapes, consist of flat,ramp, valley(dark line), ridge (light line), basin (dark circular blob),summit (light circular blob), elongated basin/summit, termination, saddle,corner, several kinds of junctions, cross, curves (like L-junction) andGabor-like shapes. As reviewed in detail in Section 2, the rationale ofthese choices lies in the rich tradition of visual primitives of genericimages in the literature. Some of the milestones on the path leading toour models are the primal sketch [23,24] of Marr, spatial groupings intextons of Julesz [62], works of Tenenbaum and Witkin [63], and Lowe[60] emphasizing the non-accidental nature of collinearity, curvili-nearity, co-termination, crossings, etc. primitives, and Griffin et al.’sBasic Image Features [28,77,61]. In what follows, we first introducethese shape models forming the core dictionary, then parametrize thisdictionary and generate variations based on rotation, scaling and shifts,and finally prune it to a more expressive dictionary based on mutualinformation principle.

3.1. Shape models

The core shape models or patterns are intended to represent themost basic local image appearances. Each pattern will spawn a numberof other atoms under geometrical transformations of translations, ro-tations, stretching, and contractions. Furthermore, these shape modelsare combined nonlinearly to obtain a richer dictionary. Accordingly wehave three groups of models:

• Group I models (Table 1), consist of linear, quadratic and cubicpolynomials, as well as exponential and transcendental functions, allwithin the argument of a sigmoid. The sigmoid transformationprovides us with one extra degree of freedom in its rate parameter α,“

+ −e1

1 αx ”.

• Group II models (Table 2) are compounded models obtained byvarious nonlinear combinations of Group I models, that were in-spired from perceptual groupings in the literature[23,63,60,58,59,70].

• Group III models (Table 3) generate a set of filter patterns, includingfirst and second derivative of Gaussian filters [78] and Gabor filters[36] to complement our shape library. We incorporated thesetransform-based patterns since they were well-established shapes inthe literature.

Notice that this grouping is done simply for convenience of thediscourse, but otherwise our final shape dictionary consists of the unionof these three sets.

3.1.1. Group I modelsA gray-level image can be thought of as a landscape I(x y, ) with (x y, )

as the spatial dimensions and luminance as the third dimension, similarto a physical landscape [79]. Then, for example, uniform luminanceregions correspond to flat surfaces or planes, negative and positive bar-like shapes to valleys or ridges, edges to ramps, scarps or hillsides etc. Inthis context, ramps represent rapid intensity transitions and correspondto edges parametrized by orientation and intensity transition rate. Si-milarly valleys and ridges, correspond respectively, to dark and lightlines vis-á-vis their background, characterized by their orientation andintensity transition rate on hillsides. Continuing this line of analogy,one can conceive basins and mesa (or tumulus), as representing cir-cular, dark and light, respectively, blobs, parametrized by the intensitytransition rate from background to foreground. Finally elongated ver-sions of basin and mesa, as representing elliptical dark and light blobs;the latter being parametrized by their orientation, intensity transitionrate and eccentricity. Other more complex structures potentially cor-responding to more complex local image structure are given in Table 1.

New dictionary atoms can be built from the collection of core shapesby varying one or more of the rotation, eccentricity, translation and tran-sition rate parameters, whenever applicable. By varying these parameters,oriented or shifted versions of the models, or models differing in steep-ness of transition and/or of eccentricity of the patterns can be created.One must, however, sample the parameter space, that is quantize theparameter values, judiciously so that shape models become sufficientlydistinctive and informative. Classification performance results indicatethat recognition rates have sensitivities to the parameter sampling in-terval that differ from parameter to parameter. For example, the finenessof rotation intervals affects the performance much more as compared to

Table 1Generator functions for Group I shape patterns. x y( , )θ θ denotes the rotated coordinates ofx y( , ) with angle ′ ′θ x y,( , ) denotes the translated coordinates by the amount =u u u u( , ),x y xand uy, i.e., ′ = + ′ = +x x u y y u,x y and =u u( , ) (0,0)x y for the non-shifted shapes,f x y u θ( , , ; ) denotes the rotated and shifted versions, in that order, of f x y ρ( , ), controls theeccentricity of the elongated basin/dome, we use =ρ 2.4.

Descriptive name Shape function Appear.

Flat =f x y c( , )0

Ramp = ′ + ′f x y u θ x y( , , ; ) θ θ1

Valley = ′ + ′f x y u θ x y( , , ; ) ( )θ θ22

Ridge = − ′ + ′f x y u θ x y( , , ; ) ( )θ θ32

Basin ′ ′= +f x y u θ x y( , , ; ) θ θ42 2

Mesa ′ ′= − +f x y u θ x y( , , ; ) ( )θ θ52 2

Elongated basin ′ ′= +f x y u θ x y ρ( , , ; ) ( )/θ θ62 2

Elongated mesa ′ ′= − +f x y u θ x y ρ( , , ; ) ( )/θ θ72 2

– ′= ′ ×f x y u θ x y( , , ; ) θ θ82

– ′ ′= ′ × + × ′f x y u θ x y x y( , , ; ) θ θ θ θ92 2

– = ′ + ′f x y u θ x y( , , ; ) ( )θ θ103

– ′ ′= +f x y u θ x y( , , ; ) θ θ113 3

– = ′ × ′f x y u θ x y( , , ; ) exp( )θ θ12

– = ′ ×′

f x y u θ x( , , ; ) cosθyθ

13 2– =

′ + ′f x y u θ( , , ; ) cos

xθ yθ14 2

S. Aslan et al. Journal of Visual Communication and Image Representation 49 (2017) 315–331

318

Page 5: Exploring visual dictionaries A model driven perspective

the sampling step size of the transition rate α.The rotated versions of these models are obtained via the coordinate

transformation ⎛⎝

⎞⎠

= ⎛⎝−

⎞⎠

xy

θ θθ θ

cos sinsin cos

θ

θ ( )xy , where θ is the rotation angle.

Notice that θ is not applicable to rotationally symmetric shapes such asflat ( f1), basin ( f4) and mesa ( f5). Shifting constitutes a nonlinear op-eration due to the cropping and uncovering outside the given patch.More specifically, we generate new shape varieties by shifting centralshape patterns, i.e., =u u( , ) (0,0)x y , by some translation vector

= >u u u u( , ), 0x y x or >u 0y , so the parts of the shape moving out of thepatch window is cropped out, while the uncovered regions are filledwith the continuation of the shape that was out of view before thetranslation. The steepness of gray level transitions on the hillsides oframps, valleys and basins is controlled by α, the transition rate para-meter, of the sigmoidal function as in Eq. (1) where →F: [0,1] inTable 1. Finally, the parameter >ρ 1 controls the eccentricity of theelongated basin/mesa varieties.

=+ −F x y u θ α

e( , , , , ) 1

1 αf x y u θ( , , ; ) (1)

Notice that in Table 1, we have denoted the shape primitive func-tion as f (·)i , while the actual shape model used is F (·)i , that is theprimitive after being subjected to a sigmoidal operation

⎯ →⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯f sigmoid F(·) (·)i i with steepness parameter α. Each F (·)i takes values in

the range [0,1] and with support of ⎡⎣− ⎤⎦ × ⎡⎣− ⎤⎦− − − −, ,p p p p12

12

12

12 , p is

odd. For illustration, we show in Fig. 2, the 3D shape surfaces andcorresponding gray-level pattern (lower left corner) for the ramp pri-mitive function = °f x y θ( , ; 0 )1 , and two instances of the actual rampmodel, = = ° =F x y u θ α( , , 0; 0 , 0.4)1 and = = ° =F x y u θ α( , , 0; 0 , 1.2)1 .

One could also consider using as primitive shape models orthogonalpolynomials such as Chebyshev, Hermite, Krawtchouk, etc. [80] orhigher order polynomial functions. We argue that within patch sizes, e.g.,

×15 15, sufficient shape diversity is produced by the low-order poly-nomials, exponential and transcendental functions as in Table 1. Higher-

Table 2Group II models. Note that the parent shapes, e.g., F1 and F2, are first scaled, or rotated, or shifted, and then compounded.

Descriptive name Shape function Parents Appear.

Saddle = +

− +

F x y u θ ψ α F x y u θ α F x y u θ ψ α

F x y u θ α F x y u θ ψ α

max

min

( , , ; , , ) min[ ( ( , , ; , ), ( , , ; , )),1

( ( , , ; , ), ( , , ; , ))]15 1 1

1 1Dark corner = − +F x y u θ ψ α F x y u θ α F x y u θ ψ αmin( )( , , ; , , ) 1 ( , , ; , ), ( , , ; , )16 1 1

Light corner = +F x y u θ ψ α F x y u θ α F x y u θ ψ αmin( )( , , ; , , ) ( , , ; , ), ( , , ; , )17 1 1

Dark junction = +F x y u θ ψ α F x y u θ α F x y u θ ψ αmin( )( , , ; , , ) ( , , ; , ), ( , , ; , )18 1 2

Light junction = −F x y u θ ψ α F x y u θ α F x y umin(( , , ; , , ) 1 ( , , ; , ), ( , ,19 1 2 ; +θ ψ α ), )

Dark cross = +F x y u θ ψ α F x y u θ α F x y u θ ψ αmin( )( , , ; , , ) ( , , ; , ), ( , , ; , )20 2 2

Light cross = − +F x y u θ ψ α F x y u θ α F x y u θ ψ αmin( )( , , ; , , ) 1 ( , , ; , ), ( , , ; , )21 2 2

Dark termination = − +F x y u θ ψ α F x y u θ α F x y u θ ψ αmin( )( , , ; , , ) 1 ( , , ; , ), ( , , ; , )22 1 3

Light termination = +F x y u θ ψ α F x y u θ α F x y u θ ψ αmin( )( , , ; , , ) ( , , ; , ), ( , , ; , )23 1 3

Dark T-Junction = +F x y u θ ψ α F x y u θ α F x y u θ ψ αmin( )( , , ; , , ) ( , , ; , ), ( , , ; , )24 2 22

Light T-Junction = +F x y u θ ψ α F x y u θ α F x y u θ ψ αmin( )( , , ; , , ) ( , , ; , ), ( , , ; , )25 3 23

Light curve = − −F x y u θ ψ α F x y u α F x y u α F x y u θ α F x y umax( ) min(( , , ; , , ) min[1 ( , , ; ), ( , , ; ) ,1 ( , , ; , ), ( , ,26 4 5 1 1 ; +θ ψ α ), ) ]

Dark curve = −F x y u θ ψ α F x y u θ ψ α( , , ; , , ) 1 ( , , ; , , )27 26

Table 3Group III models: =v v v( , )x y denotes the translation vector used to create terminations of shape patterns. ′ ′x y( , ) denotes the translated coordinates by =u u u( , )x y , i.e.,

′ = + ′ = +x x u y y u,x y . Notice that =u u( , ) (0,0)x y for central shapes.

Descriptive Name Shape function Appear.

Edge kernel ′ ′⎜ ⎟= ⎛⎝

− ⎞⎠

∂∂ ′

+F x y u θ σ( , , ; , ) exp

yθ π σ

xθ yθσ28

12 2

2 2

2 2

Edge kernel, termination 1⎜ ⎟= ⎛⎝

− ⎞⎠

∂∂ ′

′ + + ′ +F x y v u θ σ( , , , ; , ) exp

yθ π σ

xθ vx yθ vy

σ291

2 2( )2 ( )2

2 2

Edge kernel, termination 2⎜ ⎟= ⎛⎝

− ⎞⎠

∂∂ ′

′ + + ′ +F x y v u θ σ( , , , ; , ) exp

yθ π σ

xθ vx yθ vy

σ301

2 2( )2 ( )2

2 2

Bar kernel

′′ ′

⎜ ⎟= ⎛⎝

− ⎞⎠

∂∂

+F x y u θ σ( , , ; , ) exp

yθ π σ

xθ yθσ31

22

12 2

2 2

2 2

Bar kernel, termination 1

′⎜ ⎟= ⎛⎝

− ⎞⎠

∂∂

′ + + ′ +F x y v u θ σ( , , , ; , ) exp

yθ π σ

xθ vx yθ vy

σ322

21

2 2( )2 ( )2

2 2

Bar kernel, termination 2

′⎜ ⎟= ⎛⎝

− ⎞⎠

∂∂

′ + + ′ +F x y v u θ σ( , , , ; , ) exp

yθ π σ

xθ vx yθ vy

σ332

21

2 2( )2 ( )2

2 2

Gabor kernel, phase 1 ′ ′⎜ ⎟= − × ⎛⎝

− − ⎞⎠

×F x y u θ λ σ ξ( , , ; , , , ) cos exp

πyθξ

xθλ σ

yθσ34

2

2 2

2

2

Gabor kernel, phase 1, termination 1 ′ ′⎜ ⎟= − × ⎛⎝

− − ⎞⎠

′ + +

×

+F x y v u θ λ σ ξ( , , , ; , , , ) cos exp

π yθ vyξ

xθ vx

λ σ

yθ vy

σ35( ) 2

2 2

2

2

Gabor kernel, phase 1, termination 2 ′ ′⎜ ⎟= − × ⎛⎝

− − ⎞⎠

′ + +

×

+F x y v u θ λ σ ξ( , , , ; , , , ) cos exp

π yθ vyξ

xθ vx

λ σ

yθ vy

σ36( ) 2

2 2

2

2

Gabor kernel, phase 2 F x y u θ λ σ ξ( , , ; , , , )37 =Hilbert F x y u θ λ σ ξ( ( , , ; , , , ))34Gabor kernel, phase 2, termination 1 F x y v u θ λ σ ξ( , , , ; , , , )38 =Hilbert F x y v u θ λ σ ξ( ( , , , ; , , , ))35Gabor kernel, phase 2, termination 2 F x y v u θ λ σ ξ( , , , ; , , , )39 =Hilbert F x y v u θ λ σ ξ( ( , , , ; , , , ))36

S. Aslan et al. Journal of Visual Communication and Image Representation 49 (2017) 315–331

319

Page 6: Exploring visual dictionaries A model driven perspective

order terms of the orthogonal polynomials tend to have rapid oscillationsand polynomial powers beyond three give rise to rapid amplitude ex-cursions, and both behaviours are not commonly encountered in imagesand/or cannot be accommodated within patch sizes we used.

3.1.2. Group II modelsIn order to enrich the shape dictionary, we investigated various

linear combinations of Group I shapes. For example, we first collected atraining set of images and their patches, and represented each patch interms of Group I models using Sparse Representation Coding (SRC) [71].Then, we clustered their representation coefficients, and finally ob-tained new shape models learnt from data in terms of sparse linearcombinations of Group I shapes. We observed that the addition to thedictionary of these partly data-learnt shapes did not improve the clas-sification performance significantly.

We then conjectured that pairwise nonlinear combinations of thebasic shapes would create new discriminative patterns. We have usedmin-max addition operations as in Table 2 to obtain the compoundingshapes. One could contemplate other nonlinearity alternatives, such asSVM hinge nonlinearity, Lorentzian distance + ×α xlog(1 exp( )), rec-tifier nonlinearity (ReLU) =f x x( ) max( ,0) [81], Leaky-RELU

= <f x x αx α( ) max( , ), 1 [82], maxout function f= + +x W x b W x b( ) max( , )1 1 2 2 [83], etc. Notice that these are many-to-

one nonlinearities, hence they are not convenient to generate a newimage patch from two or more image patches. Other alternatives couldhave been soft version of max operation, e.g.,

= + +y αx βx αx βxmax( , )1 2 2 1 , where x x,1 2 are the pixels from the twoinput patches and bottom-hat and top-hat morphological operations forvector inputs, where the two or more channels are simply two or moreof the primitive patches to be joined nonlinearly [84]. We have notconsidered these alternatives because: (i) For the case of soft version ofmax operation, it is not very clear how to optimize the choice of theweighting coefficients; (ii) for the case of vector morphological opera-tion, it is not very clear which set of primitive patches should be used toobtain the set of meaningful shapes for image representation that weaimed to model at the beginning. In fact, discussions in the literature onthe role of structure in vision indicate that groups of primitive struc-tures that yield meaningful image representations are unlikely toemerge by accident; instead they are formed according to certaingrouping relations that are invariant across viewpoint changes [63,60].Such well-known shapes are corners, a variety of junctions, crosses,parallel valleys/ridges, saddles [23,63,60,58,59,70,30]. These shapegroupings can be obtained by combining the core shapes nonlinearly,i.e., via simply min-max addition operations as in Table 2.

In the min-max operation, we have one new parameter, the relativeangle, called the shape compounding angle or simply the compoundingangle, ψ, between the two parent models, e.g., ≠F F i j, ,i j . The

compounded shapes are also subject to rotations by θ angles, much as inthe 1st group models. While nonlinearly compounding, we have chosenshape models from Table 1 having similar α parameter values. For ex-ample, suppose that for some chosen, qα, let the values of transitionparameter be …( )α α, ,P P

q1

11α for parent P1 and …( )α α, ,P P

q2

12α are for parent P2.

The values are sorted in ascending order as < ⋯ < =α α i, 1,2Pi Piq(1) ( )α , and,

transition parameters with the same ordered rank are paired, e.g., αP1(1)

with αP2(1), αP1

(2) with αP2(2), etc. Note that this matching of scales is needed

since parent shapes F F,i j must have the same scale as argued in [58]. Thisreduces the combinations to be compounded by a factor equal to the thecardinality of transition rates. In any case it was observed that the per-formance was not too sensitive to the transition rate parameter.

3.1.3. Group III modelsGroup III shapes consist of models that are filter-bank functions

commonly used in the literature. We used 1st and 2nd derivative ofGaussian (edge and bar filter) patterns that are also parts of Leung-Malik (LM) and Maximum Response (MR) sets [85,78] and high fre-quency Gabor patterns [36]. We have not included Gaussian filters,since our library already contains circular dark and light spot shapes,i.e., basin and mesa in Table 1. We have also tested Laplacian ofGaussian and isotropic Schmidt filter patterns [85], however they wereencountered very rarely in our test images. Gabor filter patterns areincluded to enrich the dictionary by multi valley/ridge appearances.

We used the open source code in [86] to create first and secondderivative of Gaussian filters. MR8 and LM sets contains 1st and 2ndderivative of Gaussian filters at 6 orientations and 3 scales of

=σ σ( , ) {(1,3),(2,6),(4,12)}x y for the block size of 49 pixels [86]. We simi-larly employed 6 orientations, but just 2 scales of =σ σ( , ) {(1,3),(2,6)}x y ,making 24 patterns convenient for our = ×p 15 15 patch size. We cre-ated the Gabor patterns in two phases by employing Hilbert transformusing the PMT toolbox [87]. The Gabor parameters are the standarddeviation σ to set the scale, λ to set the eccentricity of the Gaussian mask,and finally ξ to adjust wavelength of the sinusoidal modulation.

We also use in the shape dictionary the terminations (end points) ofpatterns of 1st and 2nd derivative of Gaussian and Gabor filters. Morespecifically, we shift the shape by half the patch size −p( 1/2) in thedirection perpendicular to the maximum gradient direction, i.e., usingthe translation vector =v v v( , )x y . The uncovered regions after shiftingare filled with the continuation of the shape that was out of view beforethe translation. Group III models, both models of conventional filters,i.e., F28, F31, F34, F37 and models to created their terminations, i.e., F29,F30, F32, F33, F35, F36, F38, F39, are given in Table 3.

3.2. Parametrization of the shape dictionary

In this section, we discuss the choices of the shape generation

Fig. 2. 3D surface illustrations of the ramp model and its gray-level appearance. Left: The surface is generated by the ramp primitive function f (·)1 computed for = °θ 0 ; next two surfaces

are its sigmoidal transformed versions, i.e., under the F (·)1 function computed at two values of the transition rate α , Center: =α 0.4; Right: =α 1.2.

S. Aslan et al. Journal of Visual Communication and Image Representation 49 (2017) 315–331

320

Page 7: Exploring visual dictionaries A model driven perspective

parameters and their effect. The parameters considered are: (i) Quanta ofrotation angles; (ii) quanta of relative angle for shape compounding; (iii)steepness of the hillsides, i.e., the transition rate; and (iv) shape positionvis-á-vis patch center. We have used Caltech-101 dataset to tune thesedictionary parameters in this Section 3.1, since it is one of the most di-verse datasets in terms of inter-class variability. The preprocessing stepsdescribed in Section 5.3 are applied to the Caltech-101 images.

3.2.1. Quantization scheme applied to Group I and Group II modelsWe quantize the ranges of rotation angle, θ, and transition rate, α,

parameters into qθ and qα discrete values, respectively, ∈ …θ θ θ{ , , }q1 θand ∈ …α α α{ , , }q1 α , and with the addition of these parameters, the shapemodels are denoted as F x y θ α( , ; , ). For each pair of ×q qθ α values, i.e.,{ … …( ) ( ) ( ) ( ) ( ) ( )θ α θ α θ α θ α θ α θ α, , , , , , , , , , , , ,q q q q q1 1 1 2 1 2 1α α θ θ α } we obtain a shapevariety with a particular orientation and a transition rate. For thecompound shapes, we also set qψ quanta for the compounding angleparameter, ψ, resulting in × ×q q qθ α ψ combinations, so that the shapefunction becomes: F x y θ ψ α( , ; , , ). The quantization scheme applied foreach type of generation parameters are described below.

Rotation angle parameter (θ). The shapes that possess inherent sym-metry, such as a circular basin or mesa, are obviously rotation invariant,as in F F,4 5. The shapes that possess two gradients in opposite senses, suchas a bar, have °180 symmetry, and hence can be rotated within the rangeof π[0, ]. They correspond to models: F2, F3, F6, F7, F14, F15, and these canbe denoted as bi-directional. Finally, the shapes that possess one dominantgradient, such as a ramp, can be rotated over the range π[0,2 ], and theseare the following: F1, F8, F9, F10, F11, F12, F13, F16, F17, F18, F19, F22, F23, F24,F27, and we can refer to them as uni-directional. We experimented withdifferent values of qθ, that is, splitting the °360 into

=−q {4,8,12,16,20,24}θuni dir( .) and the splitting the °180 into

=−q {2,4,6,8,10,12}θbi dir( .) slices that correspond to rotational angle steps

of { }, , , , ,π π π π π π2 4 6 8 10 12 . The transition rate and compounding angle parameter

values were fixed at =q 1α and =q 3ψ , as to be described in the sequel.Rotational angle step of π

4is commonly used in the literature [13,29](i.e.,

=−q 8θuni dir( .) and =−q 4θ

bi dir( .) ). However we have found that a finersplitting, namely π

8radiant separation, i.e., =−q 16θ

uni dir( .) and

=−q 8θbi dir( .) , brings about a 3̃% performance improvement (see Fig. 4).

We did not experiment with smaller rotation angle steps beyond π12, since

the performance reached a saturation plateau. This is actually an ex-pected outcome since BRIEF features computed on the shape patterns (tobe described in Section 4), was reported as insensitive to angular shiftssmaller than π̃/9 degrees [30]. Two cases of exception were the crossshape patterns, F20 and F21, which had only three quanta for their rota-tion angles, i.e., =q 3θ with = { }θ , ,π π π

4712

1112 to preclude their duplication.

The Compounding angle parameter (ψ). We applied uniform quanti-zation on the compounding angle, and we determined that three =q 3ψ

quanta, i.e., = { }ψ , ,π π π4 2

34 suffice, since adding finer quantization levels

results in shape patterns that are too close to each other. Sample imagesof compound shapes with three settings of the compounding are pre-sented in Fig. 3. Two cases of exception were the curved shape patterns,F26 and F27, where only two quanta for their compounding angles, i.e.,

=q 2θ with = { }θ ,π π4 2 are used to preclude their duplication.

Summary. Rotating and compounding schemes applied to shapemodels in Group I and II results in a shape dictionary of size

= +D D Dθ θ ψ(Group I)

,(Group II), which is D = 580

( × + × + × + × + × + × + ×

+ × + × + × + × + × + ×

+ × + × + × + × + × + ×

+ × + × + × + × + × + ×

+ × + ×

F F F F F F F

F F F F F F

F F F F F F

F F F F F F

F F

16 8 8 1 1 8 8

16 16 16 16 16 16

8 24 48 48 48 48

9 9 16 16 48 48

32 32

1 2 3 4 5 6 7

8 9 10 11 12 13

14 15 16 17 18 19

20 21 22 23 24 25

26 27

).

Here, = ∑ ×=D q Fθ i θF

i(Group I)

214 i , where ×q Fθ

Fii denotes the contribution

of shape model Fi in Table 1 to the dictionary with qθFi number of ro-

tated variants of it. = ∑ × ×=D q q Fθ ψ i θF

ψF

i,(Group II)

1527 i i , where

× ×q q FθF

ψF

ii i denotes the contribution of shape model Fi in Table 2 tothe dictionary with ×q qθ

FψFi i number of rotated and compounded var-

iants of it.Transition rate parameter (α). Since the parameter α takes place in

the exponent of the sigmoid function (Eq. (1)), uniform quantization isnot advisable. Instead we determined discrete values of α for eachshape model by clustering the BRIEF features as follows: (i) Denselysample the variable α on its range and generate corresponding shapes,e.g., F x y θ ψ α( , ; , , )i for some i and for fixed θ and ψ; (ii) cluster these shapepatterns using K-medoids [88]; and (iii) adopt as quanta of the αparameter the values of the cluster medoidal shapes, i.e., =q Kα .

We experimented with different values of qα, that is, =q {1,2,3,4}αfor all models of Group I and II, while rotation angle and compoundingangle parameter values were fixed at =−q 16θ

uni dir( .) , =−q 8θbi dir( .) , and

=q 3ψ . Given the insensitivity of the performance results to highernumber of quanta of this parameter (see Fig. 4), we decided to adopt

=q 2α . In fact, the performance gain is a modest ∼2% for =q 2α vis-á-vis =q 1α and larger values of qα did not yield any significant ad-vantage. Adopting =q 2α resulted in a shape dictionary of sizeD = 1160 ×(2 580).

3.2.2. Quantization scheme applied to Group III modelsWe generated the 1st and 2nd derivative of Gaussian patterns in 6

orientations as in [78], but in the lowest two scales =σ σ( , ) {(1,3),(2,6)}x ywhich resulted in 24 shape patterns. We use the open source codepublished in [86] to generate these patterns.

We created Gabor patterns in 6 orientations by the PMT toolboxpublished in [87]. In order to preclude duplications with 1st and 2ndderivative of Gaussian filter patterns, we just use odd-phased Gaborkernel patterns at the scale =σ 2.6. The odd-phased pattern is enabledby the Hilbert transform which introduces 90-degree phase shift to thesinusoidal components in the Gabor waveforms. We also generate even-and odd-phased patterns with a higher scale of =σ 6.6. They all con-stitute 18 shape patterns. Here, we use =λ 1.8 for the elongation of theGaussian mask and =ξ 1.1 for the wavelength of the modulating sinu-soid.

The 42 shape patterns created by the 1st and 2nd derivative ofGaussian and Gabor filter generators in Group III models with thementioned scale and orientation parameter values are visualized inFig. 5. Finally we create terminations of these shapes by shifting themin the amount of =v v v( , )x y , so the set of shapes in Group III triples from42 to 126 patterns. Thus, the size of shape dictionary including rotated,scaled, even- and odd-phased variants of Group III models is

=D 126θ σ λ ξ v, , , ,(Group III) .

3.2.3. Enriching the dictionary by shift operationsFor = ×p 15 15 sized patches, we found it adequate to execute only

one shift size, by 3 pixels, as this corresponds to 43% of one symmetricalhalf of the patch size. In the patterns F F F F, , ,1 2 3 10 and F14, whose gen-erator functions are given in Table 1 and which are illustrated in Fig. 6,shifting is applied across the gradient. These shapes that have only onedominant gradient direction can be shifted in one direction and alongtwo senses. Shapes in Table 1 that have two or more dominant direc-tions, such as a corner, and terminations in Group III models are shiftedalong four senses using shift vectors = − −u u( , ) {(3,0);(0,3);( 3,0);(0, 3)}x y .

Fig. 3. Three quantization values for the compounding angle of the dark junction shapepattern of F18.

S. Aslan et al. Journal of Visual Communication and Image Representation 49 (2017) 315–331

321

Page 8: Exploring visual dictionaries A model driven perspective

The center symmetric shapes of Group III models are shifted in only twosenses to avoid duplicates with their terminations. Some examples ofcentral shapes and their shifted patterns are given in Fig. 6.

3.2.4. Overview of shape dictionaryWe give the overview of shape dictionaries generated so far with

performance results in Fig. 7. Quantizing rotation and compoundingangles with =−q 16θ

uni dir( .) , =−q 8θbi dir( .) , =q 3ψ , =q 6θ

(Group III) when=q 1α is fixed for Group I and II patterns, and Group III patterns are

generated in a single scale, i.e., =σ σ( , ) (2,6)x y for 1st and 2nd derivativeof Gaussian and =σ 6.6 for Gabor functions respectively, the first dic-tionary has 652 atoms and give a performance of ±70.8% 0.7. In ad-dition to orientation quantization, quantizing transition rate to =q 2αfor Group I and II patterns and generating Group III patterns at twoscales, the dictionary amounts to 1286 atoms resulting in a performanceof ±72.27% 0.8. By admitting also shifts in these 1286 shape patterns,we reach a shape dictionary with D = 6122 atoms giving a perfor-mance of ±75.29% 0.6. This final shape dictionary with 6122 atomscontains 1316 atoms from Group I, 4260 atoms from Group II, and 546atoms from Group III.

3.3. Pruning the shape dictionary

Once we have built a dictionary as a collection of shape patternsguided by image qualitative criteria, the next is to prune it into a morecompact dictionary. This will alleviate not only the computational ef-fort, but also, possibly, improve the classification performance. Wechose an information-theoretic selection method using the mutual in-formation principle [89] to reduce the class uncertainties or maximizethe class posterior probabilities of the shape models. We opted formutual information based feature selection, since (1) It has been suc-cessfully used in a number of literature studies, examples include butnot limited to [90,91]. (2) It is a simple method with low computationalcomplexity and with acceptable classification performance [92,93].

We compute the Mutual Information (MI) between the occurrenceprobability, wi, of each dictionary atom, = …i D1, , , of a D-dimensionaldictionary and image categories (classes, objects etc.), ∈ …y C{1,2, , } in aC-category image dataset. The score wi, the ith element of the signaturevector W (to be described in Section 4, Eq. (5)), is the empirical dis-tribution of the ith shape pattern score on the training images. Higherdependency between occurrence probability of a shape pattern andcategory labels results in a higher mutual information score, which inturn that indicates that particular shape pattern is probably significantto discriminate some dataset categories. Thus, we rank the shape pat-terns according to their MI score and select the highest R ones to formthe pruned dictionary. Briefly:

1. Choose a training set of images on which to calculate the shapeoccurrence probabilities wi (see Eq. (5)) where = …i D1, , and D is thenumber of shape patterns.

2. Discretize the components of the vector wi (Eq. (5)), = …i D1, , , usingtheir empirical distribution and name the discretized vector as Xi;

Fig. 4. Effect of multiple shape orientationsfor Group I and II models. The vertical axis isthe correct category recognition in the Caltech101 dataset with preprocessing steps inSection 5.3. Left: Changing rotational anglestep for uni-directionally and bi-directionallyoriented shape models when the other para-meters are fixed at =q 1α and =q 3ψ , Right:

Effect of having multiple transition slopes, qα ,

while the other parameters fixed at= =q q16 and 8θ θ

(uni-dir.) (bi-dir.) (i.e., rota-

tional angle step is π/8), and =q 3ψ .

Fig. 5. Shape patterns generated by Group III models, block size is = ×p 15 15, Left: patterns of first and second derivative of Gaussian patterns generated at scales of=σ σ( , ) {(1,3),(2,6)}x y and at six orientations, Right: Gabor patterns generated with =ψ 1.1 and =λ 1.8, (the first row indicates Hilbert transformed Gabor patterns generated at scale

=σ 2.6, second and third rows indicate Gabor kernels and their Hilbert transformed version sequentially at scale =σ 6.6) and at six orientations.

Fig. 6. Some examples of central shape patterns from Groups I to III and their shiftedversions.

S. Aslan et al. Journal of Visual Communication and Image Representation 49 (2017) 315–331

322

Page 9: Exploring visual dictionaries A model driven perspective

notice that this discretization can be quite coarse, even a two-levelquantization may suffice.

3. Compute the mutual information si of the ith shape pattern as in Eq.(2). In Eq. (2), Xi denotes the discretized signature component of theshape pattern i Y, denotes the category label of the training images,i.e., ∈ …y C1, , and C is the number of categories in the image set. Inthis work it was sufficient to apply binary quantization to the sig-nature components, hence ∈x 0,1.

= = =

= = == =

∈ ∈ …s X Y X x Y

y X x Y yX x Y y

MI( , ) Σ Σ Pr[ ,

]log Pr[ , ]Pr[ ]Pr[ ]

i i x y C i

i

i

{0,1} {1,2, , }

(2)

4. Rank the shape patterns, i.e., dictionary atoms, according to theirMI-scores from highest to lowest, and select the top R shape patternsas in Eq. (3). In Eq. (3), γ denotes the parametrization of the shapemodel Fi whatever applicable, i.e., ∈ …γ θ ψ α σ s{ , , , , }, i( ) denotes themutual information score of the ith shape pattern in the i’th rankorder.

= ∪ ⩾= F γ s sPruned dictionary { (.; )| }iR

i i R1 ( ) ( ) (3)

We computed the ranked and pruned dictionaries for each of thefour different image datasets, i.e., COIL, ALOI-View, Caltech-101, andZuBuD, separately. The training sets for MI score computation is con-structed by randomly sampling 30 images from each category as in thestandard setting of Caltech-101 experiments, and the remaining imagesare incorporated into the testing set. The preprocessing operations ap-plied on Caltech images are explained in Section 5.3. This scheme isemployed for all datasets except ZuBuD, since ZuBuD has much fewernumber of images, and hence we sample just 5 images from each ca-tegory of it. We used the lowest resolution Caltech images, i.e., 75-pixels on the longest side, to fasten the process. Number of categoriesfor Caltech, Coil, ALOI and ZuBuD are, respectively, is =C 100, 100,250, and 201, while the numbers of training images per category for MIcomputation are, 30, 30, 30, and 5.

To discretize the components wi of the signature vector, i.e., theshape occurrence probabilities, various methods of feature discretiza-tion have been employed in the literature [94–97], e.g., unsupervisedmethods that do not make use of class membership information such asEqual Frequency Binning (EFB) and Equal Width Binning (EWB)[94,95], and supervised methods that use class labels to accomplishdiscretization [96,97]. We simply applied unsupervised discretizationinto binary values as it worked quite well, even slightly better (∼1%)than discretization by EFB into higher levels, e.g. into 10 discrete va-lues, for our case. For two-level quantization of the wi scores we tried

thresholding with respect to both the sample mean and sample median,and experimental results demonstrated that median thresholding wasslightly better.

We obtained pruned dictionaries with sizes =R {256,512,768,1024}out of the D = 6122 shapes for four datasets.

The main findings were the following:

• Group II models, i.e., the compounded models, are predominantlyrepresented in the final dictionary. One reason is that this set gen-erate more discriminative shape patterns, which is to be expected,since Group II shape patterns are organized forms of Group I pat-terns, and they are more unlikely to occur by accident but prone tooccur on the interest parts of the objects as asserted in the literature[23,63,60,59].

• The portions of the pruned dictionary obtained from the three modelgroups (Group I, II and III) are shown in Fig. 8. (Left). The dic-tionaries pruned according to the four image datasets do not differsignificantly from each other; the whiskers on top of the bars denotethe variance of the dictionary sizes from each group.

• Fig. 8. (Right) shows the percentage of top shape patterns averagedon the four image datasets. Note that only 19 shapes appear belowthe bars in this figure; these are the shapes collapsed from the Rselected patterns of the pruned dictionary, when we have accumu-lated their hit probabilities over all its parametrizations, i.e., rota-tions, offsets from center, compounding angles, transition rates andlight/dark contractions. Notice that corners (F16, F17), junctions (F18,F19), T-junctions (F24, F25) and ramps (F1) are the most significantshape patterns.

The performance results obtained by the selected shape patterns onthe datasets are given in Fig. 9. It is important to note that no decreasein performance is observed although 9̃0% of the dictionary was pruned(768 out of 6122), even a slight improvement, i.e., ∼1%, is achieved forCaltech-101. We decided to continue with the pruned dictionaries withsizes =R {256,512,768,1024}, since we think that these sized dictionariesprobe the sufficiently the relationship between the number of featuresand the performance. In fact, as can be seen in Fig. 9 we reach a sa-turation point at =R 512 and =R 768 dictionaries.

4. Extraction of image descriptors

Given a shape dictionary, an image descriptor is typically a functionof the expression strength or of the occurrence frequencies of the dic-tionary atoms. To this effect, one must first determine the degree of thesimilarity between image models and local appearances.

To measure similarity between a test patch and one of the dictionaryelements, we have used BRIEF features [44]. BRIEF features have theadvantages of being binary strings, hence they are computationally

Fig. 7. An overview of shape dictionary sizes and their performance on Caltech-101 dataset, Left: Number of atoms vs. Generated dictionaries, Right: Performance vs. GeneratedDictionaries.

S. Aslan et al. Journal of Visual Communication and Image Representation 49 (2017) 315–331

323

Page 10: Exploring visual dictionaries A model driven perspective

efficient. Furthermore they have been shown to be robust to illumina-tion variations and their performance is on a par with other high-per-formance features such as SURF [33], which are non-binary, hencecomputationally more complex. BRIEF features are not robust to geo-metrical variations such as rotation and viewpoint changes. While or-ientation invariance appears as a desirable property, the downside of itis that shape templates become less discriminatory. With orientation-invariant features, we lose the capability to discriminate shapes basedon the distribution of differently-oriented local patterns. In other words,not only the shape model to which a test patch belongs is important, butits angular orientation as well. It is for this reason that our shape dic-tionary contains shape models, each several angular orientations.

Three factors are crucial for the computation of BRIEF features: (i)the spatial arrangement of the point pairs within the patch, that is,analysis window, (ii) the number of nd of pairwise pixel comparisons,and (iii) the type of filtering as a preprocessing step to reduce noisesensitivity. While Calonder et al. [44] recommended choosing the co-ordinates of the test sampling pairs from a Gaussian distribution, ourexperiments indicated that a uniform distribution works better. We use

=n 256d which was default feature length in [44], and we smooth theimages with ×3 3 box filtering as a preprocessing step.

Let =H {0,1} and the nd-dimensional Hamming space H nd consistingof binary vectors or bit strings of length nd. Each point ∈a H nd in thisspace is a string = …a a a a( , , , )n1 2 d of 0’s and 1’s and the Hamming dis-tance d a b( , )H between two given points ∈a b H, nd is the number ofpositions where these strings differ from each other, i.e.,

= ∑ −=d a b a b( , ) | |H in

i i1d . The ith shape pattern, i.e., ith dictionary atom,

and a test patch p is represented by their corresponding nd-dimensionalBRIEF feature, denoted as ∈ = … ∈b H i D b H, 1, , , ands

np

ni

d d, respec-tively. Finally, the patch/image descriptor = …Z z z z[ , , , ]p p p p D( ,1) ( ,2) ( , ) ,called in this work as SymPaD vector, becomes the D-dimensional

normalized histogram of occurrences of the dictionary elements com-puted as in Eq. (4) by hard-voting. Here d a b( , )H denotes the Hammingdistance between a and b. We implement the nearest neighbor linearsearch via FLANN library [98] using Hamming distance. We also in-vestigated localized soft-voting proposed by Liu et al. [99] that assignscontinuous weights to a subset of shape models proportional to theirsimilarity (Hamming distance). However since we obtain almost sameperformance results as hard-voting, we continue with hard-votingscheme.

=⎧⎨⎩

=∈ …z

i argmin d b b1, if ( , )

0, otherwisep i i D

H p s( , ) {1, , }

i

(4)

Once similarity decisions are computed for all of the = …p N1, , testpoints on the image, the image signature = …W w w w[ , , , ]D1 2 is computedby average pooling over the = …Z z z z[ , , , ]p p p p D( ,1) ( ,2) ( , ) point descriptors,i.e., SymPaD vectors, as in Eq. (5). We preferred average pooling sinceit is more appropriate to be used in conjunction with hard-votingcoding scheme, i.e., max pooling of hard-voted descriptors yields to lessdiscriminative image signatures.

=∑

= …=wz

Ni D, 1, ,i

pN

p i1 ( , )

(5)

We also tried a version where some spatial information of the oc-currence locations of dictionary atoms was taken into account viaSpatial Pyramid Matching (SPM)[34]. According to this method, 2Dimage space is divided into a sequence of grids at resolutions = …l L0, , ,such that the grid has for a total of 2 l2 cells at level l. After computingdescriptor vectors at each cell at all resolution levels, they are appro-priately weighted and concatenated in a long vector, to form the imagesignature. We applied the same scheme of normalization as in [34], i.e.,we divided the number of occurrences of dictionary atoms in each cellby their total occurrences in the whole image. These normalized his-tograms of the 2 l2 cells computed for level l are then concatenated toobtain the spatial histogram for level l. Then, the level =l 0 and thehigher levels = … −l L1, , 1 spatial histograms are weighted as 1/2L and

− +1/2L l 1, respectively [34]. This weighting scheme provides the higherlevels to have bigger weights. Finally, spatial histograms of all levels areconcatenated to obtain the image signature with dimensionality

=DΣ 4lL l

0 , where D is the dictionary size. Thus for example in Table 4 (tobe presented in Section 5), one has signature size D if =L 0 is used andif both =L 0 and =L 1 are used, the signature dimension becomes D5 .

5. Performance analysis of image understanding applications

We evaluated SymPaD performance for three types of image un-derstanding problems, i.e., category recognition, object recognition,and image retrieval on four benchmark datasets, namely, ColumbiaObject Image Library (COIL-100) [100], Amsterdam Library Of Images

Fig. 8. Percentage of populations of highest ranked patterns averaged over all dictionary sizes =R {256,512,768,1024} and four datasets, i.e., COIL-100, ALOI-VIEW, Caltech-101, ZuBuD,Left: portions of three groups, Group I, II, and III, Right: portions of 19 shape models.

Fig. 9. Recognition performance with dictionaries of various sizes. Dictionary atomsthere are selected according to their MI scores on the Caltech, COIL, ALOI, and ZuBuDimages.

S. Aslan et al. Journal of Visual Communication and Image Representation 49 (2017) 315–331

324

Page 11: Exploring visual dictionaries A model driven perspective

(ALOI-VIEW) [101], Caltech-101 [102], and Zurich Buildings Dataset(ZuBuD) [103], all in their gray-scale version. We compare the per-formance of our method with the state-of-the-art model-drivenmethods, i.e., BIFs [30], oBIFs [29] and BIF-columns [30]. In addition,we report for comparative purposes the performance of recent data-driven methods from the literature. In the implementation of SymPaD,we discard the votes on the visual word of Flat as it was also opted outin [29,30]. The performance of SymPaD is evaluated by using fourdictionaries of sizes =R {256,512,768,1024} that have been selected outof =D 6122 according to the mutual information criterion; this selec-tion has been repeated for each of the four datasets in Section 3.3. Inaddition to performance results obtained by standard BoW pooling, wealso present performance results obtained by employing SPM [34],described in Section 4.

The descriptors of model-driven methods, namely, BIFs [30], oBIFs[29] and BIF-columns [30], are computed by running the open sourcecodes [104], on the benchmark datasets with the default parametersettings mentioned in [29,30]. We implement SPM for BIF-columnssomewhat differently than SPM [[34] applied to other methods, i.e.,SymPaD, BIFs and oBIFs. In the standard implementation of SPM, thedescriptors are computed on the whole image, and these are pooled ineach cell of spatial pyramid. However, we compute BIF-columns oneach cell of the spatial pyramid separately, we then concatenate all toobtain the image signature. We equally weight the BIF-column vectorsof the cells of level l by 1/2 l2 , so the sum of the concatenated vectors oflevel l become 1. Finally the same weighting scheme as in [34] is ap-plied to concatenate vectors of different levels.

For classification purposes, we used a linear SVM classifier usingOne-vs-One (OVO) for all methods, i.e., SymPaD, BIFs, oBIFs, and BIF-columns. Finally, we worked with square rooted image signaturescomputed by all methods, since it yielded a modest performance im-provement, i.e., 1% - 2%.

5.1. COIL-100

COIL-100 [100] contains 7200 color images of 100 objects under in-plane rotation, with a pose interval of 5 degrees, and with ×128 128resolution. A set of randomly chosen object images of COIL-100 aregiven in Fig. 10.

For a more exhaustive analysis, we followed the three experimentalsetups that have been used in the literature. In SETUP1, we use 9-foldcross-validation as in [105,106]. Images in each object class are ran-domly divided into 9 groups, and in each iteration, one of the groups isused for testing, and the remaining ones are used for training. Theaverage of the nine recognition rates is given as the final result. InSETUP2, we regularly chose 8 images with 45 degree viewing anglebetween for each object to make the training set, the remaining 64images of each object are incorporated into the testing set as in[107–110]. We finally, applied SETUP3, to make a comparison withSalBayes, HMAX and SIFT methods, performances of which are re-ported in [111]. 18 images with 20 degree viewing angle between eachare selected regularly for the training set and the remaining ones totesting set in SETUP3 setting.

The performance results obtained by the methods SymPaD, BIFs,oBIFs and BIF-columns in all experimental settings are presented inTable 4. We outperform BIFs, oBIFs and BIF-columns for all setups. Weobtain 100% performance with the selected =D 768 and =D 1024

dictionary atoms in the standard BoW scheme, that is =L 0, for SETUP1configuration. SPM with = ×L 1(2 2) improved performance 3̃% for thesmallest sized dictionary, i.e., =R 256, with regard to =L 0 for SETUP2.Performance results was beyond 99% with =L 0 and =L 1 schemes forSETUP3.

Comparison of our results with the performance results of data-driven approaches from the literature is given in Table 5. We out-performed existing literature methods of Rotation Invariant Kernels(RIK) [105] and Manifold Kernel SVM [106] that are contour-baseddescriptors extracted from the shape contour. LAFs [110], which de-fines on affine covariant features, was better in SETUP2 setting and inSETUP3 it was equal to ours.

5.2. ALOI-VIEW

ALOI-VIEW [101] is a similar dataset to COIL-100 but has a muchlarger extent, i.e., contains 72,000 images of 1000 objects under in-plane rotation, with interval of 5 degrees. Example images from ran-domly selected objects are presented in Fig. 11.

We followed SETUP2 and SETUP3 settings as in COIL-100 experi-ments which also have been used in the literature for ALOI-VIEW da-taset. Experiments were run with SETUP2 setting on the uniform ran-domly sampled 250 ALOI-objects among 1000 as in [109]. Weimplement experiments with SETUP3 setting on images of 1000 objectsas in [111]. Published gray-scale ALOI-VIEW images in the quarterresolution with size ×192 144 were used in experiments with SETUP2

setting and SETUP3 settings.The performance results obtained by SymPaD, BIFs, oBIFs and BIF-

columns with/without the scheme of SPM applied are given in Table 6.We outperformed BIFs, oBIFs and BIF-columns in both schemes of =L 0and =L 1. The use of SPM ( =L 1) improves performance especiallywhen smaller sized dictionaries are used, however, for larger sizes ofthe shape dictionary, SPM is not crucial anymore.

We present performance comparisons with recent works from theliterature in Table 7. M-CORD [109], using shape and color clues ofimages in SETUP2 setting, slightly outperforms SymPad which is basedon shape clues merely. SymPaD outperforms the SIFT, HMAX and Sal-Bayes methods [111] in SETUP3 setting significantly.

Fig. 10. Sample images randomly chosen from COIL-100 dataset.

Table 4Performance comparison of SymPaD to BIFs, oBIFs, and BIF-columns for object recognitionon COIL-100 dataset in three experimental setups. =L 0 denotes standard BoW scheme,

=L 1 denotes SPM applied at levels =l 0 and =l 1.

Method R SETUP1 SETUP2 SETUP3

=L 0 ×(1 1) =L 0×(1 1)

=L 1×(2 2)

=L 0×(1 1)

=L 1×(2 2)

SymPaD 256 ±99.8% 0.2 93.4% 96.4% 98.9% 99.6%512 ±99.9% 0.1 95.5% 97.1% 99.2% 99.7%768 100% 95.8% 97.2% 99.5% 99.9%1024 100% 96.0% 97.2% 99.5% 99.9%

BIFs 6 ±58.6% 1.5 49.9% 77.5% 57.1% 85.9%oBIFs 22 ±94.8% 0.7 81.3% 89.3% 91.9% 97.7%BIF-columns 1296 ±99.4% 0.3 92.9% 95.8% 97.7% 98.9%

Best performances that are obtained for each setup are demonstrated in bold values.

S. Aslan et al. Journal of Visual Communication and Image Representation 49 (2017) 315–331

325

Page 12: Exploring visual dictionaries A model driven perspective

5.3. CALTECH-101

Caltech-101 [102] is one of the most diverse datasets in terms ofinterclass variability and it has become almost a de facto test standardfor category recognition algorithms. It consists of 101 object categorieseach containing from 31 to 800 images. Example images from randomlyselected categories are presented in Fig. 12.

A widely used experimental setup for Caltech-101 dataset consists inconstructing a training set by randomly chosen 30 images and a testingset by randomly chosen 50 images from the remaining ones in eachcategory. If less than 50 images are left after the training set in a ca-tegory, all the remaining ones are used in testing. This process is re-peated 10 times and the average and the standard deviation of the re-cognition performances are presented.

Many studies in the literature treated image classification for theCaltech-101 as a scene matching problem and used image-based re-presentations [112,57,113,114,34,50,52,115,116], while some othersargued that for a better classification performance it is necessary tohome in on the object instance since Caltech-101 images are not con-strained in pose and have significant background clutter [117–121].Motivated by the reported improvements in the recognition accuracy onthe Caltech 101 dataset [117–121], we adopted the second approachand made the assumption that the object foreground has been seg-mented from the background in a preprocessing stage. Therefore, wefirst crop a rectangular region of interest (RoI), resize it by setting to300 pixels the longer side of the rectangle via bilinear interpolation. Wealso remove any background clutter by filling the non-object parts witha constant intensity value. In our experiments, we did not include theFaces-easy and Background classes, since Faces-easy is redundant in thepresence of the Faces class when foreground segmentation was em-ployed, and the Background class is not intended for object recognition.

Following the scheme employed by algorithms on similar shape/category recognition problems, we used a Gaussian pyramid with twoscales where images were smoothed with a Gaussian filter with =σ 2and down-sampled to half-size in each step. Hence, the image sizesbecame 150 pixels and 75 pixels on the longest side in the subsequentlevels of the scale pyramid. We also incorporated the information aboutspatial layout of occurrences of dictionary shapes in images of everyscale of the Gaussian pyramid by applying SPM [34].

We present the performance results obtained by SymPaD, BIFs, andoBIFs and BIF-columns in Table 8, with/without spatial histogramsapplied on the single scale of images. Performance results in Table 8 arethe highest ones obtained among scales of 300, 150 and 75 pixels on thelongest side. Since we did not observe any significant increase in per-formance beyond =L 2, we do not present them here. It is observedthat SPM brings significant performance gain similarly as reported inthe literature, i.e., 1̃2% for SymPaD, 1̃5% for BIFs, oBIFs and BIF-col-umns. For the SymPad case we see two opposite tendencies: In theabsence of any spatial pyramid, the performance improves as lowerresolution images are used. In the presence of spatial pyramid ( =L 1and =L 2), the performance is better with higher resolution images.We outperform BIFs, oBIFs and BIF-columns, probably because we use alarger variety of shape models.

For the multiscale performance of SymPaD presented in Table 9, weconcatenated the signatures of each scale of with equal weights of 1/3(since there are 3 scales) for the =L 0, =L 1, and =L 2 schemes, into along vector, in length of R3 , R15 , and R63 respectively. The performanceimproves significantly, i.e., 8̃%, for =L 0 scheme with the aid of mul-tiscale; however, when spatial pyramid scheme is used in conjunctionwith multiscaling, the performance gain we obtain becomes marginal,i.e., ∼1%.

The categories with best and worst performance, as recognized bythe multiscale SymPaD with =R 1024, =L 2 are presented in Fig. 13.Our method performs well on categories with strong spatial layoutpriors such as binocular, car_side, motorbikes, but less successful on ca-tegories with large intra-class variation and high diversity on pose suchas gerenuk, lotus, beaver and crocodile.

Table 5Performance comparison of SymPaD to State-of-the-Art methods for object recognition onCOIL-100 dataset in three experimental setups. =L 0 denotes standard BoW scheme,

=L 1 denotes SPM applied at levels =l 0 and =l 1.

Method SETUP1 SETUP2 SETUP3

HMAX [111] – – 77.0%SIFT [111] – – 87.2%RIK [105] 95.8% – –Manifold Ker. SVM [106] 97.0% – –SNOW/intensity [107] – 85.1% 92.3%SNOW/edges [107] – 89.2% 94.1%SalBayes [111] – – 97.2%Extra Trees [108] – 92.5% 99.5%SymPaD (L = 0, R = 1024) 100% 96.0% 99.5%SymPaD (L = 1, R = 1024) – 97.2% 99.9%Sub-windows [108] – 98.5% 99.6%M-CORD-Edge [109] – 99.0% 99.9%LAFs [110] – 99.4% 99.9%

Best performances that are obtained for each setup are demonstrated in bold values.

Fig. 11. Sample images randomly chosen from ALOI-VIEW dataset.

Table 6Performance comparison of SymPaD to BIFs, oBIFs, and BIF-columns for object recognitionon ALOI-VIEW dataset in three experimental setups. =L 0 denotes standard BoWscheme, =L 1 denotes SPM applied at levels =l 0 and =l 1.

Method R SETUP2 (250 categories) SETUP3 (1000 categories)

=L 0 ×(1 1) =L 1 ×(2 2) =L 0 ×(1 1) =L 1 ×(2 2)

SymPaD 256 97.1% 98.1% 98.8% 99.2%512 97.9% 98.4% 99.2% 99.6%768 97.9% 98.5% 99.2% 99.8%1024 98.1% 98.6% 99.2% 99.9%

BIFs 6 50.0% 84.7% 44.2% 87.4%oBIFs 22 88.2% 95.5% 93.0% 97.7%BIF-columns 1296 91.6% 94.6% 96.1% 97.9%

Best performances that are obtained for each setup are demonstrated in bold values.

Table 7Performance comparison of SymPaD to State-of-the-Art methods for object recognition onALOI-VIEW dataset in three experimental setups. =L 0 denotes standard BoW scheme,

=L 1 denotes SPM applied at levels =l 0 and =l 1.

Method SETUP2 SETUP3

SIFT [111] – 71.0%HMAX [111] – 80.8%SalBayes [111] – 89.7 %SymPaD (L = 1, R = 1024) 98.6% 99.9%M-CORD-Edge [109] 99.6% –M-CORD-Cluster [109] 99.7% –

Best performances that are obtained for each setup are demonstrated in bold values.

S. Aslan et al. Journal of Visual Communication and Image Representation 49 (2017) 315–331

326

Page 13: Exploring visual dictionaries A model driven perspective

The performance comparison with data-driven approaches arepresented in Table 10. Since we home in on the object instance, weadditionally present the literature work adopting the same approach fora fair comparison. In one of the earlier methods, SIFT features arecomputed around image points detected by Hessian-affine detector onthe foreground images of Caltech-101 dataset in [122] and visual dic-tionary is learned by quantizing the feature space. Among the methodscomputing SIFT or HOG descriptors on the entire image, performanceimprovements are obtained by employing SPM [34], soft-voting viasparse coding methods [50,52], and Fisher kernel coding by learningthe dictionary with GMM clustering [113]. More recent studies thatreported state-of-the-art performance results employ deep learning ar-chitectures [112,114,115]. Extremely discriminative features could belearned from the data in deep learning architectures, however, theyhave high computational cost and in some circumstances simpler so-lutions might be required.

Leveraging segmentation results to object recognition providescomparative or better recognition performance than the methodsdealing with the entire image region. [117] computes features of con-tour shape and geometric blur [123] on the segmented foreground and[120] utilizes a saliency detection algorithm that yield the salient re-gions leech on the object instance and uses SIFT features densely onthese detected regions. [117,120] outperform the SPM implementationon the entire image region in [34], without incorporating any locali-zation information in [117] and with employing the same scheme with[34] in [120], by ∼9% and ∼13% respectively. [118] did not employany segmentation or saliency detection algorithm but simply trivializedthe unimportant regions that mostly belong to non-object regions viathresholding the densely extracted SIFT norm magnitude. [118]

outperformed the advanced methods of Fisher kernel [113] and sparsecoding [50] [52] that work on the entire image region with high imagerepresentation and costly sparse coding optimization by 1̃% and 5̃% inperformance respectively.

With the assumption that object instances were segmented properlyin the preprocessing, our SymPaD method employing simply hard-voting of BRIEF features and utilizing SPM and multiscaling, performedbetter than the data driven schemes using SIFT/HOG features[34,50,52,113,118,122,120], dictionary learning methods of Kmeans,GMM, MI-KSVD [34,112,113,118,122,120], or employing sparsecoding/fisher kernel [50,52,112,113]. We also did better than ConvNetmethod in [115]. However, recent studies employing deep learningarchitectures [116,114] have reported better state-of-the-art perfor-mance results on Caltech-101 dataset. Although deep learning archi-tectures perform almost always better provided there is sufficientamount of training data, there are still instances where solutions out-side the deep learning paradigm are worth pursuing. One case in pointis feature extraction and classification problems where the volume oftraining dataset is limited. For example, Pu et al. [116] presentedperformance results of the 5-layer ConvNet of Zeiler and Fergus [115]on Caltech-101 when the architecture was pretrained with and without1.2 M ImageNet images. The performance is reported as 86.5% whenthe architecture was trained with a very large dataset, i.e., ImageNet,while the reported performance has fallen to 46.5% when training wasdone on a smaller dataset, i.e., Caltech-101 images (30 training imagesper category). Another case in point is the set of recent studies where

Fig. 12. Sample images randomly chosen from Caltech-101 dataset.

Table 8Performance comparison of SymPaD to BIFs, oBIFs, and BIF-columns for category re-cognition on Caltech-101 dataset. =L 0 denotes standard BoW scheme, =L 1 denotesSPM applied at levels =l 0 and = =l L1, 2 denotes SPM applied at levels = =l l0, 1 and

=l 2. Best performance results obtained in single scale for each R are presented.

Method R =L 0 ×(1 1) =L 1 ×(2 2) =L 2 ×(4 4)

SymPaD 256 ±70.5% 0.5 ±80.8% 0.6 ±84.7% 0.5512 ±74.5% 0.6 ±82.5% 0.5 ±85.1% 0.6768 ±76.4% 0.6 ±83.4% 0.6 ±85.3% 0.61024 ±76.5% 0.6 ±82.9% 0.5 85. 6% ± 0. 7

BIFs 6 ±24.1% 0.5 ±48.6% 1.1 ±67.1% 0.8oBIFs 22 ±49.4% 0.6 ±72.8% 0.8 ±81.8% 0.9BIF-columns 1296 ±45.2% 0.7 ±62.4% 0.4 ±70.5% 0.4

Best performances that are obtained for each setup are demonstrated in bold values.

Table 9Category recognition performance results of SymPaD when spatial histograms are concatenated over scale on Caltech101 dataset. Performance results in single scale are taken fromTable 8

R L = 0 ×(1 1) L = 1 ×(2 2) L = 2 ×(4 4)

Single scale Scale concat. Single scale Scale concat. Single scale Scale concat.

256 ±70.5 0.5 ±78.6 0.8 ±80.8 0.6 ±85.0 0.6 ±84.7 0.5 ±86.3 0.2512 ±74.4 0.6 ±81.2 0.5 ±82.5 0.5 ±85.9 0.5 ±85.1 0.6 ±86.8 0.2768 ±76.4 0.6 ±82.3 0.9 ±83.4 0.6 ±86.6 0.6 ±85.3 0.6 ±86.9 0.81024 ±76.5 0.6 ±82.4 0.6 ±82.9 0.5 ±86.2 0.4 ±85.6 0.7 ±87 1 0 4. .

Best performances that are obtained for each setup are demonstrated in bold values.

Fig. 13. Example images from categories recognized in best and worst rates( = =L R2, 1024, multiscale concatenation is applied).

S. Aslan et al. Journal of Visual Communication and Image Representation 49 (2017) 315–331

327

Page 14: Exploring visual dictionaries A model driven perspective

the joint use of CNN features and other features under early or latefusion result in even a better performance. (See for example [124,125]).To prove this point, we ran experiments on the Caltech-101 datasetwhere we have concatenated 1024-long SymPaD image signatures withthe 4096-long CNN features. The CNN features were obtained from theAlexNet architecture pretrained on the ImageNet dataset. Specificallywe have used the output of the last fully-connected layer (fc7), whichleads to feature vectors of dimension 4096 [124,125]. Using multiclassSVM classifier, the performance obtained by CNN features is improvedby 2̃% when SymPaD descriptors are included in the feature vector. Inconclusion, we believe that model-based dictionaries can find nicheapplications in situations when sufficient abundance of data is notavailable for training a CNN.

We present the performance results from [121,119] which are ob-tained by utilizing segmented object instances via ground-truth in-formation as in our case. We differ from [121] in that they extract denseSIFT features, and employ costly sparse coding. By using SymPaD dic-tionary and hard-voting we obtain competitive performance resultswith [121,119] with even =R 256 dictionary atoms. Performance ob-tained on segmented object instance is improved by 1̃% in [119] over[121] by extracting a larger variety of features.

5.4. ZuBuD

Zurich Building (ZuBud) image dataset [103] is a testbed for imageretrieval that consists of 1005 color images of 201 buildings. Images ofbuildings are taken in five different viewpoints and they may alsocontain occlusions. The dataset contains an additional set of 115 queryimages under different imaging conditions and matching is performedagainst 1005 database images with 1120 images in total. We haveconverted them to gray-scale images, smoothed them with a Gaussianfilter and downsampled the original resolution from ×640 480 to

×160 120. Example images of a building in ZuBuD dataset are given inFig. 14.

We follow the experimental procedure in [126–130] and investigatethe recall rate =rZ

nN

ZZ, where nZ is the number of the correct matches in

the first Z retrieved images and NZ is the number of possible correctmatches. For the ZuBuD dataset, since every query image has 5 corre-sponding images in the dataset, one has =N Zmin( ,5)Z . We present theaverage recall rates for =Z 1.

Table 11 shows the performance comparisons of SymPaD, BIFs,oBIFs and BIF-columns at ZuBuD dataset. We observe that SPM im-plementation does not make any difference in the performance, yet weoutperform BIF-columns by 4% with image signatures in almost its half-length, i.e., =R 512 vs. =R 1296.

Performance figures from the recent literature are given in Table 12.We outperform [130,126,127]. We obtain competitive results with[128] that use randomized tree ensembles. Computing the DCT coef-ficients on the local affine frames, [129] obtained superior performancegain.

6. Conclusion

In this paper, we have addressed the problem of creating visualdictionaries for various image understanding applications from amodel-driven perspective. The dictionary building starts from a core ofshape primitives, which have commonalities with the shape modelsenvisaged by the earliest to the latest proponents of the idea, i.e., fromMarr [23,24] to Griffin et al. [77,28,61,29]. We then proceed to enrichthe dictionary by using detailed parametrization of the shape space andby applying nonlinear dyadic compositions.

The visual dictionary is applied in a Bag of Visual Words context tovarious image understanding applications such as texture classification,category recognition and object identification. The experimental resultsshowed that our method performs almost always better, or at leastcompetitively, in comparison with the existing alternate model-drivenschemes as well as with some data-driven schemes. Deep Learning (DL)methods, provided they are trained on appropriate and very large da-tasets are expected to perform better than other hand-crafted diction-aries or data-driven dictionaries constructed outside the DL metho-dology.

However, there are niche applications of our method when thetraining data for DL is inadequate, or as complementary to the CNNfeatures. This work will proceed in two avenues: (i) To develop dic-tionaries for multispectral or hyperspectral data where dictionaryatoms must summarize spatio-spectral information. (ii) To further in-vestigate complementary roles of the CNN features and hand-craftedand pruned dictionaries.

Table 10Performance comparison on Caltech-101, GT denotes the methods applied on the fore-ground images segmented by Ground Truth segmentation masks, Sal/Seg. denotes themethods applied on the salient regions or regions detected by proposed segmentationalgorithms in each.

Method Foreground Entireimage

Performance

GT Sal/Seg

SIFT, VQ (SOM), Hard Voting [122] √ 20%

Dense SIFT, VQ (KMeans), HardVoting, SPM [34]

√ ±64.6% 0.8

Contour shape (Region), GeometricBlur (Point) [117]

√ 73.1%

Dense SIFT, SC, SPM [50] √ ±73.2% 0.5

Dense HOG, LLC, SPM [52] √ 73.4%

Dense SIFT, VQ (KMeans), SPM [120] √ ±77.3% 0.6

Dense SIFT, GMM, Fisher kernel[113]

√ ±77.8% 0.6

Dense SIFT, VQ (KMeans), (Hard/Soft-voting), (Avg/Max) pooling[118]

√ ±78.5% 1.0

MI-KSVD, SC [112] √ ±82.5% 0.5

SymPaD (R = 256, SPM, scaleconcatenation, Hard-voting,Avr-pooling)

√ ±86.3% 0.2

ConvNet [115] √ ±86.5% 0.5

SymPaD (R = 1024, SPM, scaleconcatenation, Hard-voting,Avr-pooling)

√ ±87.1% 0.4

Dense SIFT, SC, SPM [121] √ 88.3%

Dense SIFT, Dense color SIFT, HOG,VQ (Kmeans) [119]

√ 89.3%

ConvNet using Bayesian deeplearning [116]

√ 93.2%

ConvNet with SPM (SPP-net) [114] √ 93. 4% ± 0. 5

Best performances that are obtained for each setup are demonstrated in bold values.

Fig. 14. Example images of a building in ZuBud dataset. The first image is the query image which contains occlusion. Images of the same building in the dataset are the remaining ones.

S. Aslan et al. Journal of Visual Communication and Image Representation 49 (2017) 315–331

328

Page 15: Exploring visual dictionaries A model driven perspective

References

[1] R. Rubinstein, A.M. Bruckstein, M. Elad, Dictionaries for sparse representationmodeling, Proc. IEEE 98 (6) (2010) 1045–1057.

[2] R. Figueras i Ventura, P. Vandergheynst, P. Frossard, Low rate and flexible imagecoding with redundant representations, IEEE Trans. Image Process. 15 (3) (2006)726–739.

[3] E. Candes, L. Demanet, D. Donoho, L. Ying, Fast discrete curvelet transforms,Multiscale Model. Simul. 5 (3) (2006) 861–899.

[4] D.L. Donoho, X. Huo, Combined image representation using edgelets and wavelets,in: SPIE’s International Symposium on Optical Science, Engineering, andInstrumentation, International Society for Optics and Photonics, 1999, pp.468–476.

[5] D.L. Donoho, Orthonormal Ridgelets and Linear Singularities, Tech. Rep., StanfordUniv, 1998.

[6] M.N. Do, M. Vetterli, The contourlet transform: an efficient directional multi-resolution image representation, IEEE Trans. Image Process. 14 (12) (2005)2091–2106.

[7] E. Le Pennec, S. Mallat, Sparse geometric image representations with bandelets,IEEE Trans. Image Process. 14 (4) (2005) 423–438.

[8] W.T. Freeman, E.H. Adelson, The design and use of steerable filters, IEEE Trans.Pattern Anal. Mach. Intell. 13 (9) (1991) 891–906.

[9] X. Huo, Sparse Image Representation via Combined Transforms (Ph.D. Thesis),Stanford University, 1999.

[10] M. Aharon, M. Elad, A. Bruckstein, K-svd: an algorithm for designing overcompletedictionaries for sparse representation, IEEE Trans. Signal Process. 54 (11) (2006)4311–4322.

[11] J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online learning for matrix factorization andsparse coding, J. Mach. Learn. Res. 11 (January) (2010) 19–60.

[12] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in:Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR’05), vol. 1, 2005, pp. 886–893.

[13] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J.Comput. Vision 60 (2) (2004) 91–110.

[14] G. Csurka, C. Dance, L. Fan, J. Willamowski, C. Bray, Visual categorization withbags of keypoints, in: Workshop on Statistical Learning in Computer Vision, ECCV,vol. 1, 2004, pp. 1–2.

[15] F. Jurie, B. Triggs, Creating efficient codebooks for visual recognition, in:International Conference on Computer Vision (ICCV’05), vol. 1, IEEE, 2005, pp.604–610.

[16] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition viasparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009)210–227.

[17] M. Yang, L. Zhang, J. Yang, D. Zhang, Metaface learning for sparse representationbased face recognition, in: IEEE International Conference on Image Processing,2010, pp. 1601–1604.

[18] B. Fulkerson, A. Vedaldi, S. Soatto, Localizing objects with smart dictionaries,European Conference on Computer Vision, Springer, 2008, pp. 179–192.

[19] J. Winn, A. Criminisi, T. Minka, Object categorization by learned universal visual

dictionary, in: International Conference on Computer Vision (ICCV’05), vol. 2,IEEE, 2005, pp. 1800–1807.

[20] J. Mairal, J. Ponce, G. Sapiro, A. Zisserman, F.R. Bach, Supervised dictionarylearning, in: Advances in Neural Information Processing Systems, 2009, pp.1033–1040.

[21] Q. Zhang, B. Li, Discriminative k-svd for dictionary learning in face recognition, in:Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR’10), 2010, pp. 2691–2698.

[22] D.S. Pham, S. Venkatesh, Joint learning and dictionary construction for patternrecognition, in: Proceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR’08), 2008, pp. 1–8.

[23] D. Marr, Early processing of visual information, Philosoph. Trans. Roy. Soc.London B: Biol. Sci. 275 (942) (1976) 483–519.

[24] D. Marr, Vision: A Computational Investigation into the Human Representationand Processing of Visual Information, Henry Holt and Co., New York, 1982.

[25] H.B. Barlow, Summation and inhibition in the frog’s retina, J. Physiol. 119 (1)(1953) 69.

[26] D.H. Hubel, T.N. Wiesel, Receptive fields, binocular interaction and functionalarchitecture in the cat’s visual cortex, J. Physiol. 160 (1) (1962) 106–154.

[27] H.K. Hartline, The response of single optic nerve fibers of the vertebrate eye toillumination of the retina, Am. J. Physiol.–Legacy Content 121 (2) (1938)400–415.

[28] L.D. Griffin, M. Lillholm, Feature category systems for 2nd order local imagestructure induced by natural image statistics and otherwise, in: Electronic Imaging2007, International Society for Optics and Photonics, 2007, pp. 649209–649209.

[29] M. Lillholm, L.D. Griffin, Novel image feature alphabets for object recognition, in:Proceedings of International Conference on Pattern Recognition (ICPR), Citeseer,2008, pp. 1–4.

[30] M. Crosier, L.D. Griffin, Using basic image features for texture classification, Int. J.Comput. Vision 88 (3) (2010) 447–460.

[31] K. Mikolajczyk, C. Schmid, A performance evaluation of local descriptors, IEEETrans. Pattern Anal. Mach. Intell. 27 (10) (2005) 1615–1630.

[32] J. Li, N.M. Allinson, A comprehensive review of current local features for com-puter vision, Neurocomputing 71 (10) (2008) 1771–1787.

[33] H. Bay, T. Tuytelaars, L. Van Gool, Surf: speeded up robust features, EuropeanConference on Computer Vision, Springer, 2006, pp. 404–417.

[34] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramidmatching for recognizing natural scene categories, in: Proceedings of IEEEConference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2, 2006,pp. 2169–2178.

[35] J.G. Daugman, Two-dimensional spectral analysis of cortical receptive field pro-files, Vision Res. 20 (10) (1980) 847–856.

[36] J.G. Daugman, Uncertainty relation for resolution in space, spatial frequency, andorientation optimized by two-dimensional visual cortical filters, JOSA A 2 (7)(1985) 1160–1169.

[37] J.J. Koenderink, A.J. van Doorn, Representation of local geometry in the visualsystem, Biol. Cybernet. 55 (6) (1987) 367–375.

[38] J. Zhu, Image compression using wavelets and jpeg2000: a tutorial, Electron.Commun. Eng. J. 14 (3) (2002) 112–121.

[39] D.R. Martin, C.C. Fowlkes, J. Malik, Learning to detect natural image boundariesusing local brightness, color, and texture cues, IEEE Trans. Pattern Anal. Mach.Intell. 26 (5) (2004) 530–549.

[40] M. Heiler, C. Schnörr, Natural image statistics for natural image segmentation, Int.J. Comput. Vision 63 (1) (2005) 5–19.

[41] J. Dou, J. Li, Modeling the background and detecting moving objects based on siftflow, Optik-Int. J. Light Electron Opt. 125 (1) (2014) 435–440.

[42] M. Dantone, J. Gall, C. Leistner, L. Van Gool, Body parts dependent joint regressorsfor human pose estimation in still images, IEEE Trans. Pattern Anal. Mach. Intell.36 (11) (2014) 2131–2143.

[43] F. Cen, Y. Jiang, Z. Zhang, H.-T. Tsui, T.K. Lau, H. Xie, Robust registration of 3-dultrasound images based on gabor filter and mean-shift method, Computer Visionand Mathematical Methods in Medical and Biomedical Image Analysis, Springer,2004, pp. 304–316.

[44] M. Calonder, V. Lepetit, M. Ozuysal, T. Trzcinski, C. Strecha, P. Fua, Brief: com-puting a local binary descriptor very fast, IEEE Trans. Pattern Anal. Mach. Intell.34 (7) (2012) 1281–1298.

[45] E. Rublee, V. Rabaud, K. Konolige, G. Bradski, Orb: an efficient alternative to siftor surf, in: Proceedings of IEEE International Conference on Computer Vision(ICCV’11), 2011, pp. 2564–2571.

[46] S. Leutenegger, M. Chli, R.Y. Siegwart, Brisk: Binary robust invariant scalablekeypoints, in: Proceedings of IEEE International Conference on Computer Vision(ICCV), 2011, pp. 2548–2555.

[47] A. Alahi, R. Ortiz, P. Vandergheynst, Freak: Fast retina keypoint, in: Proceedings ofIEEE Conference on Computer vision and Pattern Recognition (CVPR’12), 2012,pp. 510–517.

[48] Y. Huang, Z. Wu, L. Wang, T. Tan, Feature coding in image classification: acomprehensive study, IEEE Trans. Pattern Anal. Mach. Intell. 36 (3) (2014)493–506.

[49] J.C. Van Gemert, J.-M. Geusebroek, C.J. Veenman, A.W. Smeulders, Kernel co-debooks for scene categorization, in: Proceedings of European Conference onComputer Vision, Springer, 2008, pp. 696–709.

[50] J. Yang, K. Yu, Y. Gong, T. Huang, Linear spatial pyramid matching using sparsecoding for image classification, in: Proceedings of IEEE Conference on ComputerVision and Pattern Recognition (CVPR’09), 2009, pp. 1794–1801.

[51] K. Yu, T. Zhang, Y. Gong, Nonlinear learning using local coordinate coding, in:Advances in neural information processing systems, 2009, pp. 2223–2231.

Table 11Performance comparison of SymPaD to BIFs, oBIFs, and BIF-columns for image retrieval onZuBuD dataset. =L 0 denotes standard BoW scheme, =L 1 denotes SPM applied at levels

=l 0 and =l 1.

Method R =L 0 ×(1 1) =L 1 ×(2 2)

SymPaD 256 93.0 94.8%512 94.8% 94.8%768 94.8% 94.8%1024 94.8% 94.8%

BIFs 6 23.5 64.4oBIFs 22 76.5 89.6BIF-columns 1296 90.4 89.6

Best performances that are obtained for each setup are demonstrated in bold values.

Table 12Performance comparisons on the ZuBuD dataset.

Method Performance

HPAT [130] 86.1%Feature histograms of relational kernel, color histograms & tamura

texture histograms [126]89.6%

Fast wide baseline matching [127] 92.0%SymPaD (L = 0, R = 512) 94.8%Random subwindows [128] 95.7%DCT 15 coeffs [129] 100%

Best performances that are obtained for each setup are demonstrated in bold values.

S. Aslan et al. Journal of Visual Communication and Image Representation 49 (2017) 315–331

329

Page 16: Exploring visual dictionaries A model driven perspective

[52] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained linearcoding for image classification, in: Proceedings of IEEE Conference on ComputerVision and Pattern Recognition (CVPR’10), 2010, pp. 3360–3367.

[53] F. Perronnin, C. Dance, Fisher kernels on visual vocabularies for image categor-ization, in: Proceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR’07), 2007, pp. 1–8.

[54] F. Perronnin, J. Sánchez, T. Mensink, Improving the fisher kernel for large-scaleimage classification, Proceedings of European Conference on Computer Vision,Springer, 2010, pp. 143–156.

[55] Y.-L. Boureau, J. Ponce, Y. LeCun, A theoretical analysis of feature pooling invisual recognition, in: Proceedings of the 27th International Conference onMachine Learning (ICML’10), 2010, pp. 111–118.

[56] N. Murray, F. Perronnin, Generalized max pooling, in: Proceedings of IEEEConference on Computer Vision and Pattern Recognition (CVPR’14), 2014, pp.2473–2480.

[57] Y.-L. Boureau, N. Le Roux, F. Bach, J. Ponce, Y. LeCun, Ask the locals: multi-waylocal pooling for image recognition, in: Proceedings of IEEE InternationalConference on Computer Vision (ICCV’11), 2011, pp. 2651–2658.

[58] E. Saund, Symbolic construction of a 2-d scale-space image, IEEE Trans. PatternAnal. Mach. Intell. 12 (8) (1990) 817–830.

[59] R. Horaud, F. Veillon, T. Skordas, Finding geometric and relational structures in animage, Proceedings of European Conference on Computer Vision, Springer, 1990,pp. 374–384.

[60] D. Lowe, Perceptual Organization and Visual Recognition, Tech. Rep., DTICDocument, 1984.

[61] L.D. Griffin, M. Lillholm, M. Crosier, J. van Sande, Basic image features (bifs)arising from approximate symmetry type, International Conference on Scale Spaceand Variational Methods in Computer Vision, Springer, 2009, pp. 343–355.

[62] B. Julesz, Textons the elements of texture perception, and their interactions,Nature 290 (5802) (1981) 91–97.

[63] J.M. Tenenbaum, A. Witkin, On the role of structure in vision, Hum. Mach. Vision(1983) 481–543.

[64] D.J. Field, What is the goal of sensory coding? Neural Comput. 6 (4) (1994)559–601.

[65] D. Geman, A. Koloydenko, Invariant statistics and coding of natural microimages,in: IEEE Workshop on Statistical and Computational Theories of Vision, 1999.

[66] A.B. Lee, K.S. Pedersen, D. Mumford, The complex statistics of high-contrastpatches in natural images, in: IEEE Workshop on Statistical and ComputationalTheories of Vision, Vancouver, CA, 2001.

[67] F. Vilnrotter, R. Nevatia, K.E. Price, Structural analysis of natural textures, IEEETrans. Pattern Anal. Mach. Intell. 1 (1986) 76–89.

[68] R. Horaud, T. Skordas, Stereo correspondence through feature grouping andmaximal cliques, IEEE Trans. Pattern Anal. Mach. Intell. 11 (11) (1989)1168–1180.

[69] J. Sun, N.N. Zheng, H. Tao, H.-Y. Shum, Image hallucination with primal sketchpriors, in: Proceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR’03), vol. 2, 2003, pp. II–729.

[70] C.E. Guo, S.C. Zhu, Y.N. Wu, Primal sketch: integrating structure and texture,Comput. Vis. Image Underst. 106 (1) (2007) 5–19.

[71] J. Mairal, F. Bach, J. Ponce, Sparse Modeling for Image and Vision Processing,Available from: arXiv preprint< arXiv:1411.3230> .

[72] A.J. Newell, R.M. Morgan, L.D. Griffin, P.A. Bull, J.R. Marshall, G. Graham,Automated texture recognition of quartz sand grains for forensic applications, J.Forensic Sci. 57 (5) (2012) 1285–1289.

[73] A.J. Newell, L.D. Griffin, Natural image character recognition using oriented basicimage features, in: Proceedings of IEEE International Conference on Digital ImageComputing Techniques and Applications (DICTA),2011, pp. 191–196.

[74] A.J. Newell, L.D. Griffin, Writer identification using oriented basic image featuresand the delta encoding, Pattern Recogn. 47 (6) (2014) 2255–2265.

[75] N. Jaccard, N. Szita, L.D. Griffin, Trainable segmentation of phase contrast mi-croscopy images based on local basic image features histograms, in: MIUA, 2014,pp. 47–52.

[76] N. Jaccard, N. Szita, L. Griffin, Segmentation of phase contrast microscopy imagesbased on multi-scale local basic image features histograms, Comput. MethodsBiomech. Biomed. Eng.: Imaging Visual. (2015) 1–9.

[77] L.D. Griffin, The second order local-image-structure solid, IEEE Trans. PatternAnal. Mach. Intell. 29 (8) (2007) 1355–1366.

[78] M. Varma, A. Zisserman, A statistical approach to texture classification from singleimages, Int. J. Comput. Vision 62 (1-2) (2005) 61–81.

[79] M.J. Morgan, Features and the “primal sketch”, Vision Res. 51 (7) (2011)738–753.

[80] R. Koekoek, R.F. Swarttouw, The askey-scheme of hypergeometric orthogonalpolynomials and its q-analogue, Available from: arXiv preprint< arXiv:math/9602214> .

[81] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in:Proceedings of the 14th International Conference on Artificial Intelligence andStatistics, 2011, pp. 315–323.

[82] A.L. Maas, A.Y. Hannun, A.Y. Ng, Rectifier nonlinearities improve neural networkacoustic models, in: ICML, vol. 30, 2013.

[83] I.J. Goodfellow, D. Warde-Farley, M. Mirza, A.C. Courville, Y. Bengio, Maxoutnetworks, ICML 28 (2013) 1319–1327.

[84] A. Hanbury, The morphological top-hat operator generalised to multi-channelimages, in: Proceedings of the 17th IEEE International Conference on PatternRecognition (ICPR), vol. 1,2004, pp. 672–675.

[85] M. Varma, A. Zisserman, Classifying images of materials: achieving viewpoint andillumination independence, in: European Conference on Computer Vision, 2002,

pp. 255–271.[86] M. Varma, A. Zisserman, The Maximum Response (MR) Filter Banks, 2007.

Available at:< http://www.robots.ox.ac.uk/vgg/research/texclass/filters.html> (date accessed: 10.09.2016).

[87] P. Dollár, Piotrs Computer Vision Matlab Toolbox (pmt).< http://vision.ucsd.edu/pdollar/toolbox/doc/index.html > .

[88] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to ClusterAnalysis, vol. 344, John Wiley & Sons, 2009.

[89] H. Peng, F. Long, C. Ding, Feature selection based on mutual information criteriaof max-dependency, max-relevance, and min-redundancy, IEEE Trans. PatternAnal. Mach. Intell. 27 (8) (2005) 1226–1238.

[90] G. Doquire, M. Verleysen, Mutual information-based feature selection for multi-label classification, Neurocomputing 122 (2013) 148–155.

[91] R. Battiti, Using mutual information for selecting features in supervised neural netlearning, IEEE Trans. Neural Networks 5 (4) (1994) 537–550.

[92] L. Wang, Toward a discriminative codebook: codeword selection across multi-re-solution, in: Proceedings of IEEE Conference on Computer vision and PatternRecognition (CVPR’07), 2007, pp. 1–8.

[93] E. Nowak, F. Jurie, Vehicle categorization: parts for speed and accuracy, in: 2ndJoint IEEE International Workshop on Visual Surveillance and PerformanceEvaluation of Tracking and Surveillance, 2005, pp. 277–283.

[94] S. Kotsiantis, D. Kanellopoulos, Discretization techniques: a recent survey, GESTSInt. Trans. Comput. Sci. Eng. 32 (1) (2006) 47–58.

[95] J. Dougherty, R. Kohavi, M. Sahami, et al., Supervised and unsupervised dis-cretization of continuous features, in: Proceedings of the 12th InternationalConference Machine Learning, vol. 12, 1995, pp. 194–202.

[96] R.C. Holte, Very simple classification rules perform well on most commonly useddatasets, Mach. Learn. 11 (1) (1993) 63–90.

[97] R. Kerber, Chimerge: discretization of numeric attributes, in: Proceedings of theTenth National Conference on Artificial Intelligence, Aaai Press, 1992, pp.123–128.

[98] M. Muja, D.G. Lowe, Fast matching of binary features, in: Ninth IEEE Conferenceon Computer and Robot Vision (CRV), 2012, pp. 404–410.

[99] L. Liu, L. Wang, X. Liu, In defense of soft-assignment coding, in: InternationalConference on Computer Vision (ICCV), IEEE, 2011, pp. 2486–2493.

[100] S.A. Nene, S.K. Nayar, H. Murase, et al., Columbia Object Image Library (coil-20),Tech. Rep., Technical Report CUCS-005-96, 1996.

[101] J.-M. Geusebroek, G.J. Burghouts, A.W. Smeulders, The amsterdam library ofobject images, Int. J. Comput. Vision 61 (1) (2005) 103–112.

[102] L. Fei-Fei, R. Fergus, P. Perona, One-shot learning of object categories, IEEE Trans.Pattern Anal. Mach. Intell. 28 (4) (2006) 594–611.

[103] H. Shao, T. Svoboda, L. Van Gool, Zubud-zurich Buildings Database for Imagebased Recognition, Computer Vision Lab, Swiss Federal Institute of Technology,Switzerland, Tech. Rep 260, 2003, p. 20.

[104] L.D. Griffin et al., Basic Image Features (BIFs) Implementation, 2015. Availableat: < https://github.com/GriffinLab/BIFs> (date accessed: 10.09.2016).

[105] O.C. Hamsici, A.M. Martinez, Rotation invariant kernels and their application toshape analysis, IEEE Trans. Pattern Anal. Mach. Intell. 31 (11) (2009) 1985–1999.

[106] S. Jayasumana, M. Salzmann, H. Li, M. Harandi, A framework for shape analysisvia hilbert space embedding, in: International Conference on Computer Vision(ICCV’13), IEEE, 2013, pp. 1249–1256.

[107] M.H. Yang, D. Roth, N. Ahuja, Learning to recognize 3d objects with snow,European Conference on Computer Vision, Springer, 2000, pp. 439–454.

[108] R. Marée, P. Geurts, J. Piater, L. Wehenkel, A generic approach for image classi-fication based on decision tree ensembles and local sub-windows, in: Proceedingsof the 6th Asian Conference on Computer Vision, vol. 2, 2004, pp. 860–865.

[109] S. Naik, C. Murthy, Distinct multicolored region descriptors for object recognition,IEEE Trans. Pattern Anal. Mach. Intell. 7 (29) (2007) 1291–1296.

[110] S. Obdrzalek, J. Matas, Object recognition using local affine frames on dis-tinguished regions, in: BMVC, vol. 1, Citeseer, 2002, p. 3.

[111] L. Elazary, L. Itti, A bayesian model for efficient visual search and recognition,Vision. Res. 50 (14) (2010) 1338–1352.

[112] L. Bo, X. Ren, D. Fox, Multipath sparse coding using hierarchical matching pursuit,in: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2013, pp. 660–667.

[113] K. Chatfield, V.S. Lempitsky, A. Vedaldi, A. Zisserman, The devil is in the details:an evaluation of recent feature encoding methods, in: BMVC, vol. 2, 2011, p. 8.

[114] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutionalnetworks for visual recognition, European Conference on Computer Vision,Springer, 2014, pp. 346–361.

[115] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks,European Conference on Computer Vision, Springer, 2014, pp. 818–833.

[116] Y. Pu, X. Yuan, A. Stevens, C. Li, L. Carin, A deep generative deconvolutionalimage model, in: Proceedings of the 19th International Conference on ArtificialIntelligence and Statistics, 2016, pp. 741–750.

[117] C. Gu, J.J. Lim, P. Arbeláez, J. Malik, Recognition using regions, in: Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition (CVPR’09), 2009,pp. 1030–1037.

[118] M.T. Law, N. Thome, M. Cord, Bag-of-words image representation: key ideas andfurther insight, Fusion in Computer Vision, Springer, 2014, pp. 29–52.

[119] F. Li, J. Carreira, C. Sminchisescu, Object recognition as ranking holistic figure-ground hypotheses, in: Proceedings of IEEE Conference on Computer Vision andPattern Recognition (CVPR’10), 2010, pp. 1712–1719.

[120] L. Yang, Q. Hu, L. Zhao, Y. Li, Salience based hierarchical fuzzy representation forobject recognition, in: Proceedings of IEEE International Conference on ImageProcessing (ICIP), 2015, pp. 4873–4877.

S. Aslan et al. Journal of Visual Communication and Image Representation 49 (2017) 315–331

330

Page 17: Exploring visual dictionaries A model driven perspective

[121] F. Zhu, Z. Jiang, L. Shao, Submodular object recognition, in: Proceedings of IEEEConference on Computer Vision and Pattern Recognition (CVPR’14), 2014, pp.2457–2464.

[122] T. Kinnunen, Bag-of-features Approach to Unsupervised Visual ObjectCategorisation (Ph.D. Thesis), Acta Universitatis Lappeenrantaensis, 2011.

[123] A.C. Berg, J. Malik, Geometric blur for template matching, in: Proceedings of IEEEConference on Computer Vision and Pattern Recognition, (CVPR’01), vol. 1, 2001,pp. I–607.

[124] E.G. Danaci, N. Ikizler-Cinbis, Low-level features for visual attribute recognition:an evaluation, Pattern Recogn. Lett. 84 (2016) 185–191.

[125] C. Cusano, P. Napoletano, R. Schettini, Combining multiple features for colortexture classification, J. Electron. Imaging 25 (6) (2016) 061410.

[126] T. Deselaers, D. Keysers, H. Ney, Classification error rate for quantitative eva-luation of content-based image retrieval systems, in: Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR’04), vol. 2, IEEE, 2004, pp.505–508.

[127] T. Goedemé, T. Tuytelaars, L. Van Gool, Fast wide baseline matching for visualnavigation, in: Proceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR’04), vol. 1, 2004, pp. I–24.

[128] R. Marée, P. Geurts, L. Wehenkel, Content-based image retrieval by indexingrandom subwindows with randomized trees, Asian Conference on ComputerVision, Springer, 2007, pp. 611–620.

[129] Š. Obdržálek, J. Matas, Image retrieval using local compact dct-based re-presentation, in: Joint Pattern Recognition Symposium, Springer, 2003, pp.490–497.

[130] H. Shao, T. Svoboda, T. Tuytelaars, L. Van Gool, Hpat indexing for fast object/scene recognition based on local appearance, International Conference on Imageand Video Retrieval, Springer, 2003, pp. 71–80.

Sinem Aslan received her BSc degree in Department ofElectronics Engineering from Ankara University in 2002,MSc and PhD degrees in International Computer Institutefrom Ege University in 2007 and 2016. She is currently apostdoctoral researcher in Department of Informatics,Systems and Communication at University of Milano-Bicocca. During her PhD studies, she held a visiting positionat BUSIM laboratory at Boğaziçi University. Her researchinterests are in the area of image processing and computervision. In particular, her research has been focused onimage descriptors, image segmentation, and image under-standing applications.

Ceyhun Burak Akgül is the co-founder and CTO of VisperaInformation Technologies (Istanbul, Turkey). Vispera IT isan Istanbul-based visual search startup, founded in early2014. Ceyhun was previously R & D Manager at Vistek ISRAVision (Istanbul, Turkey) from 2012 to 2014. He planned,coordinated and contributed to the design and developmentof several computer vision-based industrial automationsystems and was the local coordinator of several interna-tional multi-partner R & D projects during the course of histime at Vistek. Ceyhun is also Adjunct Faculty and SeniorResearch Associate at the EE Dept., Bogazici University(Istanbul, Turkey) and Technology Adviser at the DigitalArchitectural Design Program, Istanbul TechnicalUniversity. During 2008–2009, he was Marie Curie post-

doctoral fellow at Philips Research Europe (Video Processing and Analysis Group) in the

framework of FP6 IRonDB project where he filed a patent on automated diagnosis of age-related mental disorders from MR images and metadata. Ceyhun obtained his PhD fromboth Télécom ParisTech Signals-Images (Paris, France) and Boğaziçi University EE Dept.(2007). During his PhD, he developed state-of-the-art shape description, similaritylearning and intelligent visual search algorithms for content-based 3D object retrieval. Hisresearch and technology interests are focused on computer vision and machine learningwith applications in content-based visual search and retrieval, human action recognition,and soft biometrics. Ceyhun published 6 journal and 50+ conference papers.

Bülent Sankur is presently at Bogazici University in theDepartment of Electrical-Electronic Engineering. His re-search interests are in the areas of digital signal processing,security and biometry, cognition and multimedia systems.He has served as a consultant in several industrial andgovernment projects and has been involved in variousEuropean framework and/or bilateral projects. He has heldvisiting positions at the University of Ottawa, TechnicalUniversity of Delft, and Ecole Nationale Supérieure desTélécommunications, Paris. He was the chairman ofEUSIPCO’05: The European Conference on SignalProcessing, as well as technical chairman of ICASSP’00. Dr.Sankur is presently an associate editor in IEEE Trans. onImage Processing, Journal of Image and Video Processing,

and IET Biometrics.

E. Turhan Tunalı earned a BSc degree in ElectricalEngineering from Middle East Technical University andMSc Degree in Applied Statistics from Ege University, bothin Turkey. He then earned a DSc Degree in Systems Scienceand Mathematics from Washington University in St.Louis,USA in 1985. After his doctorate study, he joined ComputerEngineering Department of Ege University as an assistantprofessor where he became an associate professor in 1988.During the period of 1992–1994, he worked in Departmentof Computer Technology of Nanyang TechnologicalUniversity of Singapore as a Visiting Senior Fellow. He thenjoined International Computer Institute of Ege University asa Professor where he was the director. In the period of2000–2001 he worked in Department of Computer Science

of Loyola University of Chicago as a Visiting Professor. He is currently working as aProfessor at Department of Computer Engineering of Izmir University of Economics. Hiscurrent research interests include adaptive video streaming and Internet performancemeasurements.

S. Aslan et al. Journal of Visual Communication and Image Representation 49 (2017) 315–331

331