attn-hybridnet: improving discriminability of hybrid features ...1 attn-hybridnet: improving...

14
1 Attn-HybridNet: Improving Discriminability of Hybrid Features with Attention Fusion Sunny Verma, Chen Wang, Liming Zhu, Member, IEEE and Wei Liu, Senior Member, IEEE Abstract—The principal component analysis network (PCANet) is an unsupervised parsimonious deep network, utilizing principal components as filters in its convolution layers. Albeit powerful, the PCANet consists of basic operations such as principal components and spatial pooling, which suffers from two fundamental problems. First, the principal components obtain information by transforming it to column vectors (which we call the amalgamated view), which incurs the loss of the spatial information in the data. Second, the generalized spatial pooling utilized in the PCANet induces feature redundancy and also fails to accommodate spatial statistics of natural images. In this research, we first propose a tensor-factorization based deep network called the Tensor Factorization Network (TFNet). The TFNet extracts features from the spatial structure of the data (which we call the minutiae view). We then show that the information obtained by the PCANet and the TFNet are distinctive and non-trivial but individually insufficient. This phenomenon necessitates the development of proposed HybridNet, which integrates the information discovery with the two views of the data. To enhance the discriminability of hybrid features, we propose Attn-HybridNet, which alleviates the feature redundancy by performing attention-based feature fusion. The significance of our proposed Attn-HybridNet is demonstrated on multiple real-world datasets where the features obtained with Attn-HybridNet achieves better classification performance over other popular baseline methods, demonstrating the effectiveness of the proposed technique. Index Terms—Tensor Decomposition, Feature Extraction, At- tention Networks, Feature Fusion I. I NTRODUCTION Feature engineering is an essential task in the development of machine learning systems and has been well-studied with substantial efforts from communities including computer vi- sion, data mining, and signal processing [1]. In the era of deep learning, the features are extracted by processing the data through multiple stacked layers in deep neural networks. These deep neural networks sequentially perform sophisticated operations to discover critical information concealed in the data [2]. However, the training time required to obtain these superior data representations is exponentially large as these networks have an exhaustive hyper-parameter search space and usually suffer from various training difficulties [3]. Besides, the deep networks are complex models that require high Sunny Verma is with The Data Science Institute, University of Technology Sydney, Australia (email: [email protected] Wei Liu is with the School of Computer Science, University of Technology Sydney, Australia (email: [email protected]). Chen Wang and Liming Zhu are with Data61, Commonwealth Scientific and Industrial Research Organization, CSIRO, Sydney, Australia (email: [email protected], [email protected]). The source code of proposed technique and extracted features are available at https://github.com/sverma88/Attn-HybridNet—IEEE-TCYB. computational resources for their training and deployment. Hence, this limits the usability of these networks on micro- devices such as cellphones [4], [5]. The current research trend focuses on alleviating the memory and space requirements associated with the deep networks [6]. In this regard, to produce lightweight convolution neural networks (CNNs) architecture, most of the existing solutions would 1) approximate the convolution and fully connected layers by factorization [7], [8], 2) compress the layers with quantization/hashing [5], [9], or, 3) replace the fully connected layer with a tensorized layer and optimize the weights of this layer by retraining [6]. However, all these techniques require a pre-trained CNN network and can only work as post- optimization technique. By contrast, the goal of this research is to build lightweight deep networks which are computationally inexpensive to train. In other words, this research aims to build deep networks that are independent of a) high-performance hardware and b) exhaustive hyper-parameter search space, where a PCANet represents one such promising architecture. The PCANet is an unsupervised deep parsimonious net- work utilizing principal components as convolution filters for extracting features in its cascaded layers [10]. Due to the remarkable performance of PCANet on several bench- mark face datasets, the network is recognized as a simple deep learning baseline for image classification. However, the features extracted by PCANet do not achieve competitive performance on challenging object recognition datasets such as CIFAR-10 [11]. There are two major reasons for this performance limitation: 1) the PCANet vectorizes the data while extracting principal components which results in loss of spatial information exhibiting in the data and, 2) the output layer (which is spatial-pooling) utilized in the PCANet induces feature redundancy and does not adapt to the structure of natural images, deteriorating the performance of classifiers [12], [13]. However, the vectorization of the data is inherent with the principal components and hence motivates to devise techniques that can alleviate the loss of spatial information present in the data. In other words, techniques that can extract information from the untransformed view of the data 1 which is proven to be beneficial in literature [14], [15], [16]. In this research, we first propose an unsupervised tensor factorization based deep network called Tensor Factorization Network (TFNet). The TFNet, contrary to the PCANet, does 1 Throughout this paper we refer to the vectorized presentation of the data as the amalgamated view where all modes of the data (also called dimension for higher order-matrices, i.e. tensors) are collapsed to obtain a vector. The untransformed view of the data, i.e., when viewed with its multiple modes (e.g., tensors), is referred to as the minutiae view of the data. arXiv:2010.06096v2 [cs.CV] 14 Oct 2020

Upload: others

Post on 16-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Attn-HybridNet: Improving Discriminability of Hybrid Features ...1 Attn-HybridNet: Improving Discriminability of Hybrid Features with Attention Fusion Sunny Verma, Chen Wang, Liming

1

Attn-HybridNet: Improving Discriminability ofHybrid Features with Attention Fusion

Sunny Verma, Chen Wang, Liming Zhu, Member, IEEE and Wei Liu, Senior Member, IEEE

Abstract—The principal component analysis network(PCANet) is an unsupervised parsimonious deep network,utilizing principal components as filters in its convolution layers.Albeit powerful, the PCANet consists of basic operations suchas principal components and spatial pooling, which suffers fromtwo fundamental problems. First, the principal componentsobtain information by transforming it to column vectors (whichwe call the amalgamated view), which incurs the loss of thespatial information in the data. Second, the generalized spatialpooling utilized in the PCANet induces feature redundancyand also fails to accommodate spatial statistics of naturalimages. In this research, we first propose a tensor-factorizationbased deep network called the Tensor Factorization Network(TFNet). The TFNet extracts features from the spatial structureof the data (which we call the minutiae view). We then showthat the information obtained by the PCANet and the TFNetare distinctive and non-trivial but individually insufficient.This phenomenon necessitates the development of proposedHybridNet, which integrates the information discovery with thetwo views of the data. To enhance the discriminability of hybridfeatures, we propose Attn-HybridNet, which alleviates the featureredundancy by performing attention-based feature fusion. Thesignificance of our proposed Attn-HybridNet is demonstrated onmultiple real-world datasets where the features obtained withAttn-HybridNet achieves better classification performance overother popular baseline methods, demonstrating the effectivenessof the proposed technique.

Index Terms—Tensor Decomposition, Feature Extraction, At-tention Networks, Feature Fusion

I. INTRODUCTION

Feature engineering is an essential task in the developmentof machine learning systems and has been well-studied withsubstantial efforts from communities including computer vi-sion, data mining, and signal processing [1]. In the era ofdeep learning, the features are extracted by processing thedata through multiple stacked layers in deep neural networks.These deep neural networks sequentially perform sophisticatedoperations to discover critical information concealed in thedata [2]. However, the training time required to obtain thesesuperior data representations is exponentially large as thesenetworks have an exhaustive hyper-parameter search space andusually suffer from various training difficulties [3]. Besides,the deep networks are complex models that require high

Sunny Verma is with The Data Science Institute, University of TechnologySydney, Australia (email: [email protected]

Wei Liu is with the School of Computer Science, University of TechnologySydney, Australia (email: [email protected]).

Chen Wang and Liming Zhu are with Data61, Commonwealth Scientificand Industrial Research Organization, CSIRO, Sydney, Australia (email:[email protected], [email protected]).

The source code of proposed technique and extracted features are availableat https://github.com/sverma88/Attn-HybridNet—IEEE-TCYB.

computational resources for their training and deployment.Hence, this limits the usability of these networks on micro-devices such as cellphones [4], [5]. The current research trendfocuses on alleviating the memory and space requirementsassociated with the deep networks [6].

In this regard, to produce lightweight convolution neuralnetworks (CNNs) architecture, most of the existing solutionswould 1) approximate the convolution and fully connectedlayers by factorization [7], [8], 2) compress the layers withquantization/hashing [5], [9], or, 3) replace the fully connectedlayer with a tensorized layer and optimize the weights ofthis layer by retraining [6]. However, all these techniquesrequire a pre-trained CNN network and can only work as post-optimization technique. By contrast, the goal of this research isto build lightweight deep networks which are computationallyinexpensive to train. In other words, this research aims to builddeep networks that are independent of a) high-performancehardware and b) exhaustive hyper-parameter search space,where a PCANet represents one such promising architecture.

The PCANet is an unsupervised deep parsimonious net-work utilizing principal components as convolution filtersfor extracting features in its cascaded layers [10]. Due tothe remarkable performance of PCANet on several bench-mark face datasets, the network is recognized as a simpledeep learning baseline for image classification. However, thefeatures extracted by PCANet do not achieve competitiveperformance on challenging object recognition datasets suchas CIFAR-10 [11]. There are two major reasons for thisperformance limitation: 1) the PCANet vectorizes the datawhile extracting principal components which results in lossof spatial information exhibiting in the data and, 2) the outputlayer (which is spatial-pooling) utilized in the PCANet inducesfeature redundancy and does not adapt to the structure ofnatural images, deteriorating the performance of classifiers[12], [13]. However, the vectorization of the data is inherentwith the principal components and hence motivates to devisetechniques that can alleviate the loss of spatial informationpresent in the data. In other words, techniques that can extractinformation from the untransformed view of the data1 whichis proven to be beneficial in literature [14], [15], [16].

In this research, we first propose an unsupervised tensorfactorization based deep network called Tensor FactorizationNetwork (TFNet). The TFNet, contrary to the PCANet, does

1Throughout this paper we refer to the vectorized presentation of the dataas the amalgamated view where all modes of the data (also called dimensionfor higher order-matrices, i.e. tensors) are collapsed to obtain a vector. Theuntransformed view of the data, i.e., when viewed with its multiple modes(e.g., tensors), is referred to as the minutiae view of the data.

arX

iv:2

010.

0609

6v2

[cs

.CV

] 1

4 O

ct 2

020

Page 2: Attn-HybridNet: Improving Discriminability of Hybrid Features ...1 Attn-HybridNet: Improving Discriminability of Hybrid Features with Attention Fusion Sunny Verma, Chen Wang, Liming

2

not vectorize the data while obtaining weights for its con-volution filters. Therefore, it is able to extract informationassociated with the spatial structure of the data or the minutiaeview of the data. Besides, the information is independentlyobtained from each mode of the data, providing several degreesof freedom to the information extraction procedure of TFNet.

Importantly, we hypothesize that the information obtainedfrom either the amalgamated view or the minutiae viewof the data is essential but individually insufficient as theyrespectively conceal complementary information associatedwith the two views2 of the data [17], [15]. Therefore, theintegration of information from these two views can enhancethe performance of classification systems. To this end, wepropose the Hybrid Network (HybridNet) that integrates in-formation discovery and feature extraction from the minutiaeview and the amalgamated view of the data simultaneously inits consolidated architecture.

Although the HybridNet reduces the information loss byintegrating information from the two views of the data, itmay still suffer from feature redundancy problems arisingfrom the generalized spatial pooling utilized in the outputlayer of PCANet. Therefore, we propose an attention-basedfusion scheme Attn-HybridNet that performs feature selectionand aggregation, thus enhancing the feature discriminability inhybrid features.

The superiority of feature representations obtained withthe Attn-HybridNet is validated by performing comprehensiveexperiments on multiple real-world benchmark datasets. Thedifferences and similarities between the PCANet, TFNet,HybridNet, and Attn-HybridNet from data view perspectivesare summarized in Table I.

Our contributions in this paper are summarized below:

• We propose Tensor Factorized Network (TFNet), whichextracts features from the minutiae view of the data andhence is able able to preserve the spatial informationpresent in the data that is proven beneficial for imageclassification.

• We propose Left one Mode Out Orthogonal Iteration(LoMOI) algorithm, which optimizes convolution weightsfrom the minutiae view of the data utilized in the pro-posed TFNet.

• We introduce the Hybrid Network (HybridNet), which in-tegrates the feature extraction and information discoveryprocedure from two views of the data. This integrationprocedure reduces information loss from the data bycombining the merits of the PCANet and TFNet andobtains superior features from both of the two schemes.

• We propose the Attn-HybridNet, which alleviates featureredundancy among hybrid features by performing featureselection and aggregation with an attention-based fusionscheme. The Attn-HybridNet enhances the discriminabil-ity of the feature representations, which further theo-retically improves the classification performance of ourscheme.

2Throughout this paper by two views, we mean the amalgamated view andthe minutiae view.

Methods Amalgamated View Minutiae View Attention Fusion

PCANet [10] X × ×

TFNet [18] × X ×

HybridNet [18] X X ×

Attn-HybridNet X X X

TABLE I: Comparison of different feature extraction models

• We perform comprehensive evaluations and case studiesto demonstrate the effectiveness of features obtained byAttn-HybridNet and HybridNet on multiple benchmarkreal-world datasets.

The rest of the paper is organized as follows: in Sec. IIwe present the literature review including prior works andbackground on PCANet and tensor preliminaries. We thenpresent the details of our proposed TFNet, HybridNet, andAttn-HybridNet Sec. III, Sec. IV, and Sec. V respectively. Nextwe describe our experimental setup, results and discussionsin Sec. VI and Sec. VII. Finally, we conclude our work andspecify the future directions for its improvement in Sec. VIII.

II. LITERATURE REVIEW

The success of utilizing CNNs for multiple computer visiontasks such as visual categorization, semantic segmentation,etc. has lead to drastic research and development in thedeep learning field. At the same time, to supersede humanperformance on these tasks, the CNNs requires an enormousamount of computational resources. For example, the ResNet[19] achieves a top-5 error rate of 3.57% that consists of 152-layers accounting for a total of 60M parameters and requires2.25 × 1010 flops of the data at inference. This substantialcomputational cost restricts the applicability of such models ondevices with limited computational resources such as mobiledevices. Reducing the size and time complexities of the CNNshas, therefore, become a non-trivial task for their practicalapplications. In this regard, researchers actively pursuit threemain active research directions: 1) compression of trainedCNNs weights with quantization, 2) approximating convolu-tion layer with factorization, and 3) replacing fully connectedlayers with custom-built layers.

In the first category, the aim is to reduce the size oftrained CNNs by compressing its layers with quantization orhashing, as in [5], [9], [20]. These quantized CNN modelsachieve similar recognition accuracy with significantly lessrequirements for computational resources during inference.Similarly, the works in [7], [8] obtain approximations of fullyconnected and convolution layers by utilizing factorization forcompressing the CNN models. However, both the quantizationand factorization based methods compress a pre-trained CNNmodel instead of building a smaller or faster CNN model in thefirst place. Therefore, these techniques inherit the limitationsof the pre-trained CNN models.

In the second and the third categories, the aim is to replacefully connected layers by customized lightweight layers thatsubstantially reduce the size of any CNN model. For example,in [6] proposes a neural tensor layer while the work in [4]proposes a BoF (Bag-of-features) model as a neural poolinglayer. These techniques augment conventional CNNs layers

Page 3: Attn-HybridNet: Improving Discriminability of Hybrid Features ...1 Attn-HybridNet: Improving Discriminability of Hybrid Features with Attention Fusion Sunny Verma, Chen Wang, Liming

3

and produce their lightweight versions which are trainablein an end-to-end fashion. However, a major limitation ofthese work is that they are only capable of replacing a fullyconnected layer, and in order to replace a convolution layer,they usually end up functioning similarly to the work in thefirst category.

Different from the above research, possible solutions forobtaining lightweight CNN architecture with lower compu-tational requirements on smaller size images are proposedin PCANet [10] and TFNN [16]. The PCANet is a deepunsupervised parsimonious feature extractor, whereas TFNNis a supervised CNN architecture utilizing neural tensor factor-izations for extracting information from multiway data. Boththese networks achieve very high classification performanceon handwritten digits dataset but fail to obtain competitiveperformance on object recognition dataset. This is because thePCANet (and its later variants FANet [21]) incur informationloss associated with the spatial structure of the data as itobtains weights of its convolution filters from the amalgamatedview of the data. Contrarily, the TFNN extracts informationby isolating each view of the multi-view data and fails to effi-ciently consolidate them for their utmost utilization, incurringthe loss of common information present in the data.

Therefore, the information from both the amalgamated viewand the minutiae view is essential for classification, and theirintegration can enhance the classification performance [17],[15]. In this research, we first propose HybridNet, which inte-grates the two kinds of information in its deep parsimoniousfeature extraction architecture. A major difference betweenHybridNet and PCANet is that the HybridNet obtains infor-mation from both views of the data simultaneously, whereasthe PCANet is restricted to obtain information from theamalgamated view of the data. The HybridNet is also notablydifferent from TFNN as the HybridNet is an unsupervised deepnetwork while the TFNN is a supervised deep neural network.Moreover, the HybridNet extracts information from minutiaeview of the data, whereas the TFNN extracts information byisolating each mode of multi-view data.

Moreover, to enhance the discriminability of the featuresobtained with HybridNet, we propose the Attn-HybridNet,which performs attention-based fusion on hybrid features. TheAttn-HybridNet reduces feature redundancy by performing fea-ture selection and obtains superior feature representations forsupervised classification. We present the related backgroundpreliminaries in the next subsection.

A. Background

We briefly summarize PCANet’s 2-layer architecture andprovide background on tensor preliminaries in this section.

1) The First Layer: The procedure begins by extractingoverlapping patches of size k1 × k2 around each pixel inthe image; where patches from image Ii are denoted asxi,1, xi,2, ..., xi,mn ∈ Rk1k2 , m = m − dk12 e

3 and n =n − dk22 e. Next, the obtained patches are zero-centered bysubtracting the mean of the image patches and vectorized toobtain Xi ∈ Rk1k2×mn as the patch matrix. After repeating

3The operator dze gives the smallest integer greater than or equal to z.

the same procedure for all the training images we obtainX ∈ Rk1k2×Nmn as the final patch-matrix from which the pcafilters are obtained. The PCA minimizes the reconstructionerror with orthonormal filters known as the principal eigen-vectors of XXT calculated as in Eq. 1

minV ∈Rk1k2×L1

‖X − V V TX‖F , s.t. V TV = IL1 (1)

where IL1 is an identity matrix of size L1×L1 and L1 is thetotal number of obtained filters. These convolution filters cannow be expressed as:

W 1lPCANet

= matk1,k2(ql(XXT )) ∈ Rk1×k2 (2)

where matk1,k2(v) is a function that maps v ∈ Rk1k2 toa matrix W ∈ Rk1×k2 , and ql(XXT ) denotes the l-thprincipal eigenvector of XXT . Next, each training image Iiis convolved with the L1 filters as in Eq. 3.

IliPCANet

= Ii ∗W 1lPCANet

(3)

where ∗ denotes the 2D convolution and i, l are the image andfilter indices respectively. Importantly, the boundary of imageIi is padded before convolution to obtain I liPCANet

with thesame dimensions as in Ii. From Eq. 3 a total of N×L1 imagesare obtained and attributed as the output from the first layer.

2) The Second Layer: The methodology of the second layeris similar to the the first layer. We collect overlapping patchesof size k1×k2 around each pixel from all input images in thislayer i.e., from I liPCANet

. Next, we vectorize and zero-centrethese images patches to obtain the final patch matrix denotedas Y ∈ Rk1k2×L1Nmn. This patch matrix is then utilized toobtain the convolution pca filters in layer 2 as in Eq. 4.

W 2lPCANet

= matk1,k2(ql(Y Y T )) ∈ Rk1×k2 (4)

where l = [1, L2] denotes the number of pca filters obtainedin this layer. Next, the input images in this layer I liPCANet

are convolved with the learned filters W 2lPCANet

to obtain theoutput from this layer in Eq. 5. These images are then passedto the feature aggregation phase as in the next subsection.

OliPCANet

= IliPCANet

∗W 2lPCANet

(5)

3) The Output Layer: The output layer combines the outputfrom all the convolution layers of PCANet to obtain thefeature vectors. The process initiates by first binarizing eachof the real-valued outputs from Eq. 5 by utilizing a Heavisidefunction H(Ol

iPCANet) on them, which converts the positive

entries to 1 otherwise 0. Then, these L2 outputs are assembledinto L1 batches, where all images in a batch belong to thesame convolution filter in the first layer. Then, these imagesare combined to form a single image by applying weightedsum as in Eq. 6 whose pixel value is in the range [0, 2L2 −1]:

IliPCANet

=

L2∑l=1

2l−1H(O2lPCANet

) (6)

Next, these binarized images are partitioned into B blocks anda histogram with 2L2 bins is obtained. Finally, the histogramsfrom all the B blocks are concatenated to form a feature vectorfrom the amalgamted view of the images in Eq. 7.

fiPCANet= [Bhist(I1

iPCANet), ..., Bhist(IL1

iPCANet)]T ∈ R(2L2 )L1B

(7)

Page 4: Attn-HybridNet: Improving Discriminability of Hybrid Features ...1 Attn-HybridNet: Improving Discriminability of Hybrid Features with Attention Fusion Sunny Verma, Chen Wang, Liming

4

Algorithm 1 Left One Mode Out Orthogonal Iteration,LoMOI1: Input: n-mode tensor X ∈ Ri1,i2,...,in ; factorization ranks for each mode of the

tensor [r1...rm−1, rm+1...rn], where rk ≤ ik∀ k ∈ 1, 2, ..., n and k 6= m;factorization error-tolerance ε, and Maximum allowable iterations = Maxiter, m= mode to discard while factorizing

2: for i = 1, 2, ..., n and i 6= m do3: Xi ← unfold tensor X on mode-i4: U(i) ← ri left singular vectors of Xi . extract leading ri matrix factors5: G← X×1(U(1))T ...×m−1(U(m−1))T×m+1(U(m+1))T ...×n(U(n))T

. Core tensor6: X← G×1 U

(1)...×m−1 U(m−1)×m+1 U

(m+1)×n U(n) . reconstructedtensor obtained by multilinear product of the core-tensor with the factor-matrices;Eq. 8.

7: loss← ‖X− X‖ . decomposition loss8: count← 09: while [(loss ≥ ε) Or (Maxiter ≤ count)] do . loop until convergence

10: for i = 1, 2, ..., n and i 6= m do11: Y← X×1 (U(1))T ...×(i−1) (U(i−1))T ×(i+1) (U(i+1))T ...×n

(U(n))T . obtain the variance in mode-i12: Y i ← unfold tensor Y on mode-i13: U(i) ← ri left singular vectors of Y i

14: G← X×1 (U(1))T ...×(m−1) (U(m−1))T ×(m+1) (U(m+1))T ...×n

(U(n))T

15: X← G×1 U(1)...×(m−1) U(m−1) ×(m+1) U(m+1)...×n U(n)

16: loss← ‖X− X‖17: count← count+ 1

18: Output: X the reconstructed tensor and [U(1)...U(m−1),U(m+1)...U(n)] thefactor matrices

This block-wise encoding process encapsulates the L1 imagesfrom Eq. 6 into a single feature vector which can be utilizedfor any machine learning task like clustering or classification.

B. Tensor Preliminaries

Tensors are simply multi-mode arrays or higher-order4 ma-trices of dimension > 2. In this paper, the vectors are denotedas x are called first-order tensors, whereas the matrices aredenoted as X are called second-order tensors. Analogously,matrices of order-3 or higher are called tensors and are denotedas X. A few important multilinear algebraic operations utilizedin this paper are described below.

a) Matriziation: also known as tensor unfolding, is theoperation to rearrange the elements of an n-mode tensor X ∈Ri1×i2...×iN as matrix X(n) ∈ Rin×j on the chosen mode n,where j =

(i1 ...× in−1 × in+1...× iN

).

b) n-mode Product: the product of an n-mode tensorX ∈ Ri1...×im−1×im×im+1...×in and a matrix A ∈ Rj×in isdenoted as X ×n A. The resultant of this product is alsoa tensor Y ∈ Ri1×i2×in−1×j×in+1...×in which can also beexpressed through matricized tensor as Y (n) = AX(n).

c) Tensor Decomposition: Tensor decomposition is aform of generalized matrix factorization for approximatingmultimode tensors. The factorization an n-mode tensor X ∈Ri1×i2...×in obtains two sub components: 1) G ∈ Rr1×r2...×rnwhich is a lower dimensional tensor called the core-tensorand, 2) U (j) ∈ Rrn×in∀j = [1, n] which are matrix factorsassociated with each mode of the tensor. The entries in thecore-tensor G signify the interaction level between tensorelements. The factor matrices U (n) are analogous to princi-pal components associated with the respective mode-n. Thisscheme of tensor factorization falls under the Tucker family

4Also known as modes (dimensions) of a tensor and are analogous to rowsand columns of a matrix.

of tensor decomposition [22]. The original tensor X can bereconstructed by taking the n-mode product of the core-tensorand the factor matrices as in Eq. 8.

G×1 U(1) ×2 U

(2)...×N U (n) ≈ X (8)

The advantages of Tucker based factorization methods arealready studied in several domains such as computer vision,[14], data mining [23], and signal processing [24], [22].However, in this research, we factorize tensor to obtain weightsof convolution-tensorial filters for TFNet by devising ourcustom tensor factorization scheme which we call as Left oneMode Out Orthogonal Iteration (LoMOI) presented in Alg. 1.

III. THE TENSOR FACTORIZATION NETWORK

The development of Tensor Factorization Network (TFNet)is motivated to reduce the loss of spatial information occurringin the PCANet while vectorizing image patches. However, thistransformation of the data is inherent while extracting the prin-cipal components which destroys the geometric structure of theobject encapsulated in the data which is proven beneficial inmany image classification tasks [14], [15], [16]. Furthermore,the vectorization of the data results in high dimensionalvectors and generally requires more computational resources.Motivated by the above shortcomings with the PCANet, wepropose the TFNet. The TFNet preserves the spatial structureof the data while obtaining weights of its convolution-tensorfilters. The unsupervised feature extraction procedure withminutiae view of the data is detailed in the next subsection.

A. The First Layer

Similar to the first layer in PCANet, we begin by collectingall overlapping patches of size k1 × k2 around each pixelfrom the image Ii. However, contrary to PCANet the spatialstructure of these patches are preserved and instead of matrixand we obtain a 3-mode tensor Xi ∈ Rk1×k2×mn. The mode-1 and mode-2 of this tensor represent the row-space, and thecolumn-space spanned by the pixels in the image. Whereasthe mode-3 of this tensor represents the total number of imagepatches obtained from the input image. Iterating this processfor all the training images, we obtain X ∈ Rk1×k2×Nmn as ourfinal patch-tensor. The matrix factors utilized to generate ourconvolution-tensorial filters for to the first two modes of X areobtained by utilizing our custom-designed LoMOI (presentedin Alg. 1) in Eq. 9.

[X,U (1),U (2)]← LoMOI(X, r1, r2) (9)

where X ∈ Rr1×r2×Nmn, U (1) ∈ Rk1×r1 , and U (2) ∈ Rk2×r2 .We discard obtaining the matrix factors from mode-3 of tensorX (which is X3) as this is equivalent to the transpose ofthe patches matrix X in layer 1 of the PCANet which isnot factorized in the PCANet while obtaining weights for itsconvolution filters. Moreover, the matrix factors for this modespan the sample space of the data which is trivial. A total ofL1 = r1 × r2 convolution-tensor filters are obtained from thefactor matrices U (1) and U (2) as in Eq. 10.

W 1lTFNet

= U(1)(:,i)⊗U

(2)(:,j)∈ Rk1×k2 (10)

Page 5: Attn-HybridNet: Improving Discriminability of Hybrid Features ...1 Attn-HybridNet: Improving Discriminability of Hybrid Features with Attention Fusion Sunny Verma, Chen Wang, Liming

5

(a) Convolution responses from the PCANet where each convolutionresponse is visually distinct from the rest of the responses. This illustratesextraction of unique information with the amalgamated view of the data.

(b) Convolution responses from the MPCANet. Although the output is lessdiversified than TFNet, visual resemblance is partially observable.

(c) Convolution responses from our TFNet where the visual resemblanceis observed in a sequence of three responses. This illustrates extraction of

common information with minutiae view of the data.

Fig. 1: Comparison of convolution outputs from Layer1 in PCANet, MPCANet and TFNet on CIFAR-10 dataset. These plotsdemonstrate the contrast between the kinds of information obtained with the amalgamated and the minutiae view of the data.

where ‘⊗’ is the outer-product between two vectors, i =

[1, r1], j = [1, r2], l = [1, L1], and U(m)(:,i) represents ‘ith’ col-

umn of the ‘mth’ factor matrix. Importantly, our convolution-tensorial filters do not require any explicit reshaping asthe outer-product between two vectors naturally results ina matrix. Therefore, we can straightforwardly convolve theinput images with our obtained convolution-tensorial filters asdescribed in Eq. 11 where i = [1, N ] and l = [1, L1] are theimage and filter indices respectively.

IliTFNet

= Ii ∗W 1lTFNet

(11)

However, whenever the data is an RGB-image, each ex-tracted patch from the image is a 3-order tensor X ∈ Rk1×k2×3

(i.e., RowPixels×ColPixels×Color). After collecting patchesfrom all the training images, we obtain a 4-mode tensoras X ∈ Rk1×k2×3×Nmn which is decomposed by utilizingLoMOI ([X,U (1),U (2),U (3)] ← LoMOI(X, r1, r2, r3)) forobtaining the convolution-tensorial filters in Eq. 12.

W 1lTFNet

= U(1)(:,i)⊗U

(2)(:,j)⊗U

(3)(:,k)

(12)

where i ∈ [1, r1], j ∈ [1, r2], and k ∈ [1, r3].

B. The Second Layer

Similar to the first layer, we extract overlapping patchesfrom the input images and zero-center them to build a 3-mode patch-tensor denoted as Y ∈ Rk1×k2×NL1mn whichis decomposed as [Y,V (1),V (2)] ← LoMOI(Y, r1, r2) toobtain the convolution-tensor filters for layer 2 in Eq. 13.

W 2lTFNet

= V(1)(:,i)⊗ V

(2)(:,j)∈ Rk1×k2 (13)

where, Y ∈ Rr1×r2×NL1mn, V (1) ∈ Rk1×r1 , and V (2) ∈Rk2×r2 , i = [1, r1], j = [1, r2], and l = [1, L2]. We, now

convolve each of the L1 input images from the first layer withthe convolution-tensorial filters obtained as below in Eq. 14.

OliTFNet

= IliTFNet

∗W 2lTFNet

, l = 1, 2, ..., L2 (14)

The number of output images obtained here is equal to L1×L2 which is identical to the number of images obtained at layer2 of PCANet. Finally, we utilize the output layer of PCANet(Sec. II-A3) to obtain the feature vectors from the minutiaeview of the image in Eq. 15.

IliTFNet

=

L2∑l=1

2l−1H(O2lTFNet

)

fiTFNet= [Bhist(I1

iTFNet), ..., Bhist(IL1

iTFNet)]T ∈ R(2L2 )L1B

(15)

Despite having close resemblance between the feature ex-traction mechanism of the PCANet and the TFNet, these twonetworks capture visibly distinguishable features from the twoview of the images as shown in Fig. 1. These plots are obtainedby convolving image of a cat with the convolution filtersobtained in the first layer of the networks.

Undoubtedly, each of the L1 convolution responses withinthe PCANet is visibly distinct. Whereas the convolution re-sponses within the TFNet shows visual similarity, i.e., theimages in a triplet sequence show similarity consecutively.These plots demonstrate that the TFNet emphasize miningthe common information from the minutiae view of the data.Whereas the PCANet emphasizes mining the unique informa-tion from the amalgamated view of the data. Both these kindsof information are proven beneficial for classification in [17],[15] and motivate the development of HybridNet.

Besides, we also present convolution responses of MP-CANet [25] in the comparisons. MPCANet employstensorized-convolution but differs from our TFNet in two

Page 6: Attn-HybridNet: Improving Discriminability of Hybrid Features ...1 Attn-HybridNet: Improving Discriminability of Hybrid Features with Attention Fusion Sunny Verma, Chen Wang, Liming

6

𝐿𝐿1𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹

PCA

LoMOI

Matrix, X

32

Tensor, X

𝑘𝑘1𝑘𝑘2

𝐼𝐼𝑖𝑖

PCA

–Fi

lter O

utpu

tTe

nsor

–Fi

lter O

utpu

t

...

𝐼𝐼𝑖𝑖𝑃𝑃𝑃𝑃𝑃𝑃1

𝐼𝐼𝑖𝑖𝑃𝑃𝑃𝑃𝑃𝑃𝐿𝐿1

𝐼𝐼𝑖𝑖𝑇𝑇𝑇𝑇1

𝐼𝐼𝑖𝑖𝑇𝑇𝑇𝑇𝐿𝐿1

...

𝐶𝐶𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝐹𝐹𝑜𝑜𝐹𝐹𝐹𝐹𝑜𝑜𝑜𝑜 (∗)

𝑃𝑃𝑃𝑃𝐹𝐹𝑃𝑃𝑃Extraction

𝐿𝐿𝑃𝑃𝐿𝐿𝐹𝐹𝐹𝐹 𝟏𝟏 𝑜𝑜𝑜𝑜 𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯

𝐿𝐿𝑃𝑃𝐿𝐿𝐹𝐹𝐹𝐹𝟐𝟐𝑜𝑜𝑜𝑜

𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯

𝑯𝑯𝑯𝑯

𝐻𝐻𝐿𝐿𝐻𝐻𝐹𝐹𝐹𝐹𝐻𝐻𝐹𝐹𝐹𝐹𝑃𝑃𝐹𝐹𝑜𝑜𝐹𝐹𝐹𝐹𝐹𝐹

𝐴𝐴𝐹𝐹𝐹𝐹𝐹𝐹𝑜𝑜𝐹𝐹𝐹𝐹𝑜𝑜𝑜𝑜𝑜𝑜𝐹𝐹𝑃𝑃𝐹𝐹𝑜𝑜𝐹𝐹𝐹𝐹𝐹𝐹

𝐴𝐴𝐹𝐹𝐹𝐹𝐹𝐹𝑜𝑜𝐹𝐹𝐹𝐹𝑜𝑜𝑜𝑜𝑀𝑀𝐹𝐹𝑃𝑃𝑃𝑃𝑃𝑜𝑜𝐹𝐹𝐹𝐹𝑀𝑀

𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯 𝑨𝑨𝑯𝑯𝑯𝑯𝑯𝑯𝑨𝑨𝑯𝑯𝑯𝑯𝑨𝑨𝑨𝑨 𝑭𝑭𝑭𝑭𝑭𝑭𝑯𝑯𝑨𝑨𝑨𝑨

Fig. 2: Workflow of the proposed Attn-HybridNet model.

ways: 1) construction of convolution kernels and 2) convo-lution operation. Technically, the convolution kernel and con-volution operation in MPCANet in [25] (and its predecessor in[26]) are combined together as conventional tensor operations,obtaining a) factor matrices for each mode from the patch-tensor and b) n-mode products of factor matrices with thepatch-tensor. In other words, the convolution in MPCANetis a n-mode product of the patch-tensor and factor matrices,whereas in TFNet the convolution kernels are obtained byperforming outer-procut of factor matrices in Eq. 10 followedby convolution in Eq. 11. From Figure 1, it is visible that theconvolution responses in MPCANet is less diversified thanthose of TFNet although visible resemblance is still observ-able. We believe that this is because of how the convolutionfilters and convolution operation are performed in MPCANetas analyzed above.

IV. THE HYBRID NETWORK

The PCANet and the TFNet extract contrasting informationfrom the amalgamated view and the minutiae view of the data,respectively. However, we hypothesize that the informationfrom both these views are essential as they conceal comple-mentary information and that their integration can enhance theperformance of classification systems. Motivated by the above,we propose the HybridNet, which simultaneously extractsinformation from both views of the data and is detailed inthe next subsection. However, for ease of understanding, weillustrate the complete procedure of feature extraction withAttn-HybridNet in Fig. 2.

A. The First Layer

Similar to the previous networks, we begin the featureextraction process by collecting all overlapping patches ofsize k1 × k2 around each pixel from the image Ii. Impor-tantly, the first layer of HybridNet consists of image-patchesexpressed both as tensors X ∈ Rk1×k2×3×Nmn and matrices

(a) Comparison of eigenvaluesbetween networks

Cor

e-T

enso

r Nor

m

0E

2E

4E

6E

TFNet HybridNet

Layer 1 Layer 2

(b) Comparison of core-tensorstrength between networks

Fig. 3: Comparison of factorization strength in Layer 2 ofthe PCANet, TFNet and HybridNet on CIFAR-10 dataset

X ∈ Rk1k2×Nmn which are utilized for obtaining weights ofconvolution filters in layer 1 of HybridNet.

This enables this layer (and the subsequent layers) ofHybridNet to learn superior filters as they perceive more infor-mation from both views of the data. The weights for the pca-filters are obtained as the principal-eigenvectors as W 1

lPCA=

matk1,k2(ql(XXT )), and the weights for convolution-tensorfilters are obtained by utilizing LoMOI as W 1

lTF= U

(1)(:,i) ⊗

U(2)(:,j) ⊗ U

(3)(:,k). Furthermore, the output from this layer is

obtained by convolving input images with a) the PCA-filtersand b) the convolution-tensorial filters in Eq. 16. This injectsmore diversity to the output in succeeding layer of HybridNet.

IliPCA

= Ii ∗W 1lPCA

IliTF

= Ii ∗W 1lTF

(16)

Since we obtain of L1 pca filters and L1 convolution-tensorfilters, a total of 2× L1 outputs are obtained in this layer.

B. The Second Layer

Similar to the first layer, we begin with collecting alloverlapping patches of size k1 × k2 around each pixel fromthe images. However, contrary to the above layer, the weightsof the pca-filters W 2

lPCAand convolution-tensor filters W 2

lTF

Page 7: Attn-HybridNet: Improving Discriminability of Hybrid Features ...1 Attn-HybridNet: Improving Discriminability of Hybrid Features with Attention Fusion Sunny Verma, Chen Wang, Liming

7

Algorithm 2 The HybridNet Algorithm1: Input: Ii, i = 1, 2, ..., n n is the total number of training images, L =

[l1, l2, ...lD] the number of filters in each layer,k1 and k2 the patch-size, B,D = the depth of the network.

2: for i = 1, 2, ..., n do . DO for each image in the first convolution layer.3: X ← extract patches of size k1 × k2 around each pixel of Ii . mean

centered and vectorized.4: X← extract patches of size k1 × k2 around each pixel of Ii . mean centred

but retain their spatial shape.5: WPCA ← obtain PCA filters by factorizing X .6: WTF ← obtain tensor filters by factorizing X with LoMOI Algo. 1.7: for i = 1, 2, ..., n do8: I1

iPCA← Ii ∗WPCA . store convolution with pca filters.

9: I1iTF

← Ii ∗WTF . store convolution with tensorial filters.

10: for l = 2, ..., D do . DO for the remaining convolution layers.11: I ← [I

(l−1)PCA , I

(l−1)TF ] . Utilize both views together.

12: for i = 1, 2, ..., n do . n = 2× n× ll−1.13: X ← extract patches of size k1 × k2 around each pixel of Ii

14: X← extract patches of size k1 × k2 around each pixel of Ii

15: W lPCA ← obtain PCA filters by factorizing X .

16: W lTF ← obtain tensor filters by factorizing X with LoMOI Alg. 1.

17: for i = 1, 2, ..., n do . n = n× ll−1.18: Il

iPCA= I

(l−1)iPCA

∗W 1lPCA

19: IliTF

= Il−1iTF∗W 1

lTF

20: for i = 1, 2, ..., n do . DO for each image.21: IiPCA

=∑ld

k=1 2l−1H(IdkPCA

) . Binarize and accumulate output fromall convolution layers.

22: IiTF=∑ld

k=1 2l−1H(IdkTF

)

23: fiPCA← Bhist(IiTF

) . create block-wise histogram.24: fiTF

← Bhist(IiTF)

25: Output: features from the amalgamated mode as fPCA, and the minutiae modeas fTF .

are learned from the data obtained by convolving input imageswith the pca filters and the convolution-tensor filters i.e.both IiPCA

and IiTF. Hence both the patch-matrix Y ∈

Rk1k2×2L1Nmn and the patch-tensor Y ∈ Rk1×k2×2L1Nmn

contain image patches obtained from [IiPCA, IiTF

]. This en-ables the hybrid filters to assimilate more variability presentin the data while obtaining weights of their convolution filtersas evident in Fig. 3.

The plot in Fig. 3(a) compares the eigenvalues obtainedin layer 2 (we exclude eigenvalues from layer 1 as theycompletely overlap as their expected behavior). The leadingeigenvalues obtained in layer 2 of the HybridNet by principalcomponents has much higher magnitude than the corre-sponding eigenvalues obtained by principal componentsin PCANet. This demonstrates that the pca filters in theHybridNet capture more variability than those in the PCANet.

Similarly, Fig 3(b) compares the core-tensor strength indifferent layers of the HybridNet and the TFNet. We plot thenorm of the core-tensor for both the networks as the valuesin the core-tensor is analogous to eigenvalues for higher-ordermatrices, and its norm signifies the compression strength of thefactorization [27]. Again, the norm of the core-tensor in layer 2of HybridNet is much lower than that of the TFNet, suggestingrelatively higher factorization strength in HybridNet. Besides,as expected, the norm of the core-tensor in layer 1 forboth the networks coincides and signifies equal factorizationstrength at this layer. Consequently, this leads to attainment ofbetter-disentangled feature representations with the HybridNetand hence enhances its generalization performance over thePCANet and the TFNet by integrating information from thetwo views of the data.

In the second layer, the weights of pca filters are obtained

by principal components as W 2lPCA

= matk1,k2(ql(Y Y T ))and the weights for convolution-tensor filters are obtainedas W 2

lTF= V

(1)(:,i) ⊗ V

(2)(:,j), where the matrix factors are

obtained using LoMOI [Y,V (1),V (2)]← LoMOI(Y, r1, r2).Analogous to the previous networks, the output images fromthis layer of HybridNet are obtained by a) convolving the L1

images corresponding to the output from the PCA-filters inthe first layer with the L2 pca filters obtained in the secondlayer (Eq. 17), and b) convolving the L1 images correspondingto the output from the convolution-tensorial filters in the firstlayer with the L2 convolution-tensorial filters obtained in thesecond layer (Eq. 18). This generates a total of 2 × L1 × L2

output images in this layer.

OliPCA

= IliPCA

∗W 2lPCA

(17)

OliTF

= IliTF∗W 2

lTF(18)

The output images obtained from the pca-filters (OliPCA

)in layer 2 are then processed with the output layer of thePCANet (Sec. II-A3) to obtain fiPCA

as the informationfrom amalgamated view of the image. Similarly, the outputimages obtained from the convolution-tensor filters (Ol

iTF)

are processed to obtain fiTFas the information from minutiae

view of the image. Finally, these two kinds information areconcatenated to obtain the hybrid features as in Eq. 19.

fihybrid= [fiPCA

fiTF] ∈ R(2L2 )2L1B (19)

The whole procedure of obtaining hybrid features is detailedin Alg. 2. Although these hybrid features bring the best of boththe common and the unique information obtained respectivelyfrom the two views of the data, they still suffer from featureredundancy problems induced by the spatial pooling operationin the output layer. To alleviate this drawback, we propose theAttn-HybridNet, which further enhances the discriminabilityof the hybrid features.

V. ATTENTION-BASED FUSION - ATTN-HYBRIDNET

Our proposed HybridNet eradicates the loss of informationby integrating the learning scheme of PCANet and TFNetthus obtaining superior features than either of the networks.However, the feature encoding scheme in the output layer iselementary and induces redundancy in the feature represen-tations [12], [28]. Moreover, the generalized spatial poolingoperation in the output layer is unable to accommodate thespatial structure of the natural images, i.e., it is more effectivefor aligned images dataset like face and handwritten digits thanfor object recognition dataset. Simply, the design of the outputlayer is ineffectual to obtain utmost feature representation onobject recognition datasets resulting in performance degrada-tion with the HybridNet. Moreover, efficient ways to alleviatethis drawback with the output layer are not addressed in theliterature, which necessitates the development of our proposedattention-based fusion scheme i.e. the Attn-HybridNet.

Our proposed attention-based fusion scheme is presentedin Alg. 3, where fhybrid ∈ RN×(2L2 )L1B×2 are the hybridfeature vectors obtained with the HybridNet, w ∈ Rd is thefeature level context vector of dimension d << (2L2)L1B,

Page 8: Attn-HybridNet: Improving Discriminability of Hybrid Features ...1 Attn-HybridNet: Improving Discriminability of Hybrid Features with Attention Fusion Sunny Verma, Chen Wang, Liming

8

Algorithm 3 The Attn-HybridNet Algorithm

1: Input: fhybrid = [fPCA; fTF ] ∈ RN×(2L2 )L1B×2 the hybrid featurevectors from the training images; y = [0, 1, ..., C] ground truth of training images,dimensionality of feature level context vector w ∈ Rd, where d << R(2L2 )L1B .

2: randomly initialize W , fc, and w3: loss← 1000 . arbitrary number to start training4: do5: [fbatch, ybatch]← sample batch ([fhybrid, y])6: PF ← tanh(W.fbatch) . get the hidden representation of the hybrid

features7: α = softmax

(wT .PF

). measure and normalize the importance

8: Fattn = fbatch.αT . perform attention fusion

9: y ← fc(Fattn) . fully connected layer10: loss← LogLoss(ybatch, ybatch) . compute loss for optimizing

parameters11: back-propagate loss for optimizing W , fc, and w.12: while [(loss ≥ ε)] . loop until convergence13: Output: parameters to perform attention fusion W , fc, and w ∈ Rd

αT ∈ R2 is the normalized importance weight vector forcombining the two kinds of information with attention fusion,and Fattn ∈ R(2L2 )L1B are the attention features. The fullyconnected layers i.e. W ∈ Rd×(2L2 )L1B and fc are utilizedto obtain hidden representations of features while performingattention fusion.

A few numerical optimization based techniques proposedin [12], [13] exist for alleviating the feature redundancyfrom architectures utilizing generalized spatial pooling layers.However, these techniques require grid search between thedictionary size (number of convolution filters in our case)and the pooling blocks in the output layer while performingoptimization. Besides, the transition to prune filters from asingle-layer networks to multi-layer network is not smooth inthese techniques. A major difference between our proposedAttn-HybridNet and the existing proposal in [12], [13] isthat we reduce the feature redundancy by performing featureselection with attention-based fusion scheme, whereas theexisting techniques prune the filters to eliminate the featureredundancy. Therefore, our proposed Attn-HybridNet is supe-rior to these existing techniques as it decouples the two sub-processes, i.e., information discovery with convolution layersand feature aggregation in the pooling layer while alleviatingthe redundancy exhibiting in the feature representations.

The discriminative features obtained by Attn-HybridNet i.e.Fattn are utilized with softmax-layer for classification, wherethe parameters in the proposed fusion scheme (i.e., W , fcand w) are optimized via gradient-descent on the classificationloss. This simple yet effective scheme substantially enhancesthe classification performance by obtaining highly discrimi-native features. Comprehensive experiments are conducted inthis regard to demonstrate the superiority of Attn-HybridNetdetailed in Sec. VI.

A. Computational ComplexityTo calculate the computational complexity of Attn-

HybridNet, we assume the HybridNet is composed of two-layers with a patch size of k1 = k2 = k in each layer followedby our attention-based fusion scheme.

In each layer of the HybridNet, we have to compute thetime complexities arising from learning convolution weightsfrom the two views of the data. The formation of the zero-centered patch-matrix X and zero-centered patch-tensor X

have identical complexities as k2(1+mn). The complexity ofeigen-decomposition for patch-matrix and tensor factorizationwith LoMOI for patch-tensor are also identical and equal toO((k2)3), where k is a whole number < 7 in our experiments.Further, the complexity for convolving images with the convo-lution filters at stage i requires Lik2mn flops. The conversionof L2 binary bits to a decimal number in the output layer costs2L2mn, where m = m − dk2 e, n = n − dk2 e and the naivehistogram operation for this conversion results in complexityequal to O(mnBL2log2).

The complexity of performing matrix multiplication inAttn-HybridNet is O

(2L1B

(d(1 + 2L2) + 2L2

))which can

be efficiently handled with modern deep learning packageslike Tensorflow [29] for stochastic updates. To optimize theparameters in the attention-based fusion scheme (W , fc, andw), we back-propagate the loss through the attention networkuntil convergence of the error on the training features.

VI. EXPERIMENTS AND RESULTS

A. Experimental Setup

In our experiments, we utilized a two-layer architecturefor each of the networks in comparison, while the numberof convolution filters in the first and the second layer areoptimized via cross-validation on the training datasets. Thedimensionality of the feature vectors extracted from PCANetand TFNet is then BL12

L2 , where L1 and L2 are the numberof convolution filters in layer 1 and layer 2 respectively.The dimensionality of feature vector with HybridNet is then2BL12

L2 . We utilized Linear-SVM [30] as the classifierincorporating features obtained with the PCANet, TFNet, andthe HybridNet.

The attention-based fusion scheme is performed by follow-ing the procedure as described in Alg. 3, where we obtainedthe optimal attention dimension on training data for the contextlevel feature vector w ∈ Rd in [10, 50, 100, 150, 200, 400]. Theobtained attention features i.e. Fattn ∈ RBL12L2 are utilizedwith softmax-layer for classification. The parameters ofattention-based fusion scheme (W , fc, and w) are optimizedvia back-propagation on the classification loss implementedin TensorFlow [29]. We observed that the attention-network’soptimization took less than 15 epochs for convergence on allthe datasets utilized in this paper.

B. Datasets

The details of datasets and hyper-parameters are as below:1 MNIST variations [31], which consist of gray scale

handwritten digits of size 28× 28 with controlled factorsof variations such as background noise, rotations, etc.Each variation contains 10K training and 50K testingimages. We cite the results for baselines techniques like2-stage ScatNet [32] (ScatNet-2) and 2-stage Contractiveauto-encoders [33] (CAE-2) as reported in [10]. Theparameters of HybridNet (and other networks) are set asL1 = 9, L2 = 8, k1 = k2 = 7, with a block size of B = 7×7keeping size of overlapping regions equal to half of theblock size for feature pooling.

Page 9: Attn-HybridNet: Improving Discriminability of Hybrid Features ...1 Attn-HybridNet: Improving Discriminability of Hybrid Features with Attention Fusion Sunny Verma, Chen Wang, Liming

9

(a) CIFAR-10 (b) MNIST Rotation (c) MNIST bg-img-rot

Fig. 4: Performance Comparison by varying size of the training data

2 CUReT dataset [34], consists of 61 texture categories,where each category has images of the same materialwith different poses, illumination, specularity, shadowing,and surface normals. Following the standard procedure in[34], [10] a subset of 92 cropped images were taken fromeach category and randomly partitioned into train and testsets with a split ratio of 50%. The classification results areaveraged over 10 different trails with hyper-parameters asL1 = 9, L2 = 8, k1 = k2 = 5, block size B = 50 × 50with overlapping regions equal to half of the block size.Again, we cite the results of the baselines techniques aspublished in [10].

3 ORL5 and Extended Yale-B [35] datasets are utilizedto investigate the performance for face recognition. TheORL dataset consists of 10 different images of 40 distinctsubjects taken at different times, varying the lightning,facial expression, and facial details. The Extended Yale-B dataset consists of face images from 38 individualsunder 9 poses and 64 illumination conditions. The imagesin both the datasets are cropped to size 64 × 64 pixelsfollowed by unit length normalization. The classificationresults are averaged over 5 different trails by progres-sively increasing the number of training examples6 withhyper-parameters as L1 = 9, L2 = 8, k1 = k2 = 5 with anon-overlapping block of size B = 7× 7.

4 CIFAR-10 [11] dataset consists of RGB images of di-mensions 32×32 for object recognition consisting of 50Kand 10K images for training and testing respectively.These images are distributed among 10 classes and varysignificantly in object positions, object scales, colors, andtextures within each class. We varied the number of filtersin layer 1 i.e., L1 as 9 and 27 and kept the number offilters in layer 2 i.e. L2 = 8. The patch-size k1 and k2 arekept equal and varied as 5, 7, and 9 with block size B =8 × 8. Following [10] we also applied spatial pyramidpooling (SPP) [36] to the output layer of HybridNet(and similarly to the out layer of other networks). Weadditionally applied PCA to reduce the dimension of each

5The ORL database is publicly available and can be obtained from thewebsite: http://www.uk.research.att.com/facedatabase.html

6The training and test split are obtained from http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html

pooled feature to 1007. These features are utilized withLinear-SVM for classification and Attn-HybridNet forobtaining attention features Fattn.

5 CIFAR-100 [11] dataset closely follows CIFAR-10dataset and consists 50k training and 10k testing imagesroughly distributed among 100 categories. We use thesame experimental setup as in CIFAR-10 on this dataset.

C. Baselines

On face recognition datasets, we compare the performanceof Attn-HybridNet and HybridNet against three recent base-lines: Deep-NMF [37], PCANet+ [38], and PCANet-II [39]8.On MNIST variations and CuReT dataset we select thebaselines as in [10]. Besides, the hyper-parameters of thesebaselines are set equal to those in HybridNet.

On CIFAR-10 dataset, we compare our schemes againstmultiple comparable baselines such as Tiled CNN [40],CUDA-Convnet [41], VGG style CNN (VGG-CIFAR-10 re-ported by [42]), K-means (tri) [43], Spatial Pyramid Poolingfor CNN (SPP) [44], Convolution Bag-of-Features (CBoF)[45], and Spatial-CBoF [4]. Note that, we do not compare ourschemes against schemes which aim to either compress deepneural networks or transfer pre-learned CNN filters such as in[46], [47], [48], [49], [50], [51] as these schemes do not traina deep network from scratch whereas our proposed schemesand comparable baselines do so. Besides, we also report theperformances of ResNet [19] and DenseNet [52] on CIFARdatasets, as mentioned in their respective publication.

Lastly, we perform a qualitative case study on the CIFAR-10 dataset by studying the performance of baselines and ourscheme by varying the size of the training data. Besides,we also studied the effect on the discriminability of hybridfeatures with attention-based fusion scheme.

VII. RESULTS AND DISCUSSIONS

Our two main contributions in this research are - 1) theintegration of information available from both the amalga-mated view (i.e., the unique information) and the minutiaeview (i.e., the common information), and 2) attention-based

7Results does not vary significantly on increasing the projection dimensions.8The paper did not provide its source code and the results are based on our

independent implementation of their second order pooling technique.

Page 10: Attn-HybridNet: Improving Discriminability of Hybrid Features ...1 Attn-HybridNet: Improving Discriminability of Hybrid Features with Attention Fusion Sunny Verma, Chen Wang, Liming

10

Parameters PCANet [10] TFNet [18] HybridNet Attn-HybridNet

L1 L2 k1 k2 Error (%) Error (%) Error (%) Error (%)

8 8 5 5 34.80 32.57 31.39 28.088 8 7 7 39.92 37.19 35.24 30.948 8 9 9 43.91 39.65 38.04 35.33

27 8 5 5 26.43 29.25 23.84 18.4127 8 7 7 30.08 32.57 28.53 25.6727 8 9 9 33.94 34.79 31.36 27.70

TABLE II: Classification Error obtained by varyinghyper-parameters on CIFAR-10 dataset.

Fig. 5: Accuracy of Attn-HybridNet on CIFAR-10 dataset byvarying the dimension of w in Alg. 3.

fusion of information obtained from these two views forsupervised classification. We evaluate the significance of thesecontributions under the following research questions:

Q1: Is the integration of both the minutiae view andthe amalgamated view beneficial? Or, does their integrationdeteriorate the generalization performance of HybridNet?

In order to evaluate this, we varied the amount of trainingdata in HybridNet, the PCANet, and TFNet and obtained theclassification performance from their corresponding featureson CIFAR-10 and MNIST variations datasets. We cross-validated the performances of these schemes for 5 times andpresent their mean and variances in Fig. 4.

Firstly, these plots clearly suggest that the classificationaccuracies obtained with the features from HybridNet (and alsofrom the PCANet and the TFNet) linearly increase with respectto the size of training data. Secondly, these plots also demon-strate that the information obtained from the amalgamatedview in PCANet is superior than the information obtainedfrom the minutiae view TFNet on object-recognition dataset.However, these two kinds of information achieve competitiveclassification performance on variations of handwritten digitsdataset which contains nearly aligned images.

Most importantly, these plots unambiguously demonstratethat integrating both kinds of information can enhance thesuperiority of feature representations, consequently improvingthe classification performance in proposed HybridNet.

Q2: How does the hyper-parameters affect the discrim-inability of feature representations? Moreover, how does theseaffect the performance of HybridNet and Attn-HybridNet?

To address this question, we present a detailed study on howthe hyper-parameters affect the performance of HybridNet andAttn-HybridNet. In this regard, we compare the classificationperformance of the PCANet, TFNet, HybridNet, and Attn-HybridNet on CIFAR-10 dataset in Table II. The lowest error

ORL - Dataset Number of Training Instances

4 6 8

Deep-NMF [37] 9.50 ± 1.94 6.50 ± 2.59 1.75 ± 2.09PCANet-II [39] 16.16 ± 1.29 7.87 ± 1.29 5.00 ± 2.05PCANet+ [38] 1.25 ± 0.83 0.50 ± 0.52 0.25 ± 0.55

PCANet [10] 1.75 ± 0.95 0.37 ± 0.34 0.40 ± 0.68TFNet [18] 1.98 ± 0.54 0.50 ± 0.68 0.25 ± 0.59HybridNet (proposed) 1.48 ± 0.72 0.25 ± 0.32 0.21 ± 0.55Attn-HybridNet (proposed) 5.43 ± 0.78 3.11 ± 0.27 1.13 ± 0.31

TABLE III: Classification Error on ORL dataset.

YaleB - Dataset Number of Training Instances

20 30 40 50

Deep-NMF [37] 10.94±0.89 8.03±0.61 5.43±0.94 4.78±0.76PCANet-II [39] 11.40±0.97 5.54±1.49 2.86±0.35 1.98±0.75PCANet+ [38] 1.15±0.14 0.28±0.07 0.23±0.14 0.22±0.23

PCANet [10] 1.35±0.17 0.40±0.18 0.38±0.16 0.38±0.13TFNet [18] 1.97±0.27 0.91±0.28 0.40±0.16 0.42±0.21HybridNet (proposed) 1.32±0.35 0.55±0.26 0.32±0.25 0.34±0.21Attn-HybridNet (proposed) 5.11±0.65 2.80±0.42 2.12±0.15 1.88±0.40

TABLE IV: Classification Error on Extended YaleB dataset.

is highlighted in slightly larger font, while the minimum errorachieved in each row is highlighted in bold font. Moreover, wealso illustrate the performance of Attn-HybridNet by varyingthe dimension of context level feature vector w utilized in ourattention-fusion scheme in Fig. 5.

A clear trend is visible in Table II among the performancesof all the networks, where the classification error decreaseswith an increase in the number of filters in the first layer ofthe networks. This trend also demonstrates the effect of thefactorization rank while obtaining the principal-componentsand the matrix factors with LoMOI; signifying that increasingthe number filters in the first layer allows all the networksto increase the data variability that aids in obtaining betterfeature correspondences in the output stage. In addition, thisalso increases the dimensionality of the features extracted bythe networks suggesting that comparatively higher dimensionalfeatures have lower intraclass variability among the featurerepresentations of objects from the same category.

Another trend can be observed in the performance tablewhere the classification error increases with the increase ofthe patch size of the image. Since the dimension of imagesin CIFAR-10 is 32 × 32, this may be due to the presence ofless background with smaller image-patches as increasing thepatch size gradually mount to non-stationary data [10].

Importantly, our proposed Attn-HybridNet substantially re-duces the classification error by 22.78% when comparedto classification performance with HybridNet on CIFAR-10dataset. The plot in Fig. 5 shows the effect on classificationaccuracy by varying dimensions of feature level context vectorw in Attn-HybridNet.

Q3: How does the proposed Attn-HybridNet (and Hybrid-Net) perform in comparison to the baseline techniques?

To evaluate this requirement, we compare the performanceof the proposed Attn-HybridNet and HybridNet against base-lines as detailed in Sec. VI-C. In this regard, we present

Page 11: Attn-HybridNet: Improving Discriminability of Hybrid Features ...1 Attn-HybridNet: Improving Discriminability of Hybrid Features with Attention Fusion Sunny Verma, Chen Wang, Liming

11

Methods baisc rot bg-rand bg-img bg-img-rot rect-image convex

CAE-2 [33] 2.48 9.66 10.90 15.50 45.23 21.54 -TIRBM [53] - 4.20 - - 35.50 - -PGBM [54] - - 6.08 12.25 36.76 8.02 -ScatNet-2 [32] 1.27 7.48 12.30 18.40 50.48 15.94 6.50

PCANet [10] 1.07 6.88 6.99 11.16 35.46 13.59 4.15TFNet [18] 1.07 7.15 6.96 11.44 37.02 16.87 4.98HybridNet (proposed) 1.01 6.32 5.46 10.08 33.87 12.91 3.55

Attn-HybridNet (proposed) 0.94 4.31 3.73 8.68 31.33 10.65 2.81

(a) MNIST Variations Datasets

Methods Error (%)

Textons [55] 1.50BIF [56] 1.40Histogram [57] 1.00ScatNet [32] 0.20

PCANet [10] 0.84TFNet [18] 0.96HybridNet (proposed) 0.81

Attn-HybridNet (proposed) 0.72

(b) CuReT Dataset

TABLE V: Classification Error on MNIST variations and CUReT datasets.

Methods #Depth #Params Error

Tiled CNN [40] - - 26.90K-means (tri.) [43] (1600 dim.) 1 5 22.10CUDA-Convnet [41] 4 1.06M 18.00VGG-CIFAR-10 [58] 5 2.07M 20.04SPP [44] 5 256.5K 19.39CBoF [45] 5 174.6K 20.47Spatial-CBoF [4] 5 199.1K 21.37ResNet reported in-[59] 110 1.7M 13.63DenseNet-BC reported in-[52] 250 15.3M 5.2

PCANet [10] 3 7 26.43TFNet [18] 3 7 29.25HybridNet (proposed) 3 7 23.84Attn-HybridNet (proposed) 3 12.7k 18.41

TABLE VI: Classification Error on CIFAR-10 dataset withno data augmentation. The DenseNet achieves the lowestclassification error but at the expense of huge depth andsubstantial computational cost among all techniques.

the performance comparison on face recognition datasets inTable III and Table IV.The performance comparison on hand-written digits and texture classification datasets are presentedin Table V. Furthermore, classification results on CIFAR-10 and CIFAR-100 datasets are presented in Table VI andTable VII, respectively.

Besides, we visualize the discriminability of feature repre-sentation obtained from HybridNet and Attn-HybridNet witht-SNE plot [60] in Fig. 7 to perform a qualitative analysis oftheir discriminability. We further aid this analysis with a plot tostudy their classification performance obtained with increasingamount of training data on CIFAR-10 dataset in Fig. 6.

a) Performance on face recognition: A similar trend isnoticeable from the classification performances on ORL andExtended YaleB datasets. First, for all schemes, the classifica-tion error decreases with the increase of the number of trainingexamples. This is expected as by increasing the amount oftraining data all schemes can better estimate the variationin lighting, facial expression, and pose. Secondly, among thebaselines, PCANet+ performs substantially better than Deep-NMF and PCANet-II on both the face datasets. The poorperformance for Deep-NMF can be justified as it needs a largeamout of data to estimate its parameters. Whereas for PCANet-II, the explicit alignment of face images as a requirement canexplain its degradation in performance. Besides, the PCANet+also performs slightly better than PCANet as the earlierenhances the latter with a better feature encoding scheme.

Methods #Depth #Params Error

ResNet reported in-[59] 110 1.7M 37.80DenseNet-BC reported in-[52] 250 15.3M 19.64

PCANet [10] 3 7 53.00TFNet [18] 3 7 53.63HybridNet (proposed) 3 7 49.87Attn-HybridNet (proposed) 4 42.2k 47.44

TABLE VII: Classification Error on CIFAR-100 dataset withno data augmentation.

Lastly, our proposed HybridNet outperforms the baselinesand individual schemes, i.e. PCANet and TFNet, on boththe datasets. This validates our hypothesis that the commonand unique information are both essential and their fusioncan enhance the classification performance. However, theclassification performance of Attn-HybridNet is slightly worsethan HybridNet. Again, this might be due to less amount ofdata available while learning the attention parameters, similarto Deep-NMF albeit it still performs better than Deep-NMFas; the underlying features are highly discriminative, andtherefore it is less strenuous to discover attention weights forthe proposed fusion scheme in comparison to Deep-NMF.

b) Performance on digit recognition and texture clas-sification: On MNIST handwritten digits variations dataset,the Attn-HybridNet (and also the HybridNet) outperforms thebaselines on five out of seven variations. In particular, for bg-rand and bg-img variations, we decreased the error (comparedto [18]) by 31.68% and 13.80% respectively. On CUReTtexture classification dataset, the Attn-HybridNet achieves thelowest classification error among all the networks, albeit itachieves slightly higher classification error compared to stateof the art. However, the difference in classification errorachieved by state of the art [32] and Attn-HybridNet ismarginal and is only 0.5%.

c) Quantitative Performance on CIFAR Datasets: Wecompare the performance of proposed Attn-HybridNet andHybridNet against baselines as described in Sec. VI-C onCIFAR-10 and CIFAR-100 datasets and report their respectiveaccuracies in Table VI and Table VII respectively.

The proposed Attn-HybridNet achieves the best performanceamong all kinds of networks studied in the paper. Techni-cally, the proposed HybridNet reduces the error by 9.80% onCIFAR-10 and 5.91% on CIFAR-100 dataset in comparisonto min(PCANet, HybridNet). Additionally, the proposed Attn-

Page 12: Attn-HybridNet: Improving Discriminability of Hybrid Features ...1 Attn-HybridNet: Improving Discriminability of Hybrid Features with Attention Fusion Sunny Verma, Chen Wang, Liming

12

Fig. 6: (Best viewed in color) Accuracy of various methodson CIFAR-10 dataset by varying size of the training data

HybridNet further reduces the error by 22.78% on CIFAR-10and 4.87% on CIFAR-100 dataset in comparison to Hybrid-Net. Besides, on CIFAR-10 dataset, Attn-HybridNet achievessubstantially lower error compared to Titled CNN [40], K-means (tri), and the PCANet; particularly 16.70% lower thanK-means (tri) which has 2× higher feature dimensionality thanour proposed HybridNet and utilizes L2 regularized-SVMinstead of Linear-SVM for classification.

The performance of our proposed Attn-HybridNet is stillbetter than VGG-CIFAR-10 [42] and comparable to CUDA-Convnet [41]9, both of which have more depth than theproposed Attn-HybridNet. In particular, we have reduced theerror by 1.63% than VGG-CIFAR-10 with 99.63% less train-able parameters. At the same time, we have performed verycompetitive to CUDA-Convnet achieving 0.41% higher errorrate but with 88% less number of tunable parameters.

Our proposed Attn-HybridNet also performs marginallybetter in compared to deep-quantized networks such as SPP[44], CBoF [45], and Spatial-CBoF [4]. This is because theproposed scheme is decoupled as feature extraction and featurepooling schemes and hence the effort required to estimate thetuneable parameters is negligible. The quantization schemesare proposed to reduce the parameters in the fully connectedlayer but requiring the same efforts required to find optimalparameters of the higher layers.

Besides, the performance gap between Attn-HybridNet andstate of the art ResNet and DenseNet are not comparable as thedepth and the computational complexity of the latter networksare tremendously huge. Therefore, the main bottleneck forthese schemes is requirement of high performing hardwarewhich is opposite to the motivation of this work that isalleviation of such requirementsand hence the tradeoff.

d) Qualitative Discussion on CIFAR-10: We now presenta qualitative discussion on the performances of various base-lines and our proposals by varying the size of training datain Fig. 6. Although our proposed Attn-HybridNet consistentlyachieved the highest classification performance, a few inter-esting patterns are noticeable in the performance curves.

A paramount observation in this regard is the lower clas-sification performance achieved by both CUDA-Convnet [41]and VGG-CIFAR-10 [58] with less amount of training dataset,particularly until 40%. It is intuitive and justifiable since less

9We cite the accuracy as published.

(a) HybridNet (b) Attn-HybridNet

Fig. 7: (Best viewed in color) t-SNE visualization of featuresfrom HybridNet and Attn-HybridNet on CIFAR-10 dataset.

amount of the training data is not sufficient to learn theparameters of these deep networks. However, on increasingthe amount of training data (above 50%), the performance ofthese networks increases substantially i.e., increases with alarger margin compared to the performance of SVM basedschemes in HybridNet and K-means (tri) [43].

The second observation is regarding the classification per-formances of HybridNet and K-means (tri). Both these net-works achieve higher classification accuracy compared to thedeep networks with less amount of training data; particularly,the HybridNet has 11.56% higher classification rate comparedto the second-highest classification accuracy achieved by K-means (tri) with only 10% of the training dataset. However,the accuracy of these networks does not scale or increasesubstantially with an increase in the training data, as noticedby deep-network-based schemes.

Besides, the Attn-HybridNet achieved the highest classi-fication performance across different sizes of the trainingdataset among all techniques. A possible explanation for this isthe requirement of fewer parameters with proposed attention-fusion while performing feature selection with attention-basedfusion to alleviate the feature redundancy. Moreover, the t-SNE plot in Fig. 7 compares the discriminability of featuresobtained with the HybridNet and Attn-HybridNet. The plot onthe features obtained from Attn-HybridNet Fig. 7(b) visuallyachieves better clustering than the plot on features obtainedfrom HybridNet Fig. 7(a) and justifies the performance im-provement with our proposal.

VIII. CONCLUSION AND FUTURE WORK

In this paper, we have introduced HybridNet, which inte-grates the information discovery and feature extraction pro-cedure from the amalgamated view and the minutiae view ofthe data. The development of HybridNet is motivated by thefact that information obtained from the two views of the dataare individually insufficient but necessary for classification.To extract features from the minutiae view of the data, weproposed the TFNet that obtains weights of its convolution-tensor filters by utilizing our custom-built LoMOI factoriza-tion algorithm. We then demonstrated how the informationobtained with the two views of data are complementary to eachother. Then, we provided details to simultaneously extract thecommon information from the amalgamated view and uniqueinformation with the minutiae view of the data in our proposedHybridNet. The significance of integrating these two kinds of

Page 13: Attn-HybridNet: Improving Discriminability of Hybrid Features ...1 Attn-HybridNet: Improving Discriminability of Hybrid Features with Attention Fusion Sunny Verma, Chen Wang, Liming

13

information with HybridNet is demonstrated by performingclassification on multiple real-world datasets.

Although the HybridNet achieves higher classification ac-curacy, it still suffers from the problem of feature redundancyarising from the generalized spatial pooling operation utilizedto aggregate the features in the output layer. Therefore, we pro-posed Attn-HybridNet for alleviating the feature redundancyby performing attentive feature selection. Our proposed Attn-HybridNet enhances the discriminability of features, whichfurther enhances their classification performance.

We performed comprehensive experiments on multiple real-world datasets to validate the significance of our proposedAttn-HybridNet and HybridNet. The features extracted usingour proposed Attn-HybridNet achieved similar classificationperformance among popular baseline methods with signif-icantly less amount of hyper-parameters and training timerequired for their optimization. Besides, we also conductedmultiple case studies with other popular baseline methods toprovide qualitative justifications for the superiority of featuresextracted by our proposed Attn-HybridNet.

Furthermore, our research can be further improved withtwo interesting research directions. The first direction is thedesign of HybridNet filters to accommodate various non-linearities in the data such as alignments and occlusion. Asecond research direction can be the design of attention-basedfusion for generalized tasks such as face verification and gaitrecognition.

REFERENCES

[1] L. Zheng, Y. Yang, and Q. Tian, “Sift meets CNN: a decade survey ofinstance retrieval,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 40, no. 5, pp. 1224–1244, 2018.

[2] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: areview and new perspectives,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.

[3] X. Glorot and Y. Bengio, “Understanding the difficulty of trainingdeep feedforward neural networks,” in Proceedings of the thirteenthinternational conference on artificial intelligence and statistics, 2010,pp. 249–256.

[4] N. Passalis and A. Tefas, “Training lightweight deep convolutionalneural networks using bag-of-features pooling,” IEEE transactions onneural networks and learning systems, 2018.

[5] S. Han, H. Mao, and W. J. Dally, “Deep compression: compressing deepneural networks with pruning, trained quantization and huffman coding,”ICLR, 2016.

[6] J. Kossaifi, A. Khanna, Z. Lipton, T. Furlanello, and A. Anandkumar,“Tensor contraction layers for parsimonious deep nets,” in ComputerVision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Con-ference on. IEEE, 2017, pp. 1940–1946.

[7] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun, “Efficient and accurateapproximations of nonlinear convolutional networks,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,2015, pp. 1984–1992.

[8] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempit-sky, “Speeding-up convolutional neural networks using fine-tuned cp-decomposition,” in ICLR, 2015.

[9] J. Cheng, J. Wu, C. Leng, Y. Wang, and Q. Hu, “Quantized cnn: aunified approach to accelerate and compress convolutional networks,”IEEE transactions on neural networks and learning systems, no. 99, pp.1–14, 2017.

[10] T.-H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma, “Pcanet: a simpledeep learning baseline for image classification?” IEEE Transactions onImage Processing, vol. 24, no. 12, pp. 5017–5032, 2015.

[11] A. Krizhevsky and G. Hinton, “Learning multiple layers of featuresfrom tiny images,” Master’s thesis, Department of Computer Science,University of Toronto, 2009.

[12] Y. Jia, O. Vinyals, and T. Darrell, “On compact codes for spatially pooledfeatures,” in International Conference on Machine Learning, 2013, pp.549–557.

[13] Y. Jia, C. Huang, and T. Darrell, “Beyond spatial pyramids: Receptivefield learning for pooled image features,” in 2012 IEEE Conference onComputer Vision and Pattern Recognition. IEEE, 2012, pp. 3370–3377.

[14] M. A. O. Vasilescu and D. Terzopoulos, “Multilinear analysis of imageensembles: Tensorfaces,” in European Conference on Computer Vision.Springer, 2002, pp. 447–460.

[15] S. Verma, W. Liu, C. Wang, and L. Zhu, “Extracting highly effectivefeatures for supervised learning via simultaneous tensor factorization,”in Thirty-First AAAI Conference on Artificial Intelligence, 2017.

[16] J.-T. Chien and Y.-T. Bao, “Tensor-factorized neural networks,” IEEEtransactions on neural networks and learning systems, vol. 29, no. 5,pp. 1998–2011, 2017.

[17] W. Liu, J. Chan, J. Bailey, C. Leckie, and K. Ramamohanarao, “Mininglabelled tensors by discovering both their common and discriminativesubspaces,” in Proceedings of the 2013 SIAM International Conferenceon Data Mining. SIAM, 2013, pp. 614–622.

[18] S. Verma, W. Liu, C. Wang, and L. Zhu, “Hybrid networks: Improvingdeep learning networks via integrating two views of images,” in NeuralInformation Processing - 25th International Conference, ICONIP ,Proceedings, Part I, 2018, pp. 46–58.

[19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2016, pp. 770–778.

[20] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, “Compress-ing neural networks with the hashing trick,” in International Conferenceon Machine Learning, 2015, pp. 2285–2294.

[21] J. Huang and C. Yuan, “Fanet: factor analysis neural network,” inInternational Conference on Neural Information Processing. Springer,2015, pp. 172–181.

[22] A. Cichocki, D. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, C. Caiafa,and H. A. Phan, “Tensor decompositions for signal processing appli-cations: from two-way to multiway component analysis,” IEEE SignalProcessing Magazine, vol. 32, no. 2, pp. 145–163, 2015.

[23] B. Savas and L. Elden, “Handwritten digit classification using higherorder singular value decomposition,” Pattern Recognition, vol. 40, no. 3,pp. 993–1003, 2007.

[24] F. Cong, Q.-H. Lin, L.-D. Kuang, X.-F. Gong, P. Astikainen, andT. Ristaniemi, “Tensor decomposition of eeg signals: a brief review,”Journal of Neuroscience Methods, vol. 248, pp. 59–69, 2015.

[25] J. Wu, S. Qiu, R. Zeng, Y. Kong, L. Senhadji, and H. Shu, “Multilinearprincipal component analysis network for tensor object classification,”IEEE Access, vol. 5, pp. 3322–3331, 2017.

[26] R. Zeng, J. Wu, L. Senhadji, and H. Shu, “Tensor object classification viamultilinear discriminant analysis network,” in 2015 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP).IEEE, 2015, pp. 1971–1975.

[27] H. Lu, K. N. Plataniotis, and A. Venetsanopoulos, Multilinear SubspaceLearning: Dimensionality Reduction of Multidimensional Data. Chap-man and Hall/CRC, 2013.

[28] A. Coates and A. Y. Ng, “The importance of encoding versus trainingwith sparse coding and vector quantization,” in Proceedings of the 28thinternational conference on machine learning (ICML-11), 2011, pp.921–928.

[29] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning,” in 12th {USENIX} Symposium on OperatingSystems Design and Implementation ({OSDI} 16), 2016, pp. 265–283.

[30] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin,“Liblinear: a library for large linear classification,” Journal of MachineLearning Research, vol. 9, no. Aug, pp. 1871–1874, 2008.

[31] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio,“An empirical evaluation of deep architectures on problems with manyfactors of variation,” in Proceedings of the 24th International Conferenceon Machine Learning. ACM, 2007, pp. 473–480.

[32] J. Bruna and S. Mallat, “Invariant scattering convolution networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 35, no. 8, pp. 1872–1886, 2013.

[33] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractiveauto-encoders: explicit invariance during feature extraction,” in Proceed-ings of the 28th International Conference on International Conferenceon Machine Learning. Omnipress, 2011, pp. 833–840.

[34] M. Varma and A. Zisserman, “A statistical approach to material clas-sification using image patch exemplars,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 31, no. 11, pp. 2032–2047, 2009.

Page 14: Attn-HybridNet: Improving Discriminability of Hybrid Features ...1 Attn-HybridNet: Improving Discriminability of Hybrid Features with Attention Fusion Sunny Verma, Chen Wang, Liming

14

[35] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From fewto many: Illumination cone models for face recognition under variablelighting and pose,” IEEE transactions on pattern analysis and machineintelligence, vol. 23, no. 6, pp. 643–660, 2001.

[36] K. Grauman and T. Darrell, “The pyramid match kernel: discriminativeclassification with sets of image features,” in Computer Vision, 2005.ICCV 2005. Tenth IEEE International Conference on, vol. 2. IEEE,2005, pp. 1458–1465.

[37] G. Trigeorgis, K. Bousmalis, S. Zafeiriou, and B. W. Schuller, “A deepmatrix factorization method for learning attribute representations,” IEEEtransactions on pattern analysis and machine intelligence, vol. 39, no. 3,pp. 417–429, 2016.

[38] C.-Y. Low, A. B.-J. Teoh, and K.-A. Toh, “Stacking pcanet+: Anoverly simplified convnets baseline for face recognition,” IEEE SignalProcessing Letters, vol. 24, no. 11, pp. 1581–1585, 2017.

[39] C. Fan, X. Hong, L. Tian, Y. Ming, M. Piettikainen, and G. Zhao,“Pcanet-ii: When pcanet meets the second order pooling,” IEICE Trans-actions on Information and Systems, vol. 101, no. 8, pp. 2159–2162,2018.

[40] J. Ngiam, Z. Chen, D. Chia, P. W. Koh, Q. V. Le, and A. Y. Ng, “Tiledconvolutional neural networks,” in Advances in Neural InformationProcessing Systems, 2010, pp. 1279–1287.

[41] A. Krizhevsky. (2012 (accessed May 20, 2019)) Cuda convnet. [Online].Available: https://code.google.com/archive/p/cuda-convnet/

[42] S. Chintala, VGG Style CNN on CIFAR10, 2017 (accessed September3, 2017), https://github.com/soumith/DeepLearningFrameworks.

[43] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networksin unsupervised feature learning,” in Proceedings of the FourteenthInternational Conference on Artificial Intelligence and Statistics, 2011,pp. 215–223.

[44] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deepconvolutional networks for visual recognition,” IEEE transactions onpattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904–1916, 2015.

[45] N. Passalis and A. Tefas, “Learning bag-of-features pooling for deepconvolutional neural networks,” in Proceedings of the IEEE internationalconference on computer vision, 2017, pp. 5755–5763.

[46] R. Keshari, M. Vatsa, R. Singh, and A. Noore, “Learning structure andstrength of cnn filters for small sample size training,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,2018, pp. 9349–9358.

[47] X. Yu, T. Liu, X. Wang, and D. Tao, “On compressing deep modelsby low rank and sparse decomposition,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2017, pp.7370–7379.

[48] S. Lin, R. Ji, C. Chen, D. Tao, and J. Luo, “Holistic cnn compressionvia low-rank decomposition with knowledge transfer,” IEEE transactionson pattern analysis and machine intelligence, vol. 41, no. 12, pp. 2889–2905, 2018.

[49] S. Lin, R. Ji, Y. Li, Y. Wu, F. Huang, and B. Zhang, “Acceleratingconvolutional networks via global & dynamic filter pruning.” in IJCAI,2018, pp. 2425–2432.

[50] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl formodel compression and acceleration on mobile devices,” in Proceedingsof the European Conference on Computer Vision (ECCV), 2018, pp.784–800.

[51] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning methodfor deep neural network compression,” in Proceedings of the IEEEinternational conference on computer vision, 2017, pp. 5058–5066.

[52] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in Proceedings of the IEEE confer-ence on computer vision and pattern recognition, 2017, pp. 4700–4708.

[53] K. Sohn and H. Lee, “Learning invariant representations with localtransformations,” in Proceedings of the 29th International Coference onInternational Conference on Machine Learning, 2012, pp. 1339–1346.

[54] K. Sohn, G. Zhou, C. Lee, and H. Lee, “Learning and selecting featuresjointly with point-wise gated boltzmann machines,” in InternationalConference on Machine Learning, 2013, pp. 217–225.

[55] E. Hayman, B. Caputo, M. Fritz, and J.-O. Eklundh, “On the signifi-cance of real-world conditions for material classification,” in EuropeanConference on Computer Vision. Springer, 2004, pp. 253–266.

[56] M. Crosier and L. D. Griffin, “Using basic image features for textureclassification,” International Journal of Computer Vision, vol. 88, no. 3,pp. 447–460, 2010.

[57] R. E. Broadhurst, “Statistical estimation of histogram variation fortexture classification,” in Proc. Intl. Workshop on Texture Analysis andSynthesis, 2005, pp. 25–30.

[58] I. Karmanov. (accessed May 20, 2019) Vgg style cnn on cifar10. [On-line]. Available: https://github.com/soumith/DeepLearningFrameworks

[59] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deepnetworks with stochastic depth,” in European conference on computervision. Springer, 2016, pp. 646–661.

[60] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journalof machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.

Sunny Verma Sunny Verma received his Ph.D.degree in Computer Science from University ofTechnology Sydney in 2020. He is currently workingas Postdoctoral Research Fellow at the Data ScienceInstitute, University of Technology Sydney, and as avisiting scientist at Data61, CSIRO. Before joiningUTS, he was a Research Assistant at Department ofElectrical Engineering, IITD India, and then workedas Senior Research Assistant at Hong Kong BaptistUniversity, Hong Kong. He obtained his Ph.D. fromthe University of Sydney. His research interests

include data mining, fairness in machine learning, interpretable deep learning,and open-set recognition systems.

Chen Wang Chen Wang is a senior research scientistwith Data61, CSIRO. His research is in distributedand parallel computing with recent focus on dataanalytics systems and deep learning interpretability.He published more than 70 papers in major jour-nals and conferences such as TPDS, TC, WWW,SIGMOD and HPDC. He has industrial experience.He developed a high-throughput event system anda medical image archive system used by manyhospitals and medical centers in the USA.

Liming Zhu Dr/Prof. Liming Zhu is a ResearchDirector at Data61, CSIRO. He is also a conjointfull professor at University of New South Wales(UNSW). He is the chairperson of Standards Aus-tralia’s blockchain and distributed ledger committee.His research program has more than 300 peopleinnovating in the area of big data platforms, compu-tational science, blockchain, regulation technology,privacy and cybersecurity. He has published morethan 200 academic papers on software architecture,secure systems and data analytics infrastructure and

blockchain.

Wei Liu Wei Liu (M’15-SM’-20) is a Senior Lec-turer and the Data Science Research Leader at theAdvanced Analytics Institute, School of ComputerScience, University of Technology Sydney. Beforejoining UTS, he was a Research Fellow at the Uni-versity of Melbourne and then a Machine LearningResearcher at NICTA. He obtained his PhD fromthe University of Sydney. He works in the areas ofmachine learning and data mining and has publishedmore than 80 papers in research topics of tensor fac-torization, game theory, adversarial learning, graph

mining, causal inference, and anomaly detection. He has won three best paperawards.