multiple rank multi-linear svm for matrix data...

16
Multiple rank multi-linear SVM for matrix data classication Chenping Hou a,n , Feiping Nie b , Changshui Zhang c , Dongyun Yi a , Yi Wu a a Department of Mathematics and Systems Science, National University of Defense Technology, Changsha 410073, China b Department of Computer Science and Engineering, University of Texas, Arlington 76019, USA c Department of Automation, Tsinghua University, Beijing 100084, China article info Article history: Received 23 August 2012 Received in revised form 5 June 2013 Accepted 4 July 2013 Available online 15 July 2013 Keywords: Pattern recognition Matrix data classication Learning capacity Generalization SVM STM abstract Matrices, or more generally, multi-way arrays (tensors) are common forms of data that are encountered in a wide range of real applications. How to classify this kind of data is an important research topic for both pattern recognition and machine learning. In this paper, by analyzing the relationship between two famous traditional classication approaches, i.e., SVM and STM, a novel tensor-based method, i.e., multiple rank multi-linear SVM (MRMLSVM), is proposed. Different from traditional vector-based and tensor based methods, multiple-rank left and right projecting vectors are employed to construct decision boundary and establish margin function. We reveal that the rank of transformation can be regarded as a tradeoff parameter to balance the capacity of learning and generalization in essence. We also proposed an effective approach to solve the proposed non-convex optimization problem. The convergence behavior, initialization, computational complexity and parameter determination problems are analyzed. Compared with vector-based classication methods, MRMLSVM achieves higher accuracy and has lower computational complexity. Compared with traditional supervised tensor-based methods, MRMLSVM performs better for matrix data classication. Promising experimental results on various kinds of data sets are provided to show the effectiveness of our method. & 2013 Elsevier Ltd. All rights reserved. 1. Introduction Matrices, or more generally, multi-way arrays (tensors) are common forms of data that are encountered in a wide range of real applications. For example, all raster images are essentially digital readings of a grid of sensors and matrix analysis is widely applied in image processing, e.g., photorealistic images of faces [1] and palms [2], and medical images [3]. In web search, one can easily get a large volume of images represented in the form of matrix. Besides, in video data mining, the data at each time frame is also a matrix. Therefore, matrix data analysis, in particular, classication, has become one of the most important topics for both pattern recognition and computer vision. Classication is arguably the most often task in pattern recog- nition and relevant techniques are abundant in the literature. Standard methods, such as K-nearest neighborhoods classier (KNN) [4], support vector machine (SVM) [5,6] and one- dimensional regression methods [7] are widely used in many elds. Among these approaches, some of them are similarity based, such as KNN, and some of them are margin based, such as SVM. Due to its practical effectiveness and theoretical sound- ness, SVM and its variants, especially linear SVM, have been widely used in many real applications [8,9]. For example, SVM has been combined with factorization machine (FM) [10] for spammer discovering in social networks [11]. Nevertheless, traditional classication methods are usually vector-based. They assume that the inputs of an algorithm are vectors, not matrix data or tensor data. When they are applied to matrix data, the matrix structure must be collapsed to make vector inputs for the algorithm. One common way is connecting each row (or column) of a matrix to reformulate a vector. Although traditional classication approaches have achieved satisfactory performance in many cases, they may lack efciency in managing matrix data by simply reformulating them into vectors. The main reasons are as follows: (1) When we reformulate a matrix as a vector, the dimensionality of this vector is often very high. For example, for a small image of resolution 256 256, the dimensionality of reformulated vector is 65,536. The performances of traditional vector based methods will degrade due to the increase of dimensionality [12,13]. (2) With the increase of dimensionality, the computation time will increase drastically. For example, the computational complexity of SVM is closely related to the dimensionality of input data [9]. If the matrix scale is large, traditional approaches cannot be implemented in this scenario. (3) When a matrix is collapsed as a vector, the spatial correlations of the matrix will be lost [14]. For example, if an image of m n is represented as a mn-dimensional vector, it suggests that the image is specied by mn independent variables. Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition 0031-3203/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2013.07.002 n Corresponding author. Tel.: +86 731 8457 3238. E-mail address: [email protected] (C. Hou). Pattern Recognition 47 (2014) 454469

Upload: others

Post on 31-Oct-2019

26 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple rank multi-linear SVM for matrix data classificationtarjomefa.com/wp-content/uploads/2015/06/3166-English.pdf · Multiple rank multi-linear SVM for matrix data classification

Pattern Recognition 47 (2014) 454–469

Contents lists available at ScienceDirect

Pattern Recognition

0031-32http://d

n CorrE-m

journal homepage: www.elsevier.com/locate/pr

Multiple rank multi-linear SVM for matrix data classification

Chenping Hou a,n, Feiping Nie b, Changshui Zhang c, Dongyun Yi a, Yi Wu a

a Department of Mathematics and Systems Science, National University of Defense Technology, Changsha 410073, Chinab Department of Computer Science and Engineering, University of Texas, Arlington 76019, USAc Department of Automation, Tsinghua University, Beijing 100084, China

a r t i c l e i n f o

Article history:Received 23 August 2012Received in revised form5 June 2013Accepted 4 July 2013Available online 15 July 2013

Keywords:Pattern recognitionMatrix data classificationLearning capacityGeneralizationSVMSTM

03/$ - see front matter & 2013 Elsevier Ltd. Ax.doi.org/10.1016/j.patcog.2013.07.002

esponding author. Tel.: +86 731 8457 3238.ail address: [email protected] (C. Hou).

a b s t r a c t

Matrices, or more generally, multi-way arrays (tensors) are common forms of data that are encounteredin a wide range of real applications. How to classify this kind of data is an important research topic forboth pattern recognition and machine learning. In this paper, by analyzing the relationship between twofamous traditional classification approaches, i.e., SVM and STM, a novel tensor-based method, i.e.,multiple rank multi-linear SVM (MRMLSVM), is proposed. Different from traditional vector-based andtensor based methods, multiple-rank left and right projecting vectors are employed to construct decisionboundary and establish margin function. We reveal that the rank of transformation can be regarded as atradeoff parameter to balance the capacity of learning and generalization in essence. We also proposedan effective approach to solve the proposed non-convex optimization problem. The convergencebehavior, initialization, computational complexity and parameter determination problems are analyzed.Compared with vector-based classification methods, MRMLSVM achieves higher accuracy and has lowercomputational complexity. Compared with traditional supervised tensor-based methods, MRMLSVMperforms better for matrix data classification. Promising experimental results on various kinds of datasets are provided to show the effectiveness of our method.

& 2013 Elsevier Ltd. All rights reserved.

1. Introduction

Matrices, or more generally, multi-way arrays (tensors) arecommon forms of data that are encountered in a wide range of realapplications. For example, all raster images are essentially digitalreadings of a grid of sensors and matrix analysis is widely appliedin image processing, e.g., photorealistic images of faces [1] andpalms [2], and medical images [3]. In web search, one can easilyget a large volume of images represented in the form of matrix.Besides, in video data mining, the data at each time frame is also amatrix. Therefore, matrix data analysis, in particular, classification,has become one of the most important topics for both patternrecognition and computer vision.

Classification is arguably the most often task in pattern recog-nition and relevant techniques are abundant in the literature.Standard methods, such as K-nearest neighborhoods classifier(KNN) [4], support vector machine (SVM) [5,6] and one-dimensional regression methods [7] are widely used in manyfields. Among these approaches, some of them are similaritybased, such as KNN, and some of them are margin based, suchas SVM. Due to its practical effectiveness and theoretical sound-ness, SVM and its variants, especially linear SVM, have been widely

ll rights reserved.

used in many real applications [8,9]. For example, SVM has beencombined with factorization machine (FM) [10] for spammerdiscovering in social networks [11]. Nevertheless, traditionalclassification methods are usually vector-based. They assume thatthe inputs of an algorithm are vectors, not matrix data or tensordata. When they are applied to matrix data, the matrix structuremust be collapsed to make vector inputs for the algorithm. Onecommon way is connecting each row (or column) of a matrix toreformulate a vector.

Although traditional classification approaches have achievedsatisfactory performance in many cases, they may lack efficiency inmanaging matrix data by simply reformulating them into vectors.The main reasons are as follows: (1) When we reformulate amatrix as a vector, the dimensionality of this vector is often veryhigh. For example, for a small image of resolution 256�256, thedimensionality of reformulated vector is 65,536. The performancesof traditional vector based methods will degrade due to theincrease of dimensionality [12,13]. (2) With the increase ofdimensionality, the computation time will increase drastically.For example, the computational complexity of SVM is closelyrelated to the dimensionality of input data [9]. If the matrix scaleis large, traditional approaches cannot be implemented in thisscenario. (3) When a matrix is collapsed as a vector, the spatialcorrelations of the matrix will be lost [14]. For example, if animage of m�n is represented as a mn-dimensional vector, itsuggests that the image is specified by mn independent variables.

Page 2: Multiple rank multi-linear SVM for matrix data classificationtarjomefa.com/wp-content/uploads/2015/06/3166-English.pdf · Multiple rank multi-linear SVM for matrix data classification

Table 1Notations.

l Number of training points

C. Hou et al. / Pattern Recognition 47 (2014) 454–469 455

However, in practice, there are generally only a few interestedaspects about an image, and the degree of freedom of the model isfar less than mn.

In order to solve the above mentioned problems, a lot ofresearchers have proposed many approaches. There are mainlytwo types of methods. The first type of methods reduces dimen-sionality of original tensor data and then employs some off-the-shelf classifiers on the projected vector data. For example, in imageprocessing field, there have been many feature based methods, suchas the elastic graph model which can preserve spatial informationin a compact dimensionality [15]. Recently, a lot of interests havebeen conducted on tensor-based approaches for matrix data ana-lysis. Vasilescu et al. have proposed the famous tensor face for facerecognition [16]. After that, a lot of researchers have also extendedtraditional subspace learning methods, such as principal compo-nent analysis (PCA) [17], linear discriminant analysis (LDA) [18],locality preserving projection (LPP) [19], etc., into their tensorcounterparts [1,20–24]. Nevertheless, since most of these methodsare unsupervised, they have lost label information in learningsubspace. They cannot be used for tensor data classification directly.

The second type of methods can classify tensor data directly[25–30]. For example, Tao et al. have proposed a framework toextend traditional approaches into their tensor counterparts, e.g.,support tensor machine (STM) [25,26]. Other related works inmultiple view learning have also used the same strategy toincorporate data representations from different views [31]. Whenwe apply these tensor-based approaches to classify matrix data,their performances can also be improved since their training erroris often large. For example, STM only uses one left projectingvector together with one right projecting vector. Its training erroris larger than the following proposed approach.

In this paper, by discovering the relationship between SVM andSTM, we introduce a novel matrix classification model usingmultiple rank projections. It is named as multiple rank multi-linear SVM (MRMLSVM). Instead of converting an m� n matrixinto an mn-dimensional vector as in traditional linear SVM, andusing only one left and one right transformation vector as in STM,we employ two groups of transformation matrices in designingconstraints and establishing objective function. More importantly,there are several transformation vectors in each group. Essentially,we discover that the number of transformation vectors is atradeoff parameter for the capacity of learning and generalizationof a learning machine. Besides, we have also proposed an effectiveoptimization strategy and some deep analyses, including conver-gence analysis, initialization, computational complexity and para-meter determination. Compared with other vector based methods,such as LDASVM (LDA for subspace learning and linear SVM forclassification) and linear SVM, and tensor-based approaches, suchas 2DLDASVM (2DLDA for subspace learning and linear SVM forclassification) and STM, it achieves more promising results inmatrix data classification. Compared with traditional classificationapproaches in converting matrix into vector, the computationalcomplexity is also reduced. Plenty of experiments on differentkinds of data are presented for illustration.

It is worthwhile to highlight the contribution of our algorithm:

t Number of testing pointsm The first dimensionality of matrix datan The second dimensionality of matrix data (1) k Rank of linear transformationXi∈Rm�n The i-th training matrix dataxi∈Rmn The vectorization of Xi

yi∈f1;�1g The label vector of Xi

ui∈Rm The i-th left regression vector

We have revealed the relationship between traditional linearSVM and STM. Based on this analysis, we have proposed anovel approach, i.e., MRMLSVM, for matrix data classification.Compared with other related vector based and tensor basedapproaches, it can achieve promising classification accuracy.

vi∈Rn The i-th right regression vector

(2) U¼ ½u1 ;⋯;uk� The left regression matrixV¼ ½v1 ;⋯; vk� The right regression matrix∥ � ∥ The Frobenius norm⊗ The Kronecker product

We have provided an effective way to solve the proposed non-convex problem. Compared with other vector based counter-parts, its computational complexity is low, especially when thedata scale is large. Experimental results have been proposedfor demonstration.

(3)

The most important parameter in MRMLSVM is the rank ofregression. We have revealed its essence. It can be regarded asa parameter to balance the capacity of learning and general-ization for a learning machine. This is important for tighteningthe relationship between two famous learning methods, i.e.,SVM and STM.

(4)

MRMLSVM is just an instance in using multiple rank projec-tions. Intrinsically, we can regard it as a common model. Otherlinear methods can also be extended by the similar way.

The rest of this paper is organized as follows. In Section 2, wewill provide some notations and some related works. In Section 3,by analyzing the relation between linear SVM and STM, wepropose the MRMLSVM algorithm in details, together with aneffective way in solving this problem. We present the performanceanalyses, including convergence behavior, initialization, computa-tional complexity and parameter determination in Section 4.Section 5 provides some promising comparing results on variouskinds of data sets, followed by the conclusions and future works inSection 6.

2. Notations and related works

In this section, we would like to introduce some representativeworks, including 2DLDA, linear SVM and STM, since 2DLDA is atypical supervised two-dimensional dimensionality reductionmethod and our algorithm has close relationship with linearSVM and STM. Before going into the details, let us introduce somenotations at first.

2.1. Notations

In this paper, we try to solve a supervised matrix dataclassification problem. Besides, we only introduce our algorithmfor matrix data (two order tensor) in binary classification task. Aswe will explain later, it is direct to extend our approach to anyorders of tensor and any numbers of categories.

Denote fXi∈Rm�nji¼ 1;2;…; lg as a set of training examples. Theassociated class label vectors are fy1; y2;…; ylg, where yi¼1 iff Xi

belongs to the first category and yi ¼�1 otherwise. m and n arethe first and second dimensions of each matrix data respectively. lis the number of training points. Additionally, we also have ttesting points fXi∈Rm�nji¼ lþ 1; lþ 2;…; lþ tg. The vectorizationof Xi, denoted as xi, is formulated by connecting each columnvector of Xi for i¼ 1;2;…; lþ t.

Define uj∈Rm and vj∈Rn as the j-th left and right projectionvectors, where j¼ 1;2;…; k and k is rank of linear transformation.Other important notations are summarized in Table 1. We willexplain their concrete meanings when they are firstly used.

Page 3: Multiple rank multi-linear SVM for matrix data classificationtarjomefa.com/wp-content/uploads/2015/06/3166-English.pdf · Multiple rank multi-linear SVM for matrix data classification

C. Hou et al. / Pattern Recognition 47 (2014) 454–469456

2.2. 2DLDA

2DLDA is one of the most important supervised two-dimen-sional subspace learning methods. It can be regarded as the two-dimensional extension of traditional LDA approach. Denote Mi asthe set of training points in the i-th class, whereMi has li samples.Let X i ¼ ð1=liÞ∑Xi∈MiXi be the mean of samples in the i-th class for1≤i≤c. Denote X ¼ ð1=lÞ∑Xi as the mean of all training data.

2DLDA tries to find two transformation matrices L and R, whichproject Xi to its low-dimensional embedding, i.e., Zi, by Zi ¼ LTXiR.Define the within-class distances Dw and between-class distanceDb as follows:

Dw ¼ ∑c

j ¼ 1∑

Xi∈Mj

∥Xi�X j∥2;

Db ¼ ∑c

j ¼ 1lj∥X j�X∥2: ð1Þ

where ∥ � ∥ is the Frobenius norm of a matrix. Intuitively, Db

measures the sum of divergence between any two classes andDw is the sum of data variance for each category.

Similar to LDA, in the low-dimensional space, the optimaltransformation matrices L and R in 2DLDA should minimize ~Dw

and maximize ~Db, the low-dimensional counterparts of Dw and Db

shown as follows:

~Dw ¼ Tr ∑c

j ¼ 1∑

Xi∈Mj

LT ðXi�X jÞRRT ðXi�X jÞTL !

;

~Db ¼ Tr ∑c

j ¼ 1ljL

T ðX j�XÞRRT ðX j�XÞTL !

: ð2Þ

Since it is difficult to derive the optimal L and R simultaneously,2DLDA solves the above problem in Eq. (2) in an alternative way.Briefly, it fixes L in computing R and fixes R in computing L. Seemore details in [21].

As seen from above formulation, although 2DLDA inherits thediscriminative power in deriving low-dimensional representa-tions, it cannot be used for classification directly.

2.3. Linear SVM

SVM is one of the most popular classifier in real applications. Itmaximizes the margin for two categories. More concretely, denotew as the vector orthogonal to the decision boundary and b as ascalar “offset” term, we can write the decision boundary as wTx þb¼ 0 for any vector data x. To relaxed the hard constraints, onecommon way is to soften it as yiðwTxi þ bÞ≥1�ξi, where ξi≥0 is theslack variable. Since the margin is proportionate to 1=∥w∥, theconcrete formulation of linear SVM can be defined as follows:

arg minw;b;ξ

12wTw þ C ∑

l

i ¼ 1ξi

s:t: yiðwTxi þ bÞ≥1�ξiξi≥0 for i¼ 1;2;…; l; ð3Þ

where C40 is the regularization parameter and ξ¼ ½ξ1; ξ2;…; ξl�Tconsists of all slack variables.

Let us explain the meaning of each function. The objectivefunction aims to maximize the margin and the constraints indicatethat the training points should be correctly classified by therelaxed decision function wTx þ b. The optimization problem inEq. (3) can be solved by casting it to its dual form. We can also usethe kernel trick to manipulate nonlinear classification tasks. Seemore details in [32].

As seen from the formulation in Eq. (3), traditional linear SVMonly involves vector data. In manipulating matrix data, we canonly employ its vectorization as the input. This would lose the

spatial information [14] of original matrix data and enlarge thecomputational cost.

2.4. STM

STM is the tensor extension of traditional SVM [25]. When it isused to manipulate matrix data, it uses two transformation vectorsu and v to replace the original transformation vector w in Eq. (3).More concretely, the margin function is replaced by ∥u⊗v∥2 andthe transformation wTxi is replaced by uTXiv. STM has thefollowing formulation:

arg minu;v;ξ;b

12‖u⊗v‖2 þ C ∑

l

i ¼ 1ξi

s:t: yiðuTXiv þ bÞ≥1�ξiξi≥0 for i¼ 1;2;…; l; ð4Þ

where ⊗ indicates the Kronecker product between two vectorsand u⊗v¼ VecðuvT Þ.

As stated in [25], since the derivation of u is related to v, wecannot solve the optimization problem in Eq. (4) directly. Onepossible way is optimizing them alternatively. Moreover, when v isfixed, the problem in Eq. (4) amounts to a standard SVM problem.Similarly, when u is fixed, we can use the same way to derive v.See more details in [25].

3. Multiple rank multi-linear SVM

In this section, we would like to introduce our multiple rankmulti-linear SVM (MRMLSVM) algorithm. Before going into thedetails, we analyze the relationship between SVM and STM. Then,the formulation of MRMLSVM is introduced step by step. Since theformulated problem is not convex, we propose an effectivemethod to find the approximated solution in an alternative way.To show the effectiveness theoretically, we will also provide somedeep analyses in the next section.

3.1. Relationship between SVM and STM

Comparing the formulation of SVM in Eq. (3) with that of STMin Eq. (4), we know that the transformation vector w in SVM'sobjective function has been replaced by u⊗v as that in STM.Correspondingly, the constraints have been changed fromyiðwTxi þ bÞ≥1�ξi to yiðuTXiv þ bÞ≥1�ξi.

Notice that

uTXiv¼ TrðuTXivÞ ¼ TrðXivuT Þ ¼ TrðXiðvuT ÞÞ¼ TrððuvT ÞTXiÞ ¼ ðVecðuvT ÞÞTVecðXiÞ ð5Þ

and

‖u⊗v‖2 ¼ ‖VecðuvT Þ‖2: ð6ÞHere Vecð�Þ is the vectorization of a matrix (by connecting eachcolumn of this matrix).

As seen from above deduction, when we employ STM to classifymatrix data, it is equivalent to the employment of SVM on thevectorization of corresponding data, provided that the constraint isw¼ VecðuvT Þ. More concretely, we have the following discussions.

On one hand, STM can be regarded as a special case of SVM byadding the constraint that w¼ VecðuvT Þ. It indicates that thecorresponding w is determined by only mþ n variables. In manyreal applications, such as image processing, a real matrix of sizem�n has mn elements. It cannot be simply modeled by just mþ nindependent variables [14]. In other words, this added constraintis too strict since we only have mþ n free variables to modelm�n-dimensional vectors derived from original matrices. Itwill make the model lack of flexibility in modeling matrix data.

Page 4: Multiple rank multi-linear SVM for matrix data classificationtarjomefa.com/wp-content/uploads/2015/06/3166-English.pdf · Multiple rank multi-linear SVM for matrix data classification

C. Hou et al. / Pattern Recognition 47 (2014) 454–469 457

By contrast, w in SVM has mn free variables. From the view ofoptimization, the feasible region of STM is much smaller than thatof SVM due to this additional constraints. Consequently, comparedwith SVM who can find the global solution in a larger feasibleregion, the objective function value of STM is larger, whichindicates that it has larger training error.

On the other hand, compared with STM, SVM has larger degreeof freedom in selecting feasible w since it regards mn elements inthe matrix independently. Nevertheless, it treats traditional vectordata and vectorized matrix data equally. Compared with traditionalvector data, the matrix data also have some spatial dependence. Forexample, all pixels of an image cannot be treated independently[14]. SVM neglects the spatial dependence of a matrix data andtreats them independently. In other words, this model has too manyfree variables and it will cause the problem of over fitting.

Besides, the above connection between SVM and STM alsoholds when the order of tensor is larger than 2. For illustration, wetake three order tensor as an example. Extension to any order canbe derived in a similar way. Assume X∈Rm1�m2�m3 is a three ordertensor. The corresponding transformation vectors are uðjÞ∈Rmj forj¼ 1;2;3. To show their relationship, we only need to prove thatEqs. (5) and (6) also hold in this scenario.

Notice that

‖ ∏3

j ¼ 1⊗uðjÞ‖2 ¼ ∑

m1

p1 ¼ 1∑m2

p2 ¼ 1∑m3

p3 ¼ 1ðuð1Þ

p1uð2Þp2uð3Þp3Þ2 ¼ ‖Vec ∏

3

j ¼ 1⊗uðjÞ

!‖2

and

X�1uð1Þ�2uð2Þ�3uð3Þ ¼ ∑m1

p1 ¼ 1∑m2

p2 ¼ 1∑m3

p3 ¼ 1Xp1 ;p2 ;p3u

ð1Þp1uð2Þp2uð3Þp3

¼ Vec ∏3

j ¼ 1⊗uðjÞ

! !T

VecðXÞ

where �1 is the model-1 time between a tensor and a vector andsimilar to others. Vecð�Þ is the vectorization of a three order tensorby first connecting column of each splice and then connectingformulated vectors from each splice. See more details in [25].

As seen from above two equations, we know that STM can beregarded as special case of SVM, provided thatw¼ Vecð∏3

j ¼ 1⊗uðjÞÞ. It also holds in the matrix scenario sinceVecðuvT Þ ¼ Vecðu⊗vÞ. Additionally, the above deductions can alsobe conducted on higher order tensor.

3.2. Multiple rank multi-linear constraints

Based on above analysis, we know that the constraints of STMand SVM have their own merits and disadvantages. To relax thetoo strict constraints in STM and avoid over fitting in SVM, insteadof using merely one couple of projecting vectors, i.e., the leftprojecting vector u and right projecting vector v, we propose touse k couples of left projecting vectors and right projecting vectorsin our MRMLSVM formulation. They are denoted as fujgkj ¼ 1 andfvjgkj ¼ 1. More concretely, the constraints in Eq. (4) becomes

yiðuT1Xiv1 þ uT

2Xiv2 þ⋯þ uTkXivk þ bÞ≥1�ξi

ξi≥0 for i¼ 1;2;…; l: ð7ÞDenote U¼ ½u1;u2;…;uk�∈Rm�k and V¼ ½v1; v2;…; vk�∈Rn�k, theconstraints in Eq. (7) can be reformulated as follows:

yiðTrðUTXiVÞ þ bÞ≥1�ξiξi≥0 for i¼ 1;2;…; l: ð8Þ

Intuitively, compared with the employment of only one couple ofprojecting vectors, there are totally kðmþ nÞ free variables in our

formulations. Since k≥2, it is more than that of STM who only hasmþ n free variables. In other words, the too strict constraintw¼ VecðuvT Þ in STM has been relaxed by inducing more freeparameters. This constraint is not required to follow and ourmodel is more flexible in characterizing matrix data. Besides,one couple of projecting vectors is a special case of our settingwhen uj ¼ 0, vj ¼ 0 for j≥2. More importantly, we have thefollowing proposition which can reveal the essence of parameterk, the rank of projecting vectors.

Proposition 1. Assume fu1;u2;…;ukg are any k vectors of dimen-sionality m, fv1; v2;…; vkg are k vectors of dimensionality n. Ifk¼minðm;nÞ, then the dimensionality of space spanned byVecð∑k

i uivTi Þ is mn. Here Vecð�Þ represents the vectorization of amatrix.

The proof is listed in Appendix. Based on above proposition, wehave the following discussions:

(1)

When k¼minðm;nÞ and the constraints are TrððUðrÞÞTXiVðrÞÞ≥

1�ξi, the above proposition indicates that the searchingspace for UðrÞðVðrÞÞT is very similar to the feasible regiondetermined by ðVecð∑k

i uiðviÞT ÞÞTVecðXiÞ. In other words,ðVecð∑k

i uiðviÞT ÞÞTVecðXiÞ and TrðUðrÞXiVðrÞÞ are more likely to

be the same. That is to say, the constraints in Eq. (8) are closeto the corresponding constraint of SVM.

(2)

As shown in above proposition, SVM has larger k and more freeparameters (m�n). Thus, the feasible region of SVM is larger andit can find the global optimization. Consequently, the trainingerror is often smaller. Nevertheless, since it has too manyparameters and the model is too complicated, it tends to beover fitting. When we use smaller k as in STM (k¼1). The overfitting problem is avoided. Nevertheless, it will add too strictconstraint since we only have mþ n free parameters. Comparedwith SVM who can find the global optimization in a largerfeasible region, the training error of STM is often larger.

3.3. Formulation

Motivated by the above comparison, we now introduce ouralgorithm formally. Mathematically, MRMLSVM has the con-straints shown in Eq. (8). Similar to SVM and STM, the objectivefunction of MRMLSVM should describe the margin. Consideringthe objective functions of SVM and STM shown in Eqs. (3) and (4),we have

∑k

j ¼ 1uTj Xivj ¼ ∑

k

j ¼ 1TrðuT

j XivjÞ

¼ ∑k

j ¼ 1TrðXivjuT

j Þ ¼ TrðXið ∑k

j ¼ 1vjuT

j ÞÞ

¼ TrðXiVUT Þ ¼ TrððUVT ÞTXiÞ ¼ ðVecðUVT ÞÞTVecðXiÞ ð9Þ

Comparing the formulated constraint of MRMLSVM (in Eq. (8))with that of SVM (in Eq. (3)), we know that the tensor counterpartof w in Eq. (3) should be UVT . Since the objective function ofSVM is

12Tr wwT� �þ C ∑

l

i ¼ 1ξi:

Then, the tensor extension of original objective function is

12Tr UVTVUT� �

þ C ∑l

i ¼ 1ξi: ð10Þ

In summary, by combing the constraints in Eq. (8) with theobjective function in Eq. (10), we formulate our MRMLSVM

Page 5: Multiple rank multi-linear SVM for matrix data classificationtarjomefa.com/wp-content/uploads/2015/06/3166-English.pdf · Multiple rank multi-linear SVM for matrix data classification

C. Hou et al. / Pattern Recognition 47 (2014) 454–469458

algorithm as follows:

arg minU;V;ξ;b

12Tr UVTVUT� �

þ C ∑l

i ¼ 1ξi

s:t: yiðTrðUTXiVÞ þ bÞ≥1�ξiξi≥0 for i¼ 1;2;…; l; ð11Þ

where U¼ ½u1;u2;…;uk� and V¼ ½v1; v2;…;vk� are defined asprevious.

As seen from above formulation of MRMLSVM, we notice thatthere is one important parameter k. We would like to reveal itsessence.

First, we would like to show the influence of k to training error.As seen from above formulation and following solving strategy, weassume 1okominðm;nÞ and use a similar way in formulating andsolving the problem as in STM. More importantly, if we use thesolution of STM to initialize MRMLSVM, we have the followingproposition.

Proposition 2. If we use the optimal value vn of STM to initializeMRMLSVM by setting v1 ¼ vn and other v by any values satisfied½v2; v3;…; vk�≠0. The optimal function value of MRMLSVM is nolarger than that of STM.

The proof is listed in Appendix. As stated in above proposition, if weuse this kind initialization, the training error of our method is no largerthan that of STM. Besides, experimental results in Section 5.2 also showthat other kinds of initializations (mentioned in Section 4.2) havesimilar training error, which is also smaller than that of STM. On thecontrary, since SVM has largest k and it can find the global optimiza-tion, its feasible region is larger than that of MRMLSVM and its trainingerror is no larger than that of MRMLSVM. Experimental results inSection 5.2 also validate this analysis. In summary, if we use STM toinitialize MRMLSVM, the larger k indicates the smaller training error.

Second, we would like to show the influence of k to the extentin avoiding over fitting. As what we have mentioned in Proposition1, SVM has the largest k and most free parameters (m�n). Ittrends to be over fitting, when we use smaller k. The over fittingproblem can be solved. In summary, the larger k indicates themore likely over fitting occurs.

Finally, based on above two points, we know that k can beregarded as a parameter to balance the training error and theextent in avoiding over fitting. In learning theory, training errorcan measure the capacity of learning and the extent in avoidingover fitting can measure the capacity of generalization [5]. There-fore, k is a parameter to balance these two capacities in essence.Recalling the basic rule of learning theory, we know that these twocapacities cannot be improved simultaneously [5]. In our followingformulations, we assume 1okominðm;nÞ. Thus, our method is ageneral tradeoff between two famous methods, i.e., SVM and STM.

3.4. Solution

In this section, since the problem in Eq. (11) is not jointly convexwith respect to U and V, we will try to find an approximated solutionto the proposed problem. Recalling the basic solving procedure of STM,we would like to optimize U and V in an alternative way. Moreconcretely, we fix one parameter and optimize the other one. Beforegoing into details, we would like to introduce a lemma at first.

Lemma 1. Assume that A, B and C are three matrices and ABC exists,we have

(1)

VecðABCÞ ¼ ðCT⊗AÞVecðBÞ. (2) TrðATBÞ ¼ VecðAÞTVecðBÞ.

The proof of this lemma is direct and we would like to omit it.

Based on Lemma 1, we have the following proposition, which isvital for our following solution.

Corollary 1.

TrðUVTVUT Þ ¼ TrðUTUVTVÞ ¼ ðVecðUÞÞT ððVTVÞ⊗Im�mÞVecðUÞ ð12Þwhere Im�m represents the m�m identity matrix.

The proof is listed in Appendix. See more details there.

(1) Fixing V and optimizing U and b. As seen from Eq. (11), wereformulate the optimization problem of MRMLSVM as follows:

arg minU;V;ξ;b

12Tr UVTVUT� �

þ C ∑l

i ¼ 1ξi

s:t: yi ∑k

j ¼ 1uTj Xivj þ b

!≥1�ξi

ξi≥0 for i¼ 1;2;…; l: ð13ÞWhen we fix V, or equivalently, v1; v2;…; vk, we need to compute aset of uj for j¼ 1;2;…; k and b. First, let us reformulate theproblem in Eq. (13) as follows. Denote

f i ¼Xiv1Xiv2⋮

Xivk

264

375mk�1

; u ¼u1

u2

⋮uk

264

375mk�1

;D¼ ðVTVÞ⊗Im�m ð14Þ

Then, recall that u ¼ VecðUÞ and Corollary 1, we know that Eq. (13)is equivalent to

arg minu ;ξ;b

12uTDu þ C ∑

l

i ¼ 1ξi

s:t: yiðuT f i þ bÞ≥1�ξiξi≥0 for i¼ 1;2;…; l: ð15Þ

Denote

u¼D1=2 ^u f i ¼D�1=2 f ð16ÞThen, Eq. (15) becomes

arg minu;b

12uTuþ C ∑

l

i ¼ 1ξi

s:t: yiðuT f i þ bÞ≥1�ξiξi≥0 for i¼ 1;2;…; l: ð17Þ

Comparing the formulation in Eq. (17) with that in Eq. (3), weknow the problem in Eq. (17) amounts to a standard linear SVMproblemwhen V is fixed. More concretely, we solve the problem inEq. (17) by formulating the Lagrangian as follows:

L¼ 12uTuþ C ∑

l

i ¼ 1ξi� ∑

l

i ¼ 1αi yi u

T f i þ b� ��1þ ξi

� �� ∑l

i ¼ 1βiξi ð18Þ

where αi≥0 and βi≥0 are the Lagrangian multipliers.Recall the basic idea of dual theorem [33], we can reformulate

the problem in Eq. (17) as

minu;b

maxαi≥0;βi≥0

12uTuþ C ∑

l

i ¼ 1ξi� ∑

l

i ¼ 1αi yi u

T f i þ b� ��1þ ξi

� �� ∑l

i ¼ 1βiξi

!:

ð19ÞIts dual form is

maxαi≥0;βi≥0

minu;b

12uTuþ C ∑

l

i ¼ 1ξi� ∑

l

i ¼ 1αi yi u

T f i þ b� ��1þ ξi

� �� ∑l

i ¼ 1βiξi

!:

ð20ÞBy minimizing L in Eq. (20) with respect to u, b and ξ and

taking the derivative of L with respect to u, b and ξ, we have

∂L∂u

¼ u� ∑l

i ¼ 1αiyif i ¼ 0;

∂L∂b

¼ ∑l

i ¼ 1αiyi ¼ 0;

∂L∂ξi

¼ C�αi�βi ¼ 0: ð21Þ

Page 6: Multiple rank multi-linear SVM for matrix data classificationtarjomefa.com/wp-content/uploads/2015/06/3166-English.pdf · Multiple rank multi-linear SVM for matrix data classification

C. Hou et al. / Pattern Recognition 47 (2014) 454–469 459

Plus Eq. (21) back to Eq. (20) and take some simple deduction, wehave the following dual optimization problem

arg maxαi

G¼ ∑l

i ¼ 1αi�

12

∑l

i ¼ 1∑l

j ¼ 1αiαjyiyjf

Ti f j s:t: ∑

l

i ¼ 1αiyi ¼ 0

0≤αi≤C for i¼ 1;2;…; l: ð22ÞAs in traditional linear SVM, the optimization problem in Eq. (22)can be solved in an effective way, although it is still a quadraticprogramming [32].

In summary, when we fix fvigki ¼ 1, the solution of MRMLSVM canbe computed by solving the optimization problem in Eq. (22) directly.In other words, we can regard it as taking the data, represented byff igli ¼ 1, as the input of original linear SVM and compute thecorresponding projection vectors. In Section 4, we will show thatwhen V is fixed, our derived results are the global solutions to theproblem in Eq. (11). Besides, as seen from above deduction, thedimensionality of f i ismk. It is less than the vectorization of Xi, whichhas the dimensionality mn. Thus, the computational cost is alsoreduced.

(2) Fixing U and optimizing V and b. Similarly, when U is fixed,we can also change the formulation of our algorithm in Eq. (11)and derive its optimal V and b by solving another linear SVMproblem. More concretely, Denote

g i ¼XT

i u1

XTi u2

⋮XT

i uk

264

375nk�1

; v ¼v1v2⋮vk

264

375nk�1

; ð23Þ

Q ¼ ðUTUÞ⊗In�n ð24ÞWe can reformulate Eq. (11) as the following form:

arg minv ;ξ;b

12vTQv þ C ∑

l

i ¼ 1ξi

s:t: yiðvT g i þ bÞ≥1�ξi

ξi≥0 for i¼ 1;2;…; l: ð25ÞDenote

v¼Q 1=2v gi ¼Q�1=2g ð26ÞThen, Eq. (25) becomes

arg minv;ξ;b

12vTv þ C ∑

l

i ¼ 1ξi s:t: yi v

Tgi þ b� �

≥1�ξi

ξi≥0 for i¼ 1;2;…; l: ð27ÞThe problem in Eq. (27) amounts to a standard SVM problem,

taking fgig as the input. We can also adopt the same method asprevious to derive the global optimization to this problem. Thedimensionality of input is nk. It is also less than mn. Due to thelimitation of space, we omit the details.

In summary, as seen from above procedure, when we fix oneparameter and optimize the other, we can derive the solutions byemploying traditional SVM on two different data sets withdifferent formulations.

There are several points should be highlighted here as follows:

(1)

The first is about initialization. As seen from above procedure,MRMLSVM is solved in an iterative way. Thus, initialization isvital for searching optimal solution. In next section, we willprovide three different kinds of initialization methods, i.e.,fixed, uniform and norm. See more details there.

(2)

The second is about the convergence behavior of our algo-rithm. As seen from the problem in Eq. (11), our formulation isnot joint convex with respect to U, V, b. Thus, it is solved in analternative way. In next section, we will show that theobjective function in Eq. (11) will decrease through this kind

of iteration. In other words, the above iteration is convergentto a local optimization to the problem in Eq. (11). Ourexperiments also show that this kind of iteration convergesfast. It is often less than ten times in our experiments.

(3)

The third problem is about the extension to any order tensors.As seen from the above deduction, extension to any order isdirect and the formulated problem can also be solved insimilar way. For example, we can fix any two projectionmatrices and compute the rest one if the data is cubic.

(4)

Although the final computed projecting vectors are u and v,not U and V, they are just different arrangements of similarvectors. We can use Eqs. (14), (16), (23) and (26) to derive Uand V directly. Besides, in above deductions, b and ξ areupdated two times in each iteration. We have not distin-guished them in deriving U and V.

In summary, the procedure of MRMLSVM is listed in Table 2.

4. Performance analysis

In this section, we will analyze our proposed algorithm in fourdifferent aspects, including convergence behavior, initializationproblem, computational complexity and parameter determination.

4.1. Convergence analysis

As seen from the procedure in Table 2, we solve the optimiza-tion problem in an alternative way. Namely, we fix one variableand compute the other. When U is fixed, the following propositionshows that we can find the global optimization to the problem inEq. (12). Similarly, the derived results of the problem in Eq. (27)are also the solutions to the problem in Eq. (11) if U is fixed.

Proposition 3. When V is fixed, the results U and b, derived bysolving problem in Eq. (17), are the global solutions to the problem inEq. (11). Similarly, when U is fixed, the results derived by solvingproblem in Eq. (27) are also the global solutions to the problemin Eq. (11).

The proof is direct. Briefly, when V is fixed, the optimizationproblem in Eq. (11) is equivalent to the problem in Eq. (17).Considering the procedure of traditional SVM, we can derive itsglobal optimization by solving the dual problem in Eq. (22). Thus,the solution to Eq. (17) is also the global optimization to theproblem in Eq. (11), provided that V is fixed. Similarly, we can alsoconduct the same conclusion when U is fixed. Intuitively, whenone parameter is fixed, we will compress a matrix into onedirection (row or column). MRMLSVM could find the best decisionboundary to classify the compressed vector data.

The above proposition only shows that the proposed algorithmcould find a global optimization in each iteration. Based on thisresult, we will propose another proposition. It indicates that ouriteration in Table 2 will find the solutions, which satisfy theconstraint in Eq. (11) and monotonically decrease the objectivefunction of the problem in Eq. (11).

Proposition 4. The iterative procedure shown in Table 2 will find thesolution that satisfies the constraints in Eq. (11) and monotonicallydecreases the objective function of the problem in Eq. (11) in eachiteration.

The proof is listed in Appendix. This proposition indicates that wecan find a local optimal solution to the problem in Eq. (11). Besides,the final results are closely related to initialization. If we haveinitialized suitably, our derived solution will be near to the optimalsolution. We will discuss the initialization problem in next section.Experimental results also show that this kind of iteration is effective.It converges fast and the time of iteration is often less than 10.

Page 7: Multiple rank multi-linear SVM for matrix data classificationtarjomefa.com/wp-content/uploads/2015/06/3166-English.pdf · Multiple rank multi-linear SVM for matrix data classification

Table 2Procedure of MRMLSVM.

Training process:Input: Training matrix data set: fXiji¼ 1;2;…; lg, label vectors: fyiji¼ 1;2;⋯; lgOutput: Left and right projection vectors: U, V, b

1. Initialize V by one of the three strategies shown in Section 4.2 and formulate f i in Eq. (16)2. Alternatively update u and v until convergence

a. Update u and b by solving the problem in Eq. (17), where D and f i in Eqs. (14) and (16) are computed based on the latest Vb. Update v and b by solving the problem in Eq. (27), where Q and gi in Eqs. (24) and (26) are computed based on the latest U

3. Reformulate u and v to U and V according to Eqs. (14), (16), (23) and (26)

Testing process:Input: Testing matrix data set: fXiji¼ lþ 1; lþ 2;…; lþ tg. The left and right projection vectors U, V and bOutput: The labels of testing data: fyiji¼ lþ 1; lþ 2;…; lþ tg

1. Compute the decision values of Xi : TrðUTXiVÞ þ b2. Assign the label of Xi , i.e., yi for i¼ lþ 1; lþ 2;…; lþ t

yi ¼1; TrðUTXiVÞ þ b40�1; TrðUTXiVÞ þ bo0

(

C. Hou et al. / Pattern Recognition 47 (2014) 454–469460

4.2. Initialization

Since our method is solved in an alternative way, the finalsolution has close relationship with initialization. We would like tointroduce three different kinds of initialization strategies in thissection.

The first initialization is an empirical method. Since theinitialization of 2DLDA is R¼ ½Il2�l2 ;0l2�ðn�l2Þ�T , where l2 is thesecond reduced dimensionality, we can initialize MRMLSVM in asame way, i.e., V¼ ½Ik�k;0k�ðn�kÞ�T . For convenience, we call thiskind of initialization ‘Fixed’ in the following. As seen from thefollowing results, although this kind of initialization is simple, it iseffective. In the following experiments, we will use this kind ofinitialization without specification.

The second kind of initialization is random. We generate somerandom elements, which is sampled from a uniform distribution(range from 0 to 1) to form the matrixes with corresponding sizes.In other words, we formulate R∈Rn�l2 for 2DLDA, v∈Rn for STMand V∈Rn�k for MRMLSVM by these elements. Intuitively, eachcolumn vector of the initialization matrix takes the role ofweighting the column vectors of X. In the following, it is namedas ‘Uniform’ for convenience.

To show the influence of different sampling methods, wegenerate a random matrix, whose elements are sampled fromanother distribution, i.e., standard normal distribution. It is namedas ‘Norm’ in the following. In next section, we will provide someexperimental results to compare different kinds of initializations.See more details in Section 5.3.

Finally, as mentioned above, we can also use the solution ofSTM to initialize MRMLSVM by simply setting v1 as STM's solutionand others as not all zeros. By using this kind of initialization, wecan guarantee that the training error of MRMLSVM is no largerthan that of STM. The objective function values in using thisinitialization are reported in Section 5.2.

4.3. Computational complexity

We compare the computational complexity of different meth-ods, including linear SVM, LDA, 2DLDA, STM and MRMLSVM, inthis section. Since different implementations of the same methodmay cost different time, we would like to give a common analysismerely.

The first group of methods contains LDA and 2DLDA. Consider-ing the procedure of 2DLDA in Section 2.2, we can see that the

most time consuming step is the eigen-decomposition. If thedimensionality of original data is D, its computational complexityis about OðD3Þ. Thus, the computational complexity of traditionalLDA is Oðm3n3Þ. Since 2DLDA solves two eigen-decompositionproblem with the sizes m and n iteratively. It has the computationalcomplexity Oðsðm3 þ n3ÞÞ, where s is the number of iterations.

The second group of methods consists of SVM, STM andMRMLSVM. The most time consuming step of these methods isthe formulated QP problem in Eq. (21). Since different implemen-tations of the same method may cost different time, it has thecomputational complexity about O(slD) in one famous implemen-tation LibSvm [9]. Here D is also the dimensionality of input andl is the number of training points. s is also the times of iterations.When SVM is employed, its computational complexity is O(slmn).By contrast, since MRMLSVM is solved by an alternation betweentwo different SVMs, its computational complexity is Oðslðmþ nÞkÞ.Since STM is a special case of MRMLSVM with k¼1, its computa-tional complexity is Oðslðmþ nÞÞ.

Besides, in the implementation of MRMLSVM, it still costs a lotof time in computing the inverse and square root of matrix D;Q asshown in Eqs. (16) and (26). Recalling the above deduction, weknow that their dimensionality are mk�mk and nk� nk respec-tively. Notice the formulations of D;Q in Eqs. (14) and (24), we cansee that they are both Kronecker products of two positive semi-definite matrices. To compute its inverse and square root in a moreeffective way, we proposed the following proposition.

Lemma 2. Suppose the SVD decompositions of A and B areA¼UAΣAV

TA and B¼UBΣBV

TB . Then

A⊗B¼ ðUA⊗UBÞðΣA⊗ΣBÞðVA⊗VBÞT ð28Þ

We would like to omit the proof and see more details in [34].Based on above lemma, we can get the following proposition,

which facilitates the computation of Di and Q i for any i.

Proposition 5. Suppose A, B are positive semi-definite and i is aninteger, then

ðA⊗BÞi ¼ Ai⊗Bi ð29Þ

The proof is listed in Appendix. Based on above proposition,we can decompose the inverse computation of an integrated

Page 8: Multiple rank multi-linear SVM for matrix data classificationtarjomefa.com/wp-content/uploads/2015/06/3166-English.pdf · Multiple rank multi-linear SVM for matrix data classification

C. Hou et al. / Pattern Recognition 47 (2014) 454–469 461

matrix D (formed by Kronecker product with mk�mk dimension-ality) into the inverse computation of a small matrix VTV (withdimensionality k� k). Moreover, in the above iteration, s is lessthan 10 and k is far less than minfm;ng. Thus, the computationalcomplexity of MRMLSVM is smaller than SVM, especially when thescale of matrix is large. We will show some experimental results inSection 5.5.

4.4. Parameter determination

In this section, we would like to discuss the problem ofparameter determination. As seen from Eq. (11), there is oneimportant parameter k in our method. It is the rank of regression.

Recalling the essence of k as discussed in Section 3.3, we canregard it as a parameter to balance the capacity of learning andgeneralization. The larger k indicates the stronger learning capa-city and weaker generalization capacity and vice versa. Based onthe basic rule of learning theory, these two capacities cannot beimproved simultaneously. Thus, it cannot be too large or too smallin our implementation. We will show some numerical resultsconcerning objective function and classification accuracy withdifferent k in Section 5.6.

Since parameter determination is still an open problem, wewould like to determine it heuristically and empirically in ourpaper. One direct way is using grid search as in most unsupervisedlearning. More concretely, we vary this parameter within a certainrange and choose the one with the best performance. Another wayis using cross validation as in most supervised learning. We splitall the data into several proportions and use parts of them fortraining and the left as testing. The parameter with best classifica-tion accuracy is selected.

Although the above mentioned strategies are effective, it is verytime consuming for real applications. As seen from the experi-mental results shown in Figs. 5 and 6 in Section 5.6, when k¼2,the classification accuracy of MRMLSVM is near the peak. Thus, weempirically choose k¼2 in our following experiments.

1 http://www1.cs.columbia.edu/CAVE/research/softlib/coil-20.html.2 http://www.lookingatpeople.com/download-daimler-ped-mcue-occl-class-

benchmark/index.html.3 http://ome.grc.nia.nih.gov/iicbu2008/binucleate/index.html.4 http://images.ee.umist.ac.uk/danny/database.html.5 http://bias.csr.unibo.it/fvc2000/databases.asp.6 http://ome.grc.nia.nih.gov/iicbu2008/pollen/index.html.

5. Experiments

In this section, we will evaluate our method in five differentaspects. The first is about the convergence behavior. The secondcontains the comparison of classification accuracy, includingresults on binary classification and multiple class scenario. Inmultiple class scenario, we classify the data via one-vs-rest strategyas in traditional methods. We will show the influence of differentinitialization in the third group of experiments. In the fourthgroup, we will compare the real time consuming of differentmethods. Finally, experimental results with different parameter kare presented.

Since SVM, 2DLDA and STM are closely related to MRMLSVM,we would like to compare the performance of our method withthem. Additionally, we also provide the results of LDA since 2DLDAcan be regarded as its two-dimensional extension. Nevertheless,LDA and 2DLDA cannot be used for classification directly. We usethem for projection and employ other classifiers, such as SVM, forclassification. They are denoted as LDASVM and 2DLDASVM in thefollowing presentations. We also present the results of K-nearestneighborhood classifier (KNN) as the baseline, where K¼1.Besides, we also compare with the methods concerning spatialinformation. The first method is S-LDA [14]. We use it as dimen-sionality reduction approach and employ SVM for classification. Itis named as S-LDASVM. In the second approach, we simply add thespatial matrix in [14], i.e., ΔTΔ, as a regularizer in SVM. We name it

as spatial regularized SVM (SRSVM) and its objective function is

arg minw;b;ξ

12wTw þ C ∑

l

i ¼ 1ξi þ αwTΔTΔw

s:t: yiðwTxi þ bÞ≥1�ξi

ξi≥0 for i¼ 1;2;…; l; ð30Þ

where α is a balance parameter.To solve this problem, we transform it into the formulation

which can be manipulated by linear SVM. Moreover, we use thefirst kind of initialization strategy. The dimensionality of subspaceis set as c�1 in LDA and 2DLDA as in traditional approaches, wherec is the number of class. Without specification, we set k¼2 and C, αare determined by five-folder cross validation.

5.1. Data description

In the following experiments, we use six different kinds ofmatrix data sets for evaluation. They are Coil data set,1 Pedestriandata set,2 Binucleate data set,3 Umist data set,4 FingerDB data set5

and Pollen data set.6 These matrix data sets are selected fromdifferent applications. The data size ranges from about 50 to 2000and the data scale ranges from 25�25 to 1204�1280. Namely,the dimensionality of vectorized sample ranges from about 600 to1,000,000. After some preprocessing, we select part of them forillustration and the detailed statistical characters are listed inTable 3.

5.2. Convergence behavior

In order to show the convergence behavior and compare theobjective function of different methods, we provide some numer-ical results on three data sets. We focus on the training error inbinary classification task and choose three data sets, i.e., Pedes-trian data, Umist data(1 vs 18) and Pollen data (2 vs 7) as trainingpoints and report their objective function values with differentinitializations. After 20 times iterations, the objective functionvalues are shown in Fig. 1. On the top and bottom planes, STM andMRMLSVM are initialized using ‘Fixed’ and ‘Norm’ strategiesrespectively. Besides, we also show the objective function valuesof SVM. When we use STM's results to initialize MRMLSVM, itsresults are denoted by ‘MRMLSVM+STM’. In these experiments,we simply set k¼2 in MRMLSVM.

There are mainly two observations from the results in Fig. 1:

(1)

With the increase of iteration's number, the function values ofSTM and MRMLSVM converge to fixed points. It indicates thatthe above solving strategy converges. It can also validate thetheoretical results shown in Proposition 4.

(2)

As seen from all the sub-figures, SVM's objective functionvalue is smallest. If we use STM to initialize MRMLSVM, itsobjective function value is smaller than that of STM. Besides, ifwe use the same initialization strategy, the objective functionvalue of MRMLSVM is also smaller than that of STM. It alsovalidates our previous statement that k is a parameter who canrank the objective function values.
Page 9: Multiple rank multi-linear SVM for matrix data classificationtarjomefa.com/wp-content/uploads/2015/06/3166-English.pdf · Multiple rank multi-linear SVM for matrix data classification

C. Hou et al. / Pattern Recognition 47 (2014) 454–469462

5.3. Classification results

In binary classification task, every two classes are combined andparts of the results are presented. For illustration, we report tworesults (1 vs 18, 4 vs 14) on Umist data, one comparison (1 vs 8) onFingerDB data and one comparison (2 vs 7) on Pollen data. Pedestriandata and Binucleate data are also employed for binary classification.By randomly splitting the original data into training set and testingset for 50 independent runs, we select 2, 3, …, 11 samples from eachcategory and the others are testing samples for Binucleate, Umist,Pollen data sets. Since Pedestrian has many samples and FingerDBhas few points for each category, we select 2, 4, …, 20 and 2, 3, …,7 training samples from each category from Pedestrian and FingerDBdata sets. With different data sets and different numbers of training

Table 3Characters of different data sets.

Data Size (lþ t) Scale (m�n) Class number

Coil 1440 32�32 20Pedestrian 2000 38�18 2Binucleate 40 1204�1280 2Umist 575 28�23 20FingerDB 80 300�300 10Pollen 630 25�25 7

Fig. 1. Objective function values on three data sets with different initializations. The x-axof STM to initialize MRMLSVM. (a) Pedestrian data with ‘Fixed’ initialization; (b) Uminitialization; (d) Pedestrian data with ‘Norm’ initialization; (e) Umist data (1 vs 18) wi

samples, the mean classification accuracies of 50 independent runsare shown in Fig. 2 (a) Pedestrian data, (b) Binucleate data, (c) Umistdata (1 vs 18), (d) Umist (4 vs 14), (e) FingerDB data (1 vs 8) and(f) Pollen data (2 vs 7). Note that, since the matrix ΔTΔ in SRSVM andS-LDASVM is mn�mn dimensional, these methods cannot beimplemented on Binucleate and FingerDB data sets. We have notreported their results.

To classify data from multiple classes, we choose four repre-sentative data, including Coil, Umist, FingerDB and Pollen data sets.As seen from the description in Table 3, they are data sets withmultiple classes. In each category, we randomly selected 2, 3,…, 11training samples for Coil, Umist, Pollen data and 2, 3,…, 7 forFingerDB data. The mean and standard derivation computed by 50random splits are shown in Tables 4–7. The results with statisticalsignificance are boldfaced. As the same reason in previous, wehave not reported the results on FingerDB either.

There are several observations from the above comparisons asfollows:

(1)

is repist dth ‘N

Among different methods and different data sets, MRMLSVMachieves the highest accuracy in most cases. This is mainly dueto the fact that MRMLSVM inherits the merits from SVM, STMand spatial regularization methods.

(2)

With the increase of training points' number, all methodsachieve higher accuracies. This consists with intuition since wehave more prior information for training.

resents number of iterations. MRMLSVM+STM means that we used the resultsata (1 vs 18) with ‘Fixed’ initialization; (c) Pollen data (2 vs 7) with ‘Fixed’orm’ initialization; (f) Pollen data (2 vs 7) with ‘Norm’ initialization.

Page 10: Multiple rank multi-linear SVM for matrix data classificationtarjomefa.com/wp-content/uploads/2015/06/3166-English.pdf · Multiple rank multi-linear SVM for matrix data classification

C. Hou et al. / Pattern Recognition 47 (2014) 454–469 463

(3)

Fig.(a) P

By adding the spatial smooth regularization, the improvedmethods are perform better than their original counterparts inmost cases. This may be due to the fact that we havecharacterized the spatial information and traditional methodscan be regarded as their special cases.

(4)

For classification, 2D based methods do not always performbetter than 1D based methods. Take the results in Table 6 as anexample, LDASVM performs better than 2DLDASVM in mostcases. The reason may be that the adding constraints in2DLDASVM will degrade the performances.

(5)

When we represent original data by its low-dimensionalembedding, it does not always helpful for classification. Takethe result in Table 6 as an example, compared with SVM'sresults in classifying the embedding derived by LDA and2DLDA, SVM achieves higher accuracy when it is used toclassify the original data directly.

2. Classification results of different methods on six different data sets with differeedestrian data; (b) Binucleate data; (c) Umist data (1 vs 18); (d) Umist (4 vs 14); (e

(6)

nt n) Fin

When the dimensionality of original data is high, it seems thatmargin based methods, such as SVM, perform better thansimilarity based methods, such as KNN. It is mainly due to thereason that the distance between any two high-dimensional datapoints trends to be similar [13] and the performance of similaritybased methods, such as KNN, will be degraded in this scenario.

5.4. Different initializations

In this section, we will show the influence of different initi-alizations. In Section 4.2 we have discussed three kinds ofinitializations. Briefly, the first is ‘Fixed’, which uses fixed value.The second is ‘Uniform’, which initializes randomly with a uniformdistribution. The third is ‘Norm’, whose initialization matrixconsists of elements sampled from standard normal distribution.

umbers of training points. The number of training points varies from 4 to 22.gerDB data (1 vs 8); (f) Pollen data (2 vs 7).

Page 11: Multiple rank multi-linear SVM for matrix data classificationtarjomefa.com/wp-content/uploads/2015/06/3166-English.pdf · Multiple rank multi-linear SVM for matrix data classification

Table 4Classification accuracy of different methods on Coil data (mean7std/%).

TRAININGNO.

KNN SVM SRSVM LDASVM 2DLDASVM S-LDASVM STM MRMLSVM

40 71.8071.30 72.1871.38 73.9571.33 70.6071.64 68.0472.97 70.6871.83 73.2772.02 75.0971.7460 77.1771.14 79.7571.12 79.7971.14 74.2571.35 71.0372.74 75.9871.23 79.3671.84 82.5471.5880 81.0370.82 83.5771.25 84.9971.26 78.5471.54 79.2472.37 79.7171.30 83.8071.52 86.7871.31

100 82.6070.99 85.9571.25 86.4371.29 80.3771.26 82.6672.19 81.3771.95 86.2471.42 89.0871.13120 85.1070.81 88.9570.89 89.5770.89 82.7771.24 86.8671.58 83.8771.18 88.5671.20 91.7971.03140 86.5170.89 90.9371.21 91.9071.19 83.9171.16 88.9571.47 85.0771.24 90.0971.31 92.8871.13160 87.6670.82 92.0470.93 93.0970.93 84.8671.00 90.4171.32 85.9471.18 91.9970.87 94.0670.75180 88.8770.66 93.1370.77 94.0770.76 86.3370.81 92.1770.85 87.2470.81 92.3170.85 94.8870.73200 89.7070.76 93.8870.92 95.2070.96 86.8570.95 92.9970.98 88.1571.07 92.9370.72 95.3070.79220 90.9670.49 95.2470.49 95.2870.46 88.0370.68 94.0670.81 88.7270.53 93.9970.50 96.2370.43

Table 5Classification accuracy of different methods on Umist data (mean7std/%).

TRAININGNO.

KNN SVM SRSVM LDASVM 2DLDASVM S-LDASVM STM MRMLSVM

40 60.7271.58 64.5571.53 66.0371.48 65.8771.32 65.5771.55 65.9571.55 65.9871.37 68.3271.2760 69.8871.15 75.9771.14 77.7671.14 77.6271.24 78.6171.39 77.8771.04 77.7071.05 79.0670.9780 78.3671.04 84.9470.93 84.9970.95 85.3970.95 85.2071.08 85.3970.92 84.4971.08 87.2171.00

100 82.5671.04 88.8270.88 89.6870.89 88.1770.98 88.3270.96 88.6570.80 87.6870.84 90.8570.78120 87.9570.86 92.7170.88 92.7570.86 91.8270.88 91.4570.74 91.8270.90 91.1270.74 93.9270.69140 90.9270.77 93.4170.59 93.6870.57 93.2771.04 93.8570.65 93.3270.55 93.5470.70 95.7070.65160 92.2270.75 94.6870.57 94.8770.58 94.4370.81 94.1070.51 94.9470.50 94.5470.64 96.4870.59180 93.9670.64 95.6370.44 96.2470.43 95.1570.55 95.5870.46 95.8670.49 96.0170.47 97.0770.44200 95.3270.55 96.1870.52 97.0170.55 95.8470.55 95.9270.58 96.7470.52 96.5670.46 97.6470.43220 95.9470.62 96.5070.47 97.6070.50 96.2370.67 96.2270.43 97.3470.45 96.7170.45 98.0170.43

Table 6Classification accuracy of different methods on FingerDB data (mean7std/%).

TRAININGNO.

KNN SVM LDASVM 2DLDASVM STM MRMLSVM

20 10.0271.34 49.8372.39 44.1672.77 34.7572.98 50.8972.05 54.9172.0030 10.1071.05 60.52072.28 45.4072.79 37.3072.68 59.4072.03 64.3072.0040 10.3871.11 64.2572.31 48.2572.27 40.2572.34 66.5071.97 70.5071.9250 10.6771.18 69.0072.38 51.5072.12 45.5072.05 69.8371.92 78.5071.8760 11.0071.31 77.5071.70 51.7571.81 46.5071.62 73.5071.72 82.0071.6870 11.3671.04 78.0071.80 57.0071.62 49.5071.58 78.4571.63 83.5071.59

Table 7Classification accuracy of different methods on Pollen data (mean7std/%).

TRAININGNO.

KNN SVM SRSVM LDASVM 2DLDASVM S-LDASVM STM MRMLSVM

14 21.7072.36 37.5972.00 39.8871.87 32.1871.88 24.1272.06 41.0172.14 40.1871.68 41.0571.8721 21.8872.18 42.4072.14 43.0771.95 37.1471.88 28.4171.75 44.3572.27 43.6271.58 44.8671.8128 22.3771.96 44.9771.93 46.6471.91 39.2671.58 32.7271.94 46.0471.94 45.2471.53 47.2971.7535 23.1671.88 46.6571.49 49.7771.46 41.7671.54 35.0071.68 48.4772.15 46.7371.50 49.7371.7242 23.5771.86 47.1971.33 50.4571.26 42.6871.49 39.9371.66 49.2471.60 49.2071.24 50.8571.4249 23.6471.76 50.8171.25 52.7471.29 44.2771.43 41.0871.63 49.9771.45 51.0171.20 52.9071.3756 24.1171.74 51.4271.12 53.2771.17 44.9571.45 44.8771.56 50.6271.22 51.5471.11 53.9971.2763 24.7471.67 53.3271.11 55.4771.09 45.3271.34 47.6671.56 50.7571.18 54.4271.06 55.6171.2270 24.8371.72 54.0571.09 56.2070.95 45.7971.38 48.5271.48 50.8471.19 54.5070.91 56.5271.0477 25.3771.62 55.0071.06 57.0671.01 46.5071.32 51.1671.43 50.9171.11 55.0970.90 57.2171.03

C. Hou et al. / Pattern Recognition 47 (2014) 454–469464

We have conducted experiments on two data sets, i.e., Umistand Pollen. With fixed training points (two for each category) andtesting points, we randomly initialize our method for 50 runs.Since 2DLDA and STM are also solved in an alternative way,their results are also presented. Other settings are the same asprevious. For example, we also set k¼2. The experimental results,which are the average classification accuracies of 50 runs, areshown in Fig. 3.

As seen from the results in Fig. 3, all methods have differentperformances with different initializations. Nevertheless, it seemsthat the variance between different initializations is not sosignificant. Thus, we have used the ‘Uniform’ initialization strategyin other experiments. Besides, different initializations take differ-ent influences on different data sets. For example, the variance of2DLDASVM is larger on Pollen data than on Umist data. Moreimportantly, as shown in Fig. 3, MRMLSVM achieves the highest

Page 12: Multiple rank multi-linear SVM for matrix data classificationtarjomefa.com/wp-content/uploads/2015/06/3166-English.pdf · Multiple rank multi-linear SVM for matrix data classification

Table 9Computational time of different methods with different number of training pointson FingerDB data (mean7std.). Note that LDASVM cannot be implemented inthis data.

TRAIN-ING NO.

SVM 2DLDASVM STM MRMLSVM

20 9.671970.1597 35.603170.0210 3.672870.4907 8.937570.229430 10.071970.2075 42.062570.0291 3.732170.7846 9.050070.231840 10.290670.1975 51.421970.0269 3.789770.4190 9.381270.220350 10.031370.1293 58.131370.0259 4.119970.3414 9.387570.220460 11.887570.1920 65.215670.0257 4.382470.3257 9.700070.175570 12.312570.1065 72.371970.0211 4.448170.8383 10.200070.1809

Table 10Computational time of different methods with different number of training pointson Binucleate data (mean7std.). Note that LDASVM cannot be implemented inthis data.

TRAIN-ING NO.

SVM 2DLDASVM STM MRMLSVM

4 3.546970.7562 1079713.2386 0.401170.0364 0.885470.02036 6.953170.3461 111473.9416 0.533070.0345 1.218570.02878 7.468770.2923 116472.9812 0.634070.0391 1.531270.0203

10 8.035670.2413 117271.7536 0.762470.0311 1.864570.020412 8.640670.1445 122771.4561 0.893370.0361 2.208370.0212

C. Hou et al. / Pattern Recognition 47 (2014) 454–469 465

classification accuracy in almost all cases. It can also demonstratethe effectiveness

5.5. Computational complexity

Commonly, another motivation for investigating tensor basedmethods is reducing the computational complexity of originalmethods in manipulating vectorized high-dimensional data. Forcomparison, we will show some experimental results on severaldata sets with different sizes and scales.

We have selected three representative data, i.e., Pedestrian,FingerDB and Binucleate. Pedestrian data has the largest data sizeand Binucleate has the highest resolution. We compare MRMLSVMwith SVM, SRSVM, LDASVM, 2DLDASVM, S-LDASVM, STM and useLibSvm for the realization of SVM. For justice, these methods areimplemented in their original formulations, without using otheraccelerating strategies. When the number of training points isfixed, we randomly select them for 50 independent runs. With anaive MATLAB implementation, the calculations are conducted ona 3.2-GHz, 4G RAM Windows machine. The computational time ofdifferent methods is listed in Tables 8–10. Since LDASVM, SRSVMand S-LDASVM cannot be implemented on FingerDB and Binucle-ate data sets, we only reports their results on Pedestrian data. Theresults with statistical significance are also boldfaced.

As seen from the results in Tables 8–10, we will analyze theinfluence of different factors. (1) When the scale of matrix data issmall and the data size is large, MRMLSVM costs similar time toSVM and consumes less time than LDASVM and 2DLDASVM. Withthe increase of data scale, the superiority of two-dimensionalmethods is emerged. For example, LDASVM, SRSVM andS-LDASVM cannot be implemented on FingerDB while 2DLDASVMstill works. Compared with SVM, MRMLSVM is more suitable fordealing with large scale matrix data. Certainly, STM costs the leasttime. (2) The computational complexity of different methods isdominated by different factors. For example, LDASVM and

Fig. 3. The classification accuracy of 2DLDA, STM and MRMLSVM with different kindsdifferent methods and y-axis is the classification accuracy. (a) Umist data; (b) Pollen da

Table 8Computational time of different methods with different number of training points on P

TRAINING NO. SVM SRSVM LDASVM

200 0.348470.0338 0.795370.0354 2.223470.0589600 0.817270.0478 1.367270.0488 2.335970.0467

1000 1.404770.0627 1.889170.0601 2.428170.03311400 2.259470.0715 2.734470.0667 2.565670.03951800 2.992270.0773 3.401670.0749 2.670370.0379

2DLDASVM are very sensitive to data scale. It is the key factor indominating their computational complexities. Compared withdimensionality, the influence of data size is not so significant.(3) The methods with spatial regularization often cost moretime than their original counterparts. This may be caused by thefact that we also need some time to compute the regularizationmatrix.

of initialization, i.e., the fixed, uniform and norm strategies. The x-axis representsta.

edestrian data (mean7std.).

2DLDASVM S-LDASVM STM MRMLSVM

0.643870.0264 2.317270.2291 0.191070.0877 0.453170.05161.985970.1852 2.407870.1884 0.397770.0249 1.028170.13463.151670.4558 2.553170.1169 0.598670.0284 1.981370.06284.251670.4382 2.631370.1423 1.079570.1062 3.843870.23005.418870.7376 2.736070.1393 1.479370.0498 5.215670.1212

Page 13: Multiple rank multi-linear SVM for matrix data classificationtarjomefa.com/wp-content/uploads/2015/06/3166-English.pdf · Multiple rank multi-linear SVM for matrix data classification

C. Hou et al. / Pattern Recognition 47 (2014) 454–469466

5.6. Parameter determination

As seen from the discussion in Sections 3.2 and 4.4, we knowthat k plays an essential role in the proposed method. In thissection, we would like to show its influence in three aspects.

In the first group of experiments, we would like to show the influ-ence of k towards objective function values. As mentioned in Proposi-tion 2, the objective function value of MRMLSVM should monoto-nously decreases with respect to k, provided that we initialize it in a

Fig. 4. The objective function of MRMLSVMwith different k. The x-axis represents the sedifferent number of training points. (a) Pedestrian data; (b) Umist data; (c) Pollen datreferred to the web version of this article.)

Fig. 5. The classification accuracy of MRMLSVM with different k. The x-axis represents threpresent different number of training points. (a) Pedestrian data; (b) Umist data; (c) Ffigure legend, the reader is referred to the web version of this article.)

suitable way. For demonstration, we conduct experiments on threedata sets, including Pedestrian, Umist and Pollen. The parameter k isvaried from 1 to 10. We choose 2 and 10 training points from eachcategory. With the ‘Fixed’ initialization, each experiment is repeatedfor 50 independent runs and the average objective function values areshown in Fig. 4.

As seen from the results in Fig. 4, we can conclude that with theincrease of k, the average objective function values decrease mono-tonously. It validates the intrinsic of k in dominating training error.

lected k and y-axis is the objective function. The lines with different colors representa. (For interpretation of the references to color in this figure legend, the reader is

e selected k and y-axis is the classification accuracy. The lines with different colorsingerDB data; (d) Pollen data. (For interpretation of the references to color in this

Page 14: Multiple rank multi-linear SVM for matrix data classificationtarjomefa.com/wp-content/uploads/2015/06/3166-English.pdf · Multiple rank multi-linear SVM for matrix data classification

Fig. 6. The classification accuracy of MRMLSVM with different k and initialization. The x-axis represents the selected k and y-axis is the classification accuracy. The lines withdifferent colors represent different kinds of initialization. (a) Pedestrian data; (b) Umist data; (c) FingerDB data; (d) Pollen data. (For interpretation of the references to colorin this figure legend, the reader is referred to the web version of this article.)

C. Hou et al. / Pattern Recognition 47 (2014) 454–469 467

In the second group of experiments, we would like to show theinfluence of k towards classification accuracy. There are totally fourdifferent data sets, including Pedestrian, Umist and FingerDB andPollen. We vary k from 1 to 10 and conduct experiments using‘Fixed’ initialization with different number of training points. After50 independent splitting training and testing points, we providethe average classification accuracy of MRMLSVM in Fig. 5.

As shown in Fig. 5, we can see that with the increase of k, thefeasible region is expanded. Nevertheless, the classification accu-racy does not always increase consistently. This is due to the factthat k is a parameter to balance the influences of training error andthe extent in avoiding over fitting. These two factors are importantin dominating the performance of a learning machine. SVM andSTM can be regarded as two extreme cases. Additionally, theseresults also validate the effectiveness in employing multipletransformation vectors.

In the third group of experiments, we would like to showwhether the out-performance of MRMLSVM depends on thecombination of k and initialization strategies. We employ the samedata sets as in previous experiments. With different k and differentinitialization strategies, we report the average classification accuracyby randomly selecting 10 points in each category as training samplesand others are assigned as testing sets. With 50 independent runs,the average classification accuracies are shown in Fig. 6.

As seen from the results in Fig. 6, we have the followingintuitions. Compared with the influences of initialization strategy,k plays a more important role in dominating the final classificationaccuracy. The reason may be (1) k is a parameter who candetermine the learning capacity of our method and (2) when k issuitable, our solving strategy tends to find a good local optimum,no matter which initialization strategy we have selected.

6. Conclusion

In this paper, we have proposed an efficient multiple rankmulti-linear SVM, i.e., MRMLSVM, for matrix data classification.Different from linear SVM which reformulated matrix data into avector and STM which only used one left and one right projectionvector, we have used several left projecting vectors and the samenumber of right projecting vectors to formulate objective functionand construct constraints. The convergence behavior, initializationstrategy, computational complexity and parameter determinationproblems were also analyzed. Plenty of experimental results havebeen proposed for illustration. Further research includes theextension of MRMLSVM to nonlinear cases. We will also focus onthe accelerating issue of our algorithm.

Conflict of interest

None declared.

Acknowledgments

Supported by 973 Program(2013CB329503) and NSFC (GrantNo. 91120301, 61005003, 61075004 and 61021063).

Appendix A

Proof of Proposition 1. Wewould like to prove it in two sides. Onone hand, for any mn-dimensional vector z, its matrix form is

Page 15: Multiple rank multi-linear SVM for matrix data classificationtarjomefa.com/wp-content/uploads/2015/06/3166-English.pdf · Multiple rank multi-linear SVM for matrix data classification

C. Hou et al. / Pattern Recognition 47 (2014) 454–469468

denoted as Z∈Rm�n. Since rankðZÞ≤minðm;nÞ, there exits twomatrixes, U¼ ½u1;u2;…;uk� and V¼ ½v1; v2;…; vk�, such thatZ¼UVT . In other words, for any mn-dimensional vector z, wecan find k couples of vectors, such that Z¼∑k

i uivTi , or equivalently,

z¼ Vecð∑ki uivTi Þ.

On the other hand, for any two groups of vectors, denoted asfu1;u2;…;ukg and fv1; v2;…;vkg, Vecð∑k

i uivTi Þ∈Rmn. Combining theabove two results, we get the conclusion. □

Proof of Proposition 2. Assume that the objective function ofSTM with the projecting vectors u and v is f ðu; vÞ. Correspondingly,the objective function of MRMLSVM with projecting vectorsfu1;u2;…;ukg and fv1; v2;…; vkg is gðu1;u2;…;uk; v1; v2;…; vkÞ.Denote the optimal value of STM as f ðun; vnÞ. Note that

f ðun;vnÞ ¼ gðun;0;…;0k;vn; v2;…; vkÞ:

As seen from the following results in Proposition 3, In each iteration,the objective function of MRMLSVM is not increase. Thus, the optimalfunction value of MRMLSVM is no larger than that of STM. □

Proof of Corollary 1. The first equation is obvious and we wouldlike to prove the second one.Based on Lemma 1, we have

TrðUTUVTVÞ ¼ ðVecðUÞÞTVecðUVTVÞ¼ ðVecðUÞÞTVecðIm�mUVTVÞ ¼ VecðUÞT ððVTVÞ⊗Im�mÞVecðUÞ:

Thus, the result follows. □

Proof of Proposition 4. Assume that we have derived UðsÞ

in the s-th iteration. We now update V, ξ and b. The followingresults hold:

fVðsÞ; ξðsÞ; bðsÞg ¼ arg minV;b;ξ

12Tr UðsÞVTVðUðsÞÞT� �

þ C ∑l

i ¼ 1ξi

s:t: yiðTrððUðsÞÞTXiVÞ þ bÞ≥1�ξiξi≥0 for i¼ 1;2;…; l: ð31Þ

In the ðsþ 1Þ�th iteration, we fix V as VðsÞ and optimize U, ξand b by solving the problem in Eq. (11). We have the followingresults:

fUðsþ1Þ; ξðsþ1Þ; bðsþ1Þg ¼ arg minU;ξ;b

12Tr UðVðsÞÞTVðsÞUT� �

þ C ∑l

i ¼ 1ξi

s:t: yiðTrðUTXiVðsÞÞ þ bÞ≥1�ξi

ξi≥0 for i¼ 1;2;…; l: ð32Þ

Similarly, when we fix U as Uðsþ1Þ and optimize V, ξ and b bysolving the problem in Eq. (11), the following result holds:

fVðsþ1Þ; ξðsþ1Þ; bðsþ1Þg ¼ arg minV;ξ;b

12Tr Uðsþ1ÞVTVðUðsþ1ÞÞT� �

þ C ∑l

i ¼ 1ξi

s:t: yiðTrððUðsþ1ÞÞTXiVÞ þ bÞ≥1�ξi ξi≥0 for i¼ 1;2;…; l: ð33Þ

Combining the results in Eqs. (30) and (32), we know thatfUðsÞ;VðsÞ; ξðsþ1Þ; bðsÞg and fUðsþ1Þ;Vðsþ1Þ; ξðsþ1Þ; bðsþ1Þg are both feasiblesolutions to the problem in Eq. (11). More importantly, recallingthe results in Proposition 3, we have the following inequality:

12Tr Uðsþ1ÞðVðsþ1ÞÞTVðsþ1ÞðUðsþ1ÞÞT� �

þ C ∑l

i ¼ 1ξðsþ1Þi

≤12Tr Uðsþ1ÞðVðsÞÞTVðsÞðUðsþ1ÞÞT� �

þ C ∑l

i ¼ 1ξðsþ1Þi

≤12Tr UðsÞðVðsÞÞTVðsÞðUðsÞÞT� �

þ C ∑l

i ¼ 1ξðsÞi ð34Þ

It indicates the decrease of objective function duringiteration. □

Proof of Proposition 5. Since A, B are positive semi-definite,then, their SVD decompositions are A¼UAΣAU

TA and B¼UBΣBUT

B .Thus

Ai⊗Bi ¼ ðUAΣiAU

TAÞ⊗ðUBΣi

BUTBÞ ¼ ðUA⊗UBÞðΣi

A⊗ΣiBÞðUA⊗UBÞT

¼ ðUA⊗UBÞðΣA⊗ΣBÞiðUA⊗UBÞT ¼ ðA⊗BÞi:□ ð35Þ

References

[1] J. Yang, D. Zhang, A.F. Frangi, J. Yang, Two-dimensional pca: a new approach toappearance-based face representation and recognition, IEEE Transactions onPattern Analysis and Machine Intelligence 26 (2004) 131–137.

[2] A.W.-K. Kong, D.D. Zhang, M.S. Kamel, A survey of palmprint recognition,Pattern Recognition 42 (7) (2009) 1408–1418.

[3] J.J. Koo, A.C. Evans, W.J. Gross, 3-d brain mri tissue classification on fpgas, IEEETransactions on Image Processing 18 (12) (2009) 2735–2746.

[4] G. Shakhnarovich, T. Darrell, P. Indyk, Nearest-Neighbor Methods in Learningand Vision: Theory and Practice (Neural Information Processing), The MITPress, 2006.

[5] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, NewYork, NY, USA, 1995.

[6] B. Scholkopf, A.J. Smola, Learning with Kernels: Support Vector Machines,Regularization, Optimization, and Beyond, MIT Press, Cambridge, MA, USA,2001.

[7] C.M. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag,Secaucus, NJ, USA, 2006.

[8] T. Joachims, Training linear svms in linear time, in: Proceedings of the 12thACM SIGKDD International Conference on Knowledge Discovery and DataMining, KDD '06, ACM, New York, NY, USA, 2006, pp. 217–226.

[9] C.-C. Chang, C.-J. Lin, LIBSVM: A library for support vector machines,ACM Transactions on Intelligent Systems and Technology 2 (2011)27:1–27:27.

[10] S. Rendle, Factorization machines with libfm, ACM Transactions on IntelligentSystems and Technology (TIST) 3 (3) (2012) 57.

[11] Y. Zhu, X. Wang, E. Zhong, N.N. Liu, H. Li, Q. Yang, Discovering spammers insocial networks, in: AAAI, 2012.

[12] C. Hou, C. Zhang, Y. Wu, Y. Jiao, Stable local dimensionality reductionapproaches, Pattern Recognition 42 (9) (2009) 2054–2066.

[13] D.L. Donoho, High-dimensional data analysis: the curses and blessings ofdimensionality, in: American Mathematical Society Conference on Mathema-tical Challenges of the 21st Century, 2000.

[14] D. Cai, X. He, Y. Hu, J. Han, T. Huang, Learning a spatially smooth subspace forface recognition, in: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition Machine Learning (CVPR'07), 2007.

[15] K. Ma, X. Tang, Discrete wavelet face graph matching, in: ICIP01, 2001, pp. II:217–220.

[16] M.A.O. Vasilescu, D. Terzopoulos, Multilinear subspace analysis of imageensembles, in: CVPR, 2003, pp. 93–99.

[17] I. Jolliffe, Principal Component Analysis, 2nd edition., Springer, New York,2002.

[18] J. Ye, Q. Li, LDA/QR: an efficient and effective dimension reductionalgorithm and its theoretical foundation, Pattern Recognition 37 (4) (2004)851–854.

[19] X. He, P. Niyogi, Locality preserving projections, in: NIPS, 2003.[20] F. Nie, S. Xiang, Y. Song, C. Zhang, Extracting the optimal dimensionality for

local tensor discriminant analysis, Pattern Recognition 42 (1) (2009) 105–114.[21] J. Ye, R. Janardan, Q. Li, Two-dimensional linear discriminant analysis, in: NIPS,

2004.[22] M. Li, B. Yuan, 2d-lda: A statistical linear discriminant analysis for image

matrix, Pattern Recognition Letters 26 (5) (2005) 527–532.[23] S. Yan, D. Xu, Q. Yang, L. Zhang, X. Tang, H. Zhang, Multilinear discriminant

analysis for face recognition, IEEE Transactions on Image Processing 16 (1)(2007) 212–220.

[24] X. He, D. Cai, P. Niyogi, Tensor subspace analysis, in: NIPS, 2005.[25] D. Tao, X. Li, X. Wu, W. Hu, S.J. Maybank, Supervised tensor learning,

Knowledge and Information Systems 13 (1) (2007) 1–42.[26] D. Tao, X. Li, X. Wu, S.J. Maybank, Tensor rank one discriminant analysis—a

convergent method for discriminative multilinear subspace selection, Neuro-computing 71 (10–12) (2008) 1866–1882.

[27] S. Chen, Z. Wang, Y. Tian, Matrix-pattern-oriented ho-kashyap classifier withregularization learning, Pattern Recognition 40 (5) (2007) 1533–1543.

[28] Z. Wang, S. Chen, J. Liu, D. Zhang, Pattern representation in feature extractionand classifier design: matrix versus vector, IEEE Transactions on NeuralNetworks 19 (5) (2008) 758–769.

Page 16: Multiple rank multi-linear SVM for matrix data classificationtarjomefa.com/wp-content/uploads/2015/06/3166-English.pdf · Multiple rank multi-linear SVM for matrix data classification

C. Hou et al. / Pattern Recognition 47 (2014) 454–469 469

[29] Z. Zhang, T.W.S. Chow, Maximum margin multisurface support tensormachines with application to image classification and segmentation, ExpertSystems with Applications 39 (1) (2012) 849–860.

[30] C. Hou, F. Nie, D. Yi, Y. Wu, Efficient image classification via multiple rankregression, IEEE Transactions on Image Processing 22 (1) (2013) 340–352.

[31] Z. Wang, S. Chen, D. Gao, A novel multi-view learning developed from single-view patterns, Pattern Recognition 44 (10-11) (2011) 2395–2413.

[32] C.M. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag NewYork, Inc., Secaucus, NJ, USA, 2006.

[33] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press,New York, NY, USA, 2004.

[34] A.N. Langville, W.J. Stewart, The Kronecker product and stochastic automatanetworks, Journal of Computational and Applied Mathematics (2003)429–447.

Chenping Hou received his B.S. and Ph.D. degrees in Applied Mathematics from National University of Defense Technology, Changsha, China, in 2004 and 2009 respectively.He is now a lecture in Department of Mathematics and System Science. He is a member of both IEEE and ACM. His current research fields include pattern recognition,machine learning and computer vision.

Feiping Nie received his B.S. degree in Computer Science from North China University of Water Conservancy and Electric Power, China, in 2000, received his MS degree inComputer Science from Lanzhou University, China, in 2003, and received his Ph.D. degree in Computer Science from Tsinghua University, China, in 2009. Currently, He is aresearch assistant professor at the University of Texas, Arlington, USA. His research interests include machine learning and its application fields, such as pattern recognition,data mining, computer vision, image processing and information retrieval.

Changshui Zhang received the B.S. degree in mathematics from Peking University, Beijing, China, in 1986 and the M.S. and Ph.D. degrees in control science and engineeringfrom Tsinghua University, Beijing, China, in 1989 and 1992 respectively. In 1992, he joined the Department of Automation, Tsinghua University, and is currently a Professor.His research interests include pattern recognition, machine learning, etc. He has authored more than 200 papers. Prof. Zhang is currently an Associate Editor of the PatternRecognition Journal.

Dongyun Yi is a professor in the Department of Mathematics and System Science at the National University of Defense Technology in Changsha, China. He has worked as avisiting researcher of the University of Warwick in 2007. His research interests include systems science, statistics and data processing.

Yi Wu is a professor in the Department of Mathematics and System Science. He earned a bachelor's and master's degree in Applied Mathematics at National University ofDefense Technology in 1981 and 1988. He worked as a visiting researcher in New York State University in 1999. His research interests include applied mathematic, statisticsand data processing.