1076 ieee transactions on knowledge and data …ise.thss.tsinghua.edu.cn/mig/adaptation... ·...

1076 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY 2014

Adaptation Regularization: A GeneralFramework for Transfer Learning

Mingsheng Long, Jianmin Wang, Guiguang Ding, Sinno Jialin Pan, and Philip S. Yu, Fellow, IEEE

Abstract—Domain transfer learning, which learns a target classifier using labeled data from a different distribution, has shownpromising value in knowledge discovery yet still been a challenging problem. Most previous works designed adaptive classifiers byexploring two learning strategies independently: distribution adaptation and label propagation. In this paper, we propose a noveltransfer learning framework, referred to as Adaptation Regularization based Transfer Learning (ARTL), to model them in a unified waybased on the structural risk minimization principle and the regularization theory. Specifically, ARTL learns the adaptive classifier bysimultaneously optimizing the structural risk functional, the joint distribution matching between domains, and the manifold consistencyunderlying marginal distribution. Based on the framework, we propose two novel methods using Regularized Least Squares (RLS)and Support Vector Machines (SVMs), respectively, and use the Representer theorem in reproducing kernel Hilbert space to derivecorresponding solutions. Comprehensive experiments verify that ARTL can significantly outperform state-of-the-art learning methodson several public text and image datasets.

Index Terms—Transfer learning, adaptation regularization, distribution adaptation, manifold regularization, generalization error

1 INTRODUCTION

IT is very difficult, if not impossible, to induce asupervised classifier without any labeled data. For the

emerging domains where labeled data are sparse, to savethe manual labeling efforts, one may expect to leverageabundant labeled data available in a related source domainfor training an accurate classifier to be reused in the targetdomain. Recently, the literature has witnessed an increasinginterest in developing transfer learning [1] methods for cross-domain knowledge transfer problems. Transfer learning hasproven to be promising in many real-world applications,e.g., text categorization [2], [3], sentiment analysis [4], [5],image classification [6] and retrieval [7], video summariza-tion [8], and collaborative recommendation [9].

Recall that the probability distributions in differentdomains may change tremendously and have very differentstatistical properties, e.g., mean and variance. Therefore,one major computational issue of transfer learning ishow to reduce the difference in distributions between thesource and target data. Recent works aim to discover agood feature representation across domains, which can

• M. Long is with the School of Software, Tsinghua University, Beijing100084, and with the Department of Computer Science, TsinghuaUniversity, Beijing 100084, China. E-mail: [email protected].

• J. Wang, and G. Ding are with the School of Software, TsinghuaUniversity, Beijing 100084, China.E-mail: {jimwang, dinggg}@tsinghua.edu.cn.

• S. J. Pan is with the Institute of Infocomm Research, Singapore 138632.E-mail: [email protected].

• P. S. Yu is with the Department of Computer Science, University ofIllinois at Chicago, Chicago, IL 60607 USA. E-mail: [email protected].

Manuscript received 15 Aug. 2012; revised 19 June 2013; accepted20 June 2013. Date of publication 30 June 2013; date of current version7 May 2014.Recommended for acceptance by J. Bailey.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier 10.1109/TKDE.2013.111

simultaneously reduce the distribution difference and pre-serve the important properties of the original data [10].Under the new feature representation, standard supervisedlearning algorithms can be trained on source domain andreused on target domain [11], [12]. Pan et al. [11] pro-posed Maximum Mean Discrepancy Embedding (MMDE),in which the MMD [13] distance measure for comparing dif-ferent distributions is explicitly minimized. Si et al. [12] pro-posed a general Transfer Subspace Learning (TSL) frame-work, in which the Bregman divergence is imposed asa regularization to a variety of subspace learning meth-ods, e.g., PCA and LDA. Another line of works aims todirectly construct an adaptive classifier by imposing thedistance measure as a regularization to supervised learn-ing methods, e.g., SVMs [14]–[17]. However, these methodsonly utilized the source domain labeled data to train aclassifier. We show that such labeled data can be furtherexplored to reduce the difference in the conditional distri-butions across domains. Also, these methods only utilizedthe target domain unlabeled data to reduce the differ-ence in the marginal distributions across domains. We showthat these unlabeled data can be further explored to boostclassification performance.

It is noteworthy that, in some real-world scenarios, onlyminimizing difference in marginal distributions betweendomains is not good enough for knowledge transfer, sincethe discriminative directions of the source and targetdomains may still be different [10], [18]. Therefore, anothermajor computational issue of transfer learning is how tofurther explore marginal distributions to potentially matchthe discriminative directions between domains. In thisdirection, the unlabeled data may often reveal the underly-ing truth of the target domain [19]–[21]. Bruzzone et al. [19]proposed Domain Adaptation Support Vector Machine(DASVM), which extended Transductive SVM (TSVM) to

1041-4347 c© 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

LONG ET AL.: ADAPTATION REGULARIZATION: A GENERAL FRAMEWORK FOR TRANSFER LEARNING 1077

(a) (b) (c) (d)

Fig. 1. Motivation of ARTL. f : hyperplane; Ds : source domain; Dt :target domain; ◦: domain/class centroid; MDA: marginal distributionadaptation; CDA: conditional distribution adaptation; MR: manifoldregularization. (a) Original domains. (b) After MDA. (c) After CDA.(d) After MR.

progressively classify the unlabeled target data and simul-taneously remove some labeled source data. Bahadoriet al. [20] proposed Latent Transductive Transfer Learning(LATTL) to combine subspace learning and transductiveclassification (TSVM) in a unified framework. However,all these methods adopt TSVM as building block, whichis difficult to solve and is not natural for out-of-sampledata [22]. In addition, these methods do not minimizethe difference between the conditional distributions acrossdomains.

Based on the aforementioned discussions, we summarizethe computational issues of transfer learning in Fig. 1 andhighlight our motivation. Given a labeled source domain Dsand an unlabeled target domain Dt as in subplot (a), we cansee that hyperplane f trained on Ds cannot discriminate Dtcorrectly due to substantial distribution difference. Similarto most previous works, we minimize the distance betweenthe marginal distributions in subplot (b), i.e., the samplemoments of the two domains are drawn closer. Then hyper-plane f can classify Dt more correctly. Noteworthily, it isindispensable to minimize the distance between the con-ditional distributions as in subplot (c), which can make theintra-class centroids close and the inter-class centroids moreseparable. Finally, as shown in subplot (d), it is impor-tant to maximize the manifold consistency underlying themarginal distributions, which can “rotate” hyperplane f torespect the groundtruth of the target data. This motivates usto design a general framework to integrate all these learningobjectives.

In this paper, we propose a general transfer learningframework, referred to as Adaptation Regularization basedTransfer Learning (ARTL), to model the joint distributionadaptation and manifold regularization in a unified wayunderpinned by the structural risk minimization princi-ple and the regularization theory. More specifically, ARTLlearns an adaptive classifier by simultaneously optimizingthe structural risk functional, the joint distribution match-ing between both marginal and conditional distributions,and the manifold consistency of the marginal distribu-tion. The contributions of this paper are summarized asfollows.

• To cope with the considerable change between datadistributions from different domains, ARTL aims tominimize the structural risk functional, joint adapta-tion of both marginal and conditional distributions,and the manifold regularization. To the best of ourknowledge, ARTL is the first semi-supervised domaintransfer learning framework which can explore allthese learning criteria simultaneously. In particular,

ARTL remains simple by introducing only one addi-tional term (parameter) compared with the state-of-the-art graph-based semi-supervised learning frame-work [22].

• Many standard supervised methods, e.g., RLS andSVMs, can be incorporated into the ARTL frame-work to tackle domain transfer learning. A revisedRepresenter theorem in the Reproducing KernelHilbert Space (RKHS) is presented to facilitate easyhandling of optimization problems.

• Under the ARTL framework, we further propose twonovel methods, i.e., ARRLS and ARSVM, respec-tively. Both of them are convex optimization prob-lems enjoying the global optimal solutions.

• Comprehensive experiments on text (Reuters-21578and 20-Newsgroups) and image (PIE, USPS, andMNIST) datasets verify the effectiveness of the ARTLframework in real-world applications.

The remainder of the paper is organized as follows. Westart by reviewing related works in Section 2. In Section 3,we present the ARTL framework, the two methods ARRLSand ARSVM, and the analysis of time complexity. InSections 4 and 5, we theoretically analyze the generaliza-tion error bound of ARTL, and conduct empirical studieson real-world datasets, respectively. Finally, we concludethe paper in Section 6.

2 RELATED WORK

In this section, we discuss previous works on transferlearning that are most related to our work, and highlighttheir differences. According to literature survey [1], mostprevious methods can be roughly organized into two cat-egories: instance reweighting [23], [24] and feature extraction.Our work belongs to the feature extraction category, whichincludes two subcategories: transfer subspace learning andtransfer classifier induction.

2.1 Transfer Subspace LearningThese methods aim to extract a shared subspace in whichthe distributions of the source and target data are drawnclose. Typical learning strategies includes:

1) Correspondence Learning, which first identifies thecorrespondence among features and then exploresthis correspondence for transfer subspace learn-ing [4], [5];

2) Property Preservation, which extracts shared latentfactors between domains by preserving the impor-tant properties of the original data, e.g., statisticalproperty [2], [25], geometric structure [26]–[28], orboth [3];

3) Distribution Adaptation, which learns a shared sub-space where the distribution difference is explicitlyreduced by minimizing predefined distance mea-sures, e.g., MMD or Bregman divergence [10]–[12],[29].

2.2 Transfer Classifier InductionThese methods aim to directly design an adaptive classifierby incorporating the adaptation of different distributions


through model regularization. For easy discussion, thelearning strategies of these methods are summarized asbelow. Our ARTL framework belongs to this subcategory,with substantial extensions.

1) Subspace Learning + Classifier Induction: These meth-ods simultaneously extract a shared subspace andtrain a supervised [30] or semi-supervised classi-fier [20] in this subspace. The advantage is thatthe subspace and classifier can establish mutualreinforcement. Different from these methods, ARTLdoes not involve subspace learning and thus is moregeneric.

2) Distribution Adaptation + Classifier Induction: Thesemethods directly integrate the minimization of dis-tribution difference as a regularization term to thestandard supervised classifier [15]–[17], [19]. But allthese methods only minimize the distance betweenthe marginal distributions. Different from thesemethods, ARTL minimizes the distance betweenboth the marginal and conditional distributions. Ourwork also explores manifold structure to improveperformance.

3) Feature Replication + Co-Regularization: In thesemethods, the distribution difference is firstlyreduced through feature replication, then both thesource and target classifiers are required to agreeon the unlabeled target data [31]. These methodsrequire some labeled data in target domain, whichis not required by ARTL.

4) Parameter Sharing + Manifold Regularization: Thisstrategy is explored by semi-supervised multi-tasklearning methods [32], [33], which aim to improvethe performance of multiple related tasks by explor-ing the common structure through a common prior.However, these methods ignore the distributionadaptation between multiple tasks, which is differ-ent from ARTL.

5) Kernel Matching + Manifold Regularization: Thesemethods simultaneously perform classifier induc-tion, kernel matching, and manifold preserva-tion [21]. The differences between these methodsand ARTL are that: 1) these methods do not reducethe distance between conditional distributions; 2)kernel matching is usually formulated as an inte-ger program, which is difficult to solve; and 3) it isdifficult to encode kernel matching as a regulariza-tion to standard classifiers, as a result a Representertheorem is missing for these methods.

3 ADAPTATION REGULARIZATION BASEDTRANSFER LEARNING FRAMEWORK

In this section, we first define the problem setting and learn-ing goal for domain transfer learning. After that, we presentthe proposed general framework, ARTL. Based on theframework, we propose two methods using RLS and SVMs,and derive learning algorithms using the Representertheorem in RKHS. Finally, we analyze the computationalcomplexity of the algorithms.

TABLE 1Notations and Descriptions Used in this Paper

3.1 Problem DefinitionNotations, which are frequently used in this paper, aresummarized in Table 1.

Definition 1 (Domain). [1] A domain D is composed of ad-dimensional feature space X and a marginal probabilitydistribution P(x), i.e., D = {X , P(x)}, where x ∈ X .

In general, if two domains Ds and Dt are different,then they may have different feature spaces or marginaldistributions, i.e., Xs �= Xt ∨ Ps(xs) �= Pt(xt).

Definition 2 (Task). [1] Given domain D, a task T is com-posed of a label space Y and a prediction function f (x), i.e.,T = {Y, f (x)}, where y ∈ Y , and f (x) = Q(y|x) can beinterpreted as the conditional probability distribution.

In general, if two tasks Ts and Tt are different, then theymay have different label spaces or conditional distributions,i.e., Ys �= Yt ∨ Qs(ys|xs) �= Qt(yt|xt).

Definition 3 (Domain Transfer Learning). Given labeledsource domain Ds = {(x1, y1), . . . , (xn, yn)} and unlabeledtarget domain Dt = {xn+1, . . . , xn+m}, the goal of domaintransfer learning is to learn a target prediction functionft:xt �→ yt with low expected error on Dt, under theassumptions Xs = Xt, Ys = Yt, Ps(xs) �= Pt(xt), andQs(ys|xs) �= Qt(yt|xt).

To address domain transfer learning problems directlyby estimating the distribution densities is challenging.Although the marginal distribution Pt(xt) can be estimatedusing kernel density estimate (KDE) [18], it is impossi-ble for the conditional distribution Qt(yt|xt) since thereare no labeled data in the target domain. Most previ-ous works thus assume that there exists a proper fea-ture transformation F such that Ps(F(xs)) = Pt(F(xt)),Qs(ys|F(xs)) ≈ Qt(yt|F(xt)). The transformation F can beinferred by minimizing the distribution distance betweenmarginal distributions, and preserving properties of originaldata [10].

In this paper, we put forward several justifications.

• It is insufficient to minimize only the distribution dis-tance between the marginal distributions. The distri-bution distance between the conditional distributionsshould also be explicitly minimized.

• It is useful to further explore the marginal distribu-tions. Noteworthily, preserving manifold consistencyunderlying the marginal distribution can benefit usfrom semi-supervised learning [22].

• It may be more generic to explore the datadistributions in original feature space or kernel


space, instead of various dimension-reduced sub-spaces.

Based on these justifications, we propose our general ARTLframework in the following section.

3.2 General FrameworkWe design the general ARTL framework underpinned bythe structural risk minimization principle and the regu-larization theory. Specifically, we aim to optimize threecomplementary objective functions as follows:

1) Minimizing the structural risk functional on thesource domain labeled data Ds;

2) Minimizing the distribution difference between thejoint probability distributions Js and Jt;

3) Maximizing the manifold consistency underlyingthe marginal distributions Ps and Pt.

Suppose the prediction function (i.e., classifier) bef = wTφ(x), where w is the classifier parameters, andφ:X �→ H is the feature mapping function that projects theoriginal feature vector to a Hilbert space H. The learningframework of ARTL is formulated as

f = arg minf∈HK

n∑

i=1

�(f (xi) , yi

)+ σ∥∥f∥∥2

K

+ λDf ,K (Js, Jt) + γ Mf ,K (Ps, Pt) ,

(1)

where K is the kernel function induced by φ such that⟨φ (xi) , φ

(xj)⟩ = K

(xi, xj

), and σ , λ, and γ are posi-

tive regularization parameters. We interpret each term ofFramework (1) in the following subsections.

3.2.1 Structural Risk MinimizationOur ultimate goal is to learn an adaptive classifier for thetarget domain Dt. To begin with, we can induce a standardclassifier f on the labeled source domain Ds. We adopt thestructural risk minimization principle [34], and minimizethe structural risk functional as

f = arg minf∈HK

n∑

i=1

�(f (xi) , yi

)+ σ∥∥f∥∥2

K , (2)

where HK is a set of classifiers in the kernel space, ‖f‖2K

is the squared norm of f in HK, σ is the shrinkage regular-ization parameter, and � is the loss function that measuresthe fitness of f for predicting the labels on training sam-ples. Two widely-used loss functions are the hinge loss forSVMs � = max

(0, 1 − yif (xi)

), and the squared loss for RLS

� = (yi − f (xi)

)2.

3.2.2 Joint Distribution AdaptationUnfortunately, the standard classifier f inferred by (2) maynot generalize well to the target domain Dt, since the struc-tural risk minimization principle requires the training andtest data to be sampled from identical probability distri-bution [34]. Thus the first major computational issue ishow to minimize the distribution distance between thejoint probability distributions Js and Jt. By probability the-ory, J = P · Q, thus we seek to minimize the distributiondistance 1) between the marginal distributions Ps and Pt,

and 2) between the conditional distributions Qs and Qt,simultaneously.

Marginal Distribution Adaptation: We minimizeDf ,K(Ps, Pt), the distance between marginal distributions Psand Pt. Since directly estimating probability densities isnontrivial, we resort to explore nonparametric statistics. Weadopt empirical Maximum Mean Discrepancy (MMD) [11],[13] as the distance measure, which compares differentdistributions based on the distance between the samplemeans of two domains in a reproducing kernel Hilbertspace (RKHS) H, namely

MMD2H (Ds,Dt) =

∥∥∥∥∥∥1n

n∑

i=1

φ (xi) − 1m

n+m∑

j=n+1

φ(xj)∥∥∥∥∥∥

2

H

,

where φ:X �→ H is the feature mapping. To make MMDa proper regularization for the classifier f , we adopt theprojected MMD [15], which is computed as

Df ,K (Ps, Pt) =∥∥∥∥∥∥

1n

n∑

i=1

f (xi) − 1m

n+m∑

j=n+1

f(xj)∥∥∥∥∥∥

2

H

, (3)

where f (x) = wTφ(x), and K is the kernel function inducedby φ such that

⟨φ (xi) , φ

(xj)⟩ = K

(xi, xj

).

Conditional Distribution Adaptation: We minimizeDf ,K(Qs, Qt), the distance between conditional distributionsQs and Qt. Since calculating the nonparametric statistics ofQs(ys|xs) and Qt(yt|xt) is difficult, we resort to explore thenonparametric statistics of Qs(xs|ys) and Qt(xt|yt) instead,which can well approximate Qs(ys|xs) and Qt(yt|xt) whensample sizes are large. Unfortunately, it is impossible to cal-culate the sample moments of Qt(xt|yt) w.r.t. each class (classcentroids), since there are no labels in the target domain data.In this paper, we propose to use the pseudo target labels pre-dicted by some supervised classifiers (e.g., SVMs) trainedon the source domain labeled data. Though many of thepseudo target labels may be incorrect due to substantialdistribution difference, we assume that the pseudo classcentroids calculated by them may reside not far apart fromthe true class centroids. Therefore, we can use both true andpseudo labels to compute the projected MMD w.r.t. eachclass c ∈ {1, . . . , C} and make the intra-class centroids of twodistributions Qs(xs|ys) and Qt(xt|yt) closer in H

D(c)f ,K(Qs,Qt)=

∥∥∥∥∥∥∥1

n(c)∑

xi∈D(c)s

f (xi)− 1m(c)

∑

xj∈D(c)t

f(

xj

)∥∥∥∥∥∥∥

2

H

, (4)

where D(c)s = {

xi:xi ∈ Ds ∧ y (xi) = c}

is the set of exam-ples belonging to class c in the source data, y (xi) is thetrue label of xi, and n(c) = |D(c)

s |. Correspondingly, D(c)t ={

xj:xj ∈ Dt ∧ y(xj) = c

}is the set of examples belonging to

class c in the target data, y(xj)

is the pseudo (predicted)label of xj, and m(c) = |D(c)

t |.Integrating (3) and (4) leads to the regularization for joint

distribution adaptation, computed as follows

Df ,K (Js, Jt) = Df ,K (Ps, Pt) +C∑

c=1

D(c)f ,K (Qs, Qt). (5)

By regularizing (2) with (5), the sample moments of boththe marginal and conditional distributions are drawn closer


in H. It is noteworthy that, if we use an adaptive classifierto obtain the pseudo labels, then we can usually obtain amore accurate labeling for the target data, which can furtherboost classification accuracy. ARTL can readily integrateany base classifiers by (5).

3.2.3 Manifold RegularizationIn domain transfer learning, there are both labeled andunlabeled data. Since by using (5) we can only match thesample moments between different distributions, but weexpect that knowledge of the marginal distributions Psand Pt can be further exploited for better function learn-ing. In other words, the unlabeled data may often revealthe underlying truth of the target domain, e.g., the sam-ple variances. By the manifold assumption [22], if two pointsxs, xt ∈ X are close in the intrinsic geometry of the marginaldistributions Ps(xs) and Pt(xt), then the conditional distri-butions Qs(ys|xs) and Qt(yt|xt) are similar. Under geodesicsmoothness, the manifold regularization is computed as

Mf ,K (Ps, Pt) =n+m∑

i,j=1

(f (xi) − f

(xj))2Wij

=n+m∑

i,j=1

f (xi) Lijf(xj), (6)

where W is the graph affinity matrix, and L is the normal-ized graph Laplacian matrix. W is defined as

Wij ={

cos(xi, xj

), if xi ∈ Np

(xj) ∨ xj ∈ Np (xi)

0, otherwise,(7)

where Np (xi) is the set of p-nearest neighbors of point xi.L is computed as L = I − D−1/2WD−1/2, where D is adiagonal matrix with each item Dii = ∑n

j=1 Wij.By regularizing (2) with (6), the marginal distributions

can be fully exploited to maximize the consistency betweenthe predictive structure of f and the intrinsic manifoldstructure of the data. This can substantially match thediscriminative hyperplanes between domains.

3.3 Learning AlgorithmsWe extend standard algorithms (RLS and SVMs) under theARTL framework with different choices of loss functions �.The major difficulty lies in that the kernel mapping φ:X �→H may have infinite dimensions. To solve (1) effectively,we need to reformulate it by using the following revisedRepresenter theorem.

Theorem 1 (Representer Theorem). [22], [35] The mini-mizer of optimization problem (1) admits an expansion

f (x) =n+m∑

i=1

αiK (xi, x) and w =n+m∑

i=1

αiφ (xi) (8)

in terms of the cross-domain labeled and unlabeled examples,where K is a kernel induced by φ, αi is a coefficient.

We focus on reformulating the regularization. By incor-porating Equation (8) into Equation (5), we have

Df ,K (Js, Jt) = tr(αTKM0Kα

)+

C∑

c=1

tr(αTKMcKα

)

= tr(αTKMKα

)with M =

C∑

c=0

Mc,

(9)

where K ∈ R(n+m)×(n+m) is kernel matrix with Kij =

K(xi, xj), α = (α1, . . . , αn+m) is classifier parameters. Mc, c ∈{0, 1, . . . , C} are MMD matrices computed as

(Mc)ij =

⎧⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎩

1n(c)n(c) , xi, xj ∈ D(c)

s1

m(c)m(c) , xi, xj ∈ D(c)t

−1n(c)m(c) ,

{xi ∈ D(c)

s , xj ∈ D(c)t

xj ∈ D(c)s , xi ∈ D(c)

t

0, otherwise,

(10)

where n(c), m(c),D(c)s ,D(c)

t , c ∈ {1, . . . , C} are defined as (4).For clarity, we can also compute M0 with (10) if substitutingn(0) = n, m(0) = m,D(0)

s = Ds,D(0)t = Dt.

Similarly, by incorporating (8) into (6), we obtain

Mf ,K (Ps, Pt) = tr(αTKLKα

). (11)

With (9) and (11), we can readily implement new algorithmsunder ARTL by extending RLS and SVMs.

3.3.1 ARRLS: ARTL Using Squared Loss

Using squared loss �(f (xi), yi

) = (yi − f (xi)

)2, the structuralrisk functional can be formulated as follows

n∑

i=1

�(f (xi) , yi

)+ σ∥∥f∥∥2

K

=n+m∑

i=1

Eii(yi − f (xi)

)2 + σ∥∥f∥∥2

K , (12)

where E is a diagonal label indicator matrix with eachelement Eii = 1 if xi ∈ Ds, and Eii = 0 otherwise. Bysubstituting Representer theorem (8) into (12), we obtain

n∑

i=1

�(f (xi) , yi

)+ σ∥∥f∥∥2

K

=∥∥∥(

Y − αTK)

E∥∥∥

2

F+ σ tr

(αTKα

), (13)

where Y = [y1, . . . , yn+m] is the label matrix. It is nomatter that the target labels are unknown, since they arefiltered out by the label indicator matrix E. IntegratingEquations (13), (9), and (11) into Framework (1), we obtainthe objective for ARRLS based on RLS:

α = arg minα∈Rn+m

∥∥∥(

Y − αTK)

E∥∥∥

2

F

+tr(σαTKα + αTK (λM + γ L) Kα

). (14)

Setting derivative of objective function as 0 leads to

α = ((E + λM + γ L) K + σ I)−1EYT. (15)

Note that when λ = γ = 0, (15) gives zero coefficientsover the joint distribution adaptation and manifoldregularization and thus degenerates to standard RLS.


Multi-Class Extension: Denote y ∈ RC a label vector

such that yc = 1 if y(x) = c, and yc = 0 otherwise. Thelabel matrix is Y = [y1, . . . , yn+m] ∈ R

C×(n+m), and theparameter matrix is α ∈ R

(n+m)×C. In this way, ARRLS canbe extended to multi-class problems.

3.3.2 ARSVM: ARTL Using Hinge LossUsing hinge loss �

(f (xi), yi

) = max(0, 1 − yif (xi)

), the

structural risk functional can be formulated asn∑

i=1

�(f (xi) , yi

)+ σ∥∥f∥∥2

K

=n∑

i=1

max(0, 1 − yif (xi)

)+ σ∥∥f∥∥2

K . (16)

By substituting the Representer theorem (8) into (16), andintegrating Equations (16), (9), and (11) into (1), we obtainthe objective for ARSVM based on SVMs:

minα∈Rn+m,ξ∈Rn

n∑

i=1

ξi + σαTKα + αTK (λM + γ L) Kα

s.t. yi

⎛

⎝n+m∑

j=1

αjK(xi, xj

)+ b

⎞

⎠ ≥ 1 − ξi, i = 1, . . . , n

ξi ≥ 0, i = 1, . . . , n. (17)

To solve Equation (17) effectively, we follow [22] andreformulate (17) using Lagrange dual, which leads to

β = arg maxβ∈Rn

n∑

i=1

βi − 12βTQβ

s.t.n∑

i=1

βiyi = 0, 0 ≤ βi ≤ 1n

, i = 1, . . . , n

with Q = YEK(2σ I + 2 (λM + γ L) K)−1ETY,

(18)

where Y = diag(y1, . . . , yn), E = [In, 0] ∈ Rn×(n+m).

ARSVM can be easily implemented by using a standardSVM solver with the quadratic form induced by the Qmatrix, and then using β to obtain the classifier parametersby α = (2σ I + 2 (λM + γ L) K)−1ETYβ.

The learning algorithms are summarized in Algorithm 1.To make parameters λ and γ easily tuned, we normalizegraph Laplacian matrix and MMD matrix.

3.4 Computational ComplexityDenote s the average number of non-zero features perexample, s ≤ d, p � min(n + m, d). The computationalcomplexity of the framework consists of three parts.

1) Solving the linear systems (15) or (18) using LUdecomposition requires O((n + m)3), which maybe greatly reduced using the conjugate gradientmethod. For ARSVM, solving the SVM optimiza-tion (18) with a widely-used SVM solver [36]requires O((n + m)2.3).

2) For constructing the graph Laplacian matrix L, ARTLneeds O

(s(n + m)2), which is performed once.

3) For constructing the kernel matrix K and aggregateMMD matrix M, ARTL requires O

(C(n + m)2).

In summary, the computational complexity ofAlgorithm 1 is O

((n + m)3 + (s + C) (n + m)2) using

exact computations, which is adopted in this paper. It isnot difficult to speed up the algorithms using conjugategradient methods, and this is left for our future work.

3.5 Connections to Existing WorksAs discussed in Section 2, our work is substantially differ-ent from a variety of prior cross-domain learning methodssuch as [4]–[6], [25], [37], which do not explicitly considerdistribution matching or manifold regularization. In thissubsection, we will specifically distinguish our work froman insightful perspective.

Distribution Adaptation: These methods explicitly reducedistribution difference by minimizing predefined distancemeasures, e.g., MMD or Bregman divergence [11], [12],[14]–[17], [19]. However, they only reduce the distancebetween marginal distributions, while the distance betweenconditional distributions is not minimized. Several worksconsidered to match both the marginal and conditional dis-tributions [18], [29], however, they require some labeleddata in the target domain, which are not required byour method. Also, the manifold structure underlyingthe marginal distributions is not considered in all thesemethods.

Manifold Regularization: These methods explicitly maxi-mize the consistency of the induced embeddings (subspacelearning) [3], [26]–[28] or classifiers (supervised learn-ing) [22], [33] with respect to the intrinsic manifold struc-ture. However, these methods have not explicitly reducedthe distribution difference between domains and may over-fit target domain data.

To our knowledge, the works most closely related toour ARTL are Graph co-regularized Transfer Learning(GTL) [3], Semi-Supervised Transfer Component Analysis(SSTCA) [10], Discriminative Feature Extraction (DFE) [30],Latent Transductive Transfer Learning (LATTL) [20], andSemi-Supervised Kernel Matching (SSKM) [21]. All thesemethods can be categorized as “semi-supervised transferlearning”, since they have explored the combination ofsemi-supervised learning and transfer learning. For clearcomparison, the difference between these methods is illus-trated in Table 2.


TABLE 2Comparison between Most Closely Related Works

• GTL and SSTCA are dimensionality reduction meth-ods where label information and manifold structureare explored only for subspace learning. Our ARTLis a framework for adaptive classifiers.

• DFE is a joint learning method for distribution adap-tation and classifier training. It explores the manifoldstructure in a separated second step, also it does notmatch the conditional distributions.

• LATTL is a combination of subspace learning andtransductive classification. By using TransductiveSVM (TSVM) to explore both the labeled and unla-beled data, LATTL can naturally achieve a bettergeneralization capability to the target domain. It hastwo weaknesses: 1) it does not explicitly reduce dis-tribution distance; 2) its TSVM learning frameworkis not natural for out-of-sample data.

• SSKM is the most similar work to ours. It simultane-ously considers structural risk minimization, kernelmatching, and manifold preservation. The differ-ences between SSKM and ARTL are that: 1) SSKMdoes not reduce the distance between conditionaldistributions; 2) the kernel matching is an integerprogramming problem and is difficult to solve; and3) the kernel matching is not directly imposed as aregularization to the classifier, thus it does not exhibita generic Representer theorem.

In summary, our proposed ARTL can simultaneouslyexplore 1) structural risk minimization, 2) distributionadaptation of both the marginal and conditional distribu-tions, and 3) manifold consistency maximization. ARTL isunderpinned by the regularization theory in RKHS, andcan exhibit a revised Representer theorem. Thus ARTL is ageneral framework in which a variety of supervised algo-rithms can be readily incorporated. Furthermore, ARTL isa convex optimization problem enjoying global optima. Wewill compare ARTL with TCA and SSKM empirically tovalidate its advantage.

4 GENERALIZATION BOUND ANALYSIS

We analyze the generalization error bound of ARTL onthe target domain based on the structural risk on thesource domain, following the approaches in [30], [38].First, we denote the induced prediction function as f (x) =sgn

(wTφ (x)

), and the true labeling function as h(x):X �→

{1,−1}. Let �(x) be a continuous loss function � (x) =∣∣h (x) − f (x)∣∣, then 0 ≤ � (x) ≤ 2. First of all, the expected

error of f in Dt is defined as

εt(f) = Ex∼Pt

[∣∣h (x) − f (x)∣∣] = Ex∼Pt [� (x)] .

Similarly, the expected error of f in Ds is defined as

εs(f) = Ex∼Ps

[∣∣h (x) − f (x)∣∣] = Ex∼Ps [� (x)] .

Now we present the target error bound in terms of thesource risk in the following theorem, which is essentially arestatement of [38] with a slight modification.

Theorem 2. Suppose the hypothesis space containing f is of VC-dimension d, then the expected error of f in Dt is bounded withprobability at least 1 − δ by

εt(f) ≤ εs

(f)+

√4n

(d log

2end

+ log4δ

)

+Df ,K (Js, Jt) + �, (19)

where e is the base of natural logarithm, εs(f ) is the empiricalerror of f in Ds, and � = inff∈HK

[εs(f)+ εt

(f)]

.

From Theorem 2, the expected error in Dt, i.e., εt(f ), isbounded if we can simultaneously minimize 1) the empiri-cal error of labeled data in Ds, i.e., εs(f ), 2) the distributiondistance between Ds and Dt in RKHS H, i.e., Df ,K(Js, Jt),and 3) the adaptability of the true function h in terms ofhypothesis space HK, i.e., �.

In ARTL framework, i.e., Equation (1), εs(f ) is explicitlyminimized by structural risk minimization in Equation (2);Df ,K(Js, Jt) is explicitly minimized by distribution adapta-tion in Equation (5); � is implicitly minimized by manifoldregularization in Equation (6).

Non-rigorously, we interpret why manifold regulariza-tion in Equation (6) can implicitly minimize �, the adapt-ability of the true function h in terms of the hypothesisspace HK. First, we introduce the following theorem, whichstates the error bound of semi-supervised learning based onmanifold regularization.

Theorem 3. [39] Consider collection (xi, yi) for i ∈ Zn+m ={1, . . . , n+m}. Assume that we randomly pick n distinct inte-gers j1, . . . , jn from Zn+m uniformly (without replacement),and denote it by Zn. Let h be the true predictor and f(Zn) bethe semi-supervised learner trained using labeled data in Znand unlabeled data in Zn+m\Zn

f (Zn) = arg inff∈Rn+m

⎡

⎣ 1n

∑

i∈Zn

�(fi, yi

)+ γ fTLf

⎤

⎦ .

if∣∣ ∂∂h�

(h, y

)∣∣ ≤ τ , and �(h, y) is convex with respect to h,then we have the generalization error bound on Zn+m\Zn

EZn

1m

∑

i∈Zn+m\Zn

�(

fi (Zn) , yi

)

≤ inff∈Rn+m

[1

n + m

n+m∑

i=1

�(fi, yi

)+ γ fTLf + τ 2tr(L−1)

2γ n (n + m)

].

In ARTL, manifold regularization (6) is performed inRKHS H where the distribution distance has been mini-mized by Equation (5). Thus, Theorem 3 states that, theclassifier f trained in a semi-supervised way on Ds ∪ Dtin RKHS H can be guaranteed by an error bound in Dt.In other words, the manifold regularization (6) can implic-itly minimize �, the adaptability of true function h in termsof the hypothesis space HK.


TABLE 3Top Categories and Subcategories in 20-Newsgroups

5 EXPERIMENTS

In this section, we perform extensive experiments on tworeal-world applications (i.e., text classification and imagerecognition) to evaluate ARTL. Datasets and codes will beavailable online upon publication.

5.1 Data Preparation5.1.1 Text DatasetsThe 219 cross-domain text datasets are generated from 20-Newsgroups and Reuters-21578, which are two benchmarktext corpora widely used for evaluating transfer learningalgorithms [2], [10], [17], [25], [26].

20-Newsgroups1 has approximately 20,000 documentsdistributed evenly in 20 different subcategories. The corpuscontains four top categories comp, rec, sci and talk. Each topcategory has four subcategories, which are listed in Table 3.In the experiments, we can construct 6 dataset groups forbinary classification by randomly selecting two top cate-gories (one for positive and the other one for negative)from the four top categories. The 6 dataset groups are compvs rec, comp vs sci, comp vs talk, rec vs sci, rec vs talk, andsci vs talk. Similar to the approach in [25], we set up onedataset (including source domain and target domain) forcross-domain classification as follows. For each pair of topcategories P and Q (e.g., P for positive and Q for nega-tive), their four sub-categories are denoted by P1, P2, P3,P4 and Q1, Q2, Q3, Q4, respectively. We randomly select(without replacement) two subcategories from P (e.g., P1and P2) and two subcategories from Q (e.g., Q1 and Q2)to form a source domain, then the remaining subcategoriesin P and Q (i.e., P3, P4 and Q3, Q4) are selected to forma target domain. This dataset construction strategy ensuresthat the domains of labeled and unlabeled data are related,since they are under the same top categories. Besides, thedomains are also ensured to be different, since they aredrawn from different subcategories. In this way, for eachdataset group P vs Q, we can generate C2

4 ·C24 = 36 datasets.

Clearly, for each example in the generated dataset group,its class label is either P or Q. In total, we can gener-ate 6 dataset groups consisting of 6 · 36 = 216 datasets.For fair comparison, the 216 datasets are constructed using

1. http://people.csail.mit.edu/jrennie/20newsgroups

Fig. 2. Benchmark image datasets PIE, USPS, MNIST.

a preprocessed version of 20-Newsgroups [2], which con-tains 25,804 features and 15,033 documents, with each docu-ment weighted by term frequency-inverse document frequency(TF-IDF).

Reuters-215782 has three top categories orgs, people, andplace. Using the same strategy, we can construct 3 cross-domain text datasets orgs vs people, orgs vs place and people vsplace. For fair comparison, we use the preprocessed versionof Reuters-21578 studied in [40].

5.1.2 Image DatasetsUSPS, MNIST and PIE (refer to Fig. 2 and Table 4) arethree handwritten digits/face datasets broadly adopted incompute vision and pattern recognition.

USPS3 dataset composes of 7,291 training images and2,007 test images of size 16 × 16.

MNIST4 dataset has a training set of 60,000 examplesand a test set of 10,000 examples of size 28 × 28.

From Fig. 2, we see that USPS and MNIST follow differ-ent distributions. They share 10 semantic classes, with eachcorresponding to one digit. We construct one dataset USPSvs MNIST by randomly sampling 1,800 images in USPSto form the source domain, and sampling 2,000 imagesin MNIST to form the target domain. Then we switch thesource/target pair to get another dataset MNIST vs USPS.We uniformly rescale all images to size 16 × 16, and repre-sent each image by a 256-dimensional vector encoding thegray-scale values of all pixels. In this way, the source andtarget domain are ensured to share the same feature space.

PIE5, standing for “Pose, Illumination, Expression”, is abenchmark face database. It has 68 individuals with 41,368face images sized 32 × 32. The images were captured by 13synchronized cameras and 21 flashes, under varying poses,illuminations, and expressions.

In our experiments, we simply adopt the preprocessedversions of PIE6, i.e., PIE1 [41] and PIE2 [42], which aregenerated by randomly sampling the face images fromthe near-frontal poses (C27) under different lighting andillumination conditions. We construct one dataset PIE1 vsPIE2 by selecting all 2,856 images in PIE1 to form thesource domain, and all 3,329 images in PIE2 to formthe target domain. We switch source/target pair to getanother dataset PIE2 vs PIE1. Thus the source and targetdomains are guaranteed to follow different distributions inthe same feature space, due to variations in lighting andillumination.

2. http://www.daviddlewis.com/resources/testcollections/reuters215783. http://www-i6.informatik.rwth-aachen.de/~keysers/usps.html4. http://yann.lecun.com/exdb/mnist5. http://vasc.ri.cmu.edu/idb/html/face6. http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html


TABLE 4Statistics of the 4 Benchmark Image Datasets

5.2 Experimental Setup5.2.1 Baseline MethodsWe compare ARTL approaches, i.e., ARSVM and ARRLS,with eight state-of-the-art supervised and transfer learningmethods for text and image classification:

• Logistic Regression (LR)• Support Vector Machine (SVM)• Laplacian SVM (LapSVM) [22]• Cross-Domain Spectral Classification (CDSC) [26]• Spectral Feature Alignment (SFA) [5]• Transfer Component Analysis (TCA) [10]• Large Margin Transductive TL (LMTTL) [15]• Semi-Supervised Kernel Matching (SSKM) [21]

Specifically, LMTTL is a special case of ARTL withγ = 0, C = 0, while SSKM can be viewed as a special caseof ARTL with C = 0. SSKM adopts a kernel matching strat-egy which needs an additional mapping matrix to matchdifferent kernels. Different from SSKM, ARTL seamlesslyintegrates the distribution adaptation term into the clas-sifier based on the regularization theory. Note that we donot compare with [30] because their work cannot cope withthousands of training samples.

5.2.2 Implementation DetailsFollowing [1], [10], [21], LR and SVM are trained on thelabeled source data, and tested on the unlabeled target data;CDSC, SFA, and TCA are run on all data as dimensionalityreduction step, then an LR classifier is trained on the labeledsource data to classify the unlabeled target data; LapSVM,LMTTL, SSKM, ARSVM, and ARRLS are trained on all datain a transductive way to directly induce domain-adaptiveclassifiers.

Under our experimental setup, it is impossible to auto-matically tune the optimal parameters for the target classi-fier using cross validation, since we have no labeled datain the target domain. Therefore, we evaluate the eight base-line methods on our datasets by empirically searching theparameter space for the optimal parameter settings, andreport the best results of each method. For LR7 and SVM8,we set the trade-off parameter C (i.e., 1/2σ in ARTL) bysearching C ∈ {0.1, 0.5, 1, 5, 10, 50, 100}. For LapSVM9, weset regularization parameters γA and γI (i.e., σ and γ inARTL) by searching γA, γI ∈ {0.01, 0.05, 0.1, 0.5, 1, 5, 10}.For transfer subspace learning methods CDSC, SFA, andTCA, we set the optimal subspace dimension k by search-ing k ∈ {4, 8, 16, 32, 64, 128}. For transfer classifier inductionmethods LMTTL and SSKM, we set the trade-off parameter

7. http://www.csie.ntu.edu.tw/~cjlin/liblinear8. http://www.csie.ntu.edu.tw/~cjlin/libsvm9. http://vikas.sindhwani.org/manifoldregularization.html

λ between the structural risk functional and the distributionadaptation term by searching λ ∈ {0.01, 0.1, 1, 10, 100}. Weuse linear kernel, i.e., K(xi, xj) = ⟨

xi, xj⟩, for all kernel

methods.ARTL approaches involve four tunable parameters:

shrinkage/MMD/manifold regularization parameters σ , λ,γ , and #nearest neighbors p. Sensitivity analysis validatesthat ARTL can achieve stable performance under a widerange of parameter values, especially for σ , λ, and p. In thecomparative study, we fix σ = 0.1, λ = 10, p = 10, and set1) γ = 10 for the text datasets, and 2) γ = 1 for the imagedatasets. In practice, we can simplify model selection bysequentially choosing optimal parameter values from themost stable ones to the most sensitive ones. Firstly, since theadaptation regularization can largely control model com-plexity, ARTL is very robust to σ , and we can simply choosesmall σ such that ARTL does not degenerate. Secondly,since distribution adaptation is inevitable for transfer learn-ing, we choose λ such that ARTL can sufficiently matchboth the marginal and conditional distributions acrossdomains. Finally, we can choose γ by following the graph-based semi-supervised learning framework [22], where p isoften predetermined as KNN methods.

We use the classification Accuracy on the test data (unla-beled target data) as the evaluation metric, since it is widelyadopted in the literature [5], [10], [17], [30]

Accuracy =∣∣x:x ∈ Dt ∧ f (x) = y (x)

∣∣|x:x ∈ Dt| ,

where y(x) is the groundtruth label of x while f (x) is thelabel predicted by the classification algorithm.

5.3 Experimental ResultsIn this section, we compare our ARTL with the eightbaseline methods in terms of classification accuracy.

5.3.1 Results of Text ClassificationAs 20-Newsgroups and Reuters-21578 are different in hier-archical structure, we report the results separately.

20-Newsgroups: The average classification accuracy ofARTL approaches, including ARSVM and ARRLS, and theeight baseline methods on the 6 cross-domain dataset groups(216 datasets) are illustrated in Table 5. All the detailedresults of the 6 dataset groups are listed in Fig. 3, subplots(a)∼(f). Each of these six figures contains the results on the36 datasets in the corresponding group. The 36 datasets aresorted by an increasing order of the classification accuracyobtained by Logistic Regression (LR). Therefore, the x-axisin each figure can essentially indicate the degree of difficultyin cross-domain knowledge transfer. From these figures, wecan make the following observations.

ARTL approaches achieve much better performance thanthe eight baseline methods with statistical significance.The average classification accuracy of ARRLS on the 216datasets is 93.40%. The performance improvement is 5.37%compared to the best baseline method SSKM, which meansa very significant error reduction of 44.86%. Since theseresults are obtained from a large number of datasets,it can convincingly verify that ARTL can build robustadaptive classifiers for classifying cross-domain documentsaccurately.


TABLE 5Average Classification Accuracy (%) on the 6 Cross-Domain Text Dataset Groups Comprising of 216 Datasets

Secondly, we observe that all the transfer learningmethods can achieve better classification accuracy thanthe standard learning methods. A major limitation ofexisting standard learning methods is that they treatthe data from different domains as if they were drawnfrom a homogenous distribution. In reality, the identical-distribution assumption does not hold in the cross-domainlearning problems, and thus results in their unsatisfactoryperformance. It is important to notice that, the state-of-the-art semi-supervised learning method LapSVM cannotperform better than LR and SVM. Although LapSVM canexplore the target data in a transductive way, it doesnot minimize the distribution difference between domains.Therefore, it may overfit the target data when the dis-criminative directions are significantly different betweendomains.

Thirdly, we notice that ARTL significantly outper-forms CDSC, SFA, and TCA, which are state-of-the-art transfer subspace learning methods based on fea-ture transformation. A major limitation of existing trans-fer subspace learning methods is that they are proneto overfitting, due to their incapability to simultane-ously reduce the difference in both marginal and con-ditional distributions between domains. Although SFAworks particularly well for sentiment classification, it worksfairly for text classification, and the reason is that SFAonly explores feature co-occurrence for feature alignment

without considering feature frequency, which is effectivefor low-frequency sentiment data but not effective for high-frequency text data. ARTL addresses these limitations andcan achieve much better results.

Fourthly, we observe that ARTL achieves much betterperformance than LMTTL and SSKM. Notice that, LMTTLand SSKM are typical transfer classifier induction methods,which can induce a supervised/semi-supervised classi-fier and meanwhile minimize the distribution differencebetween domains. However, since the difference betweenthe conditional distributions is not minimized, while theregularization terms are not imposed to the classifier, itis likely that these methods cannot fully reduce the distri-bution difference and may get stuck in poor local optima.ARTL achieves superior performance by alleviating theselimitations.

Lastly, ARTL approaches often perform more robustly ondifficult-to-classify datasets than all baseline methods. Thiscan be observed from Figs. 3∼3, where the improvementsof ARTL over the baseline methods are more remarkable ondatasets in which LR performs with extremely low accuracy(below 70%).

Reuters-21578: The classification accuracy of ARTLand the baseline methods on the 3 datasets generatedfrom Reuters-21578 are illustrated in Fig. 4(a). We observethat ARTL has outperformed, or achieved comparableperformance than the baseline methods.

(a) (b) (c)

(d) (e) (f)

Fig. 3. Classification accuracy of LR, CDSC, SFA, TCA, LMTTL, SSKM, and ARRLS on the 216 text datasets: (a) comp vs rec. (b) comp vs sci.(c) comp vs talk. (d) rec vs sci. (e) rec vs talk. (f) sci vs talk.


(a) (b)

Fig. 4. Classification accuracy of LR, SVM, LapSVM, CDSC, TCA, LMTTL, SSKM, ARSVM, ARRLS, ARRLS+: (a) Reuters-21578. (b) Imagedatasets.

We notice that, Reuters-21578 is more challenging than20-Newsgroups, since each of its top categories consists ofmany subcategories, i.e., clusters or subclasses. Therefore, itis more difficult to minimize the distribution difference byonly matching the sample moments between domains. Thisreason can explain the unsatisfactory performance obtainedby distribution adaptation methods, i.e., TCA, LMTTL, andSSKM.

By minimizing the distribution difference between bothmarginal and conditional distributions, ARTL can natu-rally match more statistical properties, i.e., both domaincentroids and class centroids. Also, by maximizing themanifold consistency, ARTL can fully explore the marginaldistributions, which can implicitly “rotate” the decisionhyperplane to better respect the target data. In this way,ARTL can perform better on difficult datasets with manyclasses or subclasses.

5.3.2 Results of Image RecognitionThe average classification accuracy of ARTL and the sixbaseline methods on the four image datasets is illustratedin Fig. 4(b). SFA is not compared since it cannot handlenon-sparse image data, while LapSVM and ARSVM are notcompared since their original implementations cannot dealwith multi-class problems.

We notice that, the transfer subspace learning meth-ods, e.g., CDSC, generally outperform standard LR andSVM. This is an expected result, since subspace learningmethods, e.g., PCA, are very effective for image represen-tation. Unfortunately, TCA has generally underperformedCDSC at this time. The main reasons are two-folds: 1) theMMD distance measure is not very suitable for image data,as exemplified by [12]; 2) the distribution difference issignificantly large in the image datasets, resulting in theoverfitting issues.

We also notice that, the transfer classifier inductionmethods, i.e., LMTTL and SSKM, outperform CDSC in theface datasets but underperform CDSC in the handwritten

digits datasets. We conjecture the reasons as follows: 1) forthe face datasets, there are 68 classes, thus transfer clas-sifier induction methods which directly inject the labeledinformation into the learning procedure, are more effective;2) for the handwritten digits datasets, data reconstructionmay be a more important process to reduce the distributiondifference.

In conclusion, ARTL generally outperforms all baselinemethods. Therefore, we can often achieve a robust adap-tive classifier, by minimizing the difference between bothmarginal and conditional distributions, and meanwhilepreserving the manifold consistency.

5.4 Effectiveness VerificationWe verify effectiveness of ARTL by inspecting the impactsof base classifier and adaptation regularization.

5.4.1 Base Classifier IntegrationARTL utilizes some base classifier, e.g., SVM, to obtainthe pseudo labels for the target data, through which thedifference between the conditional distributions is mini-mized. Unsurprisingly, if we use some adaptive classifier,e.g., ARRLS, to obtain more accurate pseudo labels for thetarget data, then we can match the conditional distribu-tions more accurately and further boost the classificationaccuracy. It is very interesting that ARTL can accept its out-puts as inputs to iteratively improve itself. We denote thisalternatingly enhanced version of ARRLS as ARRLS+. Werun ARRLS+ on the image datasets, and show its classifica-tion accuracy in Fig. 4(b). Similar results on other datasetsare omitted due to space limitation. We note that ARRLS+has significantly outperformed ARRLS by 10.60%, whichverifies ARTL can naturally integrate base classifiers.

5.4.2 Adaptation RegularizationTo inspect the effectiveness of each criterion, we run ARRLSon a randomly selected dataset, e.g., rec vs sci 1, by remov-ing one term from its objective function.

(a) (b) (c) (d)

Fig. 5. Classification predictions Kα and classifier parameters α output by ARRLS on the rec vs sci 1 dataset: (a) C = 0. (b) λ = 0. (c) γ = 0.(d) Optimal parameters.


(a) (b) (c) (d)

Fig. 6. Parameter sensitivity study for ARTL on selected datasets (dashed lines show the best baseline results): (a) Nearest neighbors p.(b) Shrinkage regularization σ . (c) MMD regularization λ. (d) Manifold regularization γ .

First, we remove the conditional distribution adaptationterm by setting C = 0 as in Fig. 5(a). In this case, we cannoteven find a clear decision hyperplane for the target data,i.e., the target data are not well separated at all. This verifiesthe crucial role that the conditional distribution adaptationhas played. Similar results can be observed from Fig. 5(b),in which we remove the whole distribution adaptation termby setting λ = 0. The similar results between C = 0 andλ = 0 implies that minimizing the difference between theconditional distributions is much more important than thatof the marginal distributions. With conditional distributionadaptation, we can make the intra-class centroids close andthe inter-class centroids more separable, as can be clearlyobserved from Fig. 5(d).

Secondly, we remove the manifold regularization termby setting γ = 0 as in Fig. 5(c). In this case, the predic-tions are scattering in a wider range than the groundtruth[ − 1, 1]. In other words, the manifold consistency under-lying the target data is violated, regardless that the targetdata are better separated due to the distribution adapta-tion of both the marginal and conditional distributions.Therefore, to induce a good adaptive classifier using theARTL framework, it is very important to preserve themanifold consistency. The importance of the manifold reg-ularization can be observed by comparing Fig. 5(c) withFig. 5(d).

5.5 Parameter SensitivityWe conduct empirical parameter sensitivity analysis, whichvalidates that ARTL can achieve optimal performanceunder wide range of parameter values. Due to space lim-itation, we randomly select one generated dataset from20-Newsgroups, Reuters-21578, USPS & MNIST, and PIErespectively, and discuss the results.

#Nearest Neighbors p: We run ARTL with varying val-ues of p. Theoretically, p should be neither too large nortoo small, since an extremely dense graph (p → ∞) willconnect two examples which are not similar at all, while an

TABLE 6Time Complexity of ARTL and the Baseline Methods

extremely sparse graph (p → 0) will capture limited similar-ity information between examples. We plot the classificationaccuracy w.r.t. different values of p in Fig. 6(a), whichindicates a wide range p ∈ [4, 64] for optimal parametervalues.

Shrinkage Regularization σ : We run ARTL with vary-ing values of σ . Theoretically, σ controls model complexityof the adaptive classifier. When σ → 0, the classifierdegenerates and overfitting occurs. On the contrary, whenσ → ∞, ARTL is dominated by the shrinkage regulariza-tion without fitting the input data. We plot the classificationaccuracy w.r.t. different values of σ in Fig. 6(b), and chooseσ ∈ [0.001, 1].

MMD Regularization λ: We run ARTL with varyingvalues of λ. Theoretically, larger values of λ make distribu-tion adaptation more effective. When λ → 0, distributiondifference is not reduced and overfitting occurs. We plotclassification accuracy w.r.t. different values of λ in Fig. 6(c),and can choose λ ∈ [5, 1000].

Manifold Regularization γ : We run ARTL with vary-ing values of γ . Theoretically, larger values of γ makemanifold consistency more important in ARTL. Whenγ → ∞, only manifold consistency is preserved whilelabeled information is discarded, which is unsupervised.We plot classification accuracy w.r.t. different values of γ

in Fig. 6(d), and choose γ ∈ [0.1, 10].

5.6 Time ComplexityWe empirically check the time complexity of all algorithmsby running them on the comp vs rec 1 dataset with 25,800features and 8,000 documents, and show the results inTable 6. We see that ARRLS can achieve comparable timecomplexity as the baseline methods.

6 CONCLUSION

In this paper, we proposed a general framework, referredto as Adaptation Regularization based Transfer Learning(ARTL), to address cross-domain learning problems. ARTLaims to simultaneously optimize the structural risk func-tional, joint distribution adaptation of both the marginaland conditional distributions, and the manifold consis-tency. An important advantage of ARTL is that it canexplore as many necessary learning objectives as pos-sible, yet still remain simple to implement practically.Furthermore, many existing supervised learning algo-rithms, e.g., RLS and SVM, can be readily incorporated intothe ARTL framework. ARTL is robust to the distribution


difference between domains, and can significantly improvecross-domain text/image classification problems. Extensiveexperiments on 219 text datasets and 4 image datasetsvalidate that the proposed approach can achieve superiorperformance than state-of-the-art adaptation methods.

ACKNOWLEDGMENTS

This work was supported in part by National HGJ KeyProject (2010ZX01042-002-002), and in part by NationalHigh-Tech Development Program (2012AA040911),National Basic Research Program (2009CB320700), andNational Natural Science Foundation of China (61073005,61271394). P. S. Yu was supported in part by US NSFthrough Grants OISE-1129076, CNS-1115234, DBI-0960443,and in part by the US Department of Army through GrantW911NF-12-1-0066.

REFERENCES

[1] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.

[2] F. Zhuang et al., “Mining distinction and commonality across mul-tiple domains using generative model for text classification,” IEEETrans. Knowl. Data Eng., vol. 24, no. 11, pp. 2025–2039, Nov. 2011.

[3] M. Long, J. Wang, G. Ding, D. Shen, and Q. Yang, “Transferlearning with graph co-regularization,” in Proc. 26th AAAI, 2012.

[4] J. Blitzer, R. McDonald, and F. Pereira, “Domain adaptationwith structural correspondence learning,” in Proc. 2006 Conf.EMNLP, 2006.

[5] S. J. Pan, X. Ni, J.-T. Sun, Q. Yang, and Z. Chen, “Cross-domainsentiment classification via spectral feature alignment,” in Proc.19th Int. Conf. WWW, Raleigh, NC, USA, 2010.

[6] Y. Zhu et al., “Heterogeneous transfer learning for image classifi-cation,” in Proc. 25th AAAI, 2011.

[7] M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele,“What helps where—And why? Semantic relatedness for knowl-edge transfer,” in Proc. 23rd IEEE Conf. CVPR, 2010.

[8] L. Li, K. Zhou, G.-R. Xue, H. Zha, and Y. Yu, “Video summa-rization via transferrable structured learning,” in Proc. Int. Conf.WWW, Hyderabad, India, 2011.

[9] B. Li, Q. Yang, and X. Xue, “Transfer learning for collaborativefiltering via a rating-matrix generative model,” in Proc. 26th ICML,Montreal, Canada, 2009.

[10] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adapta-tion via transfer component analysis,” IEEE Trans. Neural Netw.,vol. 22, no. 2, pp. 199–210, Feb. 2011.

[11] S. J. Pan, J. T. Kwok, and Q. Yang, “Transfer learning viadimensionality reduction,” in Proc. 22nd AAAI, 2008.

[12] S. Si, D. Tao, and B. Geng, “Bregman divergence-based regular-ization for transfer subspace learning,” IEEE Trans. Knowl. DataEng., vol. 22, no. 7, pp. 929–942, Jul. 2010.

[13] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Scholkopf, and A. J.Smola, “A kernel method for the two-sample problem,” in Proc.NIPS, 2006.

[14] J. Yang, R. Yan, and A. G. Hauptmann, “Cross-domain video con-cept detection using adaptive svms,” in Proc. 15th ACM MM,Augsburg, Germany, 2007.

[15] B. Quanz and J. Huan, “Large margin transductive transferlearning,” in Proc. 18th ACM CIKM, Hong Kong, China, 2009.

[16] J. Tao, F.-L. Chung, and S. Wang, “On minimum distribution dis-crepancy support vector machine for domain adaptation,” PatternRecognit., vol. 45, no. 11, pp. 3962–3984, 2012.

[17] L. Duan, I. W. Tsang, and D. Xu, “Domain transfer multiple kernellearning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 3,pp. 465–479, Mar. 2012.

[18] E. Zhong et al., “Cross domain distribution adaptation via ker-nel mapping,” in Proc. 15th ACM SIGKDD Int. Conf. KDD, Paris,France, 2009.

[19] L. Bruzzone and M. Marconcini, “Domain adaptation problems:A DASVM classification technique and a circular validationstrategy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 5,pp. 770–787, May 2010.

[20] M. T. Bahadori, Y. Liu, and D. Zhang, “Learning with mini-mum supervision: A general framework for transductive transferlearning,” in Proc. 11th IEEE ICDM, 2011.

[21] M. Xiao and Y. Guo, “Semi-supervised kernel matching fordomain adaptation,” in Proc. 26th AAAI, 2012.

[22] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization:A geometric framework for learning from labeled and unlabeledexamples,” J. Mach. Learn. Res., vol. 7, pp. 2399–2434, Nov. 2006.

[23] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu, “Boosting for transferlearning,” in Proc. 24th ICML, Corvallis, OR, USA, 2007.

[24] J. Jiang and C. Zhai, “Instance weighting for domain adaptationin nlp,” in Proc. 45th ACL, Prague, Czech Republic, 2007.

[25] W. Dai, G.-R. Xue, Q. Yang, and Y. Yu, “Co-clustering basedclassification for out-of-domain documents,” in Proc. 13th ACMSIGKDD Int. Conf. KDD, San Jose, CA, USA, 2007.

[26] X. Ling, W. Dai, G.-R. Xue, Q. Yang, and Y. Yu, “Spectraldomain-transfer learning,” in Proc. 14th ACM SIGKDD Int. Conf.KDD, Las Vegas, NV, USA, 2008.

[27] C. Wang and S. Mahadevan, “Heterogeneous domain adaptationusing manifold alignment,” in Proc. 25th AAAI, 2011.

[28] X. Shi, Q. Liu, W. Fan, and P. S. Yu, “Transfer across completelydifferent feature spaces via spectral embedding,” IEEE Trans.Knowl. Data Eng., vol. 25, no. 4, pp. 906–918, Apr. 2013.

[29] B. Quanz, J. Huan, and M. Mishra, “Knowledge transfer withlow-quality data: A feature extraction issue,” IEEE Trans. Knowl.Data Eng., vol. 24, no. 10, pp. 1789–1802, Oct. 2012.

[30] B. Chen, W. Lam, I. Tsang, and T.-L. Wong, “Extracting discrim-inative concepts for domain adaptation in text mining,” in Proc.15th ACM SIGKDD Int. Conf. KDD, 2009.

[31] H. Daumé, III, A. Kumar, and A. Saha, “Co-regularization basedsemi-supervised domain adaptation,” in Proc. Adv. NIPS, 2010.

[32] A. Argyriou and T. Evgeniou, “Multi-task feature learning,” inProc. NIPS, 2006.

[33] Q. Liu, X. Liao, and L. Carin, “Semi-supervised multitask learn-ing,” in Proc. Adv. NIPS, 2007.

[34] V. Vapnik, Statistical Learning Theory. Wiley, 1998.[35] B. Schölkopf, R. Herbrich, and A. J. Smola, “A generalized rep-

resenter theorem,” in Proc. 14th Annu. Conf. COLT, Amsterdam,The Netherlands, 2001.

[36] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for sup-port vector machines,” ACM Transactions on IntelligentSystems and Technology, vol. 2, 2011, [Online]. Available:http://www.csie.ntu.edu.tw/~cjlin/libsvm

[37] M. Long, J. Wang, G. Ding, W. Cheng, X. Zhang, and W. Wang,“Dual transfer learning,” in Proc. 12th SIAM Int. Conf. SDM, 2012.

[38] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis ofrepresentations for domain adaptation,” in Proc. Adv. NIPS, 2006.

[39] R. Johnson and T. Zhang, “Graph-based semi-supervised learningand spectral kernel design,” IEEE Trans. Inf. Theory, vol. 54 , no. 1,pp. 275–288, Jan. 2008.

[40] J. Gao, W. Fan, J. Jiang, and J. Han, “Knowledge transfer via mul-tiple model local structure mapping,” in Proc. 14th ACM SIGKDDInt. Conf. KDD, 2008.

[41] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularizednonnegative matrix factorization for data representation,” IEEETrans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1548–1560,Aug. 2011.

[42] D. Cai, X. He, and J. Han, “Spectral regression: A unifiedapproach for sparse subspace learning,” in Proc. IEEE ICDM,Omaha, NE, USA, 2007.

Mingsheng Long received the B.S. degreein 2008 from the Department of ElectricalEngineering, Tsinghua University, Beijing, China.He is a Ph.D. candidate in the Department ofComputer Science and Technology, TsinghuaUniversity. His current research interests includetransfer learning, deep learning, sparse learning,and large-scale data mining.


Jianmin Wang graduated from PekingUniversity, Beijing, China, in 1990, andreceived the M.E. and the Ph.D. degrees incomputer software from Tsinghua University,Beijing, China, in 1992 and 1995, respectively.He is now a Professor in the School of Software,Tsinghua University. His current research inter-ests include unstructured data management,workflow and BPM technology, benchmark fordatabase system, information system security,and large-scale data analytics. He has published

over 100 DBLP indexed papers in major journals (TKDE, DMKD, DKE,WWWJ, etc.) and conferences (SIGMOD, VLDB, ICDE, CVPR, AAAI,etc). He led to develop a product data/lifecycle management system,which has been implemented in hundreds of enterprises in China. Heleads to develop an unstructured data management system namedLaUDMS.

Guiguang Ding received the Ph.D. degreein electronic engineering from the Universityof Xidian, Xi’an, China. He is an AssociateProfessor in the School of Software, TsinghuaUniversity, Beijing, China. His current researchinterests include the area of multimedia infor-mation retrieval and mining, with specific focuson visual object recognition, automatic seman-tic annotation, image coding and representation,and social media recommendation. He has pub-lished over 40 research papers in international

conferences and journals and applied for 18 Patent Rights in China.

Sinno Jialin Pan received the Ph.D. degreein computer science from the Hong KongUniversity of Science and Technology (HKUST),Hong Kong. He is currently a Head (acting) ofthe Text Analytics Lab of the Data AnalyticsDepartment at the Institute for InfocommResearch (I2R), Singapore. He also holdsan adjunct position of Assistant Professor,Department of Computer Science at the NationalUniversity of Singapore (NUS), Singapore. Hiscurrent research interests include transfer learn-

ing, active learning, semi-supervised learning and their applications intext mining, pervasive computing, medical engineering, and softwareengineering.

Philip S. Yu received the Ph.D. degree in E.E.from Stanford University, Stanford, CA, USA.He is a Distinguished Professor in ComputerScience at the University of Illinois at Chicago,Chicago, IL, USA, and holds the Wexler Chairin Information Technology. Dr. Yu is a Fellowof the ACM and the IEEE. He is the Editor-in-Chief of ACM Transactions on KnowledgeDiscovery from Data. He was the Editor-in-Chief of IEEE Transactions on Knowledge andData Engineering (2001–2004). He received a

Research Contributions Award from IEEE International Conference onData Mining (2003) and a Technical Achievement Award from IEEEComputer Society (2013).

� For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

1076 ieee transactions on knowledge and data …ise.thss.tsinghua.edu.cn/mig/adaptation... ·...

Documents