robust self-tuning semi-supervised learning

ARTICLE IN PRESS

0925-2312/$ - se

doi:10.1016/j.ne

�CorrespondE-mail addr

Neurocomputing 70 (2007) 2931–2939

www.elsevier.com/locate/neucom

Robust self-tuning semi-supervised learning

Fei Wang�, Changshui Zhang

State Key Laboratory of Intelligent Technology and Systems, Department of Automation, Tsinghua University, Beijing 100084, PR China

Received 16 July 2006; received in revised form 30 October 2006; accepted 4 November 2006

Communicated by S. Choi

Available online 6 December 2006

Abstract

We investigate the issue of graph-based semi-supervised learning (SSL). The labeled and unlabeled data points are represented as

vertices in an undirected weighted neighborhood graph, with the edge weights encoding the pairwise similarities between data objects in

the same neighborhood. The SSL problem can be then formulated as a regularization problem on this graph. In this paper we propose a

robust self-tuning graph-based SSL method, which (1) can determine the similarities between pairwise data points automatically; (2) is

not sensitive to outliers. Promising experimental results are given for both synthetic and real data sets.

r 2006 Elsevier B.V. All rights reserved.

Keywords: Semi-supervised learning; Graph

1. Introduction

In many practical applications of pattern classificationand machine learning, one often faces a lack of sufficientlabeled data, since labeling often requires expensive humanlabor. However, in many cases, large numbers of unlabeleddata can be far easier to obtain. For example, in web pageclassification, one may have an easy access to a largedatabase of web pages by crawling the web, but only asmall part of them are classified by hand. Therefore, theproblem of effectively combining unlabeled data withlabeled data is of central importance in machine learning.

Consequently, semi-supervised learning (SSL) methods,which aim to learn from partially labeled data, areproposed [6]. The key to semi-supervised learning problemsis the cluster assumption, which states that two points arelikely to have the same class label if there is a pathconnecting them passing through the regions of highdensity only [7]. The geometric intuition behind thisassumption is two-fold [19]: (1) nearby points are likelyto have the same label (local consistency); (2) points on thesame structure (such as a cluster or a submanifold) arelikely to have the same label (global consistency).

e front matter r 2006 Elsevier B.V. All rights reserved.

ucom.2006.11.004

ing author. Tel.: +8610 62796872; fax: +86 10 62786911.

ess: [email protected] (F. Wang).

It is straightforward to associate cluster assumption withthe nonlinear dimensionality reduction methods developedin recent years [6], since the central idea of these methods isto construct a low-dimensional global coordinate systemfor the data set in which the local structure of the data ispreserved. It is well known that a graph can be regarded asthe discretization of a manifold [3], and recently the graph-based SSL methods have been becoming the most activearea of research in SSL community [6].Although the graph-based SSL methods have received

considerable interests in recent years, there are still someproblems which have not been properly addressed. The firstone is the graph construction. As it is said in Zhu’s literaturesurvey [21], ‘‘although the graph is at the heart of thesegraph-based methods, its construction has not been studiedextensively’’. More concretely, most of these methods[17,19,23] adopted a Gaussian function to calculate the edgeweights of the graph (i.e. the edge links data xi and xj iscomputed as eij ¼ expð�kxi � xjk

2=ð2s2ÞÞ), but the variances of the Gaussian function will affect the classification resultssignificantly. We provide a toy example to illustrate thisproblem. Fig. 1(a) shows the original data set which containstwo-moon patterns. On each moon we only label one point.Fig. 1(b) shows the classification result using Zhou’sconsistency method when s ¼ 0:1, and Fig. 1(c) shows theclassification result using the same method when s ¼ 0:2. We

www.elsevier.com/locate/neucom

dx.doi.org/10.1016/j.neucom.2006.11.004

mailto:[email protected]

ARTICLE IN PRESS

−1 0 1 2−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Original dataset

Unlabeled data

Class 1Class 2

0 1 2−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−1 1

Classification result with sigma=0.1

Class 1

Class 2

−1 0 1 2−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Classification result with sigma=0.2

Class 1

Class 2

a b c

Fig. 1. Classification results on the two-moon pattern using the method in [19], which is a powerful transductive approach operating on graph with the

edge weights computed by a Gaussian function. (a) Toy data set with two labeled points; (b) classification results with s ¼ 0:1; and (c) classification results

with s ¼ 0:2. We can see that a small variation of s will cause a dramatically different classification result.

1In this paper, we will focus on the transduction problem, for induction

one can refer to the method introduced in [10].

F. Wang, C. Zhang / Neurocomputing 70 (2007) 2931–29392932

can see that a slight variation of s may cause significantresults.

Another problem is the robustness of these traditionalgraph-based methods. Consider a toy example shown inFig. 2(a), which is the same problem as in Fig. 1(a) exceptthat we add two bridging points between the two moons.Fig. 2(b) shows the classification results by Zhou’s methodwithout these bridging points with s ¼ 0:1 (which isidentical to Fig. 1(b)), and Fig. 2(c) provides the classifica-tion results on the data set containing the bridging points(Fig. 2(a)), which is obtained by the same method with thesame parameter setting as in Fig. 2(b). We can see thatthese bridging points can bias the final classification resultsseverely.

It can be found that this robustness problem will alsoexist in other graph-based SSL algorithms (such as[1,18,20,23]). The reason why this situation occurs can beeasily explained if we regard these approaches as randomwalk procedures (in fact, most graph-based SSL methodscan be essentially understood as random walk procedures[20]), which will be introduced in detail in Section 3.1.Unfortunately, the bridging points can also be found inmany real world problems, e.g. in hand-written digitsrecognition, if we want to distinguish digit ‘‘2’’ against ‘‘3’’,we may find many ‘‘2’’s like ‘‘3’’ with their tails elongatedand curved.

To address the above two problems, we propose a novelrobust self-tuning graph-based SSL method in this paper.The main advantages of our method are: (1) it candetermine the similarities between pairwise data pointsautomatically; (2) it is not sensitive to outliers (includingthe bridging points). Experimental results on both toy andreal data sets are provided to show the effectiveness of ourmethod.

The rest of this paper is organized as follows. The basicalgorithm framework is introduced in Section 2. In Section3, we will analyze the robustness of this framework andpropose a more robust method. Promising experimental

results are given in Section 4, followed by the conclusionsand future works in Section 5.

2. Basic algorithm framework

We suppose that there is a set of data points X ¼fx1; . . . ; xl ; . . . ;xlþug with xi 2 Rd ð1pipl þ uÞ, of whichXL ¼ fx1; x2; . . . ; xlg are labeled as ti 2L (1oipl, L ¼f1; 2; . . . ;Cg is the label set) and the remaining pointsXU ¼

fxlþ1; . . . ; xlþug are unlabeled. Our task is to predict thelabels of XU .

1

Our strategy here is to first construct a connected

weighted neighborhood graph G ¼ ðV;EÞ where node setV corresponds to the data set X ¼ XL [XU , and E is theedge set associated with a weight rðeijÞ on each edge eij 2 E(here rð�Þ is some similarity function). Define a neighbor-hood system for X as

Definition 1 (Neighborhood system). Let N ¼ fNi j 8xi 2

Xg be a neighborhood system for X, where Ni is theneighborhood of xi. Then Ni satisfies: (1) xieNi (self-

exclusion); (2) xi 2Nj3xj 2Ni (symmetry).

In this paper, Ni is defined in the following way: xj 2

Ni iff xj 2Ki or xi 2Kj, whereKi is the set that containsthe k nearest neighbors of xi.Based on the above definitions, we can construct the

graph G where there is an edge links nodes xi and xj iff

xj 2Ni. Thus we can define an n� n (n ¼ l þ u) weightmatrix W for graph G, with its ði; jÞth entry

Wij ¼rðeijÞ if xj 2Ni;

0 otherwise:

�(1)

After the graph construction, we then define C functionsff 1; f 2; . . . ; f C

g on this graph, and the values of f cðxiÞ

represent the likelihood that xi belongs to class c, and for

ARTICLE IN PRESS

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Original dataset

Unlabeled dataClass 1Class 2

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Classification result by Zhou

method without bridging points

Class 1Class2

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Classification result by

Zhou's method

Class 1

Class 2

a b c

Fig. 2. Classification on the two-moon pattern, with some bridging points connecting the two moons. (a) Toy data set with two labeled points; (b)

classifying result given by Zhou’s method [19] without the bridging points; and (c) classifying result by Zhou’s method on the data set shown in (a), the

parameter configuration is the same as in (b).

F. Wang, C. Zhang / Neurocomputing 70 (2007) 2931–2939 2933

labeled points, we define

f cðxiÞ ¼

1 if ti ¼ c;

0 otherwise;

(ð1pipl; 1pcpCÞ, (2)

and these C functions are called classification functions

throughout the paper.

2.1. The similarity measure

As stated in Section 1, the graph can be regarded as thediscretized form of the data manifold. Thus we shoulddefine a proper similarity function to represent the datastructure. There have been many ways to compute wij

[6,18], and the most popular one among them is the typicalGaussian weighting function:

wij ¼ expð�bkxi � xjk2Þ. (3)

However, the existence of b affects the final classificationresults significantly [18,19], and how to determine anoptimal b is still an open problem.

To avoid the tedious work of tuning an optimal b, wepropose to use the neighborhood information of each pointto compute its similarities with other points [18]. Forcomputational convenience, we assume that each datapoint can be optimally reconstructed using a linearcombination of its k nearest neighbors [13]. Hence ourobjective is to minimize

� ¼X

i

xi �X

j:xj2Ki

wijxj

��2

. (4)

Here wij can be regarded as the contribution of xj to xi, andwe further constrain

Pj2Ki

wij ¼ 1, wijX0. Obviously, themore similar xj to xi, the larger wij will be (as an extremecase, when xi ¼ xk 2Ki, then wik ¼ 1;wij ¼ 0; jak;xj 2

Ki is the optimal solution). Thus wij can be used tomeasure how similar xj to xi. One issue should beaddressed here is that usually wijawji.

It can be easily inferred that

�i ¼ xi �X

j:xj2Ki

wijxj

��2

¼X

j:xj2Ki

wijðxi � xjÞ

��2

¼X

j;k:xj ;xk2Ki

wijwikðxi � xjÞTðxi � xkÞ

¼X

j;k:xj ;xk2Ki

wijGijkwik, ð5Þ

where Gijk ¼ ðxi � xjÞ

Tðxi � xkÞ represents the ðj; kÞth entry

of the local Gram matrix at point xi. Thus the reconstruc-tion weights of each data object can be solved by thefollowing n standard quadratic programming problems

minwij

Xj;k:xj ;xk2KðxiÞ

wijGijkwik

s:t:X

j

wij ¼ 1; wijX0. ð6Þ

Recalling the definition of the neighborhood system weintroduced at the beginning of Section 2, we can constructthe weight matrix W by

Wij ¼ wij þ wji. (7)

Note that wij ¼ 0 if xjeKi. Intuitively, Wij can reflect thesimilarity between xi and xj .

2.2. Collaborative label prediction

After having got all the pairwise similarities, we thenpropose a novel scheme to predict the labels of theunlabeled points. More concretely, we assume thatthe label of an unlabeled data point can be linearlyreconstructed from its neighbors, which is consistent withthe way we computing the pairwise similarities. Mathe-matically, we should solve the following optimization

ARTICLE IN PRESSF. Wang, C. Zhang / Neurocomputing 70 (2007) 2931–29392934

problem:

minfc

JðfcÞ ¼X

i

f cðxiÞ �

Xj

~WijfcðxjÞ

��2

s:t: f cðXLÞ ¼ tXL

, ð8Þ

where f c is the classification function of the cth class, fc isthe classification vector of the cth class, i.e.

fc ¼ ðf cðx1Þ; f

cðx2Þ; . . . ; f

cðxlÞ; . . . ; f

cðxlþuÞÞ

T, (9)

and the constraint of Eq. (8) states that we should keep thelabels of the labeled points fixed. ~Wij is the ði; jÞth entry ofthe label reconstruction weight matrix ~W. Without the lossof generality, we impose the label reconstruction to beconvex, i.e. ~WijX0;

Pj~Wij ¼ 1. Based on the geometric

intuition, we just use the row-normalized W matrix as ~W,i.e. ~Wij ¼Wij=

PjWij.

To solve Eq. (8), we first write JðfcÞ in its matrix form as

JðfcÞ ¼X

i

f cðxiÞ �

Xj

~Wij fcðxjÞ

��2

¼X

i

kIifc � ~Wif

ck2 ¼X

i

kðIi � ~WiÞfck2

¼ ðfcÞTðI� ~WÞTðI� ~WÞfc, ð10Þ

where Ii is the ith row of I, which is an n� n ðn ¼ l þ uÞ

identity matrix, and ~Wi is the ith row of ~W. Therefore theoptimization problem is equivalent to

ðI� ~WÞfc ¼ 0

s:t: f cðXLÞ ¼ tXL

. ð11Þ

Moreover, we can split fc and I� ~W as

fc ¼ ððfcLÞ

T; ðfcU Þ

TÞT,

I� ~W ¼ðI� ~WÞLL; ðI� ~WÞLU

ðI� ~WÞUL; ðI� ~WÞUU

" #. ð12Þ

Combining Eqs. (11) and (12), we can get the labels of theunlabeled points

fcU ¼ ðI�

~WÞ�1UU~WULf

cl . (13)

So our algorithm just needs to compute C classification

vectors ff1; f2; . . . ; fCg, and assign xu with the label t

satisfying t ¼ argmaxc fcu, where f

cu represents the uth entry

of fc. Note that the computation of these C classificationvectors can be processed in parallel.

2.3. The regularization framework

A common principle that guides us to design the SSL

algorithms is that the predicted labels of the data pointsshould be sufficiently smooth with respect to the under-lying data structure [6], which is in accordance with thecluster assumption introduced in Section 1. In this sectionwe will show that our algorithm can also be derived fromthis smoothness regularization framework.

Without the loss of generality, we assume that the datapoints reside (roughly) on a low-dimensional manifold M,and f c

ð1pcpCÞ is one classification function defined onM, then the smoothness of f c over M can be calculated bythe following Dirichlet integral [2]

D½f c� ¼

1

2

ZM

krf ck2 dM (14)

and the smoothest f c that we seek for is the one minimizesD½f c�. On graph G, it turns out that the minimization of

D½f c� corresponds to the minimization of the following

combinatorial Dirichlet integral [2]

EðfcÞ ¼1

2

Xi;j

Wijðfci � f c

j Þ2, (15)

where f ci ¼ f c

ðxiÞ, f cj ¼ f c

ðxjÞ. We can further expand Eq.(15) by

EðfcÞ ¼ fcTLfc ¼1

2

Xi;j

Wijðfci � f c

j Þ2

¼X

i

diðfci Þ2�X

i;j

Wij fci f c

j ¼ ðfcÞTLfc, ð16Þ

where fc is defined in Eq. (9), and di ¼P

jWij is the degree

of xi, and L is the combinatorial Laplacian matrix with itsentries

Lij ¼

di if i ¼ j;

�Wij if xi 2Nj ;

0 otherwise:

8><>: (17)

Therefore, our goal is to find the f c% that minimizesEðf cÞ ð1pcpCÞ. Using the similar techniques as in Section

2.2, i.e. we can split fc and L as

fc ¼ ððfcLÞ

T; ðfcU Þ

TÞT; L ¼

LLL LLU

LUL LUU

" #, (18)

and then let qEðf cÞ=qfc

U ¼ 0, we can get

fcU ¼ �L

�1UULULf

cL. (19)

Note that Eq. (19) has a very similar form with Eq. (13).Recalling the label reconstruction weight matrix in Eq. (13)is just

~W ¼ D�1W, (20)

where D ¼ diagðd1; d2; . . . ; dnÞ ðn ¼ l þ uÞ is the degreematrix, and the Laplacian matrix L ¼ D�W, then wecan transform Eq. (19) as

fcU ¼ � L�1UULULf

cl ¼ ðD�WÞ�1UUWULf

cL

¼ ðI�D�1WÞ�1UUD�1UUWULf

cL ¼ ðI�

~WÞ�1UU~WULf

cL, ð21Þ

which is just the solution in Eq. (13), that is, our methodcan also be derived from the smoothness regularization

framework.

ARTICLE IN PRESSF. Wang, C. Zhang / Neurocomputing 70 (2007) 2931–2939 2935

3. Robustness analysis

In this section we will first present a random walk viewof the basic algorithm introduced in Section 2 and showthat it may sensitive to the ‘‘bridging points’’. Then we willpropose a robust SSL algorithm that can solve thisproblem efficiently. At last we will also present an iterativegradient-based method for learning the hyperparameters inour model.

3.1. Relationship with random walks

Now let us consider the SSL problem from a randomwalk view. Given the neighborhood graph G, we regard itsn ¼ l þ u vertices as n places with the first l places have C

types of candies, with one place for only one type.Assuming an ant starting at a place with no candies, whatis the probability that it will first get each of the C types ofcandies? We constrain that the ant can only crawl along theedges in G, and the weight on an edge computed by Eq. (7)corresponds to the likelihood that the ant will across thatedge.

It has been previously established [12] that the prob-ability an ant first reaches a candy place exactly equals thesolution to the Dirichlet problem with the boundaryconditions at the candy places with the candy place inquestion fixed to unity while others are set to zero. ADirichlet problem is to find a harmonic function f , whichsatisfies the following Laplace equation:

r2f ¼ 0 (22)

subject to its boundary values. And it has been shown thatthe harmonic function that satisfies the boundary conditionsminimizes the Dirichlet integral defined in Eq. (14) [12].

Now let us return to our random walk problem.Assuming the ant starts at place xi, we denote theprobability of it getting the candies of type c by f c

i , andfc ¼ ðf c

1; fc2; . . . ; f

cnÞ

T. Then from the statement in Section2.3, we know that fc can be solved by minimizing thefollowing combinatorial Dirichlet integral

EðfcÞ ¼ 12fcTLfc (23)

subject to

f cj ¼

1 if tj ¼ c;

0 otherwise;

(

where xj is a candy place with its candy typetj ¼ c ð1pcpCÞ. Since the probabilities of the ant to getall kinds of candies must be sum to one, so we furtherconstrain

Pc f c

i ¼ 1. Therefore, the solutions to thisrandom walk problem are identical to the solutionspresented in Section 2.3. So the basic algorithm proposedin Section 2 can also be understood as a random walk

procedure.

3.2. Robustness analysis

Let us revisit the problem described in Fig. 2 based onthe random walk view. Fig. 3(a) illustrates the five nearestneighbor graph of the data set shown in Fig. 2(a), whichclearly shows that the two-moon patterns are ‘‘connected’’by the two bridging points. From the random walk’s view,there is a possibility that the ant starts from the uppermoon crawls across these bridging points to get thecandies stored in the place represented by the red triangle,and the probability that the ant crawls across each edge isequal to the weight on that edge. Fig. 3(b) shows us thefinal classification result using the basic algorithm inSection 2.Generally, we regard a point as a bridging point or

outlier if and only if it is in a sparse region; when thisregion lies between different classes, the point is a bridgingpoint, and when this region lies far away from this data set,the point is usually called an outlier. It can be easilyinferred that the degree of a data point is large if it is in aregion of high density (whose neighborhood size will belarge), and small if is in a sparse region (whose neighbor-hood size will be small).The robustness of the algorithms has been extensively

studied in the dimensionality reduction fields, such asrobust subspace learning [9], robust kernel Isomap [8], androbust locally linear embedding [5]. However, few researcheshave been done on the robust analysis of graph-based semi-supervised learning methods. In the following we willpropose a scheme to make our method more robust to thebridge points and outliers.An intuitive way for robustifying our method is to

reduce the edge weights associated with the bridge points(and outliers) to make the ant harder to crawl across theseedges. Therefore, we propose to calculate the similaritybetween data xi and xj by

~wij ¼ didjWij , (24)

where Wij is the similarity computed by Eq. (7), and di isthe degree of xi.Intuitively, Eq. (24) can reflect the genuine similarity

between xi and xj even when outliers exist. If xi and xj areall in a high-density region and there exists an edgeconnecting them, then ~wij will have a high value. Onthe other hand, if an edge connecting xi and xj containsat least one low value of either wij (which indicates xi

and xj may belong to different clusters) or di, dj (whichimplies xi or xj may be an outlier), then the value of ~wij willbe low.We define the robust similarity ~w and use it in place of the

similarity computed by Eq. (7) in our basic algorithmpresented in Section 2. The resultant approach will becalled robust self-tuning semi-supervised learning ðRS3LÞ

method throughout the paper because of the followingreasons: (1) we adopt a robust similarity measure; (2) weneed not do the tedious work of tuning the free parameterin Eq. (3). To show its effectiveness, we also apply RS3L to

ARTICLE IN PRESS

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

5 nearest neighbor graph

Unlabeled data

Class 1Class 2

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Classification result

by the basic algorithm

a b

Fig. 3. Classification result on the toy data set shown in Fig. 2(a) using the basic algorithm proposed in Section 2. (a) shows the five nearest neighbor

graph constructed on this data set, with the bridging points denoted by green filled circles. (b) shows the classification result using the basic algorithm in

Section 2.

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Original dataset


−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Classification result by the RS3L

a b

Fig. 4. Classification result by our robust self-tuning semi-supervised learning method on the two-moon toy data set. (a) The original data set which is

identical to Fig. 2(a). (b) shows the classification result by our RS3L method.

Table 1

Robust self-tuning semi-supervised learning

Input: data set X ¼ XL [XU from C classes, XL is the labeled set, XU is

the unlabeled set. The number of the nearest neighbors k.

Output: the labels of all the data points.

1. Construct the neighborhood graph by solving Eq. (6) for the

reconstruction weights of each data object from its k-nearest neighbors.

2. Compute the pairwise similarities ~wij by Eq. (24), and compute the

degree ~di ¼P

j ~wij for each data point.

3. Construct the combinatorial Laplacian matrix using Eq. (17) by

replacing Wij and di with ~wij and ~di .

4. Solve the classification function vector fc for each class c ð1pcpCÞ by

Eq. (19). Output the label t for xu by t ¼ arg maxc fcu.


solve the problem shown in Fig. 4(a) with k ¼ 5, and theresult is given in Fig. 4(b), which agrees well with humanjudgement. The main procedure of RS3L can be summar-ized in Table 1.

4. Experiments

In this section, we give a set of experiments where weused RS3L for semi-supervised classification, including toyexamples, digits recognition and text classification.

4.1. Toy examples

Although the traditional graph-based semi-supervisedclassifications can perform very well on the data set withmanifold structure [1,19,23], they are no longer robustenough to give satisfactory results if some noise points areadded. This can be easily observed from the toy examplewe give in Fig. 2. In this subsection we will show anothersynthetic example.As shown in Fig. 5(a), the toy data set consists of two

circular cluster with some bridging points connecting them.Initially, we only label two data points, one in each circle,

ARTICLE IN PRESS

−3 −2 −1 0 1 2 3−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Original dataset


−3 −2 −1 0 1 2 3−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Classification result by

the harmonic Gaussian fields

−3 −2 −1 0 1 2 3−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Classification results

by Zhou's method

−3 −2 −1 0 1 2 3−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Classification results by

Tikhonov regularization

−3 −2 −1 0 1 2 3−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Classification result

by the basic algorithm

−3 −2 −1 0 1 2 3−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Classification result by RS3L

a b c

d e f

Fig. 5. Classification on the two circular patterns, with some bridging points connecting the two of them. (a) The original data set with two labeled points;

(b) classifying result given by Zhu’s harmonic Gaussian field method [23]; (c) classifying result given by Zhou’s consistency method [19]; (d) classifying result

by Belkin’s Tikhonov regularization method [1]; (e) classifying result by the basic algorithm introduced in Section 2; and (f) classifying result by our RS3L

method.

2http://www.kernel-machines.org/data.html.

F. Wang, C. Zhang / Neurocomputing 70 (2007) 2931–2939 2937

and our goal is to predict the labels of the remaining points.Fig. 5(b)–(d) show the classification results by sometraditional graph-based semi-supervised learning methods.And we have adjusted the free parameters in these methodsso that they can achieve the highest classification accura-cies. Fig. 5(e) is the classification result by our basicalgorithm introduced in Section 2, with the number ofnearest neighbors k ¼ 5. Fig. 5(f) shows the classificationresult obtained by RS3L method, and k is also set to 5.

It can be observed from Fig. 5(b)–(e) that the traditionalgraph-based methods failed to find the two circularpatterns due to the existence of the bridging points as wellas the basic algorithm we proposed in Section 2. This canbe easily explained from the random walks viewpointpresented in Section 3.1, since the random walker maywalk across the edges linking the two circular pattern toachieve a wrong label. However, using the robust similarityin Eq. (24), our RS3L algorithm can give a muchsatisfactory result, since the weights of the edges containingthe bridging points are reduced, which makes the randomwalker harder to walk from one circle to the other.

It is interesting to discover that among the threetraditional graph-based methods (Fig. 5(b)–(d)), Zhou’s

consistency approach [19] seems more sensitive to thebridging points (Fig. 5(c)). We think this is because thesmoothness matrix used in Zhou’s method is the normal-ized Laplacian matrix [1], which is equivalent to using wij ¼

wij=ffiffiffiffiffiffiffiffididj

pas the similarity measure between xi and xj [4].

Therefore, the weights on the edges containing bridgingpoints and outliers are relatively enlarged due to their smalldegrees, which will cause the random walker to run acrossthese edges more easily to get wrong labels.

4.2. Digits recognition

In this case study, we will focus on the problem ofclassifying hand-written digits. The data set we adopt is theUSPS2 hand-written 16� 16 digits data set. The images ofdigits 1, 2, 3 and 4 are used in this experiments as fourclasses, and there are 1269, 929, 824 and 852 examples ineach class, with a total of 3874.We used nearest neighbor classifier and one-vs-rest SVMs

[14] as baselines. The width of the RBF kernel for SVM

was set to 5. In RS3L, the number of nearest neighbors k

http://www.kernel-machines.org/data.html

ARTICLE IN PRESS

5 10 15 20 25 30 35 40 45 500.75

0.8

0.85

0.9

0.95

1

Recognition accuracies onUSPS data "2" vs. "3"

RS3LConsistencyGaussian FieldsSVMNN

5 10 15 20 25 30 35 40 45 500.7

0.75

0.8

0.85

0.9

0.95

1

Recognition accuracies onUSPS data "1", "2", "3", "4"

RS3LConsistencyGaussian FieldsSVMNN

a b

Fig. 6. Digit recognition on the USPS data set. (a) shows the recognition accuracies of different algorithms on a subset only containing digits ‘‘2’’ and ‘‘3’’.

(b) shows the recognition accuracies of different algorithms on the total data set containing all four digits. In both figures, the abscissa represents the

number of randomly labeled data in the data set (we guarantee that there is at least one labeled point in each class), and the ordinate is the total recognition

accuracy value averaged over 50 independent runs.

10 20 30 40 500.2

0.3

0.4

0.5

0.6

0.7

0.8

Classification accuracies on the 20 Newsgroup dataset

RS3L

Consistency

Gaussian Fields

SVM

NN

Fig. 7. Text classification on the 20newgroup data set. The abscissa

represents the number of randomly labeled data in the data set (we

guarantee that there is at least one labeled point in each class), and the

ordinate is the total recognition accuracy value averaged over 50

independent runs.


was set to 5 when constructing the graph. For comparison,we also provide the classification results achieved by Zhouet al.’s consistency method [19] and Zhu et al.’s Gaussian

fields approach [23]. The affinity matrix in both methodswere constructed by a Gaussian function with variance1.25. All these parameters are set by a five-fold cross

validation method. Note that the diagonal elements of theaffinity matrix in Zhou’s consistency method were set to 0.The recognition accuracies averaged over 50 independenttrials are summarized in Fig. 6.

Fig. 6(a) illustrates the recognition accuracies ofdifferent algorithms on a two-class task which aims atdiscriminating digits ‘‘2’’ and ‘‘3’’; Fig. 6(b) provides theclassification results of those algorithms on the multi-classtask which aims at discriminating all four digits. Theeffectiveness of our RS3L method can be easily seen in bothfigures.

4.3. Text classification

In this experiment, we addressed the task of textclassification using 20 newsgroups data set.3 The topic rec

containing autos, motorcycles, baseball and hockey wasselected from the version 20news-18828. The articles werepreprocessed by the same procedure as in [19]. The resulted3970 document vectors were all 8014-dimensional. Finally,the document vectors were all normalized into TFIDF

representation.We use the inner-product distance to find the k nearest

neighbors when constructing the neighborhood graph inRS3L, i.e. dðxi; xjÞ ¼ 1� xTi xj=ðkxikkxjkÞ, where xi and xj

are document vectors. And the value of k is set to 10. ForZhou’s consistency and Zhu’s Gaussian fields (GF) methods,the affinity matrices were all computed by ðWÞij ¼expð�ð1=2s2Þð1� xTi xj=ðkxikkxjkÞÞÞ with s ¼ 0:15. The

3http://people.csail.mit.edu/jrennie/20Newsgroups/.

SVM and nearest neighbor classifiers were also served asthe baseline algorithms, and the width of the RBF kernel inSVM is set to 1.5. All these parameters are also set by afive-fold cross validation method. The classification accura-cies are summarized in Fig. 7, from which we can clearlysee the advantage of our RS3L method.

5. Conclusions and discussions

In this paper we propose a novel semi-supervisedlearning algorithm called robust self-tuning semi-supervised

learning ðRS3LÞ. The main advantages of our method are:(1) it can compute the similarities between pairwise datapoints automatically, i.e. in closed forms; (2) it is not

http://people.csail.mit.edu/jrennie/20Newsgroups/

ARTICLE IN PRESSF. Wang, C. Zhang / Neurocomputing 70 (2007) 2931–2939 2939

sensitive to outliers. Experimental results on both syntheticand real data sets are presented to show the effectiveness ofour method. In our future, we will focus on the theoreticalanalysis and accelerating issues of our RS3L algorithm.

References

[1] M. Belkin, I. Matveeva, P. Niyogi, Regularization and semi-

supervised learning on large graphs, in: Proceedings of the 17th

COLT, 2004.

[2] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction

and data representation, Neural Comput. 15 (2003) 1373–1396.

[3] M. Belkin, P. Niyogi, Semi-supervised learning on Riemannian

manifolds, Mach. Learn. 56 (2004) 209–239.

[4] Y. Bengio, J. Paiement, P. Vincent, Out-of-sample extensions for LLE,

Isomap, MDS, eigenmaps, and spectral clustering, in: NIPS03, 2003.

[5] H. Chang, D.Y. Yeung, Robust locally linear embedding, Pattern

Recognition 39 (6) (2006) 1053–1065.

[6] O. Chapelle, B. Schoolkopf, A. Zien (Eds.), Semi-supervised

Learning, MIT Press, Cambridge, MA, 2006.

[7] O. Chapelle, J. Weston, B. Scholkopf, Cluster kernels for semi-

supervised learning, NIPS 15, 2003.

[8] H. Choi, S. Choi, Kernel Isomap on noisy manifold, in: Proceedings

of the IEEE International Conference on Development and Learning

(ICDL), Osaka, Japan, July 19–21, 2005, pp. 208–213.

[9] F. De la Torre, M.J. Black, A framework for robust subspace

learning, Int. J. Comput. Vision 54 (1–3) (2003) 117–142.

[10] O. Delalleu, Y. Bengio, N. Le Roux, Non-parametric function

induction in semi-supervised learning, in: Proceedings of the 10th

AISTATS, 2005.

[12] S. Kakutani, Markov process and the Dirichlet problem, Proc. Jpn.

Acad. 21 (1945) 227–233.

[13] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by

locally linear embedding, Science 290 (2000) 2323–2326.

[14] B. Scholkopf, A.J. Smola, Learning with Kernels, MIT Press,

Cambridge, MA, 2002.

[17] M. Szummer, T. Jaakkola, Partially labeled classification with

Markov random walks. Adv. Neural Inf. Process. Syst. 14 (2002).

[18] F. Wang, C. Zhang, Label propagation through linear neighbor-

hoods, in: Proceedings of the 23rd ICML, 2006, to appear.

[19] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Scholkopf, Learning

with local and global consistency, NIPS 16 (2004).

[20] D. Zhou, B. Scholkopf, Learning from labeled and unlabeled data

using random walks, in: Pattern Recognition, Proceedings of the 26th

DAGM, 2004.

[21] X. Zhu, Semi-supervised learning literature survey. Computer

Sciences Technical Report 1530, University of Wisconsin-Madison,

2005.

[23] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using

Gaussian fields and harmonic functions, in: Proceedings of the 20th

ICML, 2003.

Fei Wang is a Ph.D. candidate of grade four in

Department of Automation, Tsinghua Univer-

sity, Beijing, PR China. He has now published

several papers in top conferences on machine

learning and pattern recognition, such as CVPR

and ICML.

Changshui Zhang is a professor in Department of

Automation, Tsinghua University, Beijing, PR

China. He is an associate editor of Pattern

Recognition Journal.

robust self-tuning semi-supervised learning

Documents