robust self-tuning semi-supervised learning
TRANSCRIPT
ARTICLE IN PRESS
0925-2312/$ - se
doi:10.1016/j.ne
�CorrespondE-mail addr
Neurocomputing 70 (2007) 2931–2939
www.elsevier.com/locate/neucom
Robust self-tuning semi-supervised learning
Fei Wang�, Changshui Zhang
State Key Laboratory of Intelligent Technology and Systems, Department of Automation, Tsinghua University, Beijing 100084, PR China
Received 16 July 2006; received in revised form 30 October 2006; accepted 4 November 2006
Communicated by S. Choi
Available online 6 December 2006
Abstract
We investigate the issue of graph-based semi-supervised learning (SSL). The labeled and unlabeled data points are represented as
vertices in an undirected weighted neighborhood graph, with the edge weights encoding the pairwise similarities between data objects in
the same neighborhood. The SSL problem can be then formulated as a regularization problem on this graph. In this paper we propose a
robust self-tuning graph-based SSL method, which (1) can determine the similarities between pairwise data points automatically; (2) is
not sensitive to outliers. Promising experimental results are given for both synthetic and real data sets.
r 2006 Elsevier B.V. All rights reserved.
Keywords: Semi-supervised learning; Graph
1. Introduction
In many practical applications of pattern classificationand machine learning, one often faces a lack of sufficientlabeled data, since labeling often requires expensive humanlabor. However, in many cases, large numbers of unlabeleddata can be far easier to obtain. For example, in web pageclassification, one may have an easy access to a largedatabase of web pages by crawling the web, but only asmall part of them are classified by hand. Therefore, theproblem of effectively combining unlabeled data withlabeled data is of central importance in machine learning.
Consequently, semi-supervised learning (SSL) methods,which aim to learn from partially labeled data, areproposed [6]. The key to semi-supervised learning problemsis the cluster assumption, which states that two points arelikely to have the same class label if there is a pathconnecting them passing through the regions of highdensity only [7]. The geometric intuition behind thisassumption is two-fold [19]: (1) nearby points are likelyto have the same label (local consistency); (2) points on thesame structure (such as a cluster or a submanifold) arelikely to have the same label (global consistency).
e front matter r 2006 Elsevier B.V. All rights reserved.
ucom.2006.11.004
ing author. Tel.: +8610 62796872; fax: +86 10 62786911.
ess: [email protected] (F. Wang).
It is straightforward to associate cluster assumption withthe nonlinear dimensionality reduction methods developedin recent years [6], since the central idea of these methods isto construct a low-dimensional global coordinate systemfor the data set in which the local structure of the data ispreserved. It is well known that a graph can be regarded asthe discretization of a manifold [3], and recently the graph-based SSL methods have been becoming the most activearea of research in SSL community [6].Although the graph-based SSL methods have received
considerable interests in recent years, there are still someproblems which have not been properly addressed. The firstone is the graph construction. As it is said in Zhu’s literaturesurvey [21], ‘‘although the graph is at the heart of thesegraph-based methods, its construction has not been studiedextensively’’. More concretely, most of these methods[17,19,23] adopted a Gaussian function to calculate the edgeweights of the graph (i.e. the edge links data xi and xj iscomputed as eij ¼ expð�kxi � xjk
2=ð2s2ÞÞ), but the variances of the Gaussian function will affect the classification resultssignificantly. We provide a toy example to illustrate thisproblem. Fig. 1(a) shows the original data set which containstwo-moon patterns. On each moon we only label one point.Fig. 1(b) shows the classification result using Zhou’sconsistency method when s ¼ 0:1, and Fig. 1(c) shows theclassification result using the same method when s ¼ 0:2. We
ARTICLE IN PRESS
−1 0 1 2−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Original dataset
Unlabeled data
Class 1Class 2
0 1 2−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−1 1
Classification result with sigma=0.1
Class 1
Class 2
−1 0 1 2−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Classification result with sigma=0.2
Class 1
Class 2
a b c
Fig. 1. Classification results on the two-moon pattern using the method in [19], which is a powerful transductive approach operating on graph with the
edge weights computed by a Gaussian function. (a) Toy data set with two labeled points; (b) classification results with s ¼ 0:1; and (c) classification results
with s ¼ 0:2. We can see that a small variation of s will cause a dramatically different classification result.
1In this paper, we will focus on the transduction problem, for induction
one can refer to the method introduced in [10].
F. Wang, C. Zhang / Neurocomputing 70 (2007) 2931–29392932
can see that a slight variation of s may cause significantresults.
Another problem is the robustness of these traditionalgraph-based methods. Consider a toy example shown inFig. 2(a), which is the same problem as in Fig. 1(a) exceptthat we add two bridging points between the two moons.Fig. 2(b) shows the classification results by Zhou’s methodwithout these bridging points with s ¼ 0:1 (which isidentical to Fig. 1(b)), and Fig. 2(c) provides the classifica-tion results on the data set containing the bridging points(Fig. 2(a)), which is obtained by the same method with thesame parameter setting as in Fig. 2(b). We can see thatthese bridging points can bias the final classification resultsseverely.
It can be found that this robustness problem will alsoexist in other graph-based SSL algorithms (such as[1,18,20,23]). The reason why this situation occurs can beeasily explained if we regard these approaches as randomwalk procedures (in fact, most graph-based SSL methodscan be essentially understood as random walk procedures[20]), which will be introduced in detail in Section 3.1.Unfortunately, the bridging points can also be found inmany real world problems, e.g. in hand-written digitsrecognition, if we want to distinguish digit ‘‘2’’ against ‘‘3’’,we may find many ‘‘2’’s like ‘‘3’’ with their tails elongatedand curved.
To address the above two problems, we propose a novelrobust self-tuning graph-based SSL method in this paper.The main advantages of our method are: (1) it candetermine the similarities between pairwise data pointsautomatically; (2) it is not sensitive to outliers (includingthe bridging points). Experimental results on both toy andreal data sets are provided to show the effectiveness of ourmethod.
The rest of this paper is organized as follows. The basicalgorithm framework is introduced in Section 2. In Section3, we will analyze the robustness of this framework andpropose a more robust method. Promising experimental
results are given in Section 4, followed by the conclusionsand future works in Section 5.
2. Basic algorithm framework
We suppose that there is a set of data points X ¼fx1; . . . ; xl ; . . . ;xlþug with xi 2 Rd ð1pipl þ uÞ, of whichXL ¼ fx1; x2; . . . ; xlg are labeled as ti 2L (1oipl, L ¼f1; 2; . . . ;Cg is the label set) and the remaining pointsXU ¼
fxlþ1; . . . ; xlþug are unlabeled. Our task is to predict thelabels of XU .
1
Our strategy here is to first construct a connected
weighted neighborhood graph G ¼ ðV;EÞ where node setV corresponds to the data set X ¼ XL [XU , and E is theedge set associated with a weight rðeijÞ on each edge eij 2 E(here rð�Þ is some similarity function). Define a neighbor-hood system for X as
Definition 1 (Neighborhood system). Let N ¼ fNi j 8xi 2
Xg be a neighborhood system for X, where Ni is theneighborhood of xi. Then Ni satisfies: (1) xieNi (self-
exclusion); (2) xi 2Nj3xj 2Ni (symmetry).
In this paper, Ni is defined in the following way: xj 2
Ni iff xj 2Ki or xi 2Kj, whereKi is the set that containsthe k nearest neighbors of xi.Based on the above definitions, we can construct the
graph G where there is an edge links nodes xi and xj iff
xj 2Ni. Thus we can define an n� n (n ¼ l þ u) weightmatrix W for graph G, with its ði; jÞth entry
Wij ¼rðeijÞ if xj 2Ni;
0 otherwise:
�(1)
After the graph construction, we then define C functionsff 1; f 2; . . . ; f C
g on this graph, and the values of f cðxiÞ
represent the likelihood that xi belongs to class c, and for
ARTICLE IN PRESS
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Original dataset
Unlabeled dataClass 1Class 2
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Classification result by Zhou
method without bridging points
Class 1Class2
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Classification result by
Zhou's method
Class 1
Class 2
a b c
Fig. 2. Classification on the two-moon pattern, with some bridging points connecting the two moons. (a) Toy data set with two labeled points; (b)
classifying result given by Zhou’s method [19] without the bridging points; and (c) classifying result by Zhou’s method on the data set shown in (a), the
parameter configuration is the same as in (b).
F. Wang, C. Zhang / Neurocomputing 70 (2007) 2931–2939 2933
labeled points, we define
f cðxiÞ ¼
1 if ti ¼ c;
0 otherwise;
(ð1pipl; 1pcpCÞ, (2)
and these C functions are called classification functions
throughout the paper.
2.1. The similarity measure
As stated in Section 1, the graph can be regarded as thediscretized form of the data manifold. Thus we shoulddefine a proper similarity function to represent the datastructure. There have been many ways to compute wij
[6,18], and the most popular one among them is the typicalGaussian weighting function:
wij ¼ expð�bkxi � xjk2Þ. (3)
However, the existence of b affects the final classificationresults significantly [18,19], and how to determine anoptimal b is still an open problem.
To avoid the tedious work of tuning an optimal b, wepropose to use the neighborhood information of each pointto compute its similarities with other points [18]. Forcomputational convenience, we assume that each datapoint can be optimally reconstructed using a linearcombination of its k nearest neighbors [13]. Hence ourobjective is to minimize
� ¼X
i
xi �X
j:xj2Ki
wijxj
������������2
. (4)
Here wij can be regarded as the contribution of xj to xi, andwe further constrain
Pj2Ki
wij ¼ 1, wijX0. Obviously, themore similar xj to xi, the larger wij will be (as an extremecase, when xi ¼ xk 2Ki, then wik ¼ 1;wij ¼ 0; jak;xj 2
Ki is the optimal solution). Thus wij can be used tomeasure how similar xj to xi. One issue should beaddressed here is that usually wijawji.
It can be easily inferred that
�i ¼ xi �X
j:xj2Ki
wijxj
������������2
¼X
j:xj2Ki
wijðxi � xjÞ
������������2
¼X
j;k:xj ;xk2Ki
wijwikðxi � xjÞTðxi � xkÞ
¼X
j;k:xj ;xk2Ki
wijGijkwik, ð5Þ
where Gijk ¼ ðxi � xjÞ
Tðxi � xkÞ represents the ðj; kÞth entry
of the local Gram matrix at point xi. Thus the reconstruc-tion weights of each data object can be solved by thefollowing n standard quadratic programming problems
minwij
Xj;k:xj ;xk2KðxiÞ
wijGijkwik
s:t:X
j
wij ¼ 1; wijX0. ð6Þ
Recalling the definition of the neighborhood system weintroduced at the beginning of Section 2, we can constructthe weight matrix W by
Wij ¼ wij þ wji. (7)
Note that wij ¼ 0 if xjeKi. Intuitively, Wij can reflect thesimilarity between xi and xj .
2.2. Collaborative label prediction
After having got all the pairwise similarities, we thenpropose a novel scheme to predict the labels of theunlabeled points. More concretely, we assume thatthe label of an unlabeled data point can be linearlyreconstructed from its neighbors, which is consistent withthe way we computing the pairwise similarities. Mathe-matically, we should solve the following optimization
ARTICLE IN PRESSF. Wang, C. Zhang / Neurocomputing 70 (2007) 2931–29392934
problem:
minfc
JðfcÞ ¼X
i
f cðxiÞ �
Xj
~WijfcðxjÞ
����������2
s:t: f cðXLÞ ¼ tXL
, ð8Þ
where f c is the classification function of the cth class, fc isthe classification vector of the cth class, i.e.
fc ¼ ðf cðx1Þ; f
cðx2Þ; . . . ; f
cðxlÞ; . . . ; f
cðxlþuÞÞ
T, (9)
and the constraint of Eq. (8) states that we should keep thelabels of the labeled points fixed. ~Wij is the ði; jÞth entry ofthe label reconstruction weight matrix ~W. Without the lossof generality, we impose the label reconstruction to beconvex, i.e. ~WijX0;
Pj~Wij ¼ 1. Based on the geometric
intuition, we just use the row-normalized W matrix as ~W,i.e. ~Wij ¼Wij=
PjWij.
To solve Eq. (8), we first write JðfcÞ in its matrix form as
JðfcÞ ¼X
i
f cðxiÞ �
Xj
~Wij fcðxjÞ
����������2
¼X
i
kIifc � ~Wif
ck2 ¼X
i
kðIi � ~WiÞfck2
¼ ðfcÞTðI� ~WÞTðI� ~WÞfc, ð10Þ
where Ii is the ith row of I, which is an n� n ðn ¼ l þ uÞ
identity matrix, and ~Wi is the ith row of ~W. Therefore theoptimization problem is equivalent to
ðI� ~WÞfc ¼ 0
s:t: f cðXLÞ ¼ tXL
. ð11Þ
Moreover, we can split fc and I� ~W as
fc ¼ ððfcLÞ
T; ðfcU Þ
TÞT,
I� ~W ¼ðI� ~WÞLL; ðI� ~WÞLU
ðI� ~WÞUL; ðI� ~WÞUU
" #. ð12Þ
Combining Eqs. (11) and (12), we can get the labels of theunlabeled points
fcU ¼ ðI�
~W�1UU~WULf
cl . (13)
So our algorithm just needs to compute C classification
vectors ff1; f2; . . . ; fCg, and assign xu with the label t
satisfying t ¼ argmaxc fcu, where f
cu represents the uth entry
of fc. Note that the computation of these C classificationvectors can be processed in parallel.
2.3. The regularization framework
A common principle that guides us to design the SSL
algorithms is that the predicted labels of the data pointsshould be sufficiently smooth with respect to the under-lying data structure [6], which is in accordance with thecluster assumption introduced in Section 1. In this sectionwe will show that our algorithm can also be derived fromthis smoothness regularization framework.
Without the loss of generality, we assume that the datapoints reside (roughly) on a low-dimensional manifold M,and f c
ð1pcpCÞ is one classification function defined onM, then the smoothness of f c over M can be calculated bythe following Dirichlet integral [2]
D½f c� ¼
1
2
ZM
krf ck2 dM (14)
and the smoothest f c that we seek for is the one minimizesD½f c�. On graph G, it turns out that the minimization of
D½f c� corresponds to the minimization of the following
combinatorial Dirichlet integral [2]
EðfcÞ ¼1
2
Xi;j
Wijðfci � f c
j Þ2, (15)
where f ci ¼ f c
ðxiÞ, f cj ¼ f c
ðxjÞ. We can further expand Eq.(15) by
EðfcÞ ¼ fcTLfc ¼1
2
Xi;j
Wijðfci � f c
j Þ2
¼X
i
diðfci Þ2�X
i;j
Wij fci f c
j ¼ ðfcÞTLfc, ð16Þ
where fc is defined in Eq. (9), and di ¼P
jWij is the degree
of xi, and L is the combinatorial Laplacian matrix with itsentries
Lij ¼
di if i ¼ j;
�Wij if xi 2Nj ;
0 otherwise:
8><>: (17)
Therefore, our goal is to find the f c% that minimizesEðf cÞ ð1pcpCÞ. Using the similar techniques as in Section
2.2, i.e. we can split fc and L as
fc ¼ ððfcLÞ
T; ðfcU Þ
TÞT; L ¼
LLL LLU
LUL LUU
" #, (18)
and then let qEðf cÞ=qfc
U ¼ 0, we can get
fcU ¼ �L
�1UULULf
cL. (19)
Note that Eq. (19) has a very similar form with Eq. (13).Recalling the label reconstruction weight matrix in Eq. (13)is just
~W ¼ D�1W, (20)
where D ¼ diagðd1; d2; . . . ; dnÞ ðn ¼ l þ uÞ is the degreematrix, and the Laplacian matrix L ¼ D�W, then wecan transform Eq. (19) as
fcU ¼ � L�1UULULf
cl ¼ ðD�WÞ�1UUWULf
cL
¼ ðI�D�1WÞ�1UUD�1UUWULf
cL ¼ ðI�
~W�1UU~WULf
cL, ð21Þ
which is just the solution in Eq. (13), that is, our methodcan also be derived from the smoothness regularization
framework.
ARTICLE IN PRESSF. Wang, C. Zhang / Neurocomputing 70 (2007) 2931–2939 2935
3. Robustness analysis
In this section we will first present a random walk viewof the basic algorithm introduced in Section 2 and showthat it may sensitive to the ‘‘bridging points’’. Then we willpropose a robust SSL algorithm that can solve thisproblem efficiently. At last we will also present an iterativegradient-based method for learning the hyperparameters inour model.
3.1. Relationship with random walks
Now let us consider the SSL problem from a randomwalk view. Given the neighborhood graph G, we regard itsn ¼ l þ u vertices as n places with the first l places have C
types of candies, with one place for only one type.Assuming an ant starting at a place with no candies, whatis the probability that it will first get each of the C types ofcandies? We constrain that the ant can only crawl along theedges in G, and the weight on an edge computed by Eq. (7)corresponds to the likelihood that the ant will across thatedge.
It has been previously established [12] that the prob-ability an ant first reaches a candy place exactly equals thesolution to the Dirichlet problem with the boundaryconditions at the candy places with the candy place inquestion fixed to unity while others are set to zero. ADirichlet problem is to find a harmonic function f , whichsatisfies the following Laplace equation:
r2f ¼ 0 (22)
subject to its boundary values. And it has been shown thatthe harmonic function that satisfies the boundary conditionsminimizes the Dirichlet integral defined in Eq. (14) [12].
Now let us return to our random walk problem.Assuming the ant starts at place xi, we denote theprobability of it getting the candies of type c by f c
i , andfc ¼ ðf c
1; fc2; . . . ; f
cnÞ
T. Then from the statement in Section2.3, we know that fc can be solved by minimizing thefollowing combinatorial Dirichlet integral
EðfcÞ ¼ 12fcTLfc (23)
subject to
f cj ¼
1 if tj ¼ c;
0 otherwise;
(
where xj is a candy place with its candy typetj ¼ c ð1pcpCÞ. Since the probabilities of the ant to getall kinds of candies must be sum to one, so we furtherconstrain
Pc f c
i ¼ 1. Therefore, the solutions to thisrandom walk problem are identical to the solutionspresented in Section 2.3. So the basic algorithm proposedin Section 2 can also be understood as a random walk
procedure.
3.2. Robustness analysis
Let us revisit the problem described in Fig. 2 based onthe random walk view. Fig. 3(a) illustrates the five nearestneighbor graph of the data set shown in Fig. 2(a), whichclearly shows that the two-moon patterns are ‘‘connected’’by the two bridging points. From the random walk’s view,there is a possibility that the ant starts from the uppermoon crawls across these bridging points to get thecandies stored in the place represented by the red triangle,and the probability that the ant crawls across each edge isequal to the weight on that edge. Fig. 3(b) shows us thefinal classification result using the basic algorithm inSection 2.Generally, we regard a point as a bridging point or
outlier if and only if it is in a sparse region; when thisregion lies between different classes, the point is a bridgingpoint, and when this region lies far away from this data set,the point is usually called an outlier. It can be easilyinferred that the degree of a data point is large if it is in aregion of high density (whose neighborhood size will belarge), and small if is in a sparse region (whose neighbor-hood size will be small).The robustness of the algorithms has been extensively
studied in the dimensionality reduction fields, such asrobust subspace learning [9], robust kernel Isomap [8], androbust locally linear embedding [5]. However, few researcheshave been done on the robust analysis of graph-based semi-supervised learning methods. In the following we willpropose a scheme to make our method more robust to thebridge points and outliers.An intuitive way for robustifying our method is to
reduce the edge weights associated with the bridge points(and outliers) to make the ant harder to crawl across theseedges. Therefore, we propose to calculate the similaritybetween data xi and xj by
~wij ¼ didjWij , (24)
where Wij is the similarity computed by Eq. (7), and di isthe degree of xi.Intuitively, Eq. (24) can reflect the genuine similarity
between xi and xj even when outliers exist. If xi and xj areall in a high-density region and there exists an edgeconnecting them, then ~wij will have a high value. Onthe other hand, if an edge connecting xi and xj containsat least one low value of either wij (which indicates xi
and xj may belong to different clusters) or di, dj (whichimplies xi or xj may be an outlier), then the value of ~wij willbe low.We define the robust similarity ~w and use it in place of the
similarity computed by Eq. (7) in our basic algorithmpresented in Section 2. The resultant approach will becalled robust self-tuning semi-supervised learning ðRS3LÞ
method throughout the paper because of the followingreasons: (1) we adopt a robust similarity measure; (2) weneed not do the tedious work of tuning the free parameterin Eq. (3). To show its effectiveness, we also apply RS3L to
ARTICLE IN PRESS
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
5 nearest neighbor graph
Unlabeled data
Class 1Class 2
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Classification result
by the basic algorithm
a b
Fig. 3. Classification result on the toy data set shown in Fig. 2(a) using the basic algorithm proposed in Section 2. (a) shows the five nearest neighbor
graph constructed on this data set, with the bridging points denoted by green filled circles. (b) shows the classification result using the basic algorithm in
Section 2.
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Original dataset
Unlabeled dataClass 1Class 2
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Classification result by the RS3L
a b
Fig. 4. Classification result by our robust self-tuning semi-supervised learning method on the two-moon toy data set. (a) The original data set which is
identical to Fig. 2(a). (b) shows the classification result by our RS3L method.
Table 1
Robust self-tuning semi-supervised learning
Input: data set X ¼ XL [XU from C classes, XL is the labeled set, XU is
the unlabeled set. The number of the nearest neighbors k.
Output: the labels of all the data points.
1. Construct the neighborhood graph by solving Eq. (6) for the
reconstruction weights of each data object from its k-nearest neighbors.
2. Compute the pairwise similarities ~wij by Eq. (24), and compute the
degree ~di ¼P
j ~wij for each data point.
3. Construct the combinatorial Laplacian matrix using Eq. (17) by
replacing Wij and di with ~wij and ~di .
4. Solve the classification function vector fc for each class c ð1pcpCÞ by
Eq. (19). Output the label t for xu by t ¼ arg maxc fcu.
F. Wang, C. Zhang / Neurocomputing 70 (2007) 2931–29392936
solve the problem shown in Fig. 4(a) with k ¼ 5, and theresult is given in Fig. 4(b), which agrees well with humanjudgement. The main procedure of RS3L can be summar-ized in Table 1.
4. Experiments
In this section, we give a set of experiments where weused RS3L for semi-supervised classification, including toyexamples, digits recognition and text classification.
4.1. Toy examples
Although the traditional graph-based semi-supervisedclassifications can perform very well on the data set withmanifold structure [1,19,23], they are no longer robustenough to give satisfactory results if some noise points areadded. This can be easily observed from the toy examplewe give in Fig. 2. In this subsection we will show anothersynthetic example.As shown in Fig. 5(a), the toy data set consists of two
circular cluster with some bridging points connecting them.Initially, we only label two data points, one in each circle,
ARTICLE IN PRESS
−3 −2 −1 0 1 2 3−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Original dataset
Unlabeled dataClass 1Class 2
−3 −2 −1 0 1 2 3−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Classification result by
the harmonic Gaussian fields
−3 −2 −1 0 1 2 3−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Classification results
by Zhou's method
−3 −2 −1 0 1 2 3−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Classification results by
Tikhonov regularization
−3 −2 −1 0 1 2 3−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Classification result
by the basic algorithm
−3 −2 −1 0 1 2 3−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Classification result by RS3L
a b c
d e f
Fig. 5. Classification on the two circular patterns, with some bridging points connecting the two of them. (a) The original data set with two labeled points;
(b) classifying result given by Zhu’s harmonic Gaussian field method [23]; (c) classifying result given by Zhou’s consistency method [19]; (d) classifying result
by Belkin’s Tikhonov regularization method [1]; (e) classifying result by the basic algorithm introduced in Section 2; and (f) classifying result by our RS3L
method.
2http://www.kernel-machines.org/data.html.
F. Wang, C. Zhang / Neurocomputing 70 (2007) 2931–2939 2937
and our goal is to predict the labels of the remaining points.Fig. 5(b)–(d) show the classification results by sometraditional graph-based semi-supervised learning methods.And we have adjusted the free parameters in these methodsso that they can achieve the highest classification accura-cies. Fig. 5(e) is the classification result by our basicalgorithm introduced in Section 2, with the number ofnearest neighbors k ¼ 5. Fig. 5(f) shows the classificationresult obtained by RS3L method, and k is also set to 5.
It can be observed from Fig. 5(b)–(e) that the traditionalgraph-based methods failed to find the two circularpatterns due to the existence of the bridging points as wellas the basic algorithm we proposed in Section 2. This canbe easily explained from the random walks viewpointpresented in Section 3.1, since the random walker maywalk across the edges linking the two circular pattern toachieve a wrong label. However, using the robust similarityin Eq. (24), our RS3L algorithm can give a muchsatisfactory result, since the weights of the edges containingthe bridging points are reduced, which makes the randomwalker harder to walk from one circle to the other.
It is interesting to discover that among the threetraditional graph-based methods (Fig. 5(b)–(d)), Zhou’s
consistency approach [19] seems more sensitive to thebridging points (Fig. 5(c)). We think this is because thesmoothness matrix used in Zhou’s method is the normal-ized Laplacian matrix [1], which is equivalent to using wij ¼
wij=ffiffiffiffiffiffiffiffididj
pas the similarity measure between xi and xj [4].
Therefore, the weights on the edges containing bridgingpoints and outliers are relatively enlarged due to their smalldegrees, which will cause the random walker to run acrossthese edges more easily to get wrong labels.
4.2. Digits recognition
In this case study, we will focus on the problem ofclassifying hand-written digits. The data set we adopt is theUSPS2 hand-written 16� 16 digits data set. The images ofdigits 1, 2, 3 and 4 are used in this experiments as fourclasses, and there are 1269, 929, 824 and 852 examples ineach class, with a total of 3874.We used nearest neighbor classifier and one-vs-rest SVMs
[14] as baselines. The width of the RBF kernel for SVM
was set to 5. In RS3L, the number of nearest neighbors k
ARTICLE IN PRESS
5 10 15 20 25 30 35 40 45 500.75
0.8
0.85
0.9
0.95
1
Recognition accuracies onUSPS data "2" vs. "3"
RS3LConsistencyGaussian FieldsSVMNN
5 10 15 20 25 30 35 40 45 500.7
0.75
0.8
0.85
0.9
0.95
1
Recognition accuracies onUSPS data "1", "2", "3", "4"
RS3LConsistencyGaussian FieldsSVMNN
a b
Fig. 6. Digit recognition on the USPS data set. (a) shows the recognition accuracies of different algorithms on a subset only containing digits ‘‘2’’ and ‘‘3’’.
(b) shows the recognition accuracies of different algorithms on the total data set containing all four digits. In both figures, the abscissa represents the
number of randomly labeled data in the data set (we guarantee that there is at least one labeled point in each class), and the ordinate is the total recognition
accuracy value averaged over 50 independent runs.
10 20 30 40 500.2
0.3
0.4
0.5
0.6
0.7
0.8
Classification accuracies on the 20 Newsgroup dataset
RS3L
Consistency
Gaussian Fields
SVM
NN
Fig. 7. Text classification on the 20newgroup data set. The abscissa
represents the number of randomly labeled data in the data set (we
guarantee that there is at least one labeled point in each class), and the
ordinate is the total recognition accuracy value averaged over 50
independent runs.
F. Wang, C. Zhang / Neurocomputing 70 (2007) 2931–29392938
was set to 5 when constructing the graph. For comparison,we also provide the classification results achieved by Zhouet al.’s consistency method [19] and Zhu et al.’s Gaussian
fields approach [23]. The affinity matrix in both methodswere constructed by a Gaussian function with variance1.25. All these parameters are set by a five-fold cross
validation method. Note that the diagonal elements of theaffinity matrix in Zhou’s consistency method were set to 0.The recognition accuracies averaged over 50 independenttrials are summarized in Fig. 6.
Fig. 6(a) illustrates the recognition accuracies ofdifferent algorithms on a two-class task which aims atdiscriminating digits ‘‘2’’ and ‘‘3’’; Fig. 6(b) provides theclassification results of those algorithms on the multi-classtask which aims at discriminating all four digits. Theeffectiveness of our RS3L method can be easily seen in bothfigures.
4.3. Text classification
In this experiment, we addressed the task of textclassification using 20 newsgroups data set.3 The topic rec
containing autos, motorcycles, baseball and hockey wasselected from the version 20news-18828. The articles werepreprocessed by the same procedure as in [19]. The resulted3970 document vectors were all 8014-dimensional. Finally,the document vectors were all normalized into TFIDF
representation.We use the inner-product distance to find the k nearest
neighbors when constructing the neighborhood graph inRS3L, i.e. dðxi; xjÞ ¼ 1� xTi xj=ðkxikkxjkÞ, where xi and xj
are document vectors. And the value of k is set to 10. ForZhou’s consistency and Zhu’s Gaussian fields (GF) methods,the affinity matrices were all computed by ðWÞij ¼expð�ð1=2s2Þð1� xTi xj=ðkxikkxjkÞÞÞ with s ¼ 0:15. The
3http://people.csail.mit.edu/jrennie/20Newsgroups/.
SVM and nearest neighbor classifiers were also served asthe baseline algorithms, and the width of the RBF kernel inSVM is set to 1.5. All these parameters are also set by afive-fold cross validation method. The classification accura-cies are summarized in Fig. 7, from which we can clearlysee the advantage of our RS3L method.
5. Conclusions and discussions
In this paper we propose a novel semi-supervisedlearning algorithm called robust self-tuning semi-supervised
learning ðRS3LÞ. The main advantages of our method are:(1) it can compute the similarities between pairwise datapoints automatically, i.e. in closed forms; (2) it is not
ARTICLE IN PRESSF. Wang, C. Zhang / Neurocomputing 70 (2007) 2931–2939 2939
sensitive to outliers. Experimental results on both syntheticand real data sets are presented to show the effectiveness ofour method. In our future, we will focus on the theoreticalanalysis and accelerating issues of our RS3L algorithm.
References
[1] M. Belkin, I. Matveeva, P. Niyogi, Regularization and semi-
supervised learning on large graphs, in: Proceedings of the 17th
COLT, 2004.
[2] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction
and data representation, Neural Comput. 15 (2003) 1373–1396.
[3] M. Belkin, P. Niyogi, Semi-supervised learning on Riemannian
manifolds, Mach. Learn. 56 (2004) 209–239.
[4] Y. Bengio, J. Paiement, P. Vincent, Out-of-sample extensions for LLE,
Isomap, MDS, eigenmaps, and spectral clustering, in: NIPS03, 2003.
[5] H. Chang, D.Y. Yeung, Robust locally linear embedding, Pattern
Recognition 39 (6) (2006) 1053–1065.
[6] O. Chapelle, B. Schoolkopf, A. Zien (Eds.), Semi-supervised
Learning, MIT Press, Cambridge, MA, 2006.
[7] O. Chapelle, J. Weston, B. Scholkopf, Cluster kernels for semi-
supervised learning, NIPS 15, 2003.
[8] H. Choi, S. Choi, Kernel Isomap on noisy manifold, in: Proceedings
of the IEEE International Conference on Development and Learning
(ICDL), Osaka, Japan, July 19–21, 2005, pp. 208–213.
[9] F. De la Torre, M.J. Black, A framework for robust subspace
learning, Int. J. Comput. Vision 54 (1–3) (2003) 117–142.
[10] O. Delalleu, Y. Bengio, N. Le Roux, Non-parametric function
induction in semi-supervised learning, in: Proceedings of the 10th
AISTATS, 2005.
[12] S. Kakutani, Markov process and the Dirichlet problem, Proc. Jpn.
Acad. 21 (1945) 227–233.
[13] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by
locally linear embedding, Science 290 (2000) 2323–2326.
[14] B. Scholkopf, A.J. Smola, Learning with Kernels, MIT Press,
Cambridge, MA, 2002.
[17] M. Szummer, T. Jaakkola, Partially labeled classification with
Markov random walks. Adv. Neural Inf. Process. Syst. 14 (2002).
[18] F. Wang, C. Zhang, Label propagation through linear neighbor-
hoods, in: Proceedings of the 23rd ICML, 2006, to appear.
[19] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Scholkopf, Learning
with local and global consistency, NIPS 16 (2004).
[20] D. Zhou, B. Scholkopf, Learning from labeled and unlabeled data
using random walks, in: Pattern Recognition, Proceedings of the 26th
DAGM, 2004.
[21] X. Zhu, Semi-supervised learning literature survey. Computer
Sciences Technical Report 1530, University of Wisconsin-Madison,
2005.
[23] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using
Gaussian fields and harmonic functions, in: Proceedings of the 20th
ICML, 2003.
Fei Wang is a Ph.D. candidate of grade four in
Department of Automation, Tsinghua Univer-
sity, Beijing, PR China. He has now published
several papers in top conferences on machine
learning and pattern recognition, such as CVPR
and ICML.
Changshui Zhang is a professor in Department of
Automation, Tsinghua University, Beijing, PR
China. He is an associate editor of Pattern
Recognition Journal.