semi supervised learning

ABSTRACT

As a supervised learning algorithm, the standard Gaussian Processes has the

excellent performance of classification. In this report, we present a semi-

supervised algorithm to learning a Gaussian Process classifier, which

incorporating a graph-based construction of semi-supervised kernels in the

presence of labelled and unlabeled data, and expanding the standard Gaussian

Processes algorithm into the semi-supervised learning framework. Our

algorithm adopts the spectral decomposition to obtain the kernel matrices, and

employs a convex optimization method to learn an optimal semi-supervised

kernel, which is incorporated into the Gaussian Process model. In the Gaussian

Processes classification, the expectation propagation algorithm is applied to

approximate the Gaussian posterior distribution. The main characteristic of the

proposed algorithm is that we incorporate the geometric properties of unlabeled

data by globally defined kernel functions. The semi-supervised Gaussian

Processes model has an explicitly probabilistic interpretation, and can model the

uncertainty among the data and solve the complex non-linear inference

problems. In the presence of few labelled examples, the proposed algorithm

outperforms cross-validation methods, and we present the experimental results

demonstrating the effectiveness of this algorithm in comparison with other

related works in the literature.

1

CHAPTER 1

INTRODUCTION

Semi-supervised learning [1] has attracted an increasing amount of attention in the recent

years, which includes many research areas, such as semi-supervised classification, semi-

supervised regression, semi-supervised clustering, Co-training, etc. In this report, we

primarily consider the semi-supervised classification. The standard supervised learning

methods use only labelled data (or features) to train and learn classifiers. Due to the diversity

of data, labelled instances are often difficult, expensive and time consuming to obtain.

Meanwhile unlabeled data may be relatively easy to collect in practice. Comparing with

supervised learning, semi-supervised learning can build better classifiers by using large

amount of unlabeled data together with few labelled data.

In the statistics and machine learning fields, much of the basic theory and many algorithms

are shared to use. The primary differences between two fields are the goal of learning and the

type of problem solved. The statistics mainly considers how to understand the relationships

between data and models, such as linearity or independence. In contrast, the machine learning

primarily focuses on how to give the accurate prediction and understand the behaviour of

algorithms. Due to the different objectives, the two fields have the different development

trends, for example, the learning algorithms in the machine learning have widely used as

black-box and we only worry about the input and output. But it is usually difficult in the

statistics to describe these models and obtain the satisfied results. To some extent, Gaussian

process model [2-9] effectively bridges the two fields, and has an explicit probabilistic

interpretation which can facilitate modelling the uncertainty of complex data sets, and

provides a completely theoretical framework for model selection and probability prediction

simultaneously.

As a supervised learning algorithm, the posterior distribution of standard Gaussian processes

can be affected by unlabeled data, which makes the location of decision boundary not be

influenced. In this report, we present how to effectively expand Gaussian processes model

into the semi- supervised framework through incorporating unlabeled data, and improve the

performance of Gaussian process classifiers. Due to Gaussian processes based on Bayesian

framework, we can address this problem from two areas of likelihood function and prior

2

distribution: (1) to combine a Gaussian process prior with a given likelihood function, which

make the posterior distribution incorporate the cluster assumption and influence the location

of decision boundary. Lawrence [4] provided the Null-Category Noise Model (NCMM),

which is equal to be a probabilistic margin. Rogers [5] replaced the NCMM by a Multinomial

Probit likelihood function, which generalized the binary setting to the multi-class setting; (2)

To directly modify the kernel function of Gaussian process prior, which has the properties of

semi-supervised kernels and incorporate the information of labelled data and unlabeled data.

Spectral clustering [11], diffusion kernels [12] and Gaussian random field [13] are semi-

supervised kernel methods. These methods belong to the parametric approaches, which are

difficult to choose an appropriate function family and accurately model the data without the

enough degrees of freedom. In this report, we address the semi-supervised learning algorithm

from the area of Gaussian process prior distribution described above. The proposed algorithm

incorporates the geometric properties of unlabeled data through the graph-based spectral

decomposition, and obtains the optimally non-parametric semi-supervised kernel which is

combined with the Gaussian processes model.

1.1 SUPERVISED, UNSUPERVISED, AND SEMI-SUPERVISED

LEARNING

In order to understand the nature of semi-supervised learning, it will be useful first to take a

look at supervised and unsupervised learning.

1.1.1 SUPERVISED AND UNSUPERVISED LEARNING

Traditionally, there have been two fundamentally different types of tasks in machine learning.

The first one is unsupervised learning. Let X = (x1, ….., xn) be a set of n examples learning (or

points), where xi ∈ X for all i ∈ [n] := {1, . . . , n}. Typically it is assumed that the points are

drawn i.i.d. (independently and identically distributed) from a common distribution on X. It is

often convenient to define the (n × d)-matrix X = (xi)i∈[n] that contains the data points as its

rows. The goal of unsupervised learning is to find interesting structure in the data X. It has

been argued that the problem of unsupervised learning is fundamentally that of estimating a

3

density which is likely to have generated X. However, there are also weaker forms of

unsupervised learning, such as quantile estimation, clustering, outlier detection, and

dimensionality reduction.

The second task is supervised learning. The goal is to learn a mapping from x to y, given a

training set made of pairs (xi, yi). Here, the yi ∈ Y are called the labels or targets of the

examples xi. If the labels are numbers, y = (yi)i∈[n] denotes the column vector of labels. Again,

a standard requirement is that the pairs (xi, yi) are sampled i.i.d. from some distribution which

here ranges over X × Y. The task is well defined, since a mapping can be evaluated through

its predictive performance on test examples. When Y = R or Y = Rd (or more generally, when

the labels are continuous), the task is called regression. Most of this book will focus on

classification (there is some work on regression in chapter 23), i.e., the case where y takes

values in a finite set (discrete labels). There are two families of algorithms for supervised

learning. Generative algorithms try to model the class-conditional density p(x|y) by some

unsupervised learning procedure. A predictive density can then be inferred by applying Bayes

theorem:

(1.1)

In fact, p(x|y)p(y) = p(x, y) is the joint density of the data, from which pairs (xi, yi) could be

generated. Discriminative algorithms do not try to estimate how the xi have been generated,

but instead concentrate on estimating p(y|x). Some discriminative methods even limit

themselves to modelling whether p(y|x) is greater than or less than 0.5; an example of this is

the support vector machine (SVM). It has been argued that discriminative models are more

directly aligned with the goal of supervised learning and therefore tend to be more efficient in

practice.

1.1.2 SEMI-SUPERVISED LEARNING

Semi-supervised learning (SSL) is halfway between supervised and unsupervised learning. In

addition to unlabeled data, the algorithm is provided with some supervision information – but

4

not necessarily for all examples. Often, this information will be the targets associated with

some of the examples. In this case, the data set X = (xi)i∈[n] can be divided into two parts: the

points Xl := (x1, . . . , xl), for which labels Yl := (y1, . . . , yl) are provided, and the points Xu :=

(xl+1, . . . , xl+u), the labels of which are not known. This is “standard” semi-supervised

learning as investigated in this book; most chapters will refer to this setting. Other forms of

partial supervision are possible. For example, there may be constraints such as “these points

have (or do not have) the same target” (cf. Abu-Mostafa, 1995). The different setting

corresponds to a different view of semi-supervised learning: In chapter 5, SSL is seen as

unsupervised learning guided by constraints. In contrast, most other approaches see SSL as

supervised learning with additional information on the distribution of the examples x. The

latter interpretation seems to be more in line with most applications, where the goal is the

same as in supervised learning: to predict a target value for a given xi. However, this view

does not readily apply if the number and nature of the classes are not known in advance but

have to be inferred from the data. In contrast, SSL as unsupervised learning with constraints

may still remain applicable in such situations. A problem related to SSL was introduced by

Vapnik already several decades ago: so-called transductive learning. In this setting, one is

given a (labelled) training set and an (unlabeled) test set. The idea of transduction is to

perform predictions only for the test points. This is in contrast to inductive learning, where

the goal is to output a prediction function which is defined on the entire space X. Many

methods described in this book will be transductive; in particular, this is rather natural for

inference based on graph representations of the data.

1.1.3 BRIEF HISTORY OF SEMI-SUPERVISED LEARNING

Probably the earliest idea about using unlabeled data in classification is self learning, which

is also known as self-training, self-labeling, or decision-directed learning. This is a wrapper-

algorithm that repeatedly uses a supervised learning method. It starts by training on the

labeled data only. In each step a part of the unlabeled points is labeled according to the

current decision function; then the supervised method is retrained using its own predictions as

additional labelled points. This idea has appeared in the literature already for some time (e.g.,

Scudder (1965); Fralick (1967); Agrawala (1970)).

5

An unsatisfactory aspect of self-learning is that the effect of the wrapper depends on the

supervised method used inside it. If self-learning is used with empirical risk minimization and

1-0-loss, the unlabeled data will have no effect on the solution at all. If instead a margin

maximizing method is used, as a result the decision boundary is pushed away from the

unlabeled points. In other cases it seems to be unclear what the self-learning is really doing,

and which assumption it corresponds to.

Closely related to semi-supervised learning is the concept of transductive inference, or

transduction, pioneered by Vapnik (Vapnik and Chervonenkis, 1974; Vapnik and Sterin,

1977). In contrast to inductive inference, no general decision rule is inferred, but only the

labels of the unlabeled (or test) points are predicted. An early instance of transduction (albeit

without explicitly considering it as a concept) was already proposed by Hartley and Rao

(1968). They suggested a combinatorial optimization on the labels of the test points in order

to maximize the likelihood of their model.

It seems that semi-supervised learning really took off in the 1970s when the problem of

estimating the Fisher linear discriminant rule with unlabeled data was considered (Hosmer,

1973; McLachlan, 1977; O’Neill, 1978; McLachlan and Ganesalingam, 1982). More

precisely, the setting was in the case where each classconditional density is Gaussian with

equal covariance matrix. The likelihood of the model is then maximized using the labeled and

unlabeled data with the help of an iterative algorithm such as the expectation-maximization

(EM) algorithm (Dempster et al., 1977). Instead of a mixture of Gaussians, the use of a

mixture of multinomial distributions estimated with labeled and unlabeled data has been

investigated in (Cooper and Freeman, 1970).

Later, this one component per class setting has been extended to several components per class

(Shahshahani and Landgrebe, 1994) and further generalized by Miller and Uyar (1997).

Learning rates in a probably approximately correct (PAC) framework (Valiant, 1984) have

been derived for the semi-supervised learning of a mixture of two Gaussians by Ratsaby and

Venkatesh (1995). In the case of an identifiable mixture, Castelli and Cover (1995) showed

that with an infinite number of unlabeled points, the probability of error has an exponential

convergence (w.r.t. the number of labelled examples) to the Bayes risk. Identifiable means

that given P(x), the decomposition in ∑y P(y) P(x|y) is unique. This seems a relatively strong

assumption, but it is satisfied, for instance, by mixtures of Gaussians. Related is the analysis

in (Castelli and Cover, 1996) in which the class-conditional densities are known but the class

6

priors are not. Finally, the interest in semi-supervised learning increased in the 1990s, mostly

due to applications in natural language problems and text classification (Yarowsky, 1995;

Nigam et al., 1998; Blum and Mitchell, 1998; Collins and Singer, 1999; Joachims, 1999).

Note that, to our knowledge, Merz et al. (1992) were the first to use the term “semi-

supervised” for classification with both labelled and unlabeled data. It has in fact been used

before, but in a different context than what is developed in this book; see, for instance,

(Board and Pitt, 1989).

1.1.4 SEMI-SUPERVISED LEARNING IN PRACTICE

Semi-supervised learning will be most useful whenever there are far more unlabeled data

than labelled. This is likely to occur if obtaining data points is cheap, but obtaining the labels

costs a lot of time, effort, or money. This is the case in many application areas of machine

learning, for example:

In speech recognition, it costs almost nothing to record huge amounts of speech, but

labelling it requires some human to listen to it and type a transcript.

Billions of WebPages are directly available for automated processing, but to classify

them reliably, humans have to read them.

Protein sequences are nowadays acquired at industrial speed (by genome sequencing,

computational gene finding, and automatic translation), but to resolve a three

dimensional (3D) structure or to determine the functions of a single protein may

require years of scientific work.

Since unlabeled data carry less information than labelled data, they are required in large

amounts in order to increase prediction accuracy significantly. This implies the need for fast

and efficient SSL algorithms.

7

CHAPTER 2

GAUSSIAN PROCESSES

The Gaussian process (GP) [2] is a generalization of a multivariate Gaussian distribution and

has the marginalization property. GP controls the properties of random data x by a random

process f (x) and synchronously describes this random process by a probability distribution.

GP describes a distribution over function and is fully specified by the mean function m(x) and

the covariance function (kernel function) K(x, x′) of this random process f:

f (x) ~ GP(m(x), K(x, x ')) (2.1)

Where the kernel K(x, x′) is usually chose as the form of Mercer kernel. For example, the

RBF kernel function has the following form:

(2.2)

Where θ1 and θ2 are the hyper parameters of the RBF kernel, which are generally selected by

maximizing the marginal likelihood (evidence).

2.1 GAUSSIAN PROCESSES CLASSIFICATION

In this report, we only consider the binary classification. We assume we are given a dataset

X= {Xl, Xu}, where Xl={x1, ..., xm} are the labelled dataset and associated label dataset is the

unlabeled dataset. The main idea is to assume

that there is an unobservable latent function f (x) which is imposed on a Gaussian process

prior p( f ) ~ GP(0,K), and the latent function f preserves the mapping relationships between

dataset X and label set Y .

The likelihood function (class probability) over the latent function is described as the

following:

(2.3)

Where Θ is a sigmoid function, such as logistic function or cumulative Gaussian function.

Based on the Bayesian theorem, the posterior probability can be written as:

8

(2.4)

Where θ is the hyper parameters of kernel function K and P(X |θ) is the normalization factor

known as the evidence for the hyper parameters. As a discriminative model, the graphical

representation of Gaussian processes is shown in Figure 1: the nodes are shaded to represent

different treatments. White shaded nodes are unobserved variables, grey shaded nodes are

observed variables and black shaded nodes are optimized.

Fig. 2.1: The graphical representation of Gaussian Processes in the discriminative framework

For the labelled dataset xi ∈ Xl, xi is not d-separated [14] (conditional independent) from K

because they have a common descendant yi which is observed. For the unlabeled dataset xu ∈ Xu, xj is d-separated from K because yj is unobserved. Namely, the unlabeled dataset xj will

not have an effect on the posterior distribution of latent function f, which makes the location

of decision boundary not be influenced. In the following section, we will present how to learn

a semi-supervised kernel capturing the information from unlabeled data which will influence

the location of decision boundary.

9

CHAPTER 3

SEMI-SUPERVISED KERNEL

10

The prior, p (f | X), plays a significant role in semi-supervised learning, especially in the

presence of a small amount of labelled data. The prior can be constructed by forming an

undirected graph G= {V, E} over the data points. The data points are the nodes V of the

graph, and the weights W = {wij} of edge E between the nodes are based on the similarity.

The prior imposes a smoothness constraint over the data points, which gives the higher

probability to the labels respecting the similarity of the graph. The similarity is usually

captured by the kernel matrix K. Given a diagonal matrix Dii = Σj Wij, we can construct the

normalized Laplacian Δ = I −D−1/2WD−1/2 of the graph, which is a symmetric and positive

semi-definite matrix. Consider the spectral decomposition of normalized Laplacian:

(3.1)

Where {vi} denote the eigenvectors of the normalized Laplacian and corresponding

eigenvalues {λi}. Through applying a transformation r (λ) to the eigenvalues {λi}, we can

obtain the semi-supervised kernel:

(3.2)

Where λ1 ≤⋅⋅⋅≤ λn, by reason of the eigenvector vi with large λi corresponding to rather uneven

functions on the graph and considered as noise, in the semi-supervised learning framework,

we should penalize them more strongly than vi with small λi representing large cluster

structures within the data. Due to the reason discussed above, we will choose the

transformation r (λ) as a decreasing function, r (λi) ≥ (λi+1), reversing the order of the

eigenvalues. In the following subsection, we apply the Kernel Alignment algorithm to learn

the transformation function r (λ).

3.1 KERNEL ALIGNMENT:

Kernel Alignment [15, 16] is used to evaluate the similarity between the kernel matrix

induced by the labelled dataset and the target matrix induced by the labels. To obtain the

11

optimal semi-supervised kernel, we should maximize the kernel alignment score described as

the following:

(3.3)

Where K~ is the sub-matrix of the semi-supervised kernel K based on the whole dataset, {Tij

= yi yj} is the target matrix preserving the labels of the training data. Kernel Alignment is also

a measurement of clustering:

( 3.4)

Where the first term in the right hand side of Equation (3.4) denotes the distances within class

and the second term denotes the distances between classes. Maximizing Kernel Alignment

score means maximizing the distances within class and minimizing the distances between

classes.

3.2 CONVEX OPTIMIZATION ALGORITHM:

Different choices of transformation functions r (λ ) lead to different semi-supervised learning

algorithms and the function are often chosen from a parametric family [11, 12, 13]. But in

practice, it is difficult to choose an appropriate function family and accurately model the data

without the enough degrees of freedom. In this report, we apply a convex optimization

algorithm to learn the transform vector (weights) {ri} of the semi-supervised kernel based on

the whole data space instead of the parametric methods. Thus the convex optimization

algorithm [10] is described as the following:

Maximize the objective A (K~, T) (3.5)

Subject to (3.6)

Trace (K*) = 1 (3.7)

12

ri ≥ 0 (3.8)

ri ≥ ri+1, i=1,2,…….n (3.9)

Where K* is the optimal semi-supervised kernel, Equation (3.7) ensures the scale invariance

of Kernel Alignment, Equation (3.8) ensures K* is a positive semi-definite matrix and

Equation (3.9) is the order constraints which provide a valid penalty.

CHAPTER 4

SEMI-SUPERVISED LEARNING WITH GP

13

Based on the ideas of the previous sections, we propose the following algorithm:

1. Extract the features from the data space and form the graph Laplacian Δ;

2. Computer Δ = I − D−1/2WD−1/ 2 and its spectral decomposition obtaining the eigenvectors

{vi};

3. Learn the optimal transform vector (weights) {ri} of the semi-supervised kernel K* by

maximizing the Kernel Alignment score;

4. To incorporate the semi-supervised kernels K* into the GP classification framework.

In this report, we use the cumulative Gaussian likelihood function:

(4.1)

Given the GP prior p( f ) and the likelihood function p( y | f ) , the posterior distribution over

the latent function is obtained:

(4.2

)

Given a test point xt, we obtain the predictive class probability:

(4.3)

Due to choosing the sigmoid functions as the likelihood functions, the non-Gaussian

likelihood in Equation (4.1) makes the posterior distribution p( f | X ,K* ) and the prediction

distribution ( yt | X, K*, xt ) analytically intractable, so the analytic approximations of integrals

are needed.

14

In this report, we apply the Expectation Propagation (EP) [17] algorithm to find the Gaussian

approximation q( f | X, K*) = Ν( f |m, Σ) of the non-Gaussian posterior p( f | X, K*) by the

moment matching of approximate marginal distribution. Given a test point xt, we obtain the

approximate posterior over the latent function f:

(4.4)

(4.5)

Where kt is the prior covariance function between the test data xt and the training data X. The

approximate posterior produced by EP algorithm is global because the latent function is

coupled through the GP prior.

CHAPTER 5

EXPERIMENTS

15

5.1 EXPERIMENTAL DATA

To evaluate the semi-supervised leaning algorithm with Gaussian processes, we apply four

datasets [10] shown in Table 1. The ‘One vs. Two’ dataset and the ‘Odd vs. Even’ dataset are

applied to the handwritten digits recognition tasks. ‘One vs. Two’ is to classify the digit “1”

vs. “2”. ‘Odd vs. Even’ is to classify odd “1, 3, 5, 7, 9” vs. even “0, 2, 4, 6, 8” in the artificial

task. The ‘Pc vs. Mac’ dataset and the ‘Baseball vs. Hockey’ dataset are taken from 10-

newsgroups dataset for the binary document categorization. In the area of feature extraction

on data, we use Euclidean 10-nearest-neighbor (10NN) unweighted graph on ‘One vs. Two’

and ‘Odd vs. Even’, and measure the Cosine similarity of TFIDF vector on ‘Pc vs. Mac’ and

‘Baseball vs. Hockey’.

Table 5.1: The Information of Database

5.2 EXPERIMENTAL RESULTS AND ANALYSIS

In the experiments, we choose five different sizes as the training (labelled) data for each

dataset, the rest as the test (unlabelled) data. For the fixed size of labelled datasets, we

perform 20 random trials, and respectively optimize the Kernel Alignment score to learn the

optimal semi-supervised kernel incorporated into Gaussian process model. In order to verify

the effectiveness of the proposed algorithm, we compare it with two algorithms: (1) one

algorithm is proposed by Zhu [10] who use SVM as the classifier choosing the bound on

SVM slack variables with 5-fold cross validation; (2) the other algorithm is GP model

combined with the RBF kernel shown in Equation (2.2), which is the standard supervised

learning method. The hyper parameters θ of RBF kernel are obtained by maximizing the

marginal likelihood function q(y | X, θ):

16

(5.1)

The experimental results are show from Table 2 to Table 5. The results of the proposed

algorithm in this paper are shown in the second and fourth column of each table, and the

results of Zhu are shown in the third and fifth column of each table. From the comparison of

the second and third column in each table, our proposed algorithm outperforms SVM with

cross validation techniques in the presence of few labelled data, which demonstrates the

reliability of GP model effectively describing and modelling the data space in the Bayesian

framework. From the comparison of the fourth and fifth column in each table, the results

demonstrate the classifiers improving the performance through incorporating the information

of unlabeled data together with labelled data, and significantly confirm the effectiveness of

the proposed semi-supervised leaning algorithm.

Table 5.2: The Average Accuracy on the One vs. Two Datasets

Table 5.3: The Average Accuracy on the Odd vs. Even Datasets

17

Table 5.4: The Average Accuracy on the Pc vs. Mac Datasets

Table 5.5: The Average Accuracy on the Baseball vs. Hockey Datasets

CONCLUSION

In this report we have presented a semi-supervised learning algorithm with Gaussian process

model, which combines a graph-based construction of semi-supervised kernel with GP

model. We have empirically demonstrated the reliability of the proposed algorithm which

builds better classifiers in virtue of the information of unlabeled data.

18

REFERENCES

[1] Alexander Zien, Bernhard Schölkopf, Olivier Chapelle. Semi-Supervised Learning.

Cambridge: MIT Press, 2006

[2] Carl Edward Rasmussen, Christopher K. I. Williams. Gaussian Processes for Machine

Learning. Cambridge: MIT Press, 2006.

[3] Carl Edward Rasmussen. Advances in Gaussian Processes. Advances in Neural

Information Processing Systems, 2006.

19

[4] N. D. Lawrence. Probabilistic non-linear principal component analysis with Gaussian

Process latent variable models. Journal of Machine Learning Research, 2005, 6: 1783-

1816.

[5] Simon Rogers, Mark Girolami. Multi-class Semi-supervised Learning with the

ε-truncated Multinomial Probit Gaussian Process. Journal of Machine Learning Research,

2007,1: 17-32.

[6] N. D. Lawrence. Learning for larger datasets with the Gaussian process latent variable

model. Proceedings of the Eleventh International Workshop on Artificial Intelligence and

Statistics, 2007.

[7] N. D. Lawrence, A. J. Moore. Hierarchical Gaussian process latent variable models.

Proceedings of the International Conference in Machine Learning, 2007. 481-488.

[8] Raquel Urtasun, Trevor Darrell. Discriminative Gaussian Process Latent Variable Model

for Classification. Proceedings of the International Conference in Machine Learning,

2007. 927-934

[9] T. Joachims. Transductive Inference for Text Classification using Support Vector

Machines. Proceedings of the International Conference on Machine Learning, 1999

[10] Xiaojin Zhu, Jaz Kandola, Zoubin Ghahramani, John Lafferty. Nonparametric

transforms of graph kernels for semi-supervised learning. Advances in Neural

Information Processing Systems 17. MIT Press, Cambridge, MA, 2005.

[11] O. Chapelle, J. Weston, B. Sch¨olkopf. Cluster kernels for semi-supervised learning.

Advances in Neural Information Processing Systems, 2002, 15(15).

[12] R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input spaces.

Proc. 19th International Conf. on Machine Learning, 2002.

[13] X. Zhu, Z. Ghahramani, J. Lafferty. Semi-supervised learning using Gaussian fields and

harmonic functions. 20th International Conference on Machine Learning, 2003.

[14] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.

Morgan Kaufmann, San Mateo, CA, 1988.

[15] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, J. Kandola. On kernel-target alignment. In

Advances in Neural Information Processing Systems, 2002a

[16] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, M. I. Jordan. Learning the

Kernel matrix with semi definite programming. Journal of Machine Learning Research,

2004, 5:27–72.

20

[17] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference.

[Ph.D.Thesis]. Department of Electrical Engineering and Computer Science, MIT, 2001.

[18] Hongwei Li, Yakui Li, Hanqing Lu. Semi-supervised Learning with Gaussian Processes.

Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

21

semi supervised learning

Documents

carl edward

test point

convex optimization

kernel alignment

gaussian process

standard gaussian

gaussian process

machine learning