semi-supervised learning with support vector...

Semi-supervised Learning WithSupport Vector Machines

B A K K A L A U R E A T S A R B E I T

von Andre GuggenbergerMatrikelnummer 0327514

eingereicht an der

Technischen Universitat Wien

im September 2008

ABSTRACTSupport Vector Machines are a modern technique in the field of machine learningand have been successfully used in different fields of application. In general theyare used for some kind of classification task, where they learn from a randomly

selected training set, which has been classified in advance and then are applied onunseen instances. To get a good classification result it is often necessary that thistraining set contains a huge set of labeled instances. But for humans labeling of

data is a time-consuming and boring task. Some algorithms address this problemand overcome this by learning on both, a small amount of labeled and a huge

amount of unlabeled instances. There the learner has access to the pool ofunlabeled instances and requests the label for some specific instances from an

user. Then the learner uses all labeled data to learn the model. The choice of theunlabeled instances which should be labeled next has a significant impact on the

quality of the resulting model. This kind of learning is called semi-supervisedlearning or active learning. Currently there exist some different solutions for

semi-supervised learning. This work focuses on the most known ones and gives anoverview about them.

KURZFASSUNGSupport Vector Machines sind eine moderne Technik im Bereich vom maschinellenLernen und wurden mittlerweile in verschiedenen Anwendungsgebieten erfolgreicheingesetzt. Generell werden sie fur Klassifikationsaufgaben verwendet, wobei sievon einer zufallig gewahlte Menge von schon vorklassifizierten Trainingsdaten

lernen und dann auf noch unbekannte Daten angewendet werden. Um ein gutesKlassifikationsergebnis zu erhalten, ist es oft notwendig, eine große Menge von

vorklassifizierten Trainingsdaten zum Training zu verwenden. Das manuelleKlassifizieren von den Daten durch Menschen ist oft eine zeitaufwendige und

langweilige Aufgabe. Um dies zu erleichtern wurden Algorithmen entwickelt, ummit schon wenigen klassifizierten und vielen nichtklassifizierten Daten ein Modell

zu erstellen. Dabei hat der Klassifikator Zugang zu dem Pool vonnichtklassifizierten Daten und fragt einen Benutzer nach der Klasse fur einige

spezielle Instanzen. Dann benutzt er alle klassifizierten Daten zum Erstellen desModells. Die Wahl jener noch nicht klassifizierten Instanzen, die von einemExperten klassifiziert werden sollen, hat einen signifikanten Einfluss auf die

Qualitat des resultierenden Modells. Diese Art des maschinellen Lernens wird alsSemi-uberwachtes Lernen oder aktives Lernen) bezeichnet. Momentan existierenverschiedenste Ansatze fur Semi-uberwachtes Lernen. Diese Arbeit behandelt die

bekanntesten und liefert eine Ubersicht uber die verschiedenen Ansatze.

1

Contents

1 Introduction 4

2 Basic Definitions of Support Vector Machines 5

3 Semi-supervised Learning 93.1 Random Subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Version Space Based Methods . . . . . . . . . . . . . . . . . . . . . 11

3.3.1 Theory of the Version Space . . . . . . . . . . . . . . . . . . 113.3.2 Simple Method . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.3 Batch-Simple Method . . . . . . . . . . . . . . . . . . . . . 143.3.4 Angle Diversity Strategy . . . . . . . . . . . . . . . . . . . . 153.3.5 Multi-Class Problem . . . . . . . . . . . . . . . . . . . . . . 16

3.4 Probability Based Method . . . . . . . . . . . . . . . . . . . . . . . 173.4.1 The Probability Model . . . . . . . . . . . . . . . . . . . . . 173.4.2 Least Certainty and Breaking Ties . . . . . . . . . . . . . . 18

3.5 Other approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.5.1 A Semidefinite Programming Approach . . . . . . . . . . . . 183.5.2 S3VM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Experiments 214.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.1 Evaluated Approaches . . . . . . . . . . . . . . . . . . . . . 214.1.2 ssSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.1.3 ssSVMToolbox . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2 Artificial Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.1 Gaussian Distributed Data . . . . . . . . . . . . . . . . . . . 314.2.2 Two Spirals Dataset . . . . . . . . . . . . . . . . . . . . . . 354.2.3 Chain Link Dataset . . . . . . . . . . . . . . . . . . . . . . . 384.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3 Datasets from UCI Machine Learning Repository . . . . . . . . . . 40

2

5 Conclusion 42

A Relevant Links 43

3

Chapter 1

Introduction

Support Vector Machines (SVMs) are a modern technique in the field of machinelearning and have been successfully used in different fields of application. Most ofthe time they were used in a supervised learning context. There the learner hasaccess to a large set of labeled data and builds a model using these informations.After this learning step the learner is presented new instances and tries to predictthe correct labels. Beside supervised learning there exists also unsupervised learningwhere the learner cannot access the labels of the instances. In this case the learnertries to predict the labels by partitioning the data and creating so called clusters.

Providing a huge set of labeled data (as in the supervised case) can be verytime-consuming (and therefore costly). Semi-supervised learning tries to reducethe needed amount of labeled data by analyzing the unlabeled data. There onlyrelevant instances have to be labeled by a human expert. Of course the overallaccuracy has to be on par with the supervised learning accuracy.

In this work I explain the most common approaches for semi-supervised learningwith SVMs. I begin by introducing some basic definitions i.e. the SVM hyperplane,the kernel function and the SVM maximization task (Chapter 2). A detailed dis-cussion about the theory of Support Vector Machines is not provided. The mainpart of the work focuses on semi-supervised learning. I present a definition of semi-supervised learning in contrast to supervised and unsupervised learning, discuss themost common approaches (Chapter 3) for Support Vector Machines, compare semi-supervised SVMs and supervised SVMs and present the results of my experimentswith some of them. I show how they perform with different datasets including somecommon machine learning datasets and one real-world datasets (Chapter 4).

4

Chapter 2

Basic Definitions of SupportVector Machines

Consider a typical classification problem. Some input vectors (feature vectors) andsome labels are given. The objective of classification problems is to predict thelabels of new input vectors so that the error rate of the classification is minimal.

There are many algorithms to solve such kind of problems. Some of them re-quire that the input data is linearly separable (by a hyperplane). But for manyapplications this assumption is not appropriate. And even if the assumption holds,most of the time there are many possible solutions for the hyperplane (Figure 2.1).

Because we are looking for a hyperplane where the classification error is minimalthis can be seen as an optimization problem. In 1965 Vapnik ( [VC04], [Vap00])introduced a mathematical approach to find a hyperplane with low generalizationerror. It is based on the theory of structural risk minimization, which states thatthe generalization error is influenced by the error on the training set and the com-plexity of the model. Based on this work Support Vector Machines were developed.They belong to the family of generalized linear classifiers and are so called max-imum margin classifier. This means that the resulting hyperplane maximizes thedistance between the ’nearest’ vectors of different classes with the assumption thata large margin is better for the gerneraliziation ability of the SVM. These ’nearest’vectors are called support vectors (SV) and SVMs consider only these vectors forthe classification task. All other vectors can be ignored. Figure 2.2 illustrates amaximum margin classifier and the support vectors.

In the context of SVMs it is also important to mind kernel functions. Theyproject the low-dimensional training data to a higher dimensional feature space,because the separation of the training data is often easier achieved in this higherdimensional space. Moreover through this projection it is possible that trainingdata, which couldn’t be separated linearly in the low-dimensional feature space,can be separated linearly in the high-dimensionl space.

To understand semi-supervised learning we have to consider some mathematical

5

Figure 2.1: Positive samples (green boxes) and negative samples (red circles). Thereare many possible solutions for the hyperplane (from [Mar03])

Figure 2.2: Maximum margin, the middle line is the hyperplane, the vectors on theother lines are the support vectors (from [Mar03])

6

background of SVMs. This is just a very short summary, beside very good resourceson the internet Vapnik, Cristianini and Shawe-Taylor provide comprehensive intro-ductions to Support Vector Machines [Vap00], [VC04] or [CST00].

At first we have to define the hyperplane, which separates the data and acts asthe decision boundary.

H(ω, b) = x|ωT ∗ x+ b = 0 (2.1)

where ω is a weight vector, x is an input vector and b is the bias.Note that ω points orthogonal to H.

Because we are interested in maximizing the margin, we have to define thedistance from a support vector to the hyperplane.

ωT ∗ x+ b

||ω||=±1

||ω||(2.2)

From this definition the margin m follows straightforward (see Figure 2.2 for anillustration).

2

||ω||(2.3)

The maximization task can be summarized as [TC01]:

maxw∈F

mini{yi(ω ∗ φ(x))} (2.4)

subject to ||ω|| = 1

yi(ω ∗ φ(xi)) ≥ 1, i = 1...n.

Note that this definition is only correct, if the data is linearly separable. In anon-linearly separable case we have to introduce slack variables.

maxw∈F

mini{yi(ω ∗ φ(x))} (2.5)

subject to yi(ω ∗ φ(xi)) ≥ 1− ξi, i = 1...n

ξi ≥ 0

where ξi are slack variables.Because SVMs try to maximize the margin we can restate the optimization task

using the definition of the margin:

minω,ξ

1

2||ω||2 + C

n∑i=1

ξi (2.6)

7

subject to yi(ω ∗ φ(xi)) ≥ 1− ξi, i = 1...n

ξi ≥ 0

where C is the complexity parameter. This parameter controls the complexityof the decision boundary. Large C penalize errors whereas small C penalize thecomplexity [Mei02].

As said Support Vector Machines usually use so called kernels or kernel functionsto project the data from a low-dimensional input space to a high-dimensional featurespace. The kernel function K satisfies Mercer’s condition and we define K as:

K(u, v) = φ(u) ∗ φ(v) (2.7)

where φ : X− > F is a feature map [Mei02], [MIJ04]. One example of a featuremap:

f(x1, x2) = (x21,√

2x1x2, x22) (2.8)

Using these feature map we can calculate the projection using the kernel K(u,v) =φ(u)∗φ(v) by computing the inner products of the data vectors and not the featurevectors φ(u) ∗ φ(v):

K(u, v) = φ(u) ∗ φ(v) (2.9)

= u21v

21 + 2u1u2v1v2 + u2

2v22 (2.10)

= (u1v1 + u2v2)2 (2.11)

= (< u, v >)2 (2.12)

whereas < u, v > is the inner product of u and v.In the context of SVMs we consider classifiers of the form:

f(x) =n∑i=1

αiK(x, x). (2.13)

where αi are the Lagrange multipliers.

8

Chapter 3

Semi-supervised Learning

The task of classification is also called supervised learning. In contrast to this thetask of clustering is called unsupervised learning. There the learner doesn’t uselabeled data, instead the learner tries to partitioning a datasets into clusters sothat the data in a cluster share some common characteristics.

Semi-supervised learning is a combination of supervised and unsupervised learn-ing where typically a small amount of labeled and a large amount of unlabeled dataare used for training. This is done because of two reasons. First labeling of a hugeset of instances can be a time-consuming task. This classification has to be done bya skilled human expert and can be quiet costly. Semi-supervised learning reducesthe needed amount of labeled instances and the associated costs. Note that in con-trast the acquisition of the unlabeled data is usually relatively inexpensive. Secondit has been shown that using unlabeled data for learning improves the accuracy ofthe produced learner [BD99]. G. Schohn and D. Cohn [SC00] report similar results,they state that a trained SVM on a well-chosen subset performs often better thanon all available instances.

Summing up the advantages of semi-supervised learning are (in many cases)better accuracy, fewer data and less training time. To achieve these the examplesto be labeled should be selected properly.

There are many different algorithms for semi-supervised learning with SupportVector Machines. Most of them require some kind of querying unlabeled instancesto request the labels for them from a human expert. They differ in the way theyselect the next instances. The process of querying is called selective sampling.

Sometimes semi-supervised learning is called active learning. As opposed to pas-sive learning, where a classifier is trained using randomly selected labeled data, anactive learner asks a user to label only ’important’ instances. Because the classifiergets feedback (the labels) about the for the classification relevant instances from anuser, this process is called relevant feedback.

Note that the approaches presented in 3.5.1 and 3.5.2 differ in that because hereno feedback is necessary.

9

3.1 Random Subset

Obviously if we use a random process to select the unlabeled instances this learningcannot be considered as real semi-supervised learning. To get an appropriate accu-racy the sampling strategy is as important as it is in case of supervised learning.Supervised learning and random subset semi-supervised learning are very similarand share most of the characteristics.

Some researchers have experimented with this and have stated that the accuracycannot keep up with real semi-supervised strategies. But they used this approach tocompare it with the other semi-supervised learning approaches [FM01], [LKG+05].

3.2 Clustering

One approach is to use a clustering algorithm (unsupervised learning) on the un-labeled data. Then we can e.g. choose the cluster centers (centroids) as instancesto be labeled by an expert. G. Fung and O. Mangasarian have used k-medianclustering and report a good classification accuracy in comparison with supervisedlearning but with fewer labeled instances [FM01]. It’s worth to keep in mind thatone has to define the correct number of the clusters in advance. Correct means thatthe clusters should be good representatives of the available classes. G. Fung andO. Mangasarian do not really address this but as for other clustering algorithm thechoice of the number of clusters can be assumed to be critical. An obvious solutionis to set the number of clusters equal to the number of classes. Additionally G.Fung and O. Mangasarian extend the clustering by an approach similiar to thatdescribed in chapter 3.5.

A general algorithm could be described this way:

1. Use the labeled data to build a model

2. Using the unlabeled data calculate n clusters

3. Query some instances for labeling by a human expert. Which instances de-pends on the alogrithm. Some examples:

(a) Query the centroids

(b) Query instances on the cluster boundaries

(c) A combination of the above approaches

Cebron and Berthold introduced an advanced clustering technique, they pro-posed a prototype based learning approach using a density estimation techniqueand a probability model for selecting prototypes to obtain the labels from an ex-pert [CB07].

10

3.3 Version Space Based Methods

Random Subset (chapter 3.1) and clustering (chapter 3.2) are simple but effectivemethods for semi-supervised learning. Depending on the given classification taskthe results can be quite good. Note that both can be used with other classifiers andare not limited to Support Vector Machines. Version space based methods are amore advanced technique, which use specific properties of Support Vector Machinesfor semi-supervised learning. But as we will see these approaches suffer from somecritical limitations.

The following approaches can be analyzed by their influence to the version space.Therefore it is worth to consider the theory of version spaces.

3.3.1 Theory of the Version Space

The version space was introduced by Tom Mitchell [Mit97]. It is the space con-taining all consistent hypotheses from the hypotheses space whereas the hypothesesspace contains all possible hypotheses.

In the context of SVMs the hypotheses are the hyperplanes and the versionspaces contain all hyperplanes consistent with the current training data [TC01].More formally, the hypotheses space (all possible hypotheses) is defined as:

H = {f |f(x) =φ(x) ∗ ω||ω||

whereω ∈ W} (3.1)

where the parameter space W is equal to the feature space F, f is a hypothesis. Asin chapter 2 explained φ(x)∗ω

||ω|| is the definition of the (normalized) hyperplanes (Def-

inition 2.1). So this space contains all possible hyperplanes. Using this definitionwe can define the version space:

V = {f ∈ H|∨

i∈{1...n}

yif(xi) > 0} (3.2)

where yi is the class label. This definition eliminates all hypotheses (hyperplanes)not consistent with the given training data (Definition 2.4)

Because there is a bijection between W (containing the unit vectors) and H(containing hyperplanes) we can redefine V [TC01]:

V = {w ∈ W |||ω|| = 1, yi ∗ φ(xi)) > 0, i = 1...n} (3.3)

There is a restriction of this definition: the training data has to be linearlyseparable in the feature space. But because it is possible to make every data linearlyseparable by modifying the kernel we can ignore this issue [STC99]. Furthermorebecause we often work in a high-dimensional feature space in many cases the datawill be linearly separable.

11

For our analysis it is important to note the duality between the feature spaceF and the parameter space W [TC01]. The unit vectors ω correspond to the deci-sion boundaries f in F. This follows intuitively from the above definitions but thiscorrespondence exists also converse. Let’s have a closer look on this issue. If oneobserves a new training instance xi in the feature space this instance reduces theset of all allowable hyperplane to ones that classify xi correctly. We can write thisdown more formally: every hyperplane must satisfy yi(ωφ(xi)) > 0, where yi is thelabel for the instance xi. As said before ω is the normal vector of the hyperplanein F. But we can think of yiφ(xi) as being the normal vector of a hyperplane inW. It follows that ω(yiφ(xi)) = 0 defines a hyperplane in W. Recall that we havedefined the version space V in W. Therefore the hyperplane is a boundary to theversion space. It can be shown that the hyperplanes in W delimit the version spaceand from the definition of the maximization task of the SVMs it maximizes theminimum distance to any of this hyperplanes in W. SVMs find a center of thelargest hypersphere in the version space, whose radius is the maximum margin andit can be shown that the hyperplanes touched by the hypersphere correspond to thesupport vectors and that the ωi often lie in the center of the version space [TC01].

3.3.2 Simple Method

Linear SVMs perform best when applied in high-dimensional domains (such astext classification). There the number of features is much larger than the numberof examples and therefore the training data cannot cover the whole dimensions,meaning that the subspace spanned by the training examples is much smaller thanthe space containing all dimensions. Considering this observation G. Schohn andD. Cohn propose that a simple method to select instances for labeling is to searchfor examples that are orthogonal to the space spanned by the current trainingdata [SC00]. Doing this would give the learner informations about yet not covereddimensions. Alternatively one can choose those instances which are near to thedividing hyperplane to improve the confidence in currently known dimensions. Thisis an attempt to narrow the existing margin. To maximally narrow the margin onewould select those instances lying on the hyperplane. The interesting result fromG. Schohn and D. Cohn is that training on a small subset of the data leads in mostcases to a better performance than training on all available data.

Remains the analysis of the computation of the proximity of a training instanceto the hyperplane: this is inexpensive, because one can compute the hyperplaneand evaluate each instance using a single dot product.The distance between a feature vector φ(x) and the hyperplane ω:

|φ(x) ∗ ω| (3.4)

Let’s have a look, how this simple method influences the version space. Givenan unlabeled instance xi we can test how close the corresponding hyperplane in

12

Figure 3.1: The gray line is the old hyperplane, the green lines are the old margins,’o’ is a new example and the black line the new hyperplane, when the new instancewas labeld as ’-’ (from [SC00])

W comes to the center of the hypersphere (the ωi). If we choose the instance xiclosest to the center we can reduce the version space as much as possible (andthis will of course reduce the amount of consistent hypotheses). This distancecan be easily computed using the above formular. By choosing the instance xi,who come closest to the hyperplane in F, we maximally reduce the margin andthe version space. Figure 3.1 shows the effect of an instance on the hyperplanegraphically. There the bottom figure shows that by placing an instance to thecenter of the old hyperplane the margin (calculated using the new hyperplane) getschanged significantly. Placing an instance on the old hyperplane too far out haslittle impact on the margin, as we can see on the top figure.

A more sophisticated description of this can be found in [TK02]. There threedifferent approaches are presented, each trying to reduce the version space as muchas possible. Note that these definitions rely on the assumption that the givenproblem is binary (two classes).

1. Simple Margin: This is the method already described: choose the next in-stance closest to the hyperplane

2. MaxMin Margin: Let the instance x be a candidate for being labeled by ahuman expert. This instance gets labeled as -1, assigning it to class -1. Then

13

the margin m− of the resulting SVM gets calculated. After this x gets labeledas +1, assigning it to class +1 and again the margin m+ gets computed.This procedure is repeated for all instances and the instance with the largestmin(m−,m+) is chosen.

3. Ratio Margin: This is similar to the MaxMin Margin method, but uses therelative sizes of m− and m+: choose the instance with largest min(m

−

m+ ,m+

m−).

All three methods perform well, the simple margin method is computationallythe fastest. But it has to be used carefully, because it can be unstable undersome circumstances [HGC01], [TK02]. MaxMin Margin and Ratio Margin try toovercome these instability problems. The results of the experiments of S. Tong andD. Koller show that all three methods outperform random sampling [TK02].

3.3.3 Batch-Simple Method

One possible problem with the above methods is that every instance has to belabeled separately. That means that after each instance the user has to determinethe label. A new hyperplane will be calculated and the next instance has to bequeried. Often this approach is not practicable and some kind of batch mechanismis necessary. There exist different approaches of batch sampling for version spacebased algorithms [Cha05]. One of those approaches is the batch-simple samplingalgorithm, where h unlabeled instances closest to the hyperplane are chosen andhave to be labeled by a user. This could be seen as a rather naive extension of theabove methods (of course naive doesn’t mean bad). The batch-simple method hasbeen used to classify images [TC01] and the researchers in this paper report goodresults. The algorithm can be expressed as follows:

1. initial model building: Build a model using the labeled data

2. feedback round: query n instances closest to the hyperplane and ask the userto label them

The feedback round can be repeated m times. Because this algorithm can beunstable during the first feedback round [TC01], Tong and Chang suggest an initialfeedback round with random sampling:

1. initial modell building: Build a modell using the labeled data

2. first feedback round: choose randomly n instances for labeling

3. advanced feedback round: query n instances closest to the hyperplane andask the user to label them

14

Now the advanced feedback round could be repeated m times. But how tochoose ’good’ values for n and m? Simon Tong and Edward Chang do not explaina way to determine these values [TC01]. But it is clear that n has to be set inadvance. They have used a query size of 20. m can be determined by using somekind of cross validation. It is also obvious that by decreasing the query size n onehas to increase the number of rounds m and vice versa. Otherwise the accuracy ofthe classifier would decrease. Beside the technical reasons the choice of the valuesdepends on the user, whose task is to label the instances. To take advantage ofactive learning this user should not have to label a huge set of examples. As anstarting point one can use the values from [TC01]: query size = 20, number ofrounds = 5.

3.3.4 Angle Diversity Strategy

One problem with the batch-simple method is that by sampling a batch of instancesthe diversity of them is not guaranteed. One can expect that divers instancescan reduce the version space more efficiently, considering the diversity can havea significant impact on the performance of the classifier. A measurement of thediversity is the angles between the samples. The angle diversity strategy proposed in[Cha05] balances the closeness to the hyperplane and the diversity of the instances.

More formally the angle between two instances xi and xj (respective their cor-responding hyperplanes hi and hj:

|cos(< (hi, hj))| =|φ(xi).φ(xj)|||φ(xi)||||φ(xj)||

=|K(xi, xj)|√

K(xi, xi)K(xj, xj)(3.5)

where xi is an instance, φ(xi) is its normal vector and K(xi, xj) is the kernel func-tion, which satisfies Mercer’s condition [Bur98].

From these theoretical considerations the algorithm follows straightforward:

1. Train a hyperplane hi by the given labeled set

2. Calculate for each unlabeled instance xj its distance to the hyperplane hi

3. Calculate the maximal angle from xj to any instance xi in the current labeledset

What’s left is to consider the distance to the hyperplane, until now we havefocused on the diversity of the samples. To do this we introduce another parameterα [Cha05]. This parameter balances the distance to the hyperplane and the diversityamong the instances. The final decision rule can be expressed this way:

α ∗ |f(xi|+ (1− α) ∗ (argmaxxj

|K(xi, xj)|√K(xi, xi)K(xj, xj)

) (3.6)

15

As we can see α acts as a trade-off-factor between proximity and diversity. Thisparameter has to be set in advance and it is suggest to set it to 0.5 [Cha05]. Theyalso present a more sophisticated solution for determining this parameter and clearlyit is possible to use cross validation to get the best value for α.

Some version space based methods have been tested in different fields of appli-cation [Cha05], [MPE06]. Whereas former have concentrated on image datasetsand latter have tested these strategies on music datasets both come to the conclu-sion that the angle diversity strategy works best. Furthermore Tong concludes thatactive learning outperforms passive learning [Cha05].

3.3.5 Multi-Class Problem

So far we have just considered and analyzed the two-class case. But to be useful ingeneral a semi-supervised learning approach should be easily used in a multi-classenvironment.

There exist different strategies for solving a multi-class problem with N classesfor supervised learning with SVMs. In the case of the one-versus-one approachN(N−1)

2SVMs are developed and a majority vote is used to determine the class of the

given instance. In contrast the one-versus-all method uses N SVMs and assigns thelabel of the class which SVM has the largest margin. An overview about differentmulti-class approaches for SVMs can be found here [Pal08]. The one-versus-allmethod was introduced by Vapnik [Vap00]. Hsu and Lin have compared differentmulti-class approaches for SVMs [HL02]. Platt has described another multi-classSVM approach: the decision directed acyclic graph [PCT00].

From the above discussions it becomes not clear how to use these version spacebased methods for multi-class problems. Consider the simple method and the one-versus-all approach. In the case of a multi-class problem we have N decision bound-aries, so which of the margins do we want to narrow? There a single instance hasN distances (to the N hyperplanes) and narrowing one margin doesn’t automati-cally mean to narrow all margins. Until now little work has done solving multi-classsemi-supervised problems. Mitra, Shankar, and Pal have applied the simple methodto multi-class problems [MSP04]. They used a ’naive’ approach where they labeledN samples at a time. As said this approach lacks the analysis which example is bestfor all hyperplanes, because the influence of an example can be very large for onehyperplane but for other hyperplanes it can be useless. The angle diversity strategysuffers from the same problem, additionally it is not clear, which angle should beconsidered.

The following section 3.4 describes probability based methods which overcomethese problems and are more suitable for multi-class problem.

16

3.4 Probability Based Method

As we have seen the version space based methods lack of considering multi-classproblems. An approach which can handle multi-class problems easily are probabilitybased method [LKG+05]. There a probability model for multiple SVMs is created.The results of each SVMs are interpreted as a probability and can be seen as ameasurement of certainty that a given instance belongs to the class. In the caseof semi-supervised learning using this approach is straightforward and using theprobabilities we have many possibilities to query unlabeled instances for labeling.A simple method would be to train a model on the given labeled datasets. Thanthis model is applied on the unlabeled data and each of these unlabeled instances isgiven probabilities that these instances belong to a given class. Now we can querythe least certain instances or the most certain instances. It is also possible to querythe instances with the smallest difference in probability between its most likelyand second most likely class. Using these probabilities there exist many differentapproaches and it is also possible to mixture some of them [LKG+05].

3.4.1 The Probability Model

To get probabilities we have to extend the default Support Vector Machines. Fora given instance the results of the default SVMs are distances where f.ex. 0 meansthat the instance lies on the hyperplane and 1 that the instance is a support vector.

To assign a probability value to a class the sigmoid function can be used. Thenthe parametric model has the following form [LKG+05]:

P (y = 1|f) =1

1 + exp(Af +B)′(3.7)

where A and B are scalar values, which have to be estimated and f is the decisionfunction of the SVM. Based on this parametric model there are some approachesfor calculating the probabilities. As we can see, when we use this model we have tocalculate the SVM parameters (complexity parameter C, kernel parameter k) andthe parameter A and B where the parameter A and B have to be calculated foreach binary SVM. We can use cross validation for this calculation but it is clearthat this can be computationally expensive.

A pragmatic approximation method could assume that all binary SVMs havethe same A, eliminate B by assigning 0.5 to instances lying on the decision boundaryand by trying to compute the SVM parameters and A simultaneously [LKG+05].The decision function can be normalized by its margin to include the margin in thecalculation of the probabilities. More formally:

Ppq(y = 1|f) =1

1 + exp( Af||ω||)′

(3.8)

17

where we currently look at class p and Ppq is the probability of class p versus classq. We assume that Ppq, q=1,2,... are independent. The final probability for classp:

P (p) =

q 6=p∏q

Ppq(y = 1|f) (3.9)

It has been reported that this approximation is very fast and delivers goodaccuracy results. Using this probability model there exist different approaches forsemi-supervised learning. The next section outlines some.

3.4.2 Least Certainty and Breaking Ties

The algorithms for both are very similar.

1. Built a multi-class model from the labeled training data

2. Compute the probabilities

3. Least Certainty: Query the instances with the smallest classification confi-dence for labeling by a human expert. Add them to the training set.

4. Breaking ties: Query the instances with the smallest difference in probabilitiesfor the two highest probability classes and obtain the correct label from ahuman expert. Add them to the training set.

5. Goto 1

Suppose a is the class with the highest probability, b is the class with secondhighest probability and P(a) and P(b) are the probabilities of the classes. Thenleast certainty tries to improve P(a) and breaking ties tries to improve P(a) - P(b).Intuitively, both methods improve the confidence of the classification. The numberof instances, which should be queried, has to be set by the SVM designer.

These approaches were tested on a gray-scale image datasets [LKG+05]. Theyreport a good accuracy and a reduced number of labeled images required to reach it.The breaking ties approach outperforms least certainty and using batch samplingwas also effective.

3.5 Other approaches

3.5.1 A Semidefinite Programming Approach

Semidefinite programming is an extension of linear and quadratic programming. Asemidefinite programming problem is a convex constrained optimization problem.With semidefinite programming one tries to optimize a symmetric n× n matrix of

18

variables X [XS05]. Semidefinite programming can be used to use Support VectorMachines in an unsupervised and semi-supervised context. For clustering the goalis not to find a large margin classifier using the labeled data (as with supervisedlearning) but instead to find a labeling that results in a large margin classifier.Therefore every possible labeling has to be computed and the labeling with themaximum margin has to be chosen. Obviously this is computationally very expen-sive but Xu and Schuurmans have found out that it can be approximated usingsemidefinite programming. This unsupervised approach can be easily extended tosemi-supervised learning where a small labeled training set has to be considered.Note that this approach also works for multi-class problems [XS05]. There is one im-portant difference between this approach and the other above discussed approaches:Here the algorithm uses the unlabeled data directly that means no human expert isasked to label them. In this case the semi-supervised learning is a combination ofsupervised learning using the given labeled training set and unsupervised learningusing the unlabeled data.

3.5.2 S3VM

This approach was introduced by Bennet and Demiriz [BD99]. Similar to the aboveapproach no human gets asked to label instances. Instead the unlabeled data getsincorporated to the formulation of the optimization problem. S3VM reformulatesthe original definition by adding two constraints to the instances of the unlabeleddatasets. Considering a binary SVM one constraint calculates the misclassificationerror as if the instance were in class 1 and the second constraint as if the instancewere in class -1. S3VM tries to minimize these two possible misclassification errors.The labeling with the smallest error is the final labeling. Moreover Bennet andDemiriz introduce some optimization techniques for this. An analysis, how thisapproach performs in a multi-class environment, is not presented.

3.6 Summary

Semi-supervised learning is a promising approach to reduce the amount of neededlabeled instances for training SVMs by asking a human expert to label relevantinstances from an unlabeled pool of instances. As outlined there are many differentapproaches available. We can use clustering, which can also be used as a semi-supervised learning approach with other machine learning algorithm. In contrastthe here presented version space based methods focus on SVMs and promise goodaccuracy results but are primary usable for binary classification tasks. Extendingthese approaches for multi-class problems is an ongoing research topic. Simple buteffective approaches are the probability based methods which can be easily used ina multi-class context and are therefore very convenient. S3VM and the semidefiniteprogramming approach are also semi-supervised learning approaches but here no

19

human gets asked to label relevant instances. Whereas the former incorporatesunlabeled instances to the formulation of the optimization problem, the latter onetries to find the labeling with the largest margin.

20

Chapter 4

Experiments

4.1 Experiment Setting

To experiment with different approaches presented in this work I have implementedtwo applications. ssSVM is a semi-supervised SVM implementation and suppportsdifferent semi-supervised learning approaches like Least Certainty and BreakingTies. ssSVM uses RapidMiner, an open-source data mining plattform, which pro-vides a comprehensive API for machine learning tasks like different classification,different clustering and of course different SVM implementations. ssSVM is alsobased on Spring, mainly an inversion-of-control container and therefore it is highlyconfigurable and extensible. Furthermore it wraps the WordVector Tool for creat-ing word vectors from texts. The second implemented application is the GUI forssSVM. It is called ssSVMToolbox and is based on Eclipse RCP. The chapters 4.1.2and 4.1.3 as well as the links in the appendix A provide detailed informations.

4.1.1 Evaluated Approaches

I compared following approaches and evaluated their performances on the differentdata sets:

1. Least Certainty (LS)

2. Breaking Ties (BT)

3. Most Certainty (MC)

4. Simple Margin (SM)

5. Random Sampling (RS)

I separated every datasets into three sub sets.

1. training set for supervised learning (in this work also called reduced set)

21

2. training set for semi-supervised learning (is used to query instances for thefeedback)

3. test set to evaluate the performance

.Using the reduced and the training set for semi-supervised learning (merged also

called the whole set) I trained a common SVM to get an upper bound and usedthe reduced set alone to get the lower bound. So the accuracies of the differentapproaches should be between these bounds. Furthermore I used a random samplingstrategy (RS) to show that the different approaches are better than an approachwhich randomly chooses instances for feedback.

I compared two different modes:

1. incremental increased training size: there the feedback size is set to 1 and thetraining size is incremental increased

2. batch mode: there the feedback size is set to a certain value (f.ex. 50), insome iterations the feedback size is increased and results with these differentfeedback sizes are compared

4.1.2 ssSVM

ssSVM (semi-supervised Support Vector Machine) is a Java application capableof performing semi-supervised learning tasks with Support Vector Machines. It isbased on RapidMiner , an Open Source data mining tool, and on Spring, an IOCcontainer. See Relevant Links for more informations (appendix A).

The core of the application is the application context sssvmContext.xml. AsRapidMiner ssSVM supports different operators, this file configures which operatorsssSVM actually supports (which input sources, which SVM implementations, whichvalidators,...).<?xml version="1.0" encoding="UTF -8"?>

<beans xmlns="http://www.springframework.org/schema/beans"

xmlns:xsi="http: //www.w3.org /2001/ XMLSchema -instance"

xsi:schemaLocation="http: //www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring -beans.xsd">

<bean id="wvTool" class="sssvm.text.WVTool" lazy -init="true">

<property name="textOperator"

value="com.rapidminer.operator.TextInputOperator" />

<property name="params">

<ref bean="wvToolParams" />

</property >

<property name="labeledTexts">

<ref bean="labeledTexts" />

</property >

<property name="unlabeledTexts">

<ref bean="unlabeledTexts" />

</property >

<property name="testTexts">

<ref bean="testTexts" />

</property >

<property name="preprocessing">

<ref bean="preprocessing" />

</property >

<property name="supportedTokenProcessor">

22

<props>

<prop key="StringTokenizer">

com.rapidminer.operator.tokenizer.SimpleTokenizer

</prop>

<prop key="NGramTokenizer">

com.rapidminer.operator.tokenizer.NGramTokenizer

</prop>

<prop key="TermNGramGenerator">

com.rapidminer.operator.tokenizer.TermNGramGenerator

</prop>

<prop key="GermanStemmer">

com.rapidminer.operator.reducer.GermanStemmer

</prop>

<prop key="LovinsStemmer">

com.rapidminer.operator.reducer.LovinsStemmer

</prop>

<prop key="PorterStemmer">

com.rapidminer.operator.reducer.PorterStemmer

</prop>

<prop key="SnowballStemmer">

com.rapidminer.operator.reducer.SnowballStemmer

</prop>

<prop key="ToLowerCaseConverter">

com.rapidminer.operator.reducer.ToLowerCaseConverter

</prop>

<prop key="EnglishStopwordFilter">

com.rapidminer.operator.wordfilter.EnglishStopwordFilter

</prop>

<prop key="GermanStopwordFilter">

com.rapidminer.operator.wordfilter.GermanStopwordFilter

</prop>

<prop key="StopwordFilterFile">

com.rapidminer.operator.wordfilter.StopwordFilterFile

</prop>

<prop key="TokenLengthFilter">

com.rapidminer.operator.wordfilter.TokenLengthFilter

</prop>

</props>

</property >

<property name="tokenProcessors">

<ref bean="tokenProcessors" />

</property >

</bean>

<bean id="inputReader" class="sssvm.InputReader">

<property name="supportedReader">

<props>

<prop key="csv">

com.rapidminer.operator.io.CSVExampleSource

</prop>

<prop key="sparse">

com.rapidminer.operator.io.SparseFormatExampleSource

</prop>

<prop key="arff">

com.rapidminer.operator.io.ArffExampleSource

</prop>

</props>

</property >

</bean>

<bean id="svmLearner" class="sssvm.SVMLearner">


<ref bean="svmParams" />

</property >

<property name="supportedSVMLearer">

<props>

<prop key="libSVM">

com.rapidminer.operator.learner.functions.kernel.LibSVMLearner

</prop>

<prop key="mySVM">

com.rapidminer.operator.learner.functions.kernel.JMySVMLearner

</prop>

</props>

</property >

<property name="supportedValidator">

<props>

<prop key="xval">

com.rapidminer.operator.validation.XValidation

</prop>

<prop key="fixedSplit">

com.rapidminer.operator.validation.FixedSplitValidationChain

</prop>

</props>

23

</property >

<property name="supportedPerfEvaluator">

<props>

<prop key="simple">

com.rapidminer.operator.performance.SimplePerformanceEvaluator

</prop>

<prop key="poly">

com.rapidminer.operator.performance.PolynominalClassificationPerformanceEvaluator

</prop>

</props>

</property >

</bean>

<bean id="clusterer" class="sssvm.Clusterer">

<property name="supportedClusterer">

<props>

<prop key="kmeans">

com.rapidminer.operator.learner.clustering.clusterer.KMeans

</prop>

<prop key="svClustering">

com.rapidminer.operator.learner.clustering.clusterer.SVClusteringOperator

</prop>

<prop key="kernelKmeans">

com.rapidminer.operator.learner.clustering.clusterer.KernelKMeans

</prop>

</props>

</property >

</bean>

</beans>

To use ssSVM for a concrete experiment another configuration is necessary.There the runtime properties for the experiment have to be provided. Instead ofdescribing this file I provide an example. For a detailed description which pa-rameters and parameter values are supported, see the RapidMiner documentation(appendix A).<?xml version="1.0" encoding="UTF -8"?>

<beans xmlns="http://www.springframework.org/schema/beans"

xmlns:xsi="http: //www.w3.org /2001/ XMLSchema -instance"

xsi:schemaLocation="http: //www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring -beans.xsd">

<bean id="runtimeConfig" class="sssvm.RuntimeConfiguration">



<property name="svmLearner" value="libSVM" />

<property name="svmValidator" value="xval" />

<property name="svmPerfEvaluator" value="simple" />



<property name="inputReader">

<props>

<prop key="format">csv</prop>

</props>

</property >

</bean>

<bean id="svmParams" class="sssvm.ParameterWrapper">

<property name="props">

<props>

<prop key="confidence_for_multiclass">false</prop>

<prop key="calculate_confidences">false </prop>

<prop key="kernel_type">linear </prop>

<prop key="scale">true</prop>

</props>

</property >

</bean>

<bean id="preprocessing"

class="sssvm.preprocessing.Preprocessing">

<property name="discretizeLabelFeature" value="false" />

<property name="numberOfBins" value="2" />

<property name="nominal2numericRegularFeatures" value="true" />

</bean>

<bean id="exampleParameter" class="sssvm.ParameterWrapper">


<props>

<prop key="id_column">1</prop>

<prop key="label_column">2</prop>

24

<prop key="read_attribute_names">false</prop>

</props>

</property >

</bean>

<bean id="labeledExampleSource" class="sssvm.ExampleSource">



</property >

<property name="parameter">

<ref bean="exampleParameter" />

</property >

<property name="additionalProps">

<props>

<prop key="filename">

./ datasets/breast_cancer_wisconsin/wdbc_as_labeled.data

</prop>

</props>

</property >

</bean>

<bean id="testExampleSource" class="sssvm.ExampleSource">



</property >



</property >


<props>


./ datasets/breast_cancer_wisconsin/wdbc_testset.data

</prop>

</props>

</property >

</bean>

<bean id="unlabeledExampleSource" class="sssvm.ExampleSource">



</property >



</property >


<props>


./ datasets/breast_cancer_wisconsin/wdbc_as_unlabeled.data

</prop>

</props>

</property >

</bean>

<bean id="centroidBasedClusterModelHandler"

class="sssvm.clustermodel.SSSVMCentroidBasedClusterModelHandler" />

<bean id="flatCrispClusterModelHandler"

class="sssvm.clustermodel.SSSVMFlatCrispClusterModelHandler">

<property name="numberOfInstancesForFeedback" value="0" />

</bean>

<bean id="hierarchicalClusterModelHandler"

class="sssvm.clustermodel.SSSVMHierarchicalClusterModelHandler" />

<bean id="breakingTiesComparator"

class="sssvm.confidencemodel.BreakingTiesComparator">

</bean>

<bean id="leastCertaintyComparator"

class="sssvm.confidencemodel.LeastCertaintyComparator">

</bean>

<bean id="simpleMarginComparator"

class="sssvm.sampling.SimpleMarginComparator">

</bean>

<bean id="breakingTies"

class="sssvm.sampling.SamplingHandler">

<property name="comparator">

<ref bean="breakingTiesComparator" />

</property >


</bean>

25

<bean id="leastCertainty"



<ref bean="leastCertaintyComparator" />

</property >


</bean>

<bean id="simpleMargin"



<ref bean="simpleMarginComparator" />

</property >


</bean>

<bean id="maxMinMargin"

class="sssvm.sampling.MaxMinMargin" >




</property >

<property name="sssvmLearner">

<ref bean="sssvmLearner" />

</property >

</bean>

<bean id="randomSampling"

class="sssvm.sampling.RandomSampling">

<property name="seed" value="123456789"/>


</bean>

<bean id="clusterParams" class="sssvm.ParameterWrapper">


<props>

<prop key="kernel_type">

com.rapidminer.operator.learner.functions.kernel.jmysvm.kernel.KernelRadial

</prop>

<prop key="kernel_gamma">0.8</prop>

</props>

</property >

</bean>

<bean id="sssvmLearner" class="sssvm.SSSVMLearner">



</property >

<property name="clusterParams">

<ref bean="clusterParams" />

</property >

<property name="svmLearner" value="libSVM" />

<property name="validator" value="xval" />

<property name="perfEvaluator" value="simple" />

<property name="clusterModelHandler">

<map>

<entry

key="com.rapidminer.operator.learner.clustering.clusterer.KMeansClusterModel">

<ref bean="centroidBasedClusterModelHandler" />

</entry >

<entry

key="com.rapidminer.operator.learner.clustering.FlatCrispClusterModel">

<ref bean="flatCrispClusterModelHandler" />

</entry>

<entry

key="com.rapidminer.operator.learner.clustering.HierarchicalClusterModel">

<ref bean="hierarchicalClusterModelHandler" />

</entry>

</map>

</property >

<property name="clusterers">

<list>





</list>

</property >

<property name="samplingStrategies">

<list>









<ref bean="maxMinMargin" />

26

</list>

</property >

</bean>

<bean id="wvToolParams" class="sssvm.ParameterWrapper">

</bean>

<bean id="labeledTexts" class="sssvm.ParameterWrapper">

</bean>

<bean id="unlabeledTexts" class="sssvm.ParameterWrapper">

</bean>

<bean id="testTexts" class="sssvm.ParameterWrapper">

</bean>

</beans>

Following code performs this experiment:final RuntimeHandler r = new RuntimeHandler("wdbc.xml");

final SSSVMLearner learner = r.getSSSVMLearner ();

// one feedback round

final ExampleSet feedbackSet = learner.queryInstances(r.getLabeledExampleSet (), r.getUnlabeledExampleSet ());

final ExampleSet all = ExampleSetUtils.merge(r.getLabeledExampleSet (), feedbackSet );

// use a SVM implementation for training

final IOObject [] resultSSSVM = r.getSVMLearner (). learn(r.getRuntimeConfig (). getSvmLearner (),

r.getRuntimeConfig (). getSvmPerfEvaluator (), all);

// get performance of self test

final PerformanceVector pvXval = (( PerformanceVector) resultSSSVM [1]);

// use model on a separate test set

final PerformanceVector pvTest = r.getSVMLearner (). test((Model) resultSSSVM [0],

r.getRuntimeConfig (). getSvmPerfEvaluator (), r.getTestExampleSet ());

The incrementally increased training size mode can be executed by this code:protected List<Performance > performSSSVMStepwise(final String experiment , final SamplingStrategy samplingStrategy)

throws Exception {

final RuntimeHandler r = new RuntimeHandler(experiment );

// learn sssvm

final SSSVMLearner learner = r.getSSSVMLearner ();

ExampleSet all = (ExampleSet) r.getLabeledExampleSet (). clone ();

final ExampleSet unlabeledSet = (ExampleSet) r.getUnlabeledExampleSet (). clone ();

final List<Performance > results = new LinkedList <Performance >();

final int feedbackSize = 10;

learner.getSamplingStrategies (). clear ();

learner.addSamplingStrategy(samplingStrategy );

samplingStrategy.setNumberOfInstancesForFeedback(feedbackSize );

for (int i = 0; i < unlabeledSet.size() / 10; i++) {

final ExampleSet feedbackSet = learner.queryInstances(all ,

ExampleSetUtils.intersect(unlabeledSet , all ));

all = ExampleSetUtils.merge(all , feedbackSet );

final IOObject [] resultSSSVM = r.getSVMLearner (). learn(r.getRuntimeConfig (). getSvmLearner (),

r.getRuntimeConfig (). getSvmPerfEvaluator (), all);

final PerformanceVector pvXval = (( PerformanceVector) resultSSSVM [1]);

final PerformanceVector pvTest = r.getSVMLearner (). test((Model) resultSSSVM [0],

r.getRuntimeConfig (). getSvmPerfEvaluator (), r.getTestExampleSet ());

final Performance perf = new Performance(pvXval , pvTest , all.size(), all.size()

- r.getLabeledExampleSet (). size ());

results.add(perf);

}

return results;

}

If f.ex. a new input source should be used it has to be configured in sssvmCon-text.xml and after that it can be used in the experiment configuration.

Table 4.1 describes the important packages of ssSVM.

27

package name describtionsssvm contains the ssSVM implementation

and the core classes for running experimentssssvm.clustermodel contains cluster models

for using clusterer in a semi-supervised mannersssvm.confidencemodel contains the implementation

for probability based semi-supervised approaches(Breaking Ties, Least Certainty,...)

sssvm.sampling contains different sampling strategiessssvm.preprocessing contains preprocessing methodssssvm.text wraps the WVTool for creating word vectors from texts

Table 4.1: Packages

4.1.3 ssSVMToolbox

This is the graphical user interface of ssSVM. It is based on Eclipse RCP. Using thessSVMToolbox one can create, configure and run experiments. This applicationuses ssSVM to perform supervised and semi-supervised learning with SVMs andhas the same abilities as ssSVM. Technically the toolbox is a GUI to manipulateexperiment xml files.

Running experiments is straightforward. At first one has to create a new ex-periment. The toolbox consists of some tabs. On the Input tab one can configurethe datasources of the experiment. Here one has to provide the input format (f.ex.cvs), the filenames of the example sets and additional parameters for the examplesets. The Preprocessing tab provides the configuration for preprocessing tasks likediscretization and transformation of nominal to numeric attributes. The ssSVMLearner tab is the core of the toolbox. Here one can choose between different SVMlearners, has to set the SVM parameters like kernel type and can activate and deac-tivate the different sampling strategies. For every sampling strategy one can set thefeedback size. Finally one can execute the ssSVM experiment. After doing this, theFeedback Set table shows the instances for labeling by the human expert. Theresome features are shown (by double clicking on the row a dialog is opened and thewhole instance is shown) and the user can label the instances by clicking on the cellLabel. The current accuracy on the testset is also shown. The Result tab showsthe accuracies and the confusion matrix.

Figure 4.1 shows the Input tab whereas Figure 4.2 represents the ssSVM tab.By repeatedly executing the experiment one can experiment with incrementally

increased training sizes, by setting the feedback sizes to values > 1 one can testthe batch mode. By choosing different sampling strategies one can experiment withdifferent combinations of them.

In the next sections I show the results of my experiments. For some of theseexperiments I used the ssSVMToolbox. For more sophisticated results (f.ex. to

28

Figure 4.1: Screenshot of the Input tab of the ssSVMToolbox

29

Figure 4.2: Screenshot of the ssSVM tab of the ssSVMToolbox

30

Figure 4.3: Binary Gaussian Distribution (µ1 = 3, σ1 = 3, µ2 = 4, σ2 = 3)

create the different figures) I used a programmatic approach where I could executedifferent experiments with different settings all at once. See Section 4.1.2 for detailedinformations and example source code.

4.2 Artificial Datasets

4.2.1 Gaussian Distributed Data

For these experiments I used generated gaussian distributed data. I generated twodifferent datasets with two different classes where the two classes overlap each other.In the first datasets ds1 the σs are equal, in the second ds2 the σs are different. TheFigures 4.3 and 4.4 show plots of these datasets.

For these datasets I evaluated different approaches (section 4.1.1). Table 4.2shows the upper and the lower bounds and the result of the ssSVM approach usinga feedback size of 50.

whole set reduced set LC BT MC SM RSself test 0.67 0.5 0.56 0.6 0.68 0.74 0.38test set 0.67 0.5 62 0.5 0.66 0.67 0.46

trainingset size 840 40 50 50 50 50 50

Table 4.2: Summary experiments with ds1

31

Figure 4.4: Binary Gaussian Distribution (µ1 = 12, σ1 = 15, µ2 = 17, σ2 = 1)

Figure 4.5 gives a more detailed insight into the performance of the semi-supervised SVM. There the feedback size was set equal to 1 and ssSVM were usedto incrementally increase the training set size. As we can see after approx. 50 iter-ations Simple Margin and Most Certainty deliver good results in comparison withconventional SVMs but with much fewer data. Breaking Ties, Least Certainty andMost Certainty are most stable and outperform Random Sampling.

Figure 4.6 shows how the implementation performs with different feedback sizesin a batch mode.

The lower and upper bounds of the second datasets and the performance ofssSVM with feedback size 50 can be found in Table 4.3 .

whole set reduced set LC BT MC SM RSself test 0.77 0.9 0.83 0.78 0.95 0.66 0.76test set 0.77 0.43 0.70 0.63 0.44 0.5 0.59


Table 4.3: Summary experiments with ds2

The performance of ssSVM with feedback size 1 and incremental increased feed-back size is highlighted in Figure 4.7.

Remains the overview how ssSVM performs on this datasets in a batch mode.Figure 4.8 highlights the results of this.

32

Figure 4.5: Incremental increased training size ds1

Figure 4.6: different feedback sizes in batch mode ds1

33

Figure 4.7: Incremental increased training size ds2

Figure 4.8: different feedback sizes in batch mode for ds2

34

Figure 4.9: Incremental increased training size ds1, RBF kernel

Both datasets show that the semi-supervised SVM approaches delivers similarresults than the supervised approach but with a smaller training set. The incre-mental version outperforms the supervised approach with respect to the trainingset size and is better than the batch semi-supervised version, which is of coursemore practically and also performs better then the supervised approach.

Different Kernels

For the above experiments I used the Linear kernel. To look how the chosen kernelinfluences the result of the semi-supervised learning approaches I used polynomialand RBF kernels for experimenting with the datasets ds1. The Figures 4.10 and4.9 are similar to the Figure 4.5. For these datasets we can conclude that thechosen kernel influences the result of the SVM but has no specific impact on thesemi-supervised approaches.

4.2.2 Two Spirals Dataset

I also applied ssSVM to a Two Spirals dataset (Figure 4.11).Table 4.4 shows lower and upper bound and the ssSVM accuracy of the dataset.The performance of ssSVM with feedback size 1 and incremental increased feed-

back size is highlighted in Figure 4.12, the results of using a batch mode can befound in Figure 4.13.

35

Figure 4.10: Incremental increased training size ds1, polynomial kernel (degree =3)

Figure 4.11: Two Spirals Dataset

36

whole set reduced set LC BT MC SM RSself test 1 0.1 0 0 1 1 1test set 0.85 0.32 0.33 0.31 0.67 0.74 0.72


Table 4.4: Summary experiments with Two Spirals Dataset

Figure 4.12: Incremental increased training size Two Spirals Dataset

37

Figure 4.13: different feedback sizes in batch mode for Two Spirals Datasets

As the Gaussian Datasets these experiments show that with ssSVM the neces-sary amount of training instances can be reduced significantly.

4.2.3 Chain Link Dataset

The last artificial dataset I used to evaluate ssSVM is the Chain Link Dataset 4.14.

Table 4.5 shows upper and lower bounds, the Figures 4.15 and 4.16 show theaccuracies with incremental increased training sets and with different batch sizes.

whole set reduced set LC BT MC SM RSself test 0.89 0.66 0.77 0.7 0.67 0.67 0.86test set 0.9 0.76 0.86 0.75 0.73 0.81 0.66


Table 4.5: Summary experiments with Chain Link dataset

4.2.4 Summary

We could see that the semi-supervised SVM approaches reduced the amount ofneeded labeled data significantly. They delivered similar accuracies than the com-mon SVM approach but the training set size was much smaller. As expected the

38

Figure 4.14: Chain Link Dataset

Figure 4.15: Incremental increased training size Chain Link dataset

39

Figure 4.16: different feedback sizes in batch mode for Chain Link Dataset

incremental version performs better than the batch version. Breaking Ties, LeastCertainty, Simple Margin and Most Certainty perform better than Random Sam-pling but no single ’winner’ could be found.

4.3 Datasets from UCI Machine Learning Repos-

itory

Beside the generated datasets I evaluated my implementation using some datasetsfrom the UCI Machine Learning Repository (appendix A).

I used following datasets:

1. abalone

2. breast cancer (WDBC)

3. heart scale

4. hill valley

5. kr-vs-kp

Detailed informations about the datasets can be found on the UCI Machine Learn-ing Repository homepage. Again I separated the datasets into training sets for

40

supervised learning, semi-supervised learning and testing. Note that I did not tryto optimize the SVM kernel parameters to get good accuracies and therefore someaccuracies are rather low. Instead I used different parameters for different datasets(f.ex. different kernel types) and for each datasets the same parameters for com-paring supervised and semi-supervised learning.

Two modes were used for the semi-supervised approach. First a simple batchmode where only one feedback round is used. The other mode uses 10 feedbackrounds.

The tables 4.6 and 4.7 outline the results of these experiments. Again, the semi-supervised approaches deliver good accuracy but with reduced sample size withrespect to the whole training set.

whole set reduced set LC BT MC SM RSheart scale 0.84 0.75 0.83 0.83 0.78 0.77 0.80

WDBC 0.94 0.75 0.94 0.94 0.77 0.80 0.87WDBC (RBF) 0.85 0.25 0.52 0.52 0.28 0.50 0.45

abalone 0.54 0.44 0.51 0.53 0.44 0.51 0.51hill valley 0.94 0.85 0.89 0.89 0.87 0.85 0.86kr-vs-kp 0.44 0.29 0.39 0.43 0.42 0.33 0.22

Table 4.6: Evaluation of semi-supervised SVM approaches (1 iteration, feedbacksize 50)

whole set reduced set LC BT MC SM RSheart scale 0.84 0.75 0.83 0.83 0.76 0.84 0.8

WDBC 0.94 0.75 0.94 0.94 0.77 0.87 0.80WDBC (RBF) 0.85 0.25 0.69 0.69 0.28 0.47 0.46

abalone 0.54 0.44 0.50 0.51 0.43 0.49 0.51hill valley 0.94 0.85 0.89 0.88 0.88 0.86 0.86kr-vs-kp 0.44 0.29 0.47 0.53 0.31 0.21 0.16

Table 4.7: Evaluation of semi-supervised SVM approaches, (10 iterations, feedbacksize 50)

These datasets show that Least Certainty and Breaking Ties often delivers sim-ilar results and outperform the other approaches.

41

Chapter 5

Conclusion

In this work I summarized different approaches of semi-supervised learning for Sup-port Vector Machines. We have seen that most of them try to narrow the marginof the hyperplane. The version space based and the probability based methodsbelong to this category. Semi-supervised learning approaches promise to reduce theamount of the needed training data through performing so called feedback roundswhere a human expert gets asked for labeling instances which are relevant for thegiven classification task. The experiments with different datasets have shown thatssSVM, my semi-supervised learning implementation for SVMs, keep this promise.With ssSVM one can obtain similar accuracies with fewer training data as withusual SVMs.

One drawback of the presented semi-supervised learning approaches is that theyintroduce a new parameter, the feedback size. The feedback size influences not onlythe accuracy but also the acceptance of the human expert. If the feedback size istoo large, the human expert has to label many instances and can get bored (as inthe supervised case), if the feedback size is too small the accuracy can be too low.Because the optimal value for the feedback size depends on the datasets and thechosen approach there is no general rule how to set it. Additionally the number offeedback rounds must also be chosen.

I compared Least Certainty, Breaking Ties, Most Certainty and Simple Mar-gin with Random Sampling and could show that these approaches outperform thelatter one. Which approach should be chosen depends on the datasets althoughLeast Certainty and Breaking Ties seem to be most stable and are general goodapproaches.

A problem is that until now there does not exist a practical online tuning al-gorithm for kernel parameters. If we add a new instance to the training set theoptimal kernel parameters can change.

Nevertheless my experiments with ssSVM show that using semi-supervised ap-proaches help to reduce the size of needed labeled training data and are thereforevaluable.

42

Appendix A

Relevant Links

• Word Vector Tool - An Open-Source Tool for creating word vectors from textshttp://www.wvtool.nemoz.org/• RapidMiner - An Open-Source Datamining Tool http://www.rapidminer.com• Spring Framework - An IOC Container http://springframework.org/• Eclipse RCP - The Eclipse Rich Client Platform http://wiki.eclipse.org/index.

php/Rich Client Platform• UCI Machine Learning Repository - Repository containg different data sets

http://archive.ics.uci.edu/ml/

43

Bibliography

[BD99] Kristin P. Bennett and Ayhan Demiriz. Semi-supervised support vectormachines. In Proceedings of the 1998 conference on Advances in neuralinformation processing systems II, pages 368–374, Cambridge, MA, USA,1999. MIT Press.

[Bur98] Christopher J. C. Burges. A tutorial on support vector machines forpattern recognition. Data Mining and Knowledge Discovery, 2(2):121–167, 1998.

[CB07] Nicolas Cebron and Michael R. Berthold. An adaptive multi objective se-lection strategy for active learning. Konstanzer Schriften in Mathematikund Informatik, No. 235, 2007.

[Cha05] Edward Chang Simon Tong Kingsby Goh Chang-Wei Chang. Supportvector machine concept-dependent active learning for image retrieval.IEEE Transactions on Multimedia 2005, 2005.

[CST00] Nello Cristianini and John Shawe-Taylor. An introduction to supportVector Machines: and other kernel-based learning methods. CambridgeUniversity Press, New York, NY, USA, 2000.

[FM01] F. Fung and O. Mangasarian. Semi-supervised support vector machinesfor unlabeled data classification. Optimization Methods and Software,15:29–44, 2001.

[HGC01] Ralf Herbrich, Thore Graepel, and Colin Campbell. Bayes point ma-chines. Journal of Machine Learning Research, 1:245–279, 2001.

[HL02] Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks,13:415–425, 2002.

[LKG+05] Tong Luo, Kurt Kramer, Dmitry B. Goldgof, Lawrence O. Hall, ScottSamson, Andrew Remsen, and Thomas Hopkins. Active learning torecognize multiple types of plankton. Journal of Machine Learning Re-search, 6:589–613, 2005.

44

[Mar03] Florian Markowetz. Klassifikation mit support vector machines.http://lectures.molgen.mpg.de/statistik03/docs/Kapitel 16.pdf, 2003.Lectures, Max Planck Institute For Molecular Genetics.

[Mei02] Ron Meir. Support vector machines - an introduction.http://www.ee.technion.ac.il/ rmeir/SVMReview.pdf, 2002. Elec-trical Engineering Department, Israel Institute of Technology, Tutorial.

[MIJ04] Romain Thibaux Michael I. Jordan. The kerneltrick. http://www.cs.berkeley.edu/ jordan/courses/281B-spring04/lectures/lec3.pdf, Spring 2004. Lectures, CS Berkeley.

[Mit97] Thomas Mitchell. Machine Learning. McGraw-Hill Education (ISE Edi-tions), October 1997.

[MPE06] Michael Mandel, Graham Poliner, and Daniel Ellis. Support vector ma-chine active learning for music retrieval. Multimedia Systems, 12(1):3–13,2006.

[MSP04] Pabitra Mitra, B. Uma Shankar, and Sankar K. Pal. Segmentation ofmultispectral remote sensing images using active support vector ma-chines. Pattern Recognition Letters, 25(9):1067–1074, 2004.

[Pal08] Mahesh Pal. Multiclass approaches for support vector machine basedland cover classification. CoRR, abs/0802.2411, 2008. informal publica-tion.

[PCT00] John C. Platt, Nello Cristianini, and Shawe J. Taylor. Large marginDAGs for multiclass classification. In Sara A. Solla, T. K. Leen, andK. R. Muller, editors, Advances in Neural Information Processing Sys-tems, volume 12. MIT Press, 2000.

[SC00] Greg Schohn and David Cohn. Less is more: Active learning with sup-port vector machines. In ICML ’00: Proceedings of the Seventeenth In-ternational Conference on Machine Learning, pages 839–846, San Fran-cisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.

[STC99] John Shawe-Taylor and Nello Cristianini. Further results on the margindistribution. In COLT ’99: Proceedings of the twelfth annual conferenceon Computational learning theory, pages 278–285, New York, NY, USA,1999. ACM.

[TC01] Simon Tong and Edward Chang. Support vector machine active learningfor image retrieval. In MULTIMEDIA ’01: Proceedings of the ninth ACMinternational conference on Multimedia, pages 107–118, New York, NY,USA, 2001. ACM.

45

[TK02] Simon Tong and Daphne Koller. Support vector machine active learningwith applications to text classification. Journal of Machine LearningResearch, 2:45–66, 2002.

[Vap00] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 2000.

[VC04] V. N. Vapnik and A. Ya. Chervonenkis. Theory of pattern recognition.www.cs.berkeley.edu/ jordan/courses/281B-spring04/lectures/lec3.pdf,Spring 2004. Lectures, CS Berkeley.

[XS05] Linli Xu and Dale Schuurmans. Unsupervised and semi-supervised multi-class support vector machines. In Manuela M. Veloso and SubbaraoKambhampati, editors, AAAI, pages 904–910. AAAI Press / The MITPress, 2005.

46

semi-supervised learning with support vector...

Documents