sponsored by aiat.or.th and kindml, siit · classification, known as a most major supervised...

Table of Contents

Chapter 3. Classification and Prediction ........................................................................................ 61 3.1. Classification ......................................................................................................................... 61

3.1.1. Fisher’s linear discriminant or centroid-based method ............................................... 62

3.1.2. k-nearest neighbor method ......................................................................................... 70

3.1.3. Statistical Classifiers ..................................................................................................... 74

3.1.4. Decision Trees .............................................................................................................. 87

3.1.5. Classification Rules: Covering Algorithm .................................................................... 113

3.1.6. Artificial Neural Networks .......................................................................................... 124

3.1.7. Support Vector Machines (SVMs) .............................................................................. 127

3.2. Numerical Prediction .......................................................................................................... 140 3.2.1. Regression .................................................................................................................. 140

3.2.2. Tree for prediction: Regression Tree and Model Tree ............................................... 146

3.3. Regression as Classification ................................................................................................ 148 3.3.1. One-Against-the-Other Regression ............................................................................ 148

3.3.2. Pairwise Regression .................................................................................................... 150

3.4. Model Ensemble Techniques ............................................................................................. 153 3.4.1. Bagging: Bootstrap Aggregating ................................................................................. 155

3.4.2. Boosting: AdaBoost Algorithm ................................................................................... 157

3.4.3. Stacking ...................................................................................................................... 160

3.4.4. Co-training .................................................................................................................. 163

3.5. Historical Bibliography ....................................................................................................... 164 Exercise ........................................................................................................................................... 167

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

61

Chapter 3. Classification and Prediction

This chapter presents a number of data mining/knowledge discovery techniques used to discover

meaningful hidden knowledge or patterns from a pile of data, in the form of transactions, where each

transaction is assumed independent of the others. Here, two rough classes of data mining techniques

are supervised and unsupervised learning. The first class includes classification and prediction while

the second one relates to clustering and association rule mining. This chapter presents the

supervised learning tasks. The unsupervised learning tasks will be explained in the next chapter.

Whereas classification aims to predict a categorical (discrete, unordered) label of a given object on

test, prediction sets a target to model continuous valued functions. These two supervised tasks need

a set of examples to create a predictive model for forecasting the value of the new coming event or

object. It is possible for us to build a classification model to categorize medical test applications, such

as either positive or negative.

There are many classification and prediction methods proposed by researchers in machine

learning, pattern recognition, and statistics. Typical classification methods are k-nearest neighbor

classifiers, Bayesian classifiers, decision tree classifiers, rule-based classifiers and artificial neural

networks. Linear regression, nonlinear regression, regression trees and model trees are prediction

models. Since these algorithms need huge computational space when the data set to be mined is large,

it is necessary to develop scalable classification and prediction techniques capable of handling large

disk-resident data, instead of memory-resident approach. Classification and prediction have

numerous applications, including fraud detection, target marketing, performance prediction,

manufacturing, and medical diagnosis. This chapter provides basic techniques for data classification

and prediction in order.

3.1. Classification

Classification, known as a most major supervised learning task in pattern recognition and

machine learning, aims to deduce a predictive function from a set of training data (also called

cases, observations or examples), each of which has its class label known beforehand. Later, the

function is used to predict a class label for a new coming case. Typically, the training data are

pairs of input objects (viewed as vectors), and desired outputs (known as class labels). The

output of the classification function is a class label of the input object. However, the function is

known as prediction if the output is a continuous value. In other words, the task of a supervised

learner (classification and prediction learners) is to predict the value of the function for any valid

input object after having seen a number of training examples (i.e. pairs of input and target

output). Two types of models generated from supervised learning are global and local models.

For the former, as more common cases, supervised learning generates a global model that maps

input objects to desired outputs. Typical models are decision trees, classification rules, Bayesian

models, artificial neural networks, and support vector machines. For the latter, supervised

learning is lazy with no construction of a generalized model but use data themselves as local

models, such as nearest neighbor or case-based reasoning. The following indicates a number of

steps towards classification.

1. Problem Formulation

The first step in classification is to figure out the overview of the problem by determining

which type of the task we are going to solve, what the input and output look like, how we

obtain training examples. For example, to classify a single handwritten character to one of


62

possible alphabets, an entire handwritten word, or an entire line of handwriting may be set as

the input and a predicted character for each character in the word is the targeted output.

2. Feature Design and Collection

After the rough specification of the input and output is determined, we have to design how to

characterize the raw input, i.e., how to transform it into a set of features. After the design of the

features, for each object (each sample), its feature values and corresponding target value are

collected, either from human experts or from measurements, to form a training set. That is,

typically, the input object is transformed into a feature vector, which contains a number of

features that are descriptive of the object. The preciseness of the learned classification model

depends strongly on how precisely the features characterize the input object. In general, the

number of features should not be too large and too small to accurately predict the output.

Moreover, in several situations, it is unfortunate that there is no special design of features but

the training set is formed as it is.

3. Algorithm Selection and Model Generation

Once the training set is ready, we have to select and apply a learning/mining algorithm, for

example decision tree induction, Bayesian learning, artificial neural network, to generate a

model from the training set. Parameters in the learned model may be adjusted to optimize

performance using a holdout subset (called a validation set) of the training set, or via cross-

validation.

4. Model Evaluation and Usage

After the final model is constructed from the training and validation set, the performance of the

algorithm may be measured using a test set that is separate from the training set. The

classification model can be used to reveal the most probable class of any unseen datum. In

general, classifier performance depends strongly on the characteristics of the data to be

classified. There is no single classifier that works the best for all kinds of given problems. It is

necessary to perform various empirical tests to compare classifier performance and to find the

characteristics of data that determine classifier performance. Finding a suitable classifier for a

given problem is however still more an art than a science. At present, the most widely used

classifiers are decision trees, naïve Bayes, k-nearest neighbor, rule-based methods, centroid-

based methods (Gaussian mixture model), artificial neural network (multilayer perceptron or

back propagation) and support vector machines.

3.1.1. Fisher’s linear discriminant or centroid-based method

The Fisher’s linear discriminant or centroid-based method, an early classification procedure, was

implemented widely due to their simplicity and low computational cost. The method simply

divides the sample space by a series of linear equations. For 2-D cases, the line dividing two

classes is drawn to bisect the line joining the centers of those classes. These lines implicitly

indicate the minimum distance from each center. Figure 3-1 shows Fisher’s Linear Discriminant

(Centroid-based) for the Iris data. The following are the linear equation for each class.

Virginica :

Versicolor : &

Setosa :


63

Figure 3-1: Fisher’s Linear Discriminant (Centroid-based) for the Iris data.

In 2-D cases, first, a linking line is drawn between the centroids of the two classes and then the

linear discriminant lines can be constructed by drawing a line perpendicular to the linking line at

the middle point of that linking line. The following is the formula of the discriminant line when

the middle points of two classes are and . Figure 3-2 shows an example when

and .

The linking line : where

and

The discriminant line : where

and

Figure 3-2: An example of Fisher’s Linear Discriminant (2-D example)


64

As another viewpoint of the centroid-based classification, an explicit profile of a class (also called

a class prototype) is calculated and used as the representative of all positive documents of the

class. The classification task is to find the most similar class to the object we will classify, by way

of comparing the object with the class prototype of the focused class. Figure 3-3 shows an

example of how to calculate centroid vectors and classify a new datum using these centroid

vectors. Figure 3-4 shows an example of classification of a new datum using the equivalent plane.

Outlook Temp. Humidity Windy Play

90.00 40.00 80.00 10.00 No 95.00 32.00 85.00 80.00 No 50.00 35.00 90.00 20.00 Yes 10.00 24.00 80.00 5.00 Yes 15.00 10.00 50.00 15.00 Yes 20.00 12.00 55.00 90.00 No 55.00 9.00 45.00 95.00 Yes 85.00 22.00 95.00 25.00 No 95.00 7.00 50.00 5.00 Yes 5.00 26.00 45.00 10.00 Yes

80.00 25.00 40.00 80.00 Yes 45.00 24.00 85.00 85.00 Yes 40.00 37.00 60.00 15.00 Yes 25.00 23.00 90.00 95.00 No

(a) A sample data set (the real-valued Play-Tennis data set with categorical classes)


90.00 40.00 80.00 10.00 No

95.00 32.00 85.00 80.00 No

20.00 12.00 55.00 90.00 No

85.00 22.00 95.00 25.00 No

(Average vector) 25.00 23.00 90.00 95.00 No

Centroid of 'No' 63.00 25.80 81.00 60.00 No


50.00 35.00 90.00 20.00 Yes

10.00 24.00 80.00 5.00 Yes

15.00 10.00 50.00 15.00 Yes

55.00 9.00 45.00 95.00 Yes

95.00 7.00 50.00 5.00 Yes

5.00 26.00 45.00 10.00 Yes

80.00 25.00 40.00 80.00 Yes

45.00 24.00 85.00 85.00 Yes

(Average vector) 40.00 37.00 60.00 15.00 Yes

Centroid of 'Yes' 43.89 21.89 60.56 36.67 Yes

(b) Average vectors as the centroids for ‘No’ and ‘Yes’


Test datum 70 40 30 60 ?


Centroid of 'No' 63.00 25.80 81.00 60.00 No


Distance between Test and ‘No’ : 53 4

Distance between Test and ‘Yes’ :

The closest class for the test datum is ‘Yes’.

(c) Classification of the test datum

Figure 3-3: An example of centroid-based classification: centroid vectors and classification


65

Distance from ‘No’ and Distance from ‘Yes’:

Distance from ‘No’ :

Distance from ‘Yes’ :

The equivalent plane satisfies the following condition:

=

=

= 0

+

= 0

= 0

The equivalent plane (discriminant) is as follows.

= 0


Centroid of 'No' 63.00 25.80 81.00 60.00 No


Center of ‘No’ and Yes' 53.45 23.85 70.78 48.34 -

Difference of ‘No’ and Yes' 19.11 3.91 20.44 23.33 -

Discriminant plane: = 0

Condition of ‘No’: > 0

Condition of ‘Yes’: < 0


Test datum 70 40 30 60 ?

Calculation for the test data:

= -181.87 yes

Figure 3-4: An example of centroid-based classification using equivalent hyperplane.

As a formal description, first let us consider the family of discriminant functions that are linear

combinations of the t components of


66

The above equation is a linear discriminant function, prescribing the weight vector w and

threshold weight w0. It represents a hyperplane with unit normal in the direction of w and a

perpendicular distance |w0|/|w| from the origin. The value of the discriminant function for a

pattern x, normalized by the size of the weight vector

is a measure of the perpendicular

distance from the hyperplane. Figure 3-5 shows the graphical representation of linear

discriminant function given by the discriminant equation with a concrete example.

Figure 3-5: An example of Fisher’s Linear Discriminant (2-D example)

A linear discriminant classifier can be viewed as a linear machine, an important special case of

which is the minimum-distance classifier or nearest-neighbor rule. Given a set of prototype

(centroid) points , one for each of the C classes . The minimum-distance

classifier assigns a pattern x to the class associated with the nearest point . For each point,

the squared Euclidean distance between the current pattern x and a prototype can be

represented as follows.

The classification can be achieved by finding the prototype which has the minimum distance to

the current pattern x. However, since the first term in the equation is identical for the calculation

of any prototype , the comparison can be performed on only the second and the third terms, i.e.,

. Thus, the linear discriminant function is as follows.

where

and


67

Therefore, the minimum-distance classifier is a linear form (also called linear machine). If the

prototype points are the class means, then we have the nearest class mean classifier. Decision

regions for a minimum-distance classifier are illustrated in Figure 3-6. Each boundary is the

perpendicular bisector of the lines joining the prototype points of regions that are contiguous.

Also, note from the figure that the decision regions are convex (that is, two arbitrary points lying

in the region can be joined by a straight line that lies entirely within the region). However, since

decision regions of a linear machine are always convex, it cannot cope with concave shape. Figure

3-7 illustrates a two-class problem, which cannot be separated by a linear form. To overcome this

difficulty, two generalizations of linear discriminant functions (linear machines) are piecewise

linear discriminant functions and generalized linear discriminant functions as shown below.

Figure 3-6: An 2-D example of Decision regions for a minimum-distance classifier

Figure 3-7: Two-class problems which cannot separable by a linear discriminant


68

Piecewise linear discriminant functions

To solve concave decision boundary, the piecewise linear discriminant allows more than one

prototype per class, instead of only one per class. For example, it is possible to assume

prototypes

for the i-th class . We can define the discriminant function for class

i as

where is a subsidiary discriminant function, which is linear and is given by

A pattern (object) x is assigned to the class for which is largest; that is, to the class of the

nearest prototype vector. In other words, we have partitioned the space into regions. This

partition is known as the Dirichlet tessellation of the space. When each pattern in the training set

is taken as a prototype vector, then we have the nearest-neighbor decision rule. This

discriminant function generates a piecewise linear decision boundary. It is possible to apply a

clustering scheme to construct prototypes

. Moreover, rather than using the

complete design set as prototypes, it is also possible to use only its subset. There are some

methods of reducing the number of prototype vectors (edit and condense) along with the

nearest-neighbor algorithm. Figure 3-8 shows an example of piecewise linear discriminant

functions.

Figure 3-8: An example of piecewise linear discriminant functions

Generalized linear discriminant function

Another solution to concave property of decision space is a generalized linear discriminant

function, also termed a phi machine by Nilsson (1965). It is a discriminant function of the form


69

where

is a non-linear mapping (kernel) function of . If q = t, the number

of original variables, and , then the formula is equivalent to a linear discriminant

function.

Figure 3-9: Nonlinear mapping (kernel) function from to where

.

By the mapping function, it is possible to transform the discriminant function on the original

measurement ’s which is not originally linear to a new discriminant function, which is linear in

the functions of . Figure 3-9 shows a nonlinear mapping (kernel) function from to

where . As seen in the figure, the two classes can be

separated in the -space by a straight line. Similarly, disjoint classes can be transformed into a

-space in which a linear discriminant function can separate the classes, even they are separable

in the original space. This mapping is sometimes known as a kernel function. Although this

transformation can help us to find linear discriminant function to separate the classes, it is hard

to determine the form of . The following table lists some common mapping functions , used in

several kernel methods.

Kernel function

(mapping function)

Mathematical form

Linear

Quadratic

, , and

j-th order polynomial

, , and

Radial basic function , is the center and is a mapping function.

Multilayer perceptron for is the direction, is an offset and f is

the logistic function .


70

Among these functions, as the number of functions (dimensions) that are used as a basis set

increases, so does the number of parameters that must be determined using the limited training

set. For example, a complete quadratic discriminant function requires

terms for

one class and for C classes, we need

parameters to estimate. For this large

number, we may need to apply some kinds of constraints in order to regularize the model to

ensure that there is no over-fitting. An alternative to having a set of different functions is to have

a set of functions of the same parametric form, but which differ in the values of the parameters

they take,

where is a set of parameters. It is possible to have different models on the way the variable x

and the parameters v are combined. For radial basis function, the function is only the

magnitude of the difference between the pattern x and the weight vector (parameter) v as follow.

On the other hand, if is a function of the scalar product of the two vectors, as shown below, the

discriminant function is known as a multilayer perceptron, especially when is the logistic

function .

Both the radial basis function and the multilayer perceptron models can be used in regression.

3.1.2. k-nearest neighbor method

The k-NN is a type of instance-based learning, or lazy learning where the classification function is

locally approximated without constructing any generalized model (no learning phase) and all

computation is deferred until classification phase. By a vector space model, the training examples

are represented by vectors in a multidimensional feature space, each with a class label. The

training phase of the algorithm consists only of storing the feature vectors and class labels of the

training samples. There is no explicit learning process. In the classification phase, the k-nearest

neighbor method (k-NN) gives a class label to an unlabeled object (one of which class is

unknown) by finding k closest training examples (neighbors) in the feature space and then

assigning it the majority class of these neighboring examples. Normally, k is a user-defined

constant, which is a positive integer and typically small, say 3-50. For the special case of k = 1,

the object is simply assigned to the class of its nearest neighbor. As for distance measure or

metric, Euclidean distance is usually used in cases of numeric attributes (features). In cases of

nominal, ordinal or binary attributes, different types of metrics, such as the overlap metric (or

Hamming distance), can be used. These metrics are normally used for measuring the distance or

similarity in the processing of clustering as shown in Chapter 4. Figure 3-10 shows the decision

boundary of linear regression of two-class response. Figure 3-11 and Figure 3-12 illustrate the

decision boundaries of k-NN when k is 1 and 5, respectively.


71

Figure 3-10: The decision boundary of linear regression of two-class response

Figure 3-11: The decision boundary of k-NN when k = 1.

Figure 3-12: The decision boundary of k-NN when k = 5

However, one main drawback of the "majority voting" criterion on classification decision is

that the classes with more examples tend to be selected as the prediction of the test object since


72

they tend to be the label of the k nearest neighbors. One way to overcome this problem is to

weight the classification with consideration of the distance from the test point to each of its k

nearest neighbors. This weighting distance can also be applied for prediction, by just assigning

the predicted value for the object with the average of the values of its k nearest neighbors,

according to their distance-based contributions. That is, a closer neighbor has more contribution.

A common weighting scheme is to give each neighbor a weight of 1/d, where d is the distance to

the neighbor, called a generalization of linear interpolation.

While the naive version of the algorithm seems to be easily implemented by computing the

distances from the test sample (a test vector) to the training data (stored vectors), but this

process is computationally intensive, especially when the number of the training data (stored

vectors in the training set) is large. In decades, many researchers have proposed efficient nearest

neighbor search algorithms to find nearest neighbors of the test vector with tractable

computational time, even for large data sets. The nearest neighbor search (NNS), sometimes

known as proximity search, similarity search or closest point search, is an optimization problem

for finding closest points in metric spaces. The problem is “given a set P of points in a metric

space S and a query point q S, find the closest point (or k closest points) in P to q.” Normally, S is

taken to be n-dimensional Euclidean space and distance is measured by Euclidean distance or

Manhattan distance. This problem is also known as the post-office problem, referring to an

application of assigning a residence to the nearest post office.

The nearest neighbor algorithm has some strong consistency results. As the amount of data

approaches infinity, the algorithm is guaranteed to yield an error rate no worse than twice the

Bayes error rate (the minimum achievable error rate given the distribution of the data). The k-

nearest neighbor is guaranteed to approach the Bayes error rate, for some value of k (where k

increases as a function of the number of data points). Various improvements to k-nearest

neighbor methods are possible by using proximity graphs.

It is observed that the result of 5-NN in Figure 3-12 shows fewer misclassified training

observations than the result of linear regression in Figure 3-10. It is more extreme in the case of

1-NN in Figure 3-11. There is none of the training observations are misclassified. As common

sense, the error on the training data should be approximately an increasing function of k, and will

always be 0 when k is 1. However, it is not the case when we test on an independent test set.

Therefore, it is more suitable to evaluate the method by using a separate test set since it is a real

situation. For complexity, the k-nearest-neighbor or k-NN fits have a single parameter, which is

the number of neighbors k while the linear regressions or least-squares fits in the previous

section have t+1 parameters, where t is the number of components or dimensions. However,

since k-NN depends on not only the parameter k but also the N training data themselves, the

effective number of parameters of k-NN is N/k and this number is normally bigger than t in linear

regression. The effective number of k-NN parameters decreases with increasing k. If the

neighborhoods were non-overlapping, there would be N/k neighborhood groups and we can

have one parameter (a mean) for each group. For k-NN, we cannot optimize the parameters by

sum-of-squared errors on the training set to find the optimal k since we will always get k=1.

Figure 3-13 shows an example of classification of the test datum using k-NN when k is 1, 3 or

5. For 1-NN, we select the closest point (object) which is the object No. 11. Since its label (class) is

‘Yes’, therefore the class for the test datum (85, 3 , 6 , 6 ) is ‘Play=Yes’. For 3-NN and 5-NN, we

select three and five nearest neighbors, respectively. For this example, 3-NN returns ‘No’ while 5-

NN gives ‘Yes’. We can observe that the answers for these three cases (k = 1, 3, 5) are not

consistent. Here, ‘Yes’, ‘No’ and ‘Yes’ for 1-NN, 3-NN and 5-NN, respectively.


73

No. Outlook Temp. Humidity Windy Play

1 90.00 40.00 80.00 10.00 No 2 95.00 32.00 85.00 80.00 No 3 50.00 35.00 90.00 20.00 Yes 4 10.00 24.00 80.00 5.00 Yes 5 15.00 10.00 50.00 15.00 Yes 6 20.00 12.00 55.00 90.00 No 7 55.00 9.00 45.00 95.00 Yes 8 85.00 22.00 95.00 25.00 No 9 95.00 7.00 50.00 5.00 Yes

10 5.00 26.00 45.00 10.00 Yes 11 80.00 25.00 40.00 80.00 Yes 12 45.00 24.00 85.00 85.00 Yes 13 40.00 37.00 60.00 15.00 Yes 14 25.00 23.00 90.00 95.00 No

(a) A sample data set (the real-valued Play-Tennis data set with categorical classes)


Test datum 85 30 60 60 ?

No Outlook Temp. Humidity Windy Play Distance Rank

1 90.00 40.00 80.00 10.00 No 55.00 6 2 95.00 32.00 85.00 80.00 No 33.60 2 3 50.00 35.00 90.00 20.00 Yes 61.24 7 4 10.00 24.00 80.00 5.00 Yes 95.32 13 5 15.00 10.00 50.00 15.00 Yes 86.17 12 6 20.00 12.00 55.00 90.00 No 73.99 10 7 55.00 9.00 45.00 95.00 Yes 52.83 4 8 85.00 22.00 95.00 25.00 No 50.14 3 9 95.00 7.00 50.00 5.00 Yes 61.27 8

10 5.00 26.00 45.00 10.00 Yes 95.61 14 11 80.00 25.00 40.00 80.00 Yes 29.15 1 12 45.00 24.00 85.00 85.00 Yes 53.72 5 13 40.00 37.00 60.00 15.00 Yes 64.02 9 14 25.00 23.00 90.00 95.00 No 75.99 11

(b) Distance between the test datum and all training data (last column)

Method K Nearest neighbors Predicted Class (majority vote)

1-NN: 1 No.11 (Yes) Yes

3-NN: 3 No.11 (Yes), No.2 (No), No.8 (No) No 5-NN: 5 No.11 (Yes), No.2 (No), No.8 (No), No.7 (Yes), No.12 (Yes) Yes

(c) Predicted class for the test datum when k = 1, 3 and 5 (last column)

Figure 3-13: An example of k-NN classification

As a formal description, given and an input

, the nearest neighbor methods attempt to find a set of the closest objects from the

observations in T. That is, . The

class of the input , will be determined by observing the classes of its nearest neighbors in

the set as follows.

To find nearest neighbors of , it is necessary to define a metric. Possible metrics include

Euclidean distance, Mahattan distance or a distance-based or statistical measure. This metric will

determine the k observations , which are closest to in the input space, and then we can

average their response to obtain the final class of the input .


74

3.1.3. Statistical Classifiers

As a typical statistical classification method, a Bayesian classifier predicts the most plausible

class for an unlabeled object by calculating class membership probabilities of all possible classes

the object can belong to and then comparing these probabilities to find the class with the highest

probability. In general, Bayesian classification is based on Bayes’ theorem (named after Thomas

Bayes, a nonconformist English clergyman who had produced early works in probability and

decision theory during the 18th century) as follows.

Given a set of possible classes and an unlabeled object , its most plausible class is the

class the probability of which, , achieves the highest value when the object is

encoded by the function , that is . The function is arbitrary but expresses the properties of

the object. It is usually represented by a set of n attributes ( ) as follows.

In Bayesian terms, are considered as a set of evidence (as a set of attributes) for

an object . The class of the object can be viewed as a hypothesis, such as that the object

(data tuple) belongs to a specified class . As the classification problem, the classifier

finds , the probability that the class holds given the evidence or attributes

observed attributes . In other words, we are looking for the probability that the

object belongs to class C, given that we know the attribute description of , i.e., .

The is called the posterior probability, or a posteriori probability, of

conditioned on .

Given an example of the Play-Tennis dataset, each tuple is described by the values of four

attributes; outlook, temperature, humidity and windy, with two possible classes; ‘play’

( ) and ‘not play’ ( ). Then the classification of an object can be done by

comparing two probabilities of , one for the ‘play’ class and the other for the

‘not play’ class as follows.

(1)

(2)

For the Play-Tennis dataset and the test object in Figure 3.14, the probabilities are defined by

(1’) , and

(2’) .

These two probabilities are calculated and compared to find the maximum one and then the class

with the maximum value will be assigned to the object.

In general, it requires a large set of existing records (samples) as a training dataset to find the

estimated value of . For example, from the dataset in Figure 3-14, we can

calculate the estimated probabilities of as follows.


75

Outlook Temperature Humidity Windy Play

sunny hot high false No sunny hot high true No

overcast hot high false Yes rainy mild high false Yes rainy cool normal false Yes rainy cool normal true No

overcast cool normal true Yes sunny mild high false No sunny cool normal false Yes rainy mild normal false Yes sunny mild normal true Yes

overcast mild high true Yes overcast hot normal false Yes

rainy mild high true No

(a) The Play-Tennis dataset

Outlook Temperature Humidity Windy Play

sunny mild high true ?

(b) test data

Figure 3-14: The Play-Tennis dataset and the test data

= 1

= 0

= 1

= 0 ... …

= 0

= 1

= 1

= 0

As the above example, when a dataset is small, the following issues needed to be considered.

1. In real application when there are many features, it is likely that the training set may not

cover all possible cases. For those cases, it is impossible to obtain their probabilities. In the

above example, since there is no record (sample) for the test example, we cannot find the

following two probabilities.

2. In several cases, there are not enough data examples to estimate plausible probability. In the

example, there is merely a single record (sample) for each case. By this condition, the

probability for ‘yes’ (‘no’) is either 0 or 1. The issue is whether a single record is a good

representative for the case or not. In terms of statistics, for example the -test may be used

for testing, such case will have low reliability since the mere record in the training set may

occasionally appear by chance. To have good estimation of probabilities, we need to have a

very large dataset. For example, suppose that the dataset has ten attributes with one class to

be predicted and each attribute has two different possible values. Theoretically there are

= 1024 possible combinations for these ten attributes. To have enough data for reliability, we


76

may need up to 20-30 cases for each combination. Therefore, we need approximately 20000-

30000 records. This number may be possible. However, a case of 20 binary attributes

requires records, approximately 1 million possible combinations and then 20-30 million

records. It is quite hard to have this number of records in the real situation.

To solve the above two issues, it is possible to apply Bayes theorem with some independence

assumptions. The strongest assumption is to suppose that all attributes are independently with

each other. As mentioned previously, given a set of possible classes , n attribute value domains

, and an unlabeled object , the most plausible class of the object is the class

the probability of which, , achieves the highest value when the object is

represented by a set of n attributes as follows.

Here, is called the posterior probability, or a posteriori probability, of

conditioned on . Based on the Bayes’ rule ,

can be derived easily. By this rule, we obtain the following equation.

From the example, the meanings of the related probability can be summarized below.

Description

Example Meaning Probability that play is ‘yes’ when we know that outlook is ‘sunny’, temperature

is ‘mild’, humidity is ‘high’ and windy is ‘true’.

Description Example Meaning Probability that play is ‘yes’, regardless of any condition.

Description Example Meaning Probability that outlook is sunny, temperature is mild, humidity is high and

windy is true when we know that play is yes.

Description

Example Meaning Probability that outlook is sunny, temperature is mild, humidity is high and

windy is true, regardless of any condition.

Here, the function argmax will return only the plausible class and all classes will have the same

denominator. With this, the denominator can be ignored and the equation can be simplified as

follows.

In this equation, instead of , the (the prior probability or a priori

probability of ) and the , are used. The prior probability, or a priori

probability, of ( ) is the probability of an object, regardless of its attribute values. The

posterior probability, , is the probability that the known class has the

attributes . Here, the original probability means the

probability that given the attributes , the class is expected to be . For the above


77

equation, the number of parameters of the first component equals to the numbers of

classes in consideration. In the Play-Tennis example, it is two, i.e. and

. For the second component , the number of its parameters is

equivalent to the number of parameters in the original probability .

However, unlike the original one, it is possible for us to transform this equation to a more

convenient form. To do this, it is possible to use the joint probability.

Then the class estimation can be concluded as follows.

For the above equation, the number of parameters of the second component

equals to the number of classes multiplied by the number of possible values of . The third

component would have more parameters, equivalent to the number of classes

multiplied by the number of possible values of and then multiplied by the number of

possible values of . The later components have more parameters. Moreover, the last

component has the most parameters, equivalent to those of the original one. As stated above, it

requires a very large dataset to compute components with a large number of parameters.

Naïve Bayes Classifier

In order to avoid this limitation, conditional independence can be made. In the extreme case, it is

possible to presume that all attributes are independent of each other. In other words, the values

of the attributes are conditionally independent of one another, given the class label of the object.

By this assumption, the formula is simplified as follows.

In this formula, each term, except the first term, is the probability to obtain the attribute value,

given only the class value. This simplification is known as naïve Bayes since it is the most

intuitive and simple. It is also possible to simplify with more complex constraints. For example, if

we assume that depends on , depends on and , and others are

independent of each other, the class prediction can be formulated as follows.

Note that the fourth and the fifth terms need more parameters while the others are the same

with naïve Bayes. This generalization can be set up by manual with human’s predefined

knowledge. The dependency can be expressed in the form of Bayesian belief networks which are

graphical models, unlike naïve Bayesian classifiers, allow the representation of dependencies

among subsets of attributes. Explained later, this Bayesian belief networks could be used for

classification. To be summarized, the naïve Bayesian classifier works as follows.

1. Given a training set of objects and their associated class labels, denoted by

, each object is represented by an n-dimensional attribute vector,

depicting the measure values of n attributes

of the object with its class , one from m possible

classes, .


78

2. The Bayesian (statistical) classifier assigns (or predicts) a class to the object when the

class has the highest posterior probability over the others, conditioned on the object’s

attribute values . That is, the Bayesian classifier predicts that the

object belongs to the class if and only if

It could be represented by a function argmax, that is, we maximize , called the

maximum posteriori hypothesis.

By Bayes’ theorem, this equation is equated to

3. Since is constant for all classes , only

need be maximized.

if

Note that the class prior probabilities may be estimated by , where

is the number of training objects, which belong to the class in the training set

and is the total number of training objects. However, if the class prior probabilities are

not known, it is commonly assumed that the classes are equally likely, that is,

. Therefore, we would simplify the term to .

if

4. Several datasets may have a large number of attributes. In these cases,

will have high complexity (need a large set of examples to calculate)

since it may include so many parameters. In order to reduce complexity in evaluating

, the naïve assumption of class conditional independence can be made.

This assumes that the values of the attributes are conditionally independent of one another,

given the class label of the object (i.e., that there are no dependence relationships among

the attributes). Thus,

Normally we can easily estimate the probabilities

from the data in the training dataset. For classification, the class value is categorical (a label,

not a continuous-valued number) while an attribute can be either categorical or

continuous-valued. The method to compute can be done as follows.

(a) If the j-th attribute is categorical, then is the number of objects of class

in the training dataset TR, having the attribute value of for the attribute

, divided by , the number of objects of class in T as follows.


79

(b) If the j-th attribute is continuous-valued (numeric), then can be

calculated by a Gaussian distribution of the attribute in the class with a mean and standard deviation , defined as follows.

Here, is the density function and is a small slack value. There is no need to know the exact value of this slack value since it will be cancelled later when the probabilities are compared with each other. Given the class the class means and the standard deviation of the j-th attribute, respectively denoted by and , can be derived by

the following equations.

Here, is the value of the j-th attribute of the object , is the set of the

objects belonging to the class , which is a subset of the whole training set T, and

is the number of objects in .

5. In order to predict the class label of , is evaluated for each class

. The classifier predicts that the class label of the object is the class if and only if

In other words, the predicted class label is the class for which

is the maximum. One interesting query on naïve Bayesian classifiers is how effective it is.

As stated before, two unrealistic assumptions for this method are (1) all attributes are

equally important, and (2) the second one states that all attributes are statistically

independent (given the class value) with each other. This means that knowledge about the

value of a particular attribute does not tell us anything about the value of another attribute

(if the class is known). In general, although based on these unrealistic assumptions that are

almost never correct, this scheme works well in practice. The naïve Bayesian has shown in

several literatures to obtain high accuracy.

However, a well-known issue of these statistical approaches is a so-called sparseness problem.

The problem occurs when we have a limited number of data. This makes some values not occur

in the training set in conjunction with every class value or some value sets may never occur in

the training set. This situation will introduce a zero-valued probability for some events,

. This zero value will make the following class prediction output a zero value for

the class , even other terms may give high probabilities, except one specific term provides zero.

That is, even though, without the zero probability, we may have ended up with a high probability,

suggesting that belonged to class . A zero probability cancels the effects of all of the other

(posteriori) probabilities (on ) involved in the product.


80

This zero-valued probability may not be realistic. It may be triggered since we do not have

enough data for training. Therefore, the value may not be a real zero. To avoid this problem, it is

possible to apply a simple trick by assuming a small probability for unseen cases. This technique

for probability estimation is known as the Laplacian correction or Laplace estimator, named after

Pierre Laplace, a French mathematician who lived from 1749 to 1827. In this technique, we will

add one additional count for all the conditions. For example,

Here, the j-th attribute can take one value from the set . In the above

equation, we add one count for each attribute value. Since we will have one added for all the

attribute values, the corresponding denominator will be added with L, the number of possible

values of the attribute, when the probability is calculated.

Patient

No. Blood Pressure

(feature #1) Protein Level (feature #2)

Glucose Level (feature #3)

Heart Beat (feature #4)

diseased (class)

1 High Medium 143 H Slow Positive 2 High High 92 N Fast Negative 3 High Low 150 H Slow Positive 4 High Low 99 N Fast Negative 5 Normal Low 93 N Fast Negative 6 Normal High 75 N Slow Negative 7 Normal Medium 80 N Slow Negative 8 Low Medium 139 H Slow Positive 9 High Medium 105 H Slow Positive

10 High High 90 N Fast Negative 11 High Low 91 N Slow Positive 12 High Low 107 H Fast Negative 13 Normal Low 95 N Fast Negative 14 Normal High 96 N Slow Negative 15 Normal Medium 81 N Slow Negative 16 Low Medium 144 H Slow Positive 17 High Medium 150 H Slow Positive 18 High High 98 N Fast Negative 19 High Low 96 N Slow Positive 20 High Low 83 N Fast Negative 21 Normal Low 95 N Fast Negative 22 Normal High 98 N Slow Negative 23 Normal Medium 105 H Slow Negative 24 Low Medium 128 H Slow Positive 25 High Medium 145 H Slow Positive 26 High High 94 N Fast Negative 27 High Low 92 N Slow Positive 28 High Low 108 H Fast Negative 29 Normal Low 93 N Fast Negative 30 Normal High 109 H Slow Negative 31 Normal Medium 95 N Slow Negative 32 Low Medium 127 H Slow Positive

(a) A medical laboratory test dataset (four features and one class)

Blood Pressure Protein Level Glucose Level Heart Beat Diseased

Normal High 104 Fast ?

(b) A test case

Figure 3-15: A medical laboratory test dataset (For glucose, H for 99, N for 99)


81

There are several variants of Laplacian correction or Laplace estimator. Two possibilities are

(1) addition of an equal small value to each attribute value based on the number of possible

attribute values, in order to make the total correction become 1, and (2) addition of different

small values to attribute values based on their contributions ( ), but maintaining the total

correction to 1. These two options are depicted in the following two equations in order.

(1)

(2)

where

Figure 3-15 illustrates another example, which shows a medical laboratory test dataset and a

test record. Figure 3-16 shows the construction of a naïve Bayes model in the form of a table

from the dataset. In this example, we use Laplacian correction or Laplace estimator for nominal

attributes while we use the mean and the standard deviation (S.D.) to calculate Gaussian-based

probability for the numeric attribute, i.e., Glucose level.

Blood Pressure (feature #1)

Protein Level (feature #2)

Glucose Level (feature #3)

Heart Beat (feature #3)

Diseased (class)

Positive Negative Positive Negative Positive Negative Positive Negative Positive Negative

High 8+(1/3) =8.33

8+(1/3) =8.33

High 0+(1/3) =0.33

8+(1/3) =8.33

143 92 Fast 0+(1/2) =0.5

12+(1/2) =12.5

12+1=13 20+1=21

Normal 0+(1/3) =0.33

12+(1/3) =12.33

Medium 8+(1/3) =8.33

4+(1/3) =4.33

150 99 Slow 12+(1/2) =12.5

8+(1/2) =8.5

Low 4+(1/3) =4.33

0+(1/3) =0.33

Low 4+(1/3) =4.33

8+(1/3) =8.33

139 93

105 75

91 80

144 90

150 107

96 95

128 96

145 81

92 98

127 83

95

98

105

94

108

93

109

95

High 8.33/13 =0.641

8.33 / 21 =0.397

High 0.33/13 =0.0254

8.33/21 =0.397

Mean 125.83 94.30 Fast 0.5/13 =0.0385

12.5/21 =0.595

13/34 =0.382

21/34 =0.618

Normal 0.33/13 =0.0254

12.33/21 =0.587

Medium 8.33/13 =0.641

4.33/21 =0.206

S.D. 23.40 9.30 Slow 12.5/13 =0.962

8.5/21 =0.405

Low 4.33/13 =0.333

0.33/21 =0.0157

Low 4.33/13 =0.333

8.33/21 =0.397

Figure 3-16: Construction of a naïve Bayes model for the medical laboratory dataset

In the classification process, now suppose that the following new case is encountered as shown

in Figure 3-15. It is also shown below.

Blood Pressure Protein Level Glucose Level Heart Beat Diseased Normal High 104 Fast ?


82

Then the target is to predict what the value is for ‘Diseased’. In the naïve Bayes model, all features

are treated equally for their importance and independently with each other. In this example,

blood pressure, protein level, glucose level, heart beat and also the class ‘Diseased’ are equally

important and independent of each other. The most likely class can be determined by selecting

the class with the highest probability. This assertion is stated previously.

Therefore the overall probability likelihood fraction of ‘Diseased=positive’ (for short, ‘pos’) and

that of ‘Diseased=negative’ (for short, ‘neg’) are as follows.

Likelihood of ‘pos’

Likelihood of ‘neg’

According to the above calculation, we can observe that the likelihood of ‘negative’ is much

higher than that of ‘positive’ in this case. Therefore, the case should be assigned with ‘negative’.

Moreover, after the normalization, the probabilities for ‘positive’ and that of ‘negative’ are as

follows. Here, X means the current environment. i.e., blood pressure, protein level, glucose level

and heart beat.

P(positive|X) =

= 0.00005

P(negative|X) =

= 0.99995

In conclusion, the Naïve Bayes classification method is based on Bayes’s rule and “naïvely”

assumes independence. Indeed, it is only valid to multiply probabilities when the events are

independent. The assumption that attributes are independent (given the class) in real life

certainly is a simplistic one. However, despite the disparaging name, Naïve Bayes works very

well when tested on actual datasets, particularly when combined with some of the attribute

selection procedures that eliminate redundant and hence no dependent attributes. One special

treatment is to apply Laplacian correction or Laplace estimator to solve the problem of zero

probability due to the limitation of the training data set.

Bayesian Belief Networks

As mentioned above, while the naïve Bayesian classifier simplifies the calculation process by

making the assumption of class conditional independence (i.e., given the class label of a tuple, the


83

values of the attributes are assumed to be conditionally independent of one another), in practice,

however, dependencies can exist between attributes or features (variables). For this purpose,

Bayesian belief networks, as more general models, provide a framework to specify joint

conditional probability distributions among attributes. They allow class conditional

independencies to be defined between subsets of attributes. They can be expressed with a

graphical model of causal relationships, on which learning can be performed. In place of naïve

Bayes classification, trained Bayesian belief networks can be used for classification. Bayesian

belief networks are also known as belief networks, Bayesian networks, and probabilistic

networks. A belief network is defined by two components, a directed acyclic graph and a set of

conditional probability tables. Each node in the directed acyclic graph represents a random

variable (attribute). The variables (attributes) may be discrete or continuous-valued. Each edge

(arc) represents a probabilistic dependence, represented by a so-called conditional probability

table (CPT). If an edge is drawn from a node P to a node Q, then P is a parent or immediate

predecessor of Q, and Q is a descendant of P. Each variable is conditionally independent of its

non-descendants in the graph, given its parents.

Figure 3-17: Attribute dependency graph (a kind of Bayesian Network) with probabilities

simply calculated from the medical laboratory data in Figure 3-15.

Figure 3-17 illustrates a sample dependency network in the medical laboratory where ‘diseased’

may affect ‘protein level’ and ‘glucose level’, the combination of ‘diseased’ and ‘protein level’ may

affect ‘heart beat, and the combination of ‘diseased’ and ‘glucose level’ may affect ‘blood pressure’.

Note that there is a conditional probability table (CPT) for each edge and a probability table for

each root node. Figure 3-18 is the Bayesian network with Laplacian correction. There is no

probability table for each intermediate and leave node. For statistical reasoning, the most likely

class can be determined by selecting the class with the highest probability as follows.


84

Tailored to this example, the following equations can be assumed.

Figure 3-18: Attribute dependency graph (a kind of Bayesian Network) with Laplacian correction

According to the dependency defined in Figure 3-17 and Figure 3-18, it is possible to ignore some

items in the conditional part as follows.

Note that the third, fourth and fifth terms can be approximated to a reduced representation when

we assume to have the attribute dependency graph. Next, in the classification process, now

suppose that the previous mentioned case is encountered as follow.


85

Blood Pressure Protein Level Glucose Level Heart Beat Diseased Normal High High Fast ?

For this setting, the overall probability likelihood fraction of ‘Diseased=positive’ (for short, ‘pos’)

and that of ‘Diseased=negative’ (for short, ‘neg’) can be calculated as follows.

Likelihood of ‘pos’

Likelihood of ‘neg’

According to the above calculation, we can observe that the likelihood of ‘positive’ is quite higher

than that of ‘negative’ in this case. Therefore, the case should be assigned with ‘positive’.

Moreover, after the normalization, the probabilities for ‘positive’ and that of ‘negative’ are as

follows. Here, X means the current environment. i.e., blood pressure, protein level, glucose level

and heart beat.

P(positive|X) =

= 0.7892

P(negative|X) =

= 0.2108

For comparison to the result of the naïve Bayes (NB), the NB calculation is listed again below.

P(positive|X) =

= 0.00005

P(negative|X) =

= 0.99995

We can observe that the results are contradicted. Two possible factors on this phenomenon are

(1) the consideration of ‘Glucose’ attribute as a nominal attribute or a numeric attribute (discrete

or continuous-valued), and (2) the different setting of attribute dependency. For the first factor,

although it depends on individual decision but numeric attribute may be realistic in case of

numeric attributes. For the second factor, more structural information can help us obtain more

precise information for determining which class the object should belong to.

In the above example, we have shown a simple method to calculate a Bayesian belief network

from a given dataset. However, the learning or training of a belief network may occupy varied

situation as shown next. If the network topology (i.e., the layout of nodes and arcs) is known and

the variables are observable, then training the network is straightforward. The training process

can be performed similarly as the calculation of the probabilities in naive Bayesian classification.


86

The above example is such a case. On the other hand, if the network topology is given in advance,

but some of network variables are hidden, called missing values or incomplete data, there are

various methods to choose for training the belief network. One promising method is gradient

descent. Without an advanced math background, it may be hard to follow the full description of

gradient descent since it includes the calculus-packed formulae. However, there exists packaged

software to solve these equations, without deep interpretation. In the following, the general idea,

which is not so difficult, is described.

Let be a training set of data tuples. Training the belief network means

that we must learn the values of the CPT entries. Let be a CPT entry

for the variable having the parents . For example, if is the upper leftmost

CPT entry of Figure 3-17, then is “Protein Level”; is its value, either “low”, “middle” or

“high”; is the list of the parent nodes of , in this case, is “Diseased” and shows the values

of the parent nodes, i.e., either “positive” or “negative”. ’s are viewed as weights, analogous to

the weights in hidden units of neural networks. The set of weights is collectively referred to as W.

The weights are initialized to random probability values. A gradient descent strategy performs

greedy hill-climbing. At each iteration, the weights are updated to eventually converge to a local

optimum solution. A gradient descent strategy is used to search for the values that best

model the data, based on the assumption that each possible setting of is equally likely. Such

a strategy is iterative. It searches for a solution along the negative of the gradient (i.e., steepest

descent) of a criterion function. What we need is to find the set of weights W that maximize this

function. To start with, the weights are initialized to random probability values.

The gradient descent method performs greedy hill-climbing in that, at each iteration or step

along the way, the algorithm moves toward what appears to be the best solution at the moment,

without backtracking. The weights are updated at each iteration. Eventually, they converge to a

local optimum solution. For our problem, we maximize or

. Given the network topology and initialized , the algorithm does as follows:

1. Compute the gradients: For each i, j, k, compute

The probability in the right-hand side of Equation (6.17) is to be calculated for each

training tuple, , in D. For brevity, let us refer to this probability simply as p. When the

variables represented by and , are hidden for some , then the corresponding

probability p can be computed from the observed variables of the tuple using standard

algorithms for Bayesian network inference, available online.

2. Take a small step in the direction of the gradient: The weights are updated by

where l is the learning rate representing the step size and

is computed from the

first step. The learning rate is set to a small constant and helps with convergence.


87

3. Renormalize the weights: Because the weights are probability values, they must be

between 0.0 and 1.0, and must equal 1 for all i, k. These criteria are achieved by

renormalizing the weights after they have been updated by the second step.

Algorithms that follow this form of learning are called Adaptive Probabilistic Networks. Other

methods for training belief networks are referenced in the bibliographic notes at the end of this

chapter. Belief networks are computationally intensive. Because belief networks provide explicit

representations of causal structure, a human expert can provide prior knowledge to the training

process in the form of network topology and/or conditional probability values. This can

significantly improve the learning rate.

3.1.4. Decision Trees

Also known as a classification tree for discrete outcome and a regression tree for continuous

outcome, a decision tree is a tree-like graph or model used for predicting a class (consequence or

outcome) of an event based on observed properties of that event. Commonly used for decision

analysis in operation research to help identify an optimal action towards a goal, a decision tree is

a predictive model that maps from observations about an event to conclusions on its target value.

In general, classification using a decision tree has high accuracy but the performance usually

depends on the characteristics of data we cope with. Decision tree induction algorithms have

been used for classification in many application areas, such as medicine, manufacturing and

production, financial analysis, astronomy, and molecular biology. In general, a decision tree

consists of three components: (1) outcome nodes (rectangular), (2) decision criterion nodes

(ovals), and (3) decision branches (lines), as shown in Figure 3-19.

Figure 3-19: A decision tree (the Play-Tennis data)

The leaf nodes represent classification (decision) outcome, the root and the intermediate nodes

express a decision criterion, and the branches under a node indicate possible values of the

decision criterion of the node. Similar to any classification techniques for machine learning or

data mining, basically two general phases for decision tree are (1) learning and (2) classification.

In the first phase, given a data set, one can create a decision tree by creating nodes one by one

from the root toward the leaves. In the second phase, after getting a decision tree, the tree can be

used for classification of unknown or unseen datum. While the first phase is complex, the second

phase is very simple. For example, given the decision tree in Figure 3-19, the following case can

be classified as ‘Yes’, as depicted by the SOLID arrow in Figure 3-20. In this case, the situation of

‘Outlook=sunny’ and the ‘Humidity=normal’ is applied.

Outlook

Windy Humidity

sunny rainy

overcast

high normal

Yes

No Yes

false true

Yes No


88


sunny hot normal false ?

(a) The test datum

(b) Usage of the decision tree to classify the test datum (the answer is ‘Yes’)

Figure 3-20: Usage of the decision tree to classify the test datum

To learn a decision tree model from a training data set, a kind of attribute selection measure is

used to choose the best attribute that optimally split the tuples into distinct classes. Popular

measures of attribute selection are information gain and gain ratio, described later. Since

construction of a decision tree may introduce some branches in the tree that are just noise or

outliers in the training data, one may need tree pruning to identify and remove such branches.

The process can be done by investigating the improvement of classification accuracy on unseen

data. The technique for constructing a decision tree from data is called decision tree learning or

decision tree induction. Decisions trees have several advantages as follows.

Decision trees are simple to interpret and understand. We can understand decision tree

models after a brief explanation.

Decision trees are a white box model. It can provide a result with an explicit explanation

why the result is determined.

Decision Tree Induction

The decision tree induction consists of recursive steps as follows. First, one has to select the

attribute that best partitions the training data set, to place at the root node and then make one

branch for each possible value. By this, the training data set is split up into subsets, one for every

value of the attribute. In the same manner, the process is repeated recursively for each branch,

using only those instances that reach the branch. As a termination criterion, one can stop

splitting branches when all instances at a node possess the same class (a pure node). However, in

the real situation there may be several cases that we cannot get such pure node, or if we split

without stopping, we will obtain a leaf node with only one instance. Such situation is called

overfitting and it is not preferable since nodes with too few instances are not reliable. As a

solution, a pruning process is needed.

As a conclusion, the decision tree induction copes with two important issues, (1) how to

determine which attribute to split on for each step and (2) how to prevent the overfitting

problem. The decision tree induction, or simple Bayesian classifier, works as follows. Given a

training set of objects and their associated class labels, denoted by , each

object is represented by an n-dimensional attribute vector, depicting

the measure values of n attributes, , of the object with its class , one from m

Outlook

Windy Humidity

sunny rainy

overcast

high normal

Yes

No Yes

false true

Yes No


89

possible classes, . Here, suppose that has possible values

. That is, .

1. Select the best attribute for the first node in order to split the training set into a number of

subsets. In this attribute selection, two most popular criteria are information gain and gain

ratio, even some other possible criteria include a minimal occupation of the node, a

maximal depth of the tree or a threshold value for the information gain (or gain ratio).

Figure 3-21 shows that a training set of objects is split into several subsets when a node

is selected for splitting. The formalism of information gain and gain ratio are as follows.

Here, is the k-th class, is the training set before splitting, is the set of the instances

with the class in the set , is a subset of the training set after splitting, containing the

objects which have the value of for the attribute , that is , and is the set of

the instances with the class in the subset . With this notation, is the total number

of instances in the training set before splitting, is the number of class-k instances in the

set , is the number of instances in the subset, which have , and is the

number of class-k instances in the subset .

Information gain:

where

Gain ratio:

where

Figure 3-21: A training set is split into subsets when a node is selected for splitting.


90

2. Repeat the above process on each subset by selecting another attribute to perform further

splitting. This process will be terminated when the subset includes few instances or

instances from one class (a pure subset), or a kind of termination criterion is satisfied.

Figure 3-22 shows an example of the iterative process for constructing a decision tree,

starting from the root node towards leaf nodes. The final tree is constructed as Figure 3-23.

(a) The best attribute is selected

as the root node and then a

number of branches are generated

according to its possible attribute

values .

(b) A node is created for each branch and then

those nodes are splitted more if they are not of

one class or include enough instances.

(c) Each succeeding node is splitted more until it includes only instances from one class

or too few instances.

Figure 3-22: A tree is constructed from the root by splitting a node repetitively.

Figure 3-23: The final tree is constructed with each leaf node assigned with a class

label, according to the majority class of the instances at that node.


91

To be more concrete, the following example shows a complete decision induction process on a

health-checking database when either information gain or gain ratio is applied for the splitting

criterion. This example includes both information gain and gain ratio but we can select either one

of them. Given the following table (Table 3.1), the process is enumerated.

Table 3-1: A patient health-checkup data

Patient Blood Pressure Protein Level Glucose Level Heart Beat Positive

1 High Medium High Slow Positive

2 High High Medium Fast Negative

3 High Low Medium Slow Positive

4 High Low High Fast Negative

5 Normal Low Medium Fast Negative

6 Normal High Very High Slow Negative

7 Normal Medium Very High Slow Negative

8 Low Medium Very High Slow Positive


























92

1. Select the best criterion to be the root node of the decision tree. Here, we test each attribute one by one, starting from the first attribute ‘Blood Pressure’.

Patient Blood P. Protein L. Glucose L. Heart Beat Positive

































Attribute Value Information

Blood Pressure High Info([8,8]) = entropy(8/16, 8/16)

= -(8/16)xlog2(8/16) - (8/16)xlog2(8/16) = 1.0000 bits

Normal Info([12,0]) = entropy(12/12, 0/12)

= -(12/12)xlog2(12/12) - (0/12)xlog2(0/12) = 0.0000 bits

Low Info([0,4]) = entropy(0/4, 4/4)

= -(0/4)xlog2(0/4) - (4/4)xlog2(4/4) = 0.0000 bits

Expected information for “Blood Pressure”

= Info([8,8],[12,0],[0,4]) = (16/32x1.0000) + (12/32x0.0000)+(4/32x0.0000)= 0.5000 bits

Information gain for “Blood Pressure”

= Info([12,20]) - Info([8,8],[12,0],[0,4])

= 0.9544-0.5000 = 0.4544 bits

Split Info for “Blood Pressure” (Intrinsic_Info for “Blood Pressure”)

= Info([16,12,4])

= -(16/32)xlog2(16/32) – (12/32)xlog2(12/32) – (4/32)xlog2(4/32)

= 0.5000+0.5306+0.3750 = 1.4056 bits

Gain ratio for “Blood Pressure”

= Information gain (“Blood Pressure”) / Split Info (“Blood Pressure”)

= 0.4544 /1.4056

= 0.3233

Blood Pressure

High

Normal

Low


93

2. The next is to calculate information gain or gain ratio for “Protein Level”.



































Protein Level High Info([8,0]) = entropy(8/8, 0/8)

= -(8/8)xlog2(8/8) - (0/8)xlog2(0/8) = 0.0000 bits

Medium Info([4,8]) = entropy(4/12, 8/12)

= -(4/12)xlog2(4/12) - (8/12)xlog2(8/12) = 0.9183 bits

Low Info([8,4]) = entropy(8/12, 4/12)

= -(8/12)xlog2(8/12) - (4/12)xlog2(4/12) = 0.9183 bits

Expected information for “Protein Level”

= Info([8,0],[4,8],[8,4]) = (8/32x0.0000) + (12/32x0.9183)+(12/32x0.9183)= 0.6887 bits

Information gain for” Protein Level”

= Info([12,20]) - Info([8,0],[4,8],[8,4])

= 0.9544-0.6887 = 0.2657 bits

Split Info for “Protein Level” (Intrinsic_Info for “Protein Level”)

= Info([8,12,12])

= -(8/32)xlog2(8/32) – (12/32)xlog2(12/32) – (12/32)xlog2(12/32)

= 0.5000+0.5306+0.5306 = 1.5613 bits

Gain ratio for “Protein Level”

= Information gain (“Protein Level”/ Split Info (“Protein Level”)

= 0.2657 /1.5613

= 0.1702

Protein Level

High Low Medium


94

3. The next is to calculate information gain or gain ratio for “Glucose Level”.



































Glucose Level Very High Info([8,4]) = entropy(8/12, 4/12)

= -(8/12)xlog2(8/12) - (4/12)xlog2(4/12) = 0.9183 bits

High Info([4,4]) = entropy(4/8, 4/8)

= -(4/8)xlog2(4/8) - (4/8)xlog2(4/8) = 1.0000 bits


= -(8/12)xlog2(8/12) - (4/12)xlog2(4/12) = 0.9183 bits

Expected information for “Glucose Level”

= Info([8,4],[4,4],[8,4]) = (12/32x0.9183) + (8/32x1.0000)+(12/32x0.9183)= 0.9387 bits

Information gain for “Glucose Level”

= Info([12,20]) - Info([8,4],[4,4],[8,4])

= 0.9544-0.9387 = 0.0157 bits

Split Info for “Glucose Level” (Intrinsic_Info for “Glucose Level”)

= Info([12,8,12])

= -(12/32)xlog2(12/32) – (8/32)xlog2(8/32) – (12/32)xlog2(12/32)

= 0.5306+0.5000+0.5306 = 1.5613 bits

Gain ratio for “Glucose Level”

= Information gain (“Glucose Level”/ Split Info (“Glucose Level”)

= 0.0157 /1.5613

= 0.0101

High Very High Medium

Glucose Level


95

4. The next is to calculate information gain or gain ratio for “Heart Beat”.



































Heart Beat Slow Info([8,12]) = entropy(8/20, 12/20)

= -(8/20)xlog2(8/20) - (12/20)xlog2(12/20) = 0.9710 bits

Fast Info([12,0]) = entropy(12/12, 0/12)

= -(12/12)xlog2(12/12) - (0/12)xlog2(0/12) = 0.0000 bits

Expected information for “Heart Beat”

= Info([8,12],[12,0]) = (20/32x0.9710) + (12/32x0.0000) = 0.6068 bits

Information gain for “Heart Beat”

= Info([12,20]) - Info([8,12],[12,0])

= 0.9544-0. 6068 = 0.3476 bits

Split Info for “Heart Beat” (Intrinsic_Info for “Heart Beat”)

= Info([20,12])

= -(20/32)xlog2(20/32) – (12/32)xlog2(12/32)

= 0.4238+0.5306 = 0.9544 bits


= Information gain (“Heart Beat”/ Split Info (“Heart Beat”)

= 0.3476 /0.9544

= 0.3642

Slow Fast

Heart Beat


96

Summary of information gain and gain ratio for the first node.

Blood Pressure Protein Level

Info: 0.5000 Info: 0.6887

Info Gain: 0.4544 Info Gain: 0.2657

Split Info: 1.4056 Split Info: 1.5613

Gain Ratio: 0.3233 Gain Ratio: 0.1702 Glucose Level Heart Beat

Info: 0.9387 Info: 0.6068

Info Gain: 0.0157 Info Gain: 0.3476

Split Info: 1.5613 Split Info: 0.9544

Gain Ratio: 0.0101 Gain Ratio: 0.3642

Based on the above table, it is possible to use either information gain (Info Gain) or gain ratio (Gain Ratio)

to select the best attribute for the root node. In the case of information gain, the best attribute is ‘Blood

Pressure’ since it has the highest value of .4544 as its information gain, compared to ‘Protein Level’,

‘Glucose Level’, and ‘Heart Beat’. On the other hand, in the case of gain ratio, the best attribute is ‘Heart

Beat’. The following shows the results of both cases.

Information Gain Gain Ratio

The next step is to place a decision node at the lower nodes, which are impure. In the ‘Information Gain’

case, it is the node of the branch ‘High’ (blood pressure = high). On the other hand, for the ‘Gain Ratio’

case, the node we need to focus is the node under the branch ‘Slow’ (heart beat = slow). The following

shows the case of finding the second node for ‘Information Gain’. The case of finding the second node for

‘Gain Ratio’ is shown in the next.

Blood Pressure

High Normal

Low

Negative: 8

Positive: 8

Negative: 12

Positive: 0

Negative: 0

Positive: 4

Slow Fast

Heart Beat

Negative: 8

Positive: 12

Negative: 12

Positive: 0


97

5. Find the second node under the ‘Blood Press’ (‘Information Gain’).






1 High Medium High Slow Positive 9 High Medium High Slow Positive

17 High Medium High Slow Positive 25 High Medium High Slow Positive

4 High Low High Fast Negative 12 High Low High Fast Negative 20 High Low High Fast Negative 28 High Low High Fast Negative

3 High Low Medium Slow Positive 11 High Low Medium Slow Positive 19 High Low Medium Slow Positive 27 High Low Medium Slow Positive



= -(4/4)xlog2(4/4) - (0/4)xlog2(0/4) = 0.0000 bits


= -(0/4)xlog2(0/4) - (4/4)xlog2(4/4) = 0.0000 bits


= -(4/8)xlog2(4/8) - (4/8)xlog2(4/8) = 1.0000 bits


= Info([4,0],[0,4],[4,4]) = (4/16x0.0000) + (4/16x0.0000)+(8/16x1.0000)= 0.5000 bits


= Info([8,8]) - Info([4,0],[0,4],[4,4])

= 1.0000-0.5000 = 0.5000 bits



















Glucose Level High Info([4,4]) = entropy(4/8, 4/8)

= -(4/8)xlog2(4/8) - (4/8)xlog2(4/8) = 1.0000 bits


= -(4/8)xlog2(4/8) - (4/8)xlog2(4/8) = 1.0000 bits


= Info([4,4],[4,4]) = (8/16x1.0000) + (8/16x1.0000) = 1.0000 bits

Information gain for” Glucose Level”

= Info([8,8]) - Info([4,4],[4,4])

= 1.0000-1.0000 = 0.0000 bits

High

Medium

Low

Negative

Positive Protein

Blood Pressure

Normal High Low

High Medium

Negative

Positive Glucose

Blood Pressure

Normal High Low


98

6. Continue finding the second node under the ‘Blood Press’.



















Heart Beat Slow Info([0,8]) = entropy(0/8, 8/8)

= -(0/8)xlog2(0/8) - (8/8)xlog2(8/8) = 0.0000 bits

Fast Info([8,0]) = entropy(8/8,0/8)

= -(8/8)xlog2(8/8) - (0/8)xlog2(0/8) = 0.0000 bits

Expected information for “Heart Beat”

= Info([0,8],[8,0]) = (8/16x0.0000) + (8/16x0.0000) = 0.0000 bits

Information gain for” Heart Beat”

= Info([8,8]) - Info([0,8],[8,0])

= 1.0000-0.0000 = 1.0000 bits

Summary of the second node for information gain

Protein Level Glucose Level Heart Beat

Info: 0.5000 Info: 1.0000 Info: 0.0000

Info Gain: 0.5000 Info Gain: 0.0000 Info Gain: 1.0000

Based on the above table of ‘Information Gain’, the best attribute for the second node is the ‘Heart Beat’

since it has the highest value of 1. as its information gain, compared to ‘Protein Level’ and ‘Glucose

Level’. The final decision tree for information gain is as follows.

Slow Fast

Negative

Positive Heart Beat

Blood Pressure

Normal High Low

Negative

Positive

Slow Fast

Negative

Positive Heart Bt

blood pressure

Normal High Low


99

7. Find the second node under the ‘Heart Beat’ (for ‘Gain Ratio’).

On the other hand, in the case of gain ratio, the best attribute is ‘Heart Beat’. Similar to the case of

information gain, the next step is to place a decision node at the lower nodes, which are impure. Here, the

node we need to focus is the node under the branch ‘Slow’ (Heart Beat = slow).










6 Normal High Very High Slow Negative 7 Normal Medium Very High Slow Negative

14 Normal High Very High Slow Negative 15 Normal Medium Very High Slow Negative 22 Normal High Very High Slow Negative 23 Normal Medium Very High Slow Negative 30 Normal High Very High Slow Negative 31 Normal Medium Very High Slow Negative

8 Low Medium Very High Slow Positive 16 Low Medium Very High Slow Positive 24 Low Medium Very High Slow Positive 32 Low Medium Very High Slow Positive


Blood Pressure High Info([0,8]) = entropy(0/8, 8/8)

= -(0/8)xlog2(0/8) - (8/8)xlog2(8/8) = 0.0000 bits

Normal Info([8,0]) = entropy(8/8, 0/8)

= -(8/8)xlog2(8/8) - (0/8)xlog2(0/8) = 0.0000 bits


= -(0/4)xlog2(0/4) - (4/4)xlog2(4/4) = 0.0000 bits

Expected information for “Blood Pressure”

= Info([0,8],[8,0],[0,4]) = (8/20x0.0000) + (8/20x0.0000)+(4/20x0.0000)= 0.0000 bits

Information gain for” Blood Pressure”

= Info([8,12]) - Info([4,0],[0,4],[4,4])

= 0.9710-0.0000 = 0.9710 bits

Split Info for “Blood Pressure” (Intrinsic_Info for “Blood Pressure”)

= Info([8,8,4])

= -(8/20)xlog2(8/20) – (8/20)xlog2(8/20) – (4/20)xlog2(4/20)

= 0.5288+0.5288+0.4644 = 1.5219 bits

Gain ratio for “Blood Pressure”

= Information gain (“Blood Pressure”) / Split Info (“Blood Pressure”)

= 0.9710 / 1.5219

= 0.6380

Low

Heart Beat

Slow Fast

Normal High

Negative Blood P.


100

8. Continue finding the second node under the ‘Heart Beat’ (for ‘Gain Ratio’).






7 Normal Medium Very High Slow Negative 15 Normal Medium Very High Slow Negative 23 Normal Medium Very High Slow Negative 31 Normal Medium Very High Slow Negative

1 High Medium High Slow Positive 8 Low Medium Very High Slow Positive 9 High Medium High Slow Positive

16 Low Medium Very High Slow Positive 17 High Medium High Slow Positive 24 Low Medium Very High Slow Positive 25 High Medium High Slow Positive

32 Low Medium Very High Slow Positive 3 High Low Medium Slow Positive

11 High Low Medium Slow Positive 19 High Low Medium Slow Positive 27 High Low Medium Slow Positive



= -(4/4)xlog2(4/4) - (0/4)xlog2(0/4) = 0.0000 bits


= -(4/12)xlog2(4/12) - (8/12)xlog2(8/12) = 0.9183 bits


= -(0/4)xlog2(0/4) - (4/4)xlog2(4/4) = 0.0000 bits


= Info([4,0],[4,8],[0,4]) = (4/20x0.0000) + (12/20x0.9183)+(4/20x0.0000)= 0.5510 bits


= Info([8,12]) - Info([4,0],[4,8],[0,4])

= 0.9710-0.5510 = 0.4200 bits

Split Info for “Protein Level” (Intrinsic_Info for “Protein Level”)

= Info([4,12,4])

= -(4/20)xlog2(4/20) – (12/20)xlog2(12/20) – (4/20)xlog2(4/20)

= 0.4644+0.4422+0.4644 = 1.3710 bits

Gain ratio for “Protein Level”

= Information gain (“Protein Level”) / Split Info (“Protein Level”)

= 0.4200 / 1.3710

= 0.3063

Low

Heart

Beat

Slow Fast

Medium High

Negative Protein


101

9. Continue finding the second node under the ‘Heart Beat’ (for ‘Gain Ratio’).























Glucose Level Very High Info([8,4]) = entropy(8/12, 4/12)

= -(8/12)xlog2(8/12) - (4/12)xlog2(4/12) = 0.9183 bits

High Info([0,4]) = entropy(0/4, 4/4)

= -(0/4)xlog2(0/4) - (4/4)xlog2(4/4) = 0.0000 bits


= -(0/4)xlog2(0/4) - (4/4)xlog2(4/4) = 0.0000 bits


= Info([8,4],[0,4],[0,4]) = (12/20x0.9183) + (4/20x0.0000) + (4/20x0.0000)= 0.5510 bits

Information gain for “Glucose Level”

= Info([8,12]) - Info([8,4],[0,4],[0,4])

= 0.9710-0.5510 = 0.4200 bits

Split Info for “Glucose Level” (Intrinsic_Info for “Glucose Level”)

= Info([4,12,4])

= -(4/20)xlog2(4/20) – (12/20)xlog2(12/20) – (4/20)xlog2(4/20)

= 0.4644+0.4422+0.4644 = 1.3710 bits


= Information gain (“Glucose Level”) / Split Info (“Glucose Level”)

= 0.4200 / 1.3710

= 0.3063

Medium

Heart Beat

Slow Fast

High

Very High

Negative Glucose


102

Summary of the second node for gain ratio

Blood Pressure Protein Level Glucose Level

Info: 0.0000 Info: 0.5510 Info: 0.5510

Info Gain: 0.9710 Info Gain: 0.4200 Info Gain: 0.4200

Split Info: 1.5219 Split Info: 1.3710 Split Info: 1.3710

Gain Ratio: 0.6380 Gain Ratio: 0.3063 Gain Ratio: 0.3063

Based on the above table of ‘Gain Ratio’, the best attribute for the second node is the ‘Blood Pressure’

since it has the highest value of .638 as its gain ratio, compared to ‘Protein Level’ and ‘Glucose Level’.

The final decision tree for information gain is as follows.

Besides the information gain and gain ratio, the GINI index is also widely used, especially in

CART. Unlike information gain and gain ratio, the GINI index consider binary split for each attribute,

instead of splitting by its possible values. The GINI index indicates the impurity of the

partition and can be calculated as follows. Following the notation of information gain and gain

ratio, the GINI index measures the impurity of T, a set of training tuples, as follows. Given a

training set of objects and their associated class labels, denoted by , each

object is represented by an n-dimensional attribute vector, depicting

the measure values of n attributes, , of the object with its class , one from m

possible classes, . Here, suppose that has possible values,

, i.e. , a binary partition p divides the values of the

attribute into two subsets (i.e., ) and

, is the k-th class, is the training

set before splitting, is the set of the instances with the class in the set , is a subset of

the training set after splitting, containing the objects which have the value of for the attribute

, that is , and is the set of the instances with the class in the subset . With this

notation, is the total number of instances in the training set before splitting, is the

number of class-k instances in the set , is the number of instances, which have , and

is the number of class-k instances in the subset .

Slow Fast

Negative

Negative Blood Pressure

Heart Beat

Normal

High Low

Positive

Positive


103

GINI Index:

While the GINI index indicates the impurity of the partition, the GINI-index-based decision tree

induction selects a node to split based on the best binary partition (i.e., ) and

, for each attribute ,

. It attempts to select the best attribute

with the best partition that makes the largest reduction of the impurity

as shown in the above formulae. Here, the function returns the value of

that generates the minimum .

Tree Pruning

While we spilt a node into branches in the construction of a decision tree, many of them may

reflect anomalies in the training data due to noises or outliers. In general, the “fully grown” tree

usually obtain a high prediction rate for the training data but a low prediction rate for the test

data which are unknown as shown in Figure 3-24. From the figure, we can observe that the

prediction performance of the learned tree on the training data set increases with respect to the

size of the learned tree while the prediction on the test data set decreases.

Figure 3-24: Prediction rate (accuracy) trend with various tree sizes

0.6000.6500.7000.7500.8000.8500.9000.9501.000

0 10 20 30 40 50 60 70 80 90 100

Train Data Test Data

tree size (number of nodes)

Acc

ura

y


104

In general, this problem is well known as overfitting. Overfitting occurs when our learned model

is too complex with a high degree of freedom in relation to the amount of data available and then

attempts to describe random errors or noises instead of the underlying relationship. A model

with overfitting usually has poor predictive performance since it is too specific to the training

data set and is not generalized for an unknown data set. In order to avoid overfitting, it is

necessary to use additional techniques (e.g. cross-validation, regularization, early stopping,

Bayesian priors on parameters or model comparison), that can indicate when further training is

not resulting in better generalization. In the process of the decision tree induction, we can avoid

overfitting by the way of tree pruning. Typically, we can use statistical or information-based

measures to remove the least reliable branches. An unpruned tree and its pruned version are

shown in Figure 3-25.

Pruning

Figure 3-25: An Example of a Decision Tree and Its pruned tree.

Normally a pruned tree tends to be smaller and less complex. Even sometimes, it represents the

knowledge more precisely by ignoring noises or outliers. Moreover it is usually faster and better

to correctly classify independent test data (i.e., of previously unseen tuples) than an unpruned

version. In general, two common approaches to tree pruning are prepruning and postpruning. In

the prepruning approach, a tree is “pruned” by terminating its construction at an early stage.

Slow Fast

Negative

Negative Blood Pressure

Heart Beat

Normal

High Low

Positive

Positive

An Unpruned Tree

Slow Fast

Negative

Heart Beat

Positive

A Pruned Tree

0 negative 8 positive









105

That is, we can decide not to split or partition the subset of training tuples further at a given node.

When we stop splitting, the current node will become a leaf node. The leaf node may be assigned

with the most frequent class among the subset tuples or the probability distribution of those

tuples. In the construction of a tree, while the measures such as information gain, gain ratio or

GINI index can be used to select the best node to split, another type of measures, such as

statistical significance, can be used to assess the reliability of that split. If partitioning the tuples

at a node results in a split that has statistical significance below a pre-specified threshold, then

further partitioning of the given subset is not made. However, it is difficult to decide an

appropriate threshold. Too high thresholds may result in oversimplified trees, whereas too low

thresholds could result in very little simplification.

As the second and more common approach, postpruning removes subtrees from a “fully

grown” tree. Instead of considering the termination of splitting at the early stage, postpruning

will allow the tree to be grown fully and then prune. Since we allow the tree to fully grow, it is

possible to prevent the situation that the tree is pruned too early. More concretely, even splitting

the current node is not recommended, it may be allowed to split first and, instead the subtree

under the current node is pruned. Anyway similar to prepruning, a subtree at a given node is

pruned by removing its branches and replacing it with a leaf. The leaf is labeled with the most

frequent class among the subtree being replaced. Same as prepruning, besides the splitting

selection criterion, we need another criterion to compare an original tree with its pruned version.

Besides the choice of prepruning and postpruning, two additional kinds of choices that

characterize a pruning method are (1) exploitation of a holdout set and (2) utilization of

knowledge complexity. For the exploitation of a holdout set, a method called reduced error

pruning uses a separate set of examples, called a holdout set, which is a distinct set from the one

of training examples, to evaluate the utility of each node in the tree for pruning. On the other

hand, the statistical reasoning method uses only data available for training, by applying a

statistical test, e.g. Chi-square test, to estimate whether expanding (or pruning) a particular node

is likely to produce an improvement beyond the training set. For the viewpoint of utilization of

knowledge complexity, most methods rarely consider the size or complexity of knowledge when

it performs the learning process. Normally this method use only a training set without help of a

holdout set. For this approach, even it is hard to define knowledge complexity; some works uses

information theory-based complexity measures, such as Minimum Description Length (MDL) to

encode the size of decision tree and the exceptions occurring in the training examples. Intuitively

when the tree becomes large, it will cover all cases in the training set, making no errors or

exceptional cases. That is most examples in the training set can be explained by using the tree

and only few exceptional examples are left for individual consideration. On the other hand, when

the tree is small, it may not cover many examples, and then they are left for individual

consideration as exceptions. The MDL approach tries to minimize the description length based on

this tradeoff. See detail in (Grünwald, 2007). Moreover, the cost-complexity pruning in CART is

also another method which utilizes knowledge complexity but with the holdout set. The

following presents the details of methods described above.

Reduced error pruning with cost complexity

The cost-complexity pruning is used in CART as postpruning. However, it is also possible to

utilize it in the prepruning process. This approach considers the cost complexity of a tree as a

tradeoff function between the number of leaves in the tree and the error rate of the tree. Here,

the error rate is the percentage of tuples misclassified by the tree. Unlike prepruning, the cost-

complexity postpruning starts from the bottom of the tree. For each internal node, N, it computes

the cost complexity of the subtree at N, and the cost complexity of the subtree at N if it were to be


106

pruned and replaced by a leaf node. These two values are compared. If pruning the subtree at

node N would result in a smaller cost complexity, then the subtree is pruned. Otherwise, it is kept.

A holdout set consisting of a number of separated class-labeled tuples is used as a pruning set to

estimate cost complexity. This holdout set is usually independent of the training set used to build

the unpruned tree and of any test set used for accuracy estimation. The algorithm generates a set

of progressively pruned trees and the smallest decision tree that minimizes the cost complexity

is preferred. Formally, the cost complexity pruning generates a series of trees where

is the initial tree and is the tree with only the root node. At step i, the tree is created by

removing a subtree from tree i−1 and replacing it with a leaf node with a value chosen as in the

tree construction algorithm. The selection of the subtree to remove, denoted by t, is decided

basing on the error rate of the original tree and the pruned tree over the holdout data set

S as and , and the numbers of leaf nodes before and after pruning

. An example of criterion is to minimize the following criterion.

Once the series of trees have been created, the best tree is chosen by generalized accuracy as

measured by a training set or cross-validation. Sometimes, this approach is also called reduced-

error pruning.

Statistical reasoning

It is also possible to prune a tree by considering only information from the training data itself

without considering a holdout set. The prune can be made, basing on estimated true errors from

observed errors. Some methods, including C4.5, use a heuristic based on some kind of statistical

reasoning to prune the tree, where it may be criticized that the statistical underpinning is weak

and ad hoc. In practice, it seems to work well. The main principal is to consider the set of

instances that reach each node and imagine that the majority class is chosen to represent that

node, with a certain number of “errors,” E, out of the total number of instances, N. Therefore the

observed error rate is Here suppose that the expected (true) probability of errors at the

nodes is and we can generate the N instances by a Bernoulli process with parameter where

E instances turn out to be errors. Here, we can calculate confidence intervals on the true success

probability given a certain observed success rate. By this, we can make a pessimistic

estimate of the error rate by calculating the upper confidence limit using the following formula.

Given a particular confidence c (for example, c=25%), we find confidence limits z that satisfies

the following formula. Here, N is the number of samples, is the observed error rate,

and is the expected (true) error rate. Given the value of c, the value of z can be derived using

the standard normal table or the Z table in Appendix A.

Normally, we will use the upper confidence limit as an estimate (pessimistic case) for the error

rate at the node. The following indicates that the error rate can be derived from the value of

, z, and N.


107

Here, the expected error rates are calculated from two alternative situations: (1) the node to be

split and (2) the nodes after splitting. These two error rates are compared and the situation with

a lower error rate is selected. Suppose that there are m nodes after splitting. Let the observed

error rate at these m nodes after splitting be

, the numbers of samples at the b nodes

be , the observed error rate at the node before splitting be , and the number of

samples at the node before splitting be . Here, intuitively , and

. Therefore, the expected error rates before and after splitting,

and

are as follows. If

, then split. Otherwise, we do not split.

Before splitting

After splitting

where

To see how all this works in practice, let us consider the unpruned tree in Figure 3-25, where

the number of training examples that reach each node is stated. Now consider whether we

should stop splitting the ‘Blood Pressure’ node or not. Here, we use a 25% confidence, which

makes z equal to 0.69, according to the standard normal table or Z table in the Appendix A. The

error rates before and after splitting the node are as follows. Before splitting, the left leaf node

under the ‘Heart Beat’ node will have 8/2 ( ) as the observed error

rate since the node has a label of ‘Positive’ with 8 negatives and 12 positives. The expected error

rate will be calculated using the above formula. After splitting, all three nodes have zero as their

observed error rates and the expected error rates of these three nodes are calculated and then

are combined into one single expected error rate. The settings for the three nodes from the left to

right are (1) ( ), (2) ( ), and (3) (

). The error rate after spitting is calculated by the weighted combination. Here the

combination of the error estimates for these three leaves is calculated in the ratio of the number

of examples they cover, 8 : 8 : 4, which leads to a combined error estimate of 0.06621. The

detailed calculation can be given below. Since

(i.e., 0.06621 < 0.47706) then we should

split the node.


108

Before splitting

After splitting

As the error rate after node splitting, The combination of these three error rates are as follows

Chi-squared test on rule pruning

Another popular pruning method is to translate a tree into a set of rules and then perform

pruning. It is possible to spell out a set of rules directly from a decision tree, by generating a rule

for each leaf and create the precedence (the left-hand-side) of the rule by using a conjunction of

all the tests encountered on the path from the root to that leaf. This procedure provides us

unambiguous rules without concerning the order of the tests. However, in general, the rules

produced from a tree are usually too complex and too specific. Sometimes, it is better to prune

some conditions in the precedence of a rule. This pruning process can be done by calculating a

pessimistic estimate of the error rate of the new rule after a condition are removed and compare

this with the pessimistic estimate for the original rule. If the new rule is better, we will delete that

condition and then look for the next possible conditions to delete. We can determine to use the

rule when we do not have any more condition that can improve the error rates after removing

that condition. This procedure is applied to all rules. After they have been pruned in this way, we

have to see whether there are any duplicates. If some are, we will remove them from the rule set.

Normally the procedure is done iteratively with a greedy approach to detecting redundant

conditions in a rule. Therefore, there is no guarantee that the best set of conditions will be

removed. However, it is intractable to consider all subsets of conditions since this is usually


109

prohibitively expensive. Although one of good solutions is to apply an optimization technique

such as simulated annealing or a genetic algorithm to select the best condition subset of this rule,

the simple greedy solution seems to work well to generate quite good rule sets. However, this

approach has a computational cost problem, even with the greedy method, is computational cost.

For every condition that is a candidate for deletion, the effect of the rule must be re-evaluated on

all the training instances. In summary, the pruning process can be done in the following steps.

1. Convert the decision tree to a set of classification rules.

2. For each classification rule, calculate a contingency table for each antecedent and test the

antecedent for statistical independence.

3. Prune the rule if the antecedent is independent.

4. Repeat the 2nd-4th processes for the rest antecedents and for all classification rules.

In the first step, we will convert the decision tree to a set of classification rules as shown in the

following example. Due to the limitation of space, we use this artificial example as follows.

R1: If (X=x1 & Y=y1) then Class = C1 R2: If (X=x1 & Y=y2) then Class = C2 R3: If (X=x2 & Z=z1) then Class = C1 R4: If (X=x2 & Z=z2) then Class = C3 R5: If (X=x3) then Class = C1

Here, assume that we also have the following table as the numbers counted from the dataset.

Note that the values in this table present in the tree but some of them may not appear. The values

can be counted from the data table.

Class=C1 Class=C2 Class=C3

X=x1 Y=y1

Z=z1 4 0 0 Z=z2 6 0 0

Y=y2 Z=z1 0 20 0 Z=z2 0 10 0

X=x2 Y=y1

Z=z1 0 5 0 Z=z2 0 0 20

Y=y2 Z=z1 0 5 0 Z=z2 0 0 10

X=x3 Y=y1

Z=z1 5 0 0 Z=z2 10 0 0

Y=y2 Z=z1 5 0 0 Z=z2 5 0 0


110

As the second step, we construct a contingency table for each rule to test the statistical independence. For example, given the above data set, we construct the table for the first rule to test the statistical independence of the first antecedent, i.e., X = ‘x1’ as follows.

R1: If (X=x1 & Y=y1) then Class = C1

The contingency table for the first antecedent of this rule can be constructed using the above table as follows.

Class = ‘C1’ Class ‘C1’ Marginal Sum

X = ‘x1’ 10 (A) 30 (B) 40 (A+B)

X ‘x1’ 25 (C) 40 (D) 65 (C+D)

Marginal Sum 35 (A+C) 70 (B+D) 105 (T)

The expected value of each cell can be calculated from the row and column marginal sums as follows.


X = ‘x1’ 13.33

[(A+C)*(A+B)]/T

26.67

[(B+D)*(A+B)]/T

40

X ‘x1’ 21.67

[(A+C)*(C+D)]/T

43.33

[(B+D)*(C+D)]/T

65

Marginal Sum 35 70 105

In the next step, we calculate the 2 using one of the following formulae, depending on the

highest expected frequency (m).

If (m>10) then use chi-square test.

i ie

eo ii )(2

2

If (5m1 ) then use Yate’s Correction

for Continuity

i ie

eo ii )5.0|(|2

2

If (m<5) then use Fisher’s Exact Test

Use fishers’ exact test. The detail can be found in

http://mathworld.wolfram.com/FishersExactTe

st.html

In this case, the highest expected frequency (m) is 43.33. Therefore, we use the chi-square test.

Moreover, oi is the ith observed value, ei is the ith expected value.

0192.2

2564.05128.04167.08333.033.43

33.4340

67.21

67.2125

67.26

67.2630

33.13

33.13102222

2

2)(

i ieeo ii

In this case, the degree of freedom calculation (df) is as follows.

Here, r is the number of rows and c is the number of columns. In the contingency table, both the

number of rows and the number of columns are two (2). Therefore, the degree of freedom is 1,

according to the above formula, .


111

From the chi-square table in Section 2.4 and Appendix B, when we set the p-value ( ) to 0.05,

will be 3.84. If we set the threshold of to 3.84 and 2 < 2

, then we accept the null hypothesis

of independence, H0. Here, we can accept the statistical independence of the first antecedent X =

‘x1’ since 2.0192 < 3.84.


We thus conclude that “Class = C1” are independent from “X=x1”. We eliminate this antecedent from the rule as follows.

R1: If (Y=y1) then Class = C1

Next we construct the contingency table for the second antecedent of the first rule to test the statistical independence of the second antecedent, i.e., Y = ‘y1’ as follows.



Y = ‘y1’ 25 (A) 25 (B) 50 (A+B)

Y ‘y1’ 10 (C) 45 (D) 55 (C+D)

Marginal Sum 35 (A+C) 70 (B+D) 105 (T)

The expected value of each cell can be calculated from the row and column marginal sums as follows.


Y = ‘y1’ 16.67

[(A+C)*(A+B)]/T

33.33

[(B+D)*(A+B)]/T

50

Y ‘y1’ 18.33

[(A+C)*(C+D)]/T

36.67

[(B+D)*(C+D)]/T

55

Marginal Sum 35 70 105

In the next step, we calculate the 2 for this antecedent as follows.

9318.11

8940.17879.30833.21667.467.36

67.3645

33.18

33.1810

33.33

33.3325

67.16

67.16252222

2

2)(

i ieeo ii

In this case, the degree of freedom calculation (df) is the same, i.e., 1.

When we set the p-value ( ) to 0.05, will be 3.84. If we set the threshold of

to 3.84 and

2 < 2 , then we accept the null hypothesis of independence, H0. Here, we reject the statistical

independence of the second antecedent Y = ‘y1’ since 11.9318 > 3.84.


We thus conclude that “Class = C1” dependent on “Y=y1”. We keep this antecedent of the rule as follows.

R1: If (Y=y1) then Class = C1

Here, we also need to check also the antecedents of other rules, R2-R5.


112

Issues in Decision Trees

This section summarizes five issues in the decision-tree based classification. However, they are

also common in other classification methods.

1. Overfitting the data

Given hypothesis space (i.e. a set of all possible trees), a hypothesis (i.e. a tree) is

likely to overfit the training data if there exists some alternative hypothesis (i.e.

another tree), such that generates fewer errors than over the training examples, but

gives fewer errors than over the entire distribution of instances. Two possible common

heuristics are prepruning and postpruning. The first one tries not to fit all examples but stops

growing a tree before using all data in the training set. The second one fits all examples with

the constructed tree but prunes the resultant tree. The problem is how to know whether a

given tree overfits the data or not. One solution is to use a validation set which does not

include data used for training (not in the training set), to check for overfitting. Usually the

validation set consists of one-third of the training set, chosen randomly. Then use statistical

tests, such as the chi-squared metric, to determine whether pruning the tree improves its

performance over the validation set. An alternative is to use MDL to check whether

modifying the tree increases its MDL with respect to the validation set or not. If we use the

validation set to guide pruning, again we need to guarantee that the tree is not overfitting the

validation set. In this case, we need to extract yet another set called the test set from the

training set and use this for the final check.

2. Good attribute selection

While the information gain measure seems a good criterion for attribute section, it has a bias

that favors attributes with many values over those with only a few. For example, the attribute

with unique values for each training example, then the gain of ,

will yield the highest value since it is obviously not ambiguous and no other non-unique

attribute can do better. This will result in a very broad tree of depth 1. To solve this problem,

it is possible to use instead of .

3. Handling continuous valued attributes

Continuous valued attributes can be partitioned into a discrete number of disjoint intervals.

Then we can test for membership to these intervals. For example, the Temperature attribute

in the Play-Tennis example in Figure 3-13, takes continuous values. It is not suitable to treat

a number as a label since the Temperature will become a bad choice for classification. The

temperature value alone may perfectly classify the training examples and therefore promise

the highest information gain like in the earlier example related to ‘Date’. However, it will be a

poor predictor on the test set. The solution to this problem is to classify based not on the

actual temperature, but on dynamically determined intervals within which the temperature

falls. For instance, we can introduce Boolean attributes,

instead of the real-valued . The can be computed by some discretization

methods.

4. Handling missing attribute values

When some of the training examples contain one or more missing value or ‘value not known’

instead of the actual attribute values, we can use one of the following options.

1. the unknown value with the most common value in that column


113

2. the most common value among all training examples that have been sorted into the

tree at that node

3. the most common value among all training examples that have been sorted into the

tree at that node with the same classification as the incomplete example.

5. Handling attributes with different costs

While attribute selection in the original version of decision tree induction depends mainly on

discrimination power (or classification performance) on classes, sometimes we would like to

introduce another type of criteria (or bias) against the selection of certain attributes. These

selection criteria may relate with cost of testing the attribute, rather than discrimination

power. For example, the cost for having a blood test is higher than that of measuring

temperature or blood pressure. By this fact, even the attribute ‘temperature’ or the attribute

‘blood pressure’ has a little lower discrimination power than the ‘blood test’ result, we may

select these low-cost attribute as a node in a decision tree. In general, it is possible to assign a

reasonable cost for each attribute and use them together with conventional criterion (i.e.,

information gain or gain ratio) to construct a decision tree. For example, we can set a

CostedGain(S,A) which is defined along the following combination function.

or

where is a weighting constant that determines the relative importance of cost

versus information gain.

3.1.5. Classification Rules: Covering Algorithm

In principle, decision tree algorithms apply a divide-and-conquer approach to solve the

classification problem. They work in the top-down manner by seeking at each stage an attribute

that best splits (separates) the classes; then recursively processes the subproblems that result

from the split. Finally, this procedure will generate a decision tree. As seen in the previous

section, if necessary, the decision tree can be converted into a set of classification rules. Although

it is simple to produce a set of effective rules, the converted rules have limitations in their forms,

especially their rule antecedents (the rules’ conditions or right-hand-sides). Known as a

separate-and-conquer approach, an alternative approach is to consider each class in turn and

seek a set of conditions (rule antecedents) that cover all instances in the class, at the same time

exclude instances not in the class. Called as a (sequential) covering approach, at each stage a rule

is produced to cover instances in the class that are not covered yet and exclude unrelated

instances. This approach directly leads to a set of rules not a decision tree. The rules produced by

the covering algorithm are more general than those converted from a decision tree are.

There are many sequential covering algorithms. The most basic one is known as PRISM

algorithm (developed by Cendrowska in 1987), which uses p/t (accuracy) as the criterion to

express the goodness of a rule ( ). Here, p is the number of instances (in the

training set) covered by the rule, i.e. the number of instances that satisfy both antecedent

and consequent and t is the number of instances that satisfy the antecedent

but may or may not satisfy the consequent. Except the accuracy p/t, there are also other criteria


114

for selecting good rules, such as accuracy with negative consideration, information gain and

positive-negative difference. Their characteristics are summarized as follows.

Accuracy: [p/t]

o The covering algorithm using accuracy (p/t) attempts to produce rules that do not cover negative instances as quickly as possible.

o However, it may produce a rule with very small coverage, such as the rule with a

coverage of only 1 but the accuracy of 100% (p/t = 1/1). The instances used to form this

rule need to be judged whether they are either special cases or just noises.

o A typical problem for the accuracy is that the algorithm will prefer to select a rule with

accuracy of 1/1 (100%) to a rule with accuracy of 999/1000 (99.9%) (1/1 vs.

999/1000). The first rule comes from only one evidence (number of instances = 1) while

the second rule relies on many evidences (number of instances = 999), which are much

larger. Therefore, intuitively the second rule has high reliability and it should be selected,

instead of the first rule.

Accuracy with negative consideration: [p+(N-n)]/T]

o The covering algorithm using accuracy with negative consideration (p+(N-n)]/T)

attempts to produce rules with the assumption that noncoverage of negatives is equally

important as coverage of positives. Therefore, it uses the summation of the number of

positive instances covered by the rule (p) and the number of negative instances not

covered by the rule (N-n). Here, N is the total number of negative instances in the whole

dataset and n is the total number of negative instances covered by the rule. Intuitively,

the rule will be selected based on the number difference between positive and negative

instances covered by the rule, i.e. (p-n).

o However, the criterion still shares similar issue with the accuracy when the difference

between the numbers of positive and negative instances (p-n) is not good representative.

o Given a data set with 5000 positive instances and 5000 negative instances (T=10000,

P=5000, and N=5000), a typical problem of this criterion occurs when the algorithm will

prefer to select a rule with an accuracy of (3000+(5000-2000))/10000 (i.e.,

6000/10000) to a rule with an accuracy of (999+(5000-1)/10000) (i.e., 5998/10000).

That is, the comparison between p=3000, n=2000 and p=999, n=1. The first rule has the

situation that the number of positive instances (3000) and the number of negative

instances (2000) are very similar while the second rule indicates dominant difference

between the numbers of positive and negative instances (999 positive instances and 1

negative instance). Therefore, intuitively the second rule has high contrast between the

numbers of positive and negative instances and seems to have high reliability with

coverage of many instances, and it should be selected, instead of the first rule.

Information gain:

o The covering algorithm with information gain attempts to

produce a set of rules with the highest information gain, where p is the number of

positive instances covered by the rule, t is the number of instances that satisfy the

antecedent, P is the total number of positive instances in the data set and T is the total

number of instances in the data set. Moreover, if two rules have equivalent information

gain , this criterion will select the rule with the largest number of

positive instance (p).


115

o However, the criterion still shares similar issue with the accuracy since it still focuses on

the number of positive instances.

o Given a data set with 5000 positive instances and 5000 negative instances (T=10000,

P=5000, and N=5000), a typical problem of this criterion occurs when the algorithm will

prefer to select a rule with information gain of

(= 1356.14) to a rule with information gain of

(= 997.56). That is, the comparison between p=2000, n=500 and

p=999, n=1. The first rule has the situation that the number of positive instances (2000

instances) and the number of negative instances (500 instances) are not so different

while the second rule indicates dominant difference between the numbers of positive

and negative instances (999 positive instances and 1 negative instance). Therefore,

intuitively the second rule has high contrast between the numbers of positive and

negative instances and seems to have high reliability with coverage of many instances,

and it should be selected, instead of the first rule.

Positive-negative difference:

o The positive-negative difference is similar to the accuracy (p/t) since it can be

transformed by n=(t-p). Then it becomes (p-(t-p))/t) = (2p-t)/t. Finally, it is

.

PRISM Algorithm

The pseudocode of the PRISM rule learner is shown as follows. In order to understand the

PRISM algorithm, the following health-check data set is used to describe the rule construction

(learning) process using the accuracy as the criterion to select the best rule.

Patient No.

Blood Pressure (Feature #1)

Protein Level (Feature #2)

Glucose Level (Feature #3)

Heart Beat (Feature #4)

Diseased (Class)

1 High Medium High Slow Positive 2 High High Normal Fast Negative 3 High Low Normal Slow Positive 4 High Low High Fast Negative 5 Normal Low Normal Fast Negative 6 Normal High Very High Slow Negative 7 Normal Medium Very High Slow Negative 8 Low Medium Very High Slow Positive 9 High Medium High Slow Positive

10 High High Normal Fast Negative 11 High Low Normal Slow Positive 12 High Low High Fast Negative 13 Normal Low Normal Fast Negative 14 Normal High Very High Slow Negative 15 Normal Medium Very High Slow Negative 16 Low Medium Very High Slow Positive 17 High Medium High Slow Positive 18 High High Normal Fast Negative 19 High Low Normal Slow Positive 20 High Low High Fast Negative 21 Normal Low Normal Fast Negative 22 Normal High Very High Slow Negative 23 Normal Medium Very High Slow Negative 24 Low Medium Very High Slow Positive 25 High Medium High Slow Positive 26 High High Normal Fast Negative 27 High Low Normal Slow Positive 28 High Low High Fast Negative 29 Normal Low Normal Fast Negative 30 Normal High Very High Slow Negative 31 Normal Medium Very High Slow Negative 32 Low Medium Very High Slow Positive


116

Algorithm 3.1. The PRISM algorithm (Pseudocode of the PRISM rule learner)

FOREACH class C {

INITIALIZE E to the instance set (the training set)

WHILE E contains instances in class C {

CREATE a rule R with an empty left-hand side that

predicts class C

UNTIL R is perfect (or there are no more attributes to use) {

FOREACH attribute A not mentioned in R, and each value v {

CONSIDER ADDING the condition A=v to the LHS of R

SELECT A and v to maximize the accuracy p/t

(break ties by choosing the condition with the largest p)

ADD A=v to R

}

REMOVE the instances covered by R from E

}

}

}

Based on this data set, the PRISM algorithm (Algorithm 3.1) will form rules that cover each of

the two alternative classes, positive and negative, in turn. Let us begin with the class ‘positive’. A

rule with empty left-hand-side and ‘positive’ class is considered as follows.

If ? then diseased = ‘positive’

For the antecedent, there are eleven possibilities. The following table shows the rules and

their values of p/t. The rule with the highest value will be selected. In the case that there is more

than one rule holding the highest value, among them we will select the rule with the high

coverage (i.e., the highest value of p). At this point, again it is also possible to have more than one

rule holding the highest coverage. For this case, we will randomly select one of them.

No. Rule p/t (= accuracy)

1 If (blood = ‘high’) then diseased = ‘positive’ 8/16 = 0.50

2 If (blood = ‘normal’) then diseased = ‘positive’ 0/12 = 0.00

3 If (blood = ‘low’) then diseased = ‘positive’ 4/4 = 1.00

4 If (protein = ‘high’) then diseased = ‘positive’ 0/8 = 0.00

5 If (protein = ‘medium’) then diseased = ‘positive’ 8/12 = 0.67

6 If (protein = ‘low’) then diseased = ‘positive’ 4/12 = 0.33

7 If (glucose = ‘normal’) then diseased = ‘positive’ 4/12 = 0.33

8 If (glucose = ‘high’) then diseased = ‘positive’ 4/8 = 0.50

9 If (glucose = ‘very high’) then diseased = ‘positive’ 4/12 = 0.33

10 If (heart = ‘fast’) then diseased = ‘positive’ 0/12 = 0.00

11 If (heart = ‘slow’) then diseased = ‘positive’ 12/20 = 0.60

From the above table, the third rule has the highest value of p/t (=4/4). Therefore, we

include the third rule into the final classification rule set.

(R1) If blood = ‘low’ then diseased = ‘positive’ p/t = 4/4 (=1.0)

Since this rule is perfect (the accuracy of 1.0), there is no need to refine this rule. Next, we

delete all instances covered by the rule and then find another rule to cover the remaining

instances. The following table expresses the status after the four instances covered by the rule

are deleted.


117

Patient No.





Diseased (Class)

1 High Medium High Slow Positive 2 High High Normal Fast Negative 3 High Low Normal Slow Positive 4 High Low High Fast Negative 5 Normal Low Normal Fast Negative 6 Normal High Very High Slow Negative 7 Normal Medium Very High Slow Negative 8 Low Medium Very High Slow Positive 9 High Medium High Slow Positive

10 High High Normal Fast Negative 11 High Low Normal Slow Positive 12 High Low High Fast Negative 13 Normal Low Normal Fast Negative 14 Normal High Very High Slow Negative 15 Normal Medium Very High Slow Negative 16 Low Medium Very High Slow Positive 17 High Medium High Slow Positive 18 High High Normal Fast Negative 19 High Low Normal Slow Positive 20 High Low High Fast Negative 21 Normal Low Normal Fast Negative 22 Normal High Very High Slow Negative 23 Normal Medium Very High Slow Negative 24 Low Medium Very High Slow Positive 25 High Medium High Slow Positive 26 High High Normal Fast Negative 27 High Low Normal Slow Positive 28 High Low High Fast Negative 29 Normal Low Normal Fast Negative 30 Normal High Very High Slow Negative 31 Normal Medium Very High Slow Negative 32 Low Medium Very High Slow Positive

Based on this table, we will form another rule to perfectly cover the instances in the positive

class. Another rule with empty left-hand-side and ‘positive’ class is considered as follows.

If ? then diseased = ‘positive’

For the antecedent, there are ten possibilities. The following table shows the rules and their

values of p/t.


1 If (blood = ‘high’) then diseased = ‘positive’ 8/16 = 0.50

2 If (blood = ‘normal’) then diseased = ‘positive’ 0/12 = 0.00

- If (blood = ‘low’) then diseased = ‘positive’ - -

3 If (protein = ‘high’) then diseased = ‘positive’ 0/8 = 0.00

4 If (protein = ‘medium’) then diseased = ‘positive’ 4/8 = 0.50

5 If (protein = ‘low’) then diseased = ‘positive’ 4/12 = 0.33

6 If (glucose = ‘normal’) then diseased = ‘positive’ 4/12 = 0.33

7 If (glucose = ‘high’) then diseased = ‘positive’ 4/8 = 0.50

8 If (glucose = ‘very high’) then diseased = ‘positive’ 0/8 = 0.00

9 If (heart = ‘fast’) then diseased = ‘positive’ 0/12 = 0.00

10 If (heart = ‘slow’) then diseased = ‘positive’ 8/16 = 0.50

From the above table, the first rule and the tenth rule have the highest value of p/t (=8/16)

and also the highest coverage (p=8). At this point, we randomly select the first rule. Since it does

not have 100% accuracy, further refinement is necessary.

If blood = ‘high’ then diseased = ‘positive’ p/t = 8/16 (=0.5)


118

Since this rule is not perfect (the accuracy of 1.0), we need to refine this rule. At this point we

select the instances which have the ‘blood = high’ and then add another antecedent to cover only

the instances with the positive class. The following is the instances with ‘blood=high’.

Patient No.





Diseased (Class)

1 High Medium High Slow Positive 2 High High Normal Fast Negative 3 High Low Normal Slow Positive 4 High Low High Fast Negative 9 High Medium High Slow Positive

10 High High Normal Fast Negative 11 High Low Normal Slow Positive 12 High Low High Fast Negative 17 High Medium High Slow Positive 18 High High Normal Fast Negative 19 High Low Normal Slow Positive 20 High Low High Fast Negative 25 High Medium High Slow Positive 26 High High Normal Fast Negative 27 High Low Normal Slow Positive 28 High Low High Fast Negative

Based on this table, we try to add another antecedent to perfectly cover the instances in the

positive class. Therefore, the following rule template can be considered.

If blood = ‘high’ & ( ? ) then diseased = ‘positive’

For the antecedent, there are ten possibilities. The following table shows the rules and their

values of p/t.


1 If (blood = ‘high’ & protein = ‘high’) then diseased = ‘positive’ 0/4 = 0.00

2 If (blood = ‘high’ & protein = ‘medium’) then diseased = ‘positive’ 4/4 = 1.00

3 If (blood = ‘high’ & protein = ‘low’) then diseased = ‘positive’ 4/8 = 0.50

4 If (blood = ‘high’ & glucose = ‘normal’) then diseased = ‘positive’ 4/8 = 0.50

5 If (blood = ‘high’ & glucose = ‘high’) then diseased = ‘positive’ 4/8 = 0.50

6 If (blood = ‘high’ & glucose = ‘very high’) then diseased = ‘positive’ - -

7 If (blood = ‘high’ & heart = ‘fast’) then diseased = ‘positive’ 0/8 = 0.00

8 If (blood = ‘high’ & heart = ‘slow’) then diseased = ‘positive’ 8/8 = 1.00

From the above table, the second rule and the eighth rule have the highest accuracy (=1.0) but

the eighth rule has the highest coverage (p=8). Therefore, we select the eighth rule.

(R2) If blood = ‘high’ & heart = ‘slow’ then diseased = ‘positive’ p/t = 8/8 (=1.0)

Since this rule is perfect (the accuracy of 1.0), there is no need to refine this rule. Next, we

delete all instances covered by the rule and then find another rule to cover the remaining

instances. However, after the construction of the eighth rule, there is no instance with the

‘positive’ class left. Therefore, the PRISM algorithm will output the following two rules as the set

of classification rules.

(R1) If blood = ‘low’ then diseased = ‘positive’

(R2) If blood = ‘high’ & heart = ‘slow’ then diseased = ‘positive’

While these two perfect rules can completely cover all instances in the training set, they may be

too specific and improper in general for unseen data. This situation is known as overfitting. In

covering algorithm, a rule becomes more overfitting every time an antecedent is added into the


119

left-hand-side of the rule. To avoid overfitting, we can stop adding an antecedent at a certain

point while the rule is still not perfect (100% accuracy). That is, sometimes it is better not to

generate perfect rules that guarantee to give the correct classification on all instances in order to

avoiding overfitting. We need to consider which rules are worthwhile and how we determine the

rule becomes counterproductive when continuing adding more antecedents or conditions to a

rule to exclude a few pecky instances of the wrong type. There are two main strategies of pruning

rules, global pruning (post-pruning) and incremental pruning (pre-pruning). The first one is to

calculate the full set of rules and then prune them while the second one is to prune the rules at

the timing they are refined or generated. As we can see, the rules are normally created with a

criterion, such as p/t. However, as pruning mechanism, we need another criterion to measure the

counterproductive level. Three general pruning criteria are MDL principle (Minimum Description

Length), reduced-error pruning (or calculating errors in the holdout set) and statistical

significance (as done in INDUCT algorithm). The MDL principle is a formalization of Occam's

razor in which the best hypothesis for a given set of data is the one that leads to the largest

compression of the data. The MDL was introduced by Jorma Rissanen in 1978 and it is an

important concept in information theory and learning theory. Any set of data can be represented

by a string of symbols from a finite (say, binary) alphabet. The fundamental idea behind the MDL

principle is that any regularity in a given set of data can be used to compress the data, i.e. to

describe it using fewer symbols than needed to describe the data literally (Grünwald, 1998).

Moreover, it is possible that all data cannot be represented by regularity (general knowledge)

and some exceptional data may exist. Since the MDL approach attempts to select the hypothesis

that captures the most regularity in the data and leaves the fewest exceptional data, it aims to

find the best compression (the smallest size of regularity together with the fewest exceptions).

Applied to the classification rules, this approach requires a method to measure the size of rules

and the size of the instances that are not covered by the rules, and then select the smallest set of

rules, which produced the smallest numbers of exceptions.

In the reduced-error pruning approaches, it is possible to split the training data into two

parts; a growing set and a pruning set. At the first step, the growing set is used to form a rule

using the basic covering algorithm. Then a test is performed on the rule using the pruning set,

and the effect is evaluated by seeing whether the rule also performs well on the pruning set or

not. Based on the timing of pruning, two variants are reduced-error pruning and incremental

reduced-error pruning. The normal reduced-error pruning is to apply the growing set to build

the complete set of classification rules and then to use the pruning set to evaluate the antecedent

of each rule in order to omit useless ones. The incremental version prunes a rule immediately by

checking whether the current antecedent (test) is effective or not and throw it out if its

performance on the test set is not good enough.

The statistical significance approach uses statistical criteria to decide the effectiveness of

adding an antecedent into the rule. One common statistical significance is to apply the

hypergeometric distribution or binomial distribution to calculate the probability that the rule

will be produced. The lower probability the rule has, the more significant (the better) the rule is.

Therefore, by incorporating into the covering algorithm, it is possible to compare the

probabilities (statistical significances) of the rules when an antecedence is added or not added

into the rule. The hypergeometric distribution of the rule indicates how

likely this rule will be generated. Suppose that a data set contains P positive instances and N

negative instances (i.e., the total number of instances ) and a rule ( )

covers p positive instances and the conjunctive set of the antecedence satistfies t

instances (i.e., the number of negative instances of this rules is ). Figure 3-26 shows the

conceptual diagram for hypergeometric distribution in rule significance calculation. Following


120

the combinatory theory, to generate the rule, we need to select t instances from T instances in the

whole data set, composed of p positive instances from P positive instances and t-p negative

instances from T-P negative instances. The probability based on the hypergeometric distribution

of the rule equals to the following equation where

.

Figure 3-26: Conceptual diagram for hypergeometric distribution for rule significance

The significance of a rule can be defined as the probability that the rule (R) performs at least as

well as the accuracy of the rule. This is called the statistical significance of the rule, m(R).

As mentioned above, in the process of the covering algorithm, a rule is revised by adding an

antecedent to increase the accuracy. Here, let R be the current rule and R- be the rule without the

last additional antecedent. When we refine the rule with an additional antecedent, the rule will

become more specific and cover a smaller set of instances. At each refinement step, it is possible

to calculate the significance of the rule before and after the refinement. If the significance

increase (m(R) < m(R-)), we should add the antecedent into the rule. Otherwise, we should not

and the rule refinement process should be suspended. The pseudocode of the PRISM rule learner

with rule significance testing, known as INDUCT algorithm, is shown in the Algorithm 3.2 below.


121

Algorithm 3.2. The PRISM algorithm with significance testing (called INDUCT algorithm)

INITIALIZE E to the instance set (the training set)

WHILE E contains instances {

FOREACH class C which E contains an instance {

CREATE a rule R with an empty left-hand side that

predicts class C

UNTIL R is perfect (or there are no more attributes) {

FOREACH attribute A not mentioned in R, and each value v {

CONSIDER ADDING the condition A=v to the LHS of R

SELECT A and v to maximize the accuracy p/t

(break ties by choosing the condition with the largest p)

CALCULATE significance m(R) for the rule R and

CALCULATE significance m(R-) for the rule R with

final condition omitted

IF (m(R-)<m(R)){ LEAVE UNTIL-LOOP } ELSE { ADD A=v to R }

}

}

}

COMPARE the rules generated for different classes to select

the most significant rule (i.e. the one with smallest m(R))

ADD the most significant rule R into the set of output rules

REMOVE the instances covered by R from E

}

To elaborate the process, we calculate the rule significance for the second rule (R2),

compared to the rule without the last antecedent (R2-) as follows. Here, there are 12 positive

instances (P=12) in the total 32 instances (T=32) as shown in the original data set.


(R2-) If blood = ‘high’ then diseased = ‘positive’ p/t = 8/16 (=0.5)

The significance level of the rule (R2) is as follows. Here, T=32, P=12, t=8, and p=8.

The significance level of the rule (R2-) is as follows. Here, T=32, P=12, t=16, and p=8.


122

Since m(R2) < m(R2-), we should add the antecedent into the rule. Therefore, R2 is accepted as

the refinement.


However, even not shown here, in some cases where m(R2) > m(R2-), we will omit the

antecedent from the rule.

Besides this most basic algorithm, some popular variations include AQ (Michalski, 1969),

CN2 (Clark and Niblett, 1989), and RIPPER (Cohen, 1995). The descriptions of AQ and CN2

algorithm are attached in Algorithm 3.3 and 3.4, respectively. Michalski’s AQ and related

algorithms were inspired by methods used by electrical engineers for simplifying Boolean

circuits (Higonnet & Grea, 1958). They exemplify the specific-to-general, and typically start with

a maximally specific rule for assigning cases to a given class. At the first example of AQ algorithm,

a set of examples to the class MAMMAL in taxonomy of vertebrates is provided to learn a set of

rules for classifying or characterizing objects in the class MAMMAL. Starting from the most

specific example, the generalization process (a bottom-up process) is performed. In contrast with

this, CN2 and RIPPER are top-down approach. The CN2 algorithm aims to modify the basic AQ

algorithm in such a way as to equip it to cope with noise and other complications in the data. In

particular, during its search for good complexes, the CN2 does not automatically remove from the

consideration a candidate that is found to include one or more negative example. Rather it

retains a set of complexes in its search that is evaluated statistically as covering a large number

of examples of a given class and few of other classes. Moreover, the manner in which the search is

conducted is general-to-specific. Each trial specialization step takes the form of either adding a

new conjunctive term or removing a disjunctive one. Having found a good complex, the algorithm

removes those examples it covers from the training set and adds the rule “if <complex> then

predict <class>” to the end of the rule list. The process terminates for each given class when no

more acceptable complexes can be found. As shown in Algorithm 3.4, the CN2 algorithm has the

following main features: (1) the dependence on specific training examples during search (a

feature of the AQ algorithm) is removed; (2) it combines the efficiency and ability to cope with

noisy data of decision tree learning with the if-then rule form and flexible search strategy of the

AQ family; (3) it contrasts with other approaches to modify AQ to handle noise in that the basic

AQ algorithm itself is generalized rather than “patched” with additional pre- and post-processing

techniques; and (4) it produces both ordered and unordered rules.


123

Algorithm 3.3. The AQR algorithm for generating a class cover (a set of classification rules)

LET pos be a set of positive examples of class C.

LET neg be a set of negative examples of class C.

PROCEDURE AQR(pos, neg){

LET cover be the empty cover.

WHILE cover does not cover all examples in POS {

SELECT a seed (a positive example not covered by cover) .

LET star be STAR(seed, neg) (a set of complexes that

Cover seed but that covers no examples in neg).

LET best be the best complex in star

According to user-defined criteria.

ADD best as an extra disjunction to cover.

RETURN cover

}

}

PROCEDURE STAR(seed, neg){

LET star be the set containing the empty complex.

WHILE any complex in star covers some negative examples in neg,

SELECT a negative example Eneg covered by a complex in star.

SPECIALIZE complexes in star to exclude Eneg by:

LET extension be all selectors that cover seed but not Eneg.

LET star be the set {x y|x star, y extension}. REMOVE all complexes in star subsumed by other complexes.

REPEAT UNTIL size-of-star < maxstar (a user-defined maximum):

REMOVE the worst complex from STAR.

RETURN star.

Algorithm 3.4. The CN2 algorithm for generating a class cover (a set of classification rules)

LET e be a set of classified examples.

LET selectors be the set of all possible selectors.

PROCEDURE CN2(e){

LET rule-list be the empty list.

REPEAT UNTIL best_complex is nil or e is empty:

LET best_complex be FIND-BEST-COMPLEX(e).

IF best_complex is not nil,

THEN LET e' be the examples covered by best_complex.

REMOVE from e the examples e' covered by best_complex.

LET C be the most common class of examples in e'.

ADD the rule 'IF best_complex THEN the class is C'

TO the end of rule-list.

RETURN rule-list }

PROCEDURE FIND-BEST-COMPLEX(e) {

LET star be the set containing the empty complex.

LET best_complex be nil.

WHILE star is not empty {

SPECIALIZE all complexes in star as follows:

LET newstar be the set {x y|x star, y selectors}. REMOVE all complexes in newstar that are either in star (i.e.,the

unspecialized ones) OR null(e.g., big=y big=n). FOR every complex Ci in newstar:

IF Ci is statistically significant and better than

best_complex by user-defined criteria when tested on e,

THEN replace the current value of best_complex by Ci.

REPEAT UNTIL size of NEWSTAR < user-defined maximum:

REMOVE the worst complex from NEWSTAR.

LET star be newstar.

RETURN best_complex. }


124

3.1.6. Artificial Neural Networks

The field of artificial neural networks (ANNs) was originally proposed by psychologists and

neurobiologists who attempted to develop and test computational analogues of neurons. Up to

present, ANNs have been usually applied to imitate human abilities such as the use of language

(speech) and learning concepts, as well as many other practical commercial, scientific, and

engineering disciplines of pattern recognition, modeling, and prediction (Hertz et al., 1991 and

Wasserma, 1989). In general, an ANN consists of connected input/output units and their

weighted connection. During the learning phase, the network is gradually adapted by adjusting

the weights among units in order to obtain the best weights that predict the correct class label of

an input. A well-known neural network learning algorithm is backpropagation, also referred to as

connectionist learning due to the connections between units. Since neural networks involve long

training times, they may not be suitable for applications that require real-time learning process.

In many cases, a number of parameters are typically best determined empirically, such as the

network topology or structure. One more criticism related to neural networks is their poor

interpretability. It is difficult for us to interpret the symbol behind the learned weights and of

“hidden units” in the network. This poor interpretability made ANNs less desirable for data

mining. However, recently a number of techniques have recently been developed for the

extraction of rules from trained neural networks. Moreover, ANNs have advantages in their high

tolerance of noisy data as well as their ability to classify patterns on which they have not been

trained. The ANNs can be used when little knowledge may be provided for the relationships

between attributes and classes. In addition, they are well-suited with continuous-valued inputs

and outputs, unlike most decision tree algorithms. Their successful applications on a wide array

of real-world data, include handwritten character recognition, pathology and laboratory

medicine, and training a computer to pronounce English text. Moreover, it is quite

straightforward when one applies parallelization techniques for neural network algorithms. The

parallelism will fasten up the computation process. These factors contribute toward the

usefulness of neural networks for classification and prediction in data mining. Currently, there

are many different kinds of neural network algorithms.

As mentioned above, the most popular neural network algorithm is backpropagation,

invented in 1980s, for multilayer feed-forward neural network. The backpropagation iteratively

learns a set of weights for prediction of the class label of tuples. A multilayer feed-forward neural

network consists of an input layer, one or more hidden layers, and an output layer as shown in

Figure 3-27. For the input, the hidden and the output layers, each of them is composed of a

number of nodes (units). The inputs to the network correspond to the values of the

attributes measured for each training tuple. The inputs are fed simultaneously into the nodes of

input layer. After these inputs pass through the input layer, they will be weighted ( for the

node i to the node j) and fed to the nodes of the second layer, a hidden layer. The outputs of the

hidden layer will become inputs to another hidden layer, and so on. Although the number of

hidden layers can be arbitrary, the widely-used network is the one with only one hidden layer, as

shown in Figure 3-27. The outputs of the last hidden layer are also weighted ( for the node j to

the node k) and used as inputs to nodes making up the output layer, which finally sends out the

prediction for given tuples. The nodes in the input layer are called input nodes. The nodes in the

hidden layers and output layer are sometimes referred to as neurodes, due to their symbolic

biological basis, or as output nodes. The multilayer neural network with two layers is called a

two-layer neural network. Normally the input layer is not counted because it serves only to pass

the input values to the next layer. Similarly, a network containing two hidden layers is called a

three-layer neural network, and so on. The network is feed-forward in that none of the weights


125

cycles back to an input node or to an output node of a previous layer. However, it is fully

connected in that each node provides input to each node in the next forward layer. Each output

node takes, as input, a weighted sum of the outputs from nodes in the previous layer. It applies a

nonlinear (activation) function to the weighted input. Multilayer feed-forward neural networks

are able to model the class prediction as a nonlinear combination of the inputs. From a statistical

viewpoint, they perform nonlinear regression. Multilayer feed-forward networks, given enough

hidden units and training samples, can closely approximate any function.

Figure 3-27: The output calculation for a hidden or output layer (the unit j). The inputs to unit j are outputs from the previous layer. These inputs are multiplied by their corresponding weights in order to form a weighted sum. The sum result is then added to the bias associated with unit j before applying a nonlinear activation function for the final output. Note that this calculation is occurred for each node in the hidden layer and the output layer. Even the figure shows an example of the calculation of the last node in the hidden layer; the same procedure is applied for all other nodes in the hidden layer and the output layer.

The main step in learning the ANN is the calculation of the weight of each association link in

the network. As mentioned above, the learning can be done by the process named

backpropagation. The backpropagration will process iteratively each tuple (data item) from the


126

dataset of training tuples, comparing the network’s prediction for each tuple with the actual

known target value, in order to adjusting the weights of association links. At the end of the

network, the target value is set to be the known class label of the training tuple for classification

or a continuous value for prediction. When each training tuple enters to the network for learning,

the weights are modified in order to minimize the mean squared error between the network’s

prediction and the actual target value. These modifications are normally done in the “backwards”

direction, that is, from the output layer, through each hidden layer down to the first hidden layer.

Because of this property, it is called backpropagation. However, in general it is widely stated that

the weights may not eventually converge or the learning process will stop.

Algorithm 3.5. Backpropagation Neural network learning for ANN for classification or prediction

INPUT: T is a training dataset, where l is the learning rate,

network is a multilayer feed-forward network.

OUTPUT: A trained neural network.

PROCEDURE:

(1) INITIALIZE all weights and biases in network;

(2) WHILE terminating condition is NOT satisfied {

(3) FOREACH training tuple X T { (4) // PROPAGATE the inputs forward:

(5) FOREACH input-layer unit i [1,n] {

(6)

; // for each input unit, output = actual input value

(7) FOREACH hidden-layer unit j [1,m] {

(8) FOREACH input-layer unit i [1,n] {

(9)

; // compute the net input into hidden unit j

// with respect to the previous layer, i

(10)

;} // compute the output of each hidden unit j

(11) FOREACH output-layer unit k [1,p] {

(12) FOREACH hidden-layer unit j [1,m] {

(13)

; // compute the net input into output unit j

// with respect to the previous layer, i

(14)

;} // compute the output of each output unit j

(15) // Backpropagate the errors:

(16) FOREACH unit j in the output layer

(17)

// compute the error from

the target value .

(18) FOREACH unit j in the hidden layers,

from the last to the first hidden layer

(19)

// compute the error with respect to

the next higher layer, k

(20) FOREACH weight in network {

(21) // weight increment

(22) } // weight update

(23) FOREACH bias in network {

(24) // bias increment

(25) } // bias update (26) }}


127

Algorithm 3.5 illustrates an algorithm for learning a neural network using backpropagation.

Given a training set of objects and their associated class labels, denoted by ,

each object is represented by an n-dimensional attribute vector,

depicting the measure values of n attributes, , of the object with its class , one

from m possible classes, (or its actual value , a set of real numbers). Here,

suppose that has possible values, . That is, . As a

common structure of ANN, each input node corresponds to each attribute ( ) and each output

node is each class or a target value .

The algorithm seems too difficult but with careful investigation, each step is inherently

simple. In summary, the backpropagation algorithm performs learning on a multilayer feed-

forward neural network. Each layer is made up of units. The inputs to the network correspond to

the attributes measured for each training tuple. The inputs are fed simultaneously into the units

making up the input layer. These inputs pass through the input layer and are then weighted and

fed simultaneously to a second layer of “neuron-like” units, known as a hidden layer. The outputs

of the hidden layer units can be input to another hidden layer, and so on. The number of hidden

layers is arbitrary, although in practice, usually only one is used. The weighted outputs of the last

hidden layer are input to units making up the output layer, which emits the network’s prediction

for given tuples. The units in the input layer are called input units. The units in the hidden layers

and output layer are sometimes referred to as neurodes, due to their symbolic biological basis, or

as output units. The multilayer neural network shown in Figure 3-27 has two layers. In artificial

neural networks (ANNs), backpropagation is a well-known learning algorithm of weights in the

network.

3.1.7. Support Vector Machines (SVMs)

Support Vector Machines (SVMs) are supervised learning methods to analyze data and recognize

patterns, as classification and regression for both linear and nonlinear data proposed by Vapnik,

Boser and Guyon (1992). Although the training time of even the fastest SVMs can be extremely

slow, they are highly accurate, owing to their ability to model complex nonlinear decision

boundaries. They are much less prone to overfitting than other methods. The support vectors

found also provide a compact description of the learned model. SVMs can be used for prediction

as well as classification. They have been applied to a number of areas, including handwritten

digit recognition, object recognition, and speaker identification, as well as benchmark time-series

prediction tests.

With a set of input data, the standard SVM predicts to which of two possible classes a given

input belongs. The SVM is known as a non-probabilistic binary linear classifier. In other words,

an SVM is a classifier, which takes a set of training examples, each of which is marked with one of

two classes, and builds a model that predicts whether a new example falls into one class

(category) or the other. While we can represent the examples as points in space, the learned

optimal SVM is a model that divides the examples of the two classes with a larger gap as possible.

After acquisition of the best model, the suitable class of a new example is done by mapping the

example into that same space and predicting the class it should belong to, based on which side of

the gap it fall on.

In a two-class problem, as linearly separable classification problem, it is possible to have

several possible decision boundaries. These decision boundaries may not be equally good (see

Figure 3-28). In general, the perceptron algorithm (artificial neural network) or genetic

algorithm can find such a boundary but there will be analytical approach to solve this problem.


128

Figure 3-28: Examples of bad decision Boundaries

Figure 3-29: Linear separating hyperplanes for the separable case. The support vectors are circled with a concrete example of linear separating hyperplanes


129

To find the optimal separation, in a formal description, a SVM constructs a hyperplane or set of

hyperplanes in a high or infinite dimensional space, which can be used for classification,

regression or other tasks. The best SVM model is the one, which gives the best separation, a

hyperplane that has the largest distance to the nearest training data points among two classes

(so-called functional margin). The larger the margin is, the lower the generalization error of the

classifier. Figure 3-29 shows the linear separating hyperplanes for the separable case. The

support vectors are highlighted with large circles and a concrete example of the hyperplanes in a

2-D case. Intuitively, the decision boundary should be as far away from the data of both classes as

possible. This property implies the maximization of the margin . The formal description can be

given as follows.

Given the training data where is a datum,

representing by a vector with dimension and is a binary class of -1 or +1, the support vector

machine finds the best hyperplane which separates the positive from the negative examples (a

“separating hyperplane”). In principle, the points on the hyperplane satisfy the formula

, where w is a normal vector, that is perpendicular to the hyperplane. is

the perpendicular distance from the hyperplane to the origin, and is the Euclidean norm of .

Let ( ) be the shortest distance from the separating hyperplane to the closest positive

(negative) example. The margin of a separating hyperplane is . For the linearly separable

case, the support vector algorithm simply looks for the separating hyperplane with largest

margin. This can be formulated as follows: suppose that all the training data satisfy the following

constraints:

for

for

These two constraints can be combined into one set of inequalities as follows.

for all i

Suppose that we consider the points satisfying the equality in the first equation, i.e.,

. These points lie on the hyperplane H1: with normal and perpendicular

distance from the origin . Similarly, the points for which the equality in the second

equation holds lie on the hyperplane H2: , with the same normal , and

perpendicular distance from the origin Hence, , and the margin

is simply . Note that H1 and H2 are parallel since they have the same normal and that no

training points fall between them. Thus, we can find the pair of hyperplanes, which gives the

maximum margin by minimizing , subject to constraints in the third equation. The solution

for a typical two-dimensional case is expected to have the graphical representation shown in

Figure 3-29. The points in the training data set satisfying the equality in the third equation (i.e.

those which lay on one of the hyperplanes H1, H2) are called support vectors. They are circled in

the figure.

Based on this formulation, the decision boundary can be found by solving the following

constrained optimization problem.

Minimize

Subject to for all i

This is a constrained optimization problem. It is not easy to solve this problem since it requires

some complex transformations.


130

Towards a solution, the problem of finding support vectors can be formulated into a

Lagrangian formulation. There are two reasons of this formulation. The first one is that the

constraints, can be replaced by constraints on the Lagrange multipliers

themselves, which will be much easier to handle. The second is that in this reformulation of the

problem, the training data will only appear (in the actual training and test algorithms) in the

form of dot products between vectors. This is a crucial property, which will allow us to generalize

the procedure to the nonlinear case. The steps are shown in the following.

Consider the following general optimization problem. The problem is to minimize

subject to . A necessary condition for to be a solution is as follows. Here, is the

Lagrange multiplier.

For multiple constraints , it is possible to use a Lagrange multiplier

for each of the constraints as follows.

As the more general case of inequality constraint to , it is possible to slightly change the

formula as follows. Even it is similar to the equality case, this case requires the Lagrange

multiplier to be positive. Here, the problem is to minimize subject to for

There must exist such that satisfies the following

condition. Here, is a Lagrange multiplier.

The function is known as the Lagrangrian and we would like to set its

gradient to 0.

Here, we can apply this general optimization problem to the boundary decision problem of

the original support vector machine by setting

as and

as as

follows.

Minimize

Subject to for

The Langragrain and its application to SVM

The Lagrangrian will be summarized as follows.


131

Note that . Here, we set the gradient of the Lagrangrian w.r.t. and , to zero as

follows. First, we perform partial differentiation on the Lagrangrian w.r.t. .

Second, we perform partial differentiation on the Lagrangrian w.r.t. .

At this point, we substitute into the Lagrangrian . Then we will have the

following equations.

Since , the following formula is derived.

After this transformation, the new objective function will be in terms of only. This is known as

the dual problem where we know , then we know all and vice versa. The original problem is

known as the primal problem while the new function is known as the dual problem. Moreover,

this particular dual formulation of the problem is called the Wolfe dual (Fletcher, 1987). It aims

to minimize with respect to and simultaneously require that the derivatives of with

respect to all the ’s vanish, all subject to the constraints ≥ . Now this is a convex quadratic

programming problem, since the objective function is itself convex, and those points which

satisfy the constraints also form a convex set (any linear constraint defines a convex set, and a set

of simultaneous linear constraints defines the intersection of convex sets, which is also a

convex set). The objective function of the dual problem to be maximized is as follows.

Maximize

Subject to , for


132

This condition is known as a quadratic programming (QP) problem and a global maximum of

can always be found. Then the normal can be recovered by . In this problem,

normally most ’s are zero and then is a linear combination of a small number of data points.

This “sparse” representation can be viewed as data compression like that in the construction of a

k-NN classifier. Here, with non-zero are called support vectors (SV). The decision boundary

is determined only by the SV. Figure 3-30 shows the graphical interpretation of support vectors

and their Lagrange multipliers.

Figure 3-30: The graphical interpretation of support vectors and their Lagrange multipliers.

Here, assume (j=1, ..., s) be the indices of the s support vectors. We can write

.

It is simple to test whether a new data belongs to either one of the two classes, say Class 1 and

Class 2, by calculating the following formula.

If the output is positive, the new data is classified as Class 1 otherwise Class 2. Note that since

the normal can be replaced with support vectors, the testing can be done locally by computing

the dot product between the new data and the support vectors

’s with the consideration of

the class and the Langrange multiplier

of the support vectors.

As mentioned previously, to find a global maximum of , the condition can be referred as a

quadratic programming (QP) problem. There are several proposed methods, such as Loqo, cplex,

so on. Some of them are provided online, for example see the following URL,

http://www.numerical.rl.ac.uk/qp/qp.html. However, they all occupy a so-called interior-point

approach. This approach starts with an initial solution that may violate the constraints and then

tries to improve this solution by optimizing the objective function and/or reducing the amount of

constraint violation. For SVM, the sequential minimal optimization (SMO), a quadratic

programming solver with two trivial variables, seems to be the most popular. The SMO method

will repetitively select a pair of and solve the QP with these two variables. The selection is

repeated until convergence. Anyway, in practice, we can consider the QP solver as a “black-box”

without bothering how it works.

Soft Margin

In several problems, it is not possible to find a linearly separable hyperplane. In those cases, it is

possible to allow errors in the marginal area. In theoretical aspect, we can allow “error” in


133

classification. This allowance is known as “Soft Margin.” The error is based on the output of the

discriminant function . Also called slack variables, are provided for all misclassified

samples in the area between two classes. Figure 3-31 shows the graphical interpretation of soft

margin when some errors (misclassification) are allowed in the marginal area.

Figure 3-31: The graphical interpretation of soft margin.

The objective is to minimize . Here, can be computed by the following equations.

Here, are “slack variables” in optimization. Note that , if there is no error for . is an

upper bound of the number of errors.

Imitating the original formulation, a problem with a soft margin can be formulated as follows.

Here, C is called a tradeoff parameter between error and margin.

Minimize

Subject to for and

With the new constraint, the objective function of the dual problem to be maximized is as

follows.

Maximize

Subject to , for

Same as the original problem, is recovered as

. This is very similar to the

optimization problem in the linear separable case, except that there is an upper bound on

Given a tradeoff value C, we can use a quadratic programming (QP) solver to find optimal .

Moreover, this upper bound is best determined experimentally.

Non-linear Assumption – Linearly inseparatable space

In the original problem of a finite dimensional space, often the space cannot be linearly

separated into two sub-spaces, i.e., each sub-space includes only members from each class. As


134

mentioned above, it is possible to use the concept of a soft margin to allow some members in a

sub-space to locate in the opposite sub-space. However, in several cases the boundary between

two classes should not be assumed to be linear. Towards a solution, several works have been

proposed to map the original finite dimensional space into a much higher dimensional space

presumably making the separation easier in that space. In other words, each point in the original

space is transformed to a point denoted by in a higher dimensional space. In some

literatures, the original space is called the input space while the target space is named the

feature space. By this transformation, the linear operation in the feature space is equivalent to

non-linear operation in the input space. Then classification in some tasks can be enabled. For

example, the well-known XOR problem can be solved by introducing a new feature , where

and are two dimensions for the XOR problem. The value of is positive when both and

are positive or both and are negative, otherwises negative. Figure 3-32 shows the

graphical interpretation of the transformation into a higher dimensional space using a function .

Here, both spaces are shown in two-dimensional representation. However, normally the feature

space has higher dimension than the input space in practice. Moreover, computation in the

feature space can be costly because it is high dimensional. The feature space can be set to

infinite-dimensional. Each point in the input space is mapped to the feature space by the function

. For this purpose, the kernel trick can be used as follows.

Figure 3-32: The graphical interpretation of the transformation to a higher dimensional

apace. Note that the feature space is of higher dimension than the input space.

In SVM schemes, a mapping into a larger space is used in order to make cross products be

computed easily in terms of the variables in the original space and the computational load

becomes reasonable. The cross products in the larger space are defined in terms of a kernel

function , which can be selected to suit the problem. Note that at this point, the important

issue is how to select a kernel function that is optimal to the problem.

Before getting to understand the usage of kernel functions in SVM, let us explain the

semantics of a hyperplane. A hyperplane in a large space can be defined as the set of points

whose cross product with a vector in that space is constant. The vectors defining the

hyperplanes can be chosen to be linear combinations with parameters of images of feature

vectors, which occur in the database. With this choice of a hyperplane, the points in the feature

space, which are mapped into the hyperplane, are defined by the relation:


135

Note that if becomes small as y grows further from x, each element in the sum

measures the degree of closeness of the test point x to the corresponding data base point . In

this way, the sum of kernels above can be used to measure the relative nearness of each test

point to the data points originating in one or the other of the sets to be discriminated. Note the

fact that the set of points mapped into any hyperplane can be quite convoluted as a result

allowing much more complex discrimination between sets, which are far from convex in the

original space. Let us recall the SVM optimization problem as shown below.

Maximize

Subject to , for

The term indicates the inner product of each pair of data points. These inner products

are summed up by ∑. As long as we can calculate the inner product in the feature space, it is not

necessary to define the mapping explicitly. In general, it is possible to express the inner product

in terms of one of common geometric operations, such as angles or distances. As this point, we

can define a kernel function K by . This kernel function will be applied to

the above equations to derive the following equations.

Maximize

Subject to , for

After applying this kernel function, we find a maximal separating hyperplane. The procedure is

similar to that described above with a user-specified upper bound , on the Lagrange multipliers

. After that, we can use a quadratic programming (QP) solver to find optimal and support

vector . The upper bound C is best determined experimentally.

In order to grasp the concept of kernel function, we use the following example with two

dimensions with the point . the function is given as follows.

1 2x1 2x2 x1 2 x2

2 2x1x2

Therefore, the inner product in the feature space can be defined as follows.

1 x1y1 x2y2 2

Here, there is no need to know explicitly what is the function but we just define the kernel

function as follows.

1 x1y1 x2y2 2

Towards the practical usage of SVM, the user has to specify the kernel function but leave the

transformation function implicitly unknown. The application of a kernel function without

knowing the function , is known as the kernel trick.

Given a kernel function , the transformation function is given by its

eigenfunctions, which is a concept in functional analysis. However, it is difficult to construct


136

eigenfunctions explicitly. Therefore, in most cases, we will only specify the kernel function

without worrying about the exact transformation. As another point of view, the kernel function,

which is equivalent to an inner product, is a similarity measure between the data points (or

objects). Up to present, there have been kernel functions used in practice but three common

kernels are as follows.

Linear kernel (no transformation)

Polynomial kernel (with degree k)

Gaussian radial basis function kernel (RBF) 2 2

Sigmoid kernel (with and )

The linear kernel performs no transformation on the original dot product. The polynomial

kernel is parameterized with degree k to form different classifiers. The Gaussian radial basis

function (RBF) kernel uses normal Gaussian distribution to determine the distance between a

pair of points. The sigmoid kernel gives a particular kind of two-layer sigmoidal neural network.

Note that all the kernels above are symmetric in the sense that .

When the new datum is going to be classified, the original linear SVM will determine its

class by assuming (j=1, ..., s) be the indices of the s support vectors and calculating the weight

and the function as follow.

If the output is positive, the new datum is classified as Class 1 otherwise Class 2. However, with

the kernel trick, the weight and the function are modified as follows.

Following the same procedure, when the output is positive, the new datum is classified as

Class 1. In the opposite, it is assigned with Class 2. Therefore, we will calculate the kernel

function between the testing data and the support vectors .

Note that both the training and the testing of SVM only require the value of . This

property means that there is no restriction of the form of and . Therefore, can be any data

representation, including a sequence or a tree, instead of a feature vector. Semantically,

is just a similarity measure between and . However, it was claimed that not all similarity

measure could be used as kernel function. Since the kernel function is used for replacing the dot

product of two mapping data points, it is assumed to be symmetric, . It is

also required to satisfy the Cauchy-Schwartz inequality, that is .

However, these two properties are not sufficient to guarantee the existence of a feature space.

The kernel function needs to satisfy the Mercer’s condition. This condition states that a necessary


137

and sufficient condition for a symmetric function of to be a kernel is positive semi-

definite (PSD). This implies that the n by n kernel matrix, in which the -th entry is the

, is always positive definite. This property also means that the quadratic programming

(QP) problem is convex and can be solved in polynomial time. The Mercer’s condition makes the

kernel guarantee the existence of a feature space. To satisfy the condition, the following four

properties must be valid [14][16]. Here, and are two arbitrary kernel functions. With

these four properties, a new kernel function can be produced.

1.

where and are positive scalar values

2.

where and are positive scalar values

3.

where exp( is the exponential function.

4.

where A is a positive semi-define matrix.

An example of kernel trick

Suppose that five data points are given in a single dimensional space as follows.

X Y

1 1

2 1

4 -1

5 -1

6 1

Here, assume the polynomial kernel of degree 2 (i.e., and the tradeoff

value C is set to 100. At this point, first we find (i=1,2, ..., 5) by the following equation.

Maximize

Subject to , for

By using a QP solver, we can find the solution of as

. Note that the constraints, i.e., and are satisfied. The support

vectors are . Then the discriminant function will be defined as follows.

=

=

=

=

=


138

The value of can be recovered by using the boundary condition of or

or since and lie on the line and lies on the line

. With either of these three boundary conditions, equals to 9. Therefore, the

final discriminant function will be as follows.

=

Figure 3-33 shows the graphical interpretation of the discriminant function curve, which is a

parabolic line. Points located in the upper part of this curve are Class 2 whereas those in the

lower part are Class 1. Therefore, it is possible to split a line into three portions for Class 1, Class

2 and Class 1, respectively.

Figure 3-33: The graphical interpretation of the discriminant function (parabolic curve).

Discussion on high-dimensional space and VC-dimension

The kernel trick implies the mapping of the original space to a higher space. In several cases, the

feature space is assumed to become very high dimensional. This transformation may trigger an

issue called the curse of dimensionality. That is, the classifier in a high-dimensional space has

many parameters and then it is hard to estimate these parameters. Vapnik (1979) argued that

the fundamental problem is not the number of parameters to be estimated but rather the

problem is on the flexibility of a classifier. Typically, a classifier with many parameters is very

flexible and faces with several exceptions. Even only one parameter, we can express the

flexibility. For example, we can use an example of the classifier can classify all

correctly for all possible combination of class labels on More details can be found in (Vapnik,

1979; Vapnik, 1995; Vapnik, 1998; Law, 2005)

Vapnik argues that the flexibility of a classifier should not be characterized by the number of

parameters, but by the flexibility (capacity) of a classifier. This property is formalized by the so-

called “VC-dimension” of a classifier. For example, let us consider a linear classifier in two-

dimensional space with two classes (circle and rectangular) and three data points as shown in

Figure 3-34. Even only three specific cases are given; in general, if we have three training data

points, no matter how those points are labeled, we can classify them perfectly.

Figure 3-34: Three data points can be perfectly classified.


139

However, when we add one more point in the space (i.e., four points), it is possible to have

situations that a linear classifier cannot perfectly classify the space into two classes with a single

line as shown in Figure 3-35.

Figure 3-35: Four data points may not be perfectly classified.

We can observe that the number three (3) is the critical number. The VC-dimension of a linear

classifier in a 2D space is three because, if we have three points in the training set, perfect

classification is always possible irrespective of the labeling, whereas for four points, perfect

classification can be impossible.

Let us consider the VC-dimension of other classification methods. For example, the VC-

dimension of the nearest neighbor classifier is said to be infinity since no matter how many

points you have, you get perfect classification on training data when k is set to 1 and any two

identical data points must be assigned with the same classes. In general, the higher the VC-

dimension, the more flexible a classifier is. However, the VC-dimension is a theoretical concept.

In practice, the VC-dimension of most classifiers is difficult to be computed exactly. Conceptually,

we can expect that if a classifier is flexible, it probably has a high VC-dimension.

The steps of SVM classification can be summarized as follows.

(1) Prepare the training data in the form of a pattern matrix. Given the training data

where is a datum, representing by a

vector with dimension and is a binary class of -1 (Class 1) or +1 (Class 2).

(2) Select the kernel function that we will apply for classification.

(3) Select the values for the parameters in the kernel function and the value of C. This

setting can be done manually or it can be done by using a validation set to

determine the values of the parameter.

(4) Execute the training algorithm (QP solver) to obtain the and support vectors

as shown by the following equation.

Maximize

Subject to , for

(5) Classify unseen data by using the acquired and support vectors using by the

following equation. The function f determines the class, i.e., negative for Class 1 and

positive for Class 2.


140

Strengths and Weakness of Support Vector Machines

The training process for SVMs is relatively easy since there is no local optimal, unlike in neural

networks. The method forms a feature space from the input space relatively well by replacing the

original data by higher dimensional data where the tradeoff between classifier complexity and

error can be controlled explicitly. That is, a linear algorithm in the feature space is equivalent to a

non-linear algorithm in the input space. The input is also flexible in the sense that non-traditional

data, such as strings can be used as input to SVM, in place of feature vectors. However, to obtain

high accuracy, we need to make a trial-and-error on selection of a “good” kernel function for SVM.

There is still no method to automatically find good kernels.

3.2. Numerical Prediction

Numeric prediction is the task of predicting continuous (or ordered) values for given input. For

example, it involves with prediction of the potential future price of gold given the current

economic situation or to predict the level of river stream given the weather report. For this task,

most methods mentioned previously are not suitable since they involve with prediction of a

nominal attribute, not a numeric one. In general, prediction of numerical values is more

complicated than prediction of categorical values since it has more sensitivity. In this section, we

firstly explain linear and non-regression model. After that we explain two extensions of

regression to decision trees for numerical prediction. Besides regression, some classification

technique, e.g., backpropagation, support vector machines, and k-nearest-neighbor classifiers can

be adapted for numeric prediction. On the other hand, regression can be adapted for

classification as shown in Section 3.5.

3.2.1. Regression

In the cases that the outcome (class) is numeric and all the attributes are numeric, it is possible

to apply linear regression for prediction. This method is a very common method in statistics. The

method expresses the class as a linear combination of the attributes, with predetermined weights.

Regression analysis is a good choice when all of the predictor variables are continuous valued as

well. Many problems can be solved by linear regression, and even more can be tackled by

applying transformations to the variables so that a nonlinear problem can be converted to a

linear one. Due to limited space, this section will not give a full-length description of regression

but instead, provide an intuitive introduction. Several software packages exist to solve regression

problems, including SAS (www.sas.com), SPSS (www.spss.com), and S-Plus

(www.insightful.com). The regression is a statistical methodology developed by Sir Frances

Galton (1822–1911), a mathematician who was also a cousin of Charles Darwin. It can be used to

model the relationship between one or more independent or predictor variables (features or

known attributes) and a continuous-valued dependent or response variable (target attributes).

Linear regression

A single-independent variable regression tackles with a response variable, y, and a single

independent (predictor) variable, x. As the simplest form of regression, the response variable y is

modelled as a linear function of x. It will be in the following form.


141

,

where the variance of is assumed to be constant, and and are regression coefficients (or

weights), specifying the Y-intercept and slope of the line, respectively. These coefficients can be

solved for by the method of least squares, which estimates the best-fitting straight line as the one

that minimizes the error between the actual data and the estimate of the line.

Let be a training set, composed of values of predictor variable, x, for some population and

their associated values for response variable, y. The training set S contains data points of the

form . The regression coefficients, and , can be

estimated as follows.

where is the mean value of and is the mean value of . The

coeeficients, and , provide good approximations to minimize the error between the actual

data and the estimate of the line. An example is shown in Figure 3-36.

X (years) Y (weight)

3 10

5 24

4 20

8 29

14 36

6 20

20 45

12 24

Figure 3-36: An example of a single-independent variable regression. (left: 2-D data, right: the

scatter plot and its regression equation.

The 2-D data can be graphed on a scatter plot and the plot suggests a linear relationship between

the two variables, x and y. Given the data in the table, we can compute the averages of x and y.

They are

and

. Thereafter,

substituting these averages into the above equation, we can find the coefficient values as follows.

=11.05

With the above calculated values, the equation of the least squares line is . By

this equation, we can predict the value of y, given a value of x, such as y = 40.95 when x= 18.

It is possible to extend linear regression from a single independent variable to multiple

independent variables. This is called multiple linear regression (MLR). The multiple linear

regression involves more than one predictor variable and allows response variable y to be

modeled as a linear function of n predictor variables or attributes, . A data tuple X

can be described with n variables with the associated response variable y. A

multiple linear regression model can be expressed in the following form.

y = 1.66x + 11.05

0

10

20

30

40

50

0 5 10 15 20 25


142

where is the class, are the attributes and are weights. Given a training

data S where each datum (the i-th instance) in S is represented in the form of , where

is an n-dimensional training tuple with its associated class labels .

Here the superscript (i) denotes the index of the instance. For example, expresses the

first instance. Moreover, to make the notation simple, we can assume an extra variable (attribute)

whose value is always one (1), e.g.,

Given the weight values,

, the predicted class value of the i-th instance denoted by can be written as

follows.

In real situation, this predicted class value for the class differs from the actual class value.

The regression tries to find the values of weights, that minimize the total difference

(the sum of the squares) between the predicted and the actual class values for all data in the

training data set. The difference between the predicted and the actual class values can be

represented as follows.

The sum of the squares of the differences in the training data S can be denoted as follows.

Here, the expression inside the parentheses is the difference between the i-th instance’s

actual class and its predicted class. This sum of squares is what we have to minimize by choosing

the coefficients appropriately. To find the suitable values of the weights, the formulation can be

denoted as follows.

To solve this equation, it is possible to derive the coefficients using standard matrix

operations. One observation is that the coefficients can be calculated if the number of instances is

not smaller than the number of attributes and all instances are independent to each other. If

there are fewer instances, there will be more than one solutions for these coefficients. That is it is

expected to have enough examples compared to the number of attributes in order to select

optimal weights to minimize the sum of the squared differences. The solution is related to the

matrix inversion operation. It can be easily found in any package software. Here, we provide a

formal description of the solution. The representation can be translated to the form of matrix as

follows. Note that as stated above,


143

To find the weight values, we can perform a partial differentiation on the above equation for each

weight as follows.

…

The differentiation result is as follows.

The values of weights will the result of matrix operations as follows.

The following shows an example of regression where the play-tennis data set is used.


90 40 80 10 5

95 32 85 80 10

50 35 90 20 80

10 24 80 5 95

15 10 50 15 85

20 12 55 90 15

55 9 45 95 80

85 22 95 25 10

95 7 50 5 100

5 26 45 10 85

80 25 40 80 95

45 24 85 85 90

40 37 60 15 75

25 23 90 95 5

With the decimal scaling with the factor of 100, we can obtain the following data.


0.90 0.40 0.80 0.10 0.05

0.95 0.32 0.85 0.80 0.10

0.50 0.35 0.90 0.20 0.80

0.10 0.24 0.80 0.05 0.95

0.15 0.10 0.50 0.15 0.85

0.20 0.12 0.55 0.90 0.15

0.55 0.09 0.45 0.95 0.80

0.85 0.22 0.95 0.25 0.10

0.95 0.07 0.50 0.05 1.00

0.05 0.26 0.45 0.10 0.85

0.80 0.25 0.40 0.80 0.95

0.45 0.24 0.85 0.85 0.90

0.40 0.37 0.60 0.15 0.75

0.25 0.23 0.90 0.95 0.05

To obtain the solution, we apply the formulation of as follows.


144

Here, the matrix , , , and can be calculated as follows.


145

By the above constraint, the solution can be derived as follows

Finally, the resultant regression can be obtained as follows.

Here, y = play, =outlook, =temperature, =humidity, =windy.

Non-linear regression

In several cases, we would like to model data that does not show a linear dependence. For

example, sometimes a given response variable and predictor variable may have a relationship of

a polynomial function, e.g., parabola or some other higher-order polynomial. Polynomial

regression is often of interest when there is just one predictor variable. It can be modeled by

adding polynomial terms to the basic linear model. By applying transformations to the variables,

we can convert the nonlinear model into a linear one that can then be solved by the method of

least squares. For example, consider a cubic polynomial relationship with a single predictor

variable x and the response variable y.

This equation can be converted to a linear form by defining the following new variables.

By this definition, the non-linear equation becomes a linear equation as follows.

This equation can be solved by the method of least squares using software for regression analysis.

We can observe that polynomial regression is a special case of multiple linear regression. That is,

the addition of high-order terms such as , , and so on, which are simple functions of the

single variable, x can be considered equivalent to adding new independent variables.


146

3.2.2. Tree for prediction: Regression Tree and Model Tree

It is possible to apply decision trees or rules that are designed for predicting categories to

estimate numeric quantities. For example, the weather data can be used to construct a regression

tree and a model tree from the following data.


90 40 80 10 5

95 32 85 80 10

50 35 90 20 80

10 24 80 5 95

15 10 50 15 85

20 12 55 90 15

55 9 45 95 80

85 22 95 25 10

95 7 50 5 100

5 26 45 10 85

80 25 40 80 95

45 24 85 85 90

40 37 60 15 75

25 23 90 95 20

In a regression tree, the leaf nodes or would contain a numeric value that is the average of all the

training set values to which the leaf applies. This is called regression trees since statisticians use

the term regression for the process of computing an expression that predicts a numeric quantity.

Then decision trees with averaged numeric values at the leaves are called regression trees. Each

leaf node in the regression tree represents the average outcome, the number of instances and the

standard deviation for instances that reach the leaf. The tree is much larger and more complex

than the regression equation. In general, the average of the absolute values of the errors between

the predicted and the actual values, usually turns out to be significantly lower than that of the

regression equation. The regression tree is more accurate because a simple linear model poorly

represents the data in this problem. However, the tree is cumbersome and difficult to interpret

because of its large size. From the above data, we can obtain the regression tree as follows.

Here the data for each node are as follows.

[The first leaf node] (Average Play = 8.33, Number = 3, Std. dev. = 2.89)

Outlook Temp. Humidity Windy Play 90 40 80 10 5

95 32 85 80 10

85 22 95 25 10


147

[The second leaf node] (Average Play = 97.5, Number = 2, Std. dev. = 3.54)


80 25 40 80 95

[The third leaf node] (Average Play = 81.25, Number = 4, Std. dev. = 6.29)

Outlook Temp. Humidity Windy Play 50 35 90 20 80 55 9 45 95 80 45 24 85 85 90 40 37 60 15 75

[The fourth leaf node] (Average Play = 88.33, Number = 3, Std. dev. = 5.77)


15 10 50 15 85

5 26 45 10 85

[The fifth leaf node] (Average Play = 17.50, Number = 2, Std. dev. = 3.54)


25 23 90 95 20

It is also possible to combine regression equations with regression trees. A regression tree is a

tree whose leaves contain linear expressions—that is, regression equations— rather than single

predicted values. This kind of a tree is called a model tree. The following is a model tree with

equations at leaf nodes.

The model tree contains the five linear models that belong at the five leaves, labeled LM1-

LM5. However, since there are not enough instances in this data set, the linear regression

equation for each node cannot be generated. The model tree approximates continuous

functions by linear “patches,” a more sophisticated representation than either linear

regression or regression trees. The model tree is smaller and more comprehensible than the

regression tree. However, its average error values on the training data are lower.


148

3.3. Regression as Classification

One interesting application of regression is to use it for classification. Linear regression can easily

be used for classification in domains with numeric attributes. Indeed, we can use any regression

technique, whether linear or nonlinear, for classification. There are two possible methods to use

regression for classification. The following shows a graphical representation of these two

methods.

3.3.1. One-Against-the-Other Regression

To apply regression for classification with one-against-the-other approach, we perform a

regression for each class, setting the output equal to one for training instances that belong to the

class and zero for those that do not. The result is a linear expression for that class. We perform

the same procedure for the other classes. For the testing procedure, given a test example of

unknown class, calculate the value of each linear expression and choose the one with the largest

outcome. This method is sometimes called multiresponse linear regression. Since there is one

expression for each class, the number of expressions equals to the number of classes. Therefore,

if there are n classes, there will be n regression expressions. The one-against-the-other

regression is explained with the following training dataset.

Temp. Humidity Windy Class

40 80 false C1

32 85 true C2

35 90 false C2

24 80 false C4

10 50 false C2

12 55 true C3

9 45 true C1

22 95 false C2

7 50 false C4

26 45 false C3

25 40 true C1

24 85 true C2

37 60 false C1

23 90 true C3

In this process, the data are normalized and the class on focus is set to 1, otherwise are set to 0.

The expressions are shown under each case. First, the model learned for Class 1 is as follows.

Temp. Humid. Windy Class Temp. Humid. Windy Class

40 80 false C1 0.40 0.80 0.0 1

32 85 true C2 0.32 0.85 1.0 0

35 90 false C2 0.35 0.90 0.0 0

24 80 false C4 0.24 0.80 0.0 0

10 50 false C2 0.10 0.50 0.0 0

12 55 true C3 0.12 0.55 1.0 0

9 45 true C1 0.09 0.45 1.0 1

22 95 false C2 0.22 0.95 0.0 0

7 50 false C4 0.07 0.50 0.0 0

26 45 false C3 0.26 0.45 0.0 0

25 40 true C1 0.25 0.40 1.0 1

24 85 true C2 0.24 0.85 1.0 0

37 60 false C1 0.37 0.60 0.0 1

23 90 true C3 0.23 0.90 1.0 0


149

Second, the model learned for Class 2 is as follows.


40 80 false C1 0.40 0.80 0.0 0

32 85 true C2 0.32 0.85 1.0 1

35 90 false C2 0.35 0.90 0.0 1

24 80 false C4 0.24 0.80 0.0 0

10 50 false C2 0.10 0.50 0.0 1

12 55 true C3 0.12 0.55 1.0 0

9 45 true C1 0.09 0.45 1.0 0

22 95 false C2 0.22 0.95 0.0 1

7 50 false C4 0.07 0.50 0.0 0

26 45 false C3 0.26 0.45 0.0 0

25 40 true C1 0.25 0.40 1.0 0

24 85 true C2 0.24 0.85 1.0 1

37 60 false C1 0.37 0.60 0.0 0

23 90 true C3 0.23 0.90 1.0 0

Third, the model learned for Class 3 is as follows.


40 80 false C1 0.40 0.80 0.0 0

32 85 true C2 0.32 0.85 1.0 0

35 90 false C2 0.35 0.90 0.0 0

24 80 false C4 0.24 0.80 0.0 0

10 50 false C2 0.10 0.50 0.0 0

12 55 true C3 0.12 0.55 1.0 1

9 45 true C1 0.09 0.45 1.0 0

22 95 false C2 0.22 0.95 0.0 0

7 50 false C4 0.07 0.50 0.0 0

26 45 false C3 0.26 0.45 0.0 1

25 40 true C1 0.25 0.40 1.0 0

24 85 true C2 0.24 0.85 1.0 0

37 60 false C1 0.37 0.60 0.0 0

23 90 true C3 0.23 0.90 1.0 1

Fourth, the model learned for Class 4 is as follows.


40 80 false C1 0.40 0.80 0.0 0

32 85 true C2 0.32 0.85 1.0 0

35 90 false C2 0.35 0.90 0.0 0

24 80 false C4 0.24 0.80 0.0 1

10 50 false C2 0.10 0.50 0.0 0

12 55 true C3 0.12 0.55 1.0 0

9 45 true C1 0.09 0.45 1.0 0

22 95 false C2 0.22 0.95 0.0 0

7 50 false C4 0.07 0.50 0.0 1

26 45 false C3 0.26 0.45 0.0 0

25 40 true C1 0.25 0.40 1.0 0

24 85 true C2 0.24 0.85 1.0 0

37 60 false C1 0.37 0.60 0.0 0

23 90 true C3 0.23 0.90 1.0 0


150

The regression equations for the four classes can be summarized as follows.

Class 1 Class 2 Class 3 Class 4

Here, suppose that we have the following test object. We can predict the class of this test object

by substituting the values for each parameter (attribute).

Temp. Humid. Windy Class

50 75 false ?

Class 1 Class 2 Class 3 Class 4

Since Class 2 got the highest score, the test object is assigned with Class 2

While the multiresponse linear regression yields good results in practice, it still has two

drawbacks. First, the membership values it produces are not proper probabilities because they

can fall outside the range 0 to 1. Second, least squares regression assumes that the errors are not

only statistically independent, but are also normally distributed with the same standard

deviation. To solve this, instead of approximating the 0 and 1 values directly, logistic regression

can be used to build a linear model based on a transformed target variable.

3.3.2. Pairwise Regression

As an alternative to multiresponse linear regression, pairwise regression can be used for

classification. In this method, it is necessary to find a regression expression for every pair of

classes, using only the instances from these two classes. Here, during regression analysis, one

class is assigned with +1 while the other is marked with -1. For classes, there will be

expressions. This seems computational intensive but in fact, it is at least as

fast as any other multiclass method. The reason is that each expression of pairwise regression

involves only instances belonging to the two classes under consideration. Suppose that

instances are divided evenly among classes, there will be instances for learning

regression expression. In general, the learning algorithm for a two-class problem with

instances takes time proportional to seconds to execute. Therefore, the run time for pairwise

classification is proportional to seconds. It is . In other

words, the method scales linearly with the number of classes and the number of instances. In the

testing step, the class for an unknown test example is assigned to which class receives the most

votes. This method generally yields accurate results in terms of classification error, compared to

the one-against-the-other method. Assume the same data set as the one-against-the-other

regression as above.


151

The model for Class 1 vs. Class 2 can be learned as follows.


40 80 false C1 0.40 0.80 0.0 +1

32 85 true C2 0.32 0.85 1.0 -1

35 90 false C2 0.35 0.90 0.0 -1

24 80 false C4

10 50 false C2 0.10 0.50 0.0 -1

12 55 true C3

9 45 true C1 0.09 0.45 1.0 +1

22 95 false C2 0.22 0.95 0.0 -1

7 50 false C4

26 45 false C3

25 40 true C1 0.25 0.40 1.0 +1

24 85 true C2 0.24 0.85 1.0 -1

37 60 false C1 0.37 0.60 0.0 +1

23 90 true C3



40 80 false C1 0.40 0.80 0.0 +1

32 85 true C2

35 90 false C2

24 80 false C4

10 50 false C2

12 55 true C3 0.12 0.55 1.0 -1

9 45 true C1 0.09 0.45 1.0 +1

22 95 false C2

7 50 false C4

26 45 false C3 0.26 0.45 0.0 -1

25 40 true C1 0.25 0.40 1.0 +1

24 85 true C2

37 60 false C1 0.37 0.60 0.0 +1

23 90 true C3 0.23 0.90 1.0 -1



40 80 false C1 0.40 0.80 0.0 +1

32 85 true C2

35 90 false C2

24 8 false C4 0.24 0.80 0.0 -1

10 50 false C2

12 55 true C3

9 45 true C1 0.09 0.45 1.0 +1

22 95 false C2

7 50 false C4 0.07 0.50 0.0 -1

26 45 false C3

25 40 true C1 0.25 0.40 1.0 +1

24 85 true C2

37 60 false C1 0.37 0.60 0.0 +1

23 90 true C3


152



40 80 false C1

32 85 true C2 0.32 0.85 1.0 +1

35 90 false C2 0.35 0.90 0.0 +1

24 80 false C4

10 50 false C2 0.10 0.50 0.0 +1

12 55 true C3 0.12 0.55 1.0 -1

9 45 true C1

22 95 false C2 0.22 0.95 0.0 +1

7 50 false C4

26 45 false C3 0.26 0.45 0.0 -1

25 40 true C1

24 85 true C2 0.24 0.85 1.0 +1

37 60 false C1

23 90 true C3 0.23 0.90 1.0 -1



40 80 false C1

32 85 true C2 0.32 0.85 1.0 +1

35 90 false C2 0.35 0.90 0.0 +1

24 80 false C4 0.24 0.80 0.0 -1

10 50 false C2 0.10 0.50 0.0 +1

12 55 true C3

9 45 true C1

22 95 false C2 0.22 0.95 0.0 +1

7 50 false C4 0.07 0.50 0.0 -1

26 45 false C3

25 40 true C1

24 85 true C2 0.24 0.85 1.0 +1

37 60 false C1

23 90 true C3



40 80 false C1

32 85 true C2

35 90 false C2

24 80 false C4 0.24 0.80 0.0 -1

10 50 false C2

12 55 true C3 0.12 0.55 1.0 +1

9 45 true C1

22 95 false C2

7 50 false C4 0.07 0.50 0.0 -1

26 45 false C3 0.26 0.45 0.0 +1

25 40 true C1

24 85 true C2

37 60 false C1

23 90 true C3 0.23 0.90 1.0 +1


153

The regression equations for the four classes can be summarized as follows.

Class 1 vs. Class 2 Class 1 vs. Class 3 Class 1 vs. Class 4 Class 2 vs. Class 3 Class 2 vs. Class 4 Class 3 vs. Class 4

Here, suppose that we have the following test object. We can predict the class of this test

object by substituting the values for each parameter (attribute).

Temp. Humid. Windy Class

50 75 false ?

Class 1 vs. Class 2 Class 1 vs. Class 3 Class 1 vs. Class 4 1.37 Class 2 vs. Class 3 0.35 Class 2 vs. Class 4 Class 3 vs. Class 4 1.97

Since Class 1 wins the others (positive values for the first three regressions), the test datum is

assigned with Class 1.

3.4. Model Ensemble Techniques

In the last decade, the idea of building ensembles of classifiers has gained interest. Instead of

building a single complex classifier, a combination of several simple (weak) classifiers (of either

the same type or different types) is an alternative. For instance, instead of training a large

decision tree (DT), we train several simpler DT and combine their individual output to form the

final decision. Alternatively, we can train different kinds of classifiers (such as DT, NB, NN or

SVM), use them to classify a test object and then combine their results to obtain the final decision.

This procedure seems like a kind of multiple committee members to make a justice. Sometimes

this allows us to have faster training and to focus each classifier on a given portion of the training

set. Figure 3-37 illustrates the concept of ensemble of classifiers. The input pattern x is first

classified by each weak classifier. The weak classifier will return the plausibility of the input x

belonging to a class denoted by . The outputs of these weak classifiers are then

combined in order to establish the final classification decision . Finally, we can select the

class that achieves the maximum value, i.e. . Intuitively, when the individual

classifiers are uncorrelated, the result by majority voting (or other operations) of an ensemble of

classifiers is likely to be better than the result obtained from one individual classifier.


154

(a) Classification

(b) Numeric prediction

Figure 3-37: The concept of ensemble of classifiers. The output of the weak classifiers will

be combined to obtain the final decision, for (a) classification and (b) numeric prediction.

(Test Phase)

Moreover, in general ensembles are shown to be more flexibility in the functions they can

represent and this may enable them to over-fit the training data more than a single model.

Nevertheless, in practice, some ensemble techniques tend to reduce problems related to over-


155

fitting of the training data. For example, the bagging ensemble technique split data into several

subsets and then applies a learning algorithm on each subset to form a model. The obtained

models will be used to predict the test object and then their results are combined.

Empirically, ensembles tend to yield better results when there is a significant diversity

among the models. Many ensemble methods, therefore, seek to promote diversity among the

models they combine. Although perhaps non-intuitive, more random algorithms (like random

decision trees) can be used to produce a stronger ensemble than very deliberate algorithms (like

entropy-reducing decision trees). Using a variety of strong learning algorithms, however, has

been shown to be more effective than using techniques that attempt to dumb-down the models in

order to promote diversity.

Note that, as described before, naïve Bayes classification can be viewed as a kind of ensemble

techniques. It is an ensemble of classifiers each of which predicts the class based on a single

attribute with Gaussian distribution. The predicted result from each classifier is given a vote

proportional to the probability of the prediction. Later the votes of classifiers (one for each

attribute) are multiplied to form the final probability. Besides this simple example, there are

several types of ensemble techniques. Three common ones are known as bagging, boosting and

stacking. Their techniques are described in sequence below.

3.4.1. Bagging: Bootstrap Aggregating

Bootstrap aggregating, in short bagging, is a simple ensemble meta-algorithm to improve

classification or regression in terms of stability and classification accuracy. It can reduce variance

and avoid overfitting. Bagging is a special case of the model averaging approach by having each

model in the ensemble vote with equal weight. To implement model variance, bagging trains each

model in the ensemble using a randomly-drawn subset of the training set. As an example, the

random forest algorithm combines a number of decision trees learnt from different subset

extracted from the training dataset. It was shown to achieve very high classification accuracy.

One of the most popular methods to generate multiple training datasets from a single dataset is

called bootstrap sampling. Its brief introduction is described below.

Given a training set T with n instances, the bagging algorithm will first generate m new

training sets with n’ instances ( by uniformly sampling examples from T with

replacement. By sampling with replacement, it is likely that some examples will be repeated in

each . As a special case, when the dataset is large enough (large n) and is set to , it was

shown that the set is expected to include 63.2% of the unique examples from T, the rest being

duplicates. The percentage of instances never being selected is calculated from the following

equation.

To understand the formula, first we figure out that

is the probability that an

instance is not selected, instead the other instances are selected. If we select n times, the

probability that the instance is not never selected will be

. When the value of n is

large enough, it will converge to the value of which is approximately . Therefore the

propability that an instance is selected is . This kind of sampling is known as

a bootstrap sample.

Figure 3-38 illustrates the graphical concept of bagging. By bootstrap sampling, the m sets of

samples will be generated and then the bagging method will learn m models each from each


156

sample set. In the test phase, given a test instance, the m models will judge the result (label or

numeric value) and combine the results by averaging the output (for regression) or voting (for

classification) as shown in Figure 3-38.

(a) The graphic concepts

(b) Conceptual representation with the training dataset as a table

Figure 3-38: Graphical concepts of bootstrap sampling and bagging learning

(Learning Phase)


157

Algorithm 3.6 presents a pseudo-code for the bagging ensemble method, composed of model

generation (learning phase) and classification (test phase).

Algorithm 3.6. Pseudo-code for the bagging ensemble method

Model generation

1: LET n be the number of instances in the training dataset T.

2: FOREACH i of m iterations

3: SAMPLE n’ instances with replacement from the set T

4: LEARN Mi by the learning algorithm with the training data T

5: STORE the resultant model Mi

Classification


2: PREDICT the class of the instance using the model Mi

3: RETURN the class that has been predicted most often

In general, it was criticized that the method averages several predictors and it may not be useful

for the cases of combining linear models. Moreover, bagging does not improve the classification

much in the cases of very stable models like k nearest neighbors.

3.4.2. Boosting: AdaBoost Algorithm

Unlike bagging, boosting involves incrementally building an ensemble by training each new

model instance to emphasize the training instances that previous models mis-classified. In some

cases, boosting has been shown to yield better accuracy than bagging, but it is more likely to

overfit the training data. By far, the most common implementation of Boosting is Adaboost,

although some newer algorithms are reported to achieve better results. The AdaBoost, short for

Adaptive Boosting was formulated by Yoav Freund and Robert Schapire. It is a meta-algorithm,

and can be used in conjunction with many other learning algorithms to improve their

performance. AdaBoost is adaptive in the sense that subsequent classifiers are built in favor of

those instances misclassified by previous classifiers. The class or value for a test instance can be

obtained by using voting or averaging the results where each voted class or value is weighted

according to the performance of the model that gives that class or value.

AdaBoost is sensitive to noisy data and outliers. However, in some problems it can be less

susceptible to the overfitting problem than most learning algorithms. Figure 3-39 and Figure

3-40 illustrates the overview of the AdaBoost-like process. The AdaBoost-like algorithms

construct weak classifiers repeatedly one by one in a series of rounds towards m classifiers. For

each classification construction, a distribution of weights given to each instance in the training

dataset is updated. On each round, the weights of each incorrectly classified example are

increased (or alternatively, the weights of each correctly classified example are decreased), so

that the new classifier focuses more on those examples. The succeeding models are expected to

be experts that complement each other. For example, the training dataset T2 is constructed by

testing the dataset T1 using the model M1, the training dataset T3 is constructed by testing the

dataset T2 using the model M2 and then the same procedure is applied until the number of

constructed models is m (as we set). In the test phase, given a test instance, the m models will

judge the result (label or numeric value) and combine the results by averaging the output (for

regression) or voting (for classification) as shown in Figure 3-37.


158

Figure 3-39: Graphical concepts of AdaBoost (Learning Phase) (Overview)


159

Figure 3-40: Graphical concepts of AdaBoost (Learning Phase) (Detail)

Algorithm 3.7 presents a pseudo-code for the AdaBoost approach, composed of model generation

(learning phase) and classification (test phase).

Algorithm 3.7. A pseudo-code for the AdaBoost algorithm

Model generation

1: ASSIGN an equal weight to each training instance in the dataset T


3: LEARN Mi by applying a learning algorithm on the weighted dataset

Tm

4: STORE the resultant model Mi.

5: COMPUTE the error rate e of the model on weighted dataset and

store the error.

6: IF (e equals to zero) OR (e greater OR equal to 0.5)

7: TERMINATE model generation

8: FOREACH instance in the dataset

9: IF the instance classified correctly by model

10: MULTIPLY the weight of the instance by

11: NORMALIZE weights of all instances

Classification

1: ASSIGN weight of zero to all classes

2: FOREACH t models

3: Add

to weight of the class predicted by model

4: RETURN the class with the highest weight

In general, it was criticized that the method averages several predictors and it may not be useful

for the cases of combining linear models. Moreover, bagging does not improve the classification

much in the cases of very stable models like k nearest neighbors.


160

3.4.3. Stacking

Stacking, also called stacked generalization, is a different way of combining multiple models.

Compared to bagging and boosting, it is less recognized since it is difficult to analyze theoretically

and there are also several different variations. Unlike bagging and boosting, stacking usually

combine models of several different types, such as naïve Bayes, neural networks or decision trees,

and use the outputs from these models to form the training dataset for the upper-level classifier.

Suppose that we use a decision tree inducer (DT), a naïve Bayes learner (NB), and an instance-

based learning method (k-NN) to form a classifier for a given dataset. A potential method to

combine outputs is voting, similar to bagging. However, voting may not work well if the learning

schemes perform not so well. Instead of voting, stacking introduces the concept of a meta-learner,

which replaces the voting procedure. The meta-learner can be learned, by using a holdout set, to

discover the best way to combine the output of the base learners. For simplicity, assume that

there are only two levels on consideration.

Figure 3-41 to Figure 3-43 illustrates the overview of the stacking ensemble method. The

figures illustrate (1) the base-model (level-0) learning, (2) level-1 dataset generation and (3) the

level-1 model learning. The classifiers at the first level are called the base models or level-0

models. In the previous example, DT, ANN and k-NN are the base models. It is possible to use the

predictions from the base models as input for learning the level-1 model (the meta-learner).

Therefore, the number of features as input to the level-1 model is equivalent to the number of the

base models. In the test phase, an instance is first fed into the level-0 models where each model

guesses a class value and then these guesses are fed into the level-1 model, which combines them

into the final prediction. To obtain the training dataset for the level-1 model, we need to find a

way to transform the level-0 training data into level-1 training data. A naïve method is to apply

level-0 models to classify a training instance, and then use their predictions and the instance’s

actual class value as training instances to construct the level-1 model. However, this may trigger

to have overfit problem. That is, working well on the training data but not well on the test data.

Figure 3-41: Stacking (Phase 1/3) (Learning Phase)


161




162

Towards a solution, stacking uses a so-called holdout dataset for independent evaluation to

create the level-1 classifier. After the level-0 classifiers have been built by using the training

dataset, they are used to classify the instances in the holdout dataset to generate the level-1

training data. Since the level-0 classifiers never see instances in the holdout set, their predictions

will be unbiased. In other words, the level-1 training data accurately reflects the true

performance of the level-0 learning algorithms. Once the level-1 data have been generated by this

holdout procedure, the level-0 learners can be reapplied to generate classifiers from the full

training set, making slightly better use of the data and leading to better predictions.

For the classification phase, the instance will be classified by the level-0 classifiers and then

the result will be used as input to the level-1 classifier and then the final decision is made. Figure

3-44 shows the graphical concept of the classification using the level-0 classifiers and the level-1

classifier obtained by stacking.

Figure 3-44: Graphical concepts of stacking (Classification Phase)

Stacking can also be applied to numeric prediction. In that case, both the level-0 models and

the level-1 model predict numeric values. The basic mechanism remains the same. The only

difference lies in the nature of the level-1 data. In the numeric case, each level-1 attribute

represents the numeric prediction made by one of the level-0 models, and instead of a class value,

the numeric target value is attached to level-1 training instances. Algorithm 3.8 presents a

pseudo-code for the stacking approach, composed of (1) the base-model (level-0) learning, (2)

level-1 dataset generation and (3) the level-1 model learning


163

Algorithm 3.8. A pseudo-code for the stacking algorithm

Model generation

1: LET T be the training data.

2: FOREACH i of m iterations # Level-0 model Learning

3: LEARN a model Mi using the i-th model learner with the training data

T

4: STORE the resultant model Mi

5: FOREACH xj of t instances in T # Level-1 Data Generation


7: CLASSIFY (OR PREDICT) the class (or value) of the instance using

the model Mi

8: STORE the result as one feature (one column)

9: ADD the actual class (or value) as the last feature (target column) and

finish creating one record of the level-1 data T’.

10: LEARN the level-1 model M’ using the newly created training data T’.

# Level-1 Model Learning

Classification


2: PREDICT the class of the instance using the model Mi

3: STORE the result as one feature (one column) in order to create the

input record for the level-1 model.

4: PREDICT the class of the instance using the level-1 model M’

Stacking (sometimes called stacked generalization) exploits this prior belief further. It does

this by using performance on the holdout data to combine the models rather than choose among

them, thereby typically getting performance better than any single one of the trained models.. It

has been successfully used on both supervised learning tasks (regression) and unsupervised

learning (density estimation). It has also been used to estimate Bagging's error rate. Because the

prior belief concerning holdout data is so powerful, stacking often out-performs Bayesian model-

averaging. Indeed, renamed blending, stacking was extensively used in the two top performers in

the recent Netflix competition.

3.4.4. Co-training

Introduced by Avrim Blum and Tom Mitchell in 1998, co-training is an algorithm to learning a

classification model from a small set of labeled data together with a large set of unlabeled data, in

the field of text mining for search engines. As a semi-supervised learning technique, co-training

requires two views of the data, describing two different feature sets that provide different,

complementary information about the instance. Ideally, the two views are assumed to be

conditionally independent and each view is sufficient to classify instances.

In the initial stage, co-training first learns a separate model (classifier) for each view using a

small set of labeled examples. After that, two models are used to classify the unlabeled data. The

unlabeled data with the most confident predictions are added into the training set and then used

to iteratively learn and refine the previous models.

One of the most classic examples is the classification of Web pages. There are two well-

known and useful perspectives: the web content (content-based information) and the incoming

links (hyperlink-based information). Currently, several successful Web search engines use these

two kinds of information. The text label used as the link to another Web page usually provides a


164

clue expressing what that page is. For example, a link with ’My university’ usually indicates that

its destination page is the home page of a university.

The procedure of co-training is as follows. Given a (small) set of labeled examples, the co-

training will first learn two different models, each for each perspective. In the previous example,

they are a content-based and a hyperlink-based model. As the second step, we apply each model

separately to label the examples without a label. For each model, we select the example (some

examples) the co-training most confidently labels as positive and the example (some examples) it

most confidently labels as negative, and add these two examples into the pool of labeled

examples. It is also possible to maintain the ratio of positive and negative examples in the labeled

pool by choosing more of one kind than the other. Finally, we repeat the whole procedure,

training both models on the augmented pool of labeled examples, until the unlabeled pool is

exhausted. Algorithm 3.9 shows a brief description of the co-training method. The original can be

found in (Blum and Mitchell, 1998). It is also adapted to use in many literatures, such as (Nigam

and Ghani, 2000; Nigam et al., 2000; Ghani, 2002; Brefeld and Scheffer , 2004).

Algorithm 3.9. A pseudo-code for the co-training algorithm

[Modified from Blum and Mitchell, 1998]

1: Let L be a set of labeled training examples.

U be a set of unlabeled training examples.

2: CREATE a pool U’ of examples by choosing u examples at

random from U

3: LOOP for k iterations:

4: TRAIN a classification model M1 using L but consider only

the feature set of the view V1

5: TRAIN a classification model M2 using L but consider only

the feature set of the view V2

6: USE M1 to classify the examples in U’ and

LABEL the most confident p positive and the most

confident n negative examples from the classified U’

7: USE M2 to classify the examples in U’ and

LABEL the most confident p positive and the most

confident n negative examples from the classified U’

8: ADD these self-labeled examples to L

9: RANDOMLY CHOOSE 2p+2n examples from U to replenish U’

3.5. Historical Bibliography

Classification principles and techniques can be found in several books, such as (Weiss and

Kulikowski, 1991), (Michie, Spiegelhalter and Taylor, 1994), (Russel and Norvig, 1995), (Mitchell,

1997), (Duda, Hart and Stork, 2001), (Alpaydin, 2004), (Han and Kamber, 2004), (Witten and

Frank, 2005), (Bishop, 2006), (Camastra and Vinciarelli, 2008), (Izenman, 2008), (Theodoridis

and Koutroumbas, 2008), (Alpaydin, 2009), (Hastie, Tibshirani and Friedman, 2009) and (Rogers

and Girolami, 2011). In the past two decades, several collections containing seminal articles on

machine learning can be found in (Michalski, Carbonell and Mitchell, 1983), (Michalski, Carbonell

and Mitchell, 1986), (Kodratoff and Michalski, 1990), (Shavlik and Dietterich, 1990), and

(Michalski and Tecuci, 1994). Recently several collections have been gathered as reports of

current research works such as (Balcazar, Bonchi, Gionis and Sebag, 2010), (Gunopulos, Hofmann,


165

Malerba and Vazirgiannis, 2011), (Gama, Bradley and Hollmen, 2011). A good introduction to

apply probability to machine learning is provided by DasGupta (2011). Many of these books

describe each of the basic methods of classification discussed in this chapter, as well as practical

techniques for the evaluation of classifier performance. For a presentation of machine learning

with respect to data mining applications, see (Michalski, Bratko, and Kubat, 1998). A linear

discriminant classifier as a linear machine was described in Nilsson (1965). A good introduction

to apply linear regression was presented by Weisberg (1980) and later by Breiman, L. and

Friedman, J. (1997). Theoretical aspect of linear discriminant classifiers can be found in (Hastie,

Tibshirani and Friedman, 2009). The k-nearest-neighbor method was first described in 1951 by

Fix and Hodges (1951). The method is labor intensive when given large training sets. Although

the k-NN obtained low popularity from the beginning, in the 1960s with the increasing power of

computing, the k-NN is widely used in the area of pattern recognition. At the early stage, Cover

and Hart (1967) have gathered a collection of articles on nearest-neighbor classification. As a set

of more recent advance, Dasarathy (1991) gathered a modern collection on k-NN approach. This

k-NN is also explained in several text books, including Duda et al. (2001) and James (1985), as

well as and Fukunaga and Hummels (1987). To improve nearest-neighbor classification time,

Friedman, Bentley, and Finkel (1977) presented a usage of search trees while Hart (1968)

proposed a method to remove unrelated training data by applying edit distance. The

computational complexity of nearest-neighbor classifiers is given in Preparata and Shamos

(1985). As a chapter in a book, Bayesian classification and its algorithms for inference on belief

networks was described by Duda, Hart, and Stork (2001), Weiss and Kulikowski (1991), Mitchell

(1997) and Russell and Norvig (1995). Domingos and Pazzani (1996) provided an analysis of the

predictive power of naïve Bayesian classifiers when the class conditional independence

assumption is violated. Heckerman (1996) and Jensen (1996) presented an introduction to

Bayesian belief networks. The computational complexity of belief networks was described by

Laskey and Mahoney (1997). Decision trees were introduced firstly by Quinlan (1986). The tree

pruning was described in (Quinlan, 1987). An empirical comparison between genetic and

decision-tree classifiers was done in (Quinlan, 1988). An approach to deal with unknown

attribute values in tree induction was shown in (Quinlan, 1989). As a concrete version of decision

trees, the C4.5 algorithm is described in a book by Quinlan (1993). Bagging, boosting in C4.5 was

described in (Quinlan, 1996). As another research group, the CART (Classification and Regression

Trees) system was developed by Breiman, Friedman, Olshen, and Stone (1984). C4.5 has a

commercial successor, known as C5.0, which can be found at www.rulequest.com. ID3, a

predecessor of C4.5, is detailed in Quinlan (1986). Incremental versions of ID3 include ID4

(Schlimmer and Fisher, 1986) and ID5 (Utgoff, 1988). Quinlan (1987 and 1993) presented how

to extract rules from decision trees. A comprehensive survey related to decision tree induction,

such as attribute selection and pruning, was written by Murthy (1998). To construct of

classification rules, the simple version of covering or separate-and-conquer approach was

implemented as an algorithm named PRISM by Cendrowska (1987). As pruning techniques, the

idea of incremental reduced-error pruning was proposed by Fürnkranz and Widmer (1994) and

forms the basis for fast and effective rule induction. Later, an algorithm called RIPPER (repeated

incremental pruning to produce error reduction) was proposed by Cohen (1995). A good

summary of the Minimum Description Length principle was introduced by Grünwald (2007).

Besides this most basic algorithm, some popular variations include AQ by Ryszard Michalski

(1969) as well as its successor AQ15 by Hong, Mozetic, and Michalski (1986), CN2 by Peter Clark

and Tim Niblett (1989), FOIL by Quinlan and Cameron-Jones (1993) and RIPPER by William W.

Cohen (1995). An artificial neural network was firstly proposed as a perceptron by Rosenblatt

(1958). Later, there are several literatures related to its limitation and improvement such as


166

books by (Wasserman, 1989), Hecht-Nielsen (1990), Hertz, Krogh, and Palmer (1991), Bishop

(1995), Ripley (1996) and Haykin (1999). Support Vector Machines (SVMs) was proposed by

Vapnik and Chervonenkis (1971) on statistical learning theory. However, the first paper on SVMs

was published by Boser, Guyon, and Vapnik (1992). Later, Vapnik (1995, 1998) published his

original idea and its extension to classification. Law (2005), and Cristianini and Shawe-Taylor

(2000) gave a comprehensive introduction to SVMs. Boser, Guyon, and Vapnik (1992) provided a

training algorithm for optimal margin classifiers. Readers can find a more comprehensive

material provided by Burge (1998) and a textbook written by Kecman (2001). Fletcher (1987) as

well as Nocedal and Wright (1999) provided good description to understand on how to solve

optimization problems in SVMs. Some applications of SVMs to regression was provided by

Drucker, Burges, Kaufman, Smola, and Vapnik (1997) and Schlkopf, Bartlett, Smola, and

Williamson (1999). Nilsson (1965) provides an excellent reference for linear classification

models which were popular in the 1960s. After that linear regression is described in most

standard statistical texts but Lawson and Hanson (1995) provided its comprehensive description

in his book. Friedman (1996) describes the technique of pairwise classification. Fürnkranz

(2002) further analyzes pairwise classification. Hastie and Tibshirani (1998) extend it to

estimate probabilities using pairwise coupling. Moreover, there are many good textbooks on

classification and regression, provide by James (1985), Dobson (2001) and Johnson and Wichern

(2002). A good introduction to the holdout, cross-validation, leave-one-out and bootstrapping

was provided by Efron, and R. Tibshirani (1993) and their theoretical and empirical study by

Kohavi (1995). While combining multiple models becomes a popular research topic in machine

learning research, the bagging (for “bootstrap aggregating”) technique was started by Breiman

(1996). As a special case of boosting, the AdaBoost.M1 boosting algorithm was developed by

Freund and Schapire (1997) with several different classifiers, including decision tree induction

by Quinlan (1996) and naive Bayesian classification by Elkan (1997). Drucker (1997) adapted

AdaBoost.M1 for numeric prediction. Freund and Schapire (1996) developed and derived

theoretical bounds for its performance. Friedman, Hastie and Tibshirani (2000) proposed the

LogitBoost algorithm. Later, Friedman (2001) describes how to make boosting more resilient in

the presence of noisy data. Bay (1999) suggests using randomization for ensemble learning with

nearest neighbor classifiers. Bagging, boosting and randomization were evaluated by Dietterich

(2000). More recently, Zhang and Ma (2012) and Okun, Valentini and Matteo Re (2011) provides

descriptions of ensembles and their applications to Machine Learning. The theoretical model for

co-training was first proposed by Blum and Mitchell (1998) for the use of labeled and unlabeled

data from different independent perspectives. Nigam and Ghani (2000) analyzed the

effectiveness and applicability of co-training and use standard EM to fill in missing values, called

the co-EM algorithm. Applied to text classification, Nigam, McCallum, Thrun, and Mitchell (2000)

applied the EM clustering algorithm to exploit unlabeled data to improve an initial naïve Bayes

classifier. Later, Ghani (2002) extended co-training and co-EM to multiclass situations with error

correcting output codes.Brefeld and Scheffer (2004) extended co-EM to use a support vector

machine rather than Naïve Bayes.


167

Exercise

1. Explain the steps towards classification or prediction.

2. Describe the Fisher’s linear discriminant or centroid-based method using the following data.

Fat (F) Protein (P) Glucose (G) Positive (C) 250 5.10 4.56 Y 220 8.65 8.91 Y 280 0.34 4.12 Y 230 0.45 9.74 N 150 9.48 10.45 Y 170 7.62 3.66 N 160 0.25 9.67 N 140 0.47 5.49 N 80 6.59 4.83 N

100 5.82 11.54 Y 90 0.54 3.52 N

110 0.62 4.81 N

3. From the result in question (2), what is the class when the following test object is observed?

Fat (F) Protein (P) Glucose (G) Positive (C)

90 7.67 4.57 ?

4. Given the table in question (2), what is the class when the k-NN is used to classify the test

datum in question (3)? Here, calculate the result when k = 1 and 3.

5. Compare k-NN and centroid-based method. what is the effect when k (for k-NN) becomes

larger?

6. Given the table below, construct a probabilistic model based on naïve Bayes by calculating a

set of priori probabilities p(H) and posteriori p(H|E), where H is a hypothesis and E is an

evidence. Here, use the ‘Carbon’ attribute with numerical values and also use Laplace

estimation by adding 1 for each class.

Temp Color Carbon Burn

High Red 90 (H) Y

High Red 60 (M) Y

High Yellow 95 (H) Y

High Yellow 30 (L) N

High Blue 80 (H) Y

High Blue 60 (M) Y

Low Red 55 (M) Y

Low Red 25 (L) N

Low Yellow 10 (L) N

Low Yellow 25 (L) N

Low Blue 65 (M) Y

Low Blue 90 (H) Y

7. Apply the acquired model in question (6) to classify the following case.

Temp Color Carbon Burn

Low Yellow 20 (L) ?


168

8. Given the table below, construct a decision tree. Here compare the result using information

gain and gain ratio.

Hair Height Weight Lotion Result Blonde Average Light N Sunburned Blonde Tall Average Y None Brown Short Average Y None Blonde Short Average N Sunburned Red Average Heavy N Sunburned Brown Tall Heavy N None Brown Average Heavy N None Blonde Short Light Y None Blonde Average Light N Sunburned Blonde Tall Average Y None Brown Short Average Y None Blonde Short Average N Sunburned Red Average Heavy N Sunburned Brown Tall Heavy N None Brown Average Heavy N None Blonde Short Light Y None Blonde Average Light N Sunburned Blonde Tall Average Y None Brown Short Average Y None Blonde Short Average N Sunburned Red Average Heavy N Sunburned Brown Tall Heavy N None Brown Average Heavy N None Blonde Short Light Y None Blonde Average Light N Sunburned Blonde Tall Average Y None Brown Short Average Y None Blonde Short Average N Sunburned Red Average Heavy N Sunburned Brown Tall Heavy N None Brown Average Heavy N None Blonde Short Light Y None

9. Describe the effect of tree pruning in decision tree induction.

10. Given the table below, construct a set of covering rules using the criteria of either p/t or (p-

n)/t.

Outlook Temperature Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No


169

11. From the result of the previous question, apply hypergeometric distribution to calculate rule

significance for pruning.

12. Explain how to use a neural network to classify a handwritten digit (0-9).

13. From the set of the given objects and their classes shown below, specify the support vectors,

draw the linear separating hyperplanes, decision boundary, calculate the margin and explain

the formulae of the linear separating hyperplanes.

x y class x y class

1 2 A 4 5 B

1 3 A 5 5 B

3 1 A 5 4 B

3 3 A 5 3 B

4 2 A 6 4 B

14. Apply linear regression to predict the value of ‘positive’. Here, y = positive, =fat, =protein

and =glucose. Then use the obtained linear regression function to predict the test case.

Fat (F) Protein (P) Glucose (G) Positive (C) 250 5.10 4.56 0.82 220 8.65 8.91 0.75 280 0.34 4.12 0.92 230 0.45 9.74 0.25 150 9.48 10.45 0.85 170 7.62 3.66 0.15 160 0.25 9.67 0.25 140 0.47 5.49 0.30 80 6.59 4.83 0.14

100 5.82 11.54 0.78 90 0.54 3.52 0.12

110 0.62 4.81 0.08

Test case

Fat (F) Protein (P) Glucose (G) Positive (C) 100 8.00 3.00 ?


170

15. Use the following table to learn regression for classification and then classify the test case.

Here compare two approaches; one-against-the-other regression and parwise regression.

Fat (F) Protein (P) Glucose (G) Positive (C) 250 5.10 4.56 Class A 220 8.65 8.91 Class B 280 0.34 4.12 Class A 230 0.45 9.74 Class C 150 9.48 10.45 Class B 170 7.62 3.66 Class C 160 0.25 9.67 Class B 140 0.47 5.49 Class B 80 6.59 4.83 Class C

100 5.82 11.54 Class A 90 0.54 3.52 Class C

110 0.62 4.81 Class C

Test case

Fat (F) Protein (P) Glucose (G) Positive (C) 100 8.00 3.00 ?

16. Compare the merits and demerits of bagging, boosting, stacking and co-training.


sponsored by aiat.or.th and kindml, siit · classification, known as a most major supervised...

Documents