sponsored by aiat.or.th and kindml, siit · classification, known as a most major supervised...
TRANSCRIPT
![Page 1: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/1.jpg)
Table of Contents
Chapter 3. Classification and Prediction ........................................................................................ 61 3.1. Classification ......................................................................................................................... 61
3.1.1. Fisher’s linear discriminant or centroid-based method ............................................... 62
3.1.2. k-nearest neighbor method ......................................................................................... 70
3.1.3. Statistical Classifiers ..................................................................................................... 74
3.1.4. Decision Trees .............................................................................................................. 87
3.1.5. Classification Rules: Covering Algorithm .................................................................... 113
3.1.6. Artificial Neural Networks .......................................................................................... 124
3.1.7. Support Vector Machines (SVMs) .............................................................................. 127
3.2. Numerical Prediction .......................................................................................................... 140 3.2.1. Regression .................................................................................................................. 140
3.2.2. Tree for prediction: Regression Tree and Model Tree ............................................... 146
3.3. Regression as Classification ................................................................................................ 148 3.3.1. One-Against-the-Other Regression ............................................................................ 148
3.3.2. Pairwise Regression .................................................................................................... 150
3.4. Model Ensemble Techniques ............................................................................................. 153 3.4.1. Bagging: Bootstrap Aggregating ................................................................................. 155
3.4.2. Boosting: AdaBoost Algorithm ................................................................................... 157
3.4.3. Stacking ...................................................................................................................... 160
3.4.4. Co-training .................................................................................................................. 163
3.5. Historical Bibliography ....................................................................................................... 164 Exercise ........................................................................................................................................... 167
Sponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND
![Page 2: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/2.jpg)
61
Chapter 3. Classification and Prediction
This chapter presents a number of data mining/knowledge discovery techniques used to discover
meaningful hidden knowledge or patterns from a pile of data, in the form of transactions, where each
transaction is assumed independent of the others. Here, two rough classes of data mining techniques
are supervised and unsupervised learning. The first class includes classification and prediction while
the second one relates to clustering and association rule mining. This chapter presents the
supervised learning tasks. The unsupervised learning tasks will be explained in the next chapter.
Whereas classification aims to predict a categorical (discrete, unordered) label of a given object on
test, prediction sets a target to model continuous valued functions. These two supervised tasks need
a set of examples to create a predictive model for forecasting the value of the new coming event or
object. It is possible for us to build a classification model to categorize medical test applications, such
as either positive or negative.
There are many classification and prediction methods proposed by researchers in machine
learning, pattern recognition, and statistics. Typical classification methods are k-nearest neighbor
classifiers, Bayesian classifiers, decision tree classifiers, rule-based classifiers and artificial neural
networks. Linear regression, nonlinear regression, regression trees and model trees are prediction
models. Since these algorithms need huge computational space when the data set to be mined is large,
it is necessary to develop scalable classification and prediction techniques capable of handling large
disk-resident data, instead of memory-resident approach. Classification and prediction have
numerous applications, including fraud detection, target marketing, performance prediction,
manufacturing, and medical diagnosis. This chapter provides basic techniques for data classification
and prediction in order.
3.1. Classification
Classification, known as a most major supervised learning task in pattern recognition and
machine learning, aims to deduce a predictive function from a set of training data (also called
cases, observations or examples), each of which has its class label known beforehand. Later, the
function is used to predict a class label for a new coming case. Typically, the training data are
pairs of input objects (viewed as vectors), and desired outputs (known as class labels). The
output of the classification function is a class label of the input object. However, the function is
known as prediction if the output is a continuous value. In other words, the task of a supervised
learner (classification and prediction learners) is to predict the value of the function for any valid
input object after having seen a number of training examples (i.e. pairs of input and target
output). Two types of models generated from supervised learning are global and local models.
For the former, as more common cases, supervised learning generates a global model that maps
input objects to desired outputs. Typical models are decision trees, classification rules, Bayesian
models, artificial neural networks, and support vector machines. For the latter, supervised
learning is lazy with no construction of a generalized model but use data themselves as local
models, such as nearest neighbor or case-based reasoning. The following indicates a number of
steps towards classification.
1. Problem Formulation
The first step in classification is to figure out the overview of the problem by determining
which type of the task we are going to solve, what the input and output look like, how we
obtain training examples. For example, to classify a single handwritten character to one of
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 3: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/3.jpg)
62
possible alphabets, an entire handwritten word, or an entire line of handwriting may be set as
the input and a predicted character for each character in the word is the targeted output.
2. Feature Design and Collection
After the rough specification of the input and output is determined, we have to design how to
characterize the raw input, i.e., how to transform it into a set of features. After the design of the
features, for each object (each sample), its feature values and corresponding target value are
collected, either from human experts or from measurements, to form a training set. That is,
typically, the input object is transformed into a feature vector, which contains a number of
features that are descriptive of the object. The preciseness of the learned classification model
depends strongly on how precisely the features characterize the input object. In general, the
number of features should not be too large and too small to accurately predict the output.
Moreover, in several situations, it is unfortunate that there is no special design of features but
the training set is formed as it is.
3. Algorithm Selection and Model Generation
Once the training set is ready, we have to select and apply a learning/mining algorithm, for
example decision tree induction, Bayesian learning, artificial neural network, to generate a
model from the training set. Parameters in the learned model may be adjusted to optimize
performance using a holdout subset (called a validation set) of the training set, or via cross-
validation.
4. Model Evaluation and Usage
After the final model is constructed from the training and validation set, the performance of the
algorithm may be measured using a test set that is separate from the training set. The
classification model can be used to reveal the most probable class of any unseen datum. In
general, classifier performance depends strongly on the characteristics of the data to be
classified. There is no single classifier that works the best for all kinds of given problems. It is
necessary to perform various empirical tests to compare classifier performance and to find the
characteristics of data that determine classifier performance. Finding a suitable classifier for a
given problem is however still more an art than a science. At present, the most widely used
classifiers are decision trees, naïve Bayes, k-nearest neighbor, rule-based methods, centroid-
based methods (Gaussian mixture model), artificial neural network (multilayer perceptron or
back propagation) and support vector machines.
3.1.1. Fisher’s linear discriminant or centroid-based method
The Fisher’s linear discriminant or centroid-based method, an early classification procedure, was
implemented widely due to their simplicity and low computational cost. The method simply
divides the sample space by a series of linear equations. For 2-D cases, the line dividing two
classes is drawn to bisect the line joining the centers of those classes. These lines implicitly
indicate the minimum distance from each center. Figure 3-1 shows Fisher’s Linear Discriminant
(Centroid-based) for the Iris data. The following are the linear equation for each class.
Virginica :
Versicolor : &
Setosa :
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 4: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/4.jpg)
63
Figure 3-1: Fisher’s Linear Discriminant (Centroid-based) for the Iris data.
In 2-D cases, first, a linking line is drawn between the centroids of the two classes and then the
linear discriminant lines can be constructed by drawing a line perpendicular to the linking line at
the middle point of that linking line. The following is the formula of the discriminant line when
the middle points of two classes are and . Figure 3-2 shows an example when
and .
The linking line : where
and
The discriminant line : where
and
Figure 3-2: An example of Fisher’s Linear Discriminant (2-D example)
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 5: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/5.jpg)
64
As another viewpoint of the centroid-based classification, an explicit profile of a class (also called
a class prototype) is calculated and used as the representative of all positive documents of the
class. The classification task is to find the most similar class to the object we will classify, by way
of comparing the object with the class prototype of the focused class. Figure 3-3 shows an
example of how to calculate centroid vectors and classify a new datum using these centroid
vectors. Figure 3-4 shows an example of classification of a new datum using the equivalent plane.
Outlook Temp. Humidity Windy Play
90.00 40.00 80.00 10.00 No 95.00 32.00 85.00 80.00 No 50.00 35.00 90.00 20.00 Yes 10.00 24.00 80.00 5.00 Yes 15.00 10.00 50.00 15.00 Yes 20.00 12.00 55.00 90.00 No 55.00 9.00 45.00 95.00 Yes 85.00 22.00 95.00 25.00 No 95.00 7.00 50.00 5.00 Yes 5.00 26.00 45.00 10.00 Yes
80.00 25.00 40.00 80.00 Yes 45.00 24.00 85.00 85.00 Yes 40.00 37.00 60.00 15.00 Yes 25.00 23.00 90.00 95.00 No
(a) A sample data set (the real-valued Play-Tennis data set with categorical classes)
Outlook Temp. Humidity Windy Play
90.00 40.00 80.00 10.00 No
95.00 32.00 85.00 80.00 No
20.00 12.00 55.00 90.00 No
85.00 22.00 95.00 25.00 No
(Average vector) 25.00 23.00 90.00 95.00 No
Centroid of 'No' 63.00 25.80 81.00 60.00 No
Outlook Temp. Humidity Windy Play
50.00 35.00 90.00 20.00 Yes
10.00 24.00 80.00 5.00 Yes
15.00 10.00 50.00 15.00 Yes
55.00 9.00 45.00 95.00 Yes
95.00 7.00 50.00 5.00 Yes
5.00 26.00 45.00 10.00 Yes
80.00 25.00 40.00 80.00 Yes
45.00 24.00 85.00 85.00 Yes
(Average vector) 40.00 37.00 60.00 15.00 Yes
Centroid of 'Yes' 43.89 21.89 60.56 36.67 Yes
(b) Average vectors as the centroids for ‘No’ and ‘Yes’
Outlook Temp. Humidity Windy Play
Test datum 70 40 30 60 ?
Outlook Temp. Humidity Windy Play
Centroid of 'No' 63.00 25.80 81.00 60.00 No
Centroid of 'Yes' 43.89 21.89 60.56 36.67 Yes
Distance between Test and ‘No’ : 53 4
Distance between Test and ‘Yes’ :
The closest class for the test datum is ‘Yes’.
(c) Classification of the test datum
Figure 3-3: An example of centroid-based classification: centroid vectors and classification
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 6: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/6.jpg)
65
Distance from ‘No’ and Distance from ‘Yes’:
Distance from ‘No’ :
Distance from ‘Yes’ :
The equivalent plane satisfies the following condition:
=
=
= 0
+
= 0
= 0
The equivalent plane (discriminant) is as follows.
= 0
Outlook Temp. Humidity Windy Play
Centroid of 'No' 63.00 25.80 81.00 60.00 No
Centroid of 'Yes' 43.89 21.89 60.56 36.67 Yes
Center of ‘No’ and Yes' 53.45 23.85 70.78 48.34 -
Difference of ‘No’ and Yes' 19.11 3.91 20.44 23.33 -
Discriminant plane: = 0
Condition of ‘No’: > 0
Condition of ‘Yes’: < 0
Outlook Temp. Humidity Windy Play
Test datum 70 40 30 60 ?
Calculation for the test data:
= -181.87 yes
Figure 3-4: An example of centroid-based classification using equivalent hyperplane.
As a formal description, first let us consider the family of discriminant functions that are linear
combinations of the t components of
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 7: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/7.jpg)
66
The above equation is a linear discriminant function, prescribing the weight vector w and
threshold weight w0. It represents a hyperplane with unit normal in the direction of w and a
perpendicular distance |w0|/|w| from the origin. The value of the discriminant function for a
pattern x, normalized by the size of the weight vector
is a measure of the perpendicular
distance from the hyperplane. Figure 3-5 shows the graphical representation of linear
discriminant function given by the discriminant equation with a concrete example.
Figure 3-5: An example of Fisher’s Linear Discriminant (2-D example)
A linear discriminant classifier can be viewed as a linear machine, an important special case of
which is the minimum-distance classifier or nearest-neighbor rule. Given a set of prototype
(centroid) points , one for each of the C classes . The minimum-distance
classifier assigns a pattern x to the class associated with the nearest point . For each point,
the squared Euclidean distance between the current pattern x and a prototype can be
represented as follows.
The classification can be achieved by finding the prototype which has the minimum distance to
the current pattern x. However, since the first term in the equation is identical for the calculation
of any prototype , the comparison can be performed on only the second and the third terms, i.e.,
. Thus, the linear discriminant function is as follows.
where
and
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 8: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/8.jpg)
67
Therefore, the minimum-distance classifier is a linear form (also called linear machine). If the
prototype points are the class means, then we have the nearest class mean classifier. Decision
regions for a minimum-distance classifier are illustrated in Figure 3-6. Each boundary is the
perpendicular bisector of the lines joining the prototype points of regions that are contiguous.
Also, note from the figure that the decision regions are convex (that is, two arbitrary points lying
in the region can be joined by a straight line that lies entirely within the region). However, since
decision regions of a linear machine are always convex, it cannot cope with concave shape. Figure
3-7 illustrates a two-class problem, which cannot be separated by a linear form. To overcome this
difficulty, two generalizations of linear discriminant functions (linear machines) are piecewise
linear discriminant functions and generalized linear discriminant functions as shown below.
Figure 3-6: An 2-D example of Decision regions for a minimum-distance classifier
Figure 3-7: Two-class problems which cannot separable by a linear discriminant
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 9: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/9.jpg)
68
Piecewise linear discriminant functions
To solve concave decision boundary, the piecewise linear discriminant allows more than one
prototype per class, instead of only one per class. For example, it is possible to assume
prototypes
for the i-th class . We can define the discriminant function for class
i as
where is a subsidiary discriminant function, which is linear and is given by
A pattern (object) x is assigned to the class for which is largest; that is, to the class of the
nearest prototype vector. In other words, we have partitioned the space into regions. This
partition is known as the Dirichlet tessellation of the space. When each pattern in the training set
is taken as a prototype vector, then we have the nearest-neighbor decision rule. This
discriminant function generates a piecewise linear decision boundary. It is possible to apply a
clustering scheme to construct prototypes
. Moreover, rather than using the
complete design set as prototypes, it is also possible to use only its subset. There are some
methods of reducing the number of prototype vectors (edit and condense) along with the
nearest-neighbor algorithm. Figure 3-8 shows an example of piecewise linear discriminant
functions.
Figure 3-8: An example of piecewise linear discriminant functions
Generalized linear discriminant function
Another solution to concave property of decision space is a generalized linear discriminant
function, also termed a phi machine by Nilsson (1965). It is a discriminant function of the form
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 10: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/10.jpg)
69
where
is a non-linear mapping (kernel) function of . If q = t, the number
of original variables, and , then the formula is equivalent to a linear discriminant
function.
Figure 3-9: Nonlinear mapping (kernel) function from to where
.
By the mapping function, it is possible to transform the discriminant function on the original
measurement ’s which is not originally linear to a new discriminant function, which is linear in
the functions of . Figure 3-9 shows a nonlinear mapping (kernel) function from to
where . As seen in the figure, the two classes can be
separated in the -space by a straight line. Similarly, disjoint classes can be transformed into a
-space in which a linear discriminant function can separate the classes, even they are separable
in the original space. This mapping is sometimes known as a kernel function. Although this
transformation can help us to find linear discriminant function to separate the classes, it is hard
to determine the form of . The following table lists some common mapping functions , used in
several kernel methods.
Kernel function
(mapping function)
Mathematical form
Linear
Quadratic
, , and
j-th order polynomial
, , and
Radial basic function , is the center and is a mapping function.
Multilayer perceptron for is the direction, is an offset and f is
the logistic function .
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 11: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/11.jpg)
70
Among these functions, as the number of functions (dimensions) that are used as a basis set
increases, so does the number of parameters that must be determined using the limited training
set. For example, a complete quadratic discriminant function requires
terms for
one class and for C classes, we need
parameters to estimate. For this large
number, we may need to apply some kinds of constraints in order to regularize the model to
ensure that there is no over-fitting. An alternative to having a set of different functions is to have
a set of functions of the same parametric form, but which differ in the values of the parameters
they take,
where is a set of parameters. It is possible to have different models on the way the variable x
and the parameters v are combined. For radial basis function, the function is only the
magnitude of the difference between the pattern x and the weight vector (parameter) v as follow.
On the other hand, if is a function of the scalar product of the two vectors, as shown below, the
discriminant function is known as a multilayer perceptron, especially when is the logistic
function .
Both the radial basis function and the multilayer perceptron models can be used in regression.
3.1.2. k-nearest neighbor method
The k-NN is a type of instance-based learning, or lazy learning where the classification function is
locally approximated without constructing any generalized model (no learning phase) and all
computation is deferred until classification phase. By a vector space model, the training examples
are represented by vectors in a multidimensional feature space, each with a class label. The
training phase of the algorithm consists only of storing the feature vectors and class labels of the
training samples. There is no explicit learning process. In the classification phase, the k-nearest
neighbor method (k-NN) gives a class label to an unlabeled object (one of which class is
unknown) by finding k closest training examples (neighbors) in the feature space and then
assigning it the majority class of these neighboring examples. Normally, k is a user-defined
constant, which is a positive integer and typically small, say 3-50. For the special case of k = 1,
the object is simply assigned to the class of its nearest neighbor. As for distance measure or
metric, Euclidean distance is usually used in cases of numeric attributes (features). In cases of
nominal, ordinal or binary attributes, different types of metrics, such as the overlap metric (or
Hamming distance), can be used. These metrics are normally used for measuring the distance or
similarity in the processing of clustering as shown in Chapter 4. Figure 3-10 shows the decision
boundary of linear regression of two-class response. Figure 3-11 and Figure 3-12 illustrate the
decision boundaries of k-NN when k is 1 and 5, respectively.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 12: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/12.jpg)
71
Figure 3-10: The decision boundary of linear regression of two-class response
Figure 3-11: The decision boundary of k-NN when k = 1.
Figure 3-12: The decision boundary of k-NN when k = 5
However, one main drawback of the "majority voting" criterion on classification decision is
that the classes with more examples tend to be selected as the prediction of the test object since
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 13: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/13.jpg)
72
they tend to be the label of the k nearest neighbors. One way to overcome this problem is to
weight the classification with consideration of the distance from the test point to each of its k
nearest neighbors. This weighting distance can also be applied for prediction, by just assigning
the predicted value for the object with the average of the values of its k nearest neighbors,
according to their distance-based contributions. That is, a closer neighbor has more contribution.
A common weighting scheme is to give each neighbor a weight of 1/d, where d is the distance to
the neighbor, called a generalization of linear interpolation.
While the naive version of the algorithm seems to be easily implemented by computing the
distances from the test sample (a test vector) to the training data (stored vectors), but this
process is computationally intensive, especially when the number of the training data (stored
vectors in the training set) is large. In decades, many researchers have proposed efficient nearest
neighbor search algorithms to find nearest neighbors of the test vector with tractable
computational time, even for large data sets. The nearest neighbor search (NNS), sometimes
known as proximity search, similarity search or closest point search, is an optimization problem
for finding closest points in metric spaces. The problem is “given a set P of points in a metric
space S and a query point q S, find the closest point (or k closest points) in P to q.” Normally, S is
taken to be n-dimensional Euclidean space and distance is measured by Euclidean distance or
Manhattan distance. This problem is also known as the post-office problem, referring to an
application of assigning a residence to the nearest post office.
The nearest neighbor algorithm has some strong consistency results. As the amount of data
approaches infinity, the algorithm is guaranteed to yield an error rate no worse than twice the
Bayes error rate (the minimum achievable error rate given the distribution of the data). The k-
nearest neighbor is guaranteed to approach the Bayes error rate, for some value of k (where k
increases as a function of the number of data points). Various improvements to k-nearest
neighbor methods are possible by using proximity graphs.
It is observed that the result of 5-NN in Figure 3-12 shows fewer misclassified training
observations than the result of linear regression in Figure 3-10. It is more extreme in the case of
1-NN in Figure 3-11. There is none of the training observations are misclassified. As common
sense, the error on the training data should be approximately an increasing function of k, and will
always be 0 when k is 1. However, it is not the case when we test on an independent test set.
Therefore, it is more suitable to evaluate the method by using a separate test set since it is a real
situation. For complexity, the k-nearest-neighbor or k-NN fits have a single parameter, which is
the number of neighbors k while the linear regressions or least-squares fits in the previous
section have t+1 parameters, where t is the number of components or dimensions. However,
since k-NN depends on not only the parameter k but also the N training data themselves, the
effective number of parameters of k-NN is N/k and this number is normally bigger than t in linear
regression. The effective number of k-NN parameters decreases with increasing k. If the
neighborhoods were non-overlapping, there would be N/k neighborhood groups and we can
have one parameter (a mean) for each group. For k-NN, we cannot optimize the parameters by
sum-of-squared errors on the training set to find the optimal k since we will always get k=1.
Figure 3-13 shows an example of classification of the test datum using k-NN when k is 1, 3 or
5. For 1-NN, we select the closest point (object) which is the object No. 11. Since its label (class) is
‘Yes’, therefore the class for the test datum (85, 3 , 6 , 6 ) is ‘Play=Yes’. For 3-NN and 5-NN, we
select three and five nearest neighbors, respectively. For this example, 3-NN returns ‘No’ while 5-
NN gives ‘Yes’. We can observe that the answers for these three cases (k = 1, 3, 5) are not
consistent. Here, ‘Yes’, ‘No’ and ‘Yes’ for 1-NN, 3-NN and 5-NN, respectively.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 14: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/14.jpg)
73
No. Outlook Temp. Humidity Windy Play
1 90.00 40.00 80.00 10.00 No 2 95.00 32.00 85.00 80.00 No 3 50.00 35.00 90.00 20.00 Yes 4 10.00 24.00 80.00 5.00 Yes 5 15.00 10.00 50.00 15.00 Yes 6 20.00 12.00 55.00 90.00 No 7 55.00 9.00 45.00 95.00 Yes 8 85.00 22.00 95.00 25.00 No 9 95.00 7.00 50.00 5.00 Yes
10 5.00 26.00 45.00 10.00 Yes 11 80.00 25.00 40.00 80.00 Yes 12 45.00 24.00 85.00 85.00 Yes 13 40.00 37.00 60.00 15.00 Yes 14 25.00 23.00 90.00 95.00 No
(a) A sample data set (the real-valued Play-Tennis data set with categorical classes)
Outlook Temp. Humidity Windy Play
Test datum 85 30 60 60 ?
No Outlook Temp. Humidity Windy Play Distance Rank
1 90.00 40.00 80.00 10.00 No 55.00 6 2 95.00 32.00 85.00 80.00 No 33.60 2 3 50.00 35.00 90.00 20.00 Yes 61.24 7 4 10.00 24.00 80.00 5.00 Yes 95.32 13 5 15.00 10.00 50.00 15.00 Yes 86.17 12 6 20.00 12.00 55.00 90.00 No 73.99 10 7 55.00 9.00 45.00 95.00 Yes 52.83 4 8 85.00 22.00 95.00 25.00 No 50.14 3 9 95.00 7.00 50.00 5.00 Yes 61.27 8
10 5.00 26.00 45.00 10.00 Yes 95.61 14 11 80.00 25.00 40.00 80.00 Yes 29.15 1 12 45.00 24.00 85.00 85.00 Yes 53.72 5 13 40.00 37.00 60.00 15.00 Yes 64.02 9 14 25.00 23.00 90.00 95.00 No 75.99 11
(b) Distance between the test datum and all training data (last column)
Method K Nearest neighbors Predicted Class (majority vote)
1-NN: 1 No.11 (Yes) Yes
3-NN: 3 No.11 (Yes), No.2 (No), No.8 (No) No 5-NN: 5 No.11 (Yes), No.2 (No), No.8 (No), No.7 (Yes), No.12 (Yes) Yes
(c) Predicted class for the test datum when k = 1, 3 and 5 (last column)
Figure 3-13: An example of k-NN classification
As a formal description, given and an input
, the nearest neighbor methods attempt to find a set of the closest objects from the
observations in T. That is, . The
class of the input , will be determined by observing the classes of its nearest neighbors in
the set as follows.
To find nearest neighbors of , it is necessary to define a metric. Possible metrics include
Euclidean distance, Mahattan distance or a distance-based or statistical measure. This metric will
determine the k observations , which are closest to in the input space, and then we can
average their response to obtain the final class of the input .
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 15: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/15.jpg)
74
3.1.3. Statistical Classifiers
As a typical statistical classification method, a Bayesian classifier predicts the most plausible
class for an unlabeled object by calculating class membership probabilities of all possible classes
the object can belong to and then comparing these probabilities to find the class with the highest
probability. In general, Bayesian classification is based on Bayes’ theorem (named after Thomas
Bayes, a nonconformist English clergyman who had produced early works in probability and
decision theory during the 18th century) as follows.
Given a set of possible classes and an unlabeled object , its most plausible class is the
class the probability of which, , achieves the highest value when the object is
encoded by the function , that is . The function is arbitrary but expresses the properties of
the object. It is usually represented by a set of n attributes ( ) as follows.
In Bayesian terms, are considered as a set of evidence (as a set of attributes) for
an object . The class of the object can be viewed as a hypothesis, such as that the object
(data tuple) belongs to a specified class . As the classification problem, the classifier
finds , the probability that the class holds given the evidence or attributes
observed attributes . In other words, we are looking for the probability that the
object belongs to class C, given that we know the attribute description of , i.e., .
The is called the posterior probability, or a posteriori probability, of
conditioned on .
Given an example of the Play-Tennis dataset, each tuple is described by the values of four
attributes; outlook, temperature, humidity and windy, with two possible classes; ‘play’
( ) and ‘not play’ ( ). Then the classification of an object can be done by
comparing two probabilities of , one for the ‘play’ class and the other for the
‘not play’ class as follows.
(1)
(2)
For the Play-Tennis dataset and the test object in Figure 3.14, the probabilities are defined by
(1’) , and
(2’) .
These two probabilities are calculated and compared to find the maximum one and then the class
with the maximum value will be assigned to the object.
In general, it requires a large set of existing records (samples) as a training dataset to find the
estimated value of . For example, from the dataset in Figure 3-14, we can
calculate the estimated probabilities of as follows.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 16: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/16.jpg)
75
Outlook Temperature Humidity Windy Play
sunny hot high false No sunny hot high true No
overcast hot high false Yes rainy mild high false Yes rainy cool normal false Yes rainy cool normal true No
overcast cool normal true Yes sunny mild high false No sunny cool normal false Yes rainy mild normal false Yes sunny mild normal true Yes
overcast mild high true Yes overcast hot normal false Yes
rainy mild high true No
(a) The Play-Tennis dataset
Outlook Temperature Humidity Windy Play
sunny mild high true ?
(b) test data
Figure 3-14: The Play-Tennis dataset and the test data
= 1
= 0
= 1
= 0 ... …
= 0
= 1
= 1
= 0
As the above example, when a dataset is small, the following issues needed to be considered.
1. In real application when there are many features, it is likely that the training set may not
cover all possible cases. For those cases, it is impossible to obtain their probabilities. In the
above example, since there is no record (sample) for the test example, we cannot find the
following two probabilities.
2. In several cases, there are not enough data examples to estimate plausible probability. In the
example, there is merely a single record (sample) for each case. By this condition, the
probability for ‘yes’ (‘no’) is either 0 or 1. The issue is whether a single record is a good
representative for the case or not. In terms of statistics, for example the -test may be used
for testing, such case will have low reliability since the mere record in the training set may
occasionally appear by chance. To have good estimation of probabilities, we need to have a
very large dataset. For example, suppose that the dataset has ten attributes with one class to
be predicted and each attribute has two different possible values. Theoretically there are
= 1024 possible combinations for these ten attributes. To have enough data for reliability, we
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 17: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/17.jpg)
76
may need up to 20-30 cases for each combination. Therefore, we need approximately 20000-
30000 records. This number may be possible. However, a case of 20 binary attributes
requires records, approximately 1 million possible combinations and then 20-30 million
records. It is quite hard to have this number of records in the real situation.
To solve the above two issues, it is possible to apply Bayes theorem with some independence
assumptions. The strongest assumption is to suppose that all attributes are independently with
each other. As mentioned previously, given a set of possible classes , n attribute value domains
, and an unlabeled object , the most plausible class of the object is the class
the probability of which, , achieves the highest value when the object is
represented by a set of n attributes as follows.
Here, is called the posterior probability, or a posteriori probability, of
conditioned on . Based on the Bayes’ rule ,
can be derived easily. By this rule, we obtain the following equation.
From the example, the meanings of the related probability can be summarized below.
Description
Example Meaning Probability that play is ‘yes’ when we know that outlook is ‘sunny’, temperature
is ‘mild’, humidity is ‘high’ and windy is ‘true’.
Description Example Meaning Probability that play is ‘yes’, regardless of any condition.
Description Example Meaning Probability that outlook is sunny, temperature is mild, humidity is high and
windy is true when we know that play is yes.
Description
Example Meaning Probability that outlook is sunny, temperature is mild, humidity is high and
windy is true, regardless of any condition.
Here, the function argmax will return only the plausible class and all classes will have the same
denominator. With this, the denominator can be ignored and the equation can be simplified as
follows.
In this equation, instead of , the (the prior probability or a priori
probability of ) and the , are used. The prior probability, or a priori
probability, of ( ) is the probability of an object, regardless of its attribute values. The
posterior probability, , is the probability that the known class has the
attributes . Here, the original probability means the
probability that given the attributes , the class is expected to be . For the above
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 18: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/18.jpg)
77
equation, the number of parameters of the first component equals to the numbers of
classes in consideration. In the Play-Tennis example, it is two, i.e. and
. For the second component , the number of its parameters is
equivalent to the number of parameters in the original probability .
However, unlike the original one, it is possible for us to transform this equation to a more
convenient form. To do this, it is possible to use the joint probability.
Then the class estimation can be concluded as follows.
For the above equation, the number of parameters of the second component
equals to the number of classes multiplied by the number of possible values of . The third
component would have more parameters, equivalent to the number of classes
multiplied by the number of possible values of and then multiplied by the number of
possible values of . The later components have more parameters. Moreover, the last
component has the most parameters, equivalent to those of the original one. As stated above, it
requires a very large dataset to compute components with a large number of parameters.
Naïve Bayes Classifier
In order to avoid this limitation, conditional independence can be made. In the extreme case, it is
possible to presume that all attributes are independent of each other. In other words, the values
of the attributes are conditionally independent of one another, given the class label of the object.
By this assumption, the formula is simplified as follows.
In this formula, each term, except the first term, is the probability to obtain the attribute value,
given only the class value. This simplification is known as naïve Bayes since it is the most
intuitive and simple. It is also possible to simplify with more complex constraints. For example, if
we assume that depends on , depends on and , and others are
independent of each other, the class prediction can be formulated as follows.
Note that the fourth and the fifth terms need more parameters while the others are the same
with naïve Bayes. This generalization can be set up by manual with human’s predefined
knowledge. The dependency can be expressed in the form of Bayesian belief networks which are
graphical models, unlike naïve Bayesian classifiers, allow the representation of dependencies
among subsets of attributes. Explained later, this Bayesian belief networks could be used for
classification. To be summarized, the naïve Bayesian classifier works as follows.
1. Given a training set of objects and their associated class labels, denoted by
, each object is represented by an n-dimensional attribute vector,
depicting the measure values of n attributes
of the object with its class , one from m possible
classes, .
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 19: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/19.jpg)
78
2. The Bayesian (statistical) classifier assigns (or predicts) a class to the object when the
class has the highest posterior probability over the others, conditioned on the object’s
attribute values . That is, the Bayesian classifier predicts that the
object belongs to the class if and only if
It could be represented by a function argmax, that is, we maximize , called the
maximum posteriori hypothesis.
By Bayes’ theorem, this equation is equated to
3. Since is constant for all classes , only
need be maximized.
if
Note that the class prior probabilities may be estimated by , where
is the number of training objects, which belong to the class in the training set
and is the total number of training objects. However, if the class prior probabilities are
not known, it is commonly assumed that the classes are equally likely, that is,
. Therefore, we would simplify the term to .
if
4. Several datasets may have a large number of attributes. In these cases,
will have high complexity (need a large set of examples to calculate)
since it may include so many parameters. In order to reduce complexity in evaluating
, the naïve assumption of class conditional independence can be made.
This assumes that the values of the attributes are conditionally independent of one another,
given the class label of the object (i.e., that there are no dependence relationships among
the attributes). Thus,
Normally we can easily estimate the probabilities
from the data in the training dataset. For classification, the class value is categorical (a label,
not a continuous-valued number) while an attribute can be either categorical or
continuous-valued. The method to compute can be done as follows.
(a) If the j-th attribute is categorical, then is the number of objects of class
in the training dataset TR, having the attribute value of for the attribute
, divided by , the number of objects of class in T as follows.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 20: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/20.jpg)
79
(b) If the j-th attribute is continuous-valued (numeric), then can be
calculated by a Gaussian distribution of the attribute in the class with a mean and standard deviation , defined as follows.
Here, is the density function and is a small slack value. There is no need to know the exact value of this slack value since it will be cancelled later when the probabilities are compared with each other. Given the class the class means and the standard deviation of the j-th attribute, respectively denoted by and , can be derived by
the following equations.
Here, is the value of the j-th attribute of the object , is the set of the
objects belonging to the class , which is a subset of the whole training set T, and
is the number of objects in .
5. In order to predict the class label of , is evaluated for each class
. The classifier predicts that the class label of the object is the class if and only if
In other words, the predicted class label is the class for which
is the maximum. One interesting query on naïve Bayesian classifiers is how effective it is.
As stated before, two unrealistic assumptions for this method are (1) all attributes are
equally important, and (2) the second one states that all attributes are statistically
independent (given the class value) with each other. This means that knowledge about the
value of a particular attribute does not tell us anything about the value of another attribute
(if the class is known). In general, although based on these unrealistic assumptions that are
almost never correct, this scheme works well in practice. The naïve Bayesian has shown in
several literatures to obtain high accuracy.
However, a well-known issue of these statistical approaches is a so-called sparseness problem.
The problem occurs when we have a limited number of data. This makes some values not occur
in the training set in conjunction with every class value or some value sets may never occur in
the training set. This situation will introduce a zero-valued probability for some events,
. This zero value will make the following class prediction output a zero value for
the class , even other terms may give high probabilities, except one specific term provides zero.
That is, even though, without the zero probability, we may have ended up with a high probability,
suggesting that belonged to class . A zero probability cancels the effects of all of the other
(posteriori) probabilities (on ) involved in the product.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 21: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/21.jpg)
80
This zero-valued probability may not be realistic. It may be triggered since we do not have
enough data for training. Therefore, the value may not be a real zero. To avoid this problem, it is
possible to apply a simple trick by assuming a small probability for unseen cases. This technique
for probability estimation is known as the Laplacian correction or Laplace estimator, named after
Pierre Laplace, a French mathematician who lived from 1749 to 1827. In this technique, we will
add one additional count for all the conditions. For example,
Here, the j-th attribute can take one value from the set . In the above
equation, we add one count for each attribute value. Since we will have one added for all the
attribute values, the corresponding denominator will be added with L, the number of possible
values of the attribute, when the probability is calculated.
Patient
No. Blood Pressure
(feature #1) Protein Level (feature #2)
Glucose Level (feature #3)
Heart Beat (feature #4)
diseased (class)
1 High Medium 143 H Slow Positive 2 High High 92 N Fast Negative 3 High Low 150 H Slow Positive 4 High Low 99 N Fast Negative 5 Normal Low 93 N Fast Negative 6 Normal High 75 N Slow Negative 7 Normal Medium 80 N Slow Negative 8 Low Medium 139 H Slow Positive 9 High Medium 105 H Slow Positive
10 High High 90 N Fast Negative 11 High Low 91 N Slow Positive 12 High Low 107 H Fast Negative 13 Normal Low 95 N Fast Negative 14 Normal High 96 N Slow Negative 15 Normal Medium 81 N Slow Negative 16 Low Medium 144 H Slow Positive 17 High Medium 150 H Slow Positive 18 High High 98 N Fast Negative 19 High Low 96 N Slow Positive 20 High Low 83 N Fast Negative 21 Normal Low 95 N Fast Negative 22 Normal High 98 N Slow Negative 23 Normal Medium 105 H Slow Negative 24 Low Medium 128 H Slow Positive 25 High Medium 145 H Slow Positive 26 High High 94 N Fast Negative 27 High Low 92 N Slow Positive 28 High Low 108 H Fast Negative 29 Normal Low 93 N Fast Negative 30 Normal High 109 H Slow Negative 31 Normal Medium 95 N Slow Negative 32 Low Medium 127 H Slow Positive
(a) A medical laboratory test dataset (four features and one class)
Blood Pressure Protein Level Glucose Level Heart Beat Diseased
Normal High 104 Fast ?
(b) A test case
Figure 3-15: A medical laboratory test dataset (For glucose, H for 99, N for 99)
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 22: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/22.jpg)
81
There are several variants of Laplacian correction or Laplace estimator. Two possibilities are
(1) addition of an equal small value to each attribute value based on the number of possible
attribute values, in order to make the total correction become 1, and (2) addition of different
small values to attribute values based on their contributions ( ), but maintaining the total
correction to 1. These two options are depicted in the following two equations in order.
(1)
(2)
where
Figure 3-15 illustrates another example, which shows a medical laboratory test dataset and a
test record. Figure 3-16 shows the construction of a naïve Bayes model in the form of a table
from the dataset. In this example, we use Laplacian correction or Laplace estimator for nominal
attributes while we use the mean and the standard deviation (S.D.) to calculate Gaussian-based
probability for the numeric attribute, i.e., Glucose level.
Blood Pressure (feature #1)
Protein Level (feature #2)
Glucose Level (feature #3)
Heart Beat (feature #3)
Diseased (class)
Positive Negative Positive Negative Positive Negative Positive Negative Positive Negative
High 8+(1/3) =8.33
8+(1/3) =8.33
High 0+(1/3) =0.33
8+(1/3) =8.33
143 92 Fast 0+(1/2) =0.5
12+(1/2) =12.5
12+1=13 20+1=21
Normal 0+(1/3) =0.33
12+(1/3) =12.33
Medium 8+(1/3) =8.33
4+(1/3) =4.33
150 99 Slow 12+(1/2) =12.5
8+(1/2) =8.5
Low 4+(1/3) =4.33
0+(1/3) =0.33
Low 4+(1/3) =4.33
8+(1/3) =8.33
139 93
105 75
91 80
144 90
150 107
96 95
128 96
145 81
92 98
127 83
95
98
105
94
108
93
109
95
High 8.33/13 =0.641
8.33 / 21 =0.397
High 0.33/13 =0.0254
8.33/21 =0.397
Mean 125.83 94.30 Fast 0.5/13 =0.0385
12.5/21 =0.595
13/34 =0.382
21/34 =0.618
Normal 0.33/13 =0.0254
12.33/21 =0.587
Medium 8.33/13 =0.641
4.33/21 =0.206
S.D. 23.40 9.30 Slow 12.5/13 =0.962
8.5/21 =0.405
Low 4.33/13 =0.333
0.33/21 =0.0157
Low 4.33/13 =0.333
8.33/21 =0.397
Figure 3-16: Construction of a naïve Bayes model for the medical laboratory dataset
In the classification process, now suppose that the following new case is encountered as shown
in Figure 3-15. It is also shown below.
Blood Pressure Protein Level Glucose Level Heart Beat Diseased Normal High 104 Fast ?
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 23: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/23.jpg)
82
Then the target is to predict what the value is for ‘Diseased’. In the naïve Bayes model, all features
are treated equally for their importance and independently with each other. In this example,
blood pressure, protein level, glucose level, heart beat and also the class ‘Diseased’ are equally
important and independent of each other. The most likely class can be determined by selecting
the class with the highest probability. This assertion is stated previously.
Therefore the overall probability likelihood fraction of ‘Diseased=positive’ (for short, ‘pos’) and
that of ‘Diseased=negative’ (for short, ‘neg’) are as follows.
Likelihood of ‘pos’
Likelihood of ‘neg’
According to the above calculation, we can observe that the likelihood of ‘negative’ is much
higher than that of ‘positive’ in this case. Therefore, the case should be assigned with ‘negative’.
Moreover, after the normalization, the probabilities for ‘positive’ and that of ‘negative’ are as
follows. Here, X means the current environment. i.e., blood pressure, protein level, glucose level
and heart beat.
P(positive|X) =
= 0.00005
P(negative|X) =
= 0.99995
In conclusion, the Naïve Bayes classification method is based on Bayes’s rule and “naïvely”
assumes independence. Indeed, it is only valid to multiply probabilities when the events are
independent. The assumption that attributes are independent (given the class) in real life
certainly is a simplistic one. However, despite the disparaging name, Naïve Bayes works very
well when tested on actual datasets, particularly when combined with some of the attribute
selection procedures that eliminate redundant and hence no dependent attributes. One special
treatment is to apply Laplacian correction or Laplace estimator to solve the problem of zero
probability due to the limitation of the training data set.
Bayesian Belief Networks
As mentioned above, while the naïve Bayesian classifier simplifies the calculation process by
making the assumption of class conditional independence (i.e., given the class label of a tuple, the
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 24: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/24.jpg)
83
values of the attributes are assumed to be conditionally independent of one another), in practice,
however, dependencies can exist between attributes or features (variables). For this purpose,
Bayesian belief networks, as more general models, provide a framework to specify joint
conditional probability distributions among attributes. They allow class conditional
independencies to be defined between subsets of attributes. They can be expressed with a
graphical model of causal relationships, on which learning can be performed. In place of naïve
Bayes classification, trained Bayesian belief networks can be used for classification. Bayesian
belief networks are also known as belief networks, Bayesian networks, and probabilistic
networks. A belief network is defined by two components, a directed acyclic graph and a set of
conditional probability tables. Each node in the directed acyclic graph represents a random
variable (attribute). The variables (attributes) may be discrete or continuous-valued. Each edge
(arc) represents a probabilistic dependence, represented by a so-called conditional probability
table (CPT). If an edge is drawn from a node P to a node Q, then P is a parent or immediate
predecessor of Q, and Q is a descendant of P. Each variable is conditionally independent of its
non-descendants in the graph, given its parents.
Figure 3-17: Attribute dependency graph (a kind of Bayesian Network) with probabilities
simply calculated from the medical laboratory data in Figure 3-15.
Figure 3-17 illustrates a sample dependency network in the medical laboratory where ‘diseased’
may affect ‘protein level’ and ‘glucose level’, the combination of ‘diseased’ and ‘protein level’ may
affect ‘heart beat, and the combination of ‘diseased’ and ‘glucose level’ may affect ‘blood pressure’.
Note that there is a conditional probability table (CPT) for each edge and a probability table for
each root node. Figure 3-18 is the Bayesian network with Laplacian correction. There is no
probability table for each intermediate and leave node. For statistical reasoning, the most likely
class can be determined by selecting the class with the highest probability as follows.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 25: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/25.jpg)
84
Tailored to this example, the following equations can be assumed.
Figure 3-18: Attribute dependency graph (a kind of Bayesian Network) with Laplacian correction
According to the dependency defined in Figure 3-17 and Figure 3-18, it is possible to ignore some
items in the conditional part as follows.
Note that the third, fourth and fifth terms can be approximated to a reduced representation when
we assume to have the attribute dependency graph. Next, in the classification process, now
suppose that the previous mentioned case is encountered as follow.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 26: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/26.jpg)
85
Blood Pressure Protein Level Glucose Level Heart Beat Diseased Normal High High Fast ?
For this setting, the overall probability likelihood fraction of ‘Diseased=positive’ (for short, ‘pos’)
and that of ‘Diseased=negative’ (for short, ‘neg’) can be calculated as follows.
Likelihood of ‘pos’
Likelihood of ‘neg’
According to the above calculation, we can observe that the likelihood of ‘positive’ is quite higher
than that of ‘negative’ in this case. Therefore, the case should be assigned with ‘positive’.
Moreover, after the normalization, the probabilities for ‘positive’ and that of ‘negative’ are as
follows. Here, X means the current environment. i.e., blood pressure, protein level, glucose level
and heart beat.
P(positive|X) =
= 0.7892
P(negative|X) =
= 0.2108
For comparison to the result of the naïve Bayes (NB), the NB calculation is listed again below.
P(positive|X) =
= 0.00005
P(negative|X) =
= 0.99995
We can observe that the results are contradicted. Two possible factors on this phenomenon are
(1) the consideration of ‘Glucose’ attribute as a nominal attribute or a numeric attribute (discrete
or continuous-valued), and (2) the different setting of attribute dependency. For the first factor,
although it depends on individual decision but numeric attribute may be realistic in case of
numeric attributes. For the second factor, more structural information can help us obtain more
precise information for determining which class the object should belong to.
In the above example, we have shown a simple method to calculate a Bayesian belief network
from a given dataset. However, the learning or training of a belief network may occupy varied
situation as shown next. If the network topology (i.e., the layout of nodes and arcs) is known and
the variables are observable, then training the network is straightforward. The training process
can be performed similarly as the calculation of the probabilities in naive Bayesian classification.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 27: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/27.jpg)
86
The above example is such a case. On the other hand, if the network topology is given in advance,
but some of network variables are hidden, called missing values or incomplete data, there are
various methods to choose for training the belief network. One promising method is gradient
descent. Without an advanced math background, it may be hard to follow the full description of
gradient descent since it includes the calculus-packed formulae. However, there exists packaged
software to solve these equations, without deep interpretation. In the following, the general idea,
which is not so difficult, is described.
Let be a training set of data tuples. Training the belief network means
that we must learn the values of the CPT entries. Let be a CPT entry
for the variable having the parents . For example, if is the upper leftmost
CPT entry of Figure 3-17, then is “Protein Level”; is its value, either “low”, “middle” or
“high”; is the list of the parent nodes of , in this case, is “Diseased” and shows the values
of the parent nodes, i.e., either “positive” or “negative”. ’s are viewed as weights, analogous to
the weights in hidden units of neural networks. The set of weights is collectively referred to as W.
The weights are initialized to random probability values. A gradient descent strategy performs
greedy hill-climbing. At each iteration, the weights are updated to eventually converge to a local
optimum solution. A gradient descent strategy is used to search for the values that best
model the data, based on the assumption that each possible setting of is equally likely. Such
a strategy is iterative. It searches for a solution along the negative of the gradient (i.e., steepest
descent) of a criterion function. What we need is to find the set of weights W that maximize this
function. To start with, the weights are initialized to random probability values.
The gradient descent method performs greedy hill-climbing in that, at each iteration or step
along the way, the algorithm moves toward what appears to be the best solution at the moment,
without backtracking. The weights are updated at each iteration. Eventually, they converge to a
local optimum solution. For our problem, we maximize or
. Given the network topology and initialized , the algorithm does as follows:
1. Compute the gradients: For each i, j, k, compute
The probability in the right-hand side of Equation (6.17) is to be calculated for each
training tuple, , in D. For brevity, let us refer to this probability simply as p. When the
variables represented by and , are hidden for some , then the corresponding
probability p can be computed from the observed variables of the tuple using standard
algorithms for Bayesian network inference, available online.
2. Take a small step in the direction of the gradient: The weights are updated by
where l is the learning rate representing the step size and
is computed from the
first step. The learning rate is set to a small constant and helps with convergence.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 28: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/28.jpg)
87
3. Renormalize the weights: Because the weights are probability values, they must be
between 0.0 and 1.0, and must equal 1 for all i, k. These criteria are achieved by
renormalizing the weights after they have been updated by the second step.
Algorithms that follow this form of learning are called Adaptive Probabilistic Networks. Other
methods for training belief networks are referenced in the bibliographic notes at the end of this
chapter. Belief networks are computationally intensive. Because belief networks provide explicit
representations of causal structure, a human expert can provide prior knowledge to the training
process in the form of network topology and/or conditional probability values. This can
significantly improve the learning rate.
3.1.4. Decision Trees
Also known as a classification tree for discrete outcome and a regression tree for continuous
outcome, a decision tree is a tree-like graph or model used for predicting a class (consequence or
outcome) of an event based on observed properties of that event. Commonly used for decision
analysis in operation research to help identify an optimal action towards a goal, a decision tree is
a predictive model that maps from observations about an event to conclusions on its target value.
In general, classification using a decision tree has high accuracy but the performance usually
depends on the characteristics of data we cope with. Decision tree induction algorithms have
been used for classification in many application areas, such as medicine, manufacturing and
production, financial analysis, astronomy, and molecular biology. In general, a decision tree
consists of three components: (1) outcome nodes (rectangular), (2) decision criterion nodes
(ovals), and (3) decision branches (lines), as shown in Figure 3-19.
Figure 3-19: A decision tree (the Play-Tennis data)
The leaf nodes represent classification (decision) outcome, the root and the intermediate nodes
express a decision criterion, and the branches under a node indicate possible values of the
decision criterion of the node. Similar to any classification techniques for machine learning or
data mining, basically two general phases for decision tree are (1) learning and (2) classification.
In the first phase, given a data set, one can create a decision tree by creating nodes one by one
from the root toward the leaves. In the second phase, after getting a decision tree, the tree can be
used for classification of unknown or unseen datum. While the first phase is complex, the second
phase is very simple. For example, given the decision tree in Figure 3-19, the following case can
be classified as ‘Yes’, as depicted by the SOLID arrow in Figure 3-20. In this case, the situation of
‘Outlook=sunny’ and the ‘Humidity=normal’ is applied.
Outlook
Windy Humidity
sunny rainy
overcast
high normal
Yes
No Yes
false true
Yes No
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 29: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/29.jpg)
88
Outlook Temp. Humidity Windy Play
sunny hot normal false ?
(a) The test datum
(b) Usage of the decision tree to classify the test datum (the answer is ‘Yes’)
Figure 3-20: Usage of the decision tree to classify the test datum
To learn a decision tree model from a training data set, a kind of attribute selection measure is
used to choose the best attribute that optimally split the tuples into distinct classes. Popular
measures of attribute selection are information gain and gain ratio, described later. Since
construction of a decision tree may introduce some branches in the tree that are just noise or
outliers in the training data, one may need tree pruning to identify and remove such branches.
The process can be done by investigating the improvement of classification accuracy on unseen
data. The technique for constructing a decision tree from data is called decision tree learning or
decision tree induction. Decisions trees have several advantages as follows.
Decision trees are simple to interpret and understand. We can understand decision tree
models after a brief explanation.
Decision trees are a white box model. It can provide a result with an explicit explanation
why the result is determined.
Decision Tree Induction
The decision tree induction consists of recursive steps as follows. First, one has to select the
attribute that best partitions the training data set, to place at the root node and then make one
branch for each possible value. By this, the training data set is split up into subsets, one for every
value of the attribute. In the same manner, the process is repeated recursively for each branch,
using only those instances that reach the branch. As a termination criterion, one can stop
splitting branches when all instances at a node possess the same class (a pure node). However, in
the real situation there may be several cases that we cannot get such pure node, or if we split
without stopping, we will obtain a leaf node with only one instance. Such situation is called
overfitting and it is not preferable since nodes with too few instances are not reliable. As a
solution, a pruning process is needed.
As a conclusion, the decision tree induction copes with two important issues, (1) how to
determine which attribute to split on for each step and (2) how to prevent the overfitting
problem. The decision tree induction, or simple Bayesian classifier, works as follows. Given a
training set of objects and their associated class labels, denoted by , each
object is represented by an n-dimensional attribute vector, depicting
the measure values of n attributes, , of the object with its class , one from m
Outlook
Windy Humidity
sunny rainy
overcast
high normal
Yes
No Yes
false true
Yes No
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 30: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/30.jpg)
89
possible classes, . Here, suppose that has possible values
. That is, .
1. Select the best attribute for the first node in order to split the training set into a number of
subsets. In this attribute selection, two most popular criteria are information gain and gain
ratio, even some other possible criteria include a minimal occupation of the node, a
maximal depth of the tree or a threshold value for the information gain (or gain ratio).
Figure 3-21 shows that a training set of objects is split into several subsets when a node
is selected for splitting. The formalism of information gain and gain ratio are as follows.
Here, is the k-th class, is the training set before splitting, is the set of the instances
with the class in the set , is a subset of the training set after splitting, containing the
objects which have the value of for the attribute , that is , and is the set of
the instances with the class in the subset . With this notation, is the total number
of instances in the training set before splitting, is the number of class-k instances in the
set , is the number of instances in the subset, which have , and is the
number of class-k instances in the subset .
Information gain:
where
Gain ratio:
where
Figure 3-21: A training set is split into subsets when a node is selected for splitting.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 31: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/31.jpg)
90
2. Repeat the above process on each subset by selecting another attribute to perform further
splitting. This process will be terminated when the subset includes few instances or
instances from one class (a pure subset), or a kind of termination criterion is satisfied.
Figure 3-22 shows an example of the iterative process for constructing a decision tree,
starting from the root node towards leaf nodes. The final tree is constructed as Figure 3-23.
(a) The best attribute is selected
as the root node and then a
number of branches are generated
according to its possible attribute
values .
(b) A node is created for each branch and then
those nodes are splitted more if they are not of
one class or include enough instances.
(c) Each succeeding node is splitted more until it includes only instances from one class
or too few instances.
Figure 3-22: A tree is constructed from the root by splitting a node repetitively.
Figure 3-23: The final tree is constructed with each leaf node assigned with a class
label, according to the majority class of the instances at that node.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 32: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/32.jpg)
91
To be more concrete, the following example shows a complete decision induction process on a
health-checking database when either information gain or gain ratio is applied for the splitting
criterion. This example includes both information gain and gain ratio but we can select either one
of them. Given the following table (Table 3.1), the process is enumerated.
Table 3-1: A patient health-checkup data
Patient Blood Pressure Protein Level Glucose Level Heart Beat Positive
1 High Medium High Slow Positive
2 High High Medium Fast Negative
3 High Low Medium Slow Positive
4 High Low High Fast Negative
5 Normal Low Medium Fast Negative
6 Normal High Very High Slow Negative
7 Normal Medium Very High Slow Negative
8 Low Medium Very High Slow Positive
9 High Medium High Slow Positive
10 High High Medium Fast Negative
11 High Low Medium Slow Positive
12 High Low High Fast Negative
13 Normal Low Medium Fast Negative
14 Normal High Very High Slow Negative
15 Normal Medium Very High Slow Negative
16 Low Medium Very High Slow Positive
17 High Medium High Slow Positive
18 High High Medium Fast Negative
19 High Low Medium Slow Positive
20 High Low High Fast Negative
21 Normal Low Medium Fast Negative
22 Normal High Very High Slow Negative
23 Normal Medium Very High Slow Negative
24 Low Medium Very High Slow Positive
25 High Medium High Slow Positive
26 High High Medium Fast Negative
27 High Low Medium Slow Positive
28 High Low High Fast Negative
29 Normal Low Medium Fast Negative
30 Normal High Very High Slow Negative
31 Normal Medium Very High Slow Negative
32 Low Medium Very High Slow Positive
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 33: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/33.jpg)
92
1. Select the best criterion to be the root node of the decision tree. Here, we test each attribute one by one, starting from the first attribute ‘Blood Pressure’.
Patient Blood P. Protein L. Glucose L. Heart Beat Positive
2 High High Medium Fast Negative
4 High Low High Fast Negative
10 High High Medium Fast Negative
12 High Low High Fast Negative
18 High High Medium Fast Negative
20 High Low High Fast Negative
26 High High Medium Fast Negative
28 High Low High Fast Negative
1 High Medium High Slow Positive
3 High Low Medium Slow Positive
9 High Medium High Slow Positive
11 High Low Medium Slow Positive
17 High Medium High Slow Positive
19 High Low Medium Slow Positive
25 High Medium High Slow Positive
27 High Low Medium Slow Positive
5 Normal Low Medium Fast Negative
6 Normal High Very High Slow Negative
7 Normal Medium Very High Slow Negative
13 Normal Low Medium Fast Negative
14 Normal High Very High Slow Negative
15 Normal Medium Very High Slow Negative
21 Normal Low Medium Fast Negative
22 Normal High Very High Slow Negative
23 Normal Medium Very High Slow Negative
29 Normal Low Medium Fast Negative
30 Normal High Very High Slow Negative
31 Normal Medium Very High Slow Negative
8 Low Medium Very High Slow Positive
16 Low Medium Very High Slow Positive
24 Low Medium Very High Slow Positive
32 Low Medium Very High Slow Positive
Attribute Value Information
Blood Pressure High Info([8,8]) = entropy(8/16, 8/16)
= -(8/16)xlog2(8/16) - (8/16)xlog2(8/16) = 1.0000 bits
Normal Info([12,0]) = entropy(12/12, 0/12)
= -(12/12)xlog2(12/12) - (0/12)xlog2(0/12) = 0.0000 bits
Low Info([0,4]) = entropy(0/4, 4/4)
= -(0/4)xlog2(0/4) - (4/4)xlog2(4/4) = 0.0000 bits
Expected information for “Blood Pressure”
= Info([8,8],[12,0],[0,4]) = (16/32x1.0000) + (12/32x0.0000)+(4/32x0.0000)= 0.5000 bits
Information gain for “Blood Pressure”
= Info([12,20]) - Info([8,8],[12,0],[0,4])
= 0.9544-0.5000 = 0.4544 bits
Split Info for “Blood Pressure” (Intrinsic_Info for “Blood Pressure”)
= Info([16,12,4])
= -(16/32)xlog2(16/32) – (12/32)xlog2(12/32) – (4/32)xlog2(4/32)
= 0.5000+0.5306+0.3750 = 1.4056 bits
Gain ratio for “Blood Pressure”
= Information gain (“Blood Pressure”) / Split Info (“Blood Pressure”)
= 0.4544 /1.4056
= 0.3233
Blood Pressure
High
Normal
Low
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 34: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/34.jpg)
93
2. The next is to calculate information gain or gain ratio for “Protein Level”.
Patient Blood P. Protein L. Glucose L. Heart Beat Positive
2 High High Medium Fast Negative
10 High High Medium Fast Negative
18 High High Medium Fast Negative
26 High High Medium Fast Negative
6 Normal High Very High Slow Negative
14 Normal High Very High Slow Negative
22 Normal High Very High Slow Negative
30 Normal High Very High Slow Negative
7 Normal Medium Very High Slow Negative
15 Normal Medium Very High Slow Negative
23 Normal Medium Very High Slow Negative
31 Normal Medium Very High Slow Negative
1 High Medium High Slow Positive
8 Low Medium Very High Slow Positive
9 High Medium High Slow Positive
16 Low Medium Very High Slow Positive
17 High Medium High Slow Positive
24 Low Medium Very High Slow Positive
25 High Medium High Slow Positive
32 Low Medium Very High Slow Positive
4 High Low High Fast Negative
5 Normal Low Medium Fast Negative
12 High Low High Fast Negative
20 High Low High Fast Negative
28 High Low High Fast Negative
13 Normal Low Medium Fast Negative
21 Normal Low Medium Fast Negative
29 Normal Low Medium Fast Negative
3 High Low Medium Slow Positive
11 High Low Medium Slow Positive
19 High Low Medium Slow Positive
27 High Low Medium Slow Positive
Attribute Value Information
Protein Level High Info([8,0]) = entropy(8/8, 0/8)
= -(8/8)xlog2(8/8) - (0/8)xlog2(0/8) = 0.0000 bits
Medium Info([4,8]) = entropy(4/12, 8/12)
= -(4/12)xlog2(4/12) - (8/12)xlog2(8/12) = 0.9183 bits
Low Info([8,4]) = entropy(8/12, 4/12)
= -(8/12)xlog2(8/12) - (4/12)xlog2(4/12) = 0.9183 bits
Expected information for “Protein Level”
= Info([8,0],[4,8],[8,4]) = (8/32x0.0000) + (12/32x0.9183)+(12/32x0.9183)= 0.6887 bits
Information gain for” Protein Level”
= Info([12,20]) - Info([8,0],[4,8],[8,4])
= 0.9544-0.6887 = 0.2657 bits
Split Info for “Protein Level” (Intrinsic_Info for “Protein Level”)
= Info([8,12,12])
= -(8/32)xlog2(8/32) – (12/32)xlog2(12/32) – (12/32)xlog2(12/32)
= 0.5000+0.5306+0.5306 = 1.5613 bits
Gain ratio for “Protein Level”
= Information gain (“Protein Level”/ Split Info (“Protein Level”)
= 0.2657 /1.5613
= 0.1702
Protein Level
High Low Medium
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 35: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/35.jpg)
94
3. The next is to calculate information gain or gain ratio for “Glucose Level”.
Patient Blood P. Protein L. Glucose L. Heart Beat Positive
6 Normal High Very High Slow Negative
7 Normal Medium Very High Slow Negative
14 Normal High Very High Slow Negative
15 Normal Medium Very High Slow Negative
22 Normal High Very High Slow Negative
23 Normal Medium Very High Slow Negative
30 Normal High Very High Slow Negative
31 Normal Medium Very High Slow Negative
8 Low Medium Very High Slow Positive
16 Low Medium Very High Slow Positive
24 Low Medium Very High Slow Positive
32 Low Medium Very High Slow Positive
4 High Low High Fast Negative
12 High Low High Fast Negative
20 High Low High Fast Negative
28 High Low High Fast Negative
1 High Medium High Slow Positive
9 High Medium High Slow Positive
17 High Medium High Slow Positive
25 High Medium High Slow Positive
2 High High Medium Fast Negative
5 Normal Low Medium Fast Negative
10 High High Medium Fast Negative
13 Normal Low Medium Fast Negative
18 High High Medium Fast Negative
21 Normal Low Medium Fast Negative
26 High High Medium Fast Negative
29 Normal Low Medium Fast Negative
3 High Low Medium Slow Positive
11 High Low Medium Slow Positive
19 High Low Medium Slow Positive
27 High Low Medium Slow Positive
Attribute Value Information
Glucose Level Very High Info([8,4]) = entropy(8/12, 4/12)
= -(8/12)xlog2(8/12) - (4/12)xlog2(4/12) = 0.9183 bits
High Info([4,4]) = entropy(4/8, 4/8)
= -(4/8)xlog2(4/8) - (4/8)xlog2(4/8) = 1.0000 bits
Medium Info([8,4]) = entropy(8/12, 4/12)
= -(8/12)xlog2(8/12) - (4/12)xlog2(4/12) = 0.9183 bits
Expected information for “Glucose Level”
= Info([8,4],[4,4],[8,4]) = (12/32x0.9183) + (8/32x1.0000)+(12/32x0.9183)= 0.9387 bits
Information gain for “Glucose Level”
= Info([12,20]) - Info([8,4],[4,4],[8,4])
= 0.9544-0.9387 = 0.0157 bits
Split Info for “Glucose Level” (Intrinsic_Info for “Glucose Level”)
= Info([12,8,12])
= -(12/32)xlog2(12/32) – (8/32)xlog2(8/32) – (12/32)xlog2(12/32)
= 0.5306+0.5000+0.5306 = 1.5613 bits
Gain ratio for “Glucose Level”
= Information gain (“Glucose Level”/ Split Info (“Glucose Level”)
= 0.0157 /1.5613
= 0.0101
High Very High Medium
Glucose Level
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 36: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/36.jpg)
95
4. The next is to calculate information gain or gain ratio for “Heart Beat”.
Patient Blood P. Protein L. Glucose L. Heart Beat Positive
6 Normal High Very High Slow Negative
7 Normal Medium Very High Slow Negative
14 Normal High Very High Slow Negative
15 Normal Medium Very High Slow Negative
22 Normal High Very High Slow Negative
23 Normal Medium Very High Slow Negative
30 Normal High Very High Slow Negative
31 Normal Medium Very High Slow Negative
1 High Medium High Slow Positive
3 High Low Medium Slow Positive
8 Low Medium Very High Slow Positive
9 High Medium High Slow Positive
11 High Low Medium Slow Positive
16 Low Medium Very High Slow Positive
17 High Medium High Slow Positive
19 High Low Medium Slow Positive
24 Low Medium Very High Slow Positive
25 High Medium High Slow Positive
27 High Low Medium Slow Positive
32 Low Medium Very High Slow Positive
2 High High Medium Fast Negative
4 High Low High Fast Negative
5 Normal Low Medium Fast Negative
10 High High Medium Fast Negative
12 High Low High Fast Negative
13 Normal Low Medium Fast Negative
18 High High Medium Fast Negative
20 High Low High Fast Negative
21 Normal Low Medium Fast Negative
26 High High Medium Fast Negative
28 High Low High Fast Negative
29 Normal Low Medium Fast Negative
Attribute Value Information
Heart Beat Slow Info([8,12]) = entropy(8/20, 12/20)
= -(8/20)xlog2(8/20) - (12/20)xlog2(12/20) = 0.9710 bits
Fast Info([12,0]) = entropy(12/12, 0/12)
= -(12/12)xlog2(12/12) - (0/12)xlog2(0/12) = 0.0000 bits
Expected information for “Heart Beat”
= Info([8,12],[12,0]) = (20/32x0.9710) + (12/32x0.0000) = 0.6068 bits
Information gain for “Heart Beat”
= Info([12,20]) - Info([8,12],[12,0])
= 0.9544-0. 6068 = 0.3476 bits
Split Info for “Heart Beat” (Intrinsic_Info for “Heart Beat”)
= Info([20,12])
= -(20/32)xlog2(20/32) – (12/32)xlog2(12/32)
= 0.4238+0.5306 = 0.9544 bits
Gain ratio for “Glucose Level”
= Information gain (“Heart Beat”/ Split Info (“Heart Beat”)
= 0.3476 /0.9544
= 0.3642
Slow Fast
Heart Beat
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 37: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/37.jpg)
96
Summary of information gain and gain ratio for the first node.
Blood Pressure Protein Level
Info: 0.5000 Info: 0.6887
Info Gain: 0.4544 Info Gain: 0.2657
Split Info: 1.4056 Split Info: 1.5613
Gain Ratio: 0.3233 Gain Ratio: 0.1702 Glucose Level Heart Beat
Info: 0.9387 Info: 0.6068
Info Gain: 0.0157 Info Gain: 0.3476
Split Info: 1.5613 Split Info: 0.9544
Gain Ratio: 0.0101 Gain Ratio: 0.3642
Based on the above table, it is possible to use either information gain (Info Gain) or gain ratio (Gain Ratio)
to select the best attribute for the root node. In the case of information gain, the best attribute is ‘Blood
Pressure’ since it has the highest value of .4544 as its information gain, compared to ‘Protein Level’,
‘Glucose Level’, and ‘Heart Beat’. On the other hand, in the case of gain ratio, the best attribute is ‘Heart
Beat’. The following shows the results of both cases.
Information Gain Gain Ratio
The next step is to place a decision node at the lower nodes, which are impure. In the ‘Information Gain’
case, it is the node of the branch ‘High’ (blood pressure = high). On the other hand, for the ‘Gain Ratio’
case, the node we need to focus is the node under the branch ‘Slow’ (heart beat = slow). The following
shows the case of finding the second node for ‘Information Gain’. The case of finding the second node for
‘Gain Ratio’ is shown in the next.
Blood Pressure
High Normal
Low
Negative: 8
Positive: 8
Negative: 12
Positive: 0
Negative: 0
Positive: 4
Slow Fast
Heart Beat
Negative: 8
Positive: 12
Negative: 12
Positive: 0
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 38: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/38.jpg)
97
5. Find the second node under the ‘Blood Press’ (‘Information Gain’).
Patient Blood P. Protein L. Glucose L. Heart Beat Positive
2 High High Medium Fast Negative
10 High High Medium Fast Negative
18 High High Medium Fast Negative
26 High High Medium Fast Negative
1 High Medium High Slow Positive 9 High Medium High Slow Positive
17 High Medium High Slow Positive 25 High Medium High Slow Positive
4 High Low High Fast Negative 12 High Low High Fast Negative 20 High Low High Fast Negative 28 High Low High Fast Negative
3 High Low Medium Slow Positive 11 High Low Medium Slow Positive 19 High Low Medium Slow Positive 27 High Low Medium Slow Positive
Attribute Value Information
Protein Level High Info([4,0]) = entropy(4/4, 0/4)
= -(4/4)xlog2(4/4) - (0/4)xlog2(0/4) = 0.0000 bits
Medium Info([0,4]) = entropy(0/4, 4/4)
= -(0/4)xlog2(0/4) - (4/4)xlog2(4/4) = 0.0000 bits
Low Info([4,4]) = entropy(4/8, 4/8)
= -(4/8)xlog2(4/8) - (4/8)xlog2(4/8) = 1.0000 bits
Expected information for “Protein Level”
= Info([4,0],[0,4],[4,4]) = (4/16x0.0000) + (4/16x0.0000)+(8/16x1.0000)= 0.5000 bits
Information gain for” Protein Level”
= Info([8,8]) - Info([4,0],[0,4],[4,4])
= 1.0000-0.5000 = 0.5000 bits
Patient Blood P. Protein L. Glucose L. Heart Beat Positive
4 High Low High Fast Negative
12 High Low High Fast Negative
20 High Low High Fast Negative
28 High Low High Fast Negative
1 High Medium High Slow Positive
9 High Medium High Slow Positive
17 High Medium High Slow Positive
25 High Medium High Slow Positive
2 High High Medium Fast Negative
10 High High Medium Fast Negative
18 High High Medium Fast Negative
26 High High Medium Fast Negative
3 High Low Medium Slow Positive
11 High Low Medium Slow Positive
19 High Low Medium Slow Positive
27 High Low Medium Slow Positive
Attribute Value Information
Glucose Level High Info([4,4]) = entropy(4/8, 4/8)
= -(4/8)xlog2(4/8) - (4/8)xlog2(4/8) = 1.0000 bits
Medium Info([4,4]) = entropy(4/8, 4/8)
= -(4/8)xlog2(4/8) - (4/8)xlog2(4/8) = 1.0000 bits
Expected information for “Glucose Level”
= Info([4,4],[4,4]) = (8/16x1.0000) + (8/16x1.0000) = 1.0000 bits
Information gain for” Glucose Level”
= Info([8,8]) - Info([4,4],[4,4])
= 1.0000-1.0000 = 0.0000 bits
High
Medium
Low
Negative
Positive Protein
Blood Pressure
Normal High Low
High Medium
Negative
Positive Glucose
Blood Pressure
Normal High Low
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 39: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/39.jpg)
98
6. Continue finding the second node under the ‘Blood Press’.
Patient Blood P. Protein L. Glucose L. Heart Beat Positive
1 High Medium High Slow Positive
3 High Low Medium Slow Positive
9 High Medium High Slow Positive
11 High Low Medium Slow Positive
17 High Medium High Slow Positive
19 High Low Medium Slow Positive
25 High Medium High Slow Positive
27 High Low Medium Slow Positive
2 High High Medium Fast Negative
4 High Low High Fast Negative
10 High High Medium Fast Negative
12 High Low High Fast Negative
18 High High Medium Fast Negative
20 High Low High Fast Negative
28 High Low High Fast Negative
26 High High Medium Fast Negative
Attribute Value Information
Heart Beat Slow Info([0,8]) = entropy(0/8, 8/8)
= -(0/8)xlog2(0/8) - (8/8)xlog2(8/8) = 0.0000 bits
Fast Info([8,0]) = entropy(8/8,0/8)
= -(8/8)xlog2(8/8) - (0/8)xlog2(0/8) = 0.0000 bits
Expected information for “Heart Beat”
= Info([0,8],[8,0]) = (8/16x0.0000) + (8/16x0.0000) = 0.0000 bits
Information gain for” Heart Beat”
= Info([8,8]) - Info([0,8],[8,0])
= 1.0000-0.0000 = 1.0000 bits
Summary of the second node for information gain
Protein Level Glucose Level Heart Beat
Info: 0.5000 Info: 1.0000 Info: 0.0000
Info Gain: 0.5000 Info Gain: 0.0000 Info Gain: 1.0000
Based on the above table of ‘Information Gain’, the best attribute for the second node is the ‘Heart Beat’
since it has the highest value of 1. as its information gain, compared to ‘Protein Level’ and ‘Glucose
Level’. The final decision tree for information gain is as follows.
Slow Fast
Negative
Positive Heart Beat
Blood Pressure
Normal High Low
Negative
Positive
Slow Fast
Negative
Positive Heart Bt
blood pressure
Normal High Low
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 40: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/40.jpg)
99
7. Find the second node under the ‘Heart Beat’ (for ‘Gain Ratio’).
On the other hand, in the case of gain ratio, the best attribute is ‘Heart Beat’. Similar to the case of
information gain, the next step is to place a decision node at the lower nodes, which are impure. Here, the
node we need to focus is the node under the branch ‘Slow’ (Heart Beat = slow).
Patient Blood P. Protein L. Glucose L. Heart Beat Positive
1 High Medium High Slow Positive
3 High Low Medium Slow Positive
9 High Medium High Slow Positive
11 High Low Medium Slow Positive
17 High Medium High Slow Positive
19 High Low Medium Slow Positive
25 High Medium High Slow Positive
27 High Low Medium Slow Positive
6 Normal High Very High Slow Negative 7 Normal Medium Very High Slow Negative
14 Normal High Very High Slow Negative 15 Normal Medium Very High Slow Negative 22 Normal High Very High Slow Negative 23 Normal Medium Very High Slow Negative 30 Normal High Very High Slow Negative 31 Normal Medium Very High Slow Negative
8 Low Medium Very High Slow Positive 16 Low Medium Very High Slow Positive 24 Low Medium Very High Slow Positive 32 Low Medium Very High Slow Positive
Attribute Value Information
Blood Pressure High Info([0,8]) = entropy(0/8, 8/8)
= -(0/8)xlog2(0/8) - (8/8)xlog2(8/8) = 0.0000 bits
Normal Info([8,0]) = entropy(8/8, 0/8)
= -(8/8)xlog2(8/8) - (0/8)xlog2(0/8) = 0.0000 bits
Low Info([0,4]) = entropy(0/4, 4/4)
= -(0/4)xlog2(0/4) - (4/4)xlog2(4/4) = 0.0000 bits
Expected information for “Blood Pressure”
= Info([0,8],[8,0],[0,4]) = (8/20x0.0000) + (8/20x0.0000)+(4/20x0.0000)= 0.0000 bits
Information gain for” Blood Pressure”
= Info([8,12]) - Info([4,0],[0,4],[4,4])
= 0.9710-0.0000 = 0.9710 bits
Split Info for “Blood Pressure” (Intrinsic_Info for “Blood Pressure”)
= Info([8,8,4])
= -(8/20)xlog2(8/20) – (8/20)xlog2(8/20) – (4/20)xlog2(4/20)
= 0.5288+0.5288+0.4644 = 1.5219 bits
Gain ratio for “Blood Pressure”
= Information gain (“Blood Pressure”) / Split Info (“Blood Pressure”)
= 0.9710 / 1.5219
= 0.6380
Low
Heart Beat
Slow Fast
Normal High
Negative Blood P.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 41: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/41.jpg)
100
8. Continue finding the second node under the ‘Heart Beat’ (for ‘Gain Ratio’).
Patient Blood P. Protein L. Glucose L. Heart Beat Positive
6 Normal High Very High Slow Negative
14 Normal High Very High Slow Negative
22 Normal High Very High Slow Negative
30 Normal High Very High Slow Negative
7 Normal Medium Very High Slow Negative 15 Normal Medium Very High Slow Negative 23 Normal Medium Very High Slow Negative 31 Normal Medium Very High Slow Negative
1 High Medium High Slow Positive 8 Low Medium Very High Slow Positive 9 High Medium High Slow Positive
16 Low Medium Very High Slow Positive 17 High Medium High Slow Positive 24 Low Medium Very High Slow Positive 25 High Medium High Slow Positive
32 Low Medium Very High Slow Positive 3 High Low Medium Slow Positive
11 High Low Medium Slow Positive 19 High Low Medium Slow Positive 27 High Low Medium Slow Positive
Attribute Value Information
Protein Level High Info([4,0]) = entropy(4/4, 0/4)
= -(4/4)xlog2(4/4) - (0/4)xlog2(0/4) = 0.0000 bits
Medium Info([4,8]) = entropy(4/12, 8/12)
= -(4/12)xlog2(4/12) - (8/12)xlog2(8/12) = 0.9183 bits
Low Info([0,4]) = entropy(0/4, 4/4)
= -(0/4)xlog2(0/4) - (4/4)xlog2(4/4) = 0.0000 bits
Expected information for “Protein Level”
= Info([4,0],[4,8],[0,4]) = (4/20x0.0000) + (12/20x0.9183)+(4/20x0.0000)= 0.5510 bits
Information gain for” Protein Level”
= Info([8,12]) - Info([4,0],[4,8],[0,4])
= 0.9710-0.5510 = 0.4200 bits
Split Info for “Protein Level” (Intrinsic_Info for “Protein Level”)
= Info([4,12,4])
= -(4/20)xlog2(4/20) – (12/20)xlog2(12/20) – (4/20)xlog2(4/20)
= 0.4644+0.4422+0.4644 = 1.3710 bits
Gain ratio for “Protein Level”
= Information gain (“Protein Level”) / Split Info (“Protein Level”)
= 0.4200 / 1.3710
= 0.3063
Low
Heart
Beat
Slow Fast
Medium High
Negative Protein
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 42: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/42.jpg)
101
9. Continue finding the second node under the ‘Heart Beat’ (for ‘Gain Ratio’).
Patient Blood P. Protein L. Glucose L. Heart Beat Positive
6 Normal High Very High Slow Negative
14 Normal High Very High Slow Negative
22 Normal High Very High Slow Negative
30 Normal High Very High Slow Negative
7 Normal Medium Very High Slow Negative
15 Normal Medium Very High Slow Negative
23 Normal Medium Very High Slow Negative
31 Normal Medium Very High Slow Negative
8 Low Medium Very High Slow Positive
16 Low Medium Very High Slow Positive
24 Low Medium Very High Slow Positive
32 Low Medium Very High Slow Positive
1 High Medium High Slow Positive
9 High Medium High Slow Positive
17 High Medium High Slow Positive
25 High Medium High Slow Positive
3 High Low Medium Slow Positive
11 High Low Medium Slow Positive
19 High Low Medium Slow Positive
27 High Low Medium Slow Positive
Attribute Value Information
Glucose Level Very High Info([8,4]) = entropy(8/12, 4/12)
= -(8/12)xlog2(8/12) - (4/12)xlog2(4/12) = 0.9183 bits
High Info([0,4]) = entropy(0/4, 4/4)
= -(0/4)xlog2(0/4) - (4/4)xlog2(4/4) = 0.0000 bits
Medium Info([0,4]) = entropy(0/4, 4/4)
= -(0/4)xlog2(0/4) - (4/4)xlog2(4/4) = 0.0000 bits
Expected information for “Glucose Level”
= Info([8,4],[0,4],[0,4]) = (12/20x0.9183) + (4/20x0.0000) + (4/20x0.0000)= 0.5510 bits
Information gain for “Glucose Level”
= Info([8,12]) - Info([8,4],[0,4],[0,4])
= 0.9710-0.5510 = 0.4200 bits
Split Info for “Glucose Level” (Intrinsic_Info for “Glucose Level”)
= Info([4,12,4])
= -(4/20)xlog2(4/20) – (12/20)xlog2(12/20) – (4/20)xlog2(4/20)
= 0.4644+0.4422+0.4644 = 1.3710 bits
Gain ratio for “Glucose Level”
= Information gain (“Glucose Level”) / Split Info (“Glucose Level”)
= 0.4200 / 1.3710
= 0.3063
Medium
Heart Beat
Slow Fast
High
Very High
Negative Glucose
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 43: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/43.jpg)
102
Summary of the second node for gain ratio
Blood Pressure Protein Level Glucose Level
Info: 0.0000 Info: 0.5510 Info: 0.5510
Info Gain: 0.9710 Info Gain: 0.4200 Info Gain: 0.4200
Split Info: 1.5219 Split Info: 1.3710 Split Info: 1.3710
Gain Ratio: 0.6380 Gain Ratio: 0.3063 Gain Ratio: 0.3063
Based on the above table of ‘Gain Ratio’, the best attribute for the second node is the ‘Blood Pressure’
since it has the highest value of .638 as its gain ratio, compared to ‘Protein Level’ and ‘Glucose Level’.
The final decision tree for information gain is as follows.
Besides the information gain and gain ratio, the GINI index is also widely used, especially in
CART. Unlike information gain and gain ratio, the GINI index consider binary split for each attribute,
instead of splitting by its possible values. The GINI index indicates the impurity of the
partition and can be calculated as follows. Following the notation of information gain and gain
ratio, the GINI index measures the impurity of T, a set of training tuples, as follows. Given a
training set of objects and their associated class labels, denoted by , each
object is represented by an n-dimensional attribute vector, depicting
the measure values of n attributes, , of the object with its class , one from m
possible classes, . Here, suppose that has possible values,
, i.e. , a binary partition p divides the values of the
attribute into two subsets (i.e., ) and
, is the k-th class, is the training
set before splitting, is the set of the instances with the class in the set , is a subset of
the training set after splitting, containing the objects which have the value of for the attribute
, that is , and is the set of the instances with the class in the subset . With this
notation, is the total number of instances in the training set before splitting, is the
number of class-k instances in the set , is the number of instances, which have , and
is the number of class-k instances in the subset .
Slow Fast
Negative
Negative Blood Pressure
Heart Beat
Normal
High Low
Positive
Positive
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 44: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/44.jpg)
103
GINI Index:
While the GINI index indicates the impurity of the partition, the GINI-index-based decision tree
induction selects a node to split based on the best binary partition (i.e., ) and
, for each attribute ,
. It attempts to select the best attribute
with the best partition that makes the largest reduction of the impurity
as shown in the above formulae. Here, the function returns the value of
that generates the minimum .
Tree Pruning
While we spilt a node into branches in the construction of a decision tree, many of them may
reflect anomalies in the training data due to noises or outliers. In general, the “fully grown” tree
usually obtain a high prediction rate for the training data but a low prediction rate for the test
data which are unknown as shown in Figure 3-24. From the figure, we can observe that the
prediction performance of the learned tree on the training data set increases with respect to the
size of the learned tree while the prediction on the test data set decreases.
Figure 3-24: Prediction rate (accuracy) trend with various tree sizes
0.6000.6500.7000.7500.8000.8500.9000.9501.000
0 10 20 30 40 50 60 70 80 90 100
Train Data Test Data
tree size (number of nodes)
Acc
ura
y
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 45: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/45.jpg)
104
In general, this problem is well known as overfitting. Overfitting occurs when our learned model
is too complex with a high degree of freedom in relation to the amount of data available and then
attempts to describe random errors or noises instead of the underlying relationship. A model
with overfitting usually has poor predictive performance since it is too specific to the training
data set and is not generalized for an unknown data set. In order to avoid overfitting, it is
necessary to use additional techniques (e.g. cross-validation, regularization, early stopping,
Bayesian priors on parameters or model comparison), that can indicate when further training is
not resulting in better generalization. In the process of the decision tree induction, we can avoid
overfitting by the way of tree pruning. Typically, we can use statistical or information-based
measures to remove the least reliable branches. An unpruned tree and its pruned version are
shown in Figure 3-25.
Pruning
Figure 3-25: An Example of a Decision Tree and Its pruned tree.
Normally a pruned tree tends to be smaller and less complex. Even sometimes, it represents the
knowledge more precisely by ignoring noises or outliers. Moreover it is usually faster and better
to correctly classify independent test data (i.e., of previously unseen tuples) than an unpruned
version. In general, two common approaches to tree pruning are prepruning and postpruning. In
the prepruning approach, a tree is “pruned” by terminating its construction at an early stage.
Slow Fast
Negative
Negative Blood Pressure
Heart Beat
Normal
High Low
Positive
Positive
An Unpruned Tree
Slow Fast
Negative
Heart Beat
Positive
A Pruned Tree
0 negative 8 positive
8 negative 0 positive
0 negative 4 positive
8 negative 12 positive
8 negative 12 positive
12 negative 0 positive
12 negative 0 positive
20 negative 12 positive
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 46: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/46.jpg)
105
That is, we can decide not to split or partition the subset of training tuples further at a given node.
When we stop splitting, the current node will become a leaf node. The leaf node may be assigned
with the most frequent class among the subset tuples or the probability distribution of those
tuples. In the construction of a tree, while the measures such as information gain, gain ratio or
GINI index can be used to select the best node to split, another type of measures, such as
statistical significance, can be used to assess the reliability of that split. If partitioning the tuples
at a node results in a split that has statistical significance below a pre-specified threshold, then
further partitioning of the given subset is not made. However, it is difficult to decide an
appropriate threshold. Too high thresholds may result in oversimplified trees, whereas too low
thresholds could result in very little simplification.
As the second and more common approach, postpruning removes subtrees from a “fully
grown” tree. Instead of considering the termination of splitting at the early stage, postpruning
will allow the tree to be grown fully and then prune. Since we allow the tree to fully grow, it is
possible to prevent the situation that the tree is pruned too early. More concretely, even splitting
the current node is not recommended, it may be allowed to split first and, instead the subtree
under the current node is pruned. Anyway similar to prepruning, a subtree at a given node is
pruned by removing its branches and replacing it with a leaf. The leaf is labeled with the most
frequent class among the subtree being replaced. Same as prepruning, besides the splitting
selection criterion, we need another criterion to compare an original tree with its pruned version.
Besides the choice of prepruning and postpruning, two additional kinds of choices that
characterize a pruning method are (1) exploitation of a holdout set and (2) utilization of
knowledge complexity. For the exploitation of a holdout set, a method called reduced error
pruning uses a separate set of examples, called a holdout set, which is a distinct set from the one
of training examples, to evaluate the utility of each node in the tree for pruning. On the other
hand, the statistical reasoning method uses only data available for training, by applying a
statistical test, e.g. Chi-square test, to estimate whether expanding (or pruning) a particular node
is likely to produce an improvement beyond the training set. For the viewpoint of utilization of
knowledge complexity, most methods rarely consider the size or complexity of knowledge when
it performs the learning process. Normally this method use only a training set without help of a
holdout set. For this approach, even it is hard to define knowledge complexity; some works uses
information theory-based complexity measures, such as Minimum Description Length (MDL) to
encode the size of decision tree and the exceptions occurring in the training examples. Intuitively
when the tree becomes large, it will cover all cases in the training set, making no errors or
exceptional cases. That is most examples in the training set can be explained by using the tree
and only few exceptional examples are left for individual consideration. On the other hand, when
the tree is small, it may not cover many examples, and then they are left for individual
consideration as exceptions. The MDL approach tries to minimize the description length based on
this tradeoff. See detail in (Grünwald, 2007). Moreover, the cost-complexity pruning in CART is
also another method which utilizes knowledge complexity but with the holdout set. The
following presents the details of methods described above.
Reduced error pruning with cost complexity
The cost-complexity pruning is used in CART as postpruning. However, it is also possible to
utilize it in the prepruning process. This approach considers the cost complexity of a tree as a
tradeoff function between the number of leaves in the tree and the error rate of the tree. Here,
the error rate is the percentage of tuples misclassified by the tree. Unlike prepruning, the cost-
complexity postpruning starts from the bottom of the tree. For each internal node, N, it computes
the cost complexity of the subtree at N, and the cost complexity of the subtree at N if it were to be
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 47: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/47.jpg)
106
pruned and replaced by a leaf node. These two values are compared. If pruning the subtree at
node N would result in a smaller cost complexity, then the subtree is pruned. Otherwise, it is kept.
A holdout set consisting of a number of separated class-labeled tuples is used as a pruning set to
estimate cost complexity. This holdout set is usually independent of the training set used to build
the unpruned tree and of any test set used for accuracy estimation. The algorithm generates a set
of progressively pruned trees and the smallest decision tree that minimizes the cost complexity
is preferred. Formally, the cost complexity pruning generates a series of trees where
is the initial tree and is the tree with only the root node. At step i, the tree is created by
removing a subtree from tree i−1 and replacing it with a leaf node with a value chosen as in the
tree construction algorithm. The selection of the subtree to remove, denoted by t, is decided
basing on the error rate of the original tree and the pruned tree over the holdout data set
S as and , and the numbers of leaf nodes before and after pruning
. An example of criterion is to minimize the following criterion.
Once the series of trees have been created, the best tree is chosen by generalized accuracy as
measured by a training set or cross-validation. Sometimes, this approach is also called reduced-
error pruning.
Statistical reasoning
It is also possible to prune a tree by considering only information from the training data itself
without considering a holdout set. The prune can be made, basing on estimated true errors from
observed errors. Some methods, including C4.5, use a heuristic based on some kind of statistical
reasoning to prune the tree, where it may be criticized that the statistical underpinning is weak
and ad hoc. In practice, it seems to work well. The main principal is to consider the set of
instances that reach each node and imagine that the majority class is chosen to represent that
node, with a certain number of “errors,” E, out of the total number of instances, N. Therefore the
observed error rate is Here suppose that the expected (true) probability of errors at the
nodes is and we can generate the N instances by a Bernoulli process with parameter where
E instances turn out to be errors. Here, we can calculate confidence intervals on the true success
probability given a certain observed success rate. By this, we can make a pessimistic
estimate of the error rate by calculating the upper confidence limit using the following formula.
Given a particular confidence c (for example, c=25%), we find confidence limits z that satisfies
the following formula. Here, N is the number of samples, is the observed error rate,
and is the expected (true) error rate. Given the value of c, the value of z can be derived using
the standard normal table or the Z table in Appendix A.
Normally, we will use the upper confidence limit as an estimate (pessimistic case) for the error
rate at the node. The following indicates that the error rate can be derived from the value of
, z, and N.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 48: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/48.jpg)
107
Here, the expected error rates are calculated from two alternative situations: (1) the node to be
split and (2) the nodes after splitting. These two error rates are compared and the situation with
a lower error rate is selected. Suppose that there are m nodes after splitting. Let the observed
error rate at these m nodes after splitting be
, the numbers of samples at the b nodes
be , the observed error rate at the node before splitting be , and the number of
samples at the node before splitting be . Here, intuitively , and
. Therefore, the expected error rates before and after splitting,
and
are as follows. If
, then split. Otherwise, we do not split.
Before splitting
After splitting
where
To see how all this works in practice, let us consider the unpruned tree in Figure 3-25, where
the number of training examples that reach each node is stated. Now consider whether we
should stop splitting the ‘Blood Pressure’ node or not. Here, we use a 25% confidence, which
makes z equal to 0.69, according to the standard normal table or Z table in the Appendix A. The
error rates before and after splitting the node are as follows. Before splitting, the left leaf node
under the ‘Heart Beat’ node will have 8/2 ( ) as the observed error
rate since the node has a label of ‘Positive’ with 8 negatives and 12 positives. The expected error
rate will be calculated using the above formula. After splitting, all three nodes have zero as their
observed error rates and the expected error rates of these three nodes are calculated and then
are combined into one single expected error rate. The settings for the three nodes from the left to
right are (1) ( ), (2) ( ), and (3) (
). The error rate after spitting is calculated by the weighted combination. Here the
combination of the error estimates for these three leaves is calculated in the ratio of the number
of examples they cover, 8 : 8 : 4, which leads to a combined error estimate of 0.06621. The
detailed calculation can be given below. Since
(i.e., 0.06621 < 0.47706) then we should
split the node.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 49: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/49.jpg)
108
Before splitting
After splitting
As the error rate after node splitting, The combination of these three error rates are as follows
Chi-squared test on rule pruning
Another popular pruning method is to translate a tree into a set of rules and then perform
pruning. It is possible to spell out a set of rules directly from a decision tree, by generating a rule
for each leaf and create the precedence (the left-hand-side) of the rule by using a conjunction of
all the tests encountered on the path from the root to that leaf. This procedure provides us
unambiguous rules without concerning the order of the tests. However, in general, the rules
produced from a tree are usually too complex and too specific. Sometimes, it is better to prune
some conditions in the precedence of a rule. This pruning process can be done by calculating a
pessimistic estimate of the error rate of the new rule after a condition are removed and compare
this with the pessimistic estimate for the original rule. If the new rule is better, we will delete that
condition and then look for the next possible conditions to delete. We can determine to use the
rule when we do not have any more condition that can improve the error rates after removing
that condition. This procedure is applied to all rules. After they have been pruned in this way, we
have to see whether there are any duplicates. If some are, we will remove them from the rule set.
Normally the procedure is done iteratively with a greedy approach to detecting redundant
conditions in a rule. Therefore, there is no guarantee that the best set of conditions will be
removed. However, it is intractable to consider all subsets of conditions since this is usually
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 50: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/50.jpg)
109
prohibitively expensive. Although one of good solutions is to apply an optimization technique
such as simulated annealing or a genetic algorithm to select the best condition subset of this rule,
the simple greedy solution seems to work well to generate quite good rule sets. However, this
approach has a computational cost problem, even with the greedy method, is computational cost.
For every condition that is a candidate for deletion, the effect of the rule must be re-evaluated on
all the training instances. In summary, the pruning process can be done in the following steps.
1. Convert the decision tree to a set of classification rules.
2. For each classification rule, calculate a contingency table for each antecedent and test the
antecedent for statistical independence.
3. Prune the rule if the antecedent is independent.
4. Repeat the 2nd-4th processes for the rest antecedents and for all classification rules.
In the first step, we will convert the decision tree to a set of classification rules as shown in the
following example. Due to the limitation of space, we use this artificial example as follows.
R1: If (X=x1 & Y=y1) then Class = C1 R2: If (X=x1 & Y=y2) then Class = C2 R3: If (X=x2 & Z=z1) then Class = C1 R4: If (X=x2 & Z=z2) then Class = C3 R5: If (X=x3) then Class = C1
Here, assume that we also have the following table as the numbers counted from the dataset.
Note that the values in this table present in the tree but some of them may not appear. The values
can be counted from the data table.
Class=C1 Class=C2 Class=C3
X=x1 Y=y1
Z=z1 4 0 0 Z=z2 6 0 0
Y=y2 Z=z1 0 20 0 Z=z2 0 10 0
X=x2 Y=y1
Z=z1 0 5 0 Z=z2 0 0 20
Y=y2 Z=z1 0 5 0 Z=z2 0 0 10
X=x3 Y=y1
Z=z1 5 0 0 Z=z2 10 0 0
Y=y2 Z=z1 5 0 0 Z=z2 5 0 0
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 51: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/51.jpg)
110
As the second step, we construct a contingency table for each rule to test the statistical independence. For example, given the above data set, we construct the table for the first rule to test the statistical independence of the first antecedent, i.e., X = ‘x1’ as follows.
R1: If (X=x1 & Y=y1) then Class = C1
The contingency table for the first antecedent of this rule can be constructed using the above table as follows.
Class = ‘C1’ Class ‘C1’ Marginal Sum
X = ‘x1’ 10 (A) 30 (B) 40 (A+B)
X ‘x1’ 25 (C) 40 (D) 65 (C+D)
Marginal Sum 35 (A+C) 70 (B+D) 105 (T)
The expected value of each cell can be calculated from the row and column marginal sums as follows.
Class = ‘C1’ Class ‘C1’ Marginal Sum
X = ‘x1’ 13.33
[(A+C)*(A+B)]/T
26.67
[(B+D)*(A+B)]/T
40
X ‘x1’ 21.67
[(A+C)*(C+D)]/T
43.33
[(B+D)*(C+D)]/T
65
Marginal Sum 35 70 105
In the next step, we calculate the 2 using one of the following formulae, depending on the
highest expected frequency (m).
If (m>10) then use chi-square test.
i ie
eo ii )(2
2
If (5m1 ) then use Yate’s Correction
for Continuity
i ie
eo ii )5.0|(|2
2
If (m<5) then use Fisher’s Exact Test
Use fishers’ exact test. The detail can be found in
http://mathworld.wolfram.com/FishersExactTe
st.html
In this case, the highest expected frequency (m) is 43.33. Therefore, we use the chi-square test.
Moreover, oi is the ith observed value, ei is the ith expected value.
0192.2
2564.05128.04167.08333.033.43
33.4340
67.21
67.2125
67.26
67.2630
33.13
33.13102222
2
2)(
i ieeo ii
In this case, the degree of freedom calculation (df) is as follows.
Here, r is the number of rows and c is the number of columns. In the contingency table, both the
number of rows and the number of columns are two (2). Therefore, the degree of freedom is 1,
according to the above formula, .
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 52: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/52.jpg)
111
From the chi-square table in Section 2.4 and Appendix B, when we set the p-value ( ) to 0.05,
will be 3.84. If we set the threshold of to 3.84 and 2 < 2
, then we accept the null hypothesis
of independence, H0. Here, we can accept the statistical independence of the first antecedent X =
‘x1’ since 2.0192 < 3.84.
R1: If (X=x1 & Y=y1) then Class = C1
We thus conclude that “Class = C1” are independent from “X=x1”. We eliminate this antecedent from the rule as follows.
R1: If (Y=y1) then Class = C1
Next we construct the contingency table for the second antecedent of the first rule to test the statistical independence of the second antecedent, i.e., Y = ‘y1’ as follows.
R1: If (X=x1 & Y=y1) then Class = C1
Class = ‘C1’ Class ‘C1’ Marginal Sum
Y = ‘y1’ 25 (A) 25 (B) 50 (A+B)
Y ‘y1’ 10 (C) 45 (D) 55 (C+D)
Marginal Sum 35 (A+C) 70 (B+D) 105 (T)
The expected value of each cell can be calculated from the row and column marginal sums as follows.
Class = ‘C1’ Class ‘C1’ Marginal Sum
Y = ‘y1’ 16.67
[(A+C)*(A+B)]/T
33.33
[(B+D)*(A+B)]/T
50
Y ‘y1’ 18.33
[(A+C)*(C+D)]/T
36.67
[(B+D)*(C+D)]/T
55
Marginal Sum 35 70 105
In the next step, we calculate the 2 for this antecedent as follows.
9318.11
8940.17879.30833.21667.467.36
67.3645
33.18
33.1810
33.33
33.3325
67.16
67.16252222
2
2)(
i ieeo ii
In this case, the degree of freedom calculation (df) is the same, i.e., 1.
When we set the p-value ( ) to 0.05, will be 3.84. If we set the threshold of
to 3.84 and
2 < 2 , then we accept the null hypothesis of independence, H0. Here, we reject the statistical
independence of the second antecedent Y = ‘y1’ since 11.9318 > 3.84.
R1: If (X=x1 & Y=y1) then Class = C1
We thus conclude that “Class = C1” dependent on “Y=y1”. We keep this antecedent of the rule as follows.
R1: If (Y=y1) then Class = C1
Here, we also need to check also the antecedents of other rules, R2-R5.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 53: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/53.jpg)
112
Issues in Decision Trees
This section summarizes five issues in the decision-tree based classification. However, they are
also common in other classification methods.
1. Overfitting the data
Given hypothesis space (i.e. a set of all possible trees), a hypothesis (i.e. a tree) is
likely to overfit the training data if there exists some alternative hypothesis (i.e.
another tree), such that generates fewer errors than over the training examples, but
gives fewer errors than over the entire distribution of instances. Two possible common
heuristics are prepruning and postpruning. The first one tries not to fit all examples but stops
growing a tree before using all data in the training set. The second one fits all examples with
the constructed tree but prunes the resultant tree. The problem is how to know whether a
given tree overfits the data or not. One solution is to use a validation set which does not
include data used for training (not in the training set), to check for overfitting. Usually the
validation set consists of one-third of the training set, chosen randomly. Then use statistical
tests, such as the chi-squared metric, to determine whether pruning the tree improves its
performance over the validation set. An alternative is to use MDL to check whether
modifying the tree increases its MDL with respect to the validation set or not. If we use the
validation set to guide pruning, again we need to guarantee that the tree is not overfitting the
validation set. In this case, we need to extract yet another set called the test set from the
training set and use this for the final check.
2. Good attribute selection
While the information gain measure seems a good criterion for attribute section, it has a bias
that favors attributes with many values over those with only a few. For example, the attribute
with unique values for each training example, then the gain of ,
will yield the highest value since it is obviously not ambiguous and no other non-unique
attribute can do better. This will result in a very broad tree of depth 1. To solve this problem,
it is possible to use instead of .
3. Handling continuous valued attributes
Continuous valued attributes can be partitioned into a discrete number of disjoint intervals.
Then we can test for membership to these intervals. For example, the Temperature attribute
in the Play-Tennis example in Figure 3-13, takes continuous values. It is not suitable to treat
a number as a label since the Temperature will become a bad choice for classification. The
temperature value alone may perfectly classify the training examples and therefore promise
the highest information gain like in the earlier example related to ‘Date’. However, it will be a
poor predictor on the test set. The solution to this problem is to classify based not on the
actual temperature, but on dynamically determined intervals within which the temperature
falls. For instance, we can introduce Boolean attributes,
instead of the real-valued . The can be computed by some discretization
methods.
4. Handling missing attribute values
When some of the training examples contain one or more missing value or ‘value not known’
instead of the actual attribute values, we can use one of the following options.
1. the unknown value with the most common value in that column
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 54: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/54.jpg)
113
2. the most common value among all training examples that have been sorted into the
tree at that node
3. the most common value among all training examples that have been sorted into the
tree at that node with the same classification as the incomplete example.
5. Handling attributes with different costs
While attribute selection in the original version of decision tree induction depends mainly on
discrimination power (or classification performance) on classes, sometimes we would like to
introduce another type of criteria (or bias) against the selection of certain attributes. These
selection criteria may relate with cost of testing the attribute, rather than discrimination
power. For example, the cost for having a blood test is higher than that of measuring
temperature or blood pressure. By this fact, even the attribute ‘temperature’ or the attribute
‘blood pressure’ has a little lower discrimination power than the ‘blood test’ result, we may
select these low-cost attribute as a node in a decision tree. In general, it is possible to assign a
reasonable cost for each attribute and use them together with conventional criterion (i.e.,
information gain or gain ratio) to construct a decision tree. For example, we can set a
CostedGain(S,A) which is defined along the following combination function.
or
where is a weighting constant that determines the relative importance of cost
versus information gain.
3.1.5. Classification Rules: Covering Algorithm
In principle, decision tree algorithms apply a divide-and-conquer approach to solve the
classification problem. They work in the top-down manner by seeking at each stage an attribute
that best splits (separates) the classes; then recursively processes the subproblems that result
from the split. Finally, this procedure will generate a decision tree. As seen in the previous
section, if necessary, the decision tree can be converted into a set of classification rules. Although
it is simple to produce a set of effective rules, the converted rules have limitations in their forms,
especially their rule antecedents (the rules’ conditions or right-hand-sides). Known as a
separate-and-conquer approach, an alternative approach is to consider each class in turn and
seek a set of conditions (rule antecedents) that cover all instances in the class, at the same time
exclude instances not in the class. Called as a (sequential) covering approach, at each stage a rule
is produced to cover instances in the class that are not covered yet and exclude unrelated
instances. This approach directly leads to a set of rules not a decision tree. The rules produced by
the covering algorithm are more general than those converted from a decision tree are.
There are many sequential covering algorithms. The most basic one is known as PRISM
algorithm (developed by Cendrowska in 1987), which uses p/t (accuracy) as the criterion to
express the goodness of a rule ( ). Here, p is the number of instances (in the
training set) covered by the rule, i.e. the number of instances that satisfy both antecedent
and consequent and t is the number of instances that satisfy the antecedent
but may or may not satisfy the consequent. Except the accuracy p/t, there are also other criteria
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 55: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/55.jpg)
114
for selecting good rules, such as accuracy with negative consideration, information gain and
positive-negative difference. Their characteristics are summarized as follows.
Accuracy: [p/t]
o The covering algorithm using accuracy (p/t) attempts to produce rules that do not cover negative instances as quickly as possible.
o However, it may produce a rule with very small coverage, such as the rule with a
coverage of only 1 but the accuracy of 100% (p/t = 1/1). The instances used to form this
rule need to be judged whether they are either special cases or just noises.
o A typical problem for the accuracy is that the algorithm will prefer to select a rule with
accuracy of 1/1 (100%) to a rule with accuracy of 999/1000 (99.9%) (1/1 vs.
999/1000). The first rule comes from only one evidence (number of instances = 1) while
the second rule relies on many evidences (number of instances = 999), which are much
larger. Therefore, intuitively the second rule has high reliability and it should be selected,
instead of the first rule.
Accuracy with negative consideration: [p+(N-n)]/T]
o The covering algorithm using accuracy with negative consideration (p+(N-n)]/T)
attempts to produce rules with the assumption that noncoverage of negatives is equally
important as coverage of positives. Therefore, it uses the summation of the number of
positive instances covered by the rule (p) and the number of negative instances not
covered by the rule (N-n). Here, N is the total number of negative instances in the whole
dataset and n is the total number of negative instances covered by the rule. Intuitively,
the rule will be selected based on the number difference between positive and negative
instances covered by the rule, i.e. (p-n).
o However, the criterion still shares similar issue with the accuracy when the difference
between the numbers of positive and negative instances (p-n) is not good representative.
o Given a data set with 5000 positive instances and 5000 negative instances (T=10000,
P=5000, and N=5000), a typical problem of this criterion occurs when the algorithm will
prefer to select a rule with an accuracy of (3000+(5000-2000))/10000 (i.e.,
6000/10000) to a rule with an accuracy of (999+(5000-1)/10000) (i.e., 5998/10000).
That is, the comparison between p=3000, n=2000 and p=999, n=1. The first rule has the
situation that the number of positive instances (3000) and the number of negative
instances (2000) are very similar while the second rule indicates dominant difference
between the numbers of positive and negative instances (999 positive instances and 1
negative instance). Therefore, intuitively the second rule has high contrast between the
numbers of positive and negative instances and seems to have high reliability with
coverage of many instances, and it should be selected, instead of the first rule.
Information gain:
o The covering algorithm with information gain attempts to
produce a set of rules with the highest information gain, where p is the number of
positive instances covered by the rule, t is the number of instances that satisfy the
antecedent, P is the total number of positive instances in the data set and T is the total
number of instances in the data set. Moreover, if two rules have equivalent information
gain , this criterion will select the rule with the largest number of
positive instance (p).
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 56: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/56.jpg)
115
o However, the criterion still shares similar issue with the accuracy since it still focuses on
the number of positive instances.
o Given a data set with 5000 positive instances and 5000 negative instances (T=10000,
P=5000, and N=5000), a typical problem of this criterion occurs when the algorithm will
prefer to select a rule with information gain of
(= 1356.14) to a rule with information gain of
(= 997.56). That is, the comparison between p=2000, n=500 and
p=999, n=1. The first rule has the situation that the number of positive instances (2000
instances) and the number of negative instances (500 instances) are not so different
while the second rule indicates dominant difference between the numbers of positive
and negative instances (999 positive instances and 1 negative instance). Therefore,
intuitively the second rule has high contrast between the numbers of positive and
negative instances and seems to have high reliability with coverage of many instances,
and it should be selected, instead of the first rule.
Positive-negative difference:
o The positive-negative difference is similar to the accuracy (p/t) since it can be
transformed by n=(t-p). Then it becomes (p-(t-p))/t) = (2p-t)/t. Finally, it is
.
PRISM Algorithm
The pseudocode of the PRISM rule learner is shown as follows. In order to understand the
PRISM algorithm, the following health-check data set is used to describe the rule construction
(learning) process using the accuracy as the criterion to select the best rule.
Patient No.
Blood Pressure (Feature #1)
Protein Level (Feature #2)
Glucose Level (Feature #3)
Heart Beat (Feature #4)
Diseased (Class)
1 High Medium High Slow Positive 2 High High Normal Fast Negative 3 High Low Normal Slow Positive 4 High Low High Fast Negative 5 Normal Low Normal Fast Negative 6 Normal High Very High Slow Negative 7 Normal Medium Very High Slow Negative 8 Low Medium Very High Slow Positive 9 High Medium High Slow Positive
10 High High Normal Fast Negative 11 High Low Normal Slow Positive 12 High Low High Fast Negative 13 Normal Low Normal Fast Negative 14 Normal High Very High Slow Negative 15 Normal Medium Very High Slow Negative 16 Low Medium Very High Slow Positive 17 High Medium High Slow Positive 18 High High Normal Fast Negative 19 High Low Normal Slow Positive 20 High Low High Fast Negative 21 Normal Low Normal Fast Negative 22 Normal High Very High Slow Negative 23 Normal Medium Very High Slow Negative 24 Low Medium Very High Slow Positive 25 High Medium High Slow Positive 26 High High Normal Fast Negative 27 High Low Normal Slow Positive 28 High Low High Fast Negative 29 Normal Low Normal Fast Negative 30 Normal High Very High Slow Negative 31 Normal Medium Very High Slow Negative 32 Low Medium Very High Slow Positive
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 57: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/57.jpg)
116
Algorithm 3.1. The PRISM algorithm (Pseudocode of the PRISM rule learner)
FOREACH class C {
INITIALIZE E to the instance set (the training set)
WHILE E contains instances in class C {
CREATE a rule R with an empty left-hand side that
predicts class C
UNTIL R is perfect (or there are no more attributes to use) {
FOREACH attribute A not mentioned in R, and each value v {
CONSIDER ADDING the condition A=v to the LHS of R
SELECT A and v to maximize the accuracy p/t
(break ties by choosing the condition with the largest p)
ADD A=v to R
}
REMOVE the instances covered by R from E
}
}
}
Based on this data set, the PRISM algorithm (Algorithm 3.1) will form rules that cover each of
the two alternative classes, positive and negative, in turn. Let us begin with the class ‘positive’. A
rule with empty left-hand-side and ‘positive’ class is considered as follows.
If ? then diseased = ‘positive’
For the antecedent, there are eleven possibilities. The following table shows the rules and
their values of p/t. The rule with the highest value will be selected. In the case that there is more
than one rule holding the highest value, among them we will select the rule with the high
coverage (i.e., the highest value of p). At this point, again it is also possible to have more than one
rule holding the highest coverage. For this case, we will randomly select one of them.
No. Rule p/t (= accuracy)
1 If (blood = ‘high’) then diseased = ‘positive’ 8/16 = 0.50
2 If (blood = ‘normal’) then diseased = ‘positive’ 0/12 = 0.00
3 If (blood = ‘low’) then diseased = ‘positive’ 4/4 = 1.00
4 If (protein = ‘high’) then diseased = ‘positive’ 0/8 = 0.00
5 If (protein = ‘medium’) then diseased = ‘positive’ 8/12 = 0.67
6 If (protein = ‘low’) then diseased = ‘positive’ 4/12 = 0.33
7 If (glucose = ‘normal’) then diseased = ‘positive’ 4/12 = 0.33
8 If (glucose = ‘high’) then diseased = ‘positive’ 4/8 = 0.50
9 If (glucose = ‘very high’) then diseased = ‘positive’ 4/12 = 0.33
10 If (heart = ‘fast’) then diseased = ‘positive’ 0/12 = 0.00
11 If (heart = ‘slow’) then diseased = ‘positive’ 12/20 = 0.60
From the above table, the third rule has the highest value of p/t (=4/4). Therefore, we
include the third rule into the final classification rule set.
(R1) If blood = ‘low’ then diseased = ‘positive’ p/t = 4/4 (=1.0)
Since this rule is perfect (the accuracy of 1.0), there is no need to refine this rule. Next, we
delete all instances covered by the rule and then find another rule to cover the remaining
instances. The following table expresses the status after the four instances covered by the rule
are deleted.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 58: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/58.jpg)
117
Patient No.
Blood Pressure (Feature #1)
Protein Level (Feature #2)
Glucose Level (Feature #3)
Heart Beat (Feature #4)
Diseased (Class)
1 High Medium High Slow Positive 2 High High Normal Fast Negative 3 High Low Normal Slow Positive 4 High Low High Fast Negative 5 Normal Low Normal Fast Negative 6 Normal High Very High Slow Negative 7 Normal Medium Very High Slow Negative 8 Low Medium Very High Slow Positive 9 High Medium High Slow Positive
10 High High Normal Fast Negative 11 High Low Normal Slow Positive 12 High Low High Fast Negative 13 Normal Low Normal Fast Negative 14 Normal High Very High Slow Negative 15 Normal Medium Very High Slow Negative 16 Low Medium Very High Slow Positive 17 High Medium High Slow Positive 18 High High Normal Fast Negative 19 High Low Normal Slow Positive 20 High Low High Fast Negative 21 Normal Low Normal Fast Negative 22 Normal High Very High Slow Negative 23 Normal Medium Very High Slow Negative 24 Low Medium Very High Slow Positive 25 High Medium High Slow Positive 26 High High Normal Fast Negative 27 High Low Normal Slow Positive 28 High Low High Fast Negative 29 Normal Low Normal Fast Negative 30 Normal High Very High Slow Negative 31 Normal Medium Very High Slow Negative 32 Low Medium Very High Slow Positive
Based on this table, we will form another rule to perfectly cover the instances in the positive
class. Another rule with empty left-hand-side and ‘positive’ class is considered as follows.
If ? then diseased = ‘positive’
For the antecedent, there are ten possibilities. The following table shows the rules and their
values of p/t.
No. Rule p/t (= accuracy)
1 If (blood = ‘high’) then diseased = ‘positive’ 8/16 = 0.50
2 If (blood = ‘normal’) then diseased = ‘positive’ 0/12 = 0.00
- If (blood = ‘low’) then diseased = ‘positive’ - -
3 If (protein = ‘high’) then diseased = ‘positive’ 0/8 = 0.00
4 If (protein = ‘medium’) then diseased = ‘positive’ 4/8 = 0.50
5 If (protein = ‘low’) then diseased = ‘positive’ 4/12 = 0.33
6 If (glucose = ‘normal’) then diseased = ‘positive’ 4/12 = 0.33
7 If (glucose = ‘high’) then diseased = ‘positive’ 4/8 = 0.50
8 If (glucose = ‘very high’) then diseased = ‘positive’ 0/8 = 0.00
9 If (heart = ‘fast’) then diseased = ‘positive’ 0/12 = 0.00
10 If (heart = ‘slow’) then diseased = ‘positive’ 8/16 = 0.50
From the above table, the first rule and the tenth rule have the highest value of p/t (=8/16)
and also the highest coverage (p=8). At this point, we randomly select the first rule. Since it does
not have 100% accuracy, further refinement is necessary.
If blood = ‘high’ then diseased = ‘positive’ p/t = 8/16 (=0.5)
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 59: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/59.jpg)
118
Since this rule is not perfect (the accuracy of 1.0), we need to refine this rule. At this point we
select the instances which have the ‘blood = high’ and then add another antecedent to cover only
the instances with the positive class. The following is the instances with ‘blood=high’.
Patient No.
Blood Pressure (Feature #1)
Protein Level (Feature #2)
Glucose Level (Feature #3)
Heart Beat (Feature #4)
Diseased (Class)
1 High Medium High Slow Positive 2 High High Normal Fast Negative 3 High Low Normal Slow Positive 4 High Low High Fast Negative 9 High Medium High Slow Positive
10 High High Normal Fast Negative 11 High Low Normal Slow Positive 12 High Low High Fast Negative 17 High Medium High Slow Positive 18 High High Normal Fast Negative 19 High Low Normal Slow Positive 20 High Low High Fast Negative 25 High Medium High Slow Positive 26 High High Normal Fast Negative 27 High Low Normal Slow Positive 28 High Low High Fast Negative
Based on this table, we try to add another antecedent to perfectly cover the instances in the
positive class. Therefore, the following rule template can be considered.
If blood = ‘high’ & ( ? ) then diseased = ‘positive’
For the antecedent, there are ten possibilities. The following table shows the rules and their
values of p/t.
No. Rule p/t (= accuracy)
1 If (blood = ‘high’ & protein = ‘high’) then diseased = ‘positive’ 0/4 = 0.00
2 If (blood = ‘high’ & protein = ‘medium’) then diseased = ‘positive’ 4/4 = 1.00
3 If (blood = ‘high’ & protein = ‘low’) then diseased = ‘positive’ 4/8 = 0.50
4 If (blood = ‘high’ & glucose = ‘normal’) then diseased = ‘positive’ 4/8 = 0.50
5 If (blood = ‘high’ & glucose = ‘high’) then diseased = ‘positive’ 4/8 = 0.50
6 If (blood = ‘high’ & glucose = ‘very high’) then diseased = ‘positive’ - -
7 If (blood = ‘high’ & heart = ‘fast’) then diseased = ‘positive’ 0/8 = 0.00
8 If (blood = ‘high’ & heart = ‘slow’) then diseased = ‘positive’ 8/8 = 1.00
From the above table, the second rule and the eighth rule have the highest accuracy (=1.0) but
the eighth rule has the highest coverage (p=8). Therefore, we select the eighth rule.
(R2) If blood = ‘high’ & heart = ‘slow’ then diseased = ‘positive’ p/t = 8/8 (=1.0)
Since this rule is perfect (the accuracy of 1.0), there is no need to refine this rule. Next, we
delete all instances covered by the rule and then find another rule to cover the remaining
instances. However, after the construction of the eighth rule, there is no instance with the
‘positive’ class left. Therefore, the PRISM algorithm will output the following two rules as the set
of classification rules.
(R1) If blood = ‘low’ then diseased = ‘positive’
(R2) If blood = ‘high’ & heart = ‘slow’ then diseased = ‘positive’
While these two perfect rules can completely cover all instances in the training set, they may be
too specific and improper in general for unseen data. This situation is known as overfitting. In
covering algorithm, a rule becomes more overfitting every time an antecedent is added into the
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 60: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/60.jpg)
119
left-hand-side of the rule. To avoid overfitting, we can stop adding an antecedent at a certain
point while the rule is still not perfect (100% accuracy). That is, sometimes it is better not to
generate perfect rules that guarantee to give the correct classification on all instances in order to
avoiding overfitting. We need to consider which rules are worthwhile and how we determine the
rule becomes counterproductive when continuing adding more antecedents or conditions to a
rule to exclude a few pecky instances of the wrong type. There are two main strategies of pruning
rules, global pruning (post-pruning) and incremental pruning (pre-pruning). The first one is to
calculate the full set of rules and then prune them while the second one is to prune the rules at
the timing they are refined or generated. As we can see, the rules are normally created with a
criterion, such as p/t. However, as pruning mechanism, we need another criterion to measure the
counterproductive level. Three general pruning criteria are MDL principle (Minimum Description
Length), reduced-error pruning (or calculating errors in the holdout set) and statistical
significance (as done in INDUCT algorithm). The MDL principle is a formalization of Occam's
razor in which the best hypothesis for a given set of data is the one that leads to the largest
compression of the data. The MDL was introduced by Jorma Rissanen in 1978 and it is an
important concept in information theory and learning theory. Any set of data can be represented
by a string of symbols from a finite (say, binary) alphabet. The fundamental idea behind the MDL
principle is that any regularity in a given set of data can be used to compress the data, i.e. to
describe it using fewer symbols than needed to describe the data literally (Grünwald, 1998).
Moreover, it is possible that all data cannot be represented by regularity (general knowledge)
and some exceptional data may exist. Since the MDL approach attempts to select the hypothesis
that captures the most regularity in the data and leaves the fewest exceptional data, it aims to
find the best compression (the smallest size of regularity together with the fewest exceptions).
Applied to the classification rules, this approach requires a method to measure the size of rules
and the size of the instances that are not covered by the rules, and then select the smallest set of
rules, which produced the smallest numbers of exceptions.
In the reduced-error pruning approaches, it is possible to split the training data into two
parts; a growing set and a pruning set. At the first step, the growing set is used to form a rule
using the basic covering algorithm. Then a test is performed on the rule using the pruning set,
and the effect is evaluated by seeing whether the rule also performs well on the pruning set or
not. Based on the timing of pruning, two variants are reduced-error pruning and incremental
reduced-error pruning. The normal reduced-error pruning is to apply the growing set to build
the complete set of classification rules and then to use the pruning set to evaluate the antecedent
of each rule in order to omit useless ones. The incremental version prunes a rule immediately by
checking whether the current antecedent (test) is effective or not and throw it out if its
performance on the test set is not good enough.
The statistical significance approach uses statistical criteria to decide the effectiveness of
adding an antecedent into the rule. One common statistical significance is to apply the
hypergeometric distribution or binomial distribution to calculate the probability that the rule
will be produced. The lower probability the rule has, the more significant (the better) the rule is.
Therefore, by incorporating into the covering algorithm, it is possible to compare the
probabilities (statistical significances) of the rules when an antecedence is added or not added
into the rule. The hypergeometric distribution of the rule indicates how
likely this rule will be generated. Suppose that a data set contains P positive instances and N
negative instances (i.e., the total number of instances ) and a rule ( )
covers p positive instances and the conjunctive set of the antecedence satistfies t
instances (i.e., the number of negative instances of this rules is ). Figure 3-26 shows the
conceptual diagram for hypergeometric distribution in rule significance calculation. Following
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 61: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/61.jpg)
120
the combinatory theory, to generate the rule, we need to select t instances from T instances in the
whole data set, composed of p positive instances from P positive instances and t-p negative
instances from T-P negative instances. The probability based on the hypergeometric distribution
of the rule equals to the following equation where
.
Figure 3-26: Conceptual diagram for hypergeometric distribution for rule significance
The significance of a rule can be defined as the probability that the rule (R) performs at least as
well as the accuracy of the rule. This is called the statistical significance of the rule, m(R).
As mentioned above, in the process of the covering algorithm, a rule is revised by adding an
antecedent to increase the accuracy. Here, let R be the current rule and R- be the rule without the
last additional antecedent. When we refine the rule with an additional antecedent, the rule will
become more specific and cover a smaller set of instances. At each refinement step, it is possible
to calculate the significance of the rule before and after the refinement. If the significance
increase (m(R) < m(R-)), we should add the antecedent into the rule. Otherwise, we should not
and the rule refinement process should be suspended. The pseudocode of the PRISM rule learner
with rule significance testing, known as INDUCT algorithm, is shown in the Algorithm 3.2 below.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 62: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/62.jpg)
121
Algorithm 3.2. The PRISM algorithm with significance testing (called INDUCT algorithm)
INITIALIZE E to the instance set (the training set)
WHILE E contains instances {
FOREACH class C which E contains an instance {
CREATE a rule R with an empty left-hand side that
predicts class C
UNTIL R is perfect (or there are no more attributes) {
FOREACH attribute A not mentioned in R, and each value v {
CONSIDER ADDING the condition A=v to the LHS of R
SELECT A and v to maximize the accuracy p/t
(break ties by choosing the condition with the largest p)
CALCULATE significance m(R) for the rule R and
CALCULATE significance m(R-) for the rule R with
final condition omitted
IF (m(R-)<m(R)){ LEAVE UNTIL-LOOP } ELSE { ADD A=v to R }
}
}
}
COMPARE the rules generated for different classes to select
the most significant rule (i.e. the one with smallest m(R))
ADD the most significant rule R into the set of output rules
REMOVE the instances covered by R from E
}
To elaborate the process, we calculate the rule significance for the second rule (R2),
compared to the rule without the last antecedent (R2-) as follows. Here, there are 12 positive
instances (P=12) in the total 32 instances (T=32) as shown in the original data set.
(R2) If blood = ‘high’ & heart = ‘slow’ then diseased = ‘positive’ p/t = 8/8 (=1.0)
(R2-) If blood = ‘high’ then diseased = ‘positive’ p/t = 8/16 (=0.5)
The significance level of the rule (R2) is as follows. Here, T=32, P=12, t=8, and p=8.
The significance level of the rule (R2-) is as follows. Here, T=32, P=12, t=16, and p=8.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 63: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/63.jpg)
122
Since m(R2) < m(R2-), we should add the antecedent into the rule. Therefore, R2 is accepted as
the refinement.
(R2) If blood = ‘high’ & heart = ‘slow’ then diseased = ‘positive’ p/t = 8/8 (=1.0)
However, even not shown here, in some cases where m(R2) > m(R2-), we will omit the
antecedent from the rule.
Besides this most basic algorithm, some popular variations include AQ (Michalski, 1969),
CN2 (Clark and Niblett, 1989), and RIPPER (Cohen, 1995). The descriptions of AQ and CN2
algorithm are attached in Algorithm 3.3 and 3.4, respectively. Michalski’s AQ and related
algorithms were inspired by methods used by electrical engineers for simplifying Boolean
circuits (Higonnet & Grea, 1958). They exemplify the specific-to-general, and typically start with
a maximally specific rule for assigning cases to a given class. At the first example of AQ algorithm,
a set of examples to the class MAMMAL in taxonomy of vertebrates is provided to learn a set of
rules for classifying or characterizing objects in the class MAMMAL. Starting from the most
specific example, the generalization process (a bottom-up process) is performed. In contrast with
this, CN2 and RIPPER are top-down approach. The CN2 algorithm aims to modify the basic AQ
algorithm in such a way as to equip it to cope with noise and other complications in the data. In
particular, during its search for good complexes, the CN2 does not automatically remove from the
consideration a candidate that is found to include one or more negative example. Rather it
retains a set of complexes in its search that is evaluated statistically as covering a large number
of examples of a given class and few of other classes. Moreover, the manner in which the search is
conducted is general-to-specific. Each trial specialization step takes the form of either adding a
new conjunctive term or removing a disjunctive one. Having found a good complex, the algorithm
removes those examples it covers from the training set and adds the rule “if <complex> then
predict <class>” to the end of the rule list. The process terminates for each given class when no
more acceptable complexes can be found. As shown in Algorithm 3.4, the CN2 algorithm has the
following main features: (1) the dependence on specific training examples during search (a
feature of the AQ algorithm) is removed; (2) it combines the efficiency and ability to cope with
noisy data of decision tree learning with the if-then rule form and flexible search strategy of the
AQ family; (3) it contrasts with other approaches to modify AQ to handle noise in that the basic
AQ algorithm itself is generalized rather than “patched” with additional pre- and post-processing
techniques; and (4) it produces both ordered and unordered rules.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 64: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/64.jpg)
123
Algorithm 3.3. The AQR algorithm for generating a class cover (a set of classification rules)
LET pos be a set of positive examples of class C.
LET neg be a set of negative examples of class C.
PROCEDURE AQR(pos, neg){
LET cover be the empty cover.
WHILE cover does not cover all examples in POS {
SELECT a seed (a positive example not covered by cover) .
LET star be STAR(seed, neg) (a set of complexes that
Cover seed but that covers no examples in neg).
LET best be the best complex in star
According to user-defined criteria.
ADD best as an extra disjunction to cover.
RETURN cover
}
}
PROCEDURE STAR(seed, neg){
LET star be the set containing the empty complex.
WHILE any complex in star covers some negative examples in neg,
SELECT a negative example Eneg covered by a complex in star.
SPECIALIZE complexes in star to exclude Eneg by:
LET extension be all selectors that cover seed but not Eneg.
LET star be the set {x y|x star, y extension}. REMOVE all complexes in star subsumed by other complexes.
REPEAT UNTIL size-of-star < maxstar (a user-defined maximum):
REMOVE the worst complex from STAR.
RETURN star.
Algorithm 3.4. The CN2 algorithm for generating a class cover (a set of classification rules)
LET e be a set of classified examples.
LET selectors be the set of all possible selectors.
PROCEDURE CN2(e){
LET rule-list be the empty list.
REPEAT UNTIL best_complex is nil or e is empty:
LET best_complex be FIND-BEST-COMPLEX(e).
IF best_complex is not nil,
THEN LET e' be the examples covered by best_complex.
REMOVE from e the examples e' covered by best_complex.
LET C be the most common class of examples in e'.
ADD the rule 'IF best_complex THEN the class is C'
TO the end of rule-list.
RETURN rule-list }
PROCEDURE FIND-BEST-COMPLEX(e) {
LET star be the set containing the empty complex.
LET best_complex be nil.
WHILE star is not empty {
SPECIALIZE all complexes in star as follows:
LET newstar be the set {x y|x star, y selectors}. REMOVE all complexes in newstar that are either in star (i.e.,the
unspecialized ones) OR null(e.g., big=y big=n). FOR every complex Ci in newstar:
IF Ci is statistically significant and better than
best_complex by user-defined criteria when tested on e,
THEN replace the current value of best_complex by Ci.
REPEAT UNTIL size of NEWSTAR < user-defined maximum:
REMOVE the worst complex from NEWSTAR.
LET star be newstar.
RETURN best_complex. }
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 65: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/65.jpg)
124
3.1.6. Artificial Neural Networks
The field of artificial neural networks (ANNs) was originally proposed by psychologists and
neurobiologists who attempted to develop and test computational analogues of neurons. Up to
present, ANNs have been usually applied to imitate human abilities such as the use of language
(speech) and learning concepts, as well as many other practical commercial, scientific, and
engineering disciplines of pattern recognition, modeling, and prediction (Hertz et al., 1991 and
Wasserma, 1989). In general, an ANN consists of connected input/output units and their
weighted connection. During the learning phase, the network is gradually adapted by adjusting
the weights among units in order to obtain the best weights that predict the correct class label of
an input. A well-known neural network learning algorithm is backpropagation, also referred to as
connectionist learning due to the connections between units. Since neural networks involve long
training times, they may not be suitable for applications that require real-time learning process.
In many cases, a number of parameters are typically best determined empirically, such as the
network topology or structure. One more criticism related to neural networks is their poor
interpretability. It is difficult for us to interpret the symbol behind the learned weights and of
“hidden units” in the network. This poor interpretability made ANNs less desirable for data
mining. However, recently a number of techniques have recently been developed for the
extraction of rules from trained neural networks. Moreover, ANNs have advantages in their high
tolerance of noisy data as well as their ability to classify patterns on which they have not been
trained. The ANNs can be used when little knowledge may be provided for the relationships
between attributes and classes. In addition, they are well-suited with continuous-valued inputs
and outputs, unlike most decision tree algorithms. Their successful applications on a wide array
of real-world data, include handwritten character recognition, pathology and laboratory
medicine, and training a computer to pronounce English text. Moreover, it is quite
straightforward when one applies parallelization techniques for neural network algorithms. The
parallelism will fasten up the computation process. These factors contribute toward the
usefulness of neural networks for classification and prediction in data mining. Currently, there
are many different kinds of neural network algorithms.
As mentioned above, the most popular neural network algorithm is backpropagation,
invented in 1980s, for multilayer feed-forward neural network. The backpropagation iteratively
learns a set of weights for prediction of the class label of tuples. A multilayer feed-forward neural
network consists of an input layer, one or more hidden layers, and an output layer as shown in
Figure 3-27. For the input, the hidden and the output layers, each of them is composed of a
number of nodes (units). The inputs to the network correspond to the values of the
attributes measured for each training tuple. The inputs are fed simultaneously into the nodes of
input layer. After these inputs pass through the input layer, they will be weighted ( for the
node i to the node j) and fed to the nodes of the second layer, a hidden layer. The outputs of the
hidden layer will become inputs to another hidden layer, and so on. Although the number of
hidden layers can be arbitrary, the widely-used network is the one with only one hidden layer, as
shown in Figure 3-27. The outputs of the last hidden layer are also weighted ( for the node j to
the node k) and used as inputs to nodes making up the output layer, which finally sends out the
prediction for given tuples. The nodes in the input layer are called input nodes. The nodes in the
hidden layers and output layer are sometimes referred to as neurodes, due to their symbolic
biological basis, or as output nodes. The multilayer neural network with two layers is called a
two-layer neural network. Normally the input layer is not counted because it serves only to pass
the input values to the next layer. Similarly, a network containing two hidden layers is called a
three-layer neural network, and so on. The network is feed-forward in that none of the weights
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 66: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/66.jpg)
125
cycles back to an input node or to an output node of a previous layer. However, it is fully
connected in that each node provides input to each node in the next forward layer. Each output
node takes, as input, a weighted sum of the outputs from nodes in the previous layer. It applies a
nonlinear (activation) function to the weighted input. Multilayer feed-forward neural networks
are able to model the class prediction as a nonlinear combination of the inputs. From a statistical
viewpoint, they perform nonlinear regression. Multilayer feed-forward networks, given enough
hidden units and training samples, can closely approximate any function.
Figure 3-27: The output calculation for a hidden or output layer (the unit j). The inputs to unit j are outputs from the previous layer. These inputs are multiplied by their corresponding weights in order to form a weighted sum. The sum result is then added to the bias associated with unit j before applying a nonlinear activation function for the final output. Note that this calculation is occurred for each node in the hidden layer and the output layer. Even the figure shows an example of the calculation of the last node in the hidden layer; the same procedure is applied for all other nodes in the hidden layer and the output layer.
The main step in learning the ANN is the calculation of the weight of each association link in
the network. As mentioned above, the learning can be done by the process named
backpropagation. The backpropagration will process iteratively each tuple (data item) from the
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 67: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/67.jpg)
126
dataset of training tuples, comparing the network’s prediction for each tuple with the actual
known target value, in order to adjusting the weights of association links. At the end of the
network, the target value is set to be the known class label of the training tuple for classification
or a continuous value for prediction. When each training tuple enters to the network for learning,
the weights are modified in order to minimize the mean squared error between the network’s
prediction and the actual target value. These modifications are normally done in the “backwards”
direction, that is, from the output layer, through each hidden layer down to the first hidden layer.
Because of this property, it is called backpropagation. However, in general it is widely stated that
the weights may not eventually converge or the learning process will stop.
Algorithm 3.5. Backpropagation Neural network learning for ANN for classification or prediction
INPUT: T is a training dataset, where l is the learning rate,
network is a multilayer feed-forward network.
OUTPUT: A trained neural network.
PROCEDURE:
(1) INITIALIZE all weights and biases in network;
(2) WHILE terminating condition is NOT satisfied {
(3) FOREACH training tuple X T { (4) // PROPAGATE the inputs forward:
(5) FOREACH input-layer unit i [1,n] {
(6)
; // for each input unit, output = actual input value
(7) FOREACH hidden-layer unit j [1,m] {
(8) FOREACH input-layer unit i [1,n] {
(9)
; // compute the net input into hidden unit j
// with respect to the previous layer, i
(10)
;} // compute the output of each hidden unit j
(11) FOREACH output-layer unit k [1,p] {
(12) FOREACH hidden-layer unit j [1,m] {
(13)
; // compute the net input into output unit j
// with respect to the previous layer, i
(14)
;} // compute the output of each output unit j
(15) // Backpropagate the errors:
(16) FOREACH unit j in the output layer
(17)
// compute the error from
the target value .
(18) FOREACH unit j in the hidden layers,
from the last to the first hidden layer
(19)
// compute the error with respect to
the next higher layer, k
(20) FOREACH weight in network {
(21) // weight increment
(22) } // weight update
(23) FOREACH bias in network {
(24) // bias increment
(25) } // bias update (26) }}
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 68: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/68.jpg)
127
Algorithm 3.5 illustrates an algorithm for learning a neural network using backpropagation.
Given a training set of objects and their associated class labels, denoted by ,
each object is represented by an n-dimensional attribute vector,
depicting the measure values of n attributes, , of the object with its class , one
from m possible classes, (or its actual value , a set of real numbers). Here,
suppose that has possible values, . That is, . As a
common structure of ANN, each input node corresponds to each attribute ( ) and each output
node is each class or a target value .
The algorithm seems too difficult but with careful investigation, each step is inherently
simple. In summary, the backpropagation algorithm performs learning on a multilayer feed-
forward neural network. Each layer is made up of units. The inputs to the network correspond to
the attributes measured for each training tuple. The inputs are fed simultaneously into the units
making up the input layer. These inputs pass through the input layer and are then weighted and
fed simultaneously to a second layer of “neuron-like” units, known as a hidden layer. The outputs
of the hidden layer units can be input to another hidden layer, and so on. The number of hidden
layers is arbitrary, although in practice, usually only one is used. The weighted outputs of the last
hidden layer are input to units making up the output layer, which emits the network’s prediction
for given tuples. The units in the input layer are called input units. The units in the hidden layers
and output layer are sometimes referred to as neurodes, due to their symbolic biological basis, or
as output units. The multilayer neural network shown in Figure 3-27 has two layers. In artificial
neural networks (ANNs), backpropagation is a well-known learning algorithm of weights in the
network.
3.1.7. Support Vector Machines (SVMs)
Support Vector Machines (SVMs) are supervised learning methods to analyze data and recognize
patterns, as classification and regression for both linear and nonlinear data proposed by Vapnik,
Boser and Guyon (1992). Although the training time of even the fastest SVMs can be extremely
slow, they are highly accurate, owing to their ability to model complex nonlinear decision
boundaries. They are much less prone to overfitting than other methods. The support vectors
found also provide a compact description of the learned model. SVMs can be used for prediction
as well as classification. They have been applied to a number of areas, including handwritten
digit recognition, object recognition, and speaker identification, as well as benchmark time-series
prediction tests.
With a set of input data, the standard SVM predicts to which of two possible classes a given
input belongs. The SVM is known as a non-probabilistic binary linear classifier. In other words,
an SVM is a classifier, which takes a set of training examples, each of which is marked with one of
two classes, and builds a model that predicts whether a new example falls into one class
(category) or the other. While we can represent the examples as points in space, the learned
optimal SVM is a model that divides the examples of the two classes with a larger gap as possible.
After acquisition of the best model, the suitable class of a new example is done by mapping the
example into that same space and predicting the class it should belong to, based on which side of
the gap it fall on.
In a two-class problem, as linearly separable classification problem, it is possible to have
several possible decision boundaries. These decision boundaries may not be equally good (see
Figure 3-28). In general, the perceptron algorithm (artificial neural network) or genetic
algorithm can find such a boundary but there will be analytical approach to solve this problem.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 69: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/69.jpg)
128
Figure 3-28: Examples of bad decision Boundaries
Figure 3-29: Linear separating hyperplanes for the separable case. The support vectors are circled with a concrete example of linear separating hyperplanes
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 70: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/70.jpg)
129
To find the optimal separation, in a formal description, a SVM constructs a hyperplane or set of
hyperplanes in a high or infinite dimensional space, which can be used for classification,
regression or other tasks. The best SVM model is the one, which gives the best separation, a
hyperplane that has the largest distance to the nearest training data points among two classes
(so-called functional margin). The larger the margin is, the lower the generalization error of the
classifier. Figure 3-29 shows the linear separating hyperplanes for the separable case. The
support vectors are highlighted with large circles and a concrete example of the hyperplanes in a
2-D case. Intuitively, the decision boundary should be as far away from the data of both classes as
possible. This property implies the maximization of the margin . The formal description can be
given as follows.
Given the training data where is a datum,
representing by a vector with dimension and is a binary class of -1 or +1, the support vector
machine finds the best hyperplane which separates the positive from the negative examples (a
“separating hyperplane”). In principle, the points on the hyperplane satisfy the formula
, where w is a normal vector, that is perpendicular to the hyperplane. is
the perpendicular distance from the hyperplane to the origin, and is the Euclidean norm of .
Let ( ) be the shortest distance from the separating hyperplane to the closest positive
(negative) example. The margin of a separating hyperplane is . For the linearly separable
case, the support vector algorithm simply looks for the separating hyperplane with largest
margin. This can be formulated as follows: suppose that all the training data satisfy the following
constraints:
for
for
These two constraints can be combined into one set of inequalities as follows.
for all i
Suppose that we consider the points satisfying the equality in the first equation, i.e.,
. These points lie on the hyperplane H1: with normal and perpendicular
distance from the origin . Similarly, the points for which the equality in the second
equation holds lie on the hyperplane H2: , with the same normal , and
perpendicular distance from the origin Hence, , and the margin
is simply . Note that H1 and H2 are parallel since they have the same normal and that no
training points fall between them. Thus, we can find the pair of hyperplanes, which gives the
maximum margin by minimizing , subject to constraints in the third equation. The solution
for a typical two-dimensional case is expected to have the graphical representation shown in
Figure 3-29. The points in the training data set satisfying the equality in the third equation (i.e.
those which lay on one of the hyperplanes H1, H2) are called support vectors. They are circled in
the figure.
Based on this formulation, the decision boundary can be found by solving the following
constrained optimization problem.
Minimize
Subject to for all i
This is a constrained optimization problem. It is not easy to solve this problem since it requires
some complex transformations.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 71: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/71.jpg)
130
Towards a solution, the problem of finding support vectors can be formulated into a
Lagrangian formulation. There are two reasons of this formulation. The first one is that the
constraints, can be replaced by constraints on the Lagrange multipliers
themselves, which will be much easier to handle. The second is that in this reformulation of the
problem, the training data will only appear (in the actual training and test algorithms) in the
form of dot products between vectors. This is a crucial property, which will allow us to generalize
the procedure to the nonlinear case. The steps are shown in the following.
Consider the following general optimization problem. The problem is to minimize
subject to . A necessary condition for to be a solution is as follows. Here, is the
Lagrange multiplier.
For multiple constraints , it is possible to use a Lagrange multiplier
for each of the constraints as follows.
As the more general case of inequality constraint to , it is possible to slightly change the
formula as follows. Even it is similar to the equality case, this case requires the Lagrange
multiplier to be positive. Here, the problem is to minimize subject to for
There must exist such that satisfies the following
condition. Here, is a Lagrange multiplier.
The function is known as the Lagrangrian and we would like to set its
gradient to 0.
Here, we can apply this general optimization problem to the boundary decision problem of
the original support vector machine by setting
as and
as as
follows.
Minimize
Subject to for
The Langragrain and its application to SVM
The Lagrangrian will be summarized as follows.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 72: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/72.jpg)
131
Note that . Here, we set the gradient of the Lagrangrian w.r.t. and , to zero as
follows. First, we perform partial differentiation on the Lagrangrian w.r.t. .
Second, we perform partial differentiation on the Lagrangrian w.r.t. .
At this point, we substitute into the Lagrangrian . Then we will have the
following equations.
Since , the following formula is derived.
After this transformation, the new objective function will be in terms of only. This is known as
the dual problem where we know , then we know all and vice versa. The original problem is
known as the primal problem while the new function is known as the dual problem. Moreover,
this particular dual formulation of the problem is called the Wolfe dual (Fletcher, 1987). It aims
to minimize with respect to and simultaneously require that the derivatives of with
respect to all the ’s vanish, all subject to the constraints ≥ . Now this is a convex quadratic
programming problem, since the objective function is itself convex, and those points which
satisfy the constraints also form a convex set (any linear constraint defines a convex set, and a set
of simultaneous linear constraints defines the intersection of convex sets, which is also a
convex set). The objective function of the dual problem to be maximized is as follows.
Maximize
Subject to , for
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 73: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/73.jpg)
132
This condition is known as a quadratic programming (QP) problem and a global maximum of
can always be found. Then the normal can be recovered by . In this problem,
normally most ’s are zero and then is a linear combination of a small number of data points.
This “sparse” representation can be viewed as data compression like that in the construction of a
k-NN classifier. Here, with non-zero are called support vectors (SV). The decision boundary
is determined only by the SV. Figure 3-30 shows the graphical interpretation of support vectors
and their Lagrange multipliers.
Figure 3-30: The graphical interpretation of support vectors and their Lagrange multipliers.
Here, assume (j=1, ..., s) be the indices of the s support vectors. We can write
.
It is simple to test whether a new data belongs to either one of the two classes, say Class 1 and
Class 2, by calculating the following formula.
If the output is positive, the new data is classified as Class 1 otherwise Class 2. Note that since
the normal can be replaced with support vectors, the testing can be done locally by computing
the dot product between the new data and the support vectors
’s with the consideration of
the class and the Langrange multiplier
of the support vectors.
As mentioned previously, to find a global maximum of , the condition can be referred as a
quadratic programming (QP) problem. There are several proposed methods, such as Loqo, cplex,
so on. Some of them are provided online, for example see the following URL,
http://www.numerical.rl.ac.uk/qp/qp.html. However, they all occupy a so-called interior-point
approach. This approach starts with an initial solution that may violate the constraints and then
tries to improve this solution by optimizing the objective function and/or reducing the amount of
constraint violation. For SVM, the sequential minimal optimization (SMO), a quadratic
programming solver with two trivial variables, seems to be the most popular. The SMO method
will repetitively select a pair of and solve the QP with these two variables. The selection is
repeated until convergence. Anyway, in practice, we can consider the QP solver as a “black-box”
without bothering how it works.
Soft Margin
In several problems, it is not possible to find a linearly separable hyperplane. In those cases, it is
possible to allow errors in the marginal area. In theoretical aspect, we can allow “error” in
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 74: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/74.jpg)
133
classification. This allowance is known as “Soft Margin.” The error is based on the output of the
discriminant function . Also called slack variables, are provided for all misclassified
samples in the area between two classes. Figure 3-31 shows the graphical interpretation of soft
margin when some errors (misclassification) are allowed in the marginal area.
Figure 3-31: The graphical interpretation of soft margin.
The objective is to minimize . Here, can be computed by the following equations.
Here, are “slack variables” in optimization. Note that , if there is no error for . is an
upper bound of the number of errors.
Imitating the original formulation, a problem with a soft margin can be formulated as follows.
Here, C is called a tradeoff parameter between error and margin.
Minimize
Subject to for and
With the new constraint, the objective function of the dual problem to be maximized is as
follows.
Maximize
Subject to , for
Same as the original problem, is recovered as
. This is very similar to the
optimization problem in the linear separable case, except that there is an upper bound on
Given a tradeoff value C, we can use a quadratic programming (QP) solver to find optimal .
Moreover, this upper bound is best determined experimentally.
Non-linear Assumption – Linearly inseparatable space
In the original problem of a finite dimensional space, often the space cannot be linearly
separated into two sub-spaces, i.e., each sub-space includes only members from each class. As
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 75: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/75.jpg)
134
mentioned above, it is possible to use the concept of a soft margin to allow some members in a
sub-space to locate in the opposite sub-space. However, in several cases the boundary between
two classes should not be assumed to be linear. Towards a solution, several works have been
proposed to map the original finite dimensional space into a much higher dimensional space
presumably making the separation easier in that space. In other words, each point in the original
space is transformed to a point denoted by in a higher dimensional space. In some
literatures, the original space is called the input space while the target space is named the
feature space. By this transformation, the linear operation in the feature space is equivalent to
non-linear operation in the input space. Then classification in some tasks can be enabled. For
example, the well-known XOR problem can be solved by introducing a new feature , where
and are two dimensions for the XOR problem. The value of is positive when both and
are positive or both and are negative, otherwises negative. Figure 3-32 shows the
graphical interpretation of the transformation into a higher dimensional space using a function .
Here, both spaces are shown in two-dimensional representation. However, normally the feature
space has higher dimension than the input space in practice. Moreover, computation in the
feature space can be costly because it is high dimensional. The feature space can be set to
infinite-dimensional. Each point in the input space is mapped to the feature space by the function
. For this purpose, the kernel trick can be used as follows.
Figure 3-32: The graphical interpretation of the transformation to a higher dimensional
apace. Note that the feature space is of higher dimension than the input space.
In SVM schemes, a mapping into a larger space is used in order to make cross products be
computed easily in terms of the variables in the original space and the computational load
becomes reasonable. The cross products in the larger space are defined in terms of a kernel
function , which can be selected to suit the problem. Note that at this point, the important
issue is how to select a kernel function that is optimal to the problem.
Before getting to understand the usage of kernel functions in SVM, let us explain the
semantics of a hyperplane. A hyperplane in a large space can be defined as the set of points
whose cross product with a vector in that space is constant. The vectors defining the
hyperplanes can be chosen to be linear combinations with parameters of images of feature
vectors, which occur in the database. With this choice of a hyperplane, the points in the feature
space, which are mapped into the hyperplane, are defined by the relation:
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 76: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/76.jpg)
135
Note that if becomes small as y grows further from x, each element in the sum
measures the degree of closeness of the test point x to the corresponding data base point . In
this way, the sum of kernels above can be used to measure the relative nearness of each test
point to the data points originating in one or the other of the sets to be discriminated. Note the
fact that the set of points mapped into any hyperplane can be quite convoluted as a result
allowing much more complex discrimination between sets, which are far from convex in the
original space. Let us recall the SVM optimization problem as shown below.
Maximize
Subject to , for
The term indicates the inner product of each pair of data points. These inner products
are summed up by ∑. As long as we can calculate the inner product in the feature space, it is not
necessary to define the mapping explicitly. In general, it is possible to express the inner product
in terms of one of common geometric operations, such as angles or distances. As this point, we
can define a kernel function K by . This kernel function will be applied to
the above equations to derive the following equations.
Maximize
Subject to , for
After applying this kernel function, we find a maximal separating hyperplane. The procedure is
similar to that described above with a user-specified upper bound , on the Lagrange multipliers
. After that, we can use a quadratic programming (QP) solver to find optimal and support
vector . The upper bound C is best determined experimentally.
In order to grasp the concept of kernel function, we use the following example with two
dimensions with the point . the function is given as follows.
1 2x1 2x2 x1 2 x2
2 2x1x2
Therefore, the inner product in the feature space can be defined as follows.
1 x1y1 x2y2 2
Here, there is no need to know explicitly what is the function but we just define the kernel
function as follows.
1 x1y1 x2y2 2
Towards the practical usage of SVM, the user has to specify the kernel function but leave the
transformation function implicitly unknown. The application of a kernel function without
knowing the function , is known as the kernel trick.
Given a kernel function , the transformation function is given by its
eigenfunctions, which is a concept in functional analysis. However, it is difficult to construct
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 77: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/77.jpg)
136
eigenfunctions explicitly. Therefore, in most cases, we will only specify the kernel function
without worrying about the exact transformation. As another point of view, the kernel function,
which is equivalent to an inner product, is a similarity measure between the data points (or
objects). Up to present, there have been kernel functions used in practice but three common
kernels are as follows.
Linear kernel (no transformation)
Polynomial kernel (with degree k)
Gaussian radial basis function kernel (RBF) 2 2
Sigmoid kernel (with and )
The linear kernel performs no transformation on the original dot product. The polynomial
kernel is parameterized with degree k to form different classifiers. The Gaussian radial basis
function (RBF) kernel uses normal Gaussian distribution to determine the distance between a
pair of points. The sigmoid kernel gives a particular kind of two-layer sigmoidal neural network.
Note that all the kernels above are symmetric in the sense that .
When the new datum is going to be classified, the original linear SVM will determine its
class by assuming (j=1, ..., s) be the indices of the s support vectors and calculating the weight
and the function as follow.
If the output is positive, the new datum is classified as Class 1 otherwise Class 2. However, with
the kernel trick, the weight and the function are modified as follows.
Following the same procedure, when the output is positive, the new datum is classified as
Class 1. In the opposite, it is assigned with Class 2. Therefore, we will calculate the kernel
function between the testing data and the support vectors .
Note that both the training and the testing of SVM only require the value of . This
property means that there is no restriction of the form of and . Therefore, can be any data
representation, including a sequence or a tree, instead of a feature vector. Semantically,
is just a similarity measure between and . However, it was claimed that not all similarity
measure could be used as kernel function. Since the kernel function is used for replacing the dot
product of two mapping data points, it is assumed to be symmetric, . It is
also required to satisfy the Cauchy-Schwartz inequality, that is .
However, these two properties are not sufficient to guarantee the existence of a feature space.
The kernel function needs to satisfy the Mercer’s condition. This condition states that a necessary
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 78: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/78.jpg)
137
and sufficient condition for a symmetric function of to be a kernel is positive semi-
definite (PSD). This implies that the n by n kernel matrix, in which the -th entry is the
, is always positive definite. This property also means that the quadratic programming
(QP) problem is convex and can be solved in polynomial time. The Mercer’s condition makes the
kernel guarantee the existence of a feature space. To satisfy the condition, the following four
properties must be valid [14][16]. Here, and are two arbitrary kernel functions. With
these four properties, a new kernel function can be produced.
1.
where and are positive scalar values
2.
where and are positive scalar values
3.
where exp( is the exponential function.
4.
where A is a positive semi-define matrix.
An example of kernel trick
Suppose that five data points are given in a single dimensional space as follows.
X Y
1 1
2 1
4 -1
5 -1
6 1
Here, assume the polynomial kernel of degree 2 (i.e., and the tradeoff
value C is set to 100. At this point, first we find (i=1,2, ..., 5) by the following equation.
Maximize
Subject to , for
By using a QP solver, we can find the solution of as
. Note that the constraints, i.e., and are satisfied. The support
vectors are . Then the discriminant function will be defined as follows.
=
=
=
=
=
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 79: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/79.jpg)
138
The value of can be recovered by using the boundary condition of or
or since and lie on the line and lies on the line
. With either of these three boundary conditions, equals to 9. Therefore, the
final discriminant function will be as follows.
=
Figure 3-33 shows the graphical interpretation of the discriminant function curve, which is a
parabolic line. Points located in the upper part of this curve are Class 2 whereas those in the
lower part are Class 1. Therefore, it is possible to split a line into three portions for Class 1, Class
2 and Class 1, respectively.
Figure 3-33: The graphical interpretation of the discriminant function (parabolic curve).
Discussion on high-dimensional space and VC-dimension
The kernel trick implies the mapping of the original space to a higher space. In several cases, the
feature space is assumed to become very high dimensional. This transformation may trigger an
issue called the curse of dimensionality. That is, the classifier in a high-dimensional space has
many parameters and then it is hard to estimate these parameters. Vapnik (1979) argued that
the fundamental problem is not the number of parameters to be estimated but rather the
problem is on the flexibility of a classifier. Typically, a classifier with many parameters is very
flexible and faces with several exceptions. Even only one parameter, we can express the
flexibility. For example, we can use an example of the classifier can classify all
correctly for all possible combination of class labels on More details can be found in (Vapnik,
1979; Vapnik, 1995; Vapnik, 1998; Law, 2005)
Vapnik argues that the flexibility of a classifier should not be characterized by the number of
parameters, but by the flexibility (capacity) of a classifier. This property is formalized by the so-
called “VC-dimension” of a classifier. For example, let us consider a linear classifier in two-
dimensional space with two classes (circle and rectangular) and three data points as shown in
Figure 3-34. Even only three specific cases are given; in general, if we have three training data
points, no matter how those points are labeled, we can classify them perfectly.
Figure 3-34: Three data points can be perfectly classified.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 80: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/80.jpg)
139
However, when we add one more point in the space (i.e., four points), it is possible to have
situations that a linear classifier cannot perfectly classify the space into two classes with a single
line as shown in Figure 3-35.
Figure 3-35: Four data points may not be perfectly classified.
We can observe that the number three (3) is the critical number. The VC-dimension of a linear
classifier in a 2D space is three because, if we have three points in the training set, perfect
classification is always possible irrespective of the labeling, whereas for four points, perfect
classification can be impossible.
Let us consider the VC-dimension of other classification methods. For example, the VC-
dimension of the nearest neighbor classifier is said to be infinity since no matter how many
points you have, you get perfect classification on training data when k is set to 1 and any two
identical data points must be assigned with the same classes. In general, the higher the VC-
dimension, the more flexible a classifier is. However, the VC-dimension is a theoretical concept.
In practice, the VC-dimension of most classifiers is difficult to be computed exactly. Conceptually,
we can expect that if a classifier is flexible, it probably has a high VC-dimension.
The steps of SVM classification can be summarized as follows.
(1) Prepare the training data in the form of a pattern matrix. Given the training data
where is a datum, representing by a
vector with dimension and is a binary class of -1 (Class 1) or +1 (Class 2).
(2) Select the kernel function that we will apply for classification.
(3) Select the values for the parameters in the kernel function and the value of C. This
setting can be done manually or it can be done by using a validation set to
determine the values of the parameter.
(4) Execute the training algorithm (QP solver) to obtain the and support vectors
as shown by the following equation.
Maximize
Subject to , for
(5) Classify unseen data by using the acquired and support vectors using by the
following equation. The function f determines the class, i.e., negative for Class 1 and
positive for Class 2.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 81: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/81.jpg)
140
Strengths and Weakness of Support Vector Machines
The training process for SVMs is relatively easy since there is no local optimal, unlike in neural
networks. The method forms a feature space from the input space relatively well by replacing the
original data by higher dimensional data where the tradeoff between classifier complexity and
error can be controlled explicitly. That is, a linear algorithm in the feature space is equivalent to a
non-linear algorithm in the input space. The input is also flexible in the sense that non-traditional
data, such as strings can be used as input to SVM, in place of feature vectors. However, to obtain
high accuracy, we need to make a trial-and-error on selection of a “good” kernel function for SVM.
There is still no method to automatically find good kernels.
3.2. Numerical Prediction
Numeric prediction is the task of predicting continuous (or ordered) values for given input. For
example, it involves with prediction of the potential future price of gold given the current
economic situation or to predict the level of river stream given the weather report. For this task,
most methods mentioned previously are not suitable since they involve with prediction of a
nominal attribute, not a numeric one. In general, prediction of numerical values is more
complicated than prediction of categorical values since it has more sensitivity. In this section, we
firstly explain linear and non-regression model. After that we explain two extensions of
regression to decision trees for numerical prediction. Besides regression, some classification
technique, e.g., backpropagation, support vector machines, and k-nearest-neighbor classifiers can
be adapted for numeric prediction. On the other hand, regression can be adapted for
classification as shown in Section 3.5.
3.2.1. Regression
In the cases that the outcome (class) is numeric and all the attributes are numeric, it is possible
to apply linear regression for prediction. This method is a very common method in statistics. The
method expresses the class as a linear combination of the attributes, with predetermined weights.
Regression analysis is a good choice when all of the predictor variables are continuous valued as
well. Many problems can be solved by linear regression, and even more can be tackled by
applying transformations to the variables so that a nonlinear problem can be converted to a
linear one. Due to limited space, this section will not give a full-length description of regression
but instead, provide an intuitive introduction. Several software packages exist to solve regression
problems, including SAS (www.sas.com), SPSS (www.spss.com), and S-Plus
(www.insightful.com). The regression is a statistical methodology developed by Sir Frances
Galton (1822–1911), a mathematician who was also a cousin of Charles Darwin. It can be used to
model the relationship between one or more independent or predictor variables (features or
known attributes) and a continuous-valued dependent or response variable (target attributes).
Linear regression
A single-independent variable regression tackles with a response variable, y, and a single
independent (predictor) variable, x. As the simplest form of regression, the response variable y is
modelled as a linear function of x. It will be in the following form.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 82: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/82.jpg)
141
,
where the variance of is assumed to be constant, and and are regression coefficients (or
weights), specifying the Y-intercept and slope of the line, respectively. These coefficients can be
solved for by the method of least squares, which estimates the best-fitting straight line as the one
that minimizes the error between the actual data and the estimate of the line.
Let be a training set, composed of values of predictor variable, x, for some population and
their associated values for response variable, y. The training set S contains data points of the
form . The regression coefficients, and , can be
estimated as follows.
where is the mean value of and is the mean value of . The
coeeficients, and , provide good approximations to minimize the error between the actual
data and the estimate of the line. An example is shown in Figure 3-36.
X (years) Y (weight)
3 10
5 24
4 20
8 29
14 36
6 20
20 45
12 24
Figure 3-36: An example of a single-independent variable regression. (left: 2-D data, right: the
scatter plot and its regression equation.
The 2-D data can be graphed on a scatter plot and the plot suggests a linear relationship between
the two variables, x and y. Given the data in the table, we can compute the averages of x and y.
They are
and
. Thereafter,
substituting these averages into the above equation, we can find the coefficient values as follows.
=11.05
With the above calculated values, the equation of the least squares line is . By
this equation, we can predict the value of y, given a value of x, such as y = 40.95 when x= 18.
It is possible to extend linear regression from a single independent variable to multiple
independent variables. This is called multiple linear regression (MLR). The multiple linear
regression involves more than one predictor variable and allows response variable y to be
modeled as a linear function of n predictor variables or attributes, . A data tuple X
can be described with n variables with the associated response variable y. A
multiple linear regression model can be expressed in the following form.
y = 1.66x + 11.05
0
10
20
30
40
50
0 5 10 15 20 25
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 83: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/83.jpg)
142
where is the class, are the attributes and are weights. Given a training
data S where each datum (the i-th instance) in S is represented in the form of , where
is an n-dimensional training tuple with its associated class labels .
Here the superscript (i) denotes the index of the instance. For example, expresses the
first instance. Moreover, to make the notation simple, we can assume an extra variable (attribute)
whose value is always one (1), e.g.,
Given the weight values,
, the predicted class value of the i-th instance denoted by can be written as
follows.
In real situation, this predicted class value for the class differs from the actual class value.
The regression tries to find the values of weights, that minimize the total difference
(the sum of the squares) between the predicted and the actual class values for all data in the
training data set. The difference between the predicted and the actual class values can be
represented as follows.
The sum of the squares of the differences in the training data S can be denoted as follows.
Here, the expression inside the parentheses is the difference between the i-th instance’s
actual class and its predicted class. This sum of squares is what we have to minimize by choosing
the coefficients appropriately. To find the suitable values of the weights, the formulation can be
denoted as follows.
To solve this equation, it is possible to derive the coefficients using standard matrix
operations. One observation is that the coefficients can be calculated if the number of instances is
not smaller than the number of attributes and all instances are independent to each other. If
there are fewer instances, there will be more than one solutions for these coefficients. That is it is
expected to have enough examples compared to the number of attributes in order to select
optimal weights to minimize the sum of the squared differences. The solution is related to the
matrix inversion operation. It can be easily found in any package software. Here, we provide a
formal description of the solution. The representation can be translated to the form of matrix as
follows. Note that as stated above,
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 84: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/84.jpg)
143
To find the weight values, we can perform a partial differentiation on the above equation for each
weight as follows.
…
The differentiation result is as follows.
The values of weights will the result of matrix operations as follows.
The following shows an example of regression where the play-tennis data set is used.
Outlook Temp. Humidity Windy Play
90 40 80 10 5
95 32 85 80 10
50 35 90 20 80
10 24 80 5 95
15 10 50 15 85
20 12 55 90 15
55 9 45 95 80
85 22 95 25 10
95 7 50 5 100
5 26 45 10 85
80 25 40 80 95
45 24 85 85 90
40 37 60 15 75
25 23 90 95 5
With the decimal scaling with the factor of 100, we can obtain the following data.
Outlook Temp. Humidity Windy Play
0.90 0.40 0.80 0.10 0.05
0.95 0.32 0.85 0.80 0.10
0.50 0.35 0.90 0.20 0.80
0.10 0.24 0.80 0.05 0.95
0.15 0.10 0.50 0.15 0.85
0.20 0.12 0.55 0.90 0.15
0.55 0.09 0.45 0.95 0.80
0.85 0.22 0.95 0.25 0.10
0.95 0.07 0.50 0.05 1.00
0.05 0.26 0.45 0.10 0.85
0.80 0.25 0.40 0.80 0.95
0.45 0.24 0.85 0.85 0.90
0.40 0.37 0.60 0.15 0.75
0.25 0.23 0.90 0.95 0.05
To obtain the solution, we apply the formulation of as follows.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 85: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/85.jpg)
144
Here, the matrix , , , and can be calculated as follows.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 86: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/86.jpg)
145
By the above constraint, the solution can be derived as follows
Finally, the resultant regression can be obtained as follows.
Here, y = play, =outlook, =temperature, =humidity, =windy.
Non-linear regression
In several cases, we would like to model data that does not show a linear dependence. For
example, sometimes a given response variable and predictor variable may have a relationship of
a polynomial function, e.g., parabola or some other higher-order polynomial. Polynomial
regression is often of interest when there is just one predictor variable. It can be modeled by
adding polynomial terms to the basic linear model. By applying transformations to the variables,
we can convert the nonlinear model into a linear one that can then be solved by the method of
least squares. For example, consider a cubic polynomial relationship with a single predictor
variable x and the response variable y.
This equation can be converted to a linear form by defining the following new variables.
By this definition, the non-linear equation becomes a linear equation as follows.
This equation can be solved by the method of least squares using software for regression analysis.
We can observe that polynomial regression is a special case of multiple linear regression. That is,
the addition of high-order terms such as , , and so on, which are simple functions of the
single variable, x can be considered equivalent to adding new independent variables.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 87: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/87.jpg)
146
3.2.2. Tree for prediction: Regression Tree and Model Tree
It is possible to apply decision trees or rules that are designed for predicting categories to
estimate numeric quantities. For example, the weather data can be used to construct a regression
tree and a model tree from the following data.
Outlook Temp. Humidity Windy Play
90 40 80 10 5
95 32 85 80 10
50 35 90 20 80
10 24 80 5 95
15 10 50 15 85
20 12 55 90 15
55 9 45 95 80
85 22 95 25 10
95 7 50 5 100
5 26 45 10 85
80 25 40 80 95
45 24 85 85 90
40 37 60 15 75
25 23 90 95 20
In a regression tree, the leaf nodes or would contain a numeric value that is the average of all the
training set values to which the leaf applies. This is called regression trees since statisticians use
the term regression for the process of computing an expression that predicts a numeric quantity.
Then decision trees with averaged numeric values at the leaves are called regression trees. Each
leaf node in the regression tree represents the average outcome, the number of instances and the
standard deviation for instances that reach the leaf. The tree is much larger and more complex
than the regression equation. In general, the average of the absolute values of the errors between
the predicted and the actual values, usually turns out to be significantly lower than that of the
regression equation. The regression tree is more accurate because a simple linear model poorly
represents the data in this problem. However, the tree is cumbersome and difficult to interpret
because of its large size. From the above data, we can obtain the regression tree as follows.
Here the data for each node are as follows.
[The first leaf node] (Average Play = 8.33, Number = 3, Std. dev. = 2.89)
Outlook Temp. Humidity Windy Play 90 40 80 10 5
95 32 85 80 10
85 22 95 25 10
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 88: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/88.jpg)
147
[The second leaf node] (Average Play = 97.5, Number = 2, Std. dev. = 3.54)
Outlook Temp. Humidity Windy Play 95 7 50 5 100
80 25 40 80 95
[The third leaf node] (Average Play = 81.25, Number = 4, Std. dev. = 6.29)
Outlook Temp. Humidity Windy Play 50 35 90 20 80 55 9 45 95 80 45 24 85 85 90 40 37 60 15 75
[The fourth leaf node] (Average Play = 88.33, Number = 3, Std. dev. = 5.77)
Outlook Temp. Humidity Windy Play 10 24 80 5 95
15 10 50 15 85
5 26 45 10 85
[The fifth leaf node] (Average Play = 17.50, Number = 2, Std. dev. = 3.54)
Outlook Temp. Humidity Windy Play 20 12 55 90 15
25 23 90 95 20
It is also possible to combine regression equations with regression trees. A regression tree is a
tree whose leaves contain linear expressions—that is, regression equations— rather than single
predicted values. This kind of a tree is called a model tree. The following is a model tree with
equations at leaf nodes.
The model tree contains the five linear models that belong at the five leaves, labeled LM1-
LM5. However, since there are not enough instances in this data set, the linear regression
equation for each node cannot be generated. The model tree approximates continuous
functions by linear “patches,” a more sophisticated representation than either linear
regression or regression trees. The model tree is smaller and more comprehensible than the
regression tree. However, its average error values on the training data are lower.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 89: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/89.jpg)
148
3.3. Regression as Classification
One interesting application of regression is to use it for classification. Linear regression can easily
be used for classification in domains with numeric attributes. Indeed, we can use any regression
technique, whether linear or nonlinear, for classification. There are two possible methods to use
regression for classification. The following shows a graphical representation of these two
methods.
3.3.1. One-Against-the-Other Regression
To apply regression for classification with one-against-the-other approach, we perform a
regression for each class, setting the output equal to one for training instances that belong to the
class and zero for those that do not. The result is a linear expression for that class. We perform
the same procedure for the other classes. For the testing procedure, given a test example of
unknown class, calculate the value of each linear expression and choose the one with the largest
outcome. This method is sometimes called multiresponse linear regression. Since there is one
expression for each class, the number of expressions equals to the number of classes. Therefore,
if there are n classes, there will be n regression expressions. The one-against-the-other
regression is explained with the following training dataset.
Temp. Humidity Windy Class
40 80 false C1
32 85 true C2
35 90 false C2
24 80 false C4
10 50 false C2
12 55 true C3
9 45 true C1
22 95 false C2
7 50 false C4
26 45 false C3
25 40 true C1
24 85 true C2
37 60 false C1
23 90 true C3
In this process, the data are normalized and the class on focus is set to 1, otherwise are set to 0.
The expressions are shown under each case. First, the model learned for Class 1 is as follows.
Temp. Humid. Windy Class Temp. Humid. Windy Class
40 80 false C1 0.40 0.80 0.0 1
32 85 true C2 0.32 0.85 1.0 0
35 90 false C2 0.35 0.90 0.0 0
24 80 false C4 0.24 0.80 0.0 0
10 50 false C2 0.10 0.50 0.0 0
12 55 true C3 0.12 0.55 1.0 0
9 45 true C1 0.09 0.45 1.0 1
22 95 false C2 0.22 0.95 0.0 0
7 50 false C4 0.07 0.50 0.0 0
26 45 false C3 0.26 0.45 0.0 0
25 40 true C1 0.25 0.40 1.0 1
24 85 true C2 0.24 0.85 1.0 0
37 60 false C1 0.37 0.60 0.0 1
23 90 true C3 0.23 0.90 1.0 0
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 90: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/90.jpg)
149
Second, the model learned for Class 2 is as follows.
Temp. Humid. Windy Class Temp. Humid. Windy Class
40 80 false C1 0.40 0.80 0.0 0
32 85 true C2 0.32 0.85 1.0 1
35 90 false C2 0.35 0.90 0.0 1
24 80 false C4 0.24 0.80 0.0 0
10 50 false C2 0.10 0.50 0.0 1
12 55 true C3 0.12 0.55 1.0 0
9 45 true C1 0.09 0.45 1.0 0
22 95 false C2 0.22 0.95 0.0 1
7 50 false C4 0.07 0.50 0.0 0
26 45 false C3 0.26 0.45 0.0 0
25 40 true C1 0.25 0.40 1.0 0
24 85 true C2 0.24 0.85 1.0 1
37 60 false C1 0.37 0.60 0.0 0
23 90 true C3 0.23 0.90 1.0 0
Third, the model learned for Class 3 is as follows.
Temp. Humid. Windy Class Temp. Humid. Windy Class
40 80 false C1 0.40 0.80 0.0 0
32 85 true C2 0.32 0.85 1.0 0
35 90 false C2 0.35 0.90 0.0 0
24 80 false C4 0.24 0.80 0.0 0
10 50 false C2 0.10 0.50 0.0 0
12 55 true C3 0.12 0.55 1.0 1
9 45 true C1 0.09 0.45 1.0 0
22 95 false C2 0.22 0.95 0.0 0
7 50 false C4 0.07 0.50 0.0 0
26 45 false C3 0.26 0.45 0.0 1
25 40 true C1 0.25 0.40 1.0 0
24 85 true C2 0.24 0.85 1.0 0
37 60 false C1 0.37 0.60 0.0 0
23 90 true C3 0.23 0.90 1.0 1
Fourth, the model learned for Class 4 is as follows.
Temp. Humid. Windy Class Temp. Humid. Windy Class
40 80 false C1 0.40 0.80 0.0 0
32 85 true C2 0.32 0.85 1.0 0
35 90 false C2 0.35 0.90 0.0 0
24 80 false C4 0.24 0.80 0.0 1
10 50 false C2 0.10 0.50 0.0 0
12 55 true C3 0.12 0.55 1.0 0
9 45 true C1 0.09 0.45 1.0 0
22 95 false C2 0.22 0.95 0.0 0
7 50 false C4 0.07 0.50 0.0 1
26 45 false C3 0.26 0.45 0.0 0
25 40 true C1 0.25 0.40 1.0 0
24 85 true C2 0.24 0.85 1.0 0
37 60 false C1 0.37 0.60 0.0 0
23 90 true C3 0.23 0.90 1.0 0
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 91: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/91.jpg)
150
The regression equations for the four classes can be summarized as follows.
Class 1 Class 2 Class 3 Class 4
Here, suppose that we have the following test object. We can predict the class of this test object
by substituting the values for each parameter (attribute).
Temp. Humid. Windy Class
50 75 false ?
Class 1 Class 2 Class 3 Class 4
Since Class 2 got the highest score, the test object is assigned with Class 2
While the multiresponse linear regression yields good results in practice, it still has two
drawbacks. First, the membership values it produces are not proper probabilities because they
can fall outside the range 0 to 1. Second, least squares regression assumes that the errors are not
only statistically independent, but are also normally distributed with the same standard
deviation. To solve this, instead of approximating the 0 and 1 values directly, logistic regression
can be used to build a linear model based on a transformed target variable.
3.3.2. Pairwise Regression
As an alternative to multiresponse linear regression, pairwise regression can be used for
classification. In this method, it is necessary to find a regression expression for every pair of
classes, using only the instances from these two classes. Here, during regression analysis, one
class is assigned with +1 while the other is marked with -1. For classes, there will be
expressions. This seems computational intensive but in fact, it is at least as
fast as any other multiclass method. The reason is that each expression of pairwise regression
involves only instances belonging to the two classes under consideration. Suppose that
instances are divided evenly among classes, there will be instances for learning
regression expression. In general, the learning algorithm for a two-class problem with
instances takes time proportional to seconds to execute. Therefore, the run time for pairwise
classification is proportional to seconds. It is . In other
words, the method scales linearly with the number of classes and the number of instances. In the
testing step, the class for an unknown test example is assigned to which class receives the most
votes. This method generally yields accurate results in terms of classification error, compared to
the one-against-the-other method. Assume the same data set as the one-against-the-other
regression as above.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 92: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/92.jpg)
151
The model for Class 1 vs. Class 2 can be learned as follows.
Temp. Humid. Windy Class Temp. Humid. Windy Class
40 80 false C1 0.40 0.80 0.0 +1
32 85 true C2 0.32 0.85 1.0 -1
35 90 false C2 0.35 0.90 0.0 -1
24 80 false C4
10 50 false C2 0.10 0.50 0.0 -1
12 55 true C3
9 45 true C1 0.09 0.45 1.0 +1
22 95 false C2 0.22 0.95 0.0 -1
7 50 false C4
26 45 false C3
25 40 true C1 0.25 0.40 1.0 +1
24 85 true C2 0.24 0.85 1.0 -1
37 60 false C1 0.37 0.60 0.0 +1
23 90 true C3
The model for Class 1 vs. Class 3 can be learned as follows.
Temp. Humid. Windy Class Temp. Humid. Windy Class
40 80 false C1 0.40 0.80 0.0 +1
32 85 true C2
35 90 false C2
24 80 false C4
10 50 false C2
12 55 true C3 0.12 0.55 1.0 -1
9 45 true C1 0.09 0.45 1.0 +1
22 95 false C2
7 50 false C4
26 45 false C3 0.26 0.45 0.0 -1
25 40 true C1 0.25 0.40 1.0 +1
24 85 true C2
37 60 false C1 0.37 0.60 0.0 +1
23 90 true C3 0.23 0.90 1.0 -1
The model for Class 1 vs. Class 4 can be learned as follows.
Temp. Humid. Windy Class Temp. Humid. Windy Class
40 80 false C1 0.40 0.80 0.0 +1
32 85 true C2
35 90 false C2
24 8 false C4 0.24 0.80 0.0 -1
10 50 false C2
12 55 true C3
9 45 true C1 0.09 0.45 1.0 +1
22 95 false C2
7 50 false C4 0.07 0.50 0.0 -1
26 45 false C3
25 40 true C1 0.25 0.40 1.0 +1
24 85 true C2
37 60 false C1 0.37 0.60 0.0 +1
23 90 true C3
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 93: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/93.jpg)
152
The model for Class 2 vs. Class 3 can be learned as follows.
Temp. Humid. Windy Class Temp. Humid. Windy Class
40 80 false C1
32 85 true C2 0.32 0.85 1.0 +1
35 90 false C2 0.35 0.90 0.0 +1
24 80 false C4
10 50 false C2 0.10 0.50 0.0 +1
12 55 true C3 0.12 0.55 1.0 -1
9 45 true C1
22 95 false C2 0.22 0.95 0.0 +1
7 50 false C4
26 45 false C3 0.26 0.45 0.0 -1
25 40 true C1
24 85 true C2 0.24 0.85 1.0 +1
37 60 false C1
23 90 true C3 0.23 0.90 1.0 -1
The model for Class 2 vs. Class 4 can be learned as follows.
Temp. Humid. Windy Class Temp. Humid. Windy Class
40 80 false C1
32 85 true C2 0.32 0.85 1.0 +1
35 90 false C2 0.35 0.90 0.0 +1
24 80 false C4 0.24 0.80 0.0 -1
10 50 false C2 0.10 0.50 0.0 +1
12 55 true C3
9 45 true C1
22 95 false C2 0.22 0.95 0.0 +1
7 50 false C4 0.07 0.50 0.0 -1
26 45 false C3
25 40 true C1
24 85 true C2 0.24 0.85 1.0 +1
37 60 false C1
23 90 true C3
The model for Class 3 vs. Class 4 can be learned as follows.
Temp. Humid. Windy Class Temp. Humid. Windy Class
40 80 false C1
32 85 true C2
35 90 false C2
24 80 false C4 0.24 0.80 0.0 -1
10 50 false C2
12 55 true C3 0.12 0.55 1.0 +1
9 45 true C1
22 95 false C2
7 50 false C4 0.07 0.50 0.0 -1
26 45 false C3 0.26 0.45 0.0 +1
25 40 true C1
24 85 true C2
37 60 false C1
23 90 true C3 0.23 0.90 1.0 +1
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 94: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/94.jpg)
153
The regression equations for the four classes can be summarized as follows.
Class 1 vs. Class 2 Class 1 vs. Class 3 Class 1 vs. Class 4 Class 2 vs. Class 3 Class 2 vs. Class 4 Class 3 vs. Class 4
Here, suppose that we have the following test object. We can predict the class of this test
object by substituting the values for each parameter (attribute).
Temp. Humid. Windy Class
50 75 false ?
Class 1 vs. Class 2 Class 1 vs. Class 3 Class 1 vs. Class 4 1.37 Class 2 vs. Class 3 0.35 Class 2 vs. Class 4 Class 3 vs. Class 4 1.97
Since Class 1 wins the others (positive values for the first three regressions), the test datum is
assigned with Class 1.
3.4. Model Ensemble Techniques
In the last decade, the idea of building ensembles of classifiers has gained interest. Instead of
building a single complex classifier, a combination of several simple (weak) classifiers (of either
the same type or different types) is an alternative. For instance, instead of training a large
decision tree (DT), we train several simpler DT and combine their individual output to form the
final decision. Alternatively, we can train different kinds of classifiers (such as DT, NB, NN or
SVM), use them to classify a test object and then combine their results to obtain the final decision.
This procedure seems like a kind of multiple committee members to make a justice. Sometimes
this allows us to have faster training and to focus each classifier on a given portion of the training
set. Figure 3-37 illustrates the concept of ensemble of classifiers. The input pattern x is first
classified by each weak classifier. The weak classifier will return the plausibility of the input x
belonging to a class denoted by . The outputs of these weak classifiers are then
combined in order to establish the final classification decision . Finally, we can select the
class that achieves the maximum value, i.e. . Intuitively, when the individual
classifiers are uncorrelated, the result by majority voting (or other operations) of an ensemble of
classifiers is likely to be better than the result obtained from one individual classifier.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 95: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/95.jpg)
154
(a) Classification
(b) Numeric prediction
Figure 3-37: The concept of ensemble of classifiers. The output of the weak classifiers will
be combined to obtain the final decision, for (a) classification and (b) numeric prediction.
(Test Phase)
Moreover, in general ensembles are shown to be more flexibility in the functions they can
represent and this may enable them to over-fit the training data more than a single model.
Nevertheless, in practice, some ensemble techniques tend to reduce problems related to over-
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 96: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/96.jpg)
155
fitting of the training data. For example, the bagging ensemble technique split data into several
subsets and then applies a learning algorithm on each subset to form a model. The obtained
models will be used to predict the test object and then their results are combined.
Empirically, ensembles tend to yield better results when there is a significant diversity
among the models. Many ensemble methods, therefore, seek to promote diversity among the
models they combine. Although perhaps non-intuitive, more random algorithms (like random
decision trees) can be used to produce a stronger ensemble than very deliberate algorithms (like
entropy-reducing decision trees). Using a variety of strong learning algorithms, however, has
been shown to be more effective than using techniques that attempt to dumb-down the models in
order to promote diversity.
Note that, as described before, naïve Bayes classification can be viewed as a kind of ensemble
techniques. It is an ensemble of classifiers each of which predicts the class based on a single
attribute with Gaussian distribution. The predicted result from each classifier is given a vote
proportional to the probability of the prediction. Later the votes of classifiers (one for each
attribute) are multiplied to form the final probability. Besides this simple example, there are
several types of ensemble techniques. Three common ones are known as bagging, boosting and
stacking. Their techniques are described in sequence below.
3.4.1. Bagging: Bootstrap Aggregating
Bootstrap aggregating, in short bagging, is a simple ensemble meta-algorithm to improve
classification or regression in terms of stability and classification accuracy. It can reduce variance
and avoid overfitting. Bagging is a special case of the model averaging approach by having each
model in the ensemble vote with equal weight. To implement model variance, bagging trains each
model in the ensemble using a randomly-drawn subset of the training set. As an example, the
random forest algorithm combines a number of decision trees learnt from different subset
extracted from the training dataset. It was shown to achieve very high classification accuracy.
One of the most popular methods to generate multiple training datasets from a single dataset is
called bootstrap sampling. Its brief introduction is described below.
Given a training set T with n instances, the bagging algorithm will first generate m new
training sets with n’ instances ( by uniformly sampling examples from T with
replacement. By sampling with replacement, it is likely that some examples will be repeated in
each . As a special case, when the dataset is large enough (large n) and is set to , it was
shown that the set is expected to include 63.2% of the unique examples from T, the rest being
duplicates. The percentage of instances never being selected is calculated from the following
equation.
To understand the formula, first we figure out that
is the probability that an
instance is not selected, instead the other instances are selected. If we select n times, the
probability that the instance is not never selected will be
. When the value of n is
large enough, it will converge to the value of which is approximately . Therefore the
propability that an instance is selected is . This kind of sampling is known as
a bootstrap sample.
Figure 3-38 illustrates the graphical concept of bagging. By bootstrap sampling, the m sets of
samples will be generated and then the bagging method will learn m models each from each
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 97: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/97.jpg)
156
sample set. In the test phase, given a test instance, the m models will judge the result (label or
numeric value) and combine the results by averaging the output (for regression) or voting (for
classification) as shown in Figure 3-38.
(a) The graphic concepts
(b) Conceptual representation with the training dataset as a table
Figure 3-38: Graphical concepts of bootstrap sampling and bagging learning
(Learning Phase)
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 98: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/98.jpg)
157
Algorithm 3.6 presents a pseudo-code for the bagging ensemble method, composed of model
generation (learning phase) and classification (test phase).
Algorithm 3.6. Pseudo-code for the bagging ensemble method
Model generation
1: LET n be the number of instances in the training dataset T.
2: FOREACH i of m iterations
3: SAMPLE n’ instances with replacement from the set T
4: LEARN Mi by the learning algorithm with the training data T
5: STORE the resultant model Mi
Classification
1: FOREACH i of m iterations
2: PREDICT the class of the instance using the model Mi
3: RETURN the class that has been predicted most often
In general, it was criticized that the method averages several predictors and it may not be useful
for the cases of combining linear models. Moreover, bagging does not improve the classification
much in the cases of very stable models like k nearest neighbors.
3.4.2. Boosting: AdaBoost Algorithm
Unlike bagging, boosting involves incrementally building an ensemble by training each new
model instance to emphasize the training instances that previous models mis-classified. In some
cases, boosting has been shown to yield better accuracy than bagging, but it is more likely to
overfit the training data. By far, the most common implementation of Boosting is Adaboost,
although some newer algorithms are reported to achieve better results. The AdaBoost, short for
Adaptive Boosting was formulated by Yoav Freund and Robert Schapire. It is a meta-algorithm,
and can be used in conjunction with many other learning algorithms to improve their
performance. AdaBoost is adaptive in the sense that subsequent classifiers are built in favor of
those instances misclassified by previous classifiers. The class or value for a test instance can be
obtained by using voting or averaging the results where each voted class or value is weighted
according to the performance of the model that gives that class or value.
AdaBoost is sensitive to noisy data and outliers. However, in some problems it can be less
susceptible to the overfitting problem than most learning algorithms. Figure 3-39 and Figure
3-40 illustrates the overview of the AdaBoost-like process. The AdaBoost-like algorithms
construct weak classifiers repeatedly one by one in a series of rounds towards m classifiers. For
each classification construction, a distribution of weights given to each instance in the training
dataset is updated. On each round, the weights of each incorrectly classified example are
increased (or alternatively, the weights of each correctly classified example are decreased), so
that the new classifier focuses more on those examples. The succeeding models are expected to
be experts that complement each other. For example, the training dataset T2 is constructed by
testing the dataset T1 using the model M1, the training dataset T3 is constructed by testing the
dataset T2 using the model M2 and then the same procedure is applied until the number of
constructed models is m (as we set). In the test phase, given a test instance, the m models will
judge the result (label or numeric value) and combine the results by averaging the output (for
regression) or voting (for classification) as shown in Figure 3-37.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 99: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/99.jpg)
158
Figure 3-39: Graphical concepts of AdaBoost (Learning Phase) (Overview)
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 100: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/100.jpg)
159
Figure 3-40: Graphical concepts of AdaBoost (Learning Phase) (Detail)
Algorithm 3.7 presents a pseudo-code for the AdaBoost approach, composed of model generation
(learning phase) and classification (test phase).
Algorithm 3.7. A pseudo-code for the AdaBoost algorithm
Model generation
1: ASSIGN an equal weight to each training instance in the dataset T
2: FOREACH i of m iterations
3: LEARN Mi by applying a learning algorithm on the weighted dataset
Tm
4: STORE the resultant model Mi.
5: COMPUTE the error rate e of the model on weighted dataset and
store the error.
6: IF (e equals to zero) OR (e greater OR equal to 0.5)
7: TERMINATE model generation
8: FOREACH instance in the dataset
9: IF the instance classified correctly by model
10: MULTIPLY the weight of the instance by
11: NORMALIZE weights of all instances
Classification
1: ASSIGN weight of zero to all classes
2: FOREACH t models
3: Add
to weight of the class predicted by model
4: RETURN the class with the highest weight
In general, it was criticized that the method averages several predictors and it may not be useful
for the cases of combining linear models. Moreover, bagging does not improve the classification
much in the cases of very stable models like k nearest neighbors.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 101: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/101.jpg)
160
3.4.3. Stacking
Stacking, also called stacked generalization, is a different way of combining multiple models.
Compared to bagging and boosting, it is less recognized since it is difficult to analyze theoretically
and there are also several different variations. Unlike bagging and boosting, stacking usually
combine models of several different types, such as naïve Bayes, neural networks or decision trees,
and use the outputs from these models to form the training dataset for the upper-level classifier.
Suppose that we use a decision tree inducer (DT), a naïve Bayes learner (NB), and an instance-
based learning method (k-NN) to form a classifier for a given dataset. A potential method to
combine outputs is voting, similar to bagging. However, voting may not work well if the learning
schemes perform not so well. Instead of voting, stacking introduces the concept of a meta-learner,
which replaces the voting procedure. The meta-learner can be learned, by using a holdout set, to
discover the best way to combine the output of the base learners. For simplicity, assume that
there are only two levels on consideration.
Figure 3-41 to Figure 3-43 illustrates the overview of the stacking ensemble method. The
figures illustrate (1) the base-model (level-0) learning, (2) level-1 dataset generation and (3) the
level-1 model learning. The classifiers at the first level are called the base models or level-0
models. In the previous example, DT, ANN and k-NN are the base models. It is possible to use the
predictions from the base models as input for learning the level-1 model (the meta-learner).
Therefore, the number of features as input to the level-1 model is equivalent to the number of the
base models. In the test phase, an instance is first fed into the level-0 models where each model
guesses a class value and then these guesses are fed into the level-1 model, which combines them
into the final prediction. To obtain the training dataset for the level-1 model, we need to find a
way to transform the level-0 training data into level-1 training data. A naïve method is to apply
level-0 models to classify a training instance, and then use their predictions and the instance’s
actual class value as training instances to construct the level-1 model. However, this may trigger
to have overfit problem. That is, working well on the training data but not well on the test data.
Figure 3-41: Stacking (Phase 1/3) (Learning Phase)
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 102: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/102.jpg)
161
Figure 3-42: Stacking (Phase 2/3) (Learning Phase)
Figure 3-43: Stacking (Phase 3/3) (Learning Phase)
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 103: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/103.jpg)
162
Towards a solution, stacking uses a so-called holdout dataset for independent evaluation to
create the level-1 classifier. After the level-0 classifiers have been built by using the training
dataset, they are used to classify the instances in the holdout dataset to generate the level-1
training data. Since the level-0 classifiers never see instances in the holdout set, their predictions
will be unbiased. In other words, the level-1 training data accurately reflects the true
performance of the level-0 learning algorithms. Once the level-1 data have been generated by this
holdout procedure, the level-0 learners can be reapplied to generate classifiers from the full
training set, making slightly better use of the data and leading to better predictions.
For the classification phase, the instance will be classified by the level-0 classifiers and then
the result will be used as input to the level-1 classifier and then the final decision is made. Figure
3-44 shows the graphical concept of the classification using the level-0 classifiers and the level-1
classifier obtained by stacking.
Figure 3-44: Graphical concepts of stacking (Classification Phase)
Stacking can also be applied to numeric prediction. In that case, both the level-0 models and
the level-1 model predict numeric values. The basic mechanism remains the same. The only
difference lies in the nature of the level-1 data. In the numeric case, each level-1 attribute
represents the numeric prediction made by one of the level-0 models, and instead of a class value,
the numeric target value is attached to level-1 training instances. Algorithm 3.8 presents a
pseudo-code for the stacking approach, composed of (1) the base-model (level-0) learning, (2)
level-1 dataset generation and (3) the level-1 model learning
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 104: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/104.jpg)
163
Algorithm 3.8. A pseudo-code for the stacking algorithm
Model generation
1: LET T be the training data.
2: FOREACH i of m iterations # Level-0 model Learning
3: LEARN a model Mi using the i-th model learner with the training data
T
4: STORE the resultant model Mi
5: FOREACH xj of t instances in T # Level-1 Data Generation
6: FOREACH i of m iterations
7: CLASSIFY (OR PREDICT) the class (or value) of the instance using
the model Mi
8: STORE the result as one feature (one column)
9: ADD the actual class (or value) as the last feature (target column) and
finish creating one record of the level-1 data T’.
10: LEARN the level-1 model M’ using the newly created training data T’.
# Level-1 Model Learning
Classification
1: FOREACH i of m iterations
2: PREDICT the class of the instance using the model Mi
3: STORE the result as one feature (one column) in order to create the
input record for the level-1 model.
4: PREDICT the class of the instance using the level-1 model M’
Stacking (sometimes called stacked generalization) exploits this prior belief further. It does
this by using performance on the holdout data to combine the models rather than choose among
them, thereby typically getting performance better than any single one of the trained models.. It
has been successfully used on both supervised learning tasks (regression) and unsupervised
learning (density estimation). It has also been used to estimate Bagging's error rate. Because the
prior belief concerning holdout data is so powerful, stacking often out-performs Bayesian model-
averaging. Indeed, renamed blending, stacking was extensively used in the two top performers in
the recent Netflix competition.
3.4.4. Co-training
Introduced by Avrim Blum and Tom Mitchell in 1998, co-training is an algorithm to learning a
classification model from a small set of labeled data together with a large set of unlabeled data, in
the field of text mining for search engines. As a semi-supervised learning technique, co-training
requires two views of the data, describing two different feature sets that provide different,
complementary information about the instance. Ideally, the two views are assumed to be
conditionally independent and each view is sufficient to classify instances.
In the initial stage, co-training first learns a separate model (classifier) for each view using a
small set of labeled examples. After that, two models are used to classify the unlabeled data. The
unlabeled data with the most confident predictions are added into the training set and then used
to iteratively learn and refine the previous models.
One of the most classic examples is the classification of Web pages. There are two well-
known and useful perspectives: the web content (content-based information) and the incoming
links (hyperlink-based information). Currently, several successful Web search engines use these
two kinds of information. The text label used as the link to another Web page usually provides a
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 105: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/105.jpg)
164
clue expressing what that page is. For example, a link with ’My university’ usually indicates that
its destination page is the home page of a university.
The procedure of co-training is as follows. Given a (small) set of labeled examples, the co-
training will first learn two different models, each for each perspective. In the previous example,
they are a content-based and a hyperlink-based model. As the second step, we apply each model
separately to label the examples without a label. For each model, we select the example (some
examples) the co-training most confidently labels as positive and the example (some examples) it
most confidently labels as negative, and add these two examples into the pool of labeled
examples. It is also possible to maintain the ratio of positive and negative examples in the labeled
pool by choosing more of one kind than the other. Finally, we repeat the whole procedure,
training both models on the augmented pool of labeled examples, until the unlabeled pool is
exhausted. Algorithm 3.9 shows a brief description of the co-training method. The original can be
found in (Blum and Mitchell, 1998). It is also adapted to use in many literatures, such as (Nigam
and Ghani, 2000; Nigam et al., 2000; Ghani, 2002; Brefeld and Scheffer , 2004).
Algorithm 3.9. A pseudo-code for the co-training algorithm
[Modified from Blum and Mitchell, 1998]
1: Let L be a set of labeled training examples.
U be a set of unlabeled training examples.
2: CREATE a pool U’ of examples by choosing u examples at
random from U
3: LOOP for k iterations:
4: TRAIN a classification model M1 using L but consider only
the feature set of the view V1
5: TRAIN a classification model M2 using L but consider only
the feature set of the view V2
6: USE M1 to classify the examples in U’ and
LABEL the most confident p positive and the most
confident n negative examples from the classified U’
7: USE M2 to classify the examples in U’ and
LABEL the most confident p positive and the most
confident n negative examples from the classified U’
8: ADD these self-labeled examples to L
9: RANDOMLY CHOOSE 2p+2n examples from U to replenish U’
3.5. Historical Bibliography
Classification principles and techniques can be found in several books, such as (Weiss and
Kulikowski, 1991), (Michie, Spiegelhalter and Taylor, 1994), (Russel and Norvig, 1995), (Mitchell,
1997), (Duda, Hart and Stork, 2001), (Alpaydin, 2004), (Han and Kamber, 2004), (Witten and
Frank, 2005), (Bishop, 2006), (Camastra and Vinciarelli, 2008), (Izenman, 2008), (Theodoridis
and Koutroumbas, 2008), (Alpaydin, 2009), (Hastie, Tibshirani and Friedman, 2009) and (Rogers
and Girolami, 2011). In the past two decades, several collections containing seminal articles on
machine learning can be found in (Michalski, Carbonell and Mitchell, 1983), (Michalski, Carbonell
and Mitchell, 1986), (Kodratoff and Michalski, 1990), (Shavlik and Dietterich, 1990), and
(Michalski and Tecuci, 1994). Recently several collections have been gathered as reports of
current research works such as (Balcazar, Bonchi, Gionis and Sebag, 2010), (Gunopulos, Hofmann,
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 106: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/106.jpg)
165
Malerba and Vazirgiannis, 2011), (Gama, Bradley and Hollmen, 2011). A good introduction to
apply probability to machine learning is provided by DasGupta (2011). Many of these books
describe each of the basic methods of classification discussed in this chapter, as well as practical
techniques for the evaluation of classifier performance. For a presentation of machine learning
with respect to data mining applications, see (Michalski, Bratko, and Kubat, 1998). A linear
discriminant classifier as a linear machine was described in Nilsson (1965). A good introduction
to apply linear regression was presented by Weisberg (1980) and later by Breiman, L. and
Friedman, J. (1997). Theoretical aspect of linear discriminant classifiers can be found in (Hastie,
Tibshirani and Friedman, 2009). The k-nearest-neighbor method was first described in 1951 by
Fix and Hodges (1951). The method is labor intensive when given large training sets. Although
the k-NN obtained low popularity from the beginning, in the 1960s with the increasing power of
computing, the k-NN is widely used in the area of pattern recognition. At the early stage, Cover
and Hart (1967) have gathered a collection of articles on nearest-neighbor classification. As a set
of more recent advance, Dasarathy (1991) gathered a modern collection on k-NN approach. This
k-NN is also explained in several text books, including Duda et al. (2001) and James (1985), as
well as and Fukunaga and Hummels (1987). To improve nearest-neighbor classification time,
Friedman, Bentley, and Finkel (1977) presented a usage of search trees while Hart (1968)
proposed a method to remove unrelated training data by applying edit distance. The
computational complexity of nearest-neighbor classifiers is given in Preparata and Shamos
(1985). As a chapter in a book, Bayesian classification and its algorithms for inference on belief
networks was described by Duda, Hart, and Stork (2001), Weiss and Kulikowski (1991), Mitchell
(1997) and Russell and Norvig (1995). Domingos and Pazzani (1996) provided an analysis of the
predictive power of naïve Bayesian classifiers when the class conditional independence
assumption is violated. Heckerman (1996) and Jensen (1996) presented an introduction to
Bayesian belief networks. The computational complexity of belief networks was described by
Laskey and Mahoney (1997). Decision trees were introduced firstly by Quinlan (1986). The tree
pruning was described in (Quinlan, 1987). An empirical comparison between genetic and
decision-tree classifiers was done in (Quinlan, 1988). An approach to deal with unknown
attribute values in tree induction was shown in (Quinlan, 1989). As a concrete version of decision
trees, the C4.5 algorithm is described in a book by Quinlan (1993). Bagging, boosting in C4.5 was
described in (Quinlan, 1996). As another research group, the CART (Classification and Regression
Trees) system was developed by Breiman, Friedman, Olshen, and Stone (1984). C4.5 has a
commercial successor, known as C5.0, which can be found at www.rulequest.com. ID3, a
predecessor of C4.5, is detailed in Quinlan (1986). Incremental versions of ID3 include ID4
(Schlimmer and Fisher, 1986) and ID5 (Utgoff, 1988). Quinlan (1987 and 1993) presented how
to extract rules from decision trees. A comprehensive survey related to decision tree induction,
such as attribute selection and pruning, was written by Murthy (1998). To construct of
classification rules, the simple version of covering or separate-and-conquer approach was
implemented as an algorithm named PRISM by Cendrowska (1987). As pruning techniques, the
idea of incremental reduced-error pruning was proposed by Fürnkranz and Widmer (1994) and
forms the basis for fast and effective rule induction. Later, an algorithm called RIPPER (repeated
incremental pruning to produce error reduction) was proposed by Cohen (1995). A good
summary of the Minimum Description Length principle was introduced by Grünwald (2007).
Besides this most basic algorithm, some popular variations include AQ by Ryszard Michalski
(1969) as well as its successor AQ15 by Hong, Mozetic, and Michalski (1986), CN2 by Peter Clark
and Tim Niblett (1989), FOIL by Quinlan and Cameron-Jones (1993) and RIPPER by William W.
Cohen (1995). An artificial neural network was firstly proposed as a perceptron by Rosenblatt
(1958). Later, there are several literatures related to its limitation and improvement such as
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 107: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/107.jpg)
166
books by (Wasserman, 1989), Hecht-Nielsen (1990), Hertz, Krogh, and Palmer (1991), Bishop
(1995), Ripley (1996) and Haykin (1999). Support Vector Machines (SVMs) was proposed by
Vapnik and Chervonenkis (1971) on statistical learning theory. However, the first paper on SVMs
was published by Boser, Guyon, and Vapnik (1992). Later, Vapnik (1995, 1998) published his
original idea and its extension to classification. Law (2005), and Cristianini and Shawe-Taylor
(2000) gave a comprehensive introduction to SVMs. Boser, Guyon, and Vapnik (1992) provided a
training algorithm for optimal margin classifiers. Readers can find a more comprehensive
material provided by Burge (1998) and a textbook written by Kecman (2001). Fletcher (1987) as
well as Nocedal and Wright (1999) provided good description to understand on how to solve
optimization problems in SVMs. Some applications of SVMs to regression was provided by
Drucker, Burges, Kaufman, Smola, and Vapnik (1997) and Schlkopf, Bartlett, Smola, and
Williamson (1999). Nilsson (1965) provides an excellent reference for linear classification
models which were popular in the 1960s. After that linear regression is described in most
standard statistical texts but Lawson and Hanson (1995) provided its comprehensive description
in his book. Friedman (1996) describes the technique of pairwise classification. Fürnkranz
(2002) further analyzes pairwise classification. Hastie and Tibshirani (1998) extend it to
estimate probabilities using pairwise coupling. Moreover, there are many good textbooks on
classification and regression, provide by James (1985), Dobson (2001) and Johnson and Wichern
(2002). A good introduction to the holdout, cross-validation, leave-one-out and bootstrapping
was provided by Efron, and R. Tibshirani (1993) and their theoretical and empirical study by
Kohavi (1995). While combining multiple models becomes a popular research topic in machine
learning research, the bagging (for “bootstrap aggregating”) technique was started by Breiman
(1996). As a special case of boosting, the AdaBoost.M1 boosting algorithm was developed by
Freund and Schapire (1997) with several different classifiers, including decision tree induction
by Quinlan (1996) and naive Bayesian classification by Elkan (1997). Drucker (1997) adapted
AdaBoost.M1 for numeric prediction. Freund and Schapire (1996) developed and derived
theoretical bounds for its performance. Friedman, Hastie and Tibshirani (2000) proposed the
LogitBoost algorithm. Later, Friedman (2001) describes how to make boosting more resilient in
the presence of noisy data. Bay (1999) suggests using randomization for ensemble learning with
nearest neighbor classifiers. Bagging, boosting and randomization were evaluated by Dietterich
(2000). More recently, Zhang and Ma (2012) and Okun, Valentini and Matteo Re (2011) provides
descriptions of ensembles and their applications to Machine Learning. The theoretical model for
co-training was first proposed by Blum and Mitchell (1998) for the use of labeled and unlabeled
data from different independent perspectives. Nigam and Ghani (2000) analyzed the
effectiveness and applicability of co-training and use standard EM to fill in missing values, called
the co-EM algorithm. Applied to text classification, Nigam, McCallum, Thrun, and Mitchell (2000)
applied the EM clustering algorithm to exploit unlabeled data to improve an initial naïve Bayes
classifier. Later, Ghani (2002) extended co-training and co-EM to multiclass situations with error
correcting output codes.Brefeld and Scheffer (2004) extended co-EM to use a support vector
machine rather than Naïve Bayes.
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 108: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/108.jpg)
167
Exercise
1. Explain the steps towards classification or prediction.
2. Describe the Fisher’s linear discriminant or centroid-based method using the following data.
Fat (F) Protein (P) Glucose (G) Positive (C) 250 5.10 4.56 Y 220 8.65 8.91 Y 280 0.34 4.12 Y 230 0.45 9.74 N 150 9.48 10.45 Y 170 7.62 3.66 N 160 0.25 9.67 N 140 0.47 5.49 N 80 6.59 4.83 N
100 5.82 11.54 Y 90 0.54 3.52 N
110 0.62 4.81 N
3. From the result in question (2), what is the class when the following test object is observed?
Fat (F) Protein (P) Glucose (G) Positive (C)
90 7.67 4.57 ?
4. Given the table in question (2), what is the class when the k-NN is used to classify the test
datum in question (3)? Here, calculate the result when k = 1 and 3.
5. Compare k-NN and centroid-based method. what is the effect when k (for k-NN) becomes
larger?
6. Given the table below, construct a probabilistic model based on naïve Bayes by calculating a
set of priori probabilities p(H) and posteriori p(H|E), where H is a hypothesis and E is an
evidence. Here, use the ‘Carbon’ attribute with numerical values and also use Laplace
estimation by adding 1 for each class.
Temp Color Carbon Burn
High Red 90 (H) Y
High Red 60 (M) Y
High Yellow 95 (H) Y
High Yellow 30 (L) N
High Blue 80 (H) Y
High Blue 60 (M) Y
Low Red 55 (M) Y
Low Red 25 (L) N
Low Yellow 10 (L) N
Low Yellow 25 (L) N
Low Blue 65 (M) Y
Low Blue 90 (H) Y
7. Apply the acquired model in question (6) to classify the following case.
Temp Color Carbon Burn
Low Yellow 20 (L) ?
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 109: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/109.jpg)
168
8. Given the table below, construct a decision tree. Here compare the result using information
gain and gain ratio.
Hair Height Weight Lotion Result Blonde Average Light N Sunburned Blonde Tall Average Y None Brown Short Average Y None Blonde Short Average N Sunburned Red Average Heavy N Sunburned Brown Tall Heavy N None Brown Average Heavy N None Blonde Short Light Y None Blonde Average Light N Sunburned Blonde Tall Average Y None Brown Short Average Y None Blonde Short Average N Sunburned Red Average Heavy N Sunburned Brown Tall Heavy N None Brown Average Heavy N None Blonde Short Light Y None Blonde Average Light N Sunburned Blonde Tall Average Y None Brown Short Average Y None Blonde Short Average N Sunburned Red Average Heavy N Sunburned Brown Tall Heavy N None Brown Average Heavy N None Blonde Short Light Y None Blonde Average Light N Sunburned Blonde Tall Average Y None Brown Short Average Y None Blonde Short Average N Sunburned Red Average Heavy N Sunburned Brown Tall Heavy N None Brown Average Heavy N None Blonde Short Light Y None
9. Describe the effect of tree pruning in decision tree induction.
10. Given the table below, construct a set of covering rules using the criteria of either p/t or (p-
n)/t.
Outlook Temperature Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 110: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/110.jpg)
169
11. From the result of the previous question, apply hypergeometric distribution to calculate rule
significance for pruning.
12. Explain how to use a neural network to classify a handwritten digit (0-9).
13. From the set of the given objects and their classes shown below, specify the support vectors,
draw the linear separating hyperplanes, decision boundary, calculate the margin and explain
the formulae of the linear separating hyperplanes.
x y class x y class
1 2 A 4 5 B
1 3 A 5 5 B
3 1 A 5 4 B
3 3 A 5 3 B
4 2 A 6 4 B
14. Apply linear regression to predict the value of ‘positive’. Here, y = positive, =fat, =protein
and =glucose. Then use the obtained linear regression function to predict the test case.
Fat (F) Protein (P) Glucose (G) Positive (C) 250 5.10 4.56 0.82 220 8.65 8.91 0.75 280 0.34 4.12 0.92 230 0.45 9.74 0.25 150 9.48 10.45 0.85 170 7.62 3.66 0.15 160 0.25 9.67 0.25 140 0.47 5.49 0.30 80 6.59 4.83 0.14
100 5.82 11.54 0.78 90 0.54 3.52 0.12
110 0.62 4.81 0.08
Test case
Fat (F) Protein (P) Glucose (G) Positive (C) 100 8.00 3.00 ?
Sponsored by AIAT.or.th and KINDML, SIIT
![Page 111: Sponsored by AIAT.or.th and KINDML, SIIT · Classification, known as a most major supervised learning task in pattern recognition and machine learning, aims to deduce a predictive](https://reader033.vdocuments.site/reader033/viewer/2022050419/5f8f191697ef577c8d13d386/html5/thumbnails/111.jpg)
170
15. Use the following table to learn regression for classification and then classify the test case.
Here compare two approaches; one-against-the-other regression and parwise regression.
Fat (F) Protein (P) Glucose (G) Positive (C) 250 5.10 4.56 Class A 220 8.65 8.91 Class B 280 0.34 4.12 Class A 230 0.45 9.74 Class C 150 9.48 10.45 Class B 170 7.62 3.66 Class C 160 0.25 9.67 Class B 140 0.47 5.49 Class B 80 6.59 4.83 Class C
100 5.82 11.54 Class A 90 0.54 3.52 Class C
110 0.62 4.81 Class C
Test case
Fat (F) Protein (P) Glucose (G) Positive (C) 100 8.00 3.00 ?
16. Compare the merits and demerits of bagging, boosting, stacking and co-training.
Sponsored by AIAT.or.th and KINDML, SIIT