machine learning in pervasive computing - diva...

Lulea University of Technology

Study Report

Machine Learning in PervasiveComputing

Author:

Samuel Idowu

Supervisor:

Prof Christer Ahlund

Co-Supervisor:

Olov Schelen

Co-Supervisor:

Robert Brannstrom

Pervasive and Mobile Computing

Department of Computer Science, Electrical and Space Engineering

September 2013

http://www.ltu.se/

http://www.samuelidowu.com




Research Group Web Site URL Here (include http://)

Department or School Web Site URL Here (include http://)

LULEA UNIVERSITY OF TECHNOLOGY

Abstract

Pervasive and Mobile Computing

Department of Computer Science, Electrical and Space Engineering

Machine Learning in Pervasive Computing

Increase in data quantities and number of pervasive systems has resulted in many

decision-making systems. Most of these systems employ Machine Learning (ML) in

various practical scenarios and applications. Enormous amount of data generated by

sensors can be useful in decision-making systems. The rising number of sensor driven

pervasive systems presents interesting research areas on how to adapt and apply ex-

isting ML techniques effectively to the domain of pervasive computing. In the face of

data deluge, ML has proved viable in many application areas such as data mining and

self-customizing programs and could bring about great impact in the field of pervasive

computing.

The objective of this study is to give the underlying concepts of ML techniques that

can be applied to problems in the domain of pervasive and mobile computing. The

scope of this study covers the three primary types of ML, supervised, unsupervised and

reinforcement learning methods. In the process of providing the fundamental knowledge

of ML, we present some conceptual terms of ML and the steps required in developing

ML system with a great impact on domains outside ML scope.

Our findings show that previous works in the area of ubiquitous computing have suc-

cessfully applied supervised learning and reinforcement learning methods. Hence, this

study focuses more on supervised learning and reinforcement learning. In conclusion, we

discuss some basic performance evaluation metrics and methods for obtaining reliable

classifiers estimates, such as cross-validation and leave-one-out validation.

University Web Site URL Here (include http://www.ltu.se/)

Faculty Web Site URL Here (include http://)

Department or School Web Site URL Here (include http://)

Contents

Abstract i

List of Figures iv

List of Tables v

1 Machine Learning in Pervasive Computing . . . . . . . . . . . . . . . . . . 1

2 Related Study and Scope of Study . . . . . . . . . . . . . . . . . . . . . . 2

3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1 Machine Learning, Data Mining and Artificial Intelligence . . . . . 4

3.2 Significance of Machine Learning . . . . . . . . . . . . . . . . . . . 5

3.3 A Machine Learning Example . . . . . . . . . . . . . . . . . . . . . 5

4 Types of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 9

4.4 Common Machine Learning Paradigms and Categories . . . . . . . 10

5 Essential Steps in Machine Learning . . . . . . . . . . . . . . . . . . . . . 11

5.1 Impacting Real World Outside Machine Learning . . . . . . . . . . 16

6 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6.1 Bayesian Classification . . . . . . . . . . . . . . . . . . . . . . . . . 17

6.1.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . 17

6.1.2 Naive Bayes classifier . . . . . . . . . . . . . . . . . . . . 18

6.2 Instance–Base Learners . . . . . . . . . . . . . . . . . . . . . . . . 23

6.3 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . 30

6.5 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.5.1 Discrete Markov Model . . . . . . . . . . . . . . . . . . . 36

6.5.2 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . 37

6.6 Comparing Supervised Learning Algorithms . . . . . . . . . . . . . 40

7 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7.1 Clustering Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7.1.1 k-means clustering . . . . . . . . . . . . . . . . . . . . . . 43

7.2 Association Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7.2.1 Apriori algorithm . . . . . . . . . . . . . . . . . . . . . . 45

8 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

8.1 Elements of Reinforcement Learning . . . . . . . . . . . . . . . . . 48

ii

Contents iii

8.1.1 Passive Reinforcement Learning . . . . . . . . . . . . . . 50

8.1.2 Active Reinforcement Learning . . . . . . . . . . . . . . . 52

9 Performance Evaluation in Machine Learning . . . . . . . . . . . . . . . . 53

9.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

9.2 Accuracy and Error rate . . . . . . . . . . . . . . . . . . . . . . . . 55

9.3 Sensitivity and Specificity . . . . . . . . . . . . . . . . . . . . . . . 56

9.4 Precision, Recall and F-measure . . . . . . . . . . . . . . . . . . . 56

9.5 Methods for Obtaining Reliable Evaluation Measures . . . . . . . . 57

10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Bibliography 60

List of Figures

1 Training and evaluation phase using a training set and a test set respectively 7

2 Common algorithms used in ML . . . . . . . . . . . . . . . . . . . . . . . 10

3 Major steps involve in designing a learning phase . . . . . . . . . . . . . . 12

4 Main Evaluation factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Three Stages of machine learning research program . . . . . . . . . . . . . 16

6 Sensor-readings training data set . . . . . . . . . . . . . . . . . . . . . . . 22

7 Calculated mean and variance . . . . . . . . . . . . . . . . . . . . . . . . . 23

8 Illustration of kNN classification . . . . . . . . . . . . . . . . . . . . . . . 24

9 A decision tree in flowchart form . . . . . . . . . . . . . . . . . . . . . . . 27

10 Support Vector Machines (SVM) optimal hyperplane and maximum margin 31

11 Complex data types for mining [1] . . . . . . . . . . . . . . . . . . . . . . 35

12 A simple Markov chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

13 Second order Markov model . . . . . . . . . . . . . . . . . . . . . . . . . . 36

14 Partially observable system . . . . . . . . . . . . . . . . . . . . . . . . . . 37

15 Apriori principle illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 46

16 Learning agent - environment . . . . . . . . . . . . . . . . . . . . . . . . . 49

iv

List of Tables

1 Lenses Data Set [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Approaches to define the distance between two instances (x and y) [3] . . 25

3 State sequence probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 HMM probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Comparing learning algorithms (**** stars represents the best and * starrepresent the worst) [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6 A simple itemset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7 Confusion matrix with totals for positive and negative examples . . . . . 54

8 Example of a confusion matrix with totals for positive and negative examples 54

v

Abbreviations vi

ML Machine Learning

CART Classification and Regression Tree

kNN k–Nearest Neighbor

ANNs Artificial Neural Networks

SVM Support Vector Machines

SVR Support Vector Regression

ID3 Iterative Dichotomiser 3

MDP Markov Decision Process

DUE Direct Utility Estimation

DP Dynamic Programming

ADP Adaptive Dynamic Programming

TD Temporal Difference

QP Quadratic Problem

SMO Sequential Minimal Optimization

GPS Global Positioning System

AI Artificial Intelligence

NLP Natural Language Processing

HMM Hidden Markov Model

HCI Human Computer Interaction

CLS Concept Learning System

IID Independent and Identically Distributed

Section 1 - Machine Learning in Pervasive Computing 1

1 Machine Learning in Pervasive Computing

There is a growing interest in the field of Machine Learning (ML). An urge for these

interests can be partly attributed to the big data era, which has led to deluge of data in

modern times [4]. Bulk of the data today is generated from sensors (e.g. context aware

sensing devices), which periodically generate information about contexts such as location

and time [5]. Other sensing devices monitor the state of structures such as buildings

by observing their vibration [6]. These examples and many others contribute to data

deluge today. The total volume of data generated on earth exceeded one Zettabyte (ZB)

in 2010, and this will continue to grow exponentially [7]. Also, estimates show that the

amount of data stored in the world’s database doubles every 20 months [8].

The advantages and potentials of data are highly valuable in many fields. However,

achieving these benefits requires some considerable amount of effort to derive useful

knowledge from a tremendous amount of data. It is not a surprise that much attention

has shifted towards automated and effective ways of data analysis with large and complex

data sets in contemporary times, which ML provides [4].

The research presented in this document aims to gain knowledge about ML and its tech-

niques. This involves a comprehensive understanding of ML and knowledge of the basis

of commonly used algorithms. Since there is no win-all technique or algorithm for every

data-set or all problem scenarios in ML [9], it becomes crucial for us to get well-versed

in the suitability of various algorithms for different types of data-sets and problems.

We shall use this experience later to motivate the choice of algorithms appropriate for

specific problems. We shall continue this work by applying the understanding gained

in the field of ML to two problem cases. The first case involves activity recognition on

a wooden bridge based on the movement of the bridge. The second case applies ML

in mobile network, specifically, the selection of access points or base stations based on

prior network performances.

The next section, 2, describes the scope and motivations for this study. Section 3

covers the definition of ML, its significance and its relationship with data mining and

artificial intelligence. This section also presents the important terms in ML using a

simple classification problem. Section 4 describes the three main types of ML and

discusses the ML paradigms. Section 5 presents the general steps in developing a ML

Section 2 - Scope of Study 2

system. Section 6, 7 and 8 discuss general techniques under supervised, unsupervised

and reinforcement learning types respectively. In these sections, we present the basic

ideas behind the techniques. We do not delve too deeply into the trickier issues, but

explain the underlying concepts. In section 9, we present commonly used metrics for

evaluating the accuracy of ML tasks.

2 Related Study and Scope of Study

In this section, we give and justify the scope of this study. The two main factors, which

motivate the scope of this study are;

• previous research in pervasive computing atmosphere that applies ML and

• the nature of our research problems, to which we shall apply ML as future work.

Much of ML research in pervasive computing environments has favored the used of

supervised learning (see Section 4.1) in many sensor-data applications such as activity

recognition [10] [11] [12], Intelligent Environment [13], mobility prediction [9], structural

health monitoring [14] [15] and Human Computer Interaction (HCI) [16]. Pervasive

computing applications by nature are context-aware; they have as possible input the

entire observable state of the environment [17]. A pervasive computing system is capable

of basing computational decisions on interactions within its environment as a whole.

This is possible since pervasive systems typically include a partial knowledge of their

physical environment with the help of sensing devices (e.g. location sensors, energy

meters) and an almost complete understanding of the computational environment (e.g.

networked device state, application state and service state) [17]. Appropriately employed

supervised ML techniques removes the limitation of having to manually specify decision

process for each contextual situation and thereby increases automation. For instance,

instead of expressing the rules for an action, the system can be provided with examples

under varying contextual situations of when that action should be taken [17].

Some related projects that have employed some forms of supervised learning methods

include the following. The ContAct project [10] involves a system that derives user

activity and availability from sensor data in an office space. It uses Naive Bayes clas-

sifier to learn user activity and availability from sensor data according to given user

Section 3 - Machine Learning 3

feedback. A predictive buildings energy optimization [18], which controls, improves the

efficiency and reliability of building operations without requiring large amounts of addi-

tional capital investment. The system applies predictive ML model known as Support

Vector Regression (SVR) on the building’s historical energy use. The Lumiere project

[19], at Microsoft Research applied the Bayesian models to accomplish techniques and

a platform for reasoning about the goals and needs of software users as they work with

software, by using a large amount of example data from users and experts. Its objec-

tive is to learn proper interactions with the user according to perceived user activities

in order to provide assistance within a software environment. These systems directly

or indirectly learn a model from history of contextual situations as feedback, which is

perceived with advantage of contextual sensors. The generated model then helps in

decision-making or prediction of future contexts.

Supervised learning can be considered limited in systems that require some form of

direct feedback to facilitate a learning process; hence, supervised learning alone is not

adequate for continuous learning from interaction [20]. In interactive problems, it is

mostly impractical to obtain examples of desired behavior that are both correct and

representative of all situations in which a learning agent has to act [20]. Direct feedback

enables a system to learn from its own experience. Reinforcement learning (see Section

4.3) is a form of ML, which is suitable for these types of systems. As an example, in

the MavHome Project [21], the environment is represented using a Hierarchical Hidden

Markov Model (HMM), and a reinforcement-learning algorithm is employed to predict

environmental preferences based on sensors within the environment.

Our research interest centers on applying ML in research problems of pervasive com-

puting. For a future work, possible domains include activity recognition on bridges,

model-based access network selection and sensor-data driven energy optimization. In

view of our research interest and prior works, which have employed the use of ML, We

limit the scope of this study to the basis of common supervised learning algorithms such

as naive Bayes, decision tree, SVM and HMM. The scope also includes common super-

vised learning tasks (i.e. clustering analysis and association analysis) and reinforcement

learning – the principal elements and the key types of reinforcement learning.


3 Machine Learning

Machine Learning (ML) applies suitable algorithms on data for a learning problem.

Tom Mitchell [22], defined a well-posed learning Problem: A computer program is said

to learn from experience E (observed examples) with respect to some task T and some

performance measure P if its performance on T, as measured by P, improves with ex-

perience E. Therefore, we can define ML as one that can learn from experience with

respect to some class of tasks and a performance measure [22].

In general terms, ML can be clearly defined as set of methods that can automatically

detect patterns (general regularities) in empirical data, such as sensor data or databases

and then use the discovered patterns to predict future data, or execute other types of

decision making under uncertainty [4]. Various disciplines utilize ML, including the

obvious disciplines like computer science and statistics as well as many other fields

from politics to geo-sciences [23]. Focus in ML research today majors on automatically

learning patterns in large data set of complex data types [1].

This section discusses the relationship of ML in relation to the two principal fields that

often apply ML algorithms (i.e. Artificial Intelligence (AI) and data mining). This

section also presents the significance of ML and its applications. Lastly, we introduce

the relevant terms in ML, using a simple supervised learning problem.

3.1 Machine Learning, Data Mining and Artificial Intelligence

Data mining and artificial intelligence are two popular fields that regularly apply ML

techniques. Data mining, also known as knowledge discovery from databases, advanced

data analysis and ML [24], combines methods and tools from at least three areas, which

includes ML, statistics and databases [25]. In the field of artificial intelligence, it is

difficult to identify an intelligent system that does not possess a learning capability as a

true intelligent system [26]. In these fields, the ultimate objectives are quite similar, al-

though there are slight differences in the approach and the use of ML in each field. Data

mining bridges other technical areas including databases, human-computer interaction

and statistical analysis [24]. However, data mining centers on ML techniques, it also in-

volves other essential steps like database creation and maintenance, data formatting and

cleansing, data visualization and summarization, use of a human expert knowledge to


formulate the inputs to the learning algorithm and evaluate empirical patterns it discov-

ers [24]. AI attempts to understand and build intelligent entities. These entities require

the capability to adapt to new circumstances, detect and extrapolate patterns [27]. AI

employs various ML techniques to provide this vital ability. In general, AI involves

using other capabilities such as natural language processing, knowledge representation

and automated reasoning [27].

3.2 Significance of Machine Learning

In the face of current data explosion, one begs the question of efficient ways to handle

such data. It is crucial for systems to handle the processing of data efficiently and also in

less time. The target of recent research in the area of ML tends towards adapting existing

techniques and algorithms to cater for complex data such as data-streams. Recent

studies also seek efficient means to utilize memory and process-cycles when dealing with

huge amount of data [28]. The application of ML algorithms in data mining includes

medical record analysis, and credit card fraud detection [24]. It has also been used

for understanding and predicting customer purchase behavior, manufacturing processes

and predicting the personal interests of web users [24] (e.g. Amazon, Netflix product

recommendation). Another use of ML is in complex systems that cannot be programmed

by hand. Such applications include autonomous helicopter, handwriting recognition,

computer vision and Natural Language Processing (NLP). Though the earlier classic

ML methods are able to produce acceptable results with simple data sets in several

applications. In recent time, due to the huge volume of data and also complex data

types, a lot of researches are progressing on delivering the best and most effective ways

of knowledge discovery and ML [1].

3.3 A Machine Learning Example

Before we go further in this report, it would be beneficial to give a simple ML example

and establish some common terminologies that we shall use henceforth. To make this as

effective as possible, we shall use an example of a common ML task. We will describe an

expert system that is capable of lens prescription using some values presented in Table

1. The presented data is from the UCI ML repository [2]. As we discuss this system,

we shall emphasize some common terms that are essential in ML.


Age of patient Prescription Astigmatic Tear rate Lenses

1 young myope no reduced none2 young myope no normal soft3 young myope yes reduced none4 young myope yes normal hard5 young hypermetrope no reduced none6 young hypermetrope no normal soft7 young hypermetrope yes reduced none8 young hypermetrope yes normal hard9 pre-presbyopic myope no reduced none10 pre-presbyopic myope no normal soft11 pre-presbyopic myope yes reduced none12 pre-presbyopic myope yes normal hard13 pre-presbyopic hypermetrope no reduced none14 pre-presbyopic hypermetrope no normal soft15 pre-presbyopic hypermetrope yes reduced none16 pre-presbyopic hypermetrope yes normal none17 presbyopic myope no reduced none18 presbyopic myope no normal none19 presbyopic myope yes reduced none20 presbyopic myope yes normal hard21 presbyopic hypermetrope no reduced none22 presbyopic hypermetrope no normal soft23 presbyopic hypermetrope yes reduced none24 presbyopic hypermetrope yes normal none

Table 1: Lenses Data Set [2]

The expert system needs to know some information about its patient to be able to

prescribe lenses correctly, just like in the case of a real human optician. To make a

simple illustration, we will use four information values about each patient. These are

often called features, attributes or covariates [23] [4]. These are the first four columns

in Table 1. Each of the rows in the Table 1 is called an instance, which consist of some

features. Features can be basic data types (e.g. nominal or numerical) [8]. Features can

also be a more complex structured data such as images, texts, emails and raw sensor data

[4]. In this example, the first feature age of patient is a nominal data type, which could

be one from a set {young, pre-presbyopic, presbyopic}. The second, third and fourth

features are binary data types also, which could either be myope/hypermetrope, yes/no

and reduce/normal respectively. In the real world scenarios, data is not always friendly

as it appears in the Table 1. Data usually contains highly large number of features (often

thousands of dimensions), with possibility of missing values, which makes a learning task

more difficult. Curse of dimensionality [29] refers to the phenomena that arise when

dealing with such data of high dimensional spaces.


Each example has four different features and one target variable. The target variable is

known as a class or label [23] [4], and it is what the expert system will try to predict. In

this illustration, the classes (target variables) are Lenses, which is a value from the set

of 3 items {none, soft, hard}. A class of an instance in principle could be anything, but

is mostly assumed to be a categorical or nominal variable from some finite set [4]. The

lens prescription problem that the described expert system will handle can be solved

by a widely used technique in ML known as classification. Classification task answers

questions such as, how do we decide when to prescribe soft lenses or another lens type

[23]? There are several problems in this category of ML tasks. One can certainly argue

that it is the most explored aspect of ML, both in its application and research [30].

There are several algorithms, which can help accomplish the classification task, but for

now, we shall only discuss the basic idea behind the classification problem in general.

What we need to do is to teach the system how, for instance, human opticians perform

prescriptions. We need to provide some quality example instances and allow the system

to learn from them. These examples are known as a training set [23]. The training set

in table 1 has 24 examples. Employing algorithms that help to find some connection

between the features and the target variable can solve the classification task.

Figure 1: Training and evaluation phase using a training set and a test set respectively.

In the field of ML, it is vital to know how accurate a system performs; this process is

known as evaluation (see Section 9). To test the illustrated lens predictor system, we

need a set of examples, called test set, which is different from the training set. The major

difference of the test set from the training set is that a test set only has the feature values

Section 4 - Types of Machine Learning 8

and no class value are given for each example. The missing class is what the system

would predict. After a learning phase with training set and an evaluation phase with a

test set, the next step is determining whether the classifiers level of accuracy is sufficient.

Figure 1 shows an illustration of a training phase and an evaluation phase. The output

of a classification algorithm is called classifier, model or knowledge representation.

4 Types of Machine Learning

ML is divided into three principal parts; supervised learning (predictive learning ap-

proach), unsupervised learning (descriptive learning approach) and reinforcement learn-

ing [4]. The supervised learning methods learn from given examples. The case given

in Section 3.3 belongs to this group of ML. In situations where we are not fortunate

enough to have example data with which we can learn from, we can apply the unsuper-

vised means of ML. The aim of unsupervised learning is to find intriguing pattern from

input data. It is largely employed in data mining field and sometimes referred to as

knowledge discovery [4]. In Reinforcement learning, the task is to learn how to behave

when given series of reinforcements - rewards or punishment. Unlike the previous types,

reinforcement learning is not commonly used [4].

4.1 Supervised Learning

In this approach, the task is to learn a mapping from input x to output y, given a labeled

set of input-output pairs D = {(xi, yi)}Ni=1. Where D is the training set, and N is the

cardinality of the training set [4]. Each example input xi is a D–dimensional vector of

values representing the features of an instance. The output yi part of the training set

represents the class or label of its respective features [4]. In relation to the illustration

given in section 3.3, D is the complete lenses data set [2], where xi represents the features

age, prescription, astigmatic, and tear rate. The label variable, yi represents lenses, and

the number of instances in the training set, N , is 24 [2].

The type of output variable, yi differentiates the two types of tasks performed under the

supervised learning. In cases similar to the example in 3.3, where the output variable

yi is categorical or nominal, these types of ML tasks are referred to as classification

or pattern recognition [4] [23]. The common algorithms used for classification tasks


include Naive Bayes, C4.5 decision trees and k–Nearest Neighbor (kNN). The second

supervised learning type is referred to as regression, where the output variable, yi, is

a real-valued scalar [4] [23], a variant of regression where the class space Y has some

natural ordering, such as grades A–F is called ordinal regression [4]. Common algorithms

used for regression include linear regression as well as Classification and Regression

Tree (CART), which can also be used for classification tasks.

4.2 Unsupervised Learning

Unsupervised learning methods only use input, D = {xi}Ni=1, and the goal is to find

”interesting patterns” in the data [4]. There’s no class or label value given for the data

D [23]. This is a much less well-defined problem, since the exact pattern to detect

is not defined and there is no obvious error metrics to use (i.e. we cannot compare

our prediction y for a given x to be the observed value) [4]. Unsupervised learning

is commonly used in task such as clustering, where the interest is to group similar

items together, and association analysis. Density estimation tasks are unsupervised

learning tasks, which discover the statistical values that describe a data [23]. Dimension

reduction tasks are applied to reduce the number of columns (i.e. features) of data sets

in high-dimensional spaces.

4.3 Reinforcement Learning

Reinforcement learning is learning what to do (i.e. how to map situations to actions with

the goal to maximize a numerical reward signal) [20]. The actions to be taken to achieve

the goal is not given as it is with other types of ML, but a learner must discover which

actions gives the most reward by trying them [20]. Russell et al, describe reinforcement

learning as a form of learning where the agent learns from a series of reinforcements–

rewards or punishments [27]. The most important aspects of the reinforcement learning

system include; sensation – ability to sense the state of the environment, action – must

be able to take actions, which affects state of its environment and goal – system must

have goals relating to the state of the environment [20]. Algorithms used under this

approach of ML include Q-learning and Temporal difference learning.


4.4 Common Machine Learning Paradigms and Categories

It is a common practice to categorize algorithms into paradigms based on their simi-

larities of basic assumptions about representation, performance methods and learning

algorithms [30]. Langley et al, present five paradigms of ML, these are perceptron-based

learning, Artificial Neural Networks (ANNs), instance based, genetic algorithm, rule in-

duction and analytic learning [30]. Kotsiantis (2007) [24] also identifies five different

paradigms within supervised learning of ML. These are

• logic based algorithms which includes, decision trees and rule-based classifiers,

• perceptron-based techniques which includes single layered and multi-layered per-

ceptron (ANNs),

• statistical learning algorithms – Naive Bayes and Bayesian Networks,

• instance-based learning - this includes kNN, and lastly,

• SVMs

Figure 2: Common algorithms used in ML

ML algorithms can also be grouped based on application tasks. A survey paper [31]

presents the top 10 data mining algorithms. As mentioned earlier, data mining exten-

sively employs ML techniques. In the study, nominated ML algorithms are grouped into

10 different groups. The groups include classification, statistical learning, association

Section 5 - General Steps in Developing a Machine Learning System 11

analysis, link mining, clustering, bagging and boosting, sequential patterns, integrated

mining, rough sets and graph mining. The following list shows each grouping of the final

top 18 algorithms. The top 10 algorithms are color highlighted. Figure 2 shows few

common ML algorithms and their respective categories and paradigms.

1. Classification – C4.5, CART, kNN, Naive Bayes

2. Statistical learning – SVM, EM

3. Association analysis – Apriori, FP–Tree

4. Link mining – PageRank, HITS

5. Clustering – K-Means, BIRCH

6. Bagging and Boosting – AdaBoost

7. Sequential Patterns – GSP, PrefixSpan

8. Integrated Mining – CBA

9. Rough Sets – Finding reduct

10. Graph Mining – gSpan

5 Essential Steps in Machine Learning

Kotsiantis (2007) and Harrington (2012) provide general steps that are required in de-

veloping supervised ML [3][23]. These steps help to derive a classifier that best suits the

specific problem at hand.

1. Identify and collect data – It is essential to identify the type of data that may

be used in the learning system. This is essential because, learning varies between

different data-types and also data properties such as size can affect the choice of

learning techniques that a learning system requires. For example, a real-valued

time series data and static nominal data might require a different technique of data

pre-processing or ML process. Data can be collected manually or automatically

depending on the nature of its application. Data source options are endless, data


Figure 3: Major steps involve in designing a learning phase [3]

can be collected by scraping and extracting data from a website or RSS feed or an

API [23]. In the pervasive computing field however, sensors are the usual source

of data. There are many times when we might require publicly available data for

experimental purpose, UCI ML repository is a popular repository that provides

useful data for such purpose [2]. In few cases, knowledge about fields (attributes,

features) that are most informative can be exploited in data collection, which

gives a more refined and minimized data set for learning. In cases where this

is not possible, then the simplest method is that of ”brute-force”, which means

measuring everything available [3]. After such data collection, data might not

be directly suitable for learning because of noise or missing values, so it requires

significant pre-processing [32].

2. Pre-process data – This step often includes the pre-processing of data such as


outlier removal and replacement of missing data points. There are usually various

techniques employed in handling missing data [33] and outlier detection. Hodge et

al (2004) introduced a survey, which presents the pros and cons of contemporary

techniques for outlier (noise) detection. It is essential to prepare the data in a

useable format fashioned after the specific algorithm under consideration. While

some algorithms require features in a special format, some can deal with string

types, some with integers types or even both [23].

3. Definition of the training set (and the test set): The training set is the heart of

supervised learning. It is important to use an appropriate training set in order to

achieve an accurate prediction model. In this step, the input data are represented

in the form of features that maps to observed output class or label from previous

tasks. It is common to have input data that require some form of transformation

or feature extraction in order to increase the accuracy of the target classifier. A

standard practice involves separating a sub-set of the collected data as test-data.

The test data comes in handy in the evaluation step.

4. Selection of the algorithm: Since there are various classification techniques, and

prediction accuracy depends on the choice of selected algorithm, it becomes neces-

sary to choose the most befitting technique. Choosing the most suitable algorithm

requires some examination of data properties and their suitability with respective

classifiers. It is it also important to consider system resources and algorithm re-

quirements. Figure 3 shows a loop that allows preliminary test of algorithms and

evaluation of their performance. This flow makes it easier to select satisfactory

algorithm while considering the necessary factors.

How does one choose best algorithm for a data set? In general, there are numbers

of factors that needs to be considered for algorithm selection. The crucial ones

are prediction accuracy, speed, interpretability (credibility versus comprehensibility)

and simplicity. Prediction accuracy relates to how well an algorithm learns from

a given data set. Speed relates to time taken to learn from a given data set as

well as the time it takes to predict unknown instances (i.e. speed of applying gen-

erated knowledge model). Interpretability of a generated model could be desired

in situations where understanding of the learning process is essential. The two

general perspectives of interpretability of hypothesis are the black-box and white-

box models. White box models are generally easy to understand (e.g. decision


trees), while black-box models are difficult to understand (e.g. Support Vector

Machines (SVM)). Credibility versus comprehensibility relates to the trade-offs

between generating understandable hypothesis (model) or complex models with

possibly better accuracy. In general, rule-based learning algorithms produce more

to understand models than statistical-based ones [34]. Simplicity relates to the

amount of fiddling or the parameter adjustment that is needed by the algorithm.

While some algorithm requires little or parameter adjustment, some requires ap-

preciable amount of parameter tweaking.

Out of these factors, the prediction accuracy is the most important factor. A

learning system with decent speed, acceptable interpretability and simplicity but

poor prediction accuracy may still be considered as a poor learning system. Hence,

it is often common to make tradeoffs between prediction accuracy and other factors

as depicted in Figure 4.

Figure 4: Main evaluation factors and tradeoffs

Other factors that may be taken into consideration when choosing a learning al-

gorithm include sensitivity to outliers, ability to handle missing values, ability

to handle non-vector data, ability to handle class imbalance, ability to handle

non-vector data, efficacy in high dimensions, attempt for incremental learning,

extensions from Independent and Identically Distributed (IID) data to dependent

data (e.g. time series). It is also useful to consider the nature of the features.

Statistical learning methods such as SVM and ANNs often perform better over

multi-dimensions and continuous attributes. However, rule-based systems such

as decision trees are more likely to perform exceptionally in discrete (categorical)

attributes [34]. Ability to handle class imbalance (i.e. ratio of the training data)

might also be considered when making a choice of algorithm. The division of the

training data plays a crucial role in evaluating the performance of an algorithm.

If the True Positives (TP) and True Negatives (TNs) (see Section 9) instances


are almost equal in size, algorithm tends to construct classifier models with a bet-

ter performance. On the contrary, if the size of TP instances is extremely small

compared to the size of TN instances, the classifier tends to overfit the positive

instances and hence perform badly during validation stage [34].

5. Execute the training process: This step involves the actual learning process in

which the selected algorithm creates a model from the training examples defined

in step 3. The generated output can be stored in a readily usable format that can

be further used as the knowledge of the learning system.

6. Perform evaluation with the test set : One advantage of supervised learning is that,

its accuracy can be easily determined. The evaluation of supervised learning algo-

rithms determines its performance when it is necessary to verify the performance

accuracy of a learned model. It is a common practice to use a different data set

called the test set (defined in step 3) to evaluate the generated classifier model.

Section 9 shows common measures used for evaluating ML algorithms. Figure 3

also shows a step for parameter tuning. This is applicable in algorithms that have

parameters that can be tuned to adjust the performance of the algorithm e.g. SVM

and kNN.

It is obvious that these above steps associate and work best with supervised ML, since

they focus on learning from a training set. Mannila [25] however identifies five main

steps that should be taken in knowledge discovery from data (i.e. unsupervised ML

tasks). These are;

1. understanding the domain,

2. preparing the data set,

3. discovering patterns (data mining),

4. post-processing of discovered patterns, and

5. putting the results into use.

Section 6 - Supervised Learning 16

Figure 5: Three Stages of machine learning research program. Current publishingincentives are highly biased towards the middle row only [35]

5.1 Impacting Real World Outside Machine Learning

As stated earlier, our research focus cuts across ML and pervasive computing. Therefore,

it is beneficial to identify the challenges that could limit the application of ML in the

pervasive environment, which is a different field from ML.

The steps highlighted in Figure 3 are the main aspects involved in conducting ML

research, however, Kiri 2012 [35], presents vital stages that should be considered when

applying ML in a real world domain (See Figure 5). These steps ensure a research with a

result that maximizes impact of ML on the applied domain. According to Kiri, it is easy

to run existing implementation of algorithms on data set downloaded from the Internet.

However, it is difficult to identify a problem for which ML may offer a solution, determine

what data should be collected, choose or extract relevant features, select suitable learning

method, select an evaluation method, interpret the results, involve domain experts,

publicize the results to the relevant scientific community, persuade users to adopt the

technique. It is noted that in order to achieve a research that makes a difference in

applied domain, it is essential to follow-through all these steps as each one is a necessary

component of any research program that aims to have a real impact on the world outside

ML. Kiri Identifies a lack of follow through as a limitation in ML research, because many

researchers focus only on the middle row (i.e. the ”machine learning” contribution) of

Figure 5. While these contributions are essential, it is important to have equal amount

of focus on the upper and lower rows of Figure 5. In addition to lack of follow-through,

hyper-focus on benchmark data sets and abstract metrics are two approaches in ML

that affects the impact on the larger world [35].


6 Supervised Learning

In this section, we shall take a look into supervised ML; we will discuss some common

algorithms used for the supervised learning tasks.

6.1 Bayesian Classification

Bayesian algorithms are statistical learning algorithms based on the Bayes theorem.

Bayesian probability and Bayes’ rule gives us a way to estimate unknown probabilities

from known values. The simplest algorithm in this paradigm is the Naive Bayes algo-

rithm, which assumes that the effect of the value of an attribute on the class attribute is

independent of the values of the other attributes given the value of the class attribute.

This is commonly termed as conditional independence [36]. Bayesian networks is an-

other type of Bayesian learning algorithm. It is appealing because it solves the issue of

overly restrictive conditional independence assumption in Naive Bayes by considering

the probabilistic relationship between variables of interest. Naive Bayes works well with

a small amount of data, handles multiple classes and works with nominal values. It is

however sensitive to how the input data is prepared [23]. Naive Bayes model is some-

times called a Bayesian classifier and in practice can work surprisingly well [27]. Naive

Bayes algorithm provides practical learning methods along with other widely used algo-

rithms like decision tree and k-nearest neighbor [36] [5] [37] [12]. It has been successfully

used in diagnosis and text documents classification, and only requires less data to give

satisfactory results in most cases [38] [39] [40].

6.1.1 Bayes Theorem

Let x be a data instance, described by measurements made on n features. In Bayesian

terms, x is considered ”evidence”. Let wi be some hypothesis such that the data instance

x belongs to a specified class wi from a set of c classes, wi, i = 1, 2, . . . , c. For classification

tasks, we are interested in the value of P (wi|x), i.e. the probability that the hypothesis

wi holds given the observed data instance x ( i.e. the probability that x belongs to class

wi, given that we know the feature description of x). The values of P (wi), P (x|wi) and

P (x) can be calculated from a given data set, while the Bayes theorem can be used to


calculate the posterior probability P (wi|x). Bayes theorem states that

P (wi|x)P (x) = P (x|wi)P (wi)

where

P (x) =

c∑i=1

P (x|wi)P (wi)

where P (wi|x) is the posterior probability (a posteriori probability) of class wi given the

value of x.

For example, suppose a data x (representing a patient) is an instance from a data set

similar to Table 1. Let x be described by Age and Prescription features only. Let the

values of the features of x be ’young’ and ’myope’ respectively. Suppose that wsoft is

the hypothesis that the patient, x, will require a ’Soft’ lens. Then P (wsoft|x) reflects the

probability that patient x requires ’Soft’ given that we know the age and prescription.

The prior probability of P (wsoft), is the probability that any patient will require ’Soft’

lenses, regardless of their age and prescription, or any other information. P (wsoft|x) is

based on more information (e.g. age and prescription) of x than the prior probability,

P (wsoft), which is independent of x. Similarly, P (x|wsoft) is the posterior of x condi-

tioned on wsoft, (i.e. the probability that a patient from our examples is young and has

myope prescription, given that we know that the patient requires ’Soft’ lenses). P (x) is

the prior probability of x, (i.e. the probability that a patient from our system is ’young’

and has ’myope’ prescription).

6.1.2 Naive Bayes classifier

This is one of the simplest density estimation methods from which a standard classifi-

cation method in ML is formed. It has properties, which makes it commonly adopted

in various purposes. It is easy to implement, fast to train and use as a classifier and

deals with missing attributes easily. There are many data sets for which Naive Bayes

does not do well because attributes are treated as though they were independent given

the class; also, addition of redundant attributes skews the learning process. For numeric

data, probability distribution functions such as Gaussian or normal distribution can be

assumed over data set. The only difference between the nominal attributes and numeric


attributes is that, instead of normalizing counts into probabilities in the case of the for-

mal, we calculate the mean and the standard deviation for each class and each numeric

attributes. The probability density function can then be used to find the probability of a

value, x, given the mean and standard deviation value. For cases where the distribution

in the data set is not normal, the density function can be easily replaced by appropriate

standard estimation for the distribution. Another solution is to use ”Kernel density

estimation”, which do not assume any distribution for the attribute values [3]. Also,

another possibility for numeric data is to simple discrete the data first. We shall give

two practical examples of Naive Bayes classification on both discrete and continuous

data type. General Discussions on the methods and merits of naive Bayes are presented

in [41][42].

Example with discrete data type

This example is based on the data set in Table 1. There are three possible classes – none,

soft, hard. There are four different attributes – age, prescription, astigmatic, tear-rate.

Since the lens data set [2], we shall consider the first 23 data instances as the training

set and use the last instance as the test data. Hence, Our task is to classify a new

instance, x, with the following attributes – age:presbyopic, prescription:hypermetrope,

astigmatic:yes, tear-rate:normal.

First we need calculate the priori probabilities, P (wi), for the classes. This is simply

the ratio of each class to the total training set.

P (class = none) = 14/23

P (class = soft) = 5/23

P (class = hard) = 4/23

Next, we derive the posterior probabilities of x conditioned on all the three classes, i.e.

P (x|class = none, soft, hard).

For attribute age

P (age = presbyopic|class = none) = 5/8

P (age = presbyopic|class = soft) = 2/8

P (age = presbyopic|class = hard) = 1/8


For prescription prescription

P (prescription = hypermetrope|class = none) = 4/7

P (prescription = hypermetrope|class = soft) = 2/7

P (prescription = hypermetrope|class = hard) = 1/7

For prescription astigmatic

P (astigmatic = yes|class = none) = 7/11

P (astigmatic = yes|class = soft) = 0/11

P (astigmatic = yes|class = hard) = 4/11

For prescription tear-rate

P (tear − rate = normal|class = none) = 2/11

P (tear − rate = normal|class = soft) = 5/11

P (tear − rate = normal|class = hard) = 4/11

Hence,

(x|none) = P (age = presbyopic|none) ∗ P (prescription = hypermetrope|none)

∗ P (astigmatic = yes|none) ∗ P (tear − rate = normal|none)

= 5/8 ∗ 4/7 ∗ 7/11 ∗ 2/11

= 0.04132

(1)

Note that we have zero probability of P (astigmatic = yes|soft), since class soft has

no instance with an attribute value astigmatic=yes in the training data set. This can

be problematic since it will wipe out the information in other probabilities P (x|soft).

A solution to the zero count problems is by applying Laplace correction [43]. We can

assume that our training set is so large that adding one to each count will make neg-

ligible difference in the estimated probabilities and yet would avoid the case of zero

probabilities.

By default, conditional probabilities can be derived using

P (Xi = xk|wj) =Nijk

Nj,


and conditional probabilities with Laplace correction can be derived instead using

P (Xi = xk|wj) =Nijk + 1

Nj + k

were Nijk is the number of instances in the data set such that Xi = xk and class = wj ,

Nj is the number of instances in data set with class = wj , and k is the number of possible

values of Xi.

To cancel the effect of the zero probability, we shall apply the Laplace correction to the

probabilities of class = soft, i.e. 2/8, 2/7, 0/11, 5/11. These give (2 + 1)/(8 + 4), (2 +

1)/(7 + 4), (0 + 1)/(11 + 4), (5 + 1)/(11 + 4) respectively.

(x|soft) = P (age = presbyopic|soft) ∗ P (prescription = hypermetrope|soft)

∗ P (astigmatic = yes|soft) ∗ P (tear − rate = normal|soft)

= (2 + 1)/(8 + 4) ∗ (2 + 1)/(7 + 4) ∗ (0 + 1)/(11 + 4) ∗ (5 + 1)/(11 + 4)

= 3/12 ∗ 3/11 ∗ 1/15 ∗ 6/15

= 0.001818

(2)

(x|hard) = P (age = presbyopic|hard) ∗ P (prescription = hypermetrope|hard)

∗ P (astigmatic = yes|hard) ∗ P (tear − rate = normal|hard)

= 1/8 ∗ 1/7 ∗ 4/11 ∗ 4/11

= 0.00236

(3)

Finally, we find the posterior probabilities in order to classify the instance.

P (none|x) = P (x|none) ∗ P (none) = 14/23 ∗ 0.04132 = 0.025151

P (soft|x) = P (x|soft) ∗ P (soft) = 5/23 ∗ 0.001818 = 0.000395

P (hard|x) = P (x|hard) ∗ P (hard) = 4/23 ∗ 0.00236 = 0.000410

According to the Bayes theorem, x is assigned to the class wi if

P (wi|x) > P (wj |x), ∀j 6= i


∴, since P (none|x) > P (hard|x) > P (soft|x), we classify x as none.

Example with continuous data type

When dealing with continuous values, we can assume that the values have a Gaussian

distribution (Probability Density Function) with a mean µ and standard deviation σ

defined by

g(x, µ, σ) =1√

2πσ2exp−

(x−µ)2

2σ2 (4)

so that P (x|wi) = g(x, µ, σ). ∴.

Consider the data set given in Figure 6. It shows the readings of three accelerometer

sensors – sens A, sens B, sens C, which monitors the state of an object. The data was

collected for two different classes (states) of the object – equilibrium and motion. The

task is to predict the state of the object when we have the following sensor readings

x = (sens A = 1.40, sens B = 0.80, sens C = 1.30).

Figure 6: Sensor-readings training data set

From the training set, we need to calculate the prior probabilities P (Equilibrium) =

6/10, P (Motion) = 4/10 and the mean µ and variance σ2 given in Figure 7.

Assuming a Gaussian distribution in the training set, and using Eqn(4) and the given

mean and variance, we can calculate the probability P (x|wi) for both classes.

P (sensA = 1.40|Equilibrium) = 3.67e−101

P (sensB = 0.80|Equilibrium) = 0.166

P (sensC = 1.30|Equilibrium) = 5.684e−162.


Figure 7: Calculated mean and variance

The posterior P (Equilibrium|x) = P (x|Equilibrium) ∗ P (Equilibruim) = 2.076e−263

For the motion class; P (sensA = 1.40|Motion) = 0.6142

P (sensB = 0.80|Motion) = 0.486

P (sensC = 1.30|Motion) = 1.222.

∴, P (x|Motion) = 0.6142 ∗ 0.486 ∗ 1.222 = 0.3647

The posterior P (Motion|x) = P (x|Motion) ∗ P (Motion) = 0.1459

Since the P (Motion|x) > P (Equilibrium|x), we predict the state of the system given x

as motion.

6.2 Instance–Base Learners

Instance–based learning is a category of learning algorithms under the header of statisti-

cal learning [3]. This involves knowledge representation in terms of cases or experiences

and utilizes adaptable matching methods to retrieve these known cases and apply them

to new situations. In contrast to most paradigms, which, constructs a general, explicit

description of the target model when training set are given, instance-based learning

techniques merely store the training examples. For every new instance, the relationship

with the previously stored examples is analyzed to designate a target function value for

the new instance. Examples of instance-based learning include nearest neighbor and

locally weighted regression methods. These types of learning techniques are termed lazy

learning [24] methods because they delay learning process until a new instance, which

needs to be classified or predicted is given. They require less computation time during

training phase than eager-learning algorithms (such as decision trees, neural and Bayes

nets), but more computation time during the classification phase [3]. Aha (1997) and

De Mantaras (1998) gave a general review of instance-based learners [44] [45]. A classic


example of instance base learner is k–Nearest Neighbor (kNN). kNN is one of the sim-

plest, straightforward and rather trivial algorithm of instance-based classifiers. Hence,

in this study, we will give a brief description of the nearest neighbor algorithm.

Figure 8 illustrates how a kNN classification works. The test instance (an unclassified

instance denoted by the object with a question mark) should be classified as male or

female gender. If k = 3, then the nearest 3 neighbors (instances within the solid line

circle) consist of 2 male and 1 female. Therefore, the test object is classified as a male

gender. However, if k = 5, then the 5 nearest neighbors consist of 3 females and 2 males.

Therefore, the test instance is classified as a female gender.

Figure 8: Illustration of kNN classification.

kNN algorithm is an easy and effective way to classify data. It compares a new unknown

instance (i.e. a features without a label) with every instance that exist in the training

set (training sets are mainly kept in memory for this purpose only). It is based on the

principle that the instances within a data set will generally exist in close proximity to

other instances that have similar properties [46]. The comparison involves calculating

the distance of the new instance from the other instances based on their respective

features. The common approach requires normalizing all features and then computing

the Euclidean distance of the new instance from other instances that are stored in the

memory. There are various distance metrics; some of the significant distance metrics


are presented in Table 2 [3]. In practice, any given distance metric must maximize the

distance between two instances of different classes and minimize the distance between two

instances of the same class. Algorithm (6.1) [3] shows the summary of kNN classification

method. The top k most similar instances are collected (k is a non-negative integer and it

is usually less than 20). The final action is taking a majority vote from the k most similar

instance of data, and the majority is the new class for the unknown instance. In the

aim of obtaining better classification accuracy, many kNN algorithms apply weighting

schemes that change the distance measurements and voting influence of each instance.

Wettschereck et al. (1997) [47] presented a survey on weighting schemes.

Euclidean: D(x, y) =

[m∑i=1

|xi − yi|2]1/2

Minkowsky: D(x, y) =

[m∑i=1

|xi − yi|r]1/r

Manhattan: D(x, y) =m∑i=1

|xi − yi|

Chebychev: D(x, y) =m

maxi=1

[xi − yi

]Camberra: D(x, y) =

m∑i=1

|xi − yi||xi + yi|

Kendall’s Rank Correlation:

D(x, y) = 1− 2m(m−1)

m∑i=j

j=1∑j=i

sign(xi − xj)sign(yi − yj)

Table 2: Approaches to define the distance between two instances (x and y) [3]

Algorithm 6.1: k-Nearest Neighbor(D = (x1, c1), ..., (xn, cn), X)

distances← empty list

for each instance (xi, ci) in D

do

d← distance(xi, X)

append d to distances

sort distances from lowest to highest

return (the K nearest instances in distances)

kNN has a number of advantages which makes it an excellent choice under supervised

learning, it is highly accurate in most cases, insensitive to outliers and it does not make

assumptions about data. It works with both numeric values and nominal values. A

popular result by Cover and Hart [48] reveals that the error of the nearest neighbor


rule is bounded above by twice the Bayes error under certain reasonable assumptions

[31]. The major drawback for instance-based learning is a high cost of computation

since the algorithm needs to calculate the distance measurement for every piece of data

in the data-set. Also, it requires a lot of memory since it has to keep the full data-

set. Unlike some other approach like decision trees, kNN does not give any idea about

the underlying structure of data. In terms of classification performance, it does not

necessarily generalize well if the examples are not clustered and are sensitive to the

choice of the similarity function (distance metrics) that is used to compare instances

[49]. The choice of k also affects the performance of kNN, if the value of k is too large,

then the neighborhood can contain too many false classes. On the contrary, having

an extremely small k gives a result that can be sensitive to noise points. There is no

principled way to choose k, except through some computationally expensive methods

such as cross-validation [49]. They are sensitive to the choice of the similarity function

that is used to compare instances.

6.3 Decision Trees

Algorithms in this paradigm use decision tree as a predictive model. Decision trees are

commonly used for supervised classifications as well as regression tasks. Recent surveys

claim that it is the most commonly used technique [50]. It has been successfully used in

various works such as activity recognition [12][11], and also in location prediction such

as mobile users’ location [36][9][5].

Figure 9 shows a decision tree in the form of a flowchart; it has decision blocks (rect-

angles), terminating blocks – (ovals) and branches in form of arrows that lead to either

decision block or terminating block. A decision block is a non-leaf node that is associated

with a feature. The terminating blocks are the tree leaves that represent class labels,

while branches represent conjunctions of features that lead to the labels. In regression

tree, the terminating block value is usually a real number, while in classification tree;

the terminating block value is a nominal value belonging to a set of possible classes of a

finite number.

There are various types of decision tree, which are commonly used today. The most

significant ones that are part of the mostly used algorithms are C4.5 and CART trees

[31]. C4.5 [51] is a successor of Concept Learning System (CLS) [52] and Iterative


Figure 9: A decision tree in flowchart form

Dichotomiser 3 (ID3) [53] algorithms. C4.5 can either generate classifier models in the

form of decision trees or in the form of a more comprehensible ruleset. A C4.5 generated

model is used for only classification task and it is commonly used as a benchmark

to which newer supervised learning algorithms are often compared. On the contrary,

CART decision tree, which is presented in this note [54], is a binary decision tree that

can be used for both classification and regression tasks. C4.5 and CART adopt a greedy

(i.e. nonbacktracking) approach that construct trees in a top–down recursive divide–

and–conquer manner. The approach starts with a training set of examples and their

associated class labels. The training set is recursively partitioned into smaller subsets

as the tree is being built. The method for generating a decision tree is shown in Listing

1.

The core step in a decision tree construction is the second procedure in Listing 1. It

is the process of growing a decision tree by splitting a set S of instances. How do we

select an attribute test that determines the splitting of training instances? This process

is commonly referred to as feature/attribute selection. It describes a heuristic procedure

for selecting the attribute that ’best’ classifies the training data according to the class

labels. Typically, the feature that best divides the training set would be the root node

of the tree. There are numerous methods for finding the feature that best divides the

training data such as gini-index, which, is used in CART [54], information-gain used in

ID3 [52] and ratio-gain used by C4.5 algorithm [51].

If there are k classes , {C1, C2, . . . , Ck} and a training set , denoted by S, then


� If all instances in S belong to the same class Ci, the decision tree is a leaf

labeled with class Ci.

� Else , if S contains mixture of classes , choose a test based on a single attribute ,

which has two or more outcomes {O1, O2, . . . , On}. Set S is divided into subsets S1, S2 . . . , Sn,

where Si contains all instances in S with outcome Oi for the given test attribute.

� Recursively apply the procedure above to each subset {S1, S2, . . . , Sn}.

Listing 1: Pseudo-code for building decision trees

Information gain, used in ID3 as feature selection measure, determines how important

a given attribute of a feature vector is. The feature with the highest information gain

is selected as the splitting feature. The selected feature minimizes the information re-

quired to classify the instances in the resulting subsets and reflects the least randomness

of the subsets. This measure guarantees a simple tree is constructed by minimizing

the expected number of tests required to successfully classify a given set of instances.

Information gain is a measure of change in entropy. Entropy of a set S is the amount

of information required to identify or classify an instance in S. It is also defined as a

measure of the uncertainty associated with a random variable. Lower entropy relates to

an ordered and patterned variable, which is easy to predict since there is lesser uncer-

tainty. Higher entropy relates to variable with a high degree of randomness and very

unpredictable since there is higher uncertainty. Entropy is given by

Entropy(S) = −n∑i=1

pilog2(pi)

where n is the number of classes in S and pi is the fraction of instances in S with output

value Ci.

Given a set S of instances with an attribute A as one of its attributes, a set Sk as the

subset of S with attribute A = k, and val(A) is the set of all possible values of A. The

information gain for this case is given by

Gain(S,A) = Entropy(S)−∑

k∈val(A)

|Sk||S|∗ Entropy(Sk)

Where |S| and |Sk| represent the sizes of the respective sets.


As one of its main drawbacks, information gain is overly sensitive and bias towards

features with large number of possible outcomes. C4.5 however uses an extension of

information gain as its feature selection measure known as gain ratio. Gain ratio at-

tempts to overcome the bias associated with information gain by applying a form of

normalization to the information gain using split information value defined as

SplitInfoA(S) = −k∑i=1

|Sk||S|∗ log2

(|Sj ||S|

)

This gives the measure of information generated by dividing the training set S into k

subsets using a test attribute A. The gain ratio is therefore defined as

Gain Ratio(S,A) =Gain(S,A)

SplitInfoA(D)

C4.5 selects the attribute that maximizes gain ratio as the splitting attribute. If the split

is near trivial, split ratio will close to zero and unstable. A constraint added to avoid

this is selecting test attribute with large information gain, at least as large as average

gain over all tests examined [1]. C4.5 handles both continuous or discrete attribute data

types. In order to handle continuous attributes A, C4.5 creates a threshold h and splits

the list into {A ≤ h,A > h}.

CART uses a different attribute selection measure known as gini index. Gini index

measures impurity of set S instead of its entropy, and it is given by

Gini(S) = 1−n∑i=1

P 2i

where n is the number of classes and Pi is the probability that an instance in S belongs

to class Ci.

A major drawback of decision tree algorithms is that they can generate highly complex

decision trees that over-fit a training set. Overfitting is the use of models or procedures

that violate the Occam’s Razor (also known as parsimony) principle, which calls for

using models and procedures that contain all that is necessary for the modeling but

nothing more [55]. Overfitting results from using more terms than are necessary or

more complicated approaches than are necessary. A good example of overfitting is when

we use a model that accommodates both curve and linear relationships on a data set


that perfectly fits a linear model. As a negative consequence, overfitting increases the

complexity of a model without any associated benefit in performance. It sometimes can

result in poorer performance than a simple model. In decision trees, overfitting can be

avoided with methods such as pruning. C4.5 and CART have pruning feature, which

takes place after the initial tree has been generated.

Aside from the difference in attribute selection measure, another significant difference

between C4.5 and CART algorithm is the allowed number of outcomes from each split

test. While C4.5 allows two or more outcomes, CART test outcomes are always bi-

nary. Also, C4.5 uses a single-pass algorithm derived from binomial confidence limits

for pruning, but CART prunes trees using a cost-complexity model whose parameters

are estimated by cross-validation [31].

In general, decision trees have many advantages which make them widely accepted for

both classification and regression tasks. A decision tree model (knowledge representa-

tion) is easy to understand and can be explained by simple boolean logic. They also

require little pre-processing unlike other techniques that require more pre-processing

steps such as data normalization and missing data replacement. Some ML techniques

are suitable for only numeric type of features while others are more appropriate for

nominal features (Relational rules work with nominal variable while neural networks

only works with numerical variables), however decision trees are befitting in both cases.

They are robust and perform quite well even when assumptions made on the training

set are violated in the test set. Decision trees work well with large amount of data, while

requiring only standard computing resources.

6.4 Support Vector Machines

SVM is a method for the classification of both linear and non-linear data. It uses non-

linear mapping to transform the original training data into a higher dimension, it then

searches for the linear optimal separating hyperplane within the new higher dimension.

With an appropriate nonlinear mapping to a sufficiently high dimension, data from two

classes can always be separated by a hyperplane. To find the hyperplane, SVM uses

support vector (i.e. essential training examples) and margins (i.e. the either side of a

hyperplane that separates two data classes, defined by the support vectors). In spite


of its extremely slow training time, SVMs have attracted an outstanding number of at-

tentions because of high accuracy because they can model complex non-linear decision

boundaries. The SVM algorithm is also computationally inexpensive. The chosen sup-

port vectors provide a compact description of the learned model. They can be applied

to both classification and regression [1]. The idea of SVM revolves around the notion of

margins. The shortest distance from a hyperplane to one side of its margin is equal to

the shortest distance from the hyperplane to the other side of its margin. The ”sides”

of the margin are parallel to the hyperplane [1]. Maximizing the margins (i.e. creating

the largest possible distance between the separating hyperplane and the instances on its

either side) has been proven to reduce an upper bound on the expected generalization

error [3]. The method of separating data using hyperplane exposes SVMs to the risk of

finding trivial solutions that overfit the training data. SVMs tend to resist overfitting

even in cases where number of instances provided is lower than the number of attributes

by using regularization techniques, which choose a fitting function that has low error

on the training set. SVMs avoid over-fitting by the careful tuning of the regulariza-

tion parameter, C, for linear SVMs. The same applies to non-linear SVMs by selecting

appropriate kernel and careful tuning of the kernel parameters.

Figure 10: SVM optimal hyperplane and maximum margin

A separating hyperplane can be written as w ∗ x + b = 0, where w is a weight vector,

namely w = w1, w2, ..., wn;n is the number of features and b is scalar, often referred to


as a bias (or −b termed the threshold). If the training data is linearly separable, then a

pair (w, b) exists such that

wTxi + b ≥ 1, for all xi ∈ P

wTxi + b ≥ −1, for all xi ∈ N

with the decision rule is given by

fw,b(X) = sgn(wTx + b)

Where P andN indicate the positive and the negative side of the hyperplane respectively.

In cases when it is possible to linearly separate two classes, an optimum separating

hyperplane can be found by minimizing the squared norm of the separating hyperplane.

This minimization problem can be set up as a convex Quadratic Problem (QP) problem:

Minimizew,bΦ(w) =1

2‖w‖2 (5)

subject to yi(xTxi + b) ≥ 1, i = 1, ..., l.

In the case of linearly separable data, once the optimum separating hyperplane is found,

data points that lie on its margin are called the support vector points and the solution

is represented as a linear combination of only these points while other points are ignored

[3]. Since the number of support vector points selected by the SVM learning algorithm

is usually few, therefore, the model complexity is unaffected by the number of features

encountered in the training data. Due to this reason, SVMs are well suited for learning

tasks where the number of features is large with respect to the number of training

instances [3].

1) Introduce positive Lagrange multipliers , one for each of the inequality

constraints Eqn (1.1). This gives Lagrangian:

Lp ≡ 12‖w2‖−

∑N1=1 αiyi(xi.w − b) +

∑Ni=1 αi

2) Minimize Lp with respect to w,b. This is a convex quadratic programming problem.

3)In the solution , those points for which αi > 0 are called "support vectors"


Listing 2: General pseudo-code for SVMs [3]

Maximum margin allows the SVM to choose from different candidate hyperplanes. In

cases where, the SVM is not able to find any separating hyperplane due to misclassified

instances, the problem can be addressed by using a soft margin that accepts some

misclassification of training instances [56].

For most real world problems, data sets often involve data for which no hyperplane exists,

such that, it successfully separates the positive instances from the negative instances (i.e.

linearly inseparable data also called non-linearly separable data or nonlinear data for

short). SVM provides a solution to inseparability problem in two main steps. In the first

step, it maps the original data onto a higher-dimensional space. Once the data has been

transformed into the new higher space, the second step searches for linear separating in

the new space. The higher-dimensional space is referred to as transformed feature space,

as opposed to the input space occupied by the training instances [3]. It is known that any

consistent training set can be made separable with an appropriately chosen transformed

feature space of a sufficient number of dimensions. Having linearly separable data in the

transformed feature space once again brings us to a quadratic optimization problem that

can be solved using the linear SVM formulation. The maximum marginal hyperplane

(linear separation) found in the new higher dimension space (transformed feature space)

corresponds to a non-linear separation (nonlinear hypersurfaces) in the original input

space.

Mapping data in the original space to some higher dimensional space H as Φ : Rd → H.

Then the training algorithm would depend on the data through dot products in H,

i.e on functions of the form Φ(xi).Φ(xj). Kernels are a special class of function that

allow inner products to be calculated directly in feature space, without performing the

mapping Φ : Rd → H [57]. If there were a kernel function K such that K(xi, xj) =

Φ(xi).Φ(xj), we would only need to use K in training algorithm, and would not be

needed to explicitly determine Φ. Selecting the appropriate kernel function is vital

because the kernel function defines the transformed feature space where the training set

is classified. To achieve the best kernel function, it is common practice to estimate a

range of potential settings and use cross/validation over the training set to find the best

one [3]. This practice adds to the limitation of SVMs, which is low speed of training.


SVM training is done by solving N th dimensional QP problem, where N is the number

of examples in the training data set. Since standard QP methods use large matrix

operations and time consuming numerical computations that render them impractical

in large problems (computational complexity of O(n3)) . A simpler algorithm known as

Sequential Minimal Optimization (SMO) comes in handy. SMO can solve the SVM QP

problem without any extra matrix storage and without using numerical QP optimization

steps at all [58]. SMO is more efficient than QP, since its computation is based only on

the support vectors.

6.5 Hidden Markov Model

Morgan (2011) [1] identifies complex data types in data mining. Figure 11 summarizes

the identified complex data types. Sequence data are types of complex data that requires

unique methods for learning. Examples include time-series data and symbolic sequence.

A time series data is a sequence of numeric data recorded at equal time interval (e.g.

temperature readings over time). A symbolic sequence data consist of events or nominal

data, which are not observed at equal time intervals (e.g. customer shopping sequences

and web click stream). Most classification methods generate model from feature vectors.

However, sequence classification is a challenging task since the sequential nature of fea-

tures is difficult to capture. ML sequence classification task can be achieved using either:

(1) Feature-based classification, which transforms a sequence into a feature vector and

then applied conventional classification methods [1] (e.g. [9] [11]). (2) Sequence distance-

based classification, where the distance function that measures the similarity between

sequences determines the quality of the classification significantly [1]. (3) Model-based

classification, this method uses statistical models such as HMM to classify sequences.

(4) The last and recently proposed method, which achieves quality classification results,

uses the time-series sub-sequences that can maximally represent a class as the features is

shaplets method [1]. This section discusses a model-based classification method known

as Hidden Markov Model (HMM).

Most real-world processes generate observable outputs which can be described as signals.

The signal can take many forms such as discrete or continuous form. It can be station-

ary or non-stationary (signal properties over time). A common problem of interest is

classifying such signals in terms of signal models. Signal model has been used to realize


Figure 11: Complex data types for mining [1]

important practical systems (e.g. prediction systems, recognition systems and identifi-

cation systems in a highly efficient manner) [59]. There are two broad classes of signal

models – deterministic models and statistical models [59]. Deterministic models usually

exploit knowledge of some known specific properties of the signal model (e.g. ampli-

tude, frequency, phase of a sine wave, amplitude and rates of exponential, etc.). The

statistical signal model tries to characterize only the statistical properties of the signal.

Gaussian processes, Poison processes, Markov process and Hidden Markov processes are

all examples of statistical signal models. The basic assumption of the statistical model

is that the signal can be well characterized as a parametric random process, and that

parameters of the stochastic process can be estimated in a precise, well-defined manner

[59].

HMM is a statistical/stochastic signal model in which the system being modeled is

assumed to be a Markov process with unobserved (hidden) states. To start, we shall

review Markov process (Markov chain) and then extend the ideas to the class of HMM.


6.5.1 Discrete Markov Model

Let’s consider a system which may be described at any time as being in one of a set D,

of n distinct states. D = (X1, X2, . . . , Xn). For such system, Markov assumption states

that future is independent of the past, given the present. Arbitrarily, Xt depends on

Xt−1, Xt−2, . . . , Xt−m for a fixed m.

P (Xt|X1, ..., Xt−1) = P (Xt|Xt−1, Xt−2, . . . , Xt−m) (6)

A discrete random variableX1, . . . , Xn form a discrete-time Markov Chain if they respect

the graphical model in Figure 12.

Figure 12: A simple Markov chain

From the Markov assumption, Eqn(6), a First Order Markov model can be written as,

where m = 1:

P (Xt|X1, ..., Xt−1) = P (Xt|Xt−1) (7)

Figure 13 and Eqn(8), where m = 2, both represents the Second Order Markov model

P (Xt|X1, ..., Xt−1) = P (Xt|Xt−1, Xt−2) (8)

Figure 13: Second order Markov model

The limitation of Markov Chain is that it assumes a fully observed true state of a system.

However, in real-world scenarios, it is difficult to perfectly observe and represent the

true state of a system (partially observable systems). One reason for this could be noise.

HMM solves the problem to this limitation by assuming there is hidden information,

which represent the system. This information is commonly known as latent or hidden


variable(s). Figure 14 shows a partially observable system with its observable variables

and latent variables.

Figure 14: Partially observable system

6.5.2 Hidden Markov Model

Hidden Markov Model is the resulting model from extending the Markov model, it is a

doubly embedded stochastic process with an underlying stochastic process that is not

observable (it is hidden), but can only be observed though another set of stochastic

process that produce the sequence of observations. An HMM is characterized by the

following elements:

1. N , the number of states in the model.

2. T , length of the observation sequence.

3. A, the transition probability distribution, also called the transition probability

matrix, A = {aij}, where

aij = P (state Xj |state Xi)

.

4. B, the observation (emission) symbol probability distribution / matrix, B =

{bj(kt)}, where

bj(kt) = P (Observation k at time t|state = Xj)

.

5. π, the vector of initial state distribution.


Given the values of the above elementsN,T,A,B, π, the HMM can be used as a generator

to give an observation sequence O = O1, O2 . . . OT where, Ot is one of the symbols

from the observation symbols. The complete parameter set for HMM is denoted by

λ = (A,B, π).

There are three fundamental applications of HMM. These are evaluation problems,

decoding problems and learning problems [59] [60]. In evaluation problem, given an

HMM λ = (A,B, π) and the observation sequence O, the task is to calculate P (O|λ), (i.e.

the probability that λ has generated O). With decoding problem, given the HMM λ =

(A,B, π) and the observation sequenceO, the task is to calculate the most likely sequence

of hidden states that produced the observation sequence. With learning problem, given

some training observation sequence O and the general structure of HMM, (i.e. the

number of hidden states and the visible observations). The task here is to determine

HMM parameters λ = (A,B, π) that best fit the observed data, i.e. λ that maximizes

P (O|λ).

For illustration purpose, let’s consider weather system with two states, Snowy and

Sunny. Assuming the states of the system cannot be directly observed, but; instead

we can make three observations walk, shop, ski which are dependent on the states of the

system. Let the initial probability, π = [0.6, 0.4] i.e. Snowy and Sunny respectively, and

Transition probability matrix, A =

Snowy Sunny

Snowy 0.7 0.3

Sunny 0.4 0.6

Observation probability matrix,

walk shop ski

Snowy 0.1 0.4 0.5

Sunny 0.6 0.3 0.1

As it can be noted from the given matrices, they are row stochastic. Assuming we make

an observation of the following sequences (walk, shop, walk, ski), ∴ Observation sequence

O over time, ti,(i=1,2,3,4) = [walk, shop, walk, ski]. How do we find the most likely true

state sequence of the weather system? In this problem, we have HMM λ = (A,B, π)

and a sequence of observation, O. Computing the likelihood of the observed sequence O

being generated by the HMM follows the evaluation problem type of HMM application.

Finding an optimal state sequence for the underlying Markov process follows the encod-

ing problem. We can find the most likely state of the system finding the probability of


all possible state sequences of length = 4. Table 3 shows all possible states (24), their

probability and a normalized probability column that sum to 1.

If πX0 is the probability for starting in the state X0 (i.e. the initial probability), bX0(O0)

is the probability of observing O0 and aX0,X1 is the probability of transiting from state

X0 to state X1, we see that the probability of the state sequence X is given by

P (X) = πX0 × bX0(O0)× aX0,X1 × bX1(O1)× . . .× aXT−1,XT × bXT (OT )

∴, for the given observation sequence, (walk, shop, walk, ski), we can compute, say,

P (Snowy, Snowy, Sunny, Sunny) = 0.6(0.1)× 0.7(0.4)× 0.3(0.7)× 0.6(0.1) = 0.000212

State Sequence Probability Normalized probability

1 Snowy, Snowy, Snowy, Snowy 0.000412 0.042787

2 Snowy, Snowy, Snowy, Sunny 0.000035 0.003635

3 Snowy, Snowy, Sunny, Snowy 0.000706 0.073320

4 Snowy, Snowy, Sunny, Sunny 0.000212 0.022017

5 Snowy, Sunny, Snowy, Snowy 0.000050 0.005193

6 Snowy, Sunny, Snowy, Sunny 0.000004 0.000415

7 Snowy, Sunny, Sunny, Snowy 0.000302 0.031364

8 Snowy, Sunny, Sunny, Sunny 0.000091 0.009451

9 Sunny, Snowy, Snowy, Snowy 0.001098 0.114031

10 Sunny, Snowy, Snowy, Sunny 0.000094 0.009762

11 Sunny, Snowy, Sunny, Snowy 0.001882 0.195451

12 Sunny, Snowy, Sunny, Sunny 0.000564 0.058573

13 Sunny, Sunny, Snowy, Snowy 0.000470 0.048811

14 Sunny, Sunny, Snowy, Sunny 0.000040 0.004154

15 Sunny, Sunny, Sunny, Snowy 0.002822 0.293073

16 Sunny, Sunny, Sunny, Sunny 0.000847 0.087963

Table 3: State sequence probabilities

To find the optimal state sequence in the Dynamic Programming (DP) sense, we can

simply choose the sequence with highest probability [60], (i.e. Sunny, Sunny, Sunny,

Snowy). In the HMM sense, we choose the most probable symbol at each position [60].

To achieve this, we sum the probabilities in table 3 that have Snowy in the first position,

and also find the sum of normalized probabilities that have Sunny in the first position.

This gives 0.18817 and 0.81183 respectively. HMM chooses the most probable at each

position; hence, it chooses Sunny as the first sequence of the optimal state sequence. We

repeat this for each element of the sequence, obtaining results shown in Table 4. From


Table 4 we find that the most probable sequence in the sense of HMM is Sunny, Snowy,

Sunny, Snowy.

first element second element third element fourth element

P (Snowy) 0.188182 0.519576 0.228788 0.804029

P (Sunny) 0.811818 0.480424 0.771212 0.195971

Table 4: HMM probabilities

6.6 Comparing Supervised Learning Algorithms

In this section, we discuss the comparison of supervised learning presented in [3]. The

comparisons are based on accuracy, learning rate, classification speed, tolerance to miss-

ing values, tolerance to redundant attributes and other factors. It is noted that a detailed

discussion of all the pros and cons of each algorithm, as well as their empirical compar-

ison, depends on the learning task under consideration. Table 5 shows the summary of

this comparison.

In general, SVM tends to attain better accuracy in comparison with other supervised

learning methods, especially when dealing with multi-dimensions and continuous fea-

tures. For discrete features, logic-based systems such as decision trees seem to perform

better. SVM usually requires a large sample size for it to achieve its maximum predic-

tion accuracy while Naive Bayes requires relatively small data set. When considering

classification speed, i.e. using a generated model for classification, kNN takes consider-

ably long time, since learning from data which considers all instances, is delayed till a

classification process is required.

Algorithms such as kNN require complete instances in a data set to perform accurately;

hence, it is not robust towards missing values. Naive Bayes is naturally robust to missing

values since the missing values are not considered when computing the probabilities, and

therefore has no impact on its final results.

Irrelevant features are most times present in a data set. kNN is particularly sensitive

to irrelevant attributes, which is due to its approach of classification. On the contrary,

SVM is insensitive to the number of dimensions and has a high tolerance to irrelevant

features.


DecisionTrees

NaiveBayes

kNN SVM Rule-learners

Accuracy in general ** * ** **** **

Speed of learning with respectto number of attributes andnumber of instances

*** **** **** * **

Speed of classification **** **** * **** ****

Tolerance to missing values *** **** * ** **

Tolerance to irrelevant at-tributes

*** ** ** **** **

Tolerance to redundant at-tributes

** * ** *** **

Tolerance to highly interde-pendent attributes (e.g. par-ity problems)

** * * *** **

Dealing with discrete/bina-ry/continuous attributes

*** ***(notcontinu-ous)

***(notdirectlydiscrete)

**(notdiscrete

***(notdirectlycontinu-ous)

Tolerance to noise ** *** * ** *

Dealing with danger of over-fitting

** *** *** ** **

Attempts for incrementallearning

** **** **** ** *

Explanation ability/trans-parency of knowledge/classi-fication

**** **** ** * ****

Model parameter handling *** **** *** * ***

Table 5: Comparing learning algorithms (**** stars represents the best and * starrepresent the worst) [3]

The term bias, measures the contribution to the error of the central tendency of the

classifier when trained on different data, while the term variance is a measure of the

contribution to error of deviations from the central tendency. Learning algorithms with

high-bias profile usually generate simpler, highly constrained models that are quite in-

sensitive to data fluctuation so that the variance is low. Algorithms with a high-variance

property usually generate complex models which fit data variations more readily. Ex-

ample of high-bias learner is Naive Bayes while decision trees and SVMs are examples

of high-variance learners. Overfitting is a common problem with high-variance model

classes, but algorithms such as SVM avoids overfitting by adopting techniques like reg-

ularization.

kNN is considered as intolerant of noise because outliers can easily distort its similarity

Section 7 - Unsupervised Learning 42

calculation and hence lead to misclassification. On the contrary, decision tree learners

are resistant to noise because they employ pruning strategies which avoids overffiting

the data.

Naive Bayes requires little storage space during both the training and classification

states. It only requires memory needed to store the prior and conditional probabilities.

kNN technique requires large storage space for the training and the execution space is

at least as large as its training space. The execution space of non-lazy learners is usually

much smaller than training space since the output model is often a highly condensed

summary of the data set.

Online learning is a learning approach which allows incremental learning. Naive Bayes

and kNN can be easily used as incremental learners while other methods require greater

effort to handle incremental learning tasks.

A learning that offers more runtime parameter to be tuned is considered to be easier to

apply on data set. SVMs have more parameter than the other methods while kNN has

only one parameter –k, which is easy to tune.

7 Unsupervised Learning

Under this section, we shall briefly discuss clustering analysis and association analysis,

which are two common tasks under unsupervised learning. Under clustering analysis,

we shall discuss the classic k–means algorithm, and for the association analysis, we will

review the apriori technique.

7.1 Clustering Analysis

Clustering analysis is the assignment of a set of observation into subsets (clusters) so that

the observations within the same cluster are similar according to some pre-designated

criterion or criteria. It is a supervised learning that automatically forms clusters of

similar things. It can be described as an automatic classification. The main difference is

that, in classification, we know what we are looking for, but for clustering, we do not have

such information. Since it also produces the same result as classification but without

having predefined classes. Clustering is sometimes called unsupervised classification [23].


There are several methods employed for clustering analysis. These include partition-

ing, hierarchical, density-based and grid-based methods. Partitioning method involves

finding mutually exclusive clusters of spherical shapes among a given data set. It is a

distance-based approach and might often use mean (or medoids) to represent the cluster

center (i.e. the centroid). The major drawback for this method is its inefficiency for

large-size data set. Hierarchical method is an approach that uses hierarchical decompo-

sition (i.e. multiple levels to find cluster within a data set). This method is sensitive to

merging and splitting since it cannot correct erroneous merges or splits. Density-based

methods can find arbitrarily shaped clusters. Clusters are dense regions of objects in

space that are separated by low-density regions. The density-based approach may filter

out outliers. Grid-based methods use a multiresolution grid data structure. They have

the advantage of fast processing time which is typically independent of the number of

data objects, but dependent on grid size.

k-means clustering algorithm belongs to the partitioning method group of clustering

analysis. It is a common algorithm employed for unsupervised clustering task. It is

called k-means because it finds k unique clusters, and the center of each cluster is the

mean of the values in that cluster.

7.1.1 k-means clustering

k-means algorithm finds k clusters for a given data set. The number of cluster k is

user defined. A single point called centroid describes each cluster. A centroid is always

at the center of all the points in a cluster. k-means algorithm start with defining the

value of k, then k centroids are randomly assigned to a point. Next, each point in the

data set is assigned to a cluster. The assignment is done by finding the nearest centroid

and assigning the point to that cluster. The next step involves updating the centroids.

This is done by taking the mean value of all the points in that cluster, and then assign

the centroid to the mean. These steps can be summarized into two main steps – data

assignment and relocation of the ”means”. Listing 3 shows the pseudo-code of k-means

algorithm.

Create k points for starting centroids (often randomly)

While any point has changed cluster assignment

for every point in the data set:


for every centroid:

calculate the distance between the centroid and point

assign the point to the cluster with the lowest distance

for every cluster calculate the mean of the points in that cluster

assign the centroid to the mean

Listing 3: Pseudo-code for k-means clustering [23]

As seen from the pseudo-code, k-means is an easy to implement method for clustering.

The algorithm converges when the assignment of the centroid to the mean no longer

change. The complexity for each iteration is O(N ∗ k), since this is the required number

of comparison per iteration. Where N is the number of instances in the data set.

A significant draw back of k-means is that it can converge at local minima and not global

minima (i.e. it sometimes give a decent result but not necessarily the best result). It

is also known to be extremely slow on large data set [23]. There are different types

of distance measures that can be employed for calculating the ”closest” centroid. The

choice of distance measure used in a data set also plays a role in the performance of

k-means. The problem of local minima can be solved by running the algorithm multiple

times with different starting centroids or by performing a limited local search about

the converged solution [31]. Despite its drawbacks, k-means remain the most widely

used partitioning clustering algorithm in practice. The advantages include simplicity of

algorithm, easy to understand, reasonably scalable and can be easily modified to handle

streaming data [31].

7.2 Association Analysis

Association analysis or association rule learning technique is popular and well researched

method for discovering interesting relations between variables in large databases. Asso-

ciation analysis is commonly applied in business related, such as marketing promotions

and customer relationship management. It is also applicable in the domain of bio-

informatics, web mining and scientific data analysis. The uncovered relationships can

be represented in two forms – the form of association rules or set of frequent items. As-

sociation rules suggest that a strong relationship exists between two items while frequent

items set are lists of items that commonly appear together.


Action number items

0 item1, item2

1 item3, item4, item5, item6




Table 6: A simple itemset

The two most essential concepts in association analysis are support and confidence. They

are used for selection of fascinating items. The support of an itemset is defined as the

percentage of the data set that contains this itemset. From Table 6, the support of

{item1} is 4/5 since it occurs in 4 actions of all the 5 actions, and the support of {item1,

item4} is 3/5 because, of all the five actions, three contained both item1 and item4.

Support applies to an itemset, so we can define a minimum support and filter out items

that do not meet the minimum support. The confidence, on the other hand, is defined

for an association rule such as {item4} → {item5}. The confidence for this rule is defined

as support(item4, item5)/support(item4). The support for {item4, item5} is 3/5. The

support for {item4} is 4/5, so the confidence for {item4} → {item5} is 3/4 = 0.75. This

means that in 75% of the items in the data set containing item4, our rule is correct.

With support and confidence, we can quantify the success of the association analysis.

Assuming we want to find all sets of items with a support greater than 0.9, we could

generate a list of all combination of items and then count how frequently they occur.

This is a brute force approach, which is not suitable in most cases, especially in large

data set. Finding different combinations of items would be a time-consuming task and

highly computationally expensive. Hence, this task requires an intelligent approach to

find frequent itemsets in a reasonable amount of time. Apriori principle allows us to

reduce the number of calculations we need to perform in learning association rules.

7.2.1 Apriori algorithm

Finding frequent itemsets (items with frequency ≥ minimum support) is not trivial

because of its combinatorial explosion [31]. Generating association rules with confidence

≥ specified minimum confidence becomes an easy task once we have derived the frequent

itemsets. Apriori principle helps us to reduce the number of possible interesting itemsets.

Apriori is a seminal algorithm for finding frequent itemsets using candidate generation


[61]. Apriori algorithm is based on the Apriori principle, which says that if an itemset

is frequent, then all of its subsets are frequent. In reverse, this means if an itemset is

infrequent, then its supersets are also infrequent [23].

For example, consider Figure 15a, the first diagram shows the possible combination of

items from set {0,1,2,3}. The first set is a big φ, which means a null set. To calculate

the support for a given set, say {0,3}, we will go through all nodes and obtain the

total number of set with 0 and 3. Then, we divide the total by the number of actions.

Counting through all possible itemsets is not feasible in large data, as data with N data

generates 2N − 1 possible itemsets. Even with element with N = 100, this will generate

1.26 ∗ 1030 possible itemsets. This is cumbersome on the processing side [23].

Figure 15: (A) All possible itemsets from the available set {0,1,2,3}. (B) Highlightedinfrequent itemsets [23]

For each actions in a:

For each candidate itemset , c:

Check to see it c is a subset of a

If so increment the count of c

For each candidate itemset:

If the support meets the minimum , keep this item

Return list of frequent itemsets

Listing 4: Pseudo-code for Apriori algorithm [23]

Consider Figure 15b, applying the reversed version of Apriori principle, the shaded

itemset{2,3} is known to be infrequent. From this knowledge, we know that itemsets

{0,2,3}, {1,2,3},{0,1,2,3} are also infrequent. This implies that, once we have derived

Section 8 - Reinforcement Learning 47

the support value for {2,3}, we do not have to compute support of {0,2,3},{1,2,3} and

{0,1,2,3} since they will not meet our requirements. This principle can be used to halt

the exponential growth of itemsets and speedup computation time for listing frequent

item sets [23].

The Apriori algorithm, Listing 4 requires a minimum support value and a data set as

variable. The algorithm first scan and generate a list of all candidate itemsets with one

item. It then scans the data set and calculates the support of each candidate of frequent

itemsets. The itemsets that meet the requirement of a minimum support will then be

added to make itemsets with two elements. Again, the algorithm scans the data set and

items that do not meet the minimum support value will be removed. This procedure is

repeated until all sets are removed.

8 Reinforcement Learning

The approach of reinforcement learning requires a system to learn from success and

failure, from reward and punishment. In supervised learning approach, a system requires

to be told the correct move for each position it encounters, but such feedback is seldom

available [27]. In situations where the feedback is not available, such as in uncharted

territory, a system can learn a model from its own experience. The system also requires

knowing the aftermath of its actions, either good or bad. The aftermath, which is

a type of feedback to the learning system is called reward or reinforcement [27].

Reinforcement learning is learning what to do (i.e. how to map situations to actions), so

as to maximize a numerical reward signal. Unlike other types of learning, a reinforcement

learner is not told which actions to take [20]. Supervised learning is an important form

of learning, but it is not adequate for learning from interaction and complex domains

[20][27]. For interactive problems, it is difficult to obtain examples of a desired behavior

that are both correct and representative of all the situations in which a system has to act

[20]. A reinforcement learning agent must be able to sense the state of the environment –

sensation, it must be able to able to take actions that affect the state of the environment

– action, and lastly, the agent must have a goal or goals relating to the state of the

environment [20].


A good example of reinforcement learning is an adaptive controller, which adjusts pa-

rameters of a petroleum refinery’s operation in real time. The controller optimizes the

yield/cost/quality trade-off based on specified marginal costs without sticking strictly

to the set points originally suggested by engineers. Another possible application of rein-

forcement learning can be seen in a mobile robot, which decides whether it should enter

a new room in search of more trash to collect or start trying to find its way back to its

battery recharging station. It makes its decision based on how quickly and easily it has

been able to find the recharger in the past. [20]. In general, reinforcement learning has

been successfully applied to various problems, including operation research, robotics,

telecommunications, games playing and economics/finance [20].

8.1 Elements of Reinforcement Learning

There are four main sub-elements to a reinforcement learning system: a policy, a reward

function, a value function, and, optionally, a model of the environment [20].

• A policy – defines the learning agent’s way of behaving at a given time. It refers

to a mapping from perceived state of the environment to actions to be taken when

in those states. A policy is the essential to reinforcement learning since it is solely

sufficient to determine behavior.

• A reward (or reinforcement) function – defines the goal in a reinforcement-learning

problem. It maps perceived states (or state-action pairs) of an environment to a

single number, an immediate reward, indicating the underlying desirability of the

state. The main objective of a reinforcement learner is to maximize the total

reward it receives in the long run. The reward function helps a learning system to

determine what are the bad and good events.

• A value (or utility) function – unlike reward function, which denotes what is good

in an immediate sense, a value function specifies what is good in the long run.

The value of a state can be described as the total amount of reward a learner can

expect to accumulate over the future starting from its current state.

• A model of the environment – imitates the behavior of the environment. It helps

to predict the resultant next state and next reward given a state and an action.


From the elements of reinforcement learning above, we can have a learning model con-

sisting of:

1. Environment’s state St ∈ S, where S is a set of possible states

2. An action at ∈ A(St), where A(St) is the set of possible actions available in state

st.

3. Rules that describes what the learner observes, i.e policy, π(S) = a.

4. Rules of transition between states, i.e. a transition function, δ(S, a) = S′

5. Rules that determine the scalar immediate reward of a transition from state S

given action a, i.e. a reward function, r(S, a).

6. Rules that determine the long term reward (utility) of a state (or state-action

pair), i.e. value function, Uπ(S)

These rules are usually stochastic. A reinforcement learner interacts with its environ-

ment in discrete time steps. At each time t, the agent perceives or receives an observation

Ot, which includes the reward rt. The agent then chooses an action at from the set of

available actions, which is later sent to the environment. The environment changes its

states from the current state st to a new state st+1 and the reward rt+1 associated with

the transition(st, at, st+1) is determined. As mentioned earlier, the goal of the learning

agent is to collect as much reward as possible. Actions selection can be a function of

the history and it can even be randomized selection. Figure 16 shows the relationships

between a learning agent and its environment.

Figure 16: Learning agent - environment

Reinforcement learning is based on Markov Decision Process (MDP), but there are

differences in the prior knowledge about the model parameters. For MDP, where set of


states – S, set of actions – A, transition probabilities T (S, a, S′) and reward functions

R(S, a, S′) are known. The aim here is to determine the optimal value function and/or

optimal policy. This can be solved using dynamic programming such as value iteration

or policy iteration. These approaches are known as model-based algorithms. In a true

reinforcement learning problems, transition model is not known; also reward model is not

known in advance. In this case, algorithms such as temporal-difference and Q-learning

are employed.

8.1.1 Passive Reinforcement Learning

Passive learning involves learning with fixed policy, the task here is to learn the utilities

of states (or state-action pairs); this could also involve learning a model of the environ-

ment. Some of the methods used in learning the utilities of states are: Direct Utility

Estimation (DUE), Adaptive Dynamic Programming (ADP) and Temporal-difference

learning. DUE was invented in the area of adaptive control theory by Widrow and Hoff

[27]. The basic idea is that the utility of a state is the expected total reward from

the state onward (i.e. expected reward-to-go) and each trial provides a sample of this

quantity for each state visited. From the observed samples, the algorithm calculates

the observed reward-to-go for each visited state and updates the estimated utility for

that state accordingly, by keeping a running average for each state in a table. DUE

succeeds in reducing the reinforcement-learning problem to an inductive learning prob-

lem, but it does not consider the inter-dependency between states since the utility of

each state equals its own reward plus the expected utility of its successor states. Utility

values should obey Bellman equations (9), but DUE does not. Hence, the algorithm

often converges extremely slowly and usually takes la large number of sequence to get

close to the correct value.

Uπ(S) = R(S) + γ∑S′

T (S, π(S), S′)Uπ(S′) (9)

ADP makes use of the Bellman equation to obtain the utilities of states Uπ(S). The ADP

approach takes advantage of the constraints among the utilities of states by learning the

transition model that connects them and solving the resulting Markov decision process

using a dynamic programming method. The transition model T (S, π(S), S′) and R(S)


are estimated from trials, the resulting learned models are subsequently used in the

Bellman equations (9) to calculate the utilities of the states. The Bellman equations are

linear; hence they can be solved using any linear algebra method. It is possible also, to

use an alternative approach – modified policy iteration, which uses a simplified value

iteration to update changes to the utility estimates after each change to the learned

model T and R. This method converges faster since the model T and R changes only

slightly. In both DUE and ADP methods, learning problem are being reduced to MDP,

which is then solved. Another way to bring the Bellman equations to bear on the

learning problem is to use the observed transitions to adjust the utilities of the observed

states so that they agree with the constraint of Bellman’s equations (9). This is the

approach employed in Temporal-difference learning. It uses the constraints for adjusting

the utilities and does not solve the equation for all the states.

Uπ(S)← (1− α)Uπ(S) + α(R(S) + γUπ(S′)) (10)

Equation (10) presents the update rule for adjusting the utilities, where α is the learn-

ing rate parameter. The name temporal-difference is adopted because this update

rule uses the difference in utilities between successive states. The fundamental idea of

temporal-difference method is to adjust the utility estimates towards the ideal equi-

librium that holds locally when the utility estimates are correct. While Equation (9)

works in the case of passive learning, Equation (10) causes the agent to reach equilib-

rium given by Equation (9). The difference however is that update only involves the

observed successor s′, whereas actual equilibrium conditions involve all possible next

states. Temporal-difference is a model free method since it does not need a transition

model to perform its updates. Temporal Difference (TD) and ADP are closely related.

They both try to make local adjustment to the utility estimates in order to make each

state ”agree” with its successors. A key difference is that TD adjusts a state to agree

with its observed successors, while ADP adjust the state to agree with the entire suc-

cessor that might occur. Another difference is that TD makes a single adjustment per

observed transition, ADP makes as many as it needs to restore consistency between the

utility estimates and the environment model.

Direct Estimation (DUE) is a model-free approach, which is well known for its simplicity.

It is easy to implement, and each update is fast. It does not exploit Bellman’s constraints


and converges slowly. ADP is a model-based approach, and it is harder to implement

compared with DUE. Also, the cost of each update is high since each update is a full

policy evaluation. It however exploits Bellman constraints and converges faster in terms

of updates. TD is slower to converge, but it uses lesser computation per observation. It

partially exploits Bellman’s constraints.

8.1.2 Active Reinforcement Learning

An active agent is free to decide what actions to take because it does not have a fixed

policy that determines its behavior. It updates its policy as it learns, also it must

consider what the outcome of its actions maybe and how they will affect the received

rewards. Adaptive dynamic programming can be used in this case. However, it needs to

be modified to handle the freedom of an active learning agent. The same ADP learning

mechanism used for passive agent can be employed to learn a complete model with

outcome probabilities for all actions. Taking into consideration that an agent has a

choice of actions, the utilities it needs to learn are those defined by the optimal policy;

and they obey the Bellman equations (11):

U(S) = R(S) + γmaxa

∑S′

P (S′|S, a)U(S′) (11)

Equation (11) can be solved using the value iteration or policy iteration algorithms to

obtain a utility function U that is optimal for the learned model [27]. The final issue is

what action to take at each step. An active learning agent can extract an optimal action

by one-step look-ahead to maximize the expected utility. This is applicable if it uses

value iteration for equation (11). However, if it uses policy iteration, the optimal policy

is already available. An ADP agent might become a greedy agent when it follows the

greedy policy suggested by the current value estimate. In such cases, the optimal policy

learned is not the true optimal policy. Choosing the optimal action from this policy will

only lead to sub-optimal results because the learned model is not the same as the true

environment (i.e. what is optimal in the learned model can be sub-optimal in the true

environment). Since the agent is not aware of true environment, it therefore, cannot

obtain an optimal action for the true environment. To solve this problem, an agent

must make a trade-off between exploitation to maximize its reward and exploration

to maximize its long-term reward. With exploitation concept, to get reward, we exploit

Section 9 - Performance Evaluation in Machine Learning 53

our knowledge to get a payoff. In the case of exploration, we simply get more information

about the world to discover actions that result in better rewards. To explore, we typically

need to take actions that do not seem best according to our current model. Managing

the tradeoff between exploitation and exploration of a critical issue in reinforcement

learning, however, the basic intuition behind most approaches is to explore more when

knowledge is weak and exploit more as we gain knowledge. Approaches that use pure

exploitation often get stuck in bad policies, while those with pure exploration get better

models by learning, but they have small rewards due to exploration.

9 Performance Evaluation in Machine Learning

In this section, we present common evaluation metrics and methods used in ML. Ac-

curacy performance of supervised learning is easy to evaluate due to the nature of the

learning problems. However, for unsupervised learning, it is difficult to evaluate accu-

racy performance of algorithms. Jain et al (1998) [62] support this claim, the authors

states that the validation of clustering structures is the most difficult and frustrating

part of cluster analysis and suggests it is best to rely on labeled data for model validation

in clustering (an unsupervised learning task).

It is crucial to evaluate the performance of learning algorithms, as this is a large factor

in selecting choice of algorithm. We present definitions to some following terms that will

be used in this section. In the following definitions, we shall talk in terms of positive

examples, P (examples of the main class of interest) and negative examples, N (all the

other examples apart from the main class of interest).

True Positives,TP - These refer to the positive examples that were correctly predicted

by the classifier.

True Negatives,TN - These are the negative examples that were correctly labeled by the

classifier.

False positives,FP - These are the negative examples that were incorrectly labeled as

positives.

False negatives, FN - These are positive examples that were mislabeled as negative


9.1 Confusion Matrix

Confusion matrix is a table layout, which allows visualization of performance of an algo-

rithm (typically supervised learning prediction algorithm). In the case of unsupervised

learning, a similar approach used is known as matching matrix. A confusion matrix has

columns, which represent the instances in a predicted class and rows that represents the

instances in an actual class (see Table 7). The term confusion was adopted because the

matrix makes it easier to see if the learning system is confusing two classes (i.e. com-

monly mislabeling one as another). It is a useful tool for analyzing how well a classifier

can recognize the examples of different classes by summarizing TP, TN, FP and FN.

From values of TP and TN, we can easily see when a classifier predicts correctly and

FP and FN values tell us when it predicts wrongly.

Predicted class

Yes No Total

Actual class Yes TP FN PNo FP TN N

Total P’ N’ P + N

Table 7: Confusion matrix with totals for positive and negative examples

for example, if a learning system has been trained to classify between three different

classes A, B and C, a confusion matrix will summarize the results of testing the algorithm

for further inspection. Let us consider a total test set of 30 samples. In the sample there

are 5 samples of class A, 12 samples of class B and 13 samples of class C

Predicted class

Class A Class B Class C Total

Actual class Class A 4 1 0 5Class B 5 6 1 12Class C 0 3 10 13

Total 9 10 11 30

Table 8: Example of a confusion matrix with totals for positive and negative examples

In this confusion matrix, of 5 actual samples of class A, the system classifies 4 correctly

and predicted 1 as class B. The matrix makes it easy to visually inspect the prediction

performance.

As seen in Table 7, the confusion matrix has additional row and column which provides

the totals. It shows P and N ; it also shows P and N. P is the number of examples

that were labeled as positive (TP + FP) while N is the number of examples that were


labeled as negative (TN + FN ). The total number of examples is TP + TN +FP +TN,

which equals P + N or P+ N.

9.2 Accuracy and Error rate

Accuracy or recognition rate of a classifier is the percentage of the test set tuples that

are correctly classified by the classifier [1], i.e.

Accuracy =TP + TN

P + N(12)

Error rate or misclassification rate is the percentage of the test set examples that are

misclassified. Error rate is simple 1−accuracy(M), where accuracy(M) is the accuracy

of model, M. Error rate can also be computed as

error rate =FP+FN

P+N(13)

As mentioned earlier in this section, to avoid misleading overoptimistic estimates, it is

ideal to perform evaluation with different data set (test set) other than the set used

for building the classifier (training set), i.e. holdout method. Resubstitution error is a

type of error rate that uses training set (instead of the test set) to estimate the error

rate. It is optimistic of the true error rate (the corresponding accuracy estimate is also

optimistic).

In cases where the distribution of classes within a data set is uneven (e.g. in data

sets with very few numbers of class of interest (positive class). For example in fraud

detection systems, the positive class (class of interest) is fraud, which occurs very rarely.

A classifier’s accuracy might seems quite accurate with 97%. But, what if only 3%

of the training examples are actually fraud cases? It means the classifier could be

predicting all non-fraud activity correctly and misclassifies the class of interest, fraud.

This clearly makes accuracy measure unacceptable for estimating the performance of the

classifier in this case. This problem is referred to as class imbalance problem and requires

measure that can access how well the classifier performs with each positive examples and

negative example, the next measures sensitivity and specificity can be used in this case

respectively [1].


9.3 Sensitivity and Specificity

Sensitivity and specificity are statistical measures of performance of a classification func-

tion. Sensitivity measures the portion of actual positives, which are correctly identified

as such (TP recognition rate). On the other hand, specificity measures the proportion

of negatives, which are correctly identified as such (TN recognition rate).

Sensitivity =TP

P(14)

Specificity =TN

N(15)

It suffices to show that accuracy is a function of sensitivity and specificity.

Accuracy = sensitivityP

(P +N)+ specificity

N

(P +N)

Sensitivity and specificity measures are closely related to the concept of type I and type

II errors in statistics [1]. A perfect classifier would be described as 100% sensitivity

(i.e. predicting all unknown class-A samples as class-A) and 100% specificity (i.e. not

predicting any non class-A sample as class-A).

9.4 Precision, Recall and F-measure

Prediction and recall measure are widely used in estimating classification models. Pre-

cision can be defined as a measure of exactness (i.e. percentage of correctly predicted

positive examples), while recall is a measure of completeness (i.e. the percentage of

positive examples that are predicted as such).

precision =TP

TP + FP

recall =TP

TP + FN=TP

P


There is somewhat an inverse relationship between precision and recall. They are typ-

ically used along with each other, where recall values are compared for a fixed value of

precision, or vice versa. In some situations, we might benefit from combining precision

and recall into single estimates, these approaches are referred to as F measure ( F1 score

or F-score) and Fβ measure.

F =2× precision× recallprecision+ recall

Fβ =(1 + β2)× precision× recallβ2 × precision+ recall

Where β is a non-negative real number. F measure is defined as the harmonic mean

of precision and recall, which gives equal weight to precision and recall. Similarly, Fβ

measure is a weighted estimate of precision and recall, which assigns times as much

weight to recall as to precision. F2 and F0.5 are two commonly used Fβ measures [1].

We have discussed some accuracy-based measures are widely employed in estimating the

performance of a classifier. In addition to accuracy-based measures, there are additional

aspects that can be considered in comparisons of classifiers e.g. speed, robustness and

scalability (see Sub Section 6.6)

9.5 Methods for Obtaining Reliable Evaluation Measures

Using a set of data to derive a classifier and then estimating the performance of re-

sulting model using the same data set can result in fallacious overoptimistic estimates

due to overspecialization of the learning algorithm on the data. There are at least three

techniques, which are used to calculate a classifier’s accuracy [3]. One simple method in-

volves splitting the data set into two-third for training and the one-third for performance

evaluation. Another method known is called cross-validation. Cross validation is a sta-

tistical method, which is widely employed in ML for obtaining reliable classifier accuracy

estimates. In an n-fold cross-validation, the data set are randomly partitioned into n

mutually exclusive subsets or folds, D1, D2, . . . , Dn each of approximately equal size.

The training and testing are performed n times. In iteration i, partition Di is reserved

as test set and the union of other partitions is used to generate the classifier. E.g. in the


first round of an n-fold validation, subset D2, . . . , Dn jointly serves as the training set to

train the first model which is tested using D1 subset, second round uses D1, D3, . . . , Dn

to obtain the second model which is then tested on D2, the other iteration continues

in a similar approach till nth iteration. The average over all the n-rounds is the final

accuracy estimate. There are variants of cross-validation such as leave-one-out, holdout

and stratified cross-validation [8]. Leave-one-out validation uses a single instance as the

test sample, while the algorithm is trained on other instances. Holdout method, also

called 2-fold cross validation, is the simplest variation of n-fold cross-validation, where

data set are randomly divided into two sets D1 and D2 of equal sizes. While D1 is used

for training, D2 is used for test and vice-versa. The stratified variant ensures each fold

contains about the same proportions of the types of class labels. Stratified 10-fold cross

validation is recommended and often use for estimating accuracy because of its relatively

low bias and variance [1].

10 Conclusions

Machine learning is a broad area that is applied in various fields. Its application in broad

areas such as data mining and AI, and other various fields led to numerous existing

techniques and algorithms. In general, the methods have aided lots of decision-making

systems by learning from data of different types. The recent interest in the Internet of

things and sensor networks, which is expected to grow rapidly to 50 billion devices and

beyond, has sparked a great interest for the application of ML in pervasive environments.

In view of this, this study is rudimentary introduction to common ML techniques that

have been applied in the area of pervasive computing. The scope of this study is limited

to the methods that are identified as applicable or that have potential application in

the field of pervasive computing. These techniques are identified from previous works

in the field of pervasive environment as fundamental methods which are essential. This

report presents the main types of ML – supervised, unsupervised and reinforcement

learning, with more focus on supervised learning since it is mostly applied in learning

problems. The supervised learning algorithms discussed include Naive Bayes, kNN,

decision tree, SVM and HMM. Other methods include unsupervised learning methods,

k-means and apriori algorithm used for clustering and association analysis respectively


and finally reinforcement learning. Lastly, various performance and accuracy metrics

are highlighted as well as methods for obtaining reliable evaluation measures.

Bibliography

[1] Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Tech-

niques. Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, 3rd edition,

2011. ISBN 0123814790, 9780123814791.

[2] A. Frank and A. Asuncion. Uci machine learning repository, 2010. URL http:

//archive.ics.uci.edu/ml.

[3] S. B. Kotsiantis. Supervised machine learning: A review of classification tech-

niques. In Proceedings of the 2007 conference on Emerging Artificial Intelligence

Applications in Computer Engineering: Real Word AI Systems with Applications

in eHealth, HCI, Information Retrieval and Pervasive Technologies, pages 3–24,

Amsterdam, The Netherlands, The Netherlands, 2007. IOS Press. ISBN 978-1-

58603-780-2. URL http://dl.acm.org/citation.cfm?id=1566770.1566773.

[4] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective (Adap-

tive Computation and Machine Learning series). The MIT Press, aug 2012.

ISBN 0262018020. URL http://www.amazon.com/exec/obidos/redirect?tag=

citeulike07-20&path=ASIN/0262018020.

[5] Theodoros Anagnostopoulos, Christos B. Anagnostopoulos, Stathes Hadjiefthymi-

ades, Alexandros Kalousis, and Miltos Kyriakakos. Path prediction through data

mining. In Pervasive Services, IEEE International Conference on, pages 128–135,

2007. ID: 1.

[6] H. Kosorus, J. Honigl, and J. Kung. Using r, weka and rapidminer in time series

analysis of sensor data for structural health monitoring. In Database and Expert

Systems Applications (DEXA), 2011 22nd International Workshop on, pages 306–

310, 2011. ISBN 1529-4188. ID: 1.

60

http://archive.ics.uci.edu/ml

http://archive.ics.uci.edu/ml

http://dl.acm.org/citation.cfm?id=1566770.1566773

http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0262018020

http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0262018020

Bibliography 61

[7] Teradata, 2013.

[8] Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine

Learning Tools and Techniques. Morgan Kaufmann, Amsterdam, 3 edition, 2011.

ISBN 978-0-12-374856-0. URL http://www.sciencedirect.com/science/book/

9780123748560.

[9] T. Anagnostopoulos, C. Anagnostopoulos, and S. Hadjiefthymiades. Mobility pre-

diction based on machine learning. In Mobile Data Management (MDM), 2011 12th

IEEE International Conference on, volume 2, pages 27–30, 2011. ID: 1.

[10] M. Muhlenbrock, O. Brdiczka, D. Snowdon, and J. L Meunier. Learning to detect

user activity and availability from a variety of sensor data. In Pervasive Computing

and Communications, 2004. PerCom 2004. Proceedings of the Second IEEE Annual

Conference on, pages 13–22, 2004. ID: 1.

[11] Jennifer R. Kwapisz, Gary M. Weiss, and Samuel A. Moore. Activity recognition

using cell phone accelerometers. SIGKDD Explor.Newsl., 12(2):74–82, mar 2011.

URL http://doi.acm.org/10.1145/1964897.1964918.

[12] Ling Bao and Stephen S. Intille. Activity recognition from user-annotated acceler-

ation data. pages 1–17. Springer, 2004.

[13] Oliver Brdiczka, Patrick Reignier, and James L. Crowley. Supervised learning of an

abstract context model for an intelligent environment. In Proceedings of the 2005

joint conference on Smart objects and ambient intelligence: innovative context-

aware services: usages and technologies, sOc-EUSAI ’05, pages 259–264, New York,

NY, USA, 2005. ACM. ISBN 1-59593-304-2. URL http://doi.acm.org/10.1145/

1107548.1107612.

[14] Keith Worden and Graeme Manson. The application of machine learning to struc-

tural health monitoring. Philosophical Transactions of the Royal Society A: Mathe-

matical, Physical and Engineering Sciences, 365(1851):515–537, February 15 2007.

[15] J. M. Ko and Y. Q. Ni. Technology developments in structural health monitoring

of large-scale bridges. Engineering Structures, 27(12):1715–1725, 10 2005.

[16] I. Cohen, F. G. Cozman, N. Sebe, M. C. Cirelo, and T. S. Huang. Semisupervised

learning of classifiers: theory, algorithms, and their application to human-computer

http://www.sciencedirect.com/science/book/9780123748560

http://www.sciencedirect.com/science/book/9780123748560

http://doi.acm.org/10.1145/1964897.1964918

http://doi.acm.org/10.1145/1107548.1107612

http://doi.acm.org/10.1145/1107548.1107612

Bibliography 62

interaction. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26

(12):1553–1566, 2004. ID: 1.

[17] B. D. Ziebart, D. Roth, R. H. Campbell, and A. K. Dey. Learning automation

policies for pervasive computing environments. In Autonomic Computing, 2005.

ICAC 2005. Proceedings. Second International Conference on, pages 193–203, 2005.

ID: 1.

[18] L. Wu, G. Kaiser, D. Solomon, R. Winter, A. Boulanger, and R. Anderson. Im-

proving efficiency and reliability of building systems using machine learning and

automated online evaluation. In Systems, Applications and Technology Conference

(LISAT), 2012 IEEE Long Island, pages 1–6, 2012. ID: 1.

[19] Eric Horvitz, Jack Breese, David Heckerman, David Hovel, and Koos Rommelse.

The lumière project: Bayesian user modeling for inferring the goals and

needs of software users. In Proceedings of the Fourteenth conference on Uncertainty

in artificial intelligence, UAI’98, pages 256–265, San Francisco, CA, USA, 1998.

Morgan Kaufmann Publishers Inc. ISBN 1-55860-555-X. URL http://dl.acm.

org/citation.cfm?id=2074094.2074124.

[20] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduc-

tion. MIT Press, 1998. ISBN 0262193981. URL http://www.cs.ualberta.ca/

%7Esutton/book/ebook/the-book.html.

[21] D. J. Cook, M. Youngblood, E. O. Heierman III, K. Gopalratnam, S. Rao, A. Litvin,

and F. Khawaja. Mavhome: an agent-based smart home. In Pervasive Comput-

ing and Communications, 2003. (PerCom 2003). Proceedings of the First IEEE

International Conference on, pages 521–524, 2003. ID: 1.

[22] Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc, New York, NY, USA,

1 edition, 1997. ISBN 0070428077, 9780070428072.

[23] Peter Harrington. Machine Learning in Action. Manning Publications Co, Green-

wich, CT, USA, 2012. ISBN 1617290181, 9781617290183.

[24] Tom M. Mitchell. Machine learning and data mining. Commun.ACM, 42(11):30–36,

nov 1999. URL http://doi.acm.org/10.1145/319382.319388.



http://www.cs.ualberta.ca/%7Esutton/book/ebook/the-book.html

http://www.cs.ualberta.ca/%7Esutton/book/ebook/the-book.html

http://doi.acm.org/10.1145/319382.319388

Bibliography 63

[25] H. Mannila. Data mining: machine learning, statistics, and databases. In Sci-

entific and Statistical Database Systems, 1996. Proceedings., Eighth International


[26] Ming Xue and Changjun Zhu. A study and application on machine learning of

artificial intellligence. In Artificial Intelligence, 2009. JCAI ’09. International Joint


[27] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Pren-

tice Hall, 3; 3rd edition, dec 2009.

[28] M. Kholghi, H. Hassanzadeh, and M. R. Keyvanpour. Classification and evaluation

of data mining techniques for data stream requirements. In Computer Communica-

tion Control and Automation (3CA), 2010 International Symposium on, volume 1,

pages 474–478, 2010. ID: 1.

[29] Frances Y. Kuo and Ian H. Sloan. Lifting the curse of dimensionality. Notices of

the AMS, 52:1320–1329, 2005.

[30] Pat Langley and Herbert A. Simon. Applications of machine learning and rule

induction. Commun.ACM, 38(11):54–64, nov 1995. URL http://doi.acm.org/

10.1145/219717.219768.

[31] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hi-

roshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-

Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 al-

gorithms in data mining. Knowl.Inf.Syst., 14(1):1–37, dec 2007. URL http:

//dx.doi.org/10.1007/s10115-007-0114-2.

[32] Shichao Zhang, Chengqi Zhang, and Qiang Yang. Data preparation for data mining.

Applied Artificial Intelligence, 17:375–381, 2003.

[33] Gustavo E. A. P. A. Batista and Maria Carolina Monard. An analysis of four

missing data treatment methods for supervised learning. Applied Artificial Intelli-

gence, 17(5-6):519–533, 05/01; 2013/03 2003. URL http://dx.doi.org/10.1080/

713827181. doi: 10.1080/713827181; M3: doi: 10.1080/713827181; 21.

[34] Aik Choon Tan and David Gilbert. An empirical comparison of supervised ma-

chine learning techniques in bioinformatics. In Proceedings of the First Asia-Pacific

http://doi.acm.org/10.1145/219717.219768

http://doi.acm.org/10.1145/219717.219768

http://dx.doi.org/10.1007/s10115-007-0114-2

http://dx.doi.org/10.1007/s10115-007-0114-2

http://dx.doi.org/10.1080/713827181

http://dx.doi.org/10.1080/713827181

Bibliography 64

bioinformatics conference on Bioinformatics 2003 - Volume 19, APBC ’03, pages

219–222, Darlinghurst, Australia, Australia, 2003. Australian Computer Society,

Inc. ISBN 0-909-92597-6. URL http://dl.acm.org/citation.cfm?id=820189.

820218.

[35] Kiri Wagstaff. Machine learning that matters. CoRR, abs/1206.4656, 2012.

[36] Theodoros Anagnostopoulos, Christos Anagnostopoulos, Stathes Hadjiefthymiades,

Miltos Kyriakakos, and Alexandros Kalousis. Predicting the location of mobile

users: a machine learning approach. In Proceedings of the 2009 international con-

ference on Pervasive services, ICPS ’09, pages 65–72, New York, NY, USA, 2009.

ACM. ISBN 978-1-60558-644-1. URL http://doi.acm.org/10.1145/1568199.

1568210.

[37] U. Weiss, P. Biber, S. Laible, K. Bohlmann, and A. Zell. Plant species classification

using a 3d lidar sensor and machine learning. In Machine Learning and Applications

(ICMLA), 2010 Ninth International Conference on, pages 339–345, 2010. ID: 1.

[38] A. M. Bidgoli and M. Boraghi. A language independent text segmentation technique

based on naive bayes classifier. In Signal and Image Processing (ICSIP), 2010

International Conference on, pages 11–16, 2010. ID: 1.

[39] M. J. Meena and K. R. Chandran. Nave bayes text classification with positive

features selected by statistical method. In Advanced Computing, 2009. ICAC 2009.

First International Conference on, pages 28–33, 2009. ID: 1.

[40] S. Viaene, R. A. Derrig, and G. Dedene. A case study of applying boosting naive

bayes to claim fraud diagnosis. Knowledge and Data Engineering, IEEE Transac-

tions on, 16(5):612–620, 2004. ID: 1.

[41] David J. Hand and Keming Yu. Idiot’s bayes—not so stupid after all? International

Statistical Review, 69(3):385–398, 2001.

[42] Pedro Domingos and Michael Pazzani. On the optimality of the simple bayesian

classifier under zero-one loss. Mach.Learn., 29(2-3):103–130, nov 1997. URL http:

//dx.doi.org/10.1023/A:1007413511361.



http://doi.acm.org/10.1145/1568199.1568210

http://doi.acm.org/10.1145/1568199.1568210

http://dx.doi.org/10.1023/A:1007413511361

http://dx.doi.org/10.1023/A:1007413511361

Bibliography 65

[43] Pedro Domingos and Michael Pazzani. On the optimality of the simple bayesian

classifier under zero-one loss. Machine Learning, 29(2-3):103–130, 1997. URL http:

//dx.doi.org/10.1023/A%3A1007413511361.

[44] David W. Aha. Lazy learning. Kluwer Academic Publishers, Norwell, MA, USA,

1997. ISBN 0-7923-4584-3.

[45] Ramon Lpez de Mntaras and Eva Armengol. Machine learning from examples:

Inductive and lazy methods. Data & Knowledge Engineering, 25(1-2):99–123, 1998.

[46] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE

Trans.Inf.Theor., 13(1):21–27, sep 2006. URL http://dx.doi.org/10.1109/TIT.

1967.1053964.

[47] Dietrich Wettschereck, David W. Aha, and Takao Mohri. A review and empirical

evaluation of feature weighting methods for a class of lazy learning algorithms.

Artificial Intelligence Review, 11:273–314, 1997.

[48] T. Cover and P. Hart. Nearest neighbor pattern classification. Information Theory,

IEEE Transactions on, 13(1):21–27, 1967. ID: 1.

[49] Gongde Guo, Hui Wang, David A. Bell, Yaxin Bi, and Kieran Greer. Knn model-

based approach in classification. In Robert Meersman, Zahir Tari, and Douglas C.

Schmidt, editors, On The Move to Meaningful Internet Systems 2003: CoopIS,

DOA, and ODBASE - OTM Confederated International Conferences, CoopIS,

DOA, and ODBASE 2003, Catania, Sicily, Italy, November 3-7, 2003, volume

2888 of Lecture Notes in Computer Science, pages 986–996. Springer, 2003. ISBN

3-540-20498-9.

[50] Giovanni Seni and John Elder. Ensemble Methods in Data Mining: Improving

Accuracy Through Combining Predictions. Morgan and Claypool Publishers, 2010.

ISBN 1608452840, 9781608452842.

[51] J. Ross Quinlan. C4.5: Programs for Machine Learning (Morgan Kaufmann

Series in Machine Learning). Morgan Kaufmann, 1 edition, oct 1992. ISBN

9781558602380.

[52] E. B. Hunt, J. Martin, and P. Stone. Experiments in Induction. Academic Press,

New York, 1966.

http://dx.doi.org/10.1023/A%3A1007413511361

http://dx.doi.org/10.1023/A%3A1007413511361

http://dx.doi.org/10.1109/TIT.1967.1053964

http://dx.doi.org/10.1109/TIT.1967.1053964

Bibliography 66

[53] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.

URL http://dx.doi.org/10.1007/BF00116251.

[54] L. Breiman, H. Friedman J., A. Olshen R., and J. Stone C. Classification and

Regression Trees. Chapman & Hall, New York, 1984.

[55] Douglas M. Hawkins. The problem of overfitting. Journal of chemical information

and computer sciences, 44(1):1–12, 2004.

[56] K. Veropoulos, ICG Campbell, and N. Cristianini. Controlling the Sensitivity of

Support Vector Machines, page 55. Proceedings of the International Joint Confer-

ence on Artificial Intelligence, Stockholm, Sweden (IJCAI99),. 1999. Other: Work-

shop ML3.

[57] Bernhard Scholkopf, Alexander J. Smola, and Klaus-Robert Muller. Advances in

kernel methods, chapter Kernel principal component analysis, pages 327–352. MIT

Press, Cambridge, MA, USA, 1999. ISBN 0-262-19416-3. URL http://dl.acm.

org/citation.cfm?id=299094.299113.

[58] John C. Platt. Using analytic qp and sparseness to speed training of support vector

machines. In Proceedings of the 1998 conference on Advances in neural information

processing systems II, pages 557–563, Cambridge, MA, USA, 1999. MIT Press. ISBN

0-262-11245-0. URL http://dl.acm.org/citation.cfm?id=340534.340735.

[59] Lawrence R. Rabiner. Readings in speech recognition, chapter A tutorial on hidden

Markov models and selected applications in speech recognition, pages 267–296.

Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, 1990. ISBN 1-55860-

124-4. URL http://dl.acm.org/citation.cfm?id=108235.108253.

[60] Mark Stamp. A revealing introduction to hidden markov models, 2004.

[61] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association

rules in large databases. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo,

editors, VLDB’94, Proceedings of 20th International Conference on Very Large Data

Bases, September 12-15, 1994, Santiago de Chile, Chile, pages 487–499. Morgan

Kaufmann, 1994. ISBN 1-55860-153-8. DBLP:conf/vldb/94.

[62] Anil K. Jain and Richard C. Dubes. Algorithms for clustering data. Prentice-Hall,

Inc, Upper Saddle River, NJ, USA, 1988. ISBN 0-13-022278-X.

http://dx.doi.org/10.1007/BF00116251





machine learning in pervasive computing - diva...

Documents