chapter 1 introduction -...
TRANSCRIPT
SELECTIVE KNOWLEDGE TRANSFER FROM K-NEAREST NEIGHBOUR TASKS
USING FUNCTIONAL SIMILARITY
AT THE CLASSIFICATION LEVEL
by
Yuan Su
Thesis
submitted in partial fulfillment of the
requirements for the Degree of
Bachelor of Computer Science with Honours
Acadia University
April 2005
© Copyright by Yuan Su, 2005
This thesis by Yuan Su
is accepted in its present form by the
Jodrey School of Computer Science
as satisfying the thesis requirements for the degree of
Bachelor of Computer Science with Honours
Approved by the Thesis Supervisor
__________________________ ____________________Dr. Daniel L. Silver Date
Approved by the Director of the School
__________________________ ____________________Dr. Leslie Oliver Date
Approved by the Honours Committee
__________________________ ____________________ Date
i
I, Yuan Su, grant permission to the University Librarian at Acadia University to reproduce, loan or distribute copies of my thesis in microform, paper or electronic
formats on a non-profit basis. I however, retain the copyright in my thesis.
_________________________________Signature of Author
_________________________________Date
ii
Table of Contents
TABLE OF CONTENTS............................................................................................................................III
LIST OF TABLES......................................................................................................................................VII
LIST OF FIGURES..................................................................................................................................VIII
ABSTRACT..................................................................................................................................................IX
CHAPTER 1 INTRODUCTION...................................................................................................................1
1.1 OVERVIEW OF PROBLEM........................................................................................................................21.2 RESEARCH OBJECTIVES..........................................................................................................................31.3 MOTIVATION..........................................................................................................................................41.4 OVERVIEW OF THESIS..........................................................................................................................4
CHAPTER 2 BACKGROUND.....................................................................................................................6
2.1 BACKGROUND ON INDUCTIVE LEARNING AND KNN.............................................................................62.1.1 Supervised Inductive Learning......................................................................................................62.1.2 Classification.................................................................................................................................62.1.3 Generalization Error.....................................................................................................................72.1.4 The kNN Algorithm........................................................................................................................72.1.5 Distance weighted kNN..................................................................................................................9
2.2 BACKGROUND OF KNOWLEDGE TRANSFER..........................................................................................102.2.1 Inductive Bias and Prior Knowledge...........................................................................................102.2.2 Knowledge Based Inductive Learning.........................................................................................112.2.3 Task Relatedness..........................................................................................................................112.2.4 Functional Similarity vs. Structural Similarity............................................................................12
2.3 RELEVANT BACKGROUND IN PROBABILITY AND STATISTICS..............................................................132.3.1 Conditional Probability...............................................................................................................132.3.2 Conditional Probability Distributions for Discrete Random Variables......................................13
2.4 PREVIOUS RESEARCH ON KNOWLEDGE TRANSFER IN THE CONTEXT OF KNN....................................152.4.1 Task Clustering (TC) Algorithm..................................................................................................152.4.2 Weight Vector...............................................................................................................................152.4.3 Summary......................................................................................................................................16
CHAPTER 3 SELECTIVE KNOWLEDGE TRANSFER FROM KNN TASKS.................................18
3.1 FORMAL DEFINITION OF THE PROBLEM...............................................................................................183.2 THEORY OF KNOWLEDGE TRANSFER FOR KNN CONCEPT LEARNING.................................................18
3.2.1 A Synthetic Example....................................................................................................................193.2.2 Conditional Probability Distributions.........................................................................................223.2.3 Generation of Virtual Instances...................................................................................................233.2.4 The Need for a Measure of Relatedness at the Classification Level............................................283.2.4 Using Variance to Measure Classification Relatedness..............................................................293.2.5 Duplicated Instances....................................................................................................................313.2.6 A Neighbourhood of Virtual Instances........................................................................................313.2.7 Steps of Knowledge Transfer from a kNN Concept Learning Tasks...........................................32
3.3 GENERALIZING TO MULTI-CLASSES....................................................................................................333.3.1 Conditional Probability Distribution for Multi-class Tasks........................................................333.3.2 Classification Relatedness...........................................................................................................353.3.3 Steps of Knowledge Transfer for a kNN Multi-Class Task..........................................................37
3.4 IMPLEMENTATION................................................................................................................................373.5 CONCLUSION........................................................................................................................................41
iii
CHAPTER 4 EMPIRICAL STUDIES.......................................................................................................42
4.1 THE BITMAP DOMAIN..........................................................................................................................424.1.1 Task T0..........................................................................................................................................434.1.2 Task T1..........................................................................................................................................434.1.3 Task T2..........................................................................................................................................444.1.4 Task T3..........................................................................................................................................44
4.2 EXPERIMENT 1: TRANSFER FROM A PARTIALLY RELATED TASKS......................................................454.2.1 Tasks............................................................................................................................................454.2.2 Method.........................................................................................................................................454.2.3 Results..........................................................................................................................................46
4.3 EXPERIMENT 2: UNRELATED TASKS....................................................................................................474.3.1 Tasks............................................................................................................................................484.3.2 Method.........................................................................................................................................484.3.3 Results..........................................................................................................................................49
4.4 EXPERIMENT 3: VARIATION IN TRANSFER FROM MORE AND LESS RELATED TASKS.........................504.4.1 Tasks............................................................................................................................................504.4.2 Methodh.......................................................................................................................................514.4.3 Results..........................................................................................................................................51
4.5 EXPERIMENT 4: KNOWLEDGE TRANSFER FROM MULTIPLE TASKS.....................................................524.5.1 Tasks............................................................................................................................................534.5.2 Method.........................................................................................................................................534.5.3 Results..........................................................................................................................................54
4.6 EXPERIMENT 5: THE ERROR OF ESTIMATION......................................................................................554.6.1 Tasks and Method........................................................................................................................554.6.2 Results..........................................................................................................................................56
4.7 EXPERIMENT 6: A REAL-WORLD DOMAIN: CHARACTER RECOGNITION.............................................574.7.1 Dataset.........................................................................................................................................584.7.2 Method.........................................................................................................................................584.7.3 Results..........................................................................................................................................59
4.8 DISCUSSION..........................................................................................................................................61
CHAPTER 5 CONCLUSION.....................................................................................................................62
5.1 MAJOR CONTRIBUTIONS......................................................................................................................625.1.1 A new functional measure of relatedness for kNN based on virtual instances............................625.1.2 Measure of relatedness is at the classification level....................................................................635.1.3 Tolerance to unrelated tasks and scaling....................................................................................63
5.2 LIMITATIONS........................................................................................................................................635.2.1 Minimizing Conditional Probability Estimation Errors..............................................................635.2.2 Relatedness Based on Sub-spaces of the Input Space..................................................................64
5.3 OTHER SUGGESTIONS FOR FUTURE WORK..........................................................................................65
References.....................................................................................................................................................67
iv
List of Tables
TABLE 1 ESTIMATION OF CONDITIONAL PROBABILITIES................................................................................23TABLE 2. RESULTS OF EXPERIMENT 1............................................................................................................46TABLE 3. RESULTS OF EXPERIMENT 2............................................................................................................49TABLE 4. RESULTS OF EXPERIMENT 3............................................................................................................52TABLE 5. RESULTS OF EXPERIMENT 4............................................................................................................54TABLE 6. RESULTS OF EXPERIMENT 5............................................................................................................56Table 7. Results of Experiment 6...................................................................................................................59
v
List of Figures
FIGURE 1 AN EXAMPLE OF A KNN TASK...........................................................................................................8
FIGURE 2 AN EXAMPLE OF PROBABILITY DISTRIBUTION................................................................................14
FIGURE 3. THE PRIMARY TASK, T0..................................................................................................................20
FIGURE 4 POSSIBLE DECISION BOUNDARIES FOR T0........................................................................................21
FIGURE 5. SECONDARY TASK, T1....................................................................................................................21
FIGURE 6 ESTIMATE CONDITIONAL PROBABILITIES USING TRAINING INSTANCES OF T0.................................24
FIGURE 7 GENERATE VIRTUAL INSTANCES FROM INSTANCES OF T1...............................................................27
FIGURE 8. ARCHITECTURE OF VND-KNN......................................................................................................40
FIGURE 9. BITMAP FOR T0, T2 AND T3.............................................................................................................43
FIGURE 10. BITMAP FOR T1............................................................................................................................44
FIGURE 11. THE RELATION BETWEEN TASKS IN BITMAP DOMAIN..................................................................45
FIGURE 12. RESULTS OF EXPERIMENT 1.........................................................................................................47
FIGURE 13. RESULTS OF EXPERIMENT 2.........................................................................................................50
FIGURE 14. RESULTS OF EXPERIMENT 3.........................................................................................................52
FIGURE 15. RESULTS OF EXPERIMENT 4.........................................................................................................55
FIGURE 16. RESULTS OF EXPERIMENT 5.........................................................................................................57
Figure 17. Results of Experiment 6................................................................................................................60
vi
ABSTRACT
The thesis explores how a learning system can utilize previously learned
knowledge to develop a more accurate hypothesis in the context of the k nearest
neighbour (kNN) learning algorithm. Several previous methods of knowledge transfer for
kNN have proposed measures based on structural similarity at the task level. A theory of
selective knowledge transfer is presented using a measure of relatedness based on
functional similarity at the classification level.
The new method of knowledge transfer relies on the generation of virtual
instances for the primary task from training instances of the secondary task. Each virtual
instance is non-deterministic in that the probability of its class value is conditioned upon
the class value of the secondary task. Virtual-instance-based Non-Deterministic kNN
(VND-kNN) is introduced as an implementation of the theory.
A prototype system based on the theory is tested against a synthetic domain and a
letter recognition domain. Experiments show that knowledge transfer from secondary
tasks based on the conditional probability distributions can improve the generalization
accuracy of the primary task if the secondary tasks are related to the primary task.
Furthermore, experiments show that the method is able to mitigate negative transfer of
knowledge when the secondary tasks are unrelated to the primary task.
vii
Chapter 1
Introduction
Machine Learning has been defined as the study of computer algorithms that
improve automatically through experience (Mitchell, 1997). Machine learning theories
imply that the larger the set of training examples, the better the probability of developing
an accurate hypothesis (Valiant, 1984). However, in practise, most applications of
machine learning systems suffer from a deficiency of training examples. Collecting
training examples can be difficult, time consuming and costly. Life-long learning is
dedicated to utilizing prior knowledge in lieu of training examples to more efficiently
learn a more effective hypothesis for a new task.
The k-nearest neighbour algorithm (kNN) is a popular machine learning method.
kNN considers every instance to be a point in an n-dimensional space, where n is the
number of input attributes. kNN is trained by simply storing training examples and it
classifies a query instance q based on the k training examples that are closest to q. Three
methods of life-long learning through knowledge transfer using kNN have previously
been proposed (Caruana, 1993, Thrun, 1995, Silver, 2000). Knowledge is selectively
transferred based on structural measures of relatedness at the task level. This paper
introduces a method of kNN selective knowledge transfer that is based on functional
measures and at the classification level using virtual training instances.
Chapter 1 has four sections. The first section defines the terms used above that
may not have been familiar to the reader and contains an overview of problem. The
viii
second section contains the research objectives and the third section explains the
motivation for the new method. The final section gives an overview of the structure of the
thesis.
1.1 Overview of Problem
Machine learning systems often encounter insufficient training examples per task
to develop a sufficiently accurate hypothesis. For example, a hospital may have records
on only 100 patients with a particular type of heart disease. One approach to overcoming
the deficiency of training examples is to utilize knowledge that has been acquired during
the learning of previous tasks that are related. For example, assuming we have learned a
model of identifying patients with high blood pressure; we can use its knowledge to help
us to identify patients with heart disease. The process is to transfer the previously
acquired knowledge (high blood pressure diagnosis) to the new and related learning task
(heart disease diagnosis).
There has been some previous work on the fundamental theory of knowledge
transfer has been provided along with a method of selective knowledge transfer in the
context of kNN (Silver, 2000, Caruana, 1993, Thrun, 1995). All of these methods use the
similarity between the distance metric (a structural measure) used in each task and do not
consider the functional relationship between the output values of the tasks.
Other previous research has explored methods of determining how much two
tasks are functionally related; for example, linear coefficient of correlation, coefficient of
determination and Hammer distance (Silver, 2000). All methods proposed measure
ix
functional relatedness at the task level: the relationship between two tasks is based on all
target values. However, it can be beneficial to measure the relationship at the
classification level. For example, if the output value of a previously learned task T 1 is the
same as that of the new task T0 for a particular class value, then this sense of relationship
should not be entirely dismissed. For some sub-region of the input attribute space, the
two tasks are similar and the transfer could be beneficial.
Only a few people have looked at knowledge transfer in the context of the kNN
algorithm and all methods previously proposed transfer based on structural measures at
the task level. The thesis develops a new functional measure of relatedness at the
classification level and uses the measure to achieve knowledge transfer between kNN
tasks.
1.2 Research Objectives
The thesis investigates possible ways of selective task knowledge transfer in the
context of kNN. The research reported has three objectives. The first objective is to find
an alternative measure of relationship between tasks than those presented previously.
The second objective is to develop a theoretical model of selective knowledge transfer
based on the new measure of relationship between tasks. The last objective is to build a
prototype system based on this theory and test the system against synthetic and real task
domains.
x
1.3 Motivation
The research is motivated by observation of human learning and fundamental
knowledge of machine learning theory.
Observation of human learning. In practise, humans often get into a situation
where they cannot decide on a best answer. For example, a person wants to decide
whether he should sell all his stock. Half of his consultants suggest that he sell but the
other half do not. If the person has background knowledge of a previous and similar
decision, it can increase the probability of making a correct choice of action. Similarly,
when the kNN algorithm classifies a query instance, if there is related prior task
knowledge that beneficially biases the algorithm, the resulting prediction will be more
accurate.
Machine learning theory. The PAC theory shows that the probability of a
learning system developing an accurate hypothesis increases with the number of training
examples used (Valiant, 1984). I propose that knowledge from previously learned kNN
tasks can provide additional virtual instances for better learning of a new task. This
requires a clever method of generating virtual instances from previously learned tasks to
enrich the pool of instances for the new task.
1.4 Overview of Thesis
The remainder of the thesis is organized as follows.
Chapter 2 provides background knowledge of inductive learning, kNN and
knowledge transfer. It describes previous research achievements and proposals for
xi
knowledge transfer in the context of kNN. Based on the advantages and limitations of
these methods, the objective and scope of the research is refined.
Chapter 3 develops a theory of knowledge transfer in the context of kNN using a
measure of relationship between class values of tasks. During the development of the
theory, a modified version of kNN, Non-Deterministic kNN (ND-kNN), is defined and
used.
Chapter 4 tests the theory developed in Chapter 3 using a prototype ND-kNN
system. Based on the results from the experiments, the advantages and limitation of the
method are discussed.
Chapter 5 concludes with a summary of the research, some important limitations,
and suggestions for future work.
xii
Chapter 2
Background
2.1 Background on Inductive Learning and kNN
This section reviews the basic knowledge of inductive machine learning and the
kNN algorithm required for this research.
2.1.1 Supervised Inductive Learning
Inductive learning is an inference process that builds a model or hypothesis, h, of
the task, f, by using a set of training examples. If each training example contains input
attributes, x, and a correct target output value f(x), we call it supervised inductive
learning. If only input attributes are provided, then unsupervised learning, often called
clustering, can be undertaken but not supervised learning. The objective of supervised
inductive learning system is to develop or select a hypothesis h such that h(x) = f(x) for
all possible x. This research deals with supervised inductive learning and henceforth, the
word “learning” refers to supervised learning.
2.1.2 Classification
Classification is one type of learning task that, given a set of input attribute
values, outputs one of a set of discrete values, known as categories or classes. This
research is limited to discussing classification learning. Concept learning is the simplest
form of classification learning, where there are only two possible class values, such as
1
True or False, Positive or Negative, 1 or 0. If a classification task has more than two
class values, it is known as a multi-class learning problem.
2.1.3 Generalization Error
The generalization error is defined to be the error between the learner's
hypothesis, h(x), and the target function, f(x), for all x (Niyogi & Girosi, 1994). For
classification learning, the generalization error can be estimated by the number of
incorrect classification by the developed model for an independent test set. The
generalization accuracy equals one minus the generalization error.
2.1.4 The kNN Algorithm
Instead of developing a representation of a hypothesis like Artificial Neural
Networks and Inductive Decision Tree, the kNN algorithm simply stores the training
instances in a knowledge base. Generalization is postponed until classification time when
a query instance is presented to the system for classification. kNN considers every
instance to be a point in an n-dimensional space, where n is the number of input
attributes. The basic assumption behind the kNN algorithm is that similar instances have
similar outputs. Similarity is based on the Euclidean Distance between instances. For
example in Figure 1, instance Q is closer to instance A than instance B. So Q should have
an output similar to A.
2
Figure 1 an example of a kNN task
Formally, as per (Mitchell, 1997), the kNN training algorithm is as follows:
Algorithm 2.1 kNN training algorithm
Given a training example x, which has n input attributes a1…an, the
output value v.
kNN learns x by just storing it into its knowledge base X.
When kNN classifies a query instance q, a neighbourhood that contains the k
nearest training examples to q is created. Formally, as per (Mitchell, 1997), classification
is accomplished with the following algorithm
Algorithm 2.2 kNN classification algorithm
1. Let x1…xk denote the k instances from training examples that are nearest to
xq in Euclidean distance
2. Return
3
A
BQ
where v is the one of the output values and =1 if a = b
else .
Euclidean distance is defined as follows:
Definition 2.1 Euclidean Distance
Consider x and y are two instances with ith input attributes xi and yi,
the Euclidean distance is:
Once a kNN task is trained, decision boundaries, which separate instances of
different classes, can be derived by enumerating every possible point in the input space.
In general, decision boundaries along with decision regions, which are the spaces
between decision boundaries, are considered the hypothesis developed by kNN for a task.
2.1.5 Distance Weighted kNN.
As we can see from section 2.1.4, the basic form of kNN treats every instance in
the neighbourhood equally. However, it can be argued that the more similar instance
should be able to contribute more to the final output. This motivation leads to a variation
called distance weighted kNN. This form takes the distance between the query point and
the instance into consideration when calculating the final output. Formally, the
classification algorithm becomes:
4
Where
2.2 Background of Knowledge Transfer
Haussler showed that the number of training examples required for developing a
sufficiently accurate hypothesis depends on the size of the hypothesis space (Haussler,
1988). If we can introduce more inductive bias, such as prior knowledge, to further
restrict the hypothesis space, the number of training examples that are required can be
reduced. The process of utilizing prior knowledge from previously learned tasks, or
secondary tasks, to influence the learning of a new task, the primary task, is called
knowledge transfer.
2.2.1 Inductive Bias and Prior Knowledge
The learning of a task cannot be accomplished without some assumptions about
the nature of the task. “A learner that makes no a priori assumptions regarding the
identity of the target concept has no rational basis for classifying any unseen instances”
(Mitchell, 1997). In the context of inductive learning, we call this a priori assumption
inductive bias. Inductive bias exists in every learning algorithm and influences the
hypotheses the algorithm develops or selects. Inductive bias can be classified into two
categories: preference or restriction. Preference is an inductive bias that prefers certain
hypothesis more than others. Restriction is an inductive bias that restricts the space of
possible hypotheses (hypothesis space) of the learning system.
5
Prior knowledge is one type of inductive bias that includes knowledge of intended
use, knowledge of the source, analogy of previously learned tasks and/or knowledge of
the task domain (Silver, 2000). In this thesis, analogy with previously learned tasks and
knowledge of the task domain are jointly called task domain knowledge.
2.2.2 Knowledge Based Inductive Learning
Knowledge based inductive learning, or KBIL, is a learning method that relies on
knowledge of the task domain (domain knowledge) to reduce the hypothesis space that
the learning system has to search. After each new task is learned, the knowledge from
that task is retained in domain knowledge, so it can be reused when learning future tasks.
In the extreme case, if the new task is exactly the same as the old task, the inductive bias
introduced by domain knowledge should help develop an accurate hypothesis rapidly
from a minimum number of training examples.
2.2.3 Task Relatedness
The relatedness between tasks has been identified as a critical issue for the
success of knowledge transfer (Thrun, 1995). Consider a situation where the learner has a
diverse set of secondary tasks in domain knowledge. There are some secondary tasks that
are unrelated to the primary task and other secondary tasks that are highly related to the
primary task. The most related tasks must be selected by some measure of task
relatedness in order to maximize the contribution from the task domain.
6
The concept of task relatedness was first defined in (Caruana, 1997). Later on, the
definition of task relatedness was extended by (Silver, 2000). Formally, as per (Silver,
2000), the task relatedness is defined as follows
Definition 2.2: Task relatedness
Let Tk be a secondary task and T0 a primary task of the same domain
with training examples Sk and S0 respectfully, The relatedness of Tk
with respect to T0 in the context of learning system L, that uses
knowledge transfer, is the utility of using Sk along with S0 toward the
efficient development of an effective hypothesis for T0
The nature of task relatedness can be examined from different perspectives: task
relatedness as a distance metric, task relatedness as similarity and task relatedness as
shared invariance (Silver, 2000). This thesis will focus on task relatedness in terms of
similarity.
2.2.4 Functional Similarity vs. Structural Similarity
Previous researchers have suggested two distinct forms of similarity: functional
similarity and structural similarity (Robins, 1996; Vosniadou & Ortony, 1989),. In the
context of machine learning, functional similarity or surface similarity can be defined as
shallow, easily perceived, external similarity (Silver, 2000). In the context of kNN,
functional similarity can be described as the degree of sharing of decision regions.
Structural similarity can be defined as deep, often complex, internal feature similarity
7
(Silver, 2000). In the context of kNN, structural similarity has been defined as the use of
similar distance metrics for each of the input attributes.
2.3 Relevant Background in Probability and Statistics
The section reviews the relevant mathematics used in this research.
2.3.1 Conditional Probability
There are situations where the information “an event B has occurred” can
influence the probability of event A occurring. The probability of one event given that
another event has occurred is known as conditional probability. Formally, as per
(Devore, 2004), conditional probability is defined as follows:
Definition 2.3 Conditional Probability
For any two events A an B with P(B) > 0, the conditional probability
of A given that B has occurred is defined by
2.3.2 Conditional Probability Distributions for Discrete Random
Variables
The probability distribution of a discrete random variable, X, describes how the
total probability of 1 is distributed among the various possible X values (Devore, 2004)
(e.g. See Figure 2). Formally, the probability distribution or probability mass function
(pmf) of a discrete random variable is define for every number x by p(x) = P(X = x). If all
8
probabilities in a probability distribution are conditional probabilities, the distribution is
called a conditional probability distribution.
Figure 2 An example of probability distribution with p(x) = 0.2 for all x
The variance of a probability distribution measures the spread of values in the
distribution. Formally, as per (Devore, 2004), variance of a probability distribution is
defined as follows:
Definition 2.4: Variance of probability distribution
Let X = {x1, x2, … , xn}. Then the variance of X is
2.4 Previous Research on knowledge Transfer in the Context of
kNN
2.4.1 Task Clustering (TC) Algorithm
One previous paper discusses the knowledge transfer in the context of the kNN
algorithm (Thrun & O'Sullivan, 1995). The proposed TC algorithm partitions task
9
1 2 3 4 5
0.2
x
P
domain knowledge into clusters of related tasks. All related tasks in the same cluster use
a common Euclidean distance metric, which is generally beneficial to all related tasks in
the cluster. When a new primary task is being learned, the relatedness between the new
task and each cluster is estimated by using the each cluster’s distance metric to bias the
learning of the primary task. The cluster that helps the primary task achieve the highest
generalization error is considered the most related. Then the primary task uses the
distance metric of that cluster for classifying future query instances.
In summary, the TC algorithm measures the relatedness of two tasks by structural
similarity at the task level and transfers the structure (distance metric) of the secondary
task to the primary task.
2.4.2 Weight Vectors
Another paper proposes one alternative method of knowledge transfer for kNN as
a possible extension to knowledge transfer in ANN (Silver, 2000). The method relies on
the generation of virtual examples from previously learned kNN models and their
distance weight vectors. A measure of primary to secondary task relatedness is computed
based on the cosine of the angle between their respective weight vectors. A weight
vector for the new task is then re-computed using a gradient descent approach that
minimizes the error over all tasks weighted by their measure of relatedness. The
structural knowledge of the secondary tasks is transferred to the primary task while
minimizing a global error function. The resulting weight vector is used to predict the
output class for test query instances for the new task.
10
Once again, this method measures the relatedness by structural similarity at the
task level.
2.4.3 Summary
All methods that have been proposed to achieve knowledge transfer in the context
of kNN utilize structural similarity between tasks. From previous research, we know that
functional similarity and structure similarity are both important measures of relatedness
(Silver, 2000). Using only one of the two may miss identifying domain knowledge that is
beneficial to the primary task. Our objective is to capture the functional similarity
between kNN tasks.
Moreover, all methods that have been proposed measure the relatedness at the
task level. Previous research explores various functional measures of relatedness such as
linear coefficient of correlation, coefficient of determination and Hammer distance
(Silver, 2000). All of these measures are aimed at the task level and may miss capturing
related portions of secondary tasks. For example, if the previously learned task T1 is
related to the primary task T0 only when T1 outputs class 1, then the measure of
relatedness at the task level will fail to capture this partial relatedness within a sub-region
of the input attribute space.
In summary, little research has been done on knowledge transfer in the context of
kNN and all of the methods that have been proposed transfer knowledge based on
structural similarity at the task level. The object of this thesis is to develop a new measure
of relatedness based on the functional similarity at the classification level and use the
measure to achieve knowledge transfer from secondary kNN tasks to a new primary task.11
Chapter 3
Selective Knowledge Transfer from kNN Tasks
3.1 Formal Definition of the Problem
In Chapter 2, several methods of knowledge transfer were presented but all of
them measure the relatedness at the task level. In the context of kNN, few people have
tried knowledge transfer and the methods that have been proposed focus on the structural
similarity. This thesis proposes a new theory of selective knowledge transfer in the
context of kNN such that:
1. Relatedness is measured between tasks at the classification level;
2. Relatedness is based purely on the functional knowledge of previously
learned tasks;
3. Functional knowledge is in the form of virtual instances.
3.2 Theory of Knowledge Transfer for kNN Concept Learning
kNN doesn’t explicitly generalize the training examples to form a hypothesis. The
knowledge of a kNN system is represented by a pool of instances. Therefore, the most
natural way to transfer knowledge from previously learned kNN tasks is to utilize training
instances from those tasks.
Section 3.2 focuses on the theory of knowledge transfer with kNN when learning
concept tasks. A domain of synthetic tasks is defined and used throughout the thesis in
12
order to present and test the theory and associated methods. The theory is extended to the
multi-class learning in Section 3.3.
3.2.1 A Synthetic Example
The kNN algorithm derives the decision boundaries directly from training
instances. If the number of available training examples is not sufficient to derive accurate
decision boundaries, then related background knowledge can be very helpful. Consider a
simple concept learning task as shown in Figure 3. The primary task, T0 has 7 training
instances, three of which have class value ‘-’. The kNN decision boundary, for k = 3, is
shown as roughly a vertical line between positive and negative training instances with k =
3. The query “?” would be classified as positive under normal kNN.
It is also possible that the actual decision boundary for T0 is more complex than
this naïve hypothesis. For example, Figure 4 shows a horizontal region of negative
instances surrounded by positive instances. In this case the query instance would be
classified as negative. Provided that there is no background knowledge, the two possible
decision boundaries are equally likely, but standard kNN will prefer the boundary shown
in Figure 3.
13
Figure 3. The primary task, T0. The shaded area is the decision
region where all instances are of class value ‘-’. The decision region
is derived by training examples of T0 using k = 3
Because the knowledge of a kNN task is represented by instances, knowledge
transfer from T1 can be accomplished through the generation of virtual instances for T0
from the T1’s kNN model. The problem of transfer can be reduced to answering two
questions:
1. How do we determine the class value of a virtual instance?
2. How do we select training examples from the secondary task so as to
generate virtual instances?
The following sections are dedicated to answer these two questions.
14
1 2 3 4 5 6
1
2
3
4
5
6
?
+
— Negative example
Positive example
Query?
Figure 4 Possible decision boundaries for T0. The query, in this
case, is classified as negative.
Figure 5. Secondary task, T1. The shaded area is the decision
region where all instances are of class value ‘-’. The decision region
is derived by training examples of T0 using k = 1.
15
1 2 3 4 5 6
1
2
3
4
5
6
?
+
— Negative example
Positive example
Query?
1 2 3 4 5 6
1
2
3
4
5
6
+
— Negative example
Positive example
Query?
3.2.2 Conditional Probability Distributions
The functional similarity between the training instances of two tasks describes a
degree of relatedness between the tasks. This relatedness can also be represented by
conditional probabilities. For the example shown in Section 3.2.1, let P(T0 = + | T1 = +)
equal the probability that an instance of T0 is positive given that an instance of T1 is
positive. Then we can express the relatedness of T1 to T0 by the conditional probabilities
P(T0 = + | T1 = +), P(T0 = - | T1 = +), P(T0 = + | T1 = -) and P(T0 = - | T1 = -). We do not
know the exact value of these conditional probabilities but they can be estimated by
observing the primary task and secondary task training instances.
One approach is to classify each training example of the primary task by the kNN
secondary task. Let U+ be the number of training examples of the primary task that are
classified as positive by the secondary task and let C+ be the number of positive training
examples of the primary task that that are classified as positive by the secondary task.
Then, can be estimated by . Similarly, Let U- be the number of
training examples of the primary task that are classified as negative by the secondary task
and let C- be the number of positive training examples of the primary task that that are
classified as negative by the secondary task. Then, can be estimated by
.
16
Let’s use the tasks in Section 3.2.1 as an example. All training examples of T0 are
classified by T1 with k = 11. There are 2 training instances of T0 that are classified as
positive by T1, so U+ = 2. There are 0 positive instance of T0 that is classified to be
positive by T1, so C+ = 0. Therefore, . Similarly, there are 5
training instances of T0 that are classified as negative by T1, so U- = 5. There are 4
positive training instances of T0 that are classified as negative by T1, so C- = 4. Therefore,
.
Because P(T0 = - | T1 = +) = 1- P(T0 = + | T1 = +) and P(T0 = - | T1 = -) = 1- P(T0 =
+ | T1 = -), we now have a complete conditional probability distribution (See Table 1).
3.2.3 Generation of Virtual Instances
Virtual instances for T0 can now be generated from the kNN model for T1 by
considering the conditional probability distributions in Table 1.
Table 1 Estimation of Conditional Probabilities
T1 = “+” T1 = “-”T0 = “+” 0/2 = 0 4/5 = 0.8
T0 = “-” 2/2 = 1 1/5 = 0.2
1 k is chosen to be constant 1 while we estimate the conditional probabilities so as to make knowledge
transfer be independent of what value k is specified by the user. 17
Figure 6 Estimate conditional probabilities using training instances
of T0.
18
1 2 3 4 5 6
1
2
3
4
5
6 ?
+
— Negative example
Positive example
Query?
1 2 3 4 5 6
1
2
3
4
5
6
?
?
?
?
?
? ?
Classify by T1
Classify by T1
T0 T1
1 2 3 4 5 6
1
2
3
4
5
6
—
+
Negative instance of T0 classified negative by T1
—
Negative instance of T0 classified positive by T1
Positive instance of T0 classified negative by T1
Because P(T0 = + | T1 = -) = 0.8, all virtual instances generated for T0 from
negative instances of T1 are positive with probability of 0.8 (or negative with probability
of 0.2). Similarly, because P(T0 = + | T1 = +) = 0, all virtual instances generated for T0
from positive instances of T1 are positive with probability of 0 (or negative with
probability of 1). In other words, the output value of a virtual instance is the conditional
probability distribution P(T0 | T1 = v), where v is the class value of the source instance of
T1.
Figure 7 shows the resulting T0 decision boundary with k = 3 after all virtual
instances are generated. Note that virtual instances are non-deterministic instances. They
have two possible class values, each with a probability. To make proper use of these
probabilities, the kNN algorithm has to be modified so that the probabilities are used to
weight the vote of the virtual instances. For example, the neighbourhood of the query
with k = 3 as shown in Figure 7 contains three instances, two of which are negative with
probability 1. The other instance in the neighbourhood is positive with probability 0.8
and negative with probability 0.2. Therefore, the resulting vote is 2.2 negative
(0.2+1+1=2.2) vs. 0.8 positive (0.8+0+0=0.8). Consequently, the query is classified to be
negative. In Figure 7, the new decision boundaries reflect the horizontal boundaries
transferred from T1. The prior knowledge of T1 has biased the learning of T0.
The first question in Section 3.2.1 can now be answered: How do we determine
the class value of a virtual instance? The class value of a virtual instance is determined
based on:
1. The class value of the corresponding instance of the secondary task
19
2. The conditional probability distributions calculated by classifying training
instances of the primary task using the secondary task. The class value of
the virtual instance is actually a conditional probability distribution over
all class values of the primary task (in this case, + or -).
20
Figure 7 Generate virtual instances from instances of T1 based on
the conditional probability distributions
21
1 2 3 4 5 6
1
2
3
4
5
6
— Negative example
Positive example
Query?
1 2 3 4 5 6
1
2
3
4
5
6
?
T0 T1
Virtual instance that is 80% positive
Virtual instance that is 100% negative1 2 3 4 5 6
1
2
3
4
5
6
?
P(T0 | T1)T1 = +T1 = -T0 = +00.8T0 = -
10.2
Generate virtual
instances
3.2.4 The Need for a Measure of Relatedness at the Classification
Level
In the previous example, all instances of T1 are transferred to T0 as virtual
instances. This can be problematic as the number of instances from T1 increases. The
computational time and space for classifying a query instance for T0 grows as a function
of the number of training instances (actual and virtual). Ideally, we only want to transfer
instances that are beneficial to the learning of the primary task, from the instances of a
secondary task that are most related to the primary task. Therefore, it is crucial to find a
measure of relatedness that can choose the best instances to transfer.
Sometimes tasks are partially related at the classification level. For instance, the
example in Section 3.2.1 shows that T1 is totally related to T0 when T1 outputs “+” but is
far less related to T0 when T1 outputs “-”. If the relatedness was measured at the task
level, the high relatedness between the two tasks when T1 outputs “+” would be mitigated
by the comparatively low relatedness between two tasks when T1 outputs “-”. In order to
utilize the relatedness between two tasks more effectively, a relatedness measure at the
class level is needed.
We propose that conditional probability distributions can be used to estimate
relatedness separately for each class value of the primary and secondary tasks. The only
question left is how to quantify the relatedness implied by the conditional probability
distribution between T0 and of the current secondary task with that of other secondary
tasks.
22
3.2.4 Using Variance to Measure Classification Relatedness
If P(T0 = + | T1 = +) = P(T0 = - | T1 = +) = 0.5, the associated virtual instances will
have a probability of 0.5 positive and 0.5 negative so they will add no value to the
development of decision boundaries when learning T0. We can say that these virtual
instances are unrelated to the learning of the primary task. However, if P(T0 = + | T1 = +)
= 1, associated virtual instances are very related to T0. Therefore, the variance of the
conditional probability distribution indicates the relative degree of relatedness between
the tasks at the class level. The greater the variance, the greater the measure of the
relatedness between the tasks is.
Consider that two concept learning tasks, T0 and T1 are totally unrelated when T1
outputs “+”, then P(T0 = + | T1 = +) = P(T0 = - | T1 = +) = 0.5. The variance of the
distribution P(T0 | T1 = +), Var[P(T0 | T1 = +)] = 0. In this case, T0 and T1 are minimally
related when T1 outputs +. On the other hand, when T0 and T1 are identical, P(T0 = + | T1
= +) = 1 and Var[P(T0 | T1 = +)] = 0.5. In this case, T0 and T1 are maximally related. If T0
= ⌐T1, then P(T0 = + | T1 = +) = 0 but P(T0 = - | T1 = +) = 1.0 and Var[P(T0 | T1 = +)] =
0.5. Once again, T0 and T1 can be considered maximally related for the positive output
class of T1. For the tasks shown in Section 3.2.1, P(T0 = + | T1 = -) = 0.8 and Var[P(T0 | T1
= -)] = 0.18. In this case, two tasks are partially related when T1 outputs negative.
As the relatedness between two tasks increases, the variance of the resulting
conditional probability distribution increases and is always in the range of 0 to 0.5. For
simplicity, we normalize the variance to the range of 0 to 1. Formally,
23
Definition 3.1: Classification Relatedness of Concept Learning
Tasks
The classification relatedness of the secondary task Ti regarding to
the primary task T0 when Ti outputs class v:
Rel(T0, Ti = v) =
As mentioned in Section 3.2.4, the class value of a virtual instance is the
conditional probability distribution P(T0 | Ti = v). Therefore:
Definition 3.2: Rel(T0, x) is the relatedness of an instance x with
respect to the primary task, can be measured by the variance of the
output probability distribution of x. The relatedness of original
training examples of the primary task is always equal to 1.
Once the classification relatedness is calculated, a minimum Rel(T0, Ti = v), the
acceptance threshold of relatedness, can be set to filter out the virtual instances that are
generated from tasks that do not meet a minimum acceptable level of relatedness. The
setting of the acceptance threshold of relatedness is currently done manually and is
intended to reduce the computational time and space for classifying a query instance. In
general, this value should be a small number in the range 0.001-0.02.
Section 3.2.1 also asks the following question: How do we select training
examples from the secondary task so as to generate virtual instances? The answer is
that selection is based on the classification relatedness. Virtual instances are only
24
generated from training instances of secondary tasks whose classification relatedness
with respect to the primary task is greater than the acceptance threshold of relatedness.
3.2.5 Duplicated Instances
Instances from 2 or more tasks that share the same input attribute values are
defined to be duplicated instances. It is important that duplicated instances are carefully
handled when transferring knowledge from secondary tasks. For example, if we
overwrite a related virtual instance with a less related virtual instance, the quality of the
knowledge transfer is reduced.
One approach to handling duplicated instances is to mathematically combine their
values such as taking the average of all probability distributions. However, this requires
that the tasks are conditionally independent of each other. If the assumption doesn’t hold,
the simple linear combination of probabilities cannot hold, either.
Therefore, we propose a naïve method: an existing (actual or virtual) instance can
be overwritten only if a duplicate is more related. This guarantees that the most beneficial
instance is retained. Since the classification relatedness of an actual training example is
always 1, no duplicate instance can overwrite an actual training instance.
3.2.6 A Neighbourhood of Virtual Instances
In Figure 7, k =3, so by definition the k nearest neighbours of the query point
contains three virtual instances and no original training example of T0. It doesn’t seem
problematic at first glance. However, if the virtual instances in the neighbourhood are not
related to the primary task, there is little confidence in the final classification. For
25
example, if all instances in the neighbourhood are virtual instances indicating 51%
positive, the final classification is “+” but only at the 51% confidence level. In order to
maintain a reasonable confidence level of the final classification, it is important to
include sufficient actual training instances during kNN generalization.
One approach is to employ a dynamic neighbourhood size when the k nearest
neighbours are selected. The method works in following way:
1. Let k be the size of the neighbourhood specified by the user. Select the k
nearest actual training instances to the query point.
2. Select at most k virtual instances nearest to the query point such that none
are farther from the query point than the kth nearest actual training
instance.
Therefore, the number of nearest neighbours ranges from k to 2k, where there are,
at most, as many virtual instances as actual training instances. Limiting the number of
virtual instances ensures that the knowledge of the virtual instances does not overwhelm
the knowledge of the actual training examples.
3.2.7 Steps of Knowledge Transfer from kNN Concept Learning Tasks
In summary, knowledge transfer from a set of secondary tasks T1…Tn to the
primary task T0 in the context of kNN concept learning tasks with acceptance threshold of
relatedness α involves the following steps:
For each secondary task Ti ( 1 ≤ i ≤ n)
26
1. Determine conditional probability distributions P(T0 | Ti = +) and P(T0 | Ti
= +) by classifying training examples of T0 using Ti
2. For each training instance x for Ti, whose class value is v (v = + or -):
a. Generate a virtual instance , which shares the same set of input
attributes as x and outputs P(T0 | Ti = v).
b. If Rel(T0, ) < α, discard .
c. Add to T0, overwriting any existing duplicated instances
according to the following: If there exists an instance x0 in T0,
which shares the same set of input attribute values as , discard
if Rel(T0, ) < Rel(T0, x0)
3.3 Generalizing to Multi-Classes
Section 3.2 has presented the theory of knowledge transfer in the context of kNN
concept learning tasks. This section extends the theory to multi-class learning problems.
The rationale is the same, i.e. use conditional probabilities to determine the relatedness of
virtual instances to the primary task. The only difference between concept learning tasks
and multi-class tasks is the number of classes. This requires extending the formulas found
in Section 3.2.
3.3.1 Conditional Probability Distribution for Multi-class Tasks
Though we can still use the definition of the conditional probability distribution, a
new formula is required for the purpose of the multi-class problem. Formally:
27
Definition 3.3: Conditional Probability Distribution for Multi-Class
Tasks
First, define the conditional probability distribution per
classification, as a function that takes the mth
element of set M, which contains all possible class values of Ti, as
the input and outputs the set:
where is the nth element of the set N, which contains all possible
class values for the possible output values of T0
Then, define as a function that takes Ti as the input
and outputs the set:
where M is the number of possible output values of Ti.
Essentially, CPDC is the conditional probability distribution and CPDT is the
joint probability distribution. They are different from the classical definition of the
conditional probability distribution in that both CPDC and CPDT output the actual class
value with which the probability is associated.
As an example, the conditional probability distributions for the example in
Section 3.2 can be expressed as:
28
CPDC(T0 | T1 = +) = {(+, 0), (-, 1)},
CPDC(T0 | T1 = -) = {(+, 0.8), (-, 0.2)} and
CPDT(T0 | T1) = {(+, {(+, 0), (-, 1)}), (-, {(+, 0.8), (-, 0.2)})}.
In general, the output value of virtual instances of T0, which are generated from
instances with class value v in T1, is CPDC(T0 | T1 = v).
3.3.2 Classification Relatedness
The classification relatedness also needs to be extended for multi-class problems.
But first, the variance of a CPDC must be defined:
Definition 3.4: Variance of CPDC
First, define as a function to extract the
probability proportion of a CPDC:
where is the mth element of the set M, which contains all
possible class values of Ti and is the nth element of the set N,
which contains all possible class values of T0.
Then, define as a function that calculates
the variance of a CPDC:
29
Same as for concept learning tasks, relatedness increases as the variance
increases. The maximum variance is not always equal to 0.5 as in concept learning tasks
but depends on the number of class values that the primary task has.
Claim 3.1: Range of the Variance of CPDC
Given a , which has n elements, i.e. |PP(
)| = n (n > 1), define:
1.
2.
The variance, Var[ ] is in the range of [0, Vmax].
By the definition of the variance, the variance is always positive. Therefore, we
only need to justify whether the maximum value of the variance equals Vmax.
Proof 3.1: Maximum value of the variance
Let X = {x1, x2, … , xn}. By definition of the sample variance,
.
30
Because , . So if is
maximized, Var(X) is maximized.
Consider
Since, xi > 0 and , we can get and
if there is one xi = 1.
Therefore,
Note that in the case of concept learning, n = 2. Therefore, Vmax = 0.5, which is
exactly the maximum value of variance concluded in Section 3.2.4.
Then, the definition of classification relatedness can be extended to:
Definition 3.5: Classification Relatedness for Multi-Class Tasks
The classification relatedness of the secondary Ti regarding to the
primary task T0 when Ti outputs , which is defined as following:
Rel(T0, Ti = ) =
31
Also, define the classification relatedness of an instance x in T0,
whose class value is CPDC(T0 | Ti = ), as:
Rel(T0, x) = Rel(T0, Ti = )
3.3.3 Steps of Knowledge Transfer from kNN Multi-Class Tasks
In summary, knowledge transfer from a set of secondary multi-class task T1…Tn
to the primary multi-class task T0 in the context of kNN with acceptance threshold of
relatedness α involves:
For each secondary task Ti (1 ≤ i ≤ n):
1. Determine CPDT(T0 | Ti) by classifying training examples of T0 using Ti
2. For each training instance x in Ti, whose class value is vm:
a. Generate a virtual instance , which shares the same set of input
attributes as x and outputs CPDC(T0 | Ti = vm).
b. If Rel(T0, ) < α, discard .
c. Add to T0, overwriting any existing duplicated instances in
accord with the following: If there existing an instance x0 in T0,
which shares the same set of input attribute value as , discard
if Rel(T0, ) < Rel(T0, x0).
32
3.4 Implementation
3.4.1 VND-kNN
In order to implement the theory of knowledge transfer for kNN tasks, the classic
kNN algorithm had to be modified so that:
1. It supports non-deterministic instances, which use CPDC as the class
value
2. It supports a dynamic nearest neighbourhood with size ranges from k to
2k.
ND-kNN will henceforth denote this modified version of kNN algorithm, which
supports non-deterministic instances. ND-kNN must have a data structure for storing
virtual non-deterministic instances including the CPDC value. Formally,
Algorithm 3.1: ND-kNN Training Algorithm for Virtual Instances
Given a virtual non-deterministic instance x, which has m input
attributes a1…am and n output class values v1…vn, each with an
associated probability Pi (1 ≤ i ≤ n)
ND-kNN learns x by just storing it into its knowledge base X.
Then according to the theory described in the previous sections, ND-kNN shall
create a neighbourhood, which contains k original training examples and at most k virtual
non-deterministic instances as follows:
33
Algorithm 3.2: ND-kNN Selection of Neighbourhood Algorithm
Given a query q, add the k nearest original training examples from X
into the neighbourhood N. Let d be the distance to q from the
farthest instance in N. Select at most k nearest virtual instances from
X, whose distance to q is less than d and add them to N.
Finally, in the voting phase, every virtual instance contributes a vote for each
class value vi weighted by its probability Pi.
Algorithm 3.3 ND-kNN Classification Algorithm
Let x1…xr denote r instances in the neighbourhood N of size r for
query q. Every instance has m input attributes a1…am and n output
values v1…vn, each being associated with a probability Pi (1 ≤ i ≤ n)
Return
where vij is the ith class value of the jth instance in the neighbourhood
N; if = , else ; V contains all possible
class values; Pij is the Pi of jth instance in the neighbourhood N.
Using ND-kNN, a Virtual-instance-based ND-kNN (VND-kNN) system can be
built to achieve life-long learning. As in Figure 8,when a new task is presented, the
system first measures the similarity between the new task and all tasks in the task 34
domain. Then based on the similarity, virtual instance are generated from secondary tasks
to the new task. Once the new task is successfully learned, the new task (i.e. all its
instances) is stored into the task domain.
Figure 8. Architecture of VND-kNN
3.4.2 The Prototype System
A VND-kNN software system was constructed in C++ using Microsoft Visual C+
+. The system consists of around 3000 lines of code and employs multi-thread
programming so as to enhance the overall performance. The system implements a
standard Windows graphical user interface and allows the user to manipulate multiple
tasks in the task domain at the same time. All tasks and domains can be saved for future
use.
An object-oriented approach was used when the system was firstly designed. Each
kNN model is an object containing methods that allows the user to train the model with
training instances and classify a query instance with the k nearest neighbours. Every
instance in a kNN model is represented by an Instance object. The Instance object uses a
vector to represent the input attributes and an Output object to represent the instance’s
output class value. The Output object overrides the data type conversion operators. In this
35
Task Domain
ND-kNN (New Task)
generate virtual instances
measure similarity
store
way, both actual training instances and virtual instances can share the same data
representation. The Instance object also has an attribute indicating whether the instance is
an actual training instance or a virtual instance so that ND-kNN can process them
separately.
3.5 Summary
The method of knowledge transfer presented in Section 3.2 and 3.3 uses
conditional probability distributions that utilize prior knowledge of previously learned
tasks at the classification level. Moreover, the employment of an acceptance threshold of
relatedness can prevent unrelated virtual instance from being generated so that the
computational space and time required to classify a query instance is minimized.
Limiting the number of virtual instances in the k nearest neighbourhood also prevents the
growth of negative inductive bias from a large number of unrelated secondary tasks. In
this way, virtual instances generated from secondary tasks can provide a positive
inductive bias through knowledge transfer when learning the primary task.
36
Chapter 4
Empirical Studies
This chapter summarizes and discusses the results of several experiments
conducted using a prototype system based on the theory of Chapter 3. The first
experiment shows the modified kNN method’s ability to transfer knowledge from a
partially related secondary task. The second experiment shows the method’s ability to
mitigate the negative inductive bias from an unrelated task. The third experiment shows
the variation in positive inductive bias as a function of the relatedness between the
primary and secondary tasks. The fourth experiment shows the method’s ability to
transfer knowledge from multiple secondary tasks. The fifth experiment focuses on a
recognized limitation of the method. The last experiment tests the method’s maximum
capacity of knowledge transfer in the context of real world tasks and the ability to handle
multi-class learning problems.
4.1 The Bitmap Domain
In order to test the effectiveness of knowledge transfer in the context of kNN, a
synthetic domain of tasks was developed, called the Bitmap Domain. The domain
contained 4 different tasks, each having 1000 instances. The output class values of
instances are determined by the bitmaps shown in Figure 9 and Figure 10. Each instance
has 4 numerical input attributes a, b, c and d ranging from 0 to 10. The attributes of each
instance in every task, are divided into 2 pairs: (a, b) and (c, d). Each pair of attributes
can determine a class value by the corresponding number in the bitmap, using the
37
attributes as 2-D coordinates. For example, a = 1.3 and b = 7.8 outputs 1 because the 2nd
column, 8th row in the bitmap Figure 9 is 1 (always round up).
4.1.1 Task T0
The class values of instances in T0 depend on input attributes a and b using
bitmap Figure 9. The instance is classified as 1 if and only if (a, b) selects a 1 in the
bitmap. Input attributes c and d are generated randomly.
Figure 9. Bitmap for T0, T2 and T3.
The input space is divided by 100 small rectangular regions. Each
number represents the class value of one of the 100 regions
4.1.2 Task T1
The class values of instances in T1 are based on input attribute a and b using the
bitmap Figure 10. The instance is classified as 1 if and only if the pair (a, b) selects a 1 in
the bitmap. Input attributes c and d are generated randomly. Note that T1 is quite related
to T0 for the lower portion of the bitmaps.
38
a or c
b or
d
Figure 10. Bitmap for T1.
It is somewhat similar to the one for T0, especially for the
distribution of 1’s
4.1.3 Task T2
The class values of instances in T2 depended on input attributes c and d using the
bitmap Figure 10. The instance is classified as 1 if and only if the pair (c, d) selects 1 in
the bitmap. Input attributes c and d are generated randomly. The output class values of T2
only depend on the last two attributes. T2 is unrelated to T0 or T1.
4.1.4 Task T3
The class values of instances in T3 depended on all four input attributes using the
bitmap Figure 9. The instance was classified as 1 if and only if both pairs (a, b) and (c, d)
selected a 1 in the bitmap. T3 is related to T0 and T1 however it is most related to T0.
Figure 11 shows the relation between tasks in the domain.
39
a or c
b or
d
Figure 11. The relation between tasks in Bitmap domain
4.2 Experiment 1: Transfer from a Partially Related Task
In this experiment we are interested in whether a previously learned task will
benefit the learning of the primary task if the two tasks are partially related.
4.2.1 Tasks
Experiment 1 uses T0 of the Bitmap domain as the primary task and T1 as the
previously learned secondary task. T0 has 200 training instances and a test set of 800
instances. T1 was previously trained using 1000 instances. The goal of this experiment is
to find out whether knowledge transfer from T1 will improve the generalization accuracy
of T0.
4.2.2 Method
The experiment consists of 10 repeated trials where each trial had the following
steps:
1. Generate random training and test sets for T0
2. Train the kNN system by loading the training instances.
40
T1 T2 T3
a b c d
T0
Related
More related
3. Test the generalization accuracy of kNN for T0 using the test set with k =
3, 5, 7
4. Transfer knowledge from T1 by generating virtual instances for T0. The
acceptance threshold of relatedness was set to 0.001 based on preliminary
testing.
5. Test the generalization accuracy of kNN for T0 using the test set with k =
3, 5, 7
4.2.3 Results
The results of the experiment are summarized in Table 2 and Figure 12.
Table 2. Results of Experiment 1. The generalization accuracy of
T0 before and after the knowledge transfer with k = 3, 5 and 7
Accuracy T0 on its own T0 with T1’s knowledgeTrials k = 3 k = 5 k = 7 k = 3 k = 5 k = 7
1 0.676654 0.6804 0.689139 0.684145 0.689139 0.7003752 0.690387 0.6804 0.686642 0.699126 0.699126 0.6766543 0.674157 0.675406 0.655431 0.68789 0.686642 0.6729094 0.665418 0.675406 0.691635 0.67166 0.696629 0.6991265 0.670412 0.672909 0.665418 0.676654 0.694132 0.6791516 0.667915 0.661673 0.665418 0.681648 0.679151 0.6766547 0.705368 0.701623 0.691635 0.72035 0.707865 0.7066178 0.734082 0.726592 0.714107 0.739076 0.741573 0.7303379 0.722846 0.702871 0.699126 0.72784 0.710362 0.714107
10 0.70412 0.682896 0.684145 0.715356 0.696629 0.68789Stdev 0.024381 0.018963 0.017611 0.023651 0.0172768 0.01901495%Conf 0.015111 0.011753 0.010915 0.014659 0.0107081 0.011785Mean 0.691136 0.686018 0.68427 0.700375 0.7001248 0.694382
41
Experiment 1
0.650.660.670.680.690.7
0.710.72
k = 3 k = 5 k = 7A
ccur
acy T0 w ith its ow n
T0 w ith T1
Figure 12. Results of Experiment 1. Generalization accuracy of T0
before and after the knowledge transfer. Note that statistically they
are the same.
Results show that the mean generalization accuracy of T0 after knowledge transfer
from T1 was higher than the mean generalization accuracy of T0 on its own (p < 0.001, p
< 0.001 and p = 0.002 for k = 3, 5 and 7 respectively). It is important to note that the
relatedness between T0 and T1 in one of the test trials was Rel(T0,T1=0) = 0.0019,
Rel(T0,T1=1) = 0.1933. This suggests that T1 is more related to T0 when T1 outputs1 than
when T1 outputs 0. This shows that a measure of relatedness at the classification level can
be helpful.
4.3 Experiment 2: Unrelated Tasks
The previous experiment concerned only related secondary tasks. However, it is
possible that background knowledge contains some tasks that are unrelated to the primary
task. In this experiment, we examine the method’s ability to mitigate the transfer of
negative inductive bias to a primary task from an unrelated secondary task.
42
4.3.1 Tasks
T0 of the Bitmap domain is the primary task and T2 is the previously learned
secondary task. T0 has 200 training instances and a test set of 800 instances. T2 was
previously trained to a 0.85 level of accuracy using 1000 instances. Based on the
discussion of Section 4.2, we consider that T2 is totally unrelated to T0. This is because T2
and T0 do not use the same input attributes.
4.3.2 Method
The experiment consisted of 10 repeated trials where each trial had the following
steps:
1. Generate random training and test sets for T0
2. Train the kNN system by loading the training instances.
3. Test the generalization accuracy of kNN for T0 using the test set with k =
3, 5, 7
4. Transfer knowledge from T2 by generating virtual instances for T0. The
acceptance threshold of relatedness was set to 0.001 based on preliminary
testing.
5. Test the generalization accuracy of kNN for T0 using the test set with k =
3, 5, 7
4.3.3 Results
Table 3 and Figure 13 show that the knowledge transfer method ensures that prior
knowledge from T2 does not adversely affect the learning of T0. When k = 3 and k = 5,
43
the generalization accuracy of T0 before and after the knowledge transfer remains the
same. This indicates that T2 does not affect the learning of T0 at all. When k = 7, we still
cannot conclude statistically that the knowledge from T2 generates negative bias to T0 (p
= 0.721). In summary, the experiment shows that the unrelated task T2 didn’t affect the
learning of T0 negatively.
Table 3. Results of Experiment 2. The generalization accuracy of
T0 before and after the knowledge transfer with k = 3, 5 and 7
Accuracy T0 on its own T0 with T1 Trials K = 3 k = 5 k = 7 k = 3 K = 5 k = 7
1 0.729089 0.705368 0.716604 0.729089 0.705368 0.7103622 0.675406 0.686642 0.677903 0.675406 0.686642 0.6916353 0.701623 0.670412 0.699126 0.701623 0.670412 0.6991264 0.667915 0.675406 0.66417 0.667915 0.675406 0.6704125 0.697878 0.694132 0.70412 0.697878 0.694132 0.704126 0.699126 0.685393 0.677903 0.699126 0.685393 0.6779037 0.672909 0.691635 0.692884 0.672909 0.691635 0.687898 0.696629 0.701623 0.697878 0.696629 0.701623 0.6941329 0.696629 0.701623 0.697878 0.696629 0.701623 0.694132
10 0.705368 0.690387 0.682896 0.705368 0.690387 0.695381Stdev 0.018086 0.011327 0.015349 0.018086 0.0113272 0.01171295%Conf 0.01121 0.007021 0.009513 0.01121 0.0070205 0.007259Mean 0.694257 0.690262 0.691136 0.694257 0.6902621 0.692509
Experiment 2
0.66
0.67
0.68
0.69
0.7
0.71
k = 3 k = 5 k = 7
Acc
urac
y
T0 on its ow n
T0 w ith T1
Figure 13. Results of Experiment 2. Mean generalization accuracy
of T0 before and after the knowledge transfer.
44
4.4 Experiment 3: Variation in Transfer from More and Less
Related Tasks
This experiment examines the transfer of knowledge from two related tasks to the
primary task, where one secondary task is more related to the primary task than the other.
We expect the more related secondary task to benefit the primary task the most by
generating the greater positive inductive bias. This should result in better generalization
accuracy for primary task.
4.4.1 Tasks
T3 of Bitmap domain is the primary task. T0 and T1 are used as the previously
learned secondary tasks. T3 has 200 training examples and a test set of 800 instances. T1
and T0 were previously trained using 1000 instances for each task. Based on the
discussion of the Bitmap domain in Section 5.2, T3 is considered more related to T0 than
to T1.
4.4.2 Method
The experiment consisted of 10 repeated trials where each trial had the following
steps:
1. Generate random training and test sets for T3
2. Train the kNN system by loading the training instances.
3. Test the generalization accuracy of kNN for T3 using the test set with k =3
4. Transfer knowledge from T0 by generating virtual instances for T0. The
acceptance threshold of relatedness was set to 0.001 based on preliminary
45
testing.
5. Test the generalization accuracy of kNN for T3 using the test set with k =3
6. Transfer knowledge from T1 by generating virtual instances for T0. The
acceptance threshold of relatedness was set to 0.001 based on preliminary
testing.
7. Test the generalization accuracy of kNN for T3 using the test set with k =3
4.4.3 Results
The results shown in Table 4 and Figure 14 show that both secondary tasks
improve the generalization accuracy of T3. T0 provides the most positive inductive bias to
the T3’s hypotheses with an accuracy of 0.788 (p < 0.001) as compared to the hypotheses
developed with the aid of T1 with an accuracy of 0.771 (p = 0.066). We conclude that
knowledge transferred from the more related task, T0 is of greater value than that of T1.
Table 4. Results of Experiment 3. The generalization accuracy of
T0 before and after the knowledge transfer with k = 3
Trials T3 alone T3 with T0 T3 with T11 0.781523 0.779026 0.76032 0.796504 0.80774 0.7715363 0.772784 0.801498 0.784024 0.735331 0.780275 0.7752815 0.750312 0.787765 0.7752816 0.769039 0.792759 0.7877657 0.750312 0.771536 0.744078 0.735331 0.787765 0.7740329 0.751561 0.791511 0.771536
10 0.750312 0.776529 0.76779Stdev 0.020016 0.011297 0.01223395%Conf 0.012406 0.007002 0.007582Mean 0.759301 0.78764 0.771161
46
Experiment 3
0.72
0.74
0.76
0.78
0.8
T3 alone With T0 With T1
Acc
urac
y
Mean
Figure 14. Results of Experiment 3. Mean generalization accuracy
of T3 before and after the knowledge transfer from either T0 or T1.
4.5 Experiment 4: Knowledge Transfer from Multiple Tasks
Previous experiments focused on transferring knowledge from one secondary task
to a primary task. In this experiment, we examine the effect of transferring knowledge
from several secondary tasks to a primary task, where the secondary tasks vary in their
degree of relatedness.
4.5.1 Tasks
T3 of the Bitmap domain is the primary task. T0, T1 and T2 in the same task
domain are the previously learned secondary tasks. T3 had 200 training instances and a
test set of 800 instances. T0, T1 and T2 were previously trained by 1000 instances for each
task to an accuracy of 0.85. We expect that T3 will receive a net benefit from the transfer,
because the method will promote positive inductive bias from related tasks and mitigate
the negative inductive bias from unrelated tasks.
47
4.5.2 Method
The experiment consisted of 10 repeated trials where each trial had the following
steps:
1. Generate random training and test sets for T3
2. Train the kNN system by loading the training instances.
3. Test the generalization accuracy of kNN for T0, T1 and T2 using the test set
with k = 3
4. Transfer knowledge from T2 by generating virtual instances for T0. The
acceptance threshold of relatedness was set to 0.001 based on preliminary
testing.
5. Test the generalization accuracy of kNN for T0 using the test set with k =3
4.5.3 Results
The results shown in Table 5 and Figure 15 show that the generalization accuracy
of T0 is improved after transferring knowledge from all secondary tasks (p < 0.001). In
addition, there is also some evidence to suggest that knowledge transfer from all three
tasks improved the generalization accuracy of T3 more than the knowledge transfer from
just T0 (p = 0.162).
Table 5. Results of Experiment 4. The generalization accuracy of
T0 before and after the knowledge transfer with k = 3
Trials T3 alone With ALL With T01 0.796504 0.815231 0.8152312 0.762797 0.826467 0.8039953 0.771536 0.787765 0.799001
48
4 0.765293 0.795256 0.7915115 0.741573 0.780275 0.7740326 0.751561 0.801498 0.7890147 0.742821 0.813983 0.8127348 0.765293 0.792759 0.7902629 0.746567 0.766542 0.781523
10 0.776529 0.820225 0.805243Stdev 0.01715 0.019068 0.0133495%Conf 0.010629 0.011819 0.008268Mean 0.762047 0.8 0.796255
Experiment 4
0.72
0.74
0.76
0.78
0.8
0.82
T3 alone With ALL With T0
Acc
urac
y
Mean
Figure 15. Results of Experiment 4. Mean generalization accuracy
of T3 before and after the knowledge transfer
4.6 Experiment 5: The Error of Estimation
The results of Experiment 1 show that knowledge transferred from T1 can benefit
the learning of T0. However, all results in the previous experiments were based on a
comparatively accurate estimation of CPDT. In this experiment we shrink the number of
training instances of T0 so that the accuracy of estimation of CPDT(T0 | T1) is reduced.
We are interested in the impact on knowledge transfer based on a less accurate estimation
of CPDT.49
4.6.1 Tasks and Method
This experiment uses T0 of the Bitmap domain as the primary task and T1 as the
previously learned secondary task. T0 has 100 training instances and a test set of 900
instances. T1 was previously trained using 1000 instances. The same method used in
Experiment 1 was used here.
4.6.2 Results
The results shown in Table 6 and Figure 16 indicate insufficient improvement of
generalization accuracy (p = 0.119, 0.265 and 0.088 for k = 3, 5 and 7 respectively). The
positive effect of knowledge transfer is not as promising as in Experiment 1.
Table 6. Results of Experiment 5. The generalization accuracy of
T0 before and after the knowledge transfer with k = 3, 5 and 7
Accuracy T0 on its own T0 with T1’s knowledgeTrials k = 3 k = 5 k = 7 k = 3 k = 5 k = 7
1 0.657048 0.677026 0.657048 0.63263 0.63929 0.6248612 0.648169 0.653718 0.624861 0.675916 0.682575 0.6847953 0.613762 0.594895 0.588235 0.627081 0.617092 0.6204224 0.645949 0.619312 0.625971 0.645949 0.641509 0.6248615 0.594895 0.618202 0.620422 0.586016 0.586016 0.5948956 0.63374 0.618202 0.613762 0.63374 0.618202 0.6293017 0.660377 0.663707 0.661487 0.673696 0.674806 0.6825758 0.653718 0.655938 0.6404 0.669256 0.670366 0.6659279 0.642619 0.648169 0.642619 0.653718 0.661487 0.662597
10 0.617092 0.621532 0.604883 0.628191 0.625971 0.614872Stdev 0.02156 0.026055 0.022907 0.027387 0.0307168 0.03095195%Conf 0.013363 0.016149 0.014197 0.016974 0.0190381 0.019183Mean 0.636737 0.63707 0.627969 0.642619 0.6417314 0.640511
50
Experiment 5
0.590.6
0.610.620.630.640.650.660.67
k = 3 k = 5 k = 7A
ccur
acy
T0 on its ow n
T0 w ith T1
Figure 16. Results of Experiment 5. Mean generalization accuracy
of T0 before and after the knowledge transfer.
To explain the less effective knowledge transfer in this experiment, we examined
the measure of relatedness between T0 and T1, which was estimated during the process of
knowledge transfer. For trial #1 of this experiment, Rel(T0 | T1 = 0) = 0.0242 and Rel(T0 |
T1 = 1) = 0.0104, meaning that T1 is more related to T0 when T1 outputs class 0 than when
T1 output class 1. This is inconsistent with the results in Experiment 1. This is because
the 100 training instances of the primary task do estimate the relatedness to the secondary
task as accurately as the 200 training instances of Experiment 1. We will examine this
further in Chapter 5.
4.7 Experiment 6: A Real-World Domain: Character Recognition
The experiment is dedicated to showing the system’s capability of using
maximally related prior task knowledge. The primary task is deliberately supplied with
insufficient training examples for learning the primary task under standard kNN.
51
4.7.1 Dataset
A small portion of the 2000 instance “Letter Recognition” dataset from UCI
Machine Learning Repository2 is used to train and test the system. There are 26 class
values, each representing an English letter. Every instance of this dataset had 16
numerical input attributes that are features extracted from raw images of characters and a
class target value indicating one of the 26 English letters. T0, as the primary task, has 200
training instances, which were randomly selected from the original dataset. The
remaining 1800 instances is used as the test set. The secondary task, T1, was trained using
all 2000 training instances from another “Letter Recognition” dataset also from the UCI
Machine Learning Repository. Both T0 and T1 identify capitalized English character from
the same 16 input attributes.
4.7.2 Method
The experiment consisted of 5 repeated trials where each trial had the following
steps:
1. Generate random training and test sets for T0
2. Train the kNN system by loading the training instances.
3. Test the generalization accuracy of kNN for T0 using the test set with k =
3, 5, 7
4. Transfer knowledge from T1 by generating virtual instances for T0. The
acceptance threshold of relatedness was set to 0.001 based on preliminary
testing.
2 http://www.ics.uci.edu/~mlearn/MLRepository.html52
5. Test the generalization accuracy of kNN for T0 using the test set with k =
3, 5, 7
4.7.3 Results
The results shown in Table 7 and Figure 17 show that the generalization accuracy
of T0 after knowledge transfer from T1 is statistically higher (p < 0.001) than the
generalization accuracy of T0 on its own.
Table 7. Results of Experiment 6. The generalization accuracy of
T0 before and after the knowledge transfer with k = 3, 5 and 7
Without T1 With T1Trails k = 3 k = 5 k = 7 k = 3 k = 5 k = 7
1 0.485841 0.461966 0.446419 0.781233 0.760689 0.7456972 0.454192 0.409772 0.378679 0.739589 0.720711 0.6896173 0.50472 0.466963 0.443642 0.782898 0.766796 0.7362584 0.475292 0.44864 0.415325 0.756802 0.762909 0.7529155 0.513604 0.464742 0.428651 0.76402 0.759023 0.735147
Stdev 0.023644 0.023809 0.027517 0.018012 0.01885 0.02474695%Conf 0.020725 0.020869 0.024119 0.015788 0.016522 0.02169Mean 0.48673 0.450417 0.422543 0.764908 0.754026 0.731927Lower 0.466005 0.429547 0.398424 0.749121 0.737503 0.710237Upper 0.507454 0.471286 0.446662 0.780696 0.770548 0.753617
53
Experiment 6
00.10.20.30.4
0.50.60.70.80.9
k = 3 k = 5 k = 7
Acc
urac
y
T0 on its ow n
T0 w ith T1
Figure 17. Results of Experiment 6. Generalization accuracy of T0
before and after the knowledge transfer
As a comparison, the generalization accuracy of T0 trained with 1600 training
instances from the original “Letter Recognition” dataset was around 80%. This is 5%
better compared to the mean generalization accuracy of T0 with transfer from T1 when k =
3. However, one must keep in mind that this was generated with 8 times the number of
training examples.
The slightly lower accuracy under knowledge transfer from T1 can be attributed to
the estimated CPDT. For instance, for test trial 1, CPDC(T0 | T1 = B) = { {B, .56},
{G, .11}, {Q, .11}, {R, .22}, {A, 0}, …, {Z, 0} }, but the ideal CPDC(T0 | T1 = B) should
be { {B, 1}, {A, 0}, …, {Z, 0} }.
4.8 Discussion
Note that the task domain used in Experiment 6 contained a related task T0, a less
related task T1 and an unrelated task T2. Therefore, Experiment 6 along with all other
54
experiments showed that knowledge transfer using conditional probabilities distributions
can selectively transfer the more beneficial knowledge to the primary task and transfer
increases the performance of the resulting hypothesis.
The experiments also showed a limitation of the method. The effectiveness of
knowledge transfer in this kNN method depends on the accuracy of the CPDT estimation.
Because the CPDT estimation introduces some errors, knowledge transfer using
conditional probability distributions could have inaccurate bias on the primary task if
training examples are not sufficient to correctly measure the relatedness between two
tasks. We will discuss this matter in more detail in Chapter 5.
55
Chapter 5
Conclusion
This thesis presents a theory of selective knowledge transfer using conditional
probability distributions in the context of kNN. An implementation of the theory was
developed and tested on a synthetic domain of tasks. The results of several experiments
indicate that the theory has some merit. This chapter concludes with a discussion of
major contributions from the research, some known limitations and suggestions for future
work.
5.1 Major Contributions
The research has made several contributions to machine learning research.
5.1.1 A new functional measure of relatedness for kNN based on
virtual instances
All knowledge transfer methods that have been previously proposed for kNN are
based on the structural similarity, most notably the similarity of the weight vectors of the
distance metric. The method proposed in this thesis develops a measure of relatedness
based on the conditional probability distribution of virtual instances created from
secondary tasks.
5.1.2 Measure of relatedness is at the classification level
All previously proposed measures of relatedness are at the task level. This
research has created a new measure of relatedness at the classification level so that the
56
resulting sense of relatedness between tasks is more detailed so as to utilize the
relatedness between two tasks more effectively. This is a first step towards an instance-
level measure of relatedness, which measures the relatedness of every single piece of the
knowledge from the secondary task with respect to the primary task.
5.1.3 Tolerance to unrelated tasks and scaling
The results of Experiment 3 show that the new method is tolerant of negative
inductive bias from unrelated tasks. Because the number of virtual instances in the k
nearest neighbourhood is limited, the proposed method is also tolerant to increases in the
numbers of secondary tasks.
5.2 Limitations
5.2.1 Conditional Probability Estimation Errors
In Section 4.1, the errors introduced when estimating CPDT between tasks limits
the maximum generation accuracy improvement that can be achieved by knowledge
transfer. As shown in Section 4.4, the error of the CPDT estimation negatively affects the
learning of the primary task. Two sources of this estimation error have been identified.
The deficiency of training examples in T0. The results of Section 4.4 show that
if the number of training instances in T0 is not sufficient, the estimation of conditional
probabilities will not be accurate. Unfortunately, a deficiency of training examples of the
primary task is not avoidable as this is the scenario in which knowledge transfer is most
needed.
57
The accuracy of secondary tasks. The estimation of CPDT is based on how
secondary tasks classify the training examples of the primary task. If the secondary task
is sufficiently accurate, the estimation of CPDT will be unable to reflect the true
relationship between the secondary tasks and the primary task. One approach to
minimizing this error is to use the accuracy of the secondary task as part of the
calculation of relatedness of the virtual instances. Less accurate secondary tasks should
have less affect on the accuracy of T0’s hypothesis.
5.2.2 Relatedness Based on Sub-spaces of the Input Space
With the proposed method, the relatedness between two tasks is measured at the
classification level; instances of secondary task sharing the same class value will generate
virtual instances with the same output probability distribution. However, in many cases,
tasks are only related within certain sub-spaces of the input attribute space. For example,
task T1 may only be similar to task T0 when the first attribute is less than 0.5 and the
second attribute is larger than 0.1. Using the method presented in this thesis, a highly
related sub-region is somewhat neutralized by other less related sub-regions that share the
same class value. Thus relatedness measured at the sub-space level is necessary to
capture a more refined sense of functional relatedness between tasks.
Knowledge transfer at the sub-space level might be approached as follows:
1. Divide the input space of T1 into n sub-region. Each sub-region can have
different size and depend upon the distribution of training instances.
2. For each sub-region Ri, estimate the probability distribution of T0 given
that fact that T1 outputs each class value and the query point is in Ri.58
3. For each sub-region Ri, generate virtual instances in T0 from instances that
are in Ri of T1 by applying probability distributions calculated in step 2 for
Ri.
Note that when n is large enough, the relatedness is actually measured at the
instance level. The major research question along this direction is finding a way to divide
the input space so that each sub-space provides a comparatively accurate estimation of
conditional probabilities.
5.3 Other Suggestions for Future Work
Apart from the possible extensions mentioned in Section 5.2.1 and Section 5.2.2,
there are some other variations that may improve the effectiveness of knowledge transfer.
Weighted distance. ND-kNN algorithm can be easily extended to a weighted
distance version. Using the weighted distance version, the virtual instance can be further
weighted by its distance to the query instance. Therefore, the nearer a virtual instance is
to a query instance, the more strongly it can affect the accuracy of classification. One
would have to consider the over-amplification of virtual instances by placing a limit on
the maximum distance weight.
Duplicated instances. In Section 3.3.1, a naïve method is adopted to deal with
duplicated instances, which are instances sharing the same set of input attributes from
two or more tasks. Ideally, a similar approach as used in Bayes Networks should be used
to calculate the conditional probabilities. For example, more complex conditional
probabilities have to be calculated such as P(T0 | T1 = + ∩ T2 = -). The major research
59
question is to find a fast way to estimate the conditional probabilities with reasonable
accuracy.
Density of virtual instances. kNN derives decision boundaries from training
examples. One important factor that greatly affects the shape of the decision boundary is
the density of instances. If the existing decision boundary happens to be the optimal one,
one more example may decrease the generation accuracy; this is similar to overtraining in
ANN. To accommodate this situation, the virtual instances could be generated in such a
way that the density of instances is constant throughout the input space. Other techniques
such as Model-based kNN (Guo, Wang, Bell, Bi, & Greer, 2003) would also help.
Combining structural and functional measures of relatedness. The measure
of relatedness suggested by previous research captures the structure similarity between
two kNN tasks while the CPDT captures the functional similarity. It would seem
important to consider both methods when transferring knowledge between tasks. Future
research could investigate a combination of these two methods.
60
References
Caruana, R. A. (1993). Multitask Connectionist Learning. Paper presented at the Connectionist Models Summer School, School of Computer Science, Carnegie Mellon University.
Caruana, R. A. (1997). Multitask Learning. PhD Thesis, Carnegie Mellon University, Pittsburg, PA.
Devore, J. L. (2004). Probability and Statistics for Engineering and the Sciences. Belmont, CA: Brooks/Cole -Thomson Learning.
Guo, G., Wang, H., Bell, D., Bi, Y., & Greer, K. (2003). KNN Model-Based Approach in Classification. Paper presented at the International Conference on Ontologies, Databases and Applications of Semantics, Catania, Sicily (Italy).
Haussler, D. (1988). Quantifying inductive bias: AI learning algorithms and Valiant's learning framework. Artificial Intelligence(36), 177-221.
Mitchell, T. M. (1997). Machine Learning. New York: McGraw-Hill.Niyogi, P., & Girosi, F. (1994). On the Relationship between Generalization Error,
Hypothesis Complexity, and Sample Complexity for Radial Basis Functions (Technichal Report No. AIM-1467).
Robins, A. V. (1996). Transfer in Cognition. In L. Pratt (Ed.), Connection Science Special Issue: Transfer in Inductive Systems (Vol. 8, pp. 185-203). Cambridge, MA: Carfax Publishing Company.
Silver, D. L. (2000). Selective transfer of neural network task knowledge. PhD Thesis, Faculty of Graduate Studies, University of Western Ontario, London, Ont.
Thrun, S. (1995). Lifelong Learning: A Case Study (Technical Report No. CMU-CS-95-208). Pittsburgh, PA: Carnegie Mellon University, Computer Science Department.
Thrun, S., & O'Sullivan, J. (1995). Clustering learning tasks and selective cross-task transfer of knowledge (Technical Report No. CMU-CS-95-209). Pittsburgh, PA: School of Computer Science, Carnegie Mellon University.
Valiant, L. G. (1984). A Theory of the Learnable. Communications of the ACM, 27(11), 1134-1142.
Vosniadou, S., & Ortony, A. (1989). Similarity and Analogical Reasoning: A Synthesis. In S. V. a. A. Ortony (Ed.), Similarity and Analogical Reasoning. NY: Cambridge University Press.
61