chapter 1 introduction -...

110
SELECTIVE KNOWLEDGE TRANSFER FROM K-NEAREST NEIGHBOUR TASKS USING FUNCTIONAL SIMILARITY AT THE CLASSIFICATION LEVEL by Yuan Su Thesis submitted in partial fulfillment of the requirements for the Degree of Bachelor of Computer Science with Honours Acadia University

Upload: others

Post on 20-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

SELECTIVE KNOWLEDGE TRANSFER FROM K-NEAREST NEIGHBOUR TASKS

USING FUNCTIONAL SIMILARITY

AT THE CLASSIFICATION LEVEL

by

Yuan Su

Thesis

submitted in partial fulfillment of the

requirements for the Degree of

Bachelor of Computer Science with Honours

Acadia University

April 2005

© Copyright by Yuan Su, 2005

Page 2: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

This thesis by Yuan Su

is accepted in its present form by the

Jodrey School of Computer Science

as satisfying the thesis requirements for the degree of

Bachelor of Computer Science with Honours

Approved by the Thesis Supervisor

__________________________ ____________________Dr. Daniel L. Silver Date

Approved by the Director of the School

__________________________ ____________________Dr. Leslie Oliver Date

Approved by the Honours Committee

__________________________ ____________________ Date

i

Page 3: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

I, Yuan Su, grant permission to the University Librarian at Acadia University to reproduce, loan or distribute copies of my thesis in microform, paper or electronic

formats on a non-profit basis. I however, retain the copyright in my thesis.

_________________________________Signature of Author

_________________________________Date

ii

Page 4: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Table of Contents

TABLE OF CONTENTS............................................................................................................................III

LIST OF TABLES......................................................................................................................................VII

LIST OF FIGURES..................................................................................................................................VIII

ABSTRACT..................................................................................................................................................IX

CHAPTER 1 INTRODUCTION...................................................................................................................1

1.1 OVERVIEW OF PROBLEM........................................................................................................................21.2 RESEARCH OBJECTIVES..........................................................................................................................31.3 MOTIVATION..........................................................................................................................................41.4 OVERVIEW OF THESIS..........................................................................................................................4

CHAPTER 2 BACKGROUND.....................................................................................................................6

2.1 BACKGROUND ON INDUCTIVE LEARNING AND KNN.............................................................................62.1.1 Supervised Inductive Learning......................................................................................................62.1.2 Classification.................................................................................................................................62.1.3 Generalization Error.....................................................................................................................72.1.4 The kNN Algorithm........................................................................................................................72.1.5 Distance weighted kNN..................................................................................................................9

2.2 BACKGROUND OF KNOWLEDGE TRANSFER..........................................................................................102.2.1 Inductive Bias and Prior Knowledge...........................................................................................102.2.2 Knowledge Based Inductive Learning.........................................................................................112.2.3 Task Relatedness..........................................................................................................................112.2.4 Functional Similarity vs. Structural Similarity............................................................................12

2.3 RELEVANT BACKGROUND IN PROBABILITY AND STATISTICS..............................................................132.3.1 Conditional Probability...............................................................................................................132.3.2 Conditional Probability Distributions for Discrete Random Variables......................................13

2.4 PREVIOUS RESEARCH ON KNOWLEDGE TRANSFER IN THE CONTEXT OF KNN....................................152.4.1 Task Clustering (TC) Algorithm..................................................................................................152.4.2 Weight Vector...............................................................................................................................152.4.3 Summary......................................................................................................................................16

CHAPTER 3 SELECTIVE KNOWLEDGE TRANSFER FROM KNN TASKS.................................18

3.1 FORMAL DEFINITION OF THE PROBLEM...............................................................................................183.2 THEORY OF KNOWLEDGE TRANSFER FOR KNN CONCEPT LEARNING.................................................18

3.2.1 A Synthetic Example....................................................................................................................193.2.2 Conditional Probability Distributions.........................................................................................223.2.3 Generation of Virtual Instances...................................................................................................233.2.4 The Need for a Measure of Relatedness at the Classification Level............................................283.2.4 Using Variance to Measure Classification Relatedness..............................................................293.2.5 Duplicated Instances....................................................................................................................313.2.6 A Neighbourhood of Virtual Instances........................................................................................313.2.7 Steps of Knowledge Transfer from a kNN Concept Learning Tasks...........................................32

3.3 GENERALIZING TO MULTI-CLASSES....................................................................................................333.3.1 Conditional Probability Distribution for Multi-class Tasks........................................................333.3.2 Classification Relatedness...........................................................................................................353.3.3 Steps of Knowledge Transfer for a kNN Multi-Class Task..........................................................37

3.4 IMPLEMENTATION................................................................................................................................373.5 CONCLUSION........................................................................................................................................41

iii

Page 5: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

CHAPTER 4 EMPIRICAL STUDIES.......................................................................................................42

4.1 THE BITMAP DOMAIN..........................................................................................................................424.1.1 Task T0..........................................................................................................................................434.1.2 Task T1..........................................................................................................................................434.1.3 Task T2..........................................................................................................................................444.1.4 Task T3..........................................................................................................................................44

4.2 EXPERIMENT 1: TRANSFER FROM A PARTIALLY RELATED TASKS......................................................454.2.1 Tasks............................................................................................................................................454.2.2 Method.........................................................................................................................................454.2.3 Results..........................................................................................................................................46

4.3 EXPERIMENT 2: UNRELATED TASKS....................................................................................................474.3.1 Tasks............................................................................................................................................484.3.2 Method.........................................................................................................................................484.3.3 Results..........................................................................................................................................49

4.4 EXPERIMENT 3: VARIATION IN TRANSFER FROM MORE AND LESS RELATED TASKS.........................504.4.1 Tasks............................................................................................................................................504.4.2 Methodh.......................................................................................................................................514.4.3 Results..........................................................................................................................................51

4.5 EXPERIMENT 4: KNOWLEDGE TRANSFER FROM MULTIPLE TASKS.....................................................524.5.1 Tasks............................................................................................................................................534.5.2 Method.........................................................................................................................................534.5.3 Results..........................................................................................................................................54

4.6 EXPERIMENT 5: THE ERROR OF ESTIMATION......................................................................................554.6.1 Tasks and Method........................................................................................................................554.6.2 Results..........................................................................................................................................56

4.7 EXPERIMENT 6: A REAL-WORLD DOMAIN: CHARACTER RECOGNITION.............................................574.7.1 Dataset.........................................................................................................................................584.7.2 Method.........................................................................................................................................584.7.3 Results..........................................................................................................................................59

4.8 DISCUSSION..........................................................................................................................................61

CHAPTER 5 CONCLUSION.....................................................................................................................62

5.1 MAJOR CONTRIBUTIONS......................................................................................................................625.1.1 A new functional measure of relatedness for kNN based on virtual instances............................625.1.2 Measure of relatedness is at the classification level....................................................................635.1.3 Tolerance to unrelated tasks and scaling....................................................................................63

5.2 LIMITATIONS........................................................................................................................................635.2.1 Minimizing Conditional Probability Estimation Errors..............................................................635.2.2 Relatedness Based on Sub-spaces of the Input Space..................................................................64

5.3 OTHER SUGGESTIONS FOR FUTURE WORK..........................................................................................65

References.....................................................................................................................................................67

iv

Page 6: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

List of Tables

TABLE 1 ESTIMATION OF CONDITIONAL PROBABILITIES................................................................................23TABLE 2. RESULTS OF EXPERIMENT 1............................................................................................................46TABLE 3. RESULTS OF EXPERIMENT 2............................................................................................................49TABLE 4. RESULTS OF EXPERIMENT 3............................................................................................................52TABLE 5. RESULTS OF EXPERIMENT 4............................................................................................................54TABLE 6. RESULTS OF EXPERIMENT 5............................................................................................................56Table 7. Results of Experiment 6...................................................................................................................59

v

Page 7: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

List of Figures

FIGURE 1 AN EXAMPLE OF A KNN TASK...........................................................................................................8

FIGURE 2 AN EXAMPLE OF PROBABILITY DISTRIBUTION................................................................................14

FIGURE 3. THE PRIMARY TASK, T0..................................................................................................................20

FIGURE 4 POSSIBLE DECISION BOUNDARIES FOR T0........................................................................................21

FIGURE 5. SECONDARY TASK, T1....................................................................................................................21

FIGURE 6 ESTIMATE CONDITIONAL PROBABILITIES USING TRAINING INSTANCES OF T0.................................24

FIGURE 7 GENERATE VIRTUAL INSTANCES FROM INSTANCES OF T1...............................................................27

FIGURE 8. ARCHITECTURE OF VND-KNN......................................................................................................40

FIGURE 9. BITMAP FOR T0, T2 AND T3.............................................................................................................43

FIGURE 10. BITMAP FOR T1............................................................................................................................44

FIGURE 11. THE RELATION BETWEEN TASKS IN BITMAP DOMAIN..................................................................45

FIGURE 12. RESULTS OF EXPERIMENT 1.........................................................................................................47

FIGURE 13. RESULTS OF EXPERIMENT 2.........................................................................................................50

FIGURE 14. RESULTS OF EXPERIMENT 3.........................................................................................................52

FIGURE 15. RESULTS OF EXPERIMENT 4.........................................................................................................55

FIGURE 16. RESULTS OF EXPERIMENT 5.........................................................................................................57

Figure 17. Results of Experiment 6................................................................................................................60

vi

Page 8: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

ABSTRACT

The thesis explores how a learning system can utilize previously learned

knowledge to develop a more accurate hypothesis in the context of the k nearest

neighbour (kNN) learning algorithm. Several previous methods of knowledge transfer for

kNN have proposed measures based on structural similarity at the task level. A theory of

selective knowledge transfer is presented using a measure of relatedness based on

functional similarity at the classification level.

The new method of knowledge transfer relies on the generation of virtual

instances for the primary task from training instances of the secondary task.  Each virtual

instance is non-deterministic in that the probability of its class value is conditioned upon

the class value of the secondary task. Virtual-instance-based Non-Deterministic kNN

(VND-kNN) is introduced as an implementation of the theory.

A prototype system based on the theory is tested against a synthetic domain and a

letter recognition domain. Experiments show that knowledge transfer from secondary

tasks based on the conditional probability distributions can improve the generalization

accuracy of the primary task if the secondary tasks are related to the primary task. 

Furthermore, experiments show that the method is able to mitigate negative transfer of

knowledge when the secondary tasks are unrelated to the primary task.

vii

Page 9: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Chapter 1

Introduction

Machine Learning has been defined as the study of computer algorithms that

improve automatically through experience (Mitchell, 1997). Machine learning theories

imply that the larger the set of training examples, the better the probability of developing

an accurate hypothesis (Valiant, 1984). However, in practise, most applications of

machine learning systems suffer from a deficiency of training examples. Collecting

training examples can be difficult, time consuming and costly. Life-long learning is

dedicated to utilizing prior knowledge in lieu of training examples to more efficiently

learn a more effective hypothesis for a new task.

The k-nearest neighbour algorithm (kNN) is a popular machine learning method.

kNN considers every instance to be a point in an n-dimensional space, where n is the

number of input attributes. kNN is trained by simply storing training examples and it

classifies a query instance q based on the k training examples that are closest to q. Three

methods of life-long learning through knowledge transfer using kNN have previously

been proposed (Caruana, 1993, Thrun, 1995, Silver, 2000). Knowledge is selectively

transferred based on structural measures of relatedness at the task level. This paper

introduces a method of kNN selective knowledge transfer that is based on functional

measures and at the classification level using virtual training instances.

Chapter 1 has four sections. The first section defines the terms used above that

may not have been familiar to the reader and contains an overview of problem. The

viii

Page 10: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

second section contains the research objectives and the third section explains the

motivation for the new method. The final section gives an overview of the structure of the

thesis.

1.1 Overview of Problem

Machine learning systems often encounter insufficient training examples per task

to develop a sufficiently accurate hypothesis. For example, a hospital may have records

on only 100 patients with a particular type of heart disease. One approach to overcoming

the deficiency of training examples is to utilize knowledge that has been acquired during

the learning of previous tasks that are related. For example, assuming we have learned a

model of identifying patients with high blood pressure; we can use its knowledge to help

us to identify patients with heart disease. The process is to transfer the previously

acquired knowledge (high blood pressure diagnosis) to the new and related learning task

(heart disease diagnosis).

There has been some previous work on the fundamental theory of knowledge

transfer has been provided along with a method of selective knowledge transfer in the

context of kNN (Silver, 2000, Caruana, 1993, Thrun, 1995). All of these methods use the

similarity between the distance metric (a structural measure) used in each task and do not

consider the functional relationship between the output values of the tasks.

Other previous research has explored methods of determining how much two

tasks are functionally related; for example, linear coefficient of correlation, coefficient of

determination and Hammer distance (Silver, 2000). All methods proposed measure

ix

Page 11: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

functional relatedness at the task level: the relationship between two tasks is based on all

target values. However, it can be beneficial to measure the relationship at the

classification level. For example, if the output value of a previously learned task T 1 is the

same as that of the new task T0 for a particular class value, then this sense of relationship

should not be entirely dismissed. For some sub-region of the input attribute space, the

two tasks are similar and the transfer could be beneficial.

Only a few people have looked at knowledge transfer in the context of the kNN

algorithm and all methods previously proposed transfer based on structural measures at

the task level. The thesis develops a new functional measure of relatedness at the

classification level and uses the measure to achieve knowledge transfer between kNN

tasks.

1.2 Research Objectives

The thesis investigates possible ways of selective task knowledge transfer in the

context of kNN. The research reported has three objectives. The first objective is to find

an alternative measure of relationship between tasks than those presented previously.

The second objective is to develop a theoretical model of selective knowledge transfer

based on the new measure of relationship between tasks. The last objective is to build a

prototype system based on this theory and test the system against synthetic and real task

domains.

x

Page 12: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

1.3 Motivation

The research is motivated by observation of human learning and fundamental

knowledge of machine learning theory.

Observation of human learning. In practise, humans often get into a situation

where they cannot decide on a best answer. For example, a person wants to decide

whether he should sell all his stock. Half of his consultants suggest that he sell but the

other half do not. If the person has background knowledge of a previous and similar

decision, it can increase the probability of making a correct choice of action. Similarly,

when the kNN algorithm classifies a query instance, if there is related prior task

knowledge that beneficially biases the algorithm, the resulting prediction will be more

accurate.

Machine learning theory. The PAC theory shows that the probability of a

learning system developing an accurate hypothesis increases with the number of training

examples used (Valiant, 1984). I propose that knowledge from previously learned kNN

tasks can provide additional virtual instances for better learning of a new task. This

requires a clever method of generating virtual instances from previously learned tasks to

enrich the pool of instances for the new task.

1.4 Overview of Thesis

The remainder of the thesis is organized as follows.

Chapter 2 provides background knowledge of inductive learning, kNN and

knowledge transfer. It describes previous research achievements and proposals for

xi

Page 13: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

knowledge transfer in the context of kNN. Based on the advantages and limitations of

these methods, the objective and scope of the research is refined.

Chapter 3 develops a theory of knowledge transfer in the context of kNN using a

measure of relationship between class values of tasks. During the development of the

theory, a modified version of kNN, Non-Deterministic kNN (ND-kNN), is defined and

used.

Chapter 4 tests the theory developed in Chapter 3 using a prototype ND-kNN

system. Based on the results from the experiments, the advantages and limitation of the

method are discussed.

Chapter 5 concludes with a summary of the research, some important limitations,

and suggestions for future work.

xii

Page 14: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Chapter 2

Background

2.1 Background on Inductive Learning and kNN

This section reviews the basic knowledge of inductive machine learning and the

kNN algorithm required for this research.

2.1.1 Supervised Inductive Learning

Inductive learning is an inference process that builds a model or hypothesis, h, of

the task, f, by using a set of training examples. If each training example contains input

attributes, x, and a correct target output value f(x), we call it supervised inductive

learning. If only input attributes are provided, then unsupervised learning, often called

clustering, can be undertaken but not supervised learning. The objective of supervised

inductive learning system is to develop or select a hypothesis h such that h(x) = f(x) for

all possible x. This research deals with supervised inductive learning and henceforth, the

word “learning” refers to supervised learning.

2.1.2 Classification

Classification is one type of learning task that, given a set of input attribute

values, outputs one of a set of discrete values, known as categories or classes. This

research is limited to discussing classification learning. Concept learning is the simplest

form of classification learning, where there are only two possible class values, such as

1

Page 15: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

True or False, Positive or Negative, 1 or 0. If a classification task has more than two

class values, it is known as a multi-class learning problem.

2.1.3 Generalization Error

The generalization error is defined to be the error between the learner's

hypothesis, h(x), and the target function, f(x), for all x (Niyogi & Girosi, 1994). For

classification learning, the generalization error can be estimated by the number of

incorrect classification by the developed model for an independent test set. The

generalization accuracy equals one minus the generalization error.

2.1.4 The kNN Algorithm

Instead of developing a representation of a hypothesis like Artificial Neural

Networks and Inductive Decision Tree, the kNN algorithm simply stores the training

instances in a knowledge base. Generalization is postponed until classification time when

a query instance is presented to the system for classification. kNN considers every

instance to be a point in an n-dimensional space, where n is the number of input

attributes. The basic assumption behind the kNN algorithm is that similar instances have

similar outputs. Similarity is based on the Euclidean Distance between instances. For

example in Figure 1, instance Q is closer to instance A than instance B. So Q should have

an output similar to A.

2

Page 16: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Figure 1 an example of a kNN task

Formally, as per (Mitchell, 1997), the kNN training algorithm is as follows:

Algorithm 2.1 kNN training algorithm

Given a training example x, which has n input attributes a1…an, the

output value v.

kNN learns x by just storing it into its knowledge base X.

When kNN classifies a query instance q, a neighbourhood that contains the k

nearest training examples to q is created. Formally, as per (Mitchell, 1997), classification

is accomplished with the following algorithm

Algorithm 2.2 kNN classification algorithm

1. Let x1…xk denote the k instances from training examples that are nearest to

xq in Euclidean distance

2. Return

3

A

BQ

Page 17: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

where v is the one of the output values and =1 if a = b

else .

Euclidean distance is defined as follows:

Definition 2.1 Euclidean Distance

Consider x and y are two instances with ith input attributes xi and yi,

the Euclidean distance is:

Once a kNN task is trained, decision boundaries, which separate instances of

different classes, can be derived by enumerating every possible point in the input space.

In general, decision boundaries along with decision regions, which are the spaces

between decision boundaries, are considered the hypothesis developed by kNN for a task.

2.1.5 Distance Weighted kNN.

As we can see from section 2.1.4, the basic form of kNN treats every instance in

the neighbourhood equally. However, it can be argued that the more similar instance

should be able to contribute more to the final output. This motivation leads to a variation

called distance weighted kNN. This form takes the distance between the query point and

the instance into consideration when calculating the final output. Formally, the

classification algorithm becomes:

4

Page 18: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Where

2.2 Background of Knowledge Transfer

Haussler showed that the number of training examples required for developing a

sufficiently accurate hypothesis depends on the size of the hypothesis space (Haussler,

1988). If we can introduce more inductive bias, such as prior knowledge, to further

restrict the hypothesis space, the number of training examples that are required can be

reduced. The process of utilizing prior knowledge from previously learned tasks, or

secondary tasks, to influence the learning of a new task, the primary task, is called

knowledge transfer.

2.2.1 Inductive Bias and Prior Knowledge

The learning of a task cannot be accomplished without some assumptions about

the nature of the task. “A learner that makes no a priori assumptions regarding the

identity of the target concept has no rational basis for classifying any unseen instances”

(Mitchell, 1997). In the context of inductive learning, we call this a priori assumption

inductive bias. Inductive bias exists in every learning algorithm and influences the

hypotheses the algorithm develops or selects. Inductive bias can be classified into two

categories: preference or restriction. Preference is an inductive bias that prefers certain

hypothesis more than others. Restriction is an inductive bias that restricts the space of

possible hypotheses (hypothesis space) of the learning system.

5

Page 19: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Prior knowledge is one type of inductive bias that includes knowledge of intended

use, knowledge of the source, analogy of previously learned tasks and/or knowledge of

the task domain (Silver, 2000). In this thesis, analogy with previously learned tasks and

knowledge of the task domain are jointly called task domain knowledge.

2.2.2 Knowledge Based Inductive Learning

Knowledge based inductive learning, or KBIL, is a learning method that relies on

knowledge of the task domain (domain knowledge) to reduce the hypothesis space that

the learning system has to search. After each new task is learned, the knowledge from

that task is retained in domain knowledge, so it can be reused when learning future tasks.

In the extreme case, if the new task is exactly the same as the old task, the inductive bias

introduced by domain knowledge should help develop an accurate hypothesis rapidly

from a minimum number of training examples.

2.2.3 Task Relatedness

The relatedness between tasks has been identified as a critical issue for the

success of knowledge transfer (Thrun, 1995). Consider a situation where the learner has a

diverse set of secondary tasks in domain knowledge. There are some secondary tasks that

are unrelated to the primary task and other secondary tasks that are highly related to the

primary task. The most related tasks must be selected by some measure of task

relatedness in order to maximize the contribution from the task domain.

6

Page 20: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

The concept of task relatedness was first defined in (Caruana, 1997). Later on, the

definition of task relatedness was extended by (Silver, 2000). Formally, as per (Silver,

2000), the task relatedness is defined as follows

Definition 2.2: Task relatedness

Let Tk be a secondary task and T0 a primary task of the same domain

with training examples Sk and S0 respectfully, The relatedness of Tk

with respect to T0 in the context of learning system L, that uses

knowledge transfer, is the utility of using Sk along with S0 toward the

efficient development of an effective hypothesis for T0

The nature of task relatedness can be examined from different perspectives: task

relatedness as a distance metric, task relatedness as similarity and task relatedness as

shared invariance (Silver, 2000). This thesis will focus on task relatedness in terms of

similarity.

2.2.4 Functional Similarity vs. Structural Similarity

Previous researchers have suggested two distinct forms of similarity: functional

similarity and structural similarity (Robins, 1996; Vosniadou & Ortony, 1989),. In the

context of machine learning, functional similarity or surface similarity can be defined as

shallow, easily perceived, external similarity (Silver, 2000). In the context of kNN,

functional similarity can be described as the degree of sharing of decision regions.

Structural similarity can be defined as deep, often complex, internal feature similarity

7

Page 21: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

(Silver, 2000). In the context of kNN, structural similarity has been defined as the use of

similar distance metrics for each of the input attributes.

2.3 Relevant Background in Probability and Statistics

The section reviews the relevant mathematics used in this research.

2.3.1 Conditional Probability

There are situations where the information “an event B has occurred” can

influence the probability of event A occurring. The probability of one event given that

another event has occurred is known as conditional probability. Formally, as per

(Devore, 2004), conditional probability is defined as follows:

Definition 2.3 Conditional Probability

For any two events A an B with P(B) > 0, the conditional probability

of A given that B has occurred is defined by

2.3.2 Conditional Probability Distributions for Discrete Random

Variables

The probability distribution of a discrete random variable, X, describes how the

total probability of 1 is distributed among the various possible X values (Devore, 2004)

(e.g. See Figure 2). Formally, the probability distribution or probability mass function

(pmf) of a discrete random variable is define for every number x by p(x) = P(X = x). If all

8

Page 22: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

probabilities in a probability distribution are conditional probabilities, the distribution is

called a conditional probability distribution.

Figure 2 An example of probability distribution with p(x) = 0.2 for all x

The variance of a probability distribution measures the spread of values in the

distribution. Formally, as per (Devore, 2004), variance of a probability distribution is

defined as follows:

Definition 2.4: Variance of probability distribution

Let X = {x1, x2, … , xn}. Then the variance of X is

2.4 Previous Research on knowledge Transfer in the Context of

kNN

2.4.1 Task Clustering (TC) Algorithm

One previous paper discusses the knowledge transfer in the context of the kNN

algorithm (Thrun & O'Sullivan, 1995). The proposed TC algorithm partitions task

9

1 2 3 4 5

0.2

x

P

Page 23: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

domain knowledge into clusters of related tasks. All related tasks in the same cluster use

a common Euclidean distance metric, which is generally beneficial to all related tasks in

the cluster. When a new primary task is being learned, the relatedness between the new

task and each cluster is estimated by using the each cluster’s distance metric to bias the

learning of the primary task. The cluster that helps the primary task achieve the highest

generalization error is considered the most related. Then the primary task uses the

distance metric of that cluster for classifying future query instances.

In summary, the TC algorithm measures the relatedness of two tasks by structural

similarity at the task level and transfers the structure (distance metric) of the secondary

task to the primary task.

2.4.2 Weight Vectors

Another paper proposes one alternative method of knowledge transfer for kNN as

a possible extension to knowledge transfer in ANN (Silver, 2000). The method relies on

the generation of virtual examples from previously learned kNN models and their

distance weight vectors. A measure of primary to secondary task relatedness is computed

based on the cosine of the angle between their respective weight vectors. A weight

vector for the new task is then re-computed using a gradient descent approach that

minimizes the error over all tasks weighted by their measure of relatedness. The

structural knowledge of the secondary tasks is transferred to the primary task while

minimizing a global error function. The resulting weight vector is used to predict the

output class for test query instances for the new task.

10

Page 24: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Once again, this method measures the relatedness by structural similarity at the

task level.

2.4.3 Summary

All methods that have been proposed to achieve knowledge transfer in the context

of kNN utilize structural similarity between tasks. From previous research, we know that

functional similarity and structure similarity are both important measures of relatedness

(Silver, 2000). Using only one of the two may miss identifying domain knowledge that is

beneficial to the primary task. Our objective is to capture the functional similarity

between kNN tasks.

Moreover, all methods that have been proposed measure the relatedness at the

task level. Previous research explores various functional measures of relatedness such as

linear coefficient of correlation, coefficient of determination and Hammer distance

(Silver, 2000). All of these measures are aimed at the task level and may miss capturing

related portions of secondary tasks. For example, if the previously learned task T1 is

related to the primary task T0 only when T1 outputs class 1, then the measure of

relatedness at the task level will fail to capture this partial relatedness within a sub-region

of the input attribute space.

In summary, little research has been done on knowledge transfer in the context of

kNN and all of the methods that have been proposed transfer knowledge based on

structural similarity at the task level. The object of this thesis is to develop a new measure

of relatedness based on the functional similarity at the classification level and use the

measure to achieve knowledge transfer from secondary kNN tasks to a new primary task.11

Page 25: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Chapter 3

Selective Knowledge Transfer from kNN Tasks

3.1 Formal Definition of the Problem

In Chapter 2, several methods of knowledge transfer were presented but all of

them measure the relatedness at the task level. In the context of kNN, few people have

tried knowledge transfer and the methods that have been proposed focus on the structural

similarity. This thesis proposes a new theory of selective knowledge transfer in the

context of kNN such that:

1. Relatedness is measured between tasks at the classification level;

2. Relatedness is based purely on the functional knowledge of previously

learned tasks;

3. Functional knowledge is in the form of virtual instances.

3.2 Theory of Knowledge Transfer for kNN Concept Learning

kNN doesn’t explicitly generalize the training examples to form a hypothesis. The

knowledge of a kNN system is represented by a pool of instances. Therefore, the most

natural way to transfer knowledge from previously learned kNN tasks is to utilize training

instances from those tasks.

Section 3.2 focuses on the theory of knowledge transfer with kNN when learning

concept tasks. A domain of synthetic tasks is defined and used throughout the thesis in

12

Page 26: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

order to present and test the theory and associated methods. The theory is extended to the

multi-class learning in Section 3.3.

3.2.1 A Synthetic Example

The kNN algorithm derives the decision boundaries directly from training

instances. If the number of available training examples is not sufficient to derive accurate

decision boundaries, then related background knowledge can be very helpful. Consider a

simple concept learning task as shown in Figure 3. The primary task, T0 has 7 training

instances, three of which have class value ‘-’. The kNN decision boundary, for k = 3, is

shown as roughly a vertical line between positive and negative training instances with k =

3. The query “?” would be classified as positive under normal kNN.

It is also possible that the actual decision boundary for T0 is more complex than

this naïve hypothesis. For example, Figure 4 shows a horizontal region of negative

instances surrounded by positive instances. In this case the query instance would be

classified as negative. Provided that there is no background knowledge, the two possible

decision boundaries are equally likely, but standard kNN will prefer the boundary shown

in Figure 3.

13

Page 27: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Figure 3. The primary task, T0. The shaded area is the decision

region where all instances are of class value ‘-’. The decision region

is derived by training examples of T0 using k = 3

Because the knowledge of a kNN task is represented by instances, knowledge

transfer from T1 can be accomplished through the generation of virtual instances for T0

from the T1’s kNN model. The problem of transfer can be reduced to answering two

questions:

1. How do we determine the class value of a virtual instance?

2. How do we select training examples from the secondary task so as to

generate virtual instances?

The following sections are dedicated to answer these two questions.

14

1 2 3 4 5 6

1

2

3

4

5

6

?

+

— Negative example

Positive example

Query?

Page 28: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Figure 4 Possible decision boundaries for T0. The query, in this

case, is classified as negative.

Figure 5. Secondary task, T1. The shaded area is the decision

region where all instances are of class value ‘-’. The decision region

is derived by training examples of T0 using k = 1.

15

1 2 3 4 5 6

1

2

3

4

5

6

?

+

— Negative example

Positive example

Query?

1 2 3 4 5 6

1

2

3

4

5

6

+

— Negative example

Positive example

Query?

Page 29: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

3.2.2 Conditional Probability Distributions

The functional similarity between the training instances of two tasks describes a

degree of relatedness between the tasks. This relatedness can also be represented by

conditional probabilities. For the example shown in Section 3.2.1, let P(T0 = + | T1 = +)

equal the probability that an instance of T0 is positive given that an instance of T1 is

positive. Then we can express the relatedness of T1 to T0 by the conditional probabilities

P(T0 = + | T1 = +), P(T0 = - | T1 = +), P(T0 = + | T1 = -) and P(T0 = - | T1 = -). We do not

know the exact value of these conditional probabilities but they can be estimated by

observing the primary task and secondary task training instances.

One approach is to classify each training example of the primary task by the kNN

secondary task. Let U+ be the number of training examples of the primary task that are

classified as positive by the secondary task and let C+ be the number of positive training

examples of the primary task that that are classified as positive by the secondary task.

Then, can be estimated by . Similarly, Let U- be the number of

training examples of the primary task that are classified as negative by the secondary task

and let C- be the number of positive training examples of the primary task that that are

classified as negative by the secondary task. Then, can be estimated by

.

16

Page 30: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Let’s use the tasks in Section 3.2.1 as an example. All training examples of T0 are

classified by T1 with k = 11. There are 2 training instances of T0 that are classified as

positive by T1, so U+ = 2. There are 0 positive instance of T0 that is classified to be

positive by T1, so C+ = 0. Therefore, . Similarly, there are 5

training instances of T0 that are classified as negative by T1, so U- = 5. There are 4

positive training instances of T0 that are classified as negative by T1, so C- = 4. Therefore,

.

Because P(T0 = - | T1 = +) = 1- P(T0 = + | T1 = +) and P(T0 = - | T1 = -) = 1- P(T0 =

+ | T1 = -), we now have a complete conditional probability distribution (See Table 1).

3.2.3 Generation of Virtual Instances

Virtual instances for T0 can now be generated from the kNN model for T1 by

considering the conditional probability distributions in Table 1.

Table 1 Estimation of Conditional Probabilities

T1 = “+” T1 = “-”T0 = “+” 0/2 = 0 4/5 = 0.8

T0 = “-” 2/2 = 1 1/5 = 0.2

1 k is chosen to be constant 1 while we estimate the conditional probabilities so as to make knowledge

transfer be independent of what value k is specified by the user. 17

Page 31: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Figure 6 Estimate conditional probabilities using training instances

of T0.

18

1 2 3 4 5 6

1

2

3

4

5

6 ?

+

— Negative example

Positive example

Query?

1 2 3 4 5 6

1

2

3

4

5

6

?

?

?

?

?

? ?

Classify by T1

Classify by T1

T0 T1

1 2 3 4 5 6

1

2

3

4

5

6

+

Negative instance of T0 classified negative by T1

Negative instance of T0 classified positive by T1

Positive instance of T0 classified negative by T1

Page 32: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Because P(T0 = + | T1 = -) = 0.8, all virtual instances generated for T0 from

negative instances of T1 are positive with probability of 0.8 (or negative with probability

of 0.2). Similarly, because P(T0 = + | T1 = +) = 0, all virtual instances generated for T0

from positive instances of T1 are positive with probability of 0 (or negative with

probability of 1). In other words, the output value of a virtual instance is the conditional

probability distribution P(T0 | T1 = v), where v is the class value of the source instance of

T1.

Figure 7 shows the resulting T0 decision boundary with k = 3 after all virtual

instances are generated. Note that virtual instances are non-deterministic instances. They

have two possible class values, each with a probability. To make proper use of these

probabilities, the kNN algorithm has to be modified so that the probabilities are used to

weight the vote of the virtual instances. For example, the neighbourhood of the query

with k = 3 as shown in Figure 7 contains three instances, two of which are negative with

probability 1. The other instance in the neighbourhood is positive with probability 0.8

and negative with probability 0.2. Therefore, the resulting vote is 2.2 negative

(0.2+1+1=2.2) vs. 0.8 positive (0.8+0+0=0.8). Consequently, the query is classified to be

negative. In Figure 7, the new decision boundaries reflect the horizontal boundaries

transferred from T1. The prior knowledge of T1 has biased the learning of T0.

The first question in Section 3.2.1 can now be answered: How do we determine

the class value of a virtual instance? The class value of a virtual instance is determined

based on:

1. The class value of the corresponding instance of the secondary task

19

Page 33: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

2. The conditional probability distributions calculated by classifying training

instances of the primary task using the secondary task. The class value of

the virtual instance is actually a conditional probability distribution over

all class values of the primary task (in this case, + or -).

20

Page 34: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Figure 7 Generate virtual instances from instances of T1 based on

the conditional probability distributions

21

1 2 3 4 5 6

1

2

3

4

5

6

— Negative example

Positive example

Query?

1 2 3 4 5 6

1

2

3

4

5

6

?

T0 T1

Virtual instance that is 80% positive

Virtual instance that is 100% negative1 2 3 4 5 6

1

2

3

4

5

6

?

P(T0 | T1)T1 = +T1 = -T0 = +00.8T0 = -

10.2

Generate virtual

instances

Page 35: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

3.2.4 The Need for a Measure of Relatedness at the Classification

Level

In the previous example, all instances of T1 are transferred to T0 as virtual

instances. This can be problematic as the number of instances from T1 increases. The

computational time and space for classifying a query instance for T0 grows as a function

of the number of training instances (actual and virtual). Ideally, we only want to transfer

instances that are beneficial to the learning of the primary task, from the instances of a

secondary task that are most related to the primary task. Therefore, it is crucial to find a

measure of relatedness that can choose the best instances to transfer.

Sometimes tasks are partially related at the classification level. For instance, the

example in Section 3.2.1 shows that T1 is totally related to T0 when T1 outputs “+” but is

far less related to T0 when T1 outputs “-”. If the relatedness was measured at the task

level, the high relatedness between the two tasks when T1 outputs “+” would be mitigated

by the comparatively low relatedness between two tasks when T1 outputs “-”. In order to

utilize the relatedness between two tasks more effectively, a relatedness measure at the

class level is needed.

We propose that conditional probability distributions can be used to estimate

relatedness separately for each class value of the primary and secondary tasks. The only

question left is how to quantify the relatedness implied by the conditional probability

distribution between T0 and of the current secondary task with that of other secondary

tasks.

22

Page 36: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

3.2.4 Using Variance to Measure Classification Relatedness

If P(T0 = + | T1 = +) = P(T0 = - | T1 = +) = 0.5, the associated virtual instances will

have a probability of 0.5 positive and 0.5 negative so they will add no value to the

development of decision boundaries when learning T0. We can say that these virtual

instances are unrelated to the learning of the primary task. However, if P(T0 = + | T1 = +)

= 1, associated virtual instances are very related to T0. Therefore, the variance of the

conditional probability distribution indicates the relative degree of relatedness between

the tasks at the class level. The greater the variance, the greater the measure of the

relatedness between the tasks is.

Consider that two concept learning tasks, T0 and T1 are totally unrelated when T1

outputs “+”, then P(T0 = + | T1 = +) = P(T0 = - | T1 = +) = 0.5. The variance of the

distribution P(T0 | T1 = +), Var[P(T0 | T1 = +)] = 0. In this case, T0 and T1 are minimally

related when T1 outputs +. On the other hand, when T0 and T1 are identical, P(T0 = + | T1

= +) = 1 and Var[P(T0 | T1 = +)] = 0.5. In this case, T0 and T1 are maximally related. If T0

= ⌐T1, then P(T0 = + | T1 = +) = 0 but P(T0 = - | T1 = +) = 1.0 and Var[P(T0 | T1 = +)] =

0.5. Once again, T0 and T1 can be considered maximally related for the positive output

class of T1. For the tasks shown in Section 3.2.1, P(T0 = + | T1 = -) = 0.8 and Var[P(T0 | T1

= -)] = 0.18. In this case, two tasks are partially related when T1 outputs negative.

As the relatedness between two tasks increases, the variance of the resulting

conditional probability distribution increases and is always in the range of 0 to 0.5. For

simplicity, we normalize the variance to the range of 0 to 1. Formally,

23

Page 37: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Definition 3.1: Classification Relatedness of Concept Learning

Tasks

The classification relatedness of the secondary task Ti regarding to

the primary task T0 when Ti outputs class v:

Rel(T0, Ti = v) =

As mentioned in Section 3.2.4, the class value of a virtual instance is the

conditional probability distribution P(T0 | Ti = v). Therefore:

Definition 3.2: Rel(T0, x) is the relatedness of an instance x with

respect to the primary task, can be measured by the variance of the

output probability distribution of x. The relatedness of original

training examples of the primary task is always equal to 1.

Once the classification relatedness is calculated, a minimum Rel(T0, Ti = v), the

acceptance threshold of relatedness, can be set to filter out the virtual instances that are

generated from tasks that do not meet a minimum acceptable level of relatedness. The

setting of the acceptance threshold of relatedness is currently done manually and is

intended to reduce the computational time and space for classifying a query instance. In

general, this value should be a small number in the range 0.001-0.02.

Section 3.2.1 also asks the following question: How do we select training

examples from the secondary task so as to generate virtual instances? The answer is

that selection is based on the classification relatedness. Virtual instances are only

24

Page 38: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

generated from training instances of secondary tasks whose classification relatedness

with respect to the primary task is greater than the acceptance threshold of relatedness.

3.2.5 Duplicated Instances

Instances from 2 or more tasks that share the same input attribute values are

defined to be duplicated instances. It is important that duplicated instances are carefully

handled when transferring knowledge from secondary tasks. For example, if we

overwrite a related virtual instance with a less related virtual instance, the quality of the

knowledge transfer is reduced.

One approach to handling duplicated instances is to mathematically combine their

values such as taking the average of all probability distributions. However, this requires

that the tasks are conditionally independent of each other. If the assumption doesn’t hold,

the simple linear combination of probabilities cannot hold, either.

Therefore, we propose a naïve method: an existing (actual or virtual) instance can

be overwritten only if a duplicate is more related. This guarantees that the most beneficial

instance is retained. Since the classification relatedness of an actual training example is

always 1, no duplicate instance can overwrite an actual training instance.

3.2.6 A Neighbourhood of Virtual Instances

In Figure 7, k =3, so by definition the k nearest neighbours of the query point

contains three virtual instances and no original training example of T0. It doesn’t seem

problematic at first glance. However, if the virtual instances in the neighbourhood are not

related to the primary task, there is little confidence in the final classification. For

25

Page 39: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

example, if all instances in the neighbourhood are virtual instances indicating 51%

positive, the final classification is “+” but only at the 51% confidence level. In order to

maintain a reasonable confidence level of the final classification, it is important to

include sufficient actual training instances during kNN generalization.

One approach is to employ a dynamic neighbourhood size when the k nearest

neighbours are selected. The method works in following way:

1. Let k be the size of the neighbourhood specified by the user. Select the k

nearest actual training instances to the query point.

2. Select at most k virtual instances nearest to the query point such that none

are farther from the query point than the kth nearest actual training

instance.

Therefore, the number of nearest neighbours ranges from k to 2k, where there are,

at most, as many virtual instances as actual training instances. Limiting the number of

virtual instances ensures that the knowledge of the virtual instances does not overwhelm

the knowledge of the actual training examples.

3.2.7 Steps of Knowledge Transfer from kNN Concept Learning Tasks

In summary, knowledge transfer from a set of secondary tasks T1…Tn to the

primary task T0 in the context of kNN concept learning tasks with acceptance threshold of

relatedness α involves the following steps:

For each secondary task Ti ( 1 ≤ i ≤ n)

26

Page 40: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

1. Determine conditional probability distributions P(T0 | Ti = +) and P(T0 | Ti

= +) by classifying training examples of T0 using Ti

2. For each training instance x for Ti, whose class value is v (v = + or -):

a. Generate a virtual instance , which shares the same set of input

attributes as x and outputs P(T0 | Ti = v).

b. If Rel(T0, ) < α, discard .

c. Add to T0, overwriting any existing duplicated instances

according to the following: If there exists an instance x0 in T0,

which shares the same set of input attribute values as , discard

if Rel(T0, ) < Rel(T0, x0)

3.3 Generalizing to Multi-Classes

Section 3.2 has presented the theory of knowledge transfer in the context of kNN

concept learning tasks. This section extends the theory to multi-class learning problems.

The rationale is the same, i.e. use conditional probabilities to determine the relatedness of

virtual instances to the primary task. The only difference between concept learning tasks

and multi-class tasks is the number of classes. This requires extending the formulas found

in Section 3.2.

3.3.1 Conditional Probability Distribution for Multi-class Tasks

Though we can still use the definition of the conditional probability distribution, a

new formula is required for the purpose of the multi-class problem. Formally:

27

Page 41: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Definition 3.3: Conditional Probability Distribution for Multi-Class

Tasks

First, define the conditional probability distribution per

classification, as a function that takes the mth

element of set M, which contains all possible class values of Ti, as

the input and outputs the set:

where is the nth element of the set N, which contains all possible

class values for the possible output values of T0

Then, define as a function that takes Ti as the input

and outputs the set:

where M is the number of possible output values of Ti.

Essentially, CPDC is the conditional probability distribution and CPDT is the

joint probability distribution. They are different from the classical definition of the

conditional probability distribution in that both CPDC and CPDT output the actual class

value with which the probability is associated.

As an example, the conditional probability distributions for the example in

Section 3.2 can be expressed as:

28

Page 42: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

CPDC(T0 | T1 = +) = {(+, 0), (-, 1)},

CPDC(T0 | T1 = -) = {(+, 0.8), (-, 0.2)} and

CPDT(T0 | T1) = {(+, {(+, 0), (-, 1)}), (-, {(+, 0.8), (-, 0.2)})}.

In general, the output value of virtual instances of T0, which are generated from

instances with class value v in T1, is CPDC(T0 | T1 = v).

3.3.2 Classification Relatedness

The classification relatedness also needs to be extended for multi-class problems.

But first, the variance of a CPDC must be defined:

Definition 3.4: Variance of CPDC

First, define as a function to extract the

probability proportion of a CPDC:

where is the mth element of the set M, which contains all

possible class values of Ti and is the nth element of the set N,

which contains all possible class values of T0.

Then, define as a function that calculates

the variance of a CPDC:

29

Page 43: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Same as for concept learning tasks, relatedness increases as the variance

increases. The maximum variance is not always equal to 0.5 as in concept learning tasks

but depends on the number of class values that the primary task has.

Claim 3.1: Range of the Variance of CPDC

Given a , which has n elements, i.e. |PP(

)| = n (n > 1), define:

1.

2.

The variance, Var[ ] is in the range of [0, Vmax].

By the definition of the variance, the variance is always positive. Therefore, we

only need to justify whether the maximum value of the variance equals Vmax.

Proof 3.1: Maximum value of the variance

Let X = {x1, x2, … , xn}. By definition of the sample variance,

.

30

Page 44: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Because , . So if is

maximized, Var(X) is maximized.

Consider

Since, xi > 0 and , we can get and

if there is one xi = 1.

Therefore,

Note that in the case of concept learning, n = 2. Therefore, Vmax = 0.5, which is

exactly the maximum value of variance concluded in Section 3.2.4.

Then, the definition of classification relatedness can be extended to:

Definition 3.5: Classification Relatedness for Multi-Class Tasks

The classification relatedness of the secondary Ti regarding to the

primary task T0 when Ti outputs , which is defined as following:

Rel(T0, Ti = ) =

31

Page 45: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Also, define the classification relatedness of an instance x in T0,

whose class value is CPDC(T0 | Ti = ), as:

Rel(T0, x) = Rel(T0, Ti = )

3.3.3 Steps of Knowledge Transfer from kNN Multi-Class Tasks

In summary, knowledge transfer from a set of secondary multi-class task T1…Tn

to the primary multi-class task T0 in the context of kNN with acceptance threshold of

relatedness α involves:

For each secondary task Ti (1 ≤ i ≤ n):

1. Determine CPDT(T0 | Ti) by classifying training examples of T0 using Ti

2. For each training instance x in Ti, whose class value is vm:

a. Generate a virtual instance , which shares the same set of input

attributes as x and outputs CPDC(T0 | Ti = vm).

b. If Rel(T0, ) < α, discard .

c. Add to T0, overwriting any existing duplicated instances in

accord with the following: If there existing an instance x0 in T0,

which shares the same set of input attribute value as , discard

if Rel(T0, ) < Rel(T0, x0).

32

Page 46: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

3.4 Implementation

3.4.1 VND-kNN

In order to implement the theory of knowledge transfer for kNN tasks, the classic

kNN algorithm had to be modified so that:

1. It supports non-deterministic instances, which use CPDC as the class

value

2. It supports a dynamic nearest neighbourhood with size ranges from k to

2k.

ND-kNN will henceforth denote this modified version of kNN algorithm, which

supports non-deterministic instances. ND-kNN must have a data structure for storing

virtual non-deterministic instances including the CPDC value. Formally,

Algorithm 3.1: ND-kNN Training Algorithm for Virtual Instances

Given a virtual non-deterministic instance x, which has m input

attributes a1…am and n output class values v1…vn, each with an

associated probability Pi (1 ≤ i ≤ n)

ND-kNN learns x by just storing it into its knowledge base X.

Then according to the theory described in the previous sections, ND-kNN shall

create a neighbourhood, which contains k original training examples and at most k virtual

non-deterministic instances as follows:

33

Page 47: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Algorithm 3.2: ND-kNN Selection of Neighbourhood Algorithm

Given a query q, add the k nearest original training examples from X

into the neighbourhood N. Let d be the distance to q from the

farthest instance in N. Select at most k nearest virtual instances from

X, whose distance to q is less than d and add them to N.

Finally, in the voting phase, every virtual instance contributes a vote for each

class value vi weighted by its probability Pi.

Algorithm 3.3 ND-kNN Classification Algorithm

Let x1…xr denote r instances in the neighbourhood N of size r for

query q. Every instance has m input attributes a1…am and n output

values v1…vn, each being associated with a probability Pi (1 ≤ i ≤ n)

Return

where vij is the ith class value of the jth instance in the neighbourhood

N; if = , else ; V contains all possible

class values; Pij is the Pi of jth instance in the neighbourhood N.

Using ND-kNN, a Virtual-instance-based ND-kNN (VND-kNN) system can be

built to achieve life-long learning. As in Figure 8,when a new task is presented, the

system first measures the similarity between the new task and all tasks in the task 34

Page 48: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

domain. Then based on the similarity, virtual instance are generated from secondary tasks

to the new task. Once the new task is successfully learned, the new task (i.e. all its

instances) is stored into the task domain.

Figure 8. Architecture of VND-kNN

3.4.2 The Prototype System

A VND-kNN software system was constructed in C++ using Microsoft Visual C+

+. The system consists of around 3000 lines of code and employs multi-thread

programming so as to enhance the overall performance. The system implements a

standard Windows graphical user interface and allows the user to manipulate multiple

tasks in the task domain at the same time. All tasks and domains can be saved for future

use.

An object-oriented approach was used when the system was firstly designed. Each

kNN model is an object containing methods that allows the user to train the model with

training instances and classify a query instance with the k nearest neighbours. Every

instance in a kNN model is represented by an Instance object. The Instance object uses a

vector to represent the input attributes and an Output object to represent the instance’s

output class value. The Output object overrides the data type conversion operators. In this

35

Task Domain

ND-kNN (New Task)

generate virtual instances

measure similarity

store

Page 49: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

way, both actual training instances and virtual instances can share the same data

representation. The Instance object also has an attribute indicating whether the instance is

an actual training instance or a virtual instance so that ND-kNN can process them

separately.

3.5 Summary

The method of knowledge transfer presented in Section 3.2 and 3.3 uses

conditional probability distributions that utilize prior knowledge of previously learned

tasks at the classification level. Moreover, the employment of an acceptance threshold of

relatedness can prevent unrelated virtual instance from being generated so that the

computational space and time required to classify a query instance is minimized.

Limiting the number of virtual instances in the k nearest neighbourhood also prevents the

growth of negative inductive bias from a large number of unrelated secondary tasks. In

this way, virtual instances generated from secondary tasks can provide a positive

inductive bias through knowledge transfer when learning the primary task.

36

Page 50: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Chapter 4

Empirical Studies

This chapter summarizes and discusses the results of several experiments

conducted using a prototype system based on the theory of Chapter 3. The first

experiment shows the modified kNN method’s ability to transfer knowledge from a

partially related secondary task. The second experiment shows the method’s ability to

mitigate the negative inductive bias from an unrelated task. The third experiment shows

the variation in positive inductive bias as a function of the relatedness between the

primary and secondary tasks. The fourth experiment shows the method’s ability to

transfer knowledge from multiple secondary tasks. The fifth experiment focuses on a

recognized limitation of the method. The last experiment tests the method’s maximum

capacity of knowledge transfer in the context of real world tasks and the ability to handle

multi-class learning problems.

4.1 The Bitmap Domain

In order to test the effectiveness of knowledge transfer in the context of kNN, a

synthetic domain of tasks was developed, called the Bitmap Domain. The domain

contained 4 different tasks, each having 1000 instances. The output class values of

instances are determined by the bitmaps shown in Figure 9 and Figure 10. Each instance

has 4 numerical input attributes a, b, c and d ranging from 0 to 10. The attributes of each

instance in every task, are divided into 2 pairs: (a, b) and (c, d). Each pair of attributes

can determine a class value by the corresponding number in the bitmap, using the

37

Page 51: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

attributes as 2-D coordinates. For example, a = 1.3 and b = 7.8 outputs 1 because the 2nd

column, 8th row in the bitmap Figure 9 is 1 (always round up).

4.1.1 Task T0

The class values of instances in T0 depend on input attributes a and b using

bitmap Figure 9. The instance is classified as 1 if and only if (a, b) selects a 1 in the

bitmap. Input attributes c and d are generated randomly.

Figure 9. Bitmap for T0, T2 and T3.

The input space is divided by 100 small rectangular regions. Each

number represents the class value of one of the 100 regions

4.1.2 Task T1

The class values of instances in T1 are based on input attribute a and b using the

bitmap Figure 10. The instance is classified as 1 if and only if the pair (a, b) selects a 1 in

the bitmap. Input attributes c and d are generated randomly. Note that T1 is quite related

to T0 for the lower portion of the bitmaps.

38

a or c

b or

d

Page 52: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Figure 10. Bitmap for T1.

It is somewhat similar to the one for T0, especially for the

distribution of 1’s

4.1.3 Task T2

The class values of instances in T2 depended on input attributes c and d using the

bitmap Figure 10. The instance is classified as 1 if and only if the pair (c, d) selects 1 in

the bitmap. Input attributes c and d are generated randomly. The output class values of T2

only depend on the last two attributes. T2 is unrelated to T0 or T1.

4.1.4 Task T3

The class values of instances in T3 depended on all four input attributes using the

bitmap Figure 9. The instance was classified as 1 if and only if both pairs (a, b) and (c, d)

selected a 1 in the bitmap. T3 is related to T0 and T1 however it is most related to T0.

Figure 11 shows the relation between tasks in the domain.

39

a or c

b or

d

Page 53: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Figure 11. The relation between tasks in Bitmap domain

4.2 Experiment 1: Transfer from a Partially Related Task

In this experiment we are interested in whether a previously learned task will

benefit the learning of the primary task if the two tasks are partially related.

4.2.1 Tasks

Experiment 1 uses T0 of the Bitmap domain as the primary task and T1 as the

previously learned secondary task. T0 has 200 training instances and a test set of 800

instances. T1 was previously trained using 1000 instances. The goal of this experiment is

to find out whether knowledge transfer from T1 will improve the generalization accuracy

of T0.

4.2.2 Method

The experiment consists of 10 repeated trials where each trial had the following

steps:

1. Generate random training and test sets for T0

2. Train the kNN system by loading the training instances.

40

T1 T2 T3

a b c d

T0

Related

More related

Page 54: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

3. Test the generalization accuracy of kNN for T0 using the test set with k =

3, 5, 7

4. Transfer knowledge from T1 by generating virtual instances for T0. The

acceptance threshold of relatedness was set to 0.001 based on preliminary

testing.

5. Test the generalization accuracy of kNN for T0 using the test set with k =

3, 5, 7

4.2.3 Results

The results of the experiment are summarized in Table 2 and Figure 12.

Table 2. Results of Experiment 1. The generalization accuracy of

T0 before and after the knowledge transfer with k = 3, 5 and 7

Accuracy T0 on its own T0 with T1’s knowledgeTrials k = 3 k = 5 k = 7 k = 3 k = 5 k = 7

1 0.676654 0.6804 0.689139 0.684145 0.689139 0.7003752 0.690387 0.6804 0.686642 0.699126 0.699126 0.6766543 0.674157 0.675406 0.655431 0.68789 0.686642 0.6729094 0.665418 0.675406 0.691635 0.67166 0.696629 0.6991265 0.670412 0.672909 0.665418 0.676654 0.694132 0.6791516 0.667915 0.661673 0.665418 0.681648 0.679151 0.6766547 0.705368 0.701623 0.691635 0.72035 0.707865 0.7066178 0.734082 0.726592 0.714107 0.739076 0.741573 0.7303379 0.722846 0.702871 0.699126 0.72784 0.710362 0.714107

10 0.70412 0.682896 0.684145 0.715356 0.696629 0.68789Stdev 0.024381 0.018963 0.017611 0.023651 0.0172768 0.01901495%Conf 0.015111 0.011753 0.010915 0.014659 0.0107081 0.011785Mean 0.691136 0.686018 0.68427 0.700375 0.7001248 0.694382

41

Page 55: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Experiment 1

0.650.660.670.680.690.7

0.710.72

k = 3 k = 5 k = 7A

ccur

acy T0 w ith its ow n

T0 w ith T1

Figure 12. Results of Experiment 1. Generalization accuracy of T0

before and after the knowledge transfer. Note that statistically they

are the same.

Results show that the mean generalization accuracy of T0 after knowledge transfer

from T1 was higher than the mean generalization accuracy of T0 on its own (p < 0.001, p

< 0.001 and p = 0.002 for k = 3, 5 and 7 respectively). It is important to note that the

relatedness between T0 and T1 in one of the test trials was Rel(T0,T1=0) = 0.0019,

Rel(T0,T1=1) = 0.1933. This suggests that T1 is more related to T0 when T1 outputs1 than

when T1 outputs 0. This shows that a measure of relatedness at the classification level can

be helpful.

4.3 Experiment 2: Unrelated Tasks

The previous experiment concerned only related secondary tasks. However, it is

possible that background knowledge contains some tasks that are unrelated to the primary

task. In this experiment, we examine the method’s ability to mitigate the transfer of

negative inductive bias to a primary task from an unrelated secondary task.

42

Page 56: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

4.3.1 Tasks

T0 of the Bitmap domain is the primary task and T2 is the previously learned

secondary task. T0 has 200 training instances and a test set of 800 instances. T2 was

previously trained to a 0.85 level of accuracy using 1000 instances. Based on the

discussion of Section 4.2, we consider that T2 is totally unrelated to T0. This is because T2

and T0 do not use the same input attributes.

4.3.2 Method

The experiment consisted of 10 repeated trials where each trial had the following

steps:

1. Generate random training and test sets for T0

2. Train the kNN system by loading the training instances.

3. Test the generalization accuracy of kNN for T0 using the test set with k =

3, 5, 7

4. Transfer knowledge from T2 by generating virtual instances for T0. The

acceptance threshold of relatedness was set to 0.001 based on preliminary

testing.

5. Test the generalization accuracy of kNN for T0 using the test set with k =

3, 5, 7

4.3.3 Results

Table 3 and Figure 13 show that the knowledge transfer method ensures that prior

knowledge from T2 does not adversely affect the learning of T0. When k = 3 and k = 5,

43

Page 57: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

the generalization accuracy of T0 before and after the knowledge transfer remains the

same. This indicates that T2 does not affect the learning of T0 at all. When k = 7, we still

cannot conclude statistically that the knowledge from T2 generates negative bias to T0 (p

= 0.721). In summary, the experiment shows that the unrelated task T2 didn’t affect the

learning of T0 negatively.

Table 3. Results of Experiment 2. The generalization accuracy of

T0 before and after the knowledge transfer with k = 3, 5 and 7

Accuracy T0 on its own T0 with T1 Trials K = 3 k = 5 k = 7 k = 3 K = 5 k = 7

1 0.729089 0.705368 0.716604 0.729089 0.705368 0.7103622 0.675406 0.686642 0.677903 0.675406 0.686642 0.6916353 0.701623 0.670412 0.699126 0.701623 0.670412 0.6991264 0.667915 0.675406 0.66417 0.667915 0.675406 0.6704125 0.697878 0.694132 0.70412 0.697878 0.694132 0.704126 0.699126 0.685393 0.677903 0.699126 0.685393 0.6779037 0.672909 0.691635 0.692884 0.672909 0.691635 0.687898 0.696629 0.701623 0.697878 0.696629 0.701623 0.6941329 0.696629 0.701623 0.697878 0.696629 0.701623 0.694132

10 0.705368 0.690387 0.682896 0.705368 0.690387 0.695381Stdev 0.018086 0.011327 0.015349 0.018086 0.0113272 0.01171295%Conf 0.01121 0.007021 0.009513 0.01121 0.0070205 0.007259Mean 0.694257 0.690262 0.691136 0.694257 0.6902621 0.692509

Experiment 2

0.66

0.67

0.68

0.69

0.7

0.71

k = 3 k = 5 k = 7

Acc

urac

y

T0 on its ow n

T0 w ith T1

Figure 13. Results of Experiment 2. Mean generalization accuracy

of T0 before and after the knowledge transfer.

44

Page 58: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

4.4 Experiment 3: Variation in Transfer from More and Less

Related Tasks

This experiment examines the transfer of knowledge from two related tasks to the

primary task, where one secondary task is more related to the primary task than the other.

We expect the more related secondary task to benefit the primary task the most by

generating the greater positive inductive bias. This should result in better generalization

accuracy for primary task.

4.4.1 Tasks

T3 of Bitmap domain is the primary task. T0 and T1 are used as the previously

learned secondary tasks. T3 has 200 training examples and a test set of 800 instances. T1

and T0 were previously trained using 1000 instances for each task. Based on the

discussion of the Bitmap domain in Section 5.2, T3 is considered more related to T0 than

to T1.

4.4.2 Method

The experiment consisted of 10 repeated trials where each trial had the following

steps:

1. Generate random training and test sets for T3

2. Train the kNN system by loading the training instances.

3. Test the generalization accuracy of kNN for T3 using the test set with k =3

4. Transfer knowledge from T0 by generating virtual instances for T0. The

acceptance threshold of relatedness was set to 0.001 based on preliminary

45

Page 59: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

testing.

5. Test the generalization accuracy of kNN for T3 using the test set with k =3

6. Transfer knowledge from T1 by generating virtual instances for T0. The

acceptance threshold of relatedness was set to 0.001 based on preliminary

testing.

7. Test the generalization accuracy of kNN for T3 using the test set with k =3

4.4.3 Results

The results shown in Table 4 and Figure 14 show that both secondary tasks

improve the generalization accuracy of T3. T0 provides the most positive inductive bias to

the T3’s hypotheses with an accuracy of 0.788 (p < 0.001) as compared to the hypotheses

developed with the aid of T1 with an accuracy of 0.771 (p = 0.066). We conclude that

knowledge transferred from the more related task, T0 is of greater value than that of T1.

Table 4. Results of Experiment 3. The generalization accuracy of

T0 before and after the knowledge transfer with k = 3

Trials T3 alone T3 with T0 T3 with T11 0.781523 0.779026 0.76032 0.796504 0.80774 0.7715363 0.772784 0.801498 0.784024 0.735331 0.780275 0.7752815 0.750312 0.787765 0.7752816 0.769039 0.792759 0.7877657 0.750312 0.771536 0.744078 0.735331 0.787765 0.7740329 0.751561 0.791511 0.771536

10 0.750312 0.776529 0.76779Stdev 0.020016 0.011297 0.01223395%Conf 0.012406 0.007002 0.007582Mean 0.759301 0.78764 0.771161

46

Page 60: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Experiment 3

0.72

0.74

0.76

0.78

0.8

T3 alone With T0 With T1

Acc

urac

y

Mean

Figure 14. Results of Experiment 3. Mean generalization accuracy

of T3 before and after the knowledge transfer from either T0 or T1.

4.5 Experiment 4: Knowledge Transfer from Multiple Tasks

Previous experiments focused on transferring knowledge from one secondary task

to a primary task. In this experiment, we examine the effect of transferring knowledge

from several secondary tasks to a primary task, where the secondary tasks vary in their

degree of relatedness.

4.5.1 Tasks

T3 of the Bitmap domain is the primary task. T0, T1 and T2 in the same task

domain are the previously learned secondary tasks. T3 had 200 training instances and a

test set of 800 instances. T0, T1 and T2 were previously trained by 1000 instances for each

task to an accuracy of 0.85. We expect that T3 will receive a net benefit from the transfer,

because the method will promote positive inductive bias from related tasks and mitigate

the negative inductive bias from unrelated tasks.

47

Page 61: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

4.5.2 Method

The experiment consisted of 10 repeated trials where each trial had the following

steps:

1. Generate random training and test sets for T3

2. Train the kNN system by loading the training instances.

3. Test the generalization accuracy of kNN for T0, T1 and T2 using the test set

with k = 3

4. Transfer knowledge from T2 by generating virtual instances for T0. The

acceptance threshold of relatedness was set to 0.001 based on preliminary

testing.

5. Test the generalization accuracy of kNN for T0 using the test set with k =3

4.5.3 Results

The results shown in Table 5 and Figure 15 show that the generalization accuracy

of T0 is improved after transferring knowledge from all secondary tasks (p < 0.001). In

addition, there is also some evidence to suggest that knowledge transfer from all three

tasks improved the generalization accuracy of T3 more than the knowledge transfer from

just T0 (p = 0.162).

Table 5. Results of Experiment 4. The generalization accuracy of

T0 before and after the knowledge transfer with k = 3

Trials T3 alone With ALL With T01 0.796504 0.815231 0.8152312 0.762797 0.826467 0.8039953 0.771536 0.787765 0.799001

48

Page 62: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

4 0.765293 0.795256 0.7915115 0.741573 0.780275 0.7740326 0.751561 0.801498 0.7890147 0.742821 0.813983 0.8127348 0.765293 0.792759 0.7902629 0.746567 0.766542 0.781523

10 0.776529 0.820225 0.805243Stdev 0.01715 0.019068 0.0133495%Conf 0.010629 0.011819 0.008268Mean 0.762047 0.8 0.796255

Experiment 4

0.72

0.74

0.76

0.78

0.8

0.82

T3 alone With ALL With T0

Acc

urac

y

Mean

Figure 15. Results of Experiment 4. Mean generalization accuracy

of T3 before and after the knowledge transfer

4.6 Experiment 5: The Error of Estimation

The results of Experiment 1 show that knowledge transferred from T1 can benefit

the learning of T0. However, all results in the previous experiments were based on a

comparatively accurate estimation of CPDT. In this experiment we shrink the number of

training instances of T0 so that the accuracy of estimation of CPDT(T0 | T1) is reduced.

We are interested in the impact on knowledge transfer based on a less accurate estimation

of CPDT.49

Page 63: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

4.6.1 Tasks and Method

This experiment uses T0 of the Bitmap domain as the primary task and T1 as the

previously learned secondary task. T0 has 100 training instances and a test set of 900

instances. T1 was previously trained using 1000 instances. The same method used in

Experiment 1 was used here.

4.6.2 Results

The results shown in Table 6 and Figure 16 indicate insufficient improvement of

generalization accuracy (p = 0.119, 0.265 and 0.088 for k = 3, 5 and 7 respectively). The

positive effect of knowledge transfer is not as promising as in Experiment 1.

Table 6. Results of Experiment 5. The generalization accuracy of

T0 before and after the knowledge transfer with k = 3, 5 and 7

Accuracy T0 on its own T0 with T1’s knowledgeTrials k = 3 k = 5 k = 7 k = 3 k = 5 k = 7

1 0.657048 0.677026 0.657048 0.63263 0.63929 0.6248612 0.648169 0.653718 0.624861 0.675916 0.682575 0.6847953 0.613762 0.594895 0.588235 0.627081 0.617092 0.6204224 0.645949 0.619312 0.625971 0.645949 0.641509 0.6248615 0.594895 0.618202 0.620422 0.586016 0.586016 0.5948956 0.63374 0.618202 0.613762 0.63374 0.618202 0.6293017 0.660377 0.663707 0.661487 0.673696 0.674806 0.6825758 0.653718 0.655938 0.6404 0.669256 0.670366 0.6659279 0.642619 0.648169 0.642619 0.653718 0.661487 0.662597

10 0.617092 0.621532 0.604883 0.628191 0.625971 0.614872Stdev 0.02156 0.026055 0.022907 0.027387 0.0307168 0.03095195%Conf 0.013363 0.016149 0.014197 0.016974 0.0190381 0.019183Mean 0.636737 0.63707 0.627969 0.642619 0.6417314 0.640511

50

Page 64: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Experiment 5

0.590.6

0.610.620.630.640.650.660.67

k = 3 k = 5 k = 7A

ccur

acy

T0 on its ow n

T0 w ith T1

Figure 16. Results of Experiment 5. Mean generalization accuracy

of T0 before and after the knowledge transfer.

To explain the less effective knowledge transfer in this experiment, we examined

the measure of relatedness between T0 and T1, which was estimated during the process of

knowledge transfer. For trial #1 of this experiment, Rel(T0 | T1 = 0) = 0.0242 and Rel(T0 |

T1 = 1) = 0.0104, meaning that T1 is more related to T0 when T1 outputs class 0 than when

T1 output class 1. This is inconsistent with the results in Experiment 1. This is because

the 100 training instances of the primary task do estimate the relatedness to the secondary

task as accurately as the 200 training instances of Experiment 1. We will examine this

further in Chapter 5.

4.7 Experiment 6: A Real-World Domain: Character Recognition

The experiment is dedicated to showing the system’s capability of using

maximally related prior task knowledge. The primary task is deliberately supplied with

insufficient training examples for learning the primary task under standard kNN.

51

Page 65: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

4.7.1 Dataset

A small portion of the 2000 instance “Letter Recognition” dataset from UCI

Machine Learning Repository2 is used to train and test the system. There are 26 class

values, each representing an English letter. Every instance of this dataset had 16

numerical input attributes that are features extracted from raw images of characters and a

class target value indicating one of the 26 English letters. T0, as the primary task, has 200

training instances, which were randomly selected from the original dataset. The

remaining 1800 instances is used as the test set. The secondary task, T1, was trained using

all 2000 training instances from another “Letter Recognition” dataset also from the UCI

Machine Learning Repository. Both T0 and T1 identify capitalized English character from

the same 16 input attributes.

4.7.2 Method

The experiment consisted of 5 repeated trials where each trial had the following

steps:

1. Generate random training and test sets for T0

2. Train the kNN system by loading the training instances.

3. Test the generalization accuracy of kNN for T0 using the test set with k =

3, 5, 7

4. Transfer knowledge from T1 by generating virtual instances for T0. The

acceptance threshold of relatedness was set to 0.001 based on preliminary

testing.

2 http://www.ics.uci.edu/~mlearn/MLRepository.html52

Page 66: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

5. Test the generalization accuracy of kNN for T0 using the test set with k =

3, 5, 7

4.7.3 Results

The results shown in Table 7 and Figure 17 show that the generalization accuracy

of T0 after knowledge transfer from T1 is statistically higher (p < 0.001) than the

generalization accuracy of T0 on its own.

Table 7. Results of Experiment 6. The generalization accuracy of

T0 before and after the knowledge transfer with k = 3, 5 and 7

Without T1 With T1Trails k = 3 k = 5 k = 7 k = 3 k = 5 k = 7

1 0.485841 0.461966 0.446419 0.781233 0.760689 0.7456972 0.454192 0.409772 0.378679 0.739589 0.720711 0.6896173 0.50472 0.466963 0.443642 0.782898 0.766796 0.7362584 0.475292 0.44864 0.415325 0.756802 0.762909 0.7529155 0.513604 0.464742 0.428651 0.76402 0.759023 0.735147

Stdev 0.023644 0.023809 0.027517 0.018012 0.01885 0.02474695%Conf 0.020725 0.020869 0.024119 0.015788 0.016522 0.02169Mean 0.48673 0.450417 0.422543 0.764908 0.754026 0.731927Lower 0.466005 0.429547 0.398424 0.749121 0.737503 0.710237Upper 0.507454 0.471286 0.446662 0.780696 0.770548 0.753617

53

Page 67: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Experiment 6

00.10.20.30.4

0.50.60.70.80.9

k = 3 k = 5 k = 7

Acc

urac

y

T0 on its ow n

T0 w ith T1

Figure 17. Results of Experiment 6. Generalization accuracy of T0

before and after the knowledge transfer

As a comparison, the generalization accuracy of T0 trained with 1600 training

instances from the original “Letter Recognition” dataset was around 80%. This is 5%

better compared to the mean generalization accuracy of T0 with transfer from T1 when k =

3. However, one must keep in mind that this was generated with 8 times the number of

training examples.

The slightly lower accuracy under knowledge transfer from T1 can be attributed to

the estimated CPDT. For instance, for test trial 1, CPDC(T0 | T1 = B) = { {B, .56},

{G, .11}, {Q, .11}, {R, .22}, {A, 0}, …, {Z, 0} }, but the ideal CPDC(T0 | T1 = B) should

be { {B, 1}, {A, 0}, …, {Z, 0} }.

4.8 Discussion

Note that the task domain used in Experiment 6 contained a related task T0, a less

related task T1 and an unrelated task T2. Therefore, Experiment 6 along with all other

54

Page 68: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

experiments showed that knowledge transfer using conditional probabilities distributions

can selectively transfer the more beneficial knowledge to the primary task and transfer

increases the performance of the resulting hypothesis.

The experiments also showed a limitation of the method. The effectiveness of

knowledge transfer in this kNN method depends on the accuracy of the CPDT estimation.

Because the CPDT estimation introduces some errors, knowledge transfer using

conditional probability distributions could have inaccurate bias on the primary task if

training examples are not sufficient to correctly measure the relatedness between two

tasks. We will discuss this matter in more detail in Chapter 5.

55

Page 69: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

Chapter 5

Conclusion

This thesis presents a theory of selective knowledge transfer using conditional

probability distributions in the context of kNN. An implementation of the theory was

developed and tested on a synthetic domain of tasks. The results of several experiments

indicate that the theory has some merit. This chapter concludes with a discussion of

major contributions from the research, some known limitations and suggestions for future

work.

5.1 Major Contributions

The research has made several contributions to machine learning research.

5.1.1 A new functional measure of relatedness for kNN based on

virtual instances

All knowledge transfer methods that have been previously proposed for kNN are

based on the structural similarity, most notably the similarity of the weight vectors of the

distance metric. The method proposed in this thesis develops a measure of relatedness

based on the conditional probability distribution of virtual instances created from

secondary tasks.

5.1.2 Measure of relatedness is at the classification level

All previously proposed measures of relatedness are at the task level. This

research has created a new measure of relatedness at the classification level so that the

56

Page 70: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

resulting sense of relatedness between tasks is more detailed so as to utilize the

relatedness between two tasks more effectively. This is a first step towards an instance-

level measure of relatedness, which measures the relatedness of every single piece of the

knowledge from the secondary task with respect to the primary task.

5.1.3 Tolerance to unrelated tasks and scaling

The results of Experiment 3 show that the new method is tolerant of negative

inductive bias from unrelated tasks. Because the number of virtual instances in the k

nearest neighbourhood is limited, the proposed method is also tolerant to increases in the

numbers of secondary tasks.

5.2 Limitations

5.2.1 Conditional Probability Estimation Errors

In Section 4.1, the errors introduced when estimating CPDT between tasks limits

the maximum generation accuracy improvement that can be achieved by knowledge

transfer. As shown in Section 4.4, the error of the CPDT estimation negatively affects the

learning of the primary task. Two sources of this estimation error have been identified.

The deficiency of training examples in T0. The results of Section 4.4 show that

if the number of training instances in T0 is not sufficient, the estimation of conditional

probabilities will not be accurate. Unfortunately, a deficiency of training examples of the

primary task is not avoidable as this is the scenario in which knowledge transfer is most

needed.

57

Page 71: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

The accuracy of secondary tasks. The estimation of CPDT is based on how

secondary tasks classify the training examples of the primary task. If the secondary task

is sufficiently accurate, the estimation of CPDT will be unable to reflect the true

relationship between the secondary tasks and the primary task. One approach to

minimizing this error is to use the accuracy of the secondary task as part of the

calculation of relatedness of the virtual instances. Less accurate secondary tasks should

have less affect on the accuracy of T0’s hypothesis.

5.2.2 Relatedness Based on Sub-spaces of the Input Space

With the proposed method, the relatedness between two tasks is measured at the

classification level; instances of secondary task sharing the same class value will generate

virtual instances with the same output probability distribution. However, in many cases,

tasks are only related within certain sub-spaces of the input attribute space. For example,

task T1 may only be similar to task T0 when the first attribute is less than 0.5 and the

second attribute is larger than 0.1. Using the method presented in this thesis, a highly

related sub-region is somewhat neutralized by other less related sub-regions that share the

same class value. Thus relatedness measured at the sub-space level is necessary to

capture a more refined sense of functional relatedness between tasks.

Knowledge transfer at the sub-space level might be approached as follows:

1. Divide the input space of T1 into n sub-region. Each sub-region can have

different size and depend upon the distribution of training instances.

2. For each sub-region Ri, estimate the probability distribution of T0 given

that fact that T1 outputs each class value and the query point is in Ri.58

Page 72: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

3. For each sub-region Ri, generate virtual instances in T0 from instances that

are in Ri of T1 by applying probability distributions calculated in step 2 for

Ri.

Note that when n is large enough, the relatedness is actually measured at the

instance level. The major research question along this direction is finding a way to divide

the input space so that each sub-space provides a comparatively accurate estimation of

conditional probabilities.

5.3 Other Suggestions for Future Work

Apart from the possible extensions mentioned in Section 5.2.1 and Section 5.2.2,

there are some other variations that may improve the effectiveness of knowledge transfer.

Weighted distance. ND-kNN algorithm can be easily extended to a weighted

distance version. Using the weighted distance version, the virtual instance can be further

weighted by its distance to the query instance. Therefore, the nearer a virtual instance is

to a query instance, the more strongly it can affect the accuracy of classification. One

would have to consider the over-amplification of virtual instances by placing a limit on

the maximum distance weight.

Duplicated instances. In Section 3.3.1, a naïve method is adopted to deal with

duplicated instances, which are instances sharing the same set of input attributes from

two or more tasks. Ideally, a similar approach as used in Bayes Networks should be used

to calculate the conditional probabilities. For example, more complex conditional

probabilities have to be calculated such as P(T0 | T1 = + ∩ T2 = -). The major research

59

Page 73: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

question is to find a fast way to estimate the conditional probabilities with reasonable

accuracy.

Density of virtual instances. kNN derives decision boundaries from training

examples. One important factor that greatly affects the shape of the decision boundary is

the density of instances. If the existing decision boundary happens to be the optimal one,

one more example may decrease the generation accuracy; this is similar to overtraining in

ANN. To accommodate this situation, the virtual instances could be generated in such a

way that the density of instances is constant throughout the input space. Other techniques

such as Model-based kNN (Guo, Wang, Bell, Bi, & Greer, 2003) would also help.

Combining structural and functional measures of relatedness. The measure

of relatedness suggested by previous research captures the structure similarity between

two kNN tasks while the CPDT captures the functional similarity. It would seem

important to consider both methods when transferring knowledge between tasks. Future

research could investigate a combination of these two methods.

60

Page 74: Chapter 1 Introduction - lmlr.acadiau.calmlr.acadiau.ca/wp-content/uploads/2017/08/thesis_final.…  · Web viewThis thesis by Yuan Su is accepted in its present form by the Jodrey

References

Caruana, R. A. (1993). Multitask Connectionist Learning. Paper presented at the Connectionist Models Summer School, School of Computer Science, Carnegie Mellon University.

Caruana, R. A. (1997). Multitask Learning. PhD Thesis, Carnegie Mellon University, Pittsburg, PA.

Devore, J. L. (2004). Probability and Statistics for Engineering and the Sciences. Belmont, CA: Brooks/Cole -Thomson Learning.

Guo, G., Wang, H., Bell, D., Bi, Y., & Greer, K. (2003). KNN Model-Based Approach in Classification. Paper presented at the International Conference on Ontologies, Databases and Applications of Semantics, Catania, Sicily (Italy).

Haussler, D. (1988). Quantifying inductive bias: AI learning algorithms and Valiant's learning framework. Artificial Intelligence(36), 177-221.

Mitchell, T. M. (1997). Machine Learning. New York: McGraw-Hill.Niyogi, P., & Girosi, F. (1994). On the Relationship between Generalization Error,

Hypothesis Complexity, and Sample Complexity for Radial Basis Functions (Technichal Report No. AIM-1467).

Robins, A. V. (1996). Transfer in Cognition. In L. Pratt (Ed.), Connection Science Special Issue: Transfer in Inductive Systems (Vol. 8, pp. 185-203). Cambridge, MA: Carfax Publishing Company.

Silver, D. L. (2000). Selective transfer of neural network task knowledge. PhD Thesis, Faculty of Graduate Studies, University of Western Ontario, London, Ont.

Thrun, S. (1995). Lifelong Learning: A Case Study (Technical Report No. CMU-CS-95-208). Pittsburgh, PA: Carnegie Mellon University, Computer Science Department.

Thrun, S., & O'Sullivan, J. (1995). Clustering learning tasks and selective cross-task transfer of knowledge (Technical Report No. CMU-CS-95-209). Pittsburgh, PA: School of Computer Science, Carnegie Mellon University.

Valiant, L. G. (1984). A Theory of the Learnable. Communications of the ACM, 27(11), 1134-1142.

Vosniadou, S., & Ortony, A. (1989). Similarity and Analogical Reasoning: A Synthesis. In S. V. a. A. Ortony (Ed.), Similarity and Analogical Reasoning. NY: Cambridge University Press.

61