transfer learning: fundamentals and applicationspremia2017]transfer... · transfer learning:...

114
Transfer Learning: Fundamentals and Applications Sinno Jialin Pan (Ph.D.) Nanyang Assistant Professor, School of Computer Science and Engineering, NTU, Singapore Homepage: http://www.ntu.edu.sg/home/sinnopan

Upload: truongliem

Post on 06-Mar-2018

228 views

Category:

Documents


3 download

TRANSCRIPT

Transfer Learning: Fundamentals and Applications

Sinno Jialin Pan (Ph.D.) Nanyang Assistant Professor,

School of Computer Science and Engineering, NTU, Singapore Homepage: http://www.ntu.edu.sg/home/sinnopan

Outline

• What is transfer learning? • Transfer learning methodologies

– Instance-based, Feature-based Parameter-based, Relational • Transfer learning applications

– Sentiment analysis, WiFi localization, Software defect prediction

• Future directions

Transfer of Learning Psychological point of view

• The study of dependency of human conduct, learning or performance on prior experience. – [Thorndike and Woodworth, 1901] explored how

individuals would transfer in one context to another context that share similar characteristics.

Transfer Learning Machine learning community

• Inspired by human’s transfer of learning ability • The ability of a system to recognize and apply

knowledge and skills learned in previous domains/tasks to novel tasks/domains, which share some commonality

A Motivating Example

• Goal: to train a robot to accomplish Task 𝑻1 in an indoor environment 𝑬1 using machine learning techniques: • Sufficient training data required: sensor readings to measure

the environment as well as human supervision, i.e. labels • A predictive model can be learned, and used in the same

environment

Task 𝑻1 in environment 𝑬1

Background Review

-30dbm -12dbm -50dbm -44dbm -99dbm

Sensor 1 Sensor 4 Sensor 3 Sensor 2 Sensor 5 Timestamp

𝑖

𝒙𝑖 ∈ 𝑅5×1

𝑦𝑖

Input feature vector:

Corresponding label: A labeled instance:

{(𝒙𝑖 ,𝑦𝑖)|𝑖 = 1, … ,𝑁} A set of labeled instances:

𝑓:𝒙 → 𝑦 A predictive model: OR 𝑃(𝑦|𝒙)

A Motivating Example (cont.)

• Limitations of traditional machine learning techniques: 1) Performance highly relies on whether sufficient labeled

data is available to build a predictive model 2) When environment changes (e.g., new domain or new

task), the learned predictive model performs poorly • A new predictive model has to be rebuilt from scratch

Environment changes 𝑬2 Task 𝑻1

Task 𝑻2 New robot

?

? ?

To train the robot from scratch? Expensive & time consuming!

Source Task Target Task 𝒇𝑻

Transfer Learning (cont.)

• Assumption: training and test data are assumed to be – Represented in the same feature space, AND – Follow the same data distribution

• In practice: training and test data come from different domains – Represented in different feature spaces, OR – Follow different data distributions

𝒇𝑺 Directly apply

Adaptively Transfer What if machines have

transfer learning ability?

Transfer Learning (cont.)

Traditional Machine Learning Transfer Learning

train

ing

dom

ains

test

dom

ains

train

ing

dom

ains

test

dom

ains

domain A domain B domain C

Transfer Learning (cont.)

• Given a target domain/task, transfer learning aims to 1) identify the commonality between the target domain/task

and previous domains/tasks 2) transfer knowledge from the previous domains/tasks to the

target one such that human supervision on the target domain/task can be dramatically reduced.

Source Domain / Task Data

Target Domain / Task Data

Predictive Models

Sufficient labeled training data

Unlabeled training/with a few labeled data

Transfer Learning Algorithms

Target Domain / Task Data

Testing

Other Motivating Examples (cont.)

• WiFi localization: signal strength changes a lot over different time periods, or across different mobile devices.

Time Period A Time Period B

Device B

Device A

Contour of signal strength values

Localization model

~ 1.5 meters

Average Error Distance

Localization model w/o transfer learning

~ 10 meters

Localization model w/ transfer learning

~ 3-4 meters

Other Motivating Examples (cont.)

Other Motivating Examples (cont.)

• Sentiment analysis: users may use different sentiment words across different domains.

Sentiment classifier

~ 82 %

Classification Accuracy

Sentiment classifier w/o transfer learning

~ 70%

Sentiment classifier w/ transfer learning

~ 77 %

Electronics Video Games (1) Compact; easy to operate; very good picture quality; looks sharp!

(2) A very good game! It is action packed and full of excitement. I am very much hooked on this game.

(3) I purchased this unit from Circuit City and I was very excited about the quality of the picture. It is really nice and sharp.

(4) Very realistic shooting action and good plots. We played this and were hooked.

(5) It is also quite blurry in very dark settings. I will never buy HP again.

(6) The game is so boring. I am extremely unhappy and will probably never buy UbiSoft again.

Product reviews on different domains

TL for Different ML Problems

• Transfer learning for reinforcement learning

• Transfer learning for classification/regression

[Taylor and Stone, Transfer Learning for Reinforcement Learning Domains: A Survey, JMLR 2009]

[Pan and Yang, A Survey on Transfer Learning, IEEE TKDE 2010]

[Pan, Transfer Learning, Chapter 21, Data Classification: Algorithms and Applications 2014]

Transfer Learning Settings

Transfer Learning

Heterogeneous Transfer Learning

Heterogeneous

Feature Space

Homogeneous Transfer Learning

Homogeneous Unsupervised Transfer

Learning

Semi-Supervised Transfer Learning

Supervised Transfer Learning

Transfer Learning Approaches

Instance-based Approaches

Feature-based Approaches

Parameter-based Approaches

Relational Approaches

Instance-based TL Approaches

Source and target domains have a lot of overlapping features (domains share the same/similar support)

General Assumption

𝓧𝑺 𝓧𝑻

Instance-based TL Approaches

Case I

Case II Problem Setting

Assumption Assumption

Problem Setting

Instance-based Approaches: Case I

Given a target task, based on the learning framework of empirical risk minimization

Instance-based Approaches: Case I (cont.) Assumption:

Instance-based Approaches: Case I (cont.)

Correcting Sample Selection Bias / Covariate Shift [Quionero-Candela, etal, Data Shift in Machine Learning, MIT Press 2009]

Assumption: sample selection bias is caused by the data generation process

Instance-based Approaches: Correcting Sample Selection Bias

• Imagine a rejection sampling process, and view the source domain as samples from the target domain

Instance-based Approaches: Correcting Sample Selection Bias (cont.) • The distribution of the selector variable maps the target onto

the source distribution

Label instances from the source domain with label 1 Label instances from the target domain with label 0 Train a binary classifier

[Zadrozny, ICML-04]

Instance-based Approaches: Kernel mean matching (KMM)

Maximum Mean Discrepancy (MMD)

[Alex Smola, Arthur Gretton and Kenji Kukumizu, ICML-08 tutorial]

Representing Distributions in RKHS

𝑋~𝑃(𝑥)

Can we use a vector 𝜇𝑋 to represent the distribution?

𝜇𝑋 = 𝔼[𝑋]

𝜇𝑋 =𝔼[𝑋]𝔼[𝑋2]

𝜇𝑋 =𝔼[𝑋]𝔼[𝑋2]𝔼[𝑋3]

𝜇𝑋 =

𝔼[𝑋]𝔼[𝑋2]𝔼[𝑋3]

……

But, infinite, cannot be explicitly computed!

What about 𝜇𝑋 ∈ ℱ (RKHS)?

Kernel trick can be applied!

Mean Map in RHKS

Mean map: 𝜇𝑃 = 𝔼𝑥~𝑃(𝑥) 𝜙𝑥

Suppose 𝑋~𝑃(𝑥), and denote 𝑘 𝑥,∙ = 𝜙𝑥

𝜇𝑃:𝑃 → �𝜙𝑥d𝑃(𝑥)

𝑃(𝑥)

𝜇𝑃

RKHS ℋ

Empirical mean map for 𝑥𝑖 𝑖=1𝑛 : �̂�𝑃 = 1

𝑛∑ 𝜙𝑥𝑖𝑛𝑖=1 = 1

𝑛∑ 𝑘(𝑥𝑖 ,∙)𝑛𝑖=1

Mean Map in RHKS (cont.)

𝑥𝑆2 𝑥𝑆3 𝑥𝑆4

𝑥𝑆1

𝑥𝑇2 𝑥𝑇3 𝑥𝑇4

𝑥𝑇1

𝑋𝑠~𝑃𝑆(𝑥)

𝜙𝑥𝑆1

𝜙𝑥𝑆2

𝜙𝑥𝑆4

𝜙𝑥𝑆3

𝜇𝑃𝑆 𝜇𝑃𝑇

𝑋𝑇~𝑃𝑇(𝑥)

𝜙𝑥𝑇3

𝜙𝑥𝑇1 𝜙𝑥𝑇4

𝜙𝑥𝑇2

Dist 𝑋𝑆,𝑋𝑇

Instance-based Approaches: Kernel mean matching (KMM) (cont.)

[Huang etal., NIPS-06]

The resultant optimization is a simple QP problem

Instance-based Approaches: Direct density ratio estimation

KL divergence loss Least squared loss

[Sugiyama etal., NIPS-07] [Kanamori etal., JMLR-09]

Instance-based Approaches: Case II

• Recall:

• Intuition: Part of the labeled data in the source domain can be

reused in the target domain after re-weighting based on their contributions to the classification accuracy of the learning problem in the target domain

Instance-based Approaches: Case II (cont.)

TrAdaBoost [Dai etal ICML-07] For each boosting iteration,

– Use the same strategy as AdaBoost to update the weights of target domain data

– Use a new mechanism to decrease the weights of misclassified source domain data

TrAdaBoost

Source domain labeled data

target domain labeled data

The whole training dataset

To decrease the weights of the misclassified data

To increase the weights of the misclassified data

Classifiers trained on re-weighted labeled data

Target domain unlabeled data

• Misclassified instances: – increase the weights of the

misclassified target data – decrease the weights of the

misclassified source data

Feature-based TL Approaches

When source and target domains only have some overlapping features. (lots of features only have support in either the source or the target domain)

Feature-based TL Approaches (cont.)

How to learn 𝜑 ? • Solution 1: Encode application-specific knowledge to

learn the transformation. • Solution 2: General approaches to learning the

transformation.

TL for Sentiment Classification

Electronics Video Games (1) Compact; easy to operate; very good picture quality; looks sharp!

(2) A very good game! It is action packed and full of excitement. I am very much hooked on this game.

(3) I purchased this unit from Circuit City and I was very excited about the quality of the picture. It is really nice and sharp.

(4) Very realistic shooting action and good plots. We played this and were hooked.

(5) It is also quite blurry in very dark settings. I will never_buy HP again.

(6) The game is so boring. I am extremely unhappy and will probably never_buy UbiSoft again.

TL for Sentiment Classification (cont.)

compact sharp blurry hooked realistic boring 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0

( ) sgn( ), [1,1, 1,0,0,0]Ty f x w x w= = ⋅ = −

compact sharp blurry hooked realistic boring 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1

Electronics

Video Game

Training

Prediction

TL for Sentiment Classification (cont.)

Electronics Video Games (1) Compact; easy to operate; very good picture quality; looks sharp!

(2) A very good game! It is action packed and full of excitement. I am very much hooked on this game.

(3) I purchased this unit from Circuit City and I was very excited about the quality of the picture. It is really nice and sharp.

(4) Very realistic shooting action and good plots. We played this and were hooked.

(5) It is also quite blurry in very dark settings. I will never_buy HP again.

(6) The game is so boring. I am extremely unhappy and will probably never_buy UbiSoft again.

TL for Sentiment Classification (cont.)

• Three different types of features – Source domain (Electronics) specific features, e.g., compact, sharp, blurry – Target domain (Video Game) specific features, e.g., hooked, realistic, boring – Domain independent features (pivot features), e.g.,

good, excited, nice, never_buy

TL for Sentiment Classification (cont.)

• How to select pivot features? – Term frequency on both source and target domain data. – Mutual information between features and source domain labels – Mutual information on between features and domains

• How to make use of the pivot features to align

features across domains? – Structural Correspondence Learning (SCL) [Biltzer etal. 2006] – Spectral Feature Alignment (SFA) [Pan etal. 2010]

How to Align Domain Specific Features?

• Intuition – Use pivot features as a bridge to connect domain-

specific features – Model correlations between pivot features and

domain-specific features – Discover new shared features by exploiting the

feature correlations

Spectral Feature Alignment (SFA)

If two domain-specific words have connections to more common pivot words in the graph, they tend to be aligned or clustered together with a higher probability. If two pivot words have connections to more common domain-specific words in the graph, they tend to be aligned together with a higher probability.

exciting

good

never_buy sharp

boring

blurry

hooked

compact

realistic Pivot features

Domain-specific features

7 6

8 3

6

2

4

5

Electronics

Video Game

High level idea

exciting

good

never_buy sharp

boring

blurry

hooked

compact

realistic Pivot features

Domain-specific features

7 6

8 3

6

2

4

5

Electronics

Video Game

boring realistic

hooked

blurry

sharp

compact

Electronics

Video Game Electronics

Electronics Video Game

Video Game

Derive new features

Spectral Clustering

SFA: Derive new features

sharp/hooked compact/realistic blurry/boring 1 1 0 1 0 0 0 0 1

45

( ) sgn( ), [1,1, 1]Ty f x w x w= = ⋅ = −

sharp/hooked compact/realistic blurry/boring 1 0 0 1 1 0 0 0 1

Electronics

Video Game

Training

Prediction

SFA: Algorithm

1. Identify pivot features, and consider the rest as domain-specific features.

2. Construct a bipartite graph between the pivot and domain-dependent features.

3. Apply spectral clustering on the graph. 4. Use the feature clusters with original features to

represent the source and target domain data.

Experiments: Datasets & Tasks

Dataset Tasks RevDat B→D E→D K→D D→B E→B K→B D→E B→E K→E D→K B→K E→K SentDat V→E S→E H→E E→V S→V H→V E→S V→S H→S E→H V→H S→H

47

[Blitzer et al. ACL-07], crawled from Amazon

Collected by MSRA, crawled from Amazon, Yelp and Citysearch

Dataset:

Task:

Overall Performance

48 Our method Blitzer’s method

Use the source domain classifier without adaptation

In-domain classifier: use labeled data in the target domain to train classifiers

Overall Performance (cont.)

49 Our method Blitzer’s method Use the source domain classifier without adaptation

In-domain classifier : use labeled data in the target domain to train classifiers

General Feature-based TL Approaches

Time Period A Time Period B

Device B

Device A

General Feature-based TL Approaches (cont.)

• Solution 2: General approaches to learning the transformation – Learning features by minimizing distance between

distributions – Learning features via multi-task learning – Learning features via self-taught learning

Transfer Component Analysis (TCA) [Pan etal., IJCAI-09, TNN-11]

Target Source

Latent factors

Temperature Signal properties

Building structure

Power of APs

Motivation

TCA (cont.)

Target Source

Latent factors

Temperature Signal properties

Building structure

Power of APs

Cause the data distributions between domains different

TCA (cont.)

Target Source

Signal properties

Noisy component

Building structure

Principal components

TCA (cont.)

Learning 𝜑 by only minimizing distance between distributions may map the data onto noisy factors.

TCA (cont.)

• Main idea: the learned 𝜑 should map the source domain and target domain data to a latent space spanned by the factors that reduce domain distance as well as preserve data structure

• High level optimization problem

min𝜑

Dist 𝜑 𝑋𝑆 ,𝜑 𝑋𝑇 + 𝜆Ω(𝜑)

s.t. constraints on 𝜑 𝑋𝑆 and 𝜑 𝑋𝑇

Maximum Mean Discrepancy (MMD)

Distance Measure via MMD

Dist 𝜑 𝑋𝑆 ,𝜑 𝑋𝑇 = 𝔼𝑥∼𝑃𝑇(𝑥) 𝜙 𝜑 𝑥 − 𝔼𝑥∼𝑃𝑆(𝑥) 𝜙 𝜑 𝑥ℋ

≈1𝑛𝑇

�𝜙 𝜑 𝑥𝑇𝑖

𝑛𝑇

𝑖=1

−1𝑛𝑆�𝜙 𝜑 𝑥𝑆𝑖

𝑛𝑆

𝑖=1 ℋ

Assume 𝜓 = 𝜙 ∘ 𝜑 be a RKHS with kernel 𝑘 𝑥𝑖 , 𝑥𝑗 = 𝜓(𝑥𝑖)𝑇𝜓(𝑥𝑖)

Dist 𝜑 𝑋𝑆 ,𝜑 𝑋𝑇 = tr(𝐾𝐾)

𝐾 =𝐾𝑆,𝑆 𝐾𝑆,𝑇𝐾𝑇,𝑆 𝐾𝑇,𝑇

𝐾𝑖𝑗 =

1𝑛𝑆2

𝑥𝑖 , 𝑥𝑗 ∈ 𝑋𝑆

1𝑛𝑇2

𝑥𝑖 , 𝑥𝑗 ∈ 𝑋𝑇

−1

𝑛𝑆𝑛𝑇otherwise

TCA (cont.)

min𝜑

Dist 𝜑 𝑋𝑆 ,𝜑 𝑋𝑇 + 𝜆Ω(𝜑)

s.t. constraints on 𝜑 𝑋𝑆 and 𝜑 𝑋𝑇

min𝜑

tr 𝐾𝐾 + 𝜆Ω 𝜑

s.t. constraints on 𝜑 𝑋𝑆 and 𝜑 𝑋𝑇

• In general, the kernel function 𝑘(𝜑 𝑥𝑖 ,𝜑(𝑥𝑗)) can be a highly nonlinear function of 𝜑, which is unknown

• A direct optimization of minimizing the quantity w.r.t. 𝜑 may get stuck in poor local minima

Solution I [Pan etal., AAAI-08]

Learning 𝜑 1) Learning 𝐾 2) Low-dimensional reconstructions

of 𝑋𝑆 and 𝑋𝑇 based on 𝐾

min𝐾≽0

tr 𝐾𝐾 − 𝜆tr 𝐾

s.t. 𝐾𝑖𝑖 + 𝐾𝑗𝑗 − 2𝐾𝑖𝑗 = 𝑑𝑖𝑗2 ,∀ 𝑖, 𝑗 ∈ 𝒩,𝐾𝟏 = 0.

Perform PCA on 𝐾

• It is a SDP problem, expensive! • It is transductive, cannot generalize on unseen instances! • PCA is post-processed on the learned kernel matrix, which may

potentially discard useful information

Maximize data variance

Minimize the distance between domains

Preserve the local geometric structure

1)

2)

Solution II [Pan etal., IJCAI-09, IEEE TNN-11]

Assume 𝐾 be low-rank, then 𝐾 = 𝐾�𝑊𝑊𝑇𝐾�

𝑊 ∈ ℝ(𝑛𝑆+𝑛𝑇)×𝑚 and 𝑚 ≪ 𝑛𝑆 + 𝑛𝑇

Learning 𝐾 Learning a low-rank matrix 𝑊

min𝑊

tr 𝐾�𝑊𝑊𝑇𝐾�𝐾 + 𝜆tr 𝑊𝑇𝑊

s.t. 𝑊𝑇𝐾�𝐻𝐾�𝑊 = 𝐼

Minimize distance between domains

Regularization term on 𝑊

Maximize data variance

Closed form solution for 𝑊∗: 𝑚 leading eigenvectors of (𝐾�𝐾𝐾� + 𝜆𝐼)−1𝐾�𝐻𝐾�

tr 𝑊𝑇𝐾�𝐾𝐾�𝑊

Illustrative example

Latent features learned by PCA and TCA

PCA Original feature space TCA

Experiments: Datasets & Setup

• Data Set: 2007 IEEE ICDM Contest Task 2 [Yang et al., 2008] WiFi data are collected in time period A (a source domain) and time period B (a target domain) in an indoor building around 145.5 × 37.5m2 .

• Baselines: Non-transfer, KPCA, other transfer learning approaches.

• Regression model: Regularized Least Square Regression (RLSR).

• Evaluation Criterion: Average Error Distance (AED)

Experiments: Overall Results

SSTCA Varying # dimensions of transfer learning methods.

Varying # unlabeled data in the target domain for training

TCA

Multi-task Feature Learning

Assumption: If tasks are related, they should share some good common features.

Goal: Learn a low-dimensional representation shared across related tasks.

General Multi-task Learning Setting

Multi-task Learning

Target Task 1

Setting 1

Setting 2

Target Task 2

… Target Task N

𝒇𝑻𝟏 𝒇𝑻𝟐 … 𝒇𝑻𝑵

Commonality among tasks

Source Task Target Task

𝒇𝑻 𝒇𝑺

Adaptively Transfer

Two main settings in transfer learning

Multi-task Feature Learning (cont.)

Multi-task Feature Learning (cont.)

[Argyriou etal., NIPS-07]

Multi-task Feature Learning (cont.)

Illustration

Multi-task Feature Learning (cont.)

Multi-task Feature Learning (cont.)

[Ando and Zhang, JMLR-05]

[Ji etal, KDD-08]

Self-taught Feature Learning

• Intuition: – There exist some higher-level features that can help the

target learning task even only a few labeled data are given. • Steps:

– Learn higher-level features from a lot of unlabeled data – Use the learned higher-level features to represent the data

of the target task – Training models from the new representations of the target

task with corresponding labels

Self-taught Feature Learning (cont.)

• How to learn higher-level features – Sparse Coding [Raina etal., 2007] – Autoencoder [Glorot etal., 2011]

≈ 0.6 × +0.3 × +0.1 ×

Sparse coding illustration Autoencoder illustration

Heterogeneous Transfer Learning

• Source and target domains do not have any overlapping features, i.e., there does not exist any feature that has support in both the source and target domains.

• Some labeled data is available on the target domain to construct correspondence between domains

• Some parallel data (i.e, instance-correspondences) between domains are available

𝓧𝑺 𝓧𝑻

Heterogeneous TL: Application

• Sentiment analysis across different languages – [Zhou, Pan, etal., AAAI-14, Zhou, Tsang, Pan, etal.,

AISTATS-14] • Suppose: given a sufficient set of annotated product reviews in

English (i.e., source domain) • Goal: to learn a sentiment classifier for product reviews in

German (i.e., target domain)

English German

Parameter-based Approaches

• Motivation: A well-trained source model 𝜃𝑆 has captured a lot of structure from data. If two tasks are related, this structure can be transferred to learn a more precise target model 𝜃𝑇 with a few labeled data in the target domain

𝜃𝑆 𝜃𝑇

𝜃𝑇

Parameter-based TL Approaches (cont.) Assumption: If tasks are related, they may share similar parameter vectors. For example, [Evgeniou and Pontil, KDD-04]

Common part

Specific part for individual task

Parameter-based TL Approaches (cont.)

A general framework:

[Zhang and Yeung, UAI-10] [Agarwal etal, NIPS-10]

Relational TL Approaches

• Motivation: If two relational domains (non-i.i.d) are related, they may share some similar relations among objects. These relations can be used for knowledge transfer across domains

• Approaches: – Second-order Markov Logic [Davis and Domingos, ICML-09]

– Relational Adaptive bootstraPping (RAP) [Li etal., ACL-12]

Second-order Markov Logic

Actor(A) Director(B) WorkedFor

Movie (M)

Student (B) Professor (A) AdvisedBy

Paper (T)

Publication Publication

Academic domain (source) Movie domain (target)

MovieMember MovieMember

AdvisedBy (B, A) ⋀ Publication (B, T) => Publication (A, T)

WorkedFor (A, B) ⋀ MovieMember (A, M) => MovieMember(B, M)

P1 𝑥, 𝑦 ⋀ P2(𝑥, 𝑧) => P2(𝑦, 𝑧)

RAP

• RAP is designed for cross-domain fine-grained sentiment analysis

I liked the service and the staff, but the food is disappointed

Aspect terms Opinion terms

Positive Negative

RAP (cont.)

This movie has good script, great casting, excellent acting. This movie is so boring. The Godfather was the most amazing movie. The movie is excellent.

The camera is great. It is a very amazing product. I highly recommend this camera. It takes excellent photos. Photos had some artifacts and noise.

Camera

Movie

RAP (cont.)

• Bridge between cross-domain sentiment words – Domain independent (general) sentiment words

• Bridge between cross-domain topic words

RAP (cont.)

• Bridge between cross-domain topic words – Syntactic structure between topic and sentiment words

Sentiment words

Topic word Topic word

Common syntactic pattern: “topic word” – nsubj – “sentiment word”

Summary

Transfer Learning

Heterogeneous Transfer Learning

Homogeneous Transfer Learning

Unsupervised Transfer Learning

Semi-Supervised Transfer Learning

Supervised Transfer Learning

Instance-based Approaches

Feature-based Approaches

Parameter-based Approaches

Relational Approaches

In data level

In model level

Real-world Applications

Fundamental Research on

Transfer Learning

Application to Text Mining

Application to Social Media Analytics

Application to Image Classification

Application to WiFi Localization

Application to Sentiment Analysis

Application to Software Engineering

Application to Recommender Systems

Application to Bioinformatics

Defect Prediction

Program with defect information

Predictive Model (Machine learning)

Future defects

For a particular project:

Cross-Project Defect Prediction

“Training data is often not available, either because a company is too small or it is the first

release of a product”

[Zimmerman et al., Cross-project Defect Prediction, FSE-09]

“For many new projects we may not have enough historical data to train prediction models.”

[Rahman, Posnett, and Devanbu, Recalling the “Imprecision” of Cross-project Defect Prediction, ICSE-12]

Cross-Project Defect Prediction

Program with defect information

Predictive Model (Machine learning)

Future defects

OR

Cross-Project Defect Prediction

• [Zimmerman etal. FSE-09] – “We ran 622 cross-project predictions and found only 3.4%

actually worked.” Worked,

3.4%

Not worked, 96.6%

Cross-Project Defect Prediction

• [Rahman, Posnett, and Devanbu. FSE-12]

0

0.1

0.2

0.3

0.4

0.5

0.6

Cross Within

Avg. F-measure

Difference between Projects

• Development processes can be very different [Zimmermann etal., FSE-09] – The way the systems are being developed – Operating systems and environments – Tools and IDEs used for development – Coding styles – etc.

Defect Prediction Process

OR Using the same set of metrics

2 3 40 … 19 Y or N

Metric 1 … Metric 3 Metric 2 Metric 𝑚

instance 𝑖:

Distributions of feature values (data distributions) are different

Data Normalization

• N1: Min-max Normalization [0, 1]

• N2: 𝑍-score Normalization (mean=0, std=1)

• N3: 𝑍-score Normalization only using source mean and standard deviation

• N4: 𝑍-score Normalization only using target mean and standard deviation

• NoN: no normalization is used

Normalization TCA

Preliminary Results using TCA

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8F-measure

Prediction performance of TCA varies according to different normalization strategies!

Baseline NoN N1 N2 N3 N4 Baseline NoN N1 N2 N3 N4

Safe Apache Apache Safe

Baseline: Cross-project defect prediction without TCA

TCA+

• TCA with decision rules to choose an appropriate data normalization strategy

• Steps: – #1: Characterize datasets for different projects – #2: Decide rules based on comparison in characteristics

between projects

Normalization TCA

TCA+

#1: Characterize Datasets

3

1

Dataset for Project A (source) Dataset for Project B (target)

2

4

5

8

9

6

11

𝑑1,2

𝑑1,5

𝑑1,3

𝑑3,11

3

1

2 4

5

8

9

6 11

𝑑2,6

𝑑1,2

𝑑1,3

𝑑3,11

DIST = 𝑑𝑖𝑗: ∀𝑖, 𝑗, 1 ≤ 𝑖 ≤ 𝑛, 1 ≤ 𝑗 ≤ 𝑛, 𝑖 < 𝑗

#1: Characterize Datasets (cont.)

• Minimum (min) and maximum (max) value of DIST • Mean (mean) and standard deviation (std) of DIST • The number of instances (#ins)

Dataset for Project A (source) Dataset for Project B (target)

min𝑆 max𝑆 mean𝑆 std𝑆

#ins𝑆

min𝑇 max𝑇 mean𝑇 std𝑇

#ins𝑇

𝒄𝑆 𝒄𝑇

#2: Decision Rules

Rule #1 • (mean𝑆 ≈ mean𝑇) and (std𝑆 ≈ std𝑇) NoN Rule #2 • (min𝑆 ≠ min𝑇) and (max𝑆 ≠ max𝑇) N1 Rule #3, #4 • (std𝑆 ≠ std𝑇) and (#ins𝑆 ≠ #ins𝑇) N3 or N4 Rule #5 • Default N2

Experimental Setup

• 8 software projects

• Classification algorithm:

– Logistic regression

ReLink [Wu etal. FSE-11]

Projects # metrics (features)

Apache 26 Safe

ZXing

AEEEM [D’Ambros etal. MSR-10]

Projects # metrics (features)

Apache Lucene (LC)

61 Equinox (EQ) Eclipse JDT

Eclipse PDE UI Mylyn (ML)

Experimental Design

Test set (50%)

Training set (50%)

Within-project defect prediction

Experimental Design (cont.)

Target project (Test set)

Source project (Training set)

Cross-project defect prediction

Experimental Design

Target project (Test set)

Source project (Training set)

Transfer defect learning with TCA/TCA+

ReLink Results

Baseline: Cross-project defect prediction without TCA/TCA+

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8F-measure

Baseline TCA TCA+ Within

Safe Apache Apache Safe Apache ZXing Baseline TCA TCA+ Within Baseline TCA TCA+ Within

ReLink Result (cont.)

Cross Source Target Safe Apache

Zxing Apache Apache Safe Zxing Safe

Apache ZXing Safe ZXing

Average

Baseline

0.52 0.69 0.49 0.59 0.46 0.10 0.49

TCA (N4)

0.64 0.64 0.72 0.70 0.45 0.42 0.59

TCA+

0.64 0.72 0.72 0.64 0.49 0.53 0.61

Within Target Target

0.64

0.62

0.33

0.53

Baseline: Cross-project defect prediction without TCA/TCA+

AEEEM Results

Baseline: Cross-project defect prediction without TCA/TCA+

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7F-measure

Baseline TCA TCA+ Within

JDT EQ PDE LC PDE ML Baseline TCA TCA+ Within Baseline TCA TCA+ Within

AEEEM Results (cont.)

Cross Source Target

JDT EQ LC EQ ML EQ

… PDE LC EQ ML JDT ML LC ML PDE ML

… Average

Baseline

0.31 0.50 0.24 …

0.33 0.19 0.27 0.20 0.27 …

0.32

TCA (N2)

0.59 0.62 0.56 …

0.27 0.62 0.56 0.58 0.48 …

0.41

TCA+

0.60 0.62 0.56 …

0.33 0.62 0.56 0.60 0.54 …

0.41

Within Target Target

0.58

… 0.27

0.30

… 0.42

Future Direction

• Theoretical study beyond generalization error bound – Given a source domain and a target domain, determine

whether transfer learning should be performed – For a specific transfer learning method, given a source and

a target domain, determine whether the method should be used for knowledge transfer

Future Direction (cont.)

• Transfer learning for deep reinforcement learning

Reward

Feature Extractor

Hand-crafted features (states)

Action

Agent

Traditional RL

Reward

Action

Observation

Deep RL

Agent

Mapping row screen pixels to predictions of final score for each of 18 joystick actions

Mapping hand-crafted features (states) to final score for each of 18 joystick actions

[Google DeepMind 2015]

Policy

Policy

[𝒇𝟏,…, 𝒇𝒏]

Observation

Future Direction (cont.) • Transfer learning for deep reinforcement learning

Deep RL Transfer Learning for Deep RL

Thank You!

Reference

• [Thorndike and Woodworth, The Influence of Improvement in one mental function upon the efficiency of the other functions, 1901]

• [Taylor and Stone, Transfer Learning for Reinforcement Learning Domains: A Survey, JMLR 2009]

• [Pan and Yang, A Survey on Transfer Learning, IEEE TKDE 2009] • [Pan, Transfer Learning, Chapter 21, Data Classification: Algorithms and

Applications 2014] • [Quionero-Candela, etal, Data Shift in Machine Learning, MIT Press 2009] • [Zadrozny, Learning and Evaluating Classifiers under Sample Selection

Bias, ICML 2004] • [Huang etal., Correcting Sample Selection Bias by Unlabeled Data, NIPS

2006]

Reference (cont.)

• [Sugiyama etal., Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation, NIPS 2007]

• [Kanamori etal., A Least-squares Approach to Direct Importance Estimation, JMLR 2009]

• [Dai etal., Boosting for Transfer Learning, ICML 2007] • [Biltzer etal., Domain Adaptation with Structural Correspondence Learning,

EMNLP 2006] • [Pan etal., Cross-Domain Sentiment Classification via Spectral Feature

Alignment, WWW 2010] • [Pan etal., Transfer Learning via Dimensionality Reduction, AAAI 2008] • [Pan etal., Domain Adaptation via Transfer Component Analysis, IJCAI

2009, TNN 2011]

Reference (cont.)

• [Argyriou etal., Multi-Task Feature Learning, NIPS 2007] • [Ando and Zhang, A Framework for Learning Predictive Structures from

Multiple Tasks and Unlabeled Data, JMLR 2005] • [Ji etal, Extracting Shared Subspace for Multi-label Classification, KDD

2008] • [Raina etal., Self-taught Learning: Transfer Learning from Unlabeled Data,

ICML 2007] • [Glorot etal., Domain Adaptation for Large-Scale Sentiment Classification:

A Deep Learning Approach, ICML 2011] • [Zhou, etal., Hybrid Heterogeneous Transfer Learning through Deep

Learning, AAAI 2014] • [Zhou, etal., Heterogeneous Domain Adaptation for Multiple Classes,

AISTATS-14]

Reference (cont.)

• [Evgeniou and Pontil, Regularized Multi-Task Learning, KDD 2004] • [Zhang and Yeung, A Convex Formulation for Learning Task

Relationships in Multi-Task Learning, UAI 2010] • [Agarwal etal, Learning Multiple Tasks using Manifold Regularization,

NIPS 2010] • [Davis and Domingos, Deep Transfer vis Second-order Markov Logic,

ICML 2009] • [Li etal., Cross-Domain Co-Extraction of Sentiment and Topic Lexicons,

ACL 2012] • [Zimmerman etal., Cross-project Defect Prediction, FSE 2009] • [Rahman etal., Recalling the “Imprecision” of Cross-project Defect

Prediction, ICSE 2012] • [Nam etal., Transfer Defect Learning, ICSE 2013]