semi-supervised learning

29
Semi Supervised Learning Qiang Yang Adapted from… • Thanks Zhi-Hua Zhou – http://cs.nju.edu. cn/people/zhouzh/ [email protected] LAMDA Group, National Laborato ry for Novel Soft ware Technology, Nanjing Universit y, China

Upload: butest

Post on 01-Jul-2015

363 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semi-supervised Learning

Semi Supervised Learning

• Qiang Yang– Adapted from…

• Thanks– Zhi-Hua Zhou– http://cs.nju.edu.cn/pe

ople/zhouzh/– [email protected]– LAMDA Group,– National Laboratory for

Novel Software Technology, Nanjing University, China

Page 2: Semi-supervised Learning

Supervised learning is a typical machine learning setting, where labeled examples are used as training examples

decision trees, neural networks, support vector

machines, etc.

trainedmodel

trainingdata

Name Rank Years TenuredMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

labeltraining

? = yesunseen data

(Jeff, Professor, 7, ?) label

unknown

Supervised learningSupervised learning

Page 3: Semi-supervised Learning

Labeled vs. Unlabeled Labeled vs. Unlabeled

In many practical applications, unlabeled training examples are readily available but labeled ones are fairly expansive to obtain because labeling the unlabeled examples requires human effort

class = “war”

(almost) infinite number of web pages on the Internet

?

Page 4: Semi-supervised Learning

Three main paradigms for Semi-supervised Three main paradigms for Semi-supervised Learning:Learning:

• Transductive learning: Unlabeled examples are exactly the test examples

• Active learning: •Assume that a user can continue to label data•The learner actively selects some unlabeled examples to query from an oracle (assume the learner has some control over the input space)

• Multi-view Learning •Unlabeled examples may be different from the test examples•Regularization (minimize error and maximize smoothness) •Multi-view Learning and Co-training

Page 5: Semi-supervised Learning

SSL: Why unlabeled data can be SSL: Why unlabeled data can be helpful? helpful?

Suppose the data is well-modeled by a mixture density:

Thus, the optimal classification rule for this model is the MAP rule:

where and = {l }

1

L

l ll

f x f x

1

1L

ll

The class labels are viewed as random quantities and are assumed chosen conditioned on the selected mixture component mi {1,2,…,L} and possibly on the feature value, i.e. according to the probabilities P[ci|xi,mi]

arg max P , Pi i i i ijkS x c k m j x m j x

where

1

Pj i j

i i L

l i ll

f xm j x

f x

unlabeled examples can be used to help estimate this term

[D.J. Miller & H.S. Uyar, NIPS’96]

Page 6: Semi-supervised Learning

Transductive SVM Transductive SVM

Transductive SVM: Taking into account a particular test set and trying to minimize misclassifications of just those particular examples

Figure reprinted from [T. Joachims, ICML99]

Concretely, using unlabeled examples to help identify the maximum margin hyperplanes

Page 7: Semi-supervised Learning

Active learning: Getting more Active learning: Getting more from query from query

The labels of the training examples are obtained by querying the oracle. Thus, for the same number of queries, more helpful information can be obtained by actively selecting some unlabeled examples to query

Key: To select the unlabeled examples on which the labeling will convey the most helpful information for the learner

Page 8: Semi-supervised Learning

Uncertainty sampling Train a single learner and then query the unlabeled instance

s on which the learner is the least confident [Lewis & Gale, SIGIR’94]

Committee-based sampling Generate a committee of multiple learners and select the un

labeled examples on which the committee members disagree the most [Abe & Mamitsuka, ICML’98; Seung et al., COLT’92]

Active Learning: Representative Active Learning: Representative approachesapproaches

Page 10: Semi-supervised Learning

DatabaseTextInterface

TextInterface

Text-based Retrieval Engine

Every image is associated with a text annotation User poses a keyword The system retrieves images by matching the keyword with annotations

Active Learning: Text-based Active Learning: Text-based image retrievalimage retrieval

“tiger”

querytiger lily

white tiger

Page 11: Semi-supervised Learning

In some applications, there are two sufficient and redundant views, i.e. two attribute sets each of which is sufficient for learning and conditionally independent to the other given the class label

e.g. two views for web page classification: 1) the text appearing on the page itself, and 2) the anchor text attached to hyperlinks pointing to this page, from other pages

Co-training Co-training

Page 12: Semi-supervised Learning

learner1 learner2X1 view

X2 view

labeled training examples

unlabeled training examples

labeled unlabeled examples

labeled unlabeled examples

[A. Blum & T. Mitchell,

COLT98]

Co-training (con’t)Co-training (con’t)

Page 13: Semi-supervised Learning

Co-training (con’t)Co-training (con’t)

Theoretical analysis [Blum & Mitchell, COLT’98; Dasgupta,

NIPS’01; Balcan et al., NIPS’04; etc.]

Experimental studies [Nigam & Ghani, CIKM’00]

New algorithms• Co-training without two views [Goldman & Zhou, ICML’

00; Zhou & Li, TKDE’05]

• Semi-supervised regression [Zhou & Li, IJCAI’05]

Applications• Statistical parsing [Sarkar, NAACL01; Steedman et al.,

EACL03; R. Hwa et al., ICML03w]• Noun phrase identification [Pierce & Cardie, EMNLP01]• Image retrieval [Zhou et al., ECML’04; Zhou et al., TOIS0

6]

Page 14: Semi-supervised Learning

Multi-view Learning and Co-training

• Multi-view learning describes the setting of learning from data where observations are represented by multiple independent sets of features.

An example of two views:• Features can be split into two sets:

– The instance space:– Each instance:

21 XXX ),( 21 xxx

Page 15: Semi-supervised Learning

Inductive vs.Transductive

• Transductive: Produce label only for the available unlabeled data.– The output of the method is not a classifier.

• Inductive: Not only produce label for unlabeled data, but also produce a classifier.

Page 16: Semi-supervised Learning

An Example of two views

• Web-page classification: e.g.,

find homepages of faculty members.– Page text: words occurring on that page:

e.g., “research interest”, “teaching”

– Hyperlink text: words occurring in hyperlinks that point to that page:

e.g., “my advisor”

Page 17: Semi-supervised Learning

Another Example

X1 : job titleX2: job description

Classifying Jobs for FlipDog

Page 18: Semi-supervised Learning

Two Views

• : the set of target function over . • : the set of target functions over .• : the set of target function over .

• Instead of learning from , multi-view learning aims to learn a pair of functions from , such that .

1X

21 XXX

2X2C

1C

C

f C),( 21 ff

21 CC )()()( 2211 xfxfxf

Page 19: Semi-supervised Learning

Co-training

• Proposed by (Blum and Mitchell 1998)Combine Multi-view learning & semi-supervised learning.

• Related work:– (Yarowsky 1995)– (Nigam and Ghani, 2000)– (Goldman and Zhou, 2000)– (Abney, 2002)– (Sarkar, 2002)– …

• Used in document classification, parsing, etc.

Page 20: Semi-supervised Learning

The Yarowsky Algorithm

Iteration: 0

+

-

A Classifiertrained by SL

Choose instances labeled with high confidence

Iteration: 1

+

-

Add them to thepool of current labeled training data

……

(Yarowsky 1995)

Iteration: 2

+

-

Page 21: Semi-supervised Learning

Co-training Assumption 1: compatibility

• The instance distribution is compatible with the target function if for any with non-zero probability, .

• Definition: compatibility of with :

),( 21 xxx D

),( 21 fff )()()( 2211 xfxfxf

Each set of features is sufficient for classification

0)]()(:),[(Pr1 221121 xfxfxxp D

f D

Page 22: Semi-supervised Learning

Co-training Assumption 2: conditional

independence• Definition: A pair of views satisfy view

independence when:

• A classification problem instance satisfies view independence when all pairs satisfy view independence.

),( 21 xx

)|(),|(

)|(),|(

221122

112211

yYxXPyYxXxXP

yYxXPyYxXxXP

),( 21 xx

Page 23: Semi-supervised Learning

Co-training Algorithm

Page 24: Semi-supervised Learning

Co-Training

• Instances contain two sufficient sets of features– i.e. an instance is x=(x1,x2)

– Each set of features is called a View

• Two views are independent given the label:

• Two views are consistent:

xx1 x2

(Blum and Mitchell 1998)

Page 25: Semi-supervised Learning

Co-Training

Iteration: t

+

-

Iteration: t+1

+

-

……

C1: A Classifiertrained

on view 1

C2: A Classifiertrained

on view 2

Allow C1 to label Some instances

Allow C2 to label Some instances

Add self-labeled instances to the pool of training data

Page 26: Semi-supervised Learning

Agreement Maximization

• A side effect of the Co-Training: Agreement between two views.

• Is it possible to pose agreement as the explicit goal?– Yes. The resulting algorithm: Agreement Boost

(Leskes 2005)

Page 27: Semi-supervised Learning

What if Co-training Assumption Not Perfectly Satisfied?

• Idea: Want classifiers that produce a maximally consistent labeling of the data

• If learning is an optimization problem, what function should we optimize?

-

+

+

+

Page 28: Semi-supervised Learning

Other Related Works

• Multi-view clustering (Bickel & Scheffer 2004)

Modified the co-training algorithm by replacing the class variable (class label) with a mixture coefficient to obtain a multi-view clustering algorithm.

• Manifold co-regularization (Sindhwani et al., 2005)

Extended Manifold regularization to multi-view learning.• Active multi-view learning (Muslea 2002)

Combine active learning and multi-view learning.• More related works can be find in the workshop on Multi-

view learning in ICML 2005:http://www-ai.cs.uni-dortmund.de/MULTIVIEW2005/index.html

Page 29: Semi-supervised Learning

Reference• A. Blum and T. Mitchell, 1998. “Combining Labeled and Unlabeled Data with

Co-Training,” In Proceedings of COLT 1998.• D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised

methods. In Proceedings of ACL 1995.• Nigam, K., & Ghani, R, 2000. Analyzing the effectiveness and applicability o

f co-training. In Proceedings of CIKM 2000.• Steven Abney, 2002. Bootstrapping. In Proceedings of ACL, 2002.• Ulf Brefeld and Tobias Scheer. Co-EM support vector learning. In Proceedin

gs ICML, 2004.• Steen Bickel and Tobias Scheer. Multi-view clustering. In Proceedings of IC

DM, 2004.• Sindhwani, V.; Niyogi, P.; and Belkin, M. 2005. A Co-Regularization Approa

ch to Semi-supervised Learning with Multiple Views. In Workshop on Learning with Multiple Views at ICML 2005.

• Ion Muslea. Active learning with multiple views. PhD thesis, University of Southern California, 2002.