partially supervised learningyamaghani.com/files/1027201162153.pdf · vtom m. mitchell,...

63
Partially Supervised Learning M.M. Pedram [email protected] Tarbiat Moallem University of Tehran (Fall 2011)

Upload: others

Post on 19-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Partially Supervised Learning

M.M. [email protected]

Tarbiat Moallem University of Tehran(Fall 2011)

Page 2: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Based on:v http://www.cs.uic.edu/~liub/WebMiningBook.html

} B. Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer-Verlag Berlin Heidelberg 2011.Chapter 3 and 5 presentations.

and

v Tom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.} Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell, “Learning to Classify Text from

Labeled and Unlabeled Documents”, In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), pp. 792-799,1998.

} Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell, “Text Classification from Labeled and Unlabeled Documents using EM”, Machine Learning, 39(2/3), pp,103-134, 2000.

} (longer version)

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 3: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Outlinev Fully supervised learning (traditional classification)v Partially supervised learning (or classification)

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 4: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Fully supervised learning (traditional classification)

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 5: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Document Classificationv Spam filtering, relevance rating, web page classification, ...v f : Doc → Class

(xi, yi)

Question:v Is it possible to classify documents by unlabeled data, i.e. (xi, -) ?

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 6: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Bayes Theorem

v p(h) = prior probability of hypothesis hv p(D) = prior probability of training data Dv p(h|D) = probability of h given Dv p(D|h) = probability of D given h

h DInformation : p(D|h)

Inference : p(h|D)

EvidenceHypothesis

( | ) ( )( | )( )

p D h p hp h D

p D⋅

= prior

likelihood

posterior

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 7: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Bayesian classificationv Probabilistic view: Supervised learning can naturally be studied

from a probabilistic point of view. v Let A1 through Ak be attributes with discrete values. The class is C. vGiven a test example d with observed attribute values a1 through

ak.vClassification is basically to compute the following posteriori

probability. The prediction is the class cj such that

is maximal.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 8: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Apply Bayes’ Rule

v Pr(C=cj) is the class prior probability: easy to estimate from the training data.

1 1 | | | |1 1 | | | |

1 1 | | | |

1 1 | | | || |

1 1 | | | |1

Pr( ,..., | )Pr( )Pr( | ,..., )

Pr( ,..., )

Pr( ,..., | )Pr( )

Pr( ,..., | )Pr( )=

= = = == = = =

= =

= = = ==

= = = =∑

A A j jj A A

A A

A A j jC

A A r rr

A a A a C c C cC c A a A a

A a A a

A a A a C c C c

A a A a C c C c

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 9: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Computing probabilitiesvThe denominator P(A1=a1,...,Ak=ak) is irrelevant for decision

making since it is the same for every class.

vWe only need P(A1=a1,...,Ak=ak | C=ci), which can be written as Pr(A1=a1|A2=a2,...,Ak=ak,C=cj)* Pr(A2=a2,...,Ak=ak |C=cj)

vRecursively, the second factor above can be written in the same way, and so on.

vNow an assumption is needed.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 10: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Conditional independence assumptionvAll attributes are conditionally independent given the class C = cj .v Formally, we assume,

Pr(A1=a1 | A2=a2, ..., A|A|=a|A|, C=cj) = Pr(A1=a1 | C=cj)

and so on for A2 through A|A|. I.e.,

∏=

======||

1||||11 )|Pr()|,...,Pr(

A

ijiiiAA cCaAcCaAaA

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 11: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Final naïve Bayesian classifier

| |

11 1 | | | | | || |

1 1

Pr( ) Pr( | )Pr( | ,..., )

Pr( ) Pr( | )

A

j i i ji

j A A AC

r i i rr i

C c A a C cC c A a A a

C c A a C c

=

= =

= = == = = =

= = =

∑ ∏

v We are done!v How do we estimate P(Ai = ai| C=cj)? Easy!.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 12: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Classify a test instancev If we only need a decision on the most probable class for the test

instance, we only need the numerator as its denominator is the same for every class.

vThus, given a test example, we compute the following to decide the most probable class for the test instance

| |

1

arg max Pr( ) Pr( | ) j

A

j i i jc i

c c A a C c=

= = =∏

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 13: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Examplev Compute all probabilities required for classification:

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 14: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Example (cont …)v For C = t, we have

v For class C = f, we have

v C = t is more probable ⇒ t is the final class.

252

52

52

21)|Pr()Pr(

2

1=××==== ∏

=jjj tCaAtC

251

52

51

21)|Pr()Pr(

2

1

=××==== ∏=j

jj fCaAfC

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 15: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Additional issues vNumeric attributes: Naïve Bayesian learning assumes that all

attributes are categorical. Numeric attributes need to be discretized.

vZero counts: An particular attribute value never occurs together with a class in the training set. We need smoothing.

} λ is commonly set to λ = 1/n, where n is the total number of examples in the training set D.

vMissing values: Ignored.

Pr( | ) iji i j

j i

nA a C c

n n

λ

λ

+= = =

+

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 16: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

On naïve Bayes classifiervAdvantages:

} Easy to implement} Very efficient} Good results obtained in many applications

vDisadvantages} Assumption: class conditional independence, Ø therefore loss of accuracy when the assumption is seriously violated (those

highly correlated data sets)

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 17: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Naïve Bayes for Text Classification

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 18: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Text classification/categorizationvDue to the rapid growth of online documents in organizations and

on the Web, automated document classification has become an important problem.

vDifferent techniques can be applied to text classification, but they are not as effective as the next three methods.

vWe first study a naïve Bayesian method specifically formulated for texts, which makes use of some text specific features.

vHowever, the ideas are similar to the preceding method.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 19: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Probabilistic frameworkv Generative model: Each document is generated by a parametric

distribution governed by a set of hidden parameters. } Training data is used to estimate these parameters. } The parameters are then applied to classify each test document using Bayes

rule by calculating the posterior probability that the distribution associated with a class (represented by the unobserved class variable) would have generated the given document.

v The generative model makes two assumptions1. The data (or the text documents) are generated by a mixture model, 2. There is one-to-one correspondence between mixture components and

document classes.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 20: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Mixture modelvA mixture model models the data with a number of statistical

distributions. } Intuitively, each distribution corresponds to a data cluster and the parameters

of the distribution provide a description of the corresponding cluster.

v Each distribution in a mixture model is also called a mixture component.

vThe distribution/component can be of any kind.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 21: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

An examplevThe figure shows a plot of the probability density function of a

1-dimensional data set (with 2 classes) generated by } a mixture of two Gaussian distributions, } one Gaussian distribution per class, whose parameters (denoted by θi) are

the mean (µi) and the standard deviation (σi), i.e., θi = (µi, σi).

class 1 class 2

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 22: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Mixture model (cont …)vK: The number of mixture components (or distributions) in a

mixture model,v j : Index of the mixture component,v θj : The parameter of the j-th distribution. v ϕj : The mixture weight (or mixture probability) of the mixture

component j.vΘ : The set of parameters of all components:

Θ = {ϕ1, ϕ2, …, ϕK, θ1, θ2, …, θK}

vHow does the model generate documents?

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 23: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Document generationvDue to one-to-one correspondence, each class corresponds to a

mixture component. The mixture weights are class prior probabilities

ϕj = Pr(cj | Θ)

v The mixture model generates each document di by:i. selecting a mixture component (or class) according to class prior

probabilities (i.e., mixture weights), ϕj = Pr(cj | Θ). ii. having this selected mixture component (cj) generate a document di

according to its parameters, with distribution Pr(di | cj; Θ) or more precisely Pr(di | cj; θj).

| |

1Pr( | ) Pr( |Θ) Pr( | ; )

C

i j i jj

d c d c=

Θ = Θ∑

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 24: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Model text documentsv The naïve Bayesian classification treats each document as a “bag of

words”.

aardvark 0

about 2

all 2

Africa 1

apple 0

anxious 0

...

gas 1

...

oil 1

Zaire 0

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 25: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Model text documentsv The generative model makes the following further assumptions:

I. Words of a document are generated independently of context given the class label. (The familiar naïve Bayes assumption used before. )

II. The probability of a word is independent of its position in the document. III. The document length is chosen independent of its class.

⇒ each document can be regarded as generated by a multinomial distribution.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 26: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Reminder!vA multinomial trial is a process that can result in any of k

outcomes, where k ≥ 2. v The probabilities of the k outcomes are denoted by p1, p2, …, pk. } The rolling of a die is a multinomial trial, with six possible outtcomes 1, 2, 3, 4,

5, 6. Ø For a fair die, p1 = p2 = … = pk = 1/6.

vAssume: } n independent trials are conducted, each with the k possible outcomes and

the k probabilities, p1, p2, …, pk. } Number the outcomes 1, 2, 3, …, k.} X1, X2, …, Xk : the number of trials that result in that outcome.

¨ in rolling toss example:o X4: the number of trials that result in appearing 4.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 27: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Reminder!Note: X1, X2, …, Xk are discrete random variables.

v The collection of X1, X2, …, Xk is said to have the multinomial distribution with parameters n, p1, p2, …, pk.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 28: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Multinomial distribution vWith the assumptions (given in slide 25), each document can be

regarded as generated by a multinomial distribution. } In other words, each document is drawn from a multinomial distribution of

words with as many independent trials as the length of the document.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 29: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Nomenclature (In text doc. classification)v n : the length of a document.

v wt : the t-th word in the vocabulary.} The words are from a given vocabulary V = {w1, w2, …, w|V|}.

v wdi,m: the word in position m of document di.

v |V|: The number of words in the vocabulary. } Note: |V|=k is the number of outcomes (words).

v C = {c1, c2, …, c|C|} : The set of (document) classes

v pi : The probability of occurrence of the word wt in a document class cj.

pt = Pr(wt | cj; Θ)

v Xt : a random variable as the number of times that word wt appears in a document.

v Nti : the number of times that word wt occurs in document di .

v |di| : document length.

v Pr(|di|) : the probability of document length.

v Dj : the subset of (the labeled training) documents for class cj .

v D = {D1, D2, …, D|C|} : the labeled training data.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 30: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Use probability function of multinomial distribution vWe can directly apply the probability function of the multinomial

distribution to find the probability of a document given its class (Pr(|di|), which is assumed to be independent of the class):

| |

1

V

ti it

N d=

=∑

| |

1

Pr( | ; ) 1V

t jt

w c=

Θ =∑

,1 ,

, ,1

,1

Pr( | ; ) Pr( ,..., | ; )

Pr( ) Pr( | ; ; , )

Pr( ) Pr( | ; )

i j ji i i

i

i i

i

i

d d d

d

i d k j d qk

d

i d k jk

d c w w c

d w c w q k

d w c

=

=

Θ = Θ

= Θ <

= Θ

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 31: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Parameter estimation v The parameters are estimated based on empirical counts.

Note:

back

| |

1| | |

1 1

ˆPr( | ; )

Pr( | )

Pr( | )

t j

t j j

D

ti j iiD

si j is i

w c

w D

N c d

N c d=

= =

Θ =

=

= ∑∑

the number of times that occurs in the training data (of class c )

the total number of word occurrences in the training data for that class

|V∑

1Pr( | )

0j

j i

Dc d

=

for each document indocuments in of other classes

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 32: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Parameter estimation (cont …)Reminderv Additive smoothing, also called Laplace smoothing or Lidstone

smoothing, is a technique used to smooth categorical data. vGiven an observation x = (x1, …, xno_of_classes) from a multinomial

distribution with n trials and parameter vectorθ = ( θ1, …, θno-of-classes)

a "smoothed" version of the data gives the estimator:

where λ > 0 is the smoothing parameter. λ = 0 corresponds to no smoothing. as the resulting estimate will be between the empirical estimate xi /n, and the uniform probability 1/no-of-classes.

ˆ. _ _

ii

xn no of classes

λθ

λ+

=+

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 33: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Parameter estimation (cont …)v In order to handle 0 counts for infrequent occurring words that

do not appear in the training set, but may appear in the test set, we need to smooth the probability. Lidstone smoothing, 0 ≤ λ ≤ 1

.)|Pr(||

)|Pr()ˆ;|Pr( ||

1

||

1

||

1

∑ ∑∑

= =

=

+

+=Θ V

s

D

i ijsi

D

i ijtijt

dcNV

dcNcw

λ

λ

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 34: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Parameter estimation (cont …)vClass prior probabilities, which are mixture weights ϕj, can be

easily estimated using training data:

go to | |

1

ˆPr( | )

Pr( | )| |

j jj

D

j ii

Dc

c d

D=

Θ =

= ∑

the training data (of class c )

the total number of the training data

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 35: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

ClassificationvGiven a test document di , from the followings

the probability of occurrence class cj is derived:

| |,1

| || |,1 1

ˆ ˆPr( | ) Pr( | ; )ˆPr( | ; ) ˆPr( | )ˆ ˆPr( | ) Pr( | ; )

ˆ ˆPr( | ) Pr( | ; )

i

i

i

i

j i jj i

i

d

j d k jkdC

r d k rr k

c d cc d

d

c w c

c w c=

= =

Θ ΘΘ =

Θ

Θ Θ=

Θ Θ

∏∑ ∏

| |

1Pr( | ) Pr( |Θ) Pr( | ; )

C

i j i jj

d c d c=

Θ = Θ∑

,1

Pr( | ; ) Pr( ) Pr( | ; )i

i ji

d

i d k jk

d c d w c=

Θ = Θ∏

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 36: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Classificationv If the final classifier is to classify each document into a single class,

the class with the highest posterior probability is selected:

ˆarg max Pr( | ; )jc C j iclassifier output c d∈= Θ

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 37: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Classificationv Train

For each class cj of documents

1. Estimate p(cj )

2. For each word wi estimate p(wi | cj )

v Classify (doc)

Assign doc to most probable class

Ø Assumption: words are conditionally independent, given class.

1 2

1 2

arg max ( , , ,...) ( ) ( | )

arg max ( , ,... | ) ( | )

=

=

i

i

j j i jj w doc

j i jj w doc

p c a a p c p w c

or

p a a c p w c

cj(Class)

w1 w2 w3 wn

output variable

input variables

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 38: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

# training examplesA

ccur

acy

Classification

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 39: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

DiscussionsvMost assumptions made by naïve Bayesian learning are violated to

some degree in practice. } words in a document are clearly not independent of each other. } the mixture model assumption of one-to-one correspondence between

classes and mixture components may not be true because a class may contain documents from multiple topics.

vDespite such violations, Naïve Bayesian learning is extremely efficient:} researchers have shown that naïve Bayesian learning produces very accurate

models. } It scans the training data only once to estimate all the probabilities required

for classification.} It can be used as an incremental algorithm as well. The model can be updated

easily as new data comes in because the probabilities can be conveniently revised.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 40: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

DiscussionsvThe main problem is the mixture model assumption. When this

assumption is seriously violated, the classification performance can be poor.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 41: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Partially supervised Learning

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 42: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Partially supervised Learningv Partially supervised learning (or classification)

} Learning with a small set of labeled examples and a large set of unlabeledexamples (LU learning or semi-supervised learning)Ø In this learning setting, there is a small set of labeled examples of every class, and

a large set of unlabeled examples. Ø The objective is to make use of the unlabeled examples to improve learning.

} Learning with positive and unlabeled examples (no labeled negative examples) (PU learning).Ø This problem assumes two-class classification. Ø The training data only has a set of labeled positive examples and a set of

unlabeled examples, but no labeled negative examples.Ø The objective is to build an accurate classifier without labeling any negative

examples.

v These two learning problems do not need full supervision, and thus are able to reduce the labeling effort.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 43: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Learning from a small labeled set and a large unlabeled set

LU learning

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 44: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Unlabeled DatavOne of the bottlenecks of classification is the labeling of a large set

of examples (data records or text documents). } Often done manually} Time consuming

vCan we label only a small number of examples and make use of a large number of unlabeled examples to learn?

v Possible in many cases.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 45: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Why unlabeled data are useful?vUnlabeled data are usually plentiful, labeled data are expensive.

vUnlabeled data provide information about the joint probability distribution over words and collocations (in texts). } using only the labeled data we find that documents containing the word

“homework” tend to belong to a particular class. If we use this fact to classify the unlabeled documents, we may find that “lecture” co-occurs with “homework” frequently in the unlabeled set. Then, “lecture” may also be an indicative word for the class. Such correlations provide a helpful source of information to increase classification accuracy, especially when the labeled data are scarce.

vWe will use text classification to study this problem.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 46: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Why unlabeled data are useful?

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 47: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

How to use unlabeled data vOne way is to use the EM algorithm

} EM: Expectation Maximization

vThe EM algorithm is a popular iterative algorithm for maximum likelihood estimation in problems with missing data.

vThe EM algorithm consists of two steps, } Expectation step: filling in the missing data based on the current

estimation of the parameters.} Maximization step: calculate a new maximum a posteriori (or

likelihood) estimate for the parameters.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 48: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Incorporating unlabeled Data with EMvKamal Nigam, Andrew McCallum, Sebastian Thrun and Tom

Mitchell, “Text Classification from Labeled and Unlabeled Documents using EM”, Machine Learning, 39(2/3), pp,103-134, 2000.

v Basic EMvAugmented EM with weighted unlabeled datavAugmented EM with multiple mixture components per class

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 49: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

EM – Expectation / Maximizationv EM is a class of algorithms that is used to estimate a probability

distribution in the presence of missing values. v Using it, requires an assumption on the underlying probability

distribution.v The algorithm can be very sensitive to this assumption and to the

starting point (that is, the initial guess of parameters). v It is a hill-climbing algorithm and converges to a local maximum of

the likelihood function. v That the EM algorithm is not really a specific “algorithm”, but is a

framework or strategy.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 50: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

The Likelihood Functionv Set of training data D,vA parametric family of models w/ parameters θ,vWe want θ that maximizes Pr(θ | D),v Bayes:

Pr(θ | D) = Pr(D | θ) ⋅ Pr(θ) / Pr(D)} P(D | θ) when viewed as a function of θ is the likelihood function:

L(θ;D) = Pr(D | θ)Sometimes: L(θ,D) = L(D | θ)

v The likelihood function is NOT a probability distribution over θand its integral / sum need not be 1.0,

v The M.L. (maximum likelihood) θ, is the one that maximizes the likelihood function (without regard for a prior on θ)

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 51: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Algorithm Outline1. Train a classifier with only the labeled documents.2. Use it to probabilistically classify the unlabeled documents.3. Use ALL the documents to train a new classifier.4. Iterate steps 2 and 3 to converge.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 52: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

The ideavThe documents in the unlabeled set (denoted by U) can be

regarded as having missing class labels.vThe parameters that EM estimates in this case are the probability

of each word given a class and the class prior probabilities.v EM here can also be seen as a clustering method with some initial

seeds (labeled data) in each cluster. The class labels of the seeds indicate the class labels of the resulting clusters.

vTwo assumptions are made in the derivation of the algorithm, which are in fact the two mixture model assumptions:1. the data is generated by a mixture model,2. there is a one-to-one correspondence between mixture components and

classes.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 53: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

The EM algorithm with naïve Bayes classifierAlgorithm EM(L, U) 1. Learn an initial naïve Bayesian classifier f from only the labeled set L (Equations (3-31) and (3-32));

2. repeat

// E-Step

3. for each example di in U do

4. Using the current classifier f to compute Pr(cj|di) (Equation (3-33));

5. end

// M-Step

6. Learn a new naïve Bayesian classifier f from L∪U by computing Pr(cj) and Pr(wt|cj) ((3-31) and (3-32));

7. until the classifier parameters stabilize;

Return the classifier f from the last iteration.

| |

1| | | |

1 1

Pr( | )ˆPr( | ; )| | Pr( | )

D

ti j iit j V D

si j is i

N c dw c

V N c d

λ

λ=

= =

+Θ =

+∑∑ ∑

| |

1Pr( | )ˆPr( | )| |

D

j iij

c dc

D=Θ = ∑

| |,1

| || |,1 1

ˆ ˆˆ ˆ Pr( | ) Pr( | ; )Pr( | ) Pr( | ; )ˆPr( | ; ) ˆ ˆ ˆPr( | ) Pr( | ) Pr( | ; )

i

i

i

i

d

j d k jj i j kj i dC

i r d k rr k

c w cc d cc d

d c w c=

= =

Θ ΘΘ ΘΘ = =

Θ Θ Θ∏

∑ ∏

| |

1| | | |

1 1

Pr( | )ˆPr( | ; )| | Pr( | )

D

ti j iit j V D

si j is i

N c dw c

V N c d

λ

λ=

= =

+Θ =

+

∑∑ ∑

| |

1Pr( | )ˆPr( | )| |

D

j iij

c dc

D=Θ = ∑

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 54: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Examplev Find the naïve Bayesian classifier f based on labeled and unlabeled

examples.vDoes the result of classification change if the 3rd record is deleted?

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

x1 x2 x3 x4 y

0 0 1 1 1

0 1 0 0 0

0 0 1 0 0

0 1 1 0 ?

0 1 0 1 ?

Page 55: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

The problemv It has been shown that the EM algorithm works well if} The two mixture model assumptions for a particular data set are true.

vAlthough naïve Bayesian classification makes additional three assumptions, it performs surprisingly well despite the obvious violation of the assumptions.

vThe two mixture model assumptions, however, can cause major problems when they do not hold. In many real-life situations, they may be violated.

vThe first assumption above is usually not a problem, while the second assumption is critical.

v It is often the case that a class (or topic) contains a number of sub-classes (or sub-topics). } For example, the class Sports may contain documents about different sub-

classes of sports, Baseball, Basketball, Tennis, and Softball.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 56: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

The problemvNote:} If the second condition holds, EM works very well and is particularly useful

when the labeled set is very small, e.g., fewer than five labeled documents per class. In such cases, every iteration of EM is able to improve the classifier dramatically.} If the second condition does not hold, the unlabeled set hurts learning

instead of helping it.

v Some methods to deal with the problem:} Weighting the influence of unlabeled examples} Finding Mixture Components

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 57: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Weighting the Unlabeled Datav This method weights the influence of unlabeled examples by factor

µ.v In LU learning, the labeled set is small, but the unlabeled set is very

large. So the EM’s parameter estimation is almost completely determined by the unlabeled set after the first iteration. } This means that EM essentially performs unsupervised clustering. When the

two mixture model assumptions are true, the natural clusters of the data are in correspondence with the class labels. The resulting clusters can be used as the classifier.} when the assumptions are not true, the clustering may not converge to

mixture components corresponding to the given classes.

Solution:v Reduce the effect of the problem by weighting down the unlabeled

data during parameter estimation (EM iterations).TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 58: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Weighting the Unlabeled Data

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

vThe computation of Pr(wt|cj) is changed to the following, where the counts of the unlabeled documents are decreased by a factor of µ, 0 ≤ µ ≤ 1.

vNew M step:

vThe value of µ can be chosen based on leave-one-out cross-validation accuracy on the labeled training data. The µ value that gives the best result is used.

vThe prior probability also needs to be weighted.

Page 59: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Finding Mixture Componentsv Instead of weighting unlabeled data low, we can attack the problem

head on, i.e., by finding the mixture components (sub-classes) of the class. } For example, the original class Sports may consist of documents from

Baseball, Tennis, and Basketball, which are three mixture components (sub-classes or sub-topics) of Sports. Instead of using the original class, we try to find these components and treat each of them as a class and replace the class Sports.

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 60: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Finding Mixture ComponentsvApproaches to identify different components:} Manually identifying different components:Ø one only needs to read the documents in the labeled set (or some sampled

unlabeled documents), which is very small.

} Automatic approaches for identifying mixture components:Ø For example, a hierarchical clustering technique was proposed in to find the

mixture components (sub-classes).Ø A simple approach based on leave-one-out cross-validation on the labeled training

set

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 61: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

Experimental EvaluationvNewsgroup postings

} 20 newsgroups, 1000/group

vWeb page classification } student, faculty, course, project} 4199 web pages

vReuters newswire articles } 12,902 articles} 10 main topic categories

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 62: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

20 Newsgroups

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]

Page 63: Partially Supervised Learningyamaghani.com/Files/1027201162153.pdf · vTom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.}Kamal Nigam, Andrew McCallum, Sebastian

20 Newsgroups

TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]