partially supervised learningyamaghani.com/files/1027201162153.pdf · vtom m. mitchell,...
TRANSCRIPT
Partially Supervised Learning
M.M. [email protected]
Tarbiat Moallem University of Tehran(Fall 2011)
Based on:v http://www.cs.uic.edu/~liub/WebMiningBook.html
} B. Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer-Verlag Berlin Heidelberg 2011.Chapter 3 and 5 presentations.
and
v Tom M. Mitchell, Semi-Supervised Learning over Text, Presentation, 2006.} Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell, “Learning to Classify Text from
Labeled and Unlabeled Documents”, In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), pp. 792-799,1998.
} Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell, “Text Classification from Labeled and Unlabeled Documents using EM”, Machine Learning, 39(2/3), pp,103-134, 2000.
} (longer version)
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Outlinev Fully supervised learning (traditional classification)v Partially supervised learning (or classification)
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Fully supervised learning (traditional classification)
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Document Classificationv Spam filtering, relevance rating, web page classification, ...v f : Doc → Class
(xi, yi)
Question:v Is it possible to classify documents by unlabeled data, i.e. (xi, -) ?
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Bayes Theorem
v p(h) = prior probability of hypothesis hv p(D) = prior probability of training data Dv p(h|D) = probability of h given Dv p(D|h) = probability of D given h
h DInformation : p(D|h)
Inference : p(h|D)
EvidenceHypothesis
( | ) ( )( | )( )
p D h p hp h D
p D⋅
= prior
likelihood
posterior
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Bayesian classificationv Probabilistic view: Supervised learning can naturally be studied
from a probabilistic point of view. v Let A1 through Ak be attributes with discrete values. The class is C. vGiven a test example d with observed attribute values a1 through
ak.vClassification is basically to compute the following posteriori
probability. The prediction is the class cj such that
is maximal.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Apply Bayes’ Rule
v Pr(C=cj) is the class prior probability: easy to estimate from the training data.
1 1 | | | |1 1 | | | |
1 1 | | | |
1 1 | | | || |
1 1 | | | |1
Pr( ,..., | )Pr( )Pr( | ,..., )
Pr( ,..., )
Pr( ,..., | )Pr( )
Pr( ,..., | )Pr( )=
= = = == = = =
= =
= = = ==
= = = =∑
A A j jj A A
A A
A A j jC
A A r rr
A a A a C c C cC c A a A a
A a A a
A a A a C c C c
A a A a C c C c
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Computing probabilitiesvThe denominator P(A1=a1,...,Ak=ak) is irrelevant for decision
making since it is the same for every class.
vWe only need P(A1=a1,...,Ak=ak | C=ci), which can be written as Pr(A1=a1|A2=a2,...,Ak=ak,C=cj)* Pr(A2=a2,...,Ak=ak |C=cj)
vRecursively, the second factor above can be written in the same way, and so on.
vNow an assumption is needed.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Conditional independence assumptionvAll attributes are conditionally independent given the class C = cj .v Formally, we assume,
Pr(A1=a1 | A2=a2, ..., A|A|=a|A|, C=cj) = Pr(A1=a1 | C=cj)
and so on for A2 through A|A|. I.e.,
∏=
======||
1||||11 )|Pr()|,...,Pr(
A
ijiiiAA cCaAcCaAaA
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Final naïve Bayesian classifier
| |
11 1 | | | | | || |
1 1
Pr( ) Pr( | )Pr( | ,..., )
Pr( ) Pr( | )
A
j i i ji
j A A AC
r i i rr i
C c A a C cC c A a A a
C c A a C c
=
= =
= = == = = =
= = =
∏
∑ ∏
v We are done!v How do we estimate P(Ai = ai| C=cj)? Easy!.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Classify a test instancev If we only need a decision on the most probable class for the test
instance, we only need the numerator as its denominator is the same for every class.
vThus, given a test example, we compute the following to decide the most probable class for the test instance
| |
1
arg max Pr( ) Pr( | ) j
A
j i i jc i
c c A a C c=
= = =∏
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Examplev Compute all probabilities required for classification:
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Example (cont …)v For C = t, we have
v For class C = f, we have
v C = t is more probable ⇒ t is the final class.
252
52
52
21)|Pr()Pr(
2
1=××==== ∏
=jjj tCaAtC
251
52
51
21)|Pr()Pr(
2
1
=××==== ∏=j
jj fCaAfC
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Additional issues vNumeric attributes: Naïve Bayesian learning assumes that all
attributes are categorical. Numeric attributes need to be discretized.
vZero counts: An particular attribute value never occurs together with a class in the training set. We need smoothing.
} λ is commonly set to λ = 1/n, where n is the total number of examples in the training set D.
vMissing values: Ignored.
Pr( | ) iji i j
j i
nA a C c
n n
λ
λ
+= = =
+
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
On naïve Bayes classifiervAdvantages:
} Easy to implement} Very efficient} Good results obtained in many applications
vDisadvantages} Assumption: class conditional independence, Ø therefore loss of accuracy when the assumption is seriously violated (those
highly correlated data sets)
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Naïve Bayes for Text Classification
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Text classification/categorizationvDue to the rapid growth of online documents in organizations and
on the Web, automated document classification has become an important problem.
vDifferent techniques can be applied to text classification, but they are not as effective as the next three methods.
vWe first study a naïve Bayesian method specifically formulated for texts, which makes use of some text specific features.
vHowever, the ideas are similar to the preceding method.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Probabilistic frameworkv Generative model: Each document is generated by a parametric
distribution governed by a set of hidden parameters. } Training data is used to estimate these parameters. } The parameters are then applied to classify each test document using Bayes
rule by calculating the posterior probability that the distribution associated with a class (represented by the unobserved class variable) would have generated the given document.
v The generative model makes two assumptions1. The data (or the text documents) are generated by a mixture model, 2. There is one-to-one correspondence between mixture components and
document classes.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Mixture modelvA mixture model models the data with a number of statistical
distributions. } Intuitively, each distribution corresponds to a data cluster and the parameters
of the distribution provide a description of the corresponding cluster.
v Each distribution in a mixture model is also called a mixture component.
vThe distribution/component can be of any kind.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
An examplevThe figure shows a plot of the probability density function of a
1-dimensional data set (with 2 classes) generated by } a mixture of two Gaussian distributions, } one Gaussian distribution per class, whose parameters (denoted by θi) are
the mean (µi) and the standard deviation (σi), i.e., θi = (µi, σi).
class 1 class 2
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Mixture model (cont …)vK: The number of mixture components (or distributions) in a
mixture model,v j : Index of the mixture component,v θj : The parameter of the j-th distribution. v ϕj : The mixture weight (or mixture probability) of the mixture
component j.vΘ : The set of parameters of all components:
Θ = {ϕ1, ϕ2, …, ϕK, θ1, θ2, …, θK}
vHow does the model generate documents?
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Document generationvDue to one-to-one correspondence, each class corresponds to a
mixture component. The mixture weights are class prior probabilities
ϕj = Pr(cj | Θ)
v The mixture model generates each document di by:i. selecting a mixture component (or class) according to class prior
probabilities (i.e., mixture weights), ϕj = Pr(cj | Θ). ii. having this selected mixture component (cj) generate a document di
according to its parameters, with distribution Pr(di | cj; Θ) or more precisely Pr(di | cj; θj).
| |
1Pr( | ) Pr( |Θ) Pr( | ; )
C
i j i jj
d c d c=
Θ = Θ∑
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Model text documentsv The naïve Bayesian classification treats each document as a “bag of
words”.
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Model text documentsv The generative model makes the following further assumptions:
I. Words of a document are generated independently of context given the class label. (The familiar naïve Bayes assumption used before. )
II. The probability of a word is independent of its position in the document. III. The document length is chosen independent of its class.
⇒ each document can be regarded as generated by a multinomial distribution.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Reminder!vA multinomial trial is a process that can result in any of k
outcomes, where k ≥ 2. v The probabilities of the k outcomes are denoted by p1, p2, …, pk. } The rolling of a die is a multinomial trial, with six possible outtcomes 1, 2, 3, 4,
5, 6. Ø For a fair die, p1 = p2 = … = pk = 1/6.
vAssume: } n independent trials are conducted, each with the k possible outcomes and
the k probabilities, p1, p2, …, pk. } Number the outcomes 1, 2, 3, …, k.} X1, X2, …, Xk : the number of trials that result in that outcome.
¨ in rolling toss example:o X4: the number of trials that result in appearing 4.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Reminder!Note: X1, X2, …, Xk are discrete random variables.
v The collection of X1, X2, …, Xk is said to have the multinomial distribution with parameters n, p1, p2, …, pk.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Multinomial distribution vWith the assumptions (given in slide 25), each document can be
regarded as generated by a multinomial distribution. } In other words, each document is drawn from a multinomial distribution of
words with as many independent trials as the length of the document.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Nomenclature (In text doc. classification)v n : the length of a document.
v wt : the t-th word in the vocabulary.} The words are from a given vocabulary V = {w1, w2, …, w|V|}.
v wdi,m: the word in position m of document di.
v |V|: The number of words in the vocabulary. } Note: |V|=k is the number of outcomes (words).
v C = {c1, c2, …, c|C|} : The set of (document) classes
v pi : The probability of occurrence of the word wt in a document class cj.
pt = Pr(wt | cj; Θ)
v Xt : a random variable as the number of times that word wt appears in a document.
v Nti : the number of times that word wt occurs in document di .
v |di| : document length.
v Pr(|di|) : the probability of document length.
v Dj : the subset of (the labeled training) documents for class cj .
v D = {D1, D2, …, D|C|} : the labeled training data.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Use probability function of multinomial distribution vWe can directly apply the probability function of the multinomial
distribution to find the probability of a document given its class (Pr(|di|), which is assumed to be independent of the class):
| |
1
V
ti it
N d=
=∑
| |
1
Pr( | ; ) 1V
t jt
w c=
Θ =∑
,1 ,
, ,1
,1
Pr( | ; ) Pr( ,..., | ; )
Pr( ) Pr( | ; ; , )
Pr( ) Pr( | ; )
i j ji i i
i
i i
i
i
d d d
d
i d k j d qk
d
i d k jk
d c w w c
d w c w q k
d w c
=
=
Θ = Θ
= Θ <
= Θ
∏
∏
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Parameter estimation v The parameters are estimated based on empirical counts.
Note:
back
| |
1| | |
1 1
ˆPr( | ; )
Pr( | )
Pr( | )
t j
t j j
D
ti j iiD
si j is i
w c
w D
N c d
N c d=
= =
Θ =
=
= ∑∑
the number of times that occurs in the training data (of class c )
the total number of word occurrences in the training data for that class
|V∑
1Pr( | )
0j
j i
Dc d
=
for each document indocuments in of other classes
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Parameter estimation (cont …)Reminderv Additive smoothing, also called Laplace smoothing or Lidstone
smoothing, is a technique used to smooth categorical data. vGiven an observation x = (x1, …, xno_of_classes) from a multinomial
distribution with n trials and parameter vectorθ = ( θ1, …, θno-of-classes)
a "smoothed" version of the data gives the estimator:
where λ > 0 is the smoothing parameter. λ = 0 corresponds to no smoothing. as the resulting estimate will be between the empirical estimate xi /n, and the uniform probability 1/no-of-classes.
ˆ. _ _
ii
xn no of classes
λθ
λ+
=+
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Parameter estimation (cont …)v In order to handle 0 counts for infrequent occurring words that
do not appear in the training set, but may appear in the test set, we need to smooth the probability. Lidstone smoothing, 0 ≤ λ ≤ 1
.)|Pr(||
)|Pr()ˆ;|Pr( ||
1
||
1
||
1
∑ ∑∑
= =
=
+
+=Θ V
s
D
i ijsi
D
i ijtijt
dcNV
dcNcw
λ
λ
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Parameter estimation (cont …)vClass prior probabilities, which are mixture weights ϕj, can be
easily estimated using training data:
go to | |
1
ˆPr( | )
Pr( | )| |
j jj
D
j ii
Dc
c d
D=
Θ =
= ∑
the training data (of class c )
the total number of the training data
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
ClassificationvGiven a test document di , from the followings
the probability of occurrence class cj is derived:
| |,1
| || |,1 1
ˆ ˆPr( | ) Pr( | ; )ˆPr( | ; ) ˆPr( | )ˆ ˆPr( | ) Pr( | ; )
ˆ ˆPr( | ) Pr( | ; )
i
i
i
i
j i jj i
i
d
j d k jkdC
r d k rr k
c d cc d
d
c w c
c w c=
= =
Θ ΘΘ =
Θ
Θ Θ=
Θ Θ
∏∑ ∏
| |
1Pr( | ) Pr( |Θ) Pr( | ; )
C
i j i jj
d c d c=
Θ = Θ∑
,1
Pr( | ; ) Pr( ) Pr( | ; )i
i ji
d
i d k jk
d c d w c=
Θ = Θ∏
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Classificationv If the final classifier is to classify each document into a single class,
the class with the highest posterior probability is selected:
ˆarg max Pr( | ; )jc C j iclassifier output c d∈= Θ
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Classificationv Train
For each class cj of documents
1. Estimate p(cj )
2. For each word wi estimate p(wi | cj )
v Classify (doc)
Assign doc to most probable class
Ø Assumption: words are conditionally independent, given class.
1 2
1 2
arg max ( , , ,...) ( ) ( | )
arg max ( , ,... | ) ( | )
∈
∈
=
=
∏
∏
i
i
j j i jj w doc
j i jj w doc
p c a a p c p w c
or
p a a c p w c
…
cj(Class)
w1 w2 w3 wn
output variable
input variables
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
DiscussionsvMost assumptions made by naïve Bayesian learning are violated to
some degree in practice. } words in a document are clearly not independent of each other. } the mixture model assumption of one-to-one correspondence between
classes and mixture components may not be true because a class may contain documents from multiple topics.
vDespite such violations, Naïve Bayesian learning is extremely efficient:} researchers have shown that naïve Bayesian learning produces very accurate
models. } It scans the training data only once to estimate all the probabilities required
for classification.} It can be used as an incremental algorithm as well. The model can be updated
easily as new data comes in because the probabilities can be conveniently revised.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
DiscussionsvThe main problem is the mixture model assumption. When this
assumption is seriously violated, the classification performance can be poor.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Partially supervised Learning
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Partially supervised Learningv Partially supervised learning (or classification)
} Learning with a small set of labeled examples and a large set of unlabeledexamples (LU learning or semi-supervised learning)Ø In this learning setting, there is a small set of labeled examples of every class, and
a large set of unlabeled examples. Ø The objective is to make use of the unlabeled examples to improve learning.
} Learning with positive and unlabeled examples (no labeled negative examples) (PU learning).Ø This problem assumes two-class classification. Ø The training data only has a set of labeled positive examples and a set of
unlabeled examples, but no labeled negative examples.Ø The objective is to build an accurate classifier without labeling any negative
examples.
v These two learning problems do not need full supervision, and thus are able to reduce the labeling effort.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Learning from a small labeled set and a large unlabeled set
LU learning
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Unlabeled DatavOne of the bottlenecks of classification is the labeling of a large set
of examples (data records or text documents). } Often done manually} Time consuming
vCan we label only a small number of examples and make use of a large number of unlabeled examples to learn?
v Possible in many cases.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Why unlabeled data are useful?vUnlabeled data are usually plentiful, labeled data are expensive.
vUnlabeled data provide information about the joint probability distribution over words and collocations (in texts). } using only the labeled data we find that documents containing the word
“homework” tend to belong to a particular class. If we use this fact to classify the unlabeled documents, we may find that “lecture” co-occurs with “homework” frequently in the unlabeled set. Then, “lecture” may also be an indicative word for the class. Such correlations provide a helpful source of information to increase classification accuracy, especially when the labeled data are scarce.
vWe will use text classification to study this problem.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Why unlabeled data are useful?
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
How to use unlabeled data vOne way is to use the EM algorithm
} EM: Expectation Maximization
vThe EM algorithm is a popular iterative algorithm for maximum likelihood estimation in problems with missing data.
vThe EM algorithm consists of two steps, } Expectation step: filling in the missing data based on the current
estimation of the parameters.} Maximization step: calculate a new maximum a posteriori (or
likelihood) estimate for the parameters.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Incorporating unlabeled Data with EMvKamal Nigam, Andrew McCallum, Sebastian Thrun and Tom
Mitchell, “Text Classification from Labeled and Unlabeled Documents using EM”, Machine Learning, 39(2/3), pp,103-134, 2000.
v Basic EMvAugmented EM with weighted unlabeled datavAugmented EM with multiple mixture components per class
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
EM – Expectation / Maximizationv EM is a class of algorithms that is used to estimate a probability
distribution in the presence of missing values. v Using it, requires an assumption on the underlying probability
distribution.v The algorithm can be very sensitive to this assumption and to the
starting point (that is, the initial guess of parameters). v It is a hill-climbing algorithm and converges to a local maximum of
the likelihood function. v That the EM algorithm is not really a specific “algorithm”, but is a
framework or strategy.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
The Likelihood Functionv Set of training data D,vA parametric family of models w/ parameters θ,vWe want θ that maximizes Pr(θ | D),v Bayes:
Pr(θ | D) = Pr(D | θ) ⋅ Pr(θ) / Pr(D)} P(D | θ) when viewed as a function of θ is the likelihood function:
L(θ;D) = Pr(D | θ)Sometimes: L(θ,D) = L(D | θ)
v The likelihood function is NOT a probability distribution over θand its integral / sum need not be 1.0,
v The M.L. (maximum likelihood) θ, is the one that maximizes the likelihood function (without regard for a prior on θ)
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Algorithm Outline1. Train a classifier with only the labeled documents.2. Use it to probabilistically classify the unlabeled documents.3. Use ALL the documents to train a new classifier.4. Iterate steps 2 and 3 to converge.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
The ideavThe documents in the unlabeled set (denoted by U) can be
regarded as having missing class labels.vThe parameters that EM estimates in this case are the probability
of each word given a class and the class prior probabilities.v EM here can also be seen as a clustering method with some initial
seeds (labeled data) in each cluster. The class labels of the seeds indicate the class labels of the resulting clusters.
vTwo assumptions are made in the derivation of the algorithm, which are in fact the two mixture model assumptions:1. the data is generated by a mixture model,2. there is a one-to-one correspondence between mixture components and
classes.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
The EM algorithm with naïve Bayes classifierAlgorithm EM(L, U) 1. Learn an initial naïve Bayesian classifier f from only the labeled set L (Equations (3-31) and (3-32));
2. repeat
// E-Step
3. for each example di in U do
4. Using the current classifier f to compute Pr(cj|di) (Equation (3-33));
5. end
// M-Step
6. Learn a new naïve Bayesian classifier f from L∪U by computing Pr(cj) and Pr(wt|cj) ((3-31) and (3-32));
7. until the classifier parameters stabilize;
Return the classifier f from the last iteration.
| |
1| | | |
1 1
Pr( | )ˆPr( | ; )| | Pr( | )
D
ti j iit j V D
si j is i
N c dw c
V N c d
λ
λ=
= =
+Θ =
+∑∑ ∑
| |
1Pr( | )ˆPr( | )| |
D
j iij
c dc
D=Θ = ∑
| |,1
| || |,1 1
ˆ ˆˆ ˆ Pr( | ) Pr( | ; )Pr( | ) Pr( | ; )ˆPr( | ; ) ˆ ˆ ˆPr( | ) Pr( | ) Pr( | ; )
i
i
i
i
d
j d k jj i j kj i dC
i r d k rr k
c w cc d cc d
d c w c=
= =
Θ ΘΘ ΘΘ = =
Θ Θ Θ∏
∑ ∏
| |
1| | | |
1 1
Pr( | )ˆPr( | ; )| | Pr( | )
D
ti j iit j V D
si j is i
N c dw c
V N c d
λ
λ=
= =
+Θ =
+
∑∑ ∑
| |
1Pr( | )ˆPr( | )| |
D
j iij
c dc
D=Θ = ∑
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Examplev Find the naïve Bayesian classifier f based on labeled and unlabeled
examples.vDoes the result of classification change if the 3rd record is deleted?
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
x1 x2 x3 x4 y
0 0 1 1 1
0 1 0 0 0
0 0 1 0 0
0 1 1 0 ?
0 1 0 1 ?
The problemv It has been shown that the EM algorithm works well if} The two mixture model assumptions for a particular data set are true.
vAlthough naïve Bayesian classification makes additional three assumptions, it performs surprisingly well despite the obvious violation of the assumptions.
vThe two mixture model assumptions, however, can cause major problems when they do not hold. In many real-life situations, they may be violated.
vThe first assumption above is usually not a problem, while the second assumption is critical.
v It is often the case that a class (or topic) contains a number of sub-classes (or sub-topics). } For example, the class Sports may contain documents about different sub-
classes of sports, Baseball, Basketball, Tennis, and Softball.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
The problemvNote:} If the second condition holds, EM works very well and is particularly useful
when the labeled set is very small, e.g., fewer than five labeled documents per class. In such cases, every iteration of EM is able to improve the classifier dramatically.} If the second condition does not hold, the unlabeled set hurts learning
instead of helping it.
v Some methods to deal with the problem:} Weighting the influence of unlabeled examples} Finding Mixture Components
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Weighting the Unlabeled Datav This method weights the influence of unlabeled examples by factor
µ.v In LU learning, the labeled set is small, but the unlabeled set is very
large. So the EM’s parameter estimation is almost completely determined by the unlabeled set after the first iteration. } This means that EM essentially performs unsupervised clustering. When the
two mixture model assumptions are true, the natural clusters of the data are in correspondence with the class labels. The resulting clusters can be used as the classifier.} when the assumptions are not true, the clustering may not converge to
mixture components corresponding to the given classes.
Solution:v Reduce the effect of the problem by weighting down the unlabeled
data during parameter estimation (EM iterations).TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Weighting the Unlabeled Data
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
vThe computation of Pr(wt|cj) is changed to the following, where the counts of the unlabeled documents are decreased by a factor of µ, 0 ≤ µ ≤ 1.
vNew M step:
vThe value of µ can be chosen based on leave-one-out cross-validation accuracy on the labeled training data. The µ value that gives the best result is used.
vThe prior probability also needs to be weighted.
Finding Mixture Componentsv Instead of weighting unlabeled data low, we can attack the problem
head on, i.e., by finding the mixture components (sub-classes) of the class. } For example, the original class Sports may consist of documents from
Baseball, Tennis, and Basketball, which are three mixture components (sub-classes or sub-topics) of Sports. Instead of using the original class, we try to find these components and treat each of them as a class and replace the class Sports.
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Finding Mixture ComponentsvApproaches to identify different components:} Manually identifying different components:Ø one only needs to read the documents in the labeled set (or some sampled
unlabeled documents), which is very small.
} Automatic approaches for identifying mixture components:Ø For example, a hierarchical clustering technique was proposed in to find the
mixture components (sub-classes).Ø A simple approach based on leave-one-out cross-validation on the labeled training
set
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]
Experimental EvaluationvNewsgroup postings
} 20 newsgroups, 1000/group
vWeb page classification } student, faculty, course, project} 4199 web pages
vReuters newswire articles } 12,902 articles} 10 main topic categories
TM & WM, Fall 2011TMU, M.M.Pedram, [email protected]