distributional clustering of words for text classification authors: l.douglas baker andrew kachites...
Post on 19-Dec-2015
225 views
TRANSCRIPT
Distributional Clustering of Words for Text Classification
Authors: L.Douglas Baker Andrew Kachites McCallum
Presenter: Yihong Ding
Text Classification What is …?
categorize documents into specialized classes class label == target concept
Why is …? exponentially increasing web documents upstream work for many other important topics
(besides itself) document identification for information
extraction (project 2) …
Distributional Clustering
Benefits useful semantic word
clusters higher classification accuracy smaller classification models
Distributional clustering embedded Naïve Bayes classifier – the whole solution
Two Assumptions One-to-one assumption
content: mixture model components VS. target classes 1-to-1
reality: independent target classes
Naïve Bayes assumption content: word probabilities equals in one text reality: word event independent of context
and position
Naïve Bayes Framework Training documents set D = {d1, d2, …, dn} Target classes set C = {c1, c2, …, cm} Mixture (parametric) model
component parameterized by estimation of is denoted as
Target classifier probability of each class given the evidence
of the test document
Bayes rule
Naïve Bayes Framework (cont.)
Probability of each document given the mixture model Bayes optimal classifier )|()|()|( ij
Hhij hvPDhPDvP
i
C
Probability of a document given class Cj
1-to-1
Naïve Bayes
Naïve Bayes Framework (cont.)
Naïve Bayes Framework (cont.)
uniform class prior
Constant over all classes
Transform the equation above uniform class prior, dropping dropping the denominator (constant over all classes) product over document product over vocabulary take a log and divide by document length |di|
compute argmax
Naïve Bayes Framework (cont.)
argmax of argmin of
distribution of words in
the document
distribution of words in the class
distribution of
clusters instead ?!
Distributional Clustering Intuition
P(C|wt) express the distributional probabilities for word wt over all the classes
Cluster words so as to preserve the distribution
Kullback-Leibler Divergence
Measurement for similarity between distributions
Traditional KL divergence: equals
Shortcomings not symmetric may have infinite result
KL divergence to the mean equals
Clustering Algorithm1. Sort the vocabulary by mutual information
with class variable2. Initialize M clusters as singletons with top M
words3. Loop until all words have been put into one
of M clusters:• Merge two clusters which are most similar,
resulting in M - 1 clusters• Create a new cluster consisting of the next word
from the sorted list, restoring the number of clusters to M
Results are used to compute P(cj|di;θ) for each class and to assign the document to the most probable class
Experimental Results20 Newsgroups
20000 articles evenly divided among 20 Newsgroups
vocabulary: 62258 words 50 features
Distributional Clustering: 82.1% LSI: 60% Mutual Information: 46.3% Class-based Clustering: 14.5% Markov blanket feature selector:
~60% DC better than feature
selection infrequent feature may important
when occurring merging preserves information
Experimental ResultsReuters-21578 & Yahoo! data set
Reuters-21578 data set 90/135 topic categories vocabulary: 16177 words
DC outperform others when small feature set size
Yahoo! data set 6294 web pages in 41 classes vocabulary: 44383 words
Naïve Bayes with 500 words achieves 66.4%, highest!
training data are too noisy
Conclusion DC aggressively reduces the number of features
while maintaining high classification accuracy
DC outperforms followings at small feature set size supervised Latent Semantic Indexing class-based clustering feature selection by mutual information feature selection by a Markov-blanket method
DC may not overcome the sparse data problem strongly biased to preserving the bad beginning
estimation of P(C|wi)
Mixture Modelclasses
documents
c1
c2
c3
cn
…
d1
d2
d3
dm
…
F1(d1, d2, d3, …, dm) (c1, c2, c3, …, cn)
Mixture Modelclasses
documents
c1
c2
c3
cn
…
d1
d2
d3
dm
…
?
Mixture Modelclasses
documents
1
2
3
n
…
d1
d2
d3
dm
…
F2(1, 2, 3, …, m, 1 2, 1 3, …, 23, …, 1 2…m) (d1, d2, d3, …, dm)
1-to-1 between C &