distributional clustering of words for text classification authors: l.douglas baker andrew kachites...

Distributional Clustering of Words for Text Classification

Authors: L.Douglas Baker Andrew Kachites McCallum

Presenter: Yihong Ding

Text Classification What is …?

categorize documents into specialized classes class label == target concept

Why is …? exponentially increasing web documents upstream work for many other important topics

(besides itself) document identification for information

extraction (project 2) …

Distributional Clustering

Benefits useful semantic word

clusters higher classification accuracy smaller classification models

Distributional clustering embedded Naïve Bayes classifier – the whole solution

Two Assumptions One-to-one assumption

content: mixture model components VS. target classes 1-to-1

reality: independent target classes

Naïve Bayes assumption content: word probabilities equals in one text reality: word event independent of context

and position

Naïve Bayes Framework Training documents set D = {d1, d2, …, dn} Target classes set C = {c1, c2, …, cm} Mixture (parametric) model

component parameterized by estimation of is denoted as

Target classifier probability of each class given the evidence

of the test document

Bayes rule

Naïve Bayes Framework (cont.)

Probability of each document given the mixture model Bayes optimal classifier )|()|()|( ij

Hhij hvPDhPDvP

i

C

Probability of a document given class Cj

1-to-1

Naïve Bayes


uniform class prior

Constant over all classes

Transform the equation above uniform class prior, dropping dropping the denominator (constant over all classes) product over document product over vocabulary take a log and divide by document length |di|

compute argmax


argmax of argmin of

distribution of words in

the document

distribution of words in the class

distribution of

clusters instead ?!

Distributional Clustering Intuition

P(C|wt) express the distributional probabilities for word wt over all the classes

Cluster words so as to preserve the distribution

Kullback-Leibler Divergence

Measurement for similarity between distributions

Traditional KL divergence: equals

Shortcomings not symmetric may have infinite result

KL divergence to the mean equals

Clustering Algorithm1. Sort the vocabulary by mutual information

with class variable2. Initialize M clusters as singletons with top M

words3. Loop until all words have been put into one

of M clusters:• Merge two clusters which are most similar,

resulting in M - 1 clusters• Create a new cluster consisting of the next word

from the sorted list, restoring the number of clusters to M

Results are used to compute P(cj|di;θ) for each class and to assign the document to the most probable class

Experimental Results20 Newsgroups

20000 articles evenly divided among 20 Newsgroups

vocabulary: 62258 words 50 features

Distributional Clustering: 82.1% LSI: 60% Mutual Information: 46.3% Class-based Clustering: 14.5% Markov blanket feature selector:

~60% DC better than feature

selection infrequent feature may important

when occurring merging preserves information

Experimental ResultsReuters-21578 & Yahoo! data set

Reuters-21578 data set 90/135 topic categories vocabulary: 16177 words

DC outperform others when small feature set size

Yahoo! data set 6294 web pages in 41 classes vocabulary: 44383 words

Naïve Bayes with 500 words achieves 66.4%, highest!

training data are too noisy

Conclusion DC aggressively reduces the number of features

while maintaining high classification accuracy

DC outperforms followings at small feature set size supervised Latent Semantic Indexing class-based clustering feature selection by mutual information feature selection by a Markov-blanket method

DC may not overcome the sparse data problem strongly biased to preserving the bad beginning

estimation of P(C|wi)

Mixture Modelclasses

documents

c1

c2

c3

cn

…

d1

d2

d3

dm

…

F1(d1, d2, d3, …, dm) (c1, c2, c3, …, cn)


documents

c1

c2

c3

cn

…

d1

d2

d3

dm

…

?


documents

1

2

3

n

…

d1

d2

d3

dm

…

F2(1, 2, 3, …, m, 1 2, 1 3, …, 23, …, 1 2…m) (d1, d2, d3, …, dm)

1-to-1 between C &

distributional clustering of words for text classification authors: l.douglas baker andrew kachites...

Documents

nave bayes slide

distribution slide

probable class slide

position slide

solution slide

class distribution of

test document bayes

document distribution