distributional clustering of words for text classification authors: l.douglas baker andrew kachites...

18
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Post on 19-Dec-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Distributional Clustering of Words for Text Classification

Authors: L.Douglas Baker Andrew Kachites McCallum

Presenter: Yihong Ding

Page 2: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Text Classification What is …?

categorize documents into specialized classes class label == target concept

Why is …? exponentially increasing web documents upstream work for many other important topics

(besides itself) document identification for information

extraction (project 2) …

Page 3: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Distributional Clustering

Benefits useful semantic word

clusters higher classification accuracy smaller classification models

Distributional clustering embedded Naïve Bayes classifier – the whole solution

Page 4: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Two Assumptions One-to-one assumption

content: mixture model components VS. target classes 1-to-1

reality: independent target classes

Naïve Bayes assumption content: word probabilities equals in one text reality: word event independent of context

and position

Page 5: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Naïve Bayes Framework Training documents set D = {d1, d2, …, dn} Target classes set C = {c1, c2, …, cm} Mixture (parametric) model

component parameterized by estimation of is denoted as

Target classifier probability of each class given the evidence

of the test document

Bayes rule

Page 6: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Naïve Bayes Framework (cont.)

Probability of each document given the mixture model Bayes optimal classifier )|()|()|( ij

Hhij hvPDhPDvP

i

C

Probability of a document given class Cj

1-to-1

Naïve Bayes

Page 7: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Naïve Bayes Framework (cont.)

Page 8: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Naïve Bayes Framework (cont.)

uniform class prior

Constant over all classes

Transform the equation above uniform class prior, dropping dropping the denominator (constant over all classes) product over document product over vocabulary take a log and divide by document length |di|

compute argmax

Page 9: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Naïve Bayes Framework (cont.)

argmax of argmin of

distribution of words in

the document

distribution of words in the class

distribution of

clusters instead ?!

Page 10: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Distributional Clustering Intuition

P(C|wt) express the distributional probabilities for word wt over all the classes

Cluster words so as to preserve the distribution

Page 11: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Kullback-Leibler Divergence

Measurement for similarity between distributions

Traditional KL divergence: equals

Shortcomings not symmetric may have infinite result

KL divergence to the mean equals

Page 12: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Clustering Algorithm1. Sort the vocabulary by mutual information

with class variable2. Initialize M clusters as singletons with top M

words3. Loop until all words have been put into one

of M clusters:• Merge two clusters which are most similar,

resulting in M - 1 clusters• Create a new cluster consisting of the next word

from the sorted list, restoring the number of clusters to M

Results are used to compute P(cj|di;θ) for each class and to assign the document to the most probable class

Page 13: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Experimental Results20 Newsgroups

20000 articles evenly divided among 20 Newsgroups

vocabulary: 62258 words 50 features

Distributional Clustering: 82.1% LSI: 60% Mutual Information: 46.3% Class-based Clustering: 14.5% Markov blanket feature selector:

~60% DC better than feature

selection infrequent feature may important

when occurring merging preserves information

Page 14: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Experimental ResultsReuters-21578 & Yahoo! data set

Reuters-21578 data set 90/135 topic categories vocabulary: 16177 words

DC outperform others when small feature set size

Yahoo! data set 6294 web pages in 41 classes vocabulary: 44383 words

Naïve Bayes with 500 words achieves 66.4%, highest!

training data are too noisy

Page 15: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Conclusion DC aggressively reduces the number of features

while maintaining high classification accuracy

DC outperforms followings at small feature set size supervised Latent Semantic Indexing class-based clustering feature selection by mutual information feature selection by a Markov-blanket method

DC may not overcome the sparse data problem strongly biased to preserving the bad beginning

estimation of P(C|wi)

Page 16: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Mixture Modelclasses

documents

c1

c2

c3

cn

d1

d2

d3

dm

F1(d1, d2, d3, …, dm) (c1, c2, c3, …, cn)

Page 17: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Mixture Modelclasses

documents

c1

c2

c3

cn

d1

d2

d3

dm

?

Page 18: Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding

Mixture Modelclasses

documents

1

2

3

n

d1

d2

d3

dm

F2(1, 2, 3, …, m, 1 2, 1 3, …, 23, …, 1 2…m) (d1, d2, d3, …, dm)

1-to-1 between C &