cs 277: data mining text classification padhraic smyth department of computer science university of...

64
CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine e slides adapted from Introduction to Information Retrieval, by ning, Raghavan, and Schutze, Cambridge University Press, 2009, h additional slides from Ray Mooney.

Upload: moses-lawrence

Post on 22-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

CS 277: Data Mining

Text Classification

Padhraic SmythDepartment of Computer Science

University of California, Irvine

Some slides adapted from Introduction to Information Retrieval, by Manning, Raghavan, and Schutze, Cambridge University Press, 2009, with additional slides from Ray Mooney.

Page 2: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Progress Report

• New deadline– In class, Thursday February 18th (not Tuesday)

• Outline– 3 pages maximum– Suggested content

• Brief restatement of the problem you are addressing (no need to repeat everything in your original proposal), e.g., ½ a page

• Summary of progress so far– Background papers read– Data preprocessing, exploratory data analysis– Algorithms/Software reviewed, implemented, tested– Initial results (if any)

• Challenges and difficulties encountered• Brief outline of plans between now and end of quarter

– Use diagrams, figures, tables, where possible– Write clearly: check what you write

Page 3: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Currently in the News….

• Communications of the ACM, Feb 2010– Fun short article, “Where the Data is”

• http://cacm.acm.org/magazines/2010/2/69384-where-the-data-is/fulltext

– Technical article on Fast Dimension Reduction• Introduction: http://cacm.acm.org/magazines/2010/2/69382-strange-effects-in-high-dimension/fulltext

• Article: http://cacm.acm.org/magazines/2010/2/69359-faster-dimension-reduction/fulltext

• New York Times, Feb 8, 2010• Interesting study of which email articles are mailed most oftenhttp://www.nytimes.com/2010/02/09/science/09tier.html?ref=technology

Page 4: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Conferences related to Text Mining

• ACM SIGIR Conference

• WWW Conference

• ACM SIGKDD Conference

Worth looking at papers in recent years for many different ideas related to text mining

Page 5: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Further Reading on Text Classification

• General references on text and language modeling– Introduction to Information Retrieval, C. Manning, P. Raghavan, and H.

Schutze, Cambridge University Press, 2008– Foundations of Statistical Language Processing, C. Manning and H.

Schutze, MIT Press, 1999.– Speech and Language Processing: An Introduction to Natural Language

Processing, Dan Jurafsky and James Martin, Prentice Hall, 2000.– The Text Mining Handbook: Advanced Approaches to Analyzing

Unstructured Data, Feldman and Sanger, Cambridge University Press, 2007

• Web-related text mining in general– S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext

Data, Morgan Kaufmann, 2003.– See chapter 5 for discussion of text classification

• SVMs for text classification– T. Joachims, Learning to Classify Text using Support Vector Machines:

Methods, Theory and Algorithms, Kluwer, 2002

Page 6: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Text Classification

• Text classification has many applications– Spam email detection– Automated tagging of streams of news articles, e.g., Google News– Online advertising: what is this Web page about?

• Data Representation– “Bag of words” most commonly used: either counts or binary– Can also use “phrases” (e.g., bigrams) for commonly occurring

combinations of words

• Classification Methods– Naïve Bayes widely used (e.g., for spam email)

• Fast and reasonably accurate– Support vector machines (SVMs)

• Typically the most accurate method in research studies• But more complex computationally

– Logistic Regression (regularized)• Not as widely used, but can be competitive with SVMs (e.g., Zhang and Oles,

2002)

Page 7: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Types of Labels/Categories/Classes

• Assigning labels to documents or web-pages– Labels are most often topics such as Yahoo-categories– "finance," "sports," "news>world>asia>business"

• Labels may be genres– "editorials" "movie-reviews" "news”

• Labels may be opinion on a person/product– “like”, “hate”, “neutral”

• Labels may be domain-specific– "interesting-to-me" : "not-interesting-to-me”– “contains adult language” : “doesn’t”– language identification: English, French, Chinese, …– search vertical: about Linux versus not– “link spam” : “not link spam”

Ch. 13

Page 8: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Common Data Sets used for Evaluation

• Reuters– 10700 labeled documents – 10% documents with multiple class labels

• Yahoo! Science Hierarchy – 95 disjoint classes with 13,598 pages

• 20 Newsgroups data– 18800 labeled USENET postings– 20 leaf classes, 5 root level classes

• WebKB– 8300 documents in 7 categories such as “faculty”, “course”, “student”.

• Industry– 6449 home pages of companies partitioned into 71 classes

Page 9: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Practical Issues• Tokenization

– Convert document to word counts = “bag of words”– word token = “any nonempty sequence of characters”– for HTML (etc) need to remove formatting

• Canonical forms, Stopwords, Stemming – Remove capitalization – Stopwords:

• remove very frequent words (a, the, and…) – can use standard list• Can also remove very rare words, e.g., words that only occur in k or fewer

documents, e.g., k = 5

– Stemming (next slide)

• Data representation– e.g., sparse 3 column for bag of words: <docid termid count>– can use inverted indices, etc

Page 10: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Toy example of a document-term matrix

database

SQL index

regression

likelihood

linear

d1 24 21 9 0 0 3

d2 32 10 5 0 3 0

d3 12 16 5 0 0 0

d4 6 7 2 0 0 0

d5 43 31 20 0 3 0

d6 2 0 0 18 7 16

d7 0 0 1 32 12 0

d8 3 0 0 22 4 2

d9 1 0 0 34 27 25

d10 6 0 0 17 4 23

Page 11: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Stemming

• Want to reduce all morphological variants of a word to a single index term– e.g. a document containing words like fish and fisher may not be retrieved by a

query containing fishing (no fishing explicitly contained in the document)

• Stemming - reduce words to their root form• e.g. fish – becomes a new index term

• Porter stemming algorithm (1980)– relies on a preconstructed suffix list with associated rules

• e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE

– BINARIZATION => BINARIZE• Not always desirable: e.g., {university, universal} -> univers (in Porter’s)

• WordNet: dictionary-based approach

• May be more useful for information retrieval/search than classification

Page 12: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Confusion Matrix and Accuracy

TrueClass 1

TrueClass 2

TrueClass 3

Predicted Class 1

80 60 10

Predicted Class 2

10 90 0

Predicted Class 3

10 30 110

Accuracy = fraction of examples classified correctly

= 280/400

= 70%

Page 13: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Precision

TrueClass 1

TrueClass 2

TrueClass 3

Predicted Class 1

80 60 10

Predicted Class 2

10 90 0

Predicted Class 3

10 30 110

Precision for class k

= fraction predicted as class k that belong to class k

= 90/100

= 90% for class 2 below

Page 14: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Recall

Recall for class k

= fraction that belong to class k predicted as class k

= 90/180

= 50% for class 2 below

TrueClass 1

TrueClass 2

TrueClass 3

Predicted Class 1

80 60 10

Predicted Class 2

10 90 0

Predicted Class 3

10 30 110

Page 15: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Precision-Recall Curves

• Rank our test documents by predicted probability of belonging to class k– Sort the ranked list – For t = 1…. N (N = total number of test documents)

• Select top t documents on the list• Classify documents above as class k, documents below as not class

k• Compute precision-recall number

– Generates a set of precision-recall values we can plot

– Perfect performance: precision = 1, recall = 1

– Similar concept to ROC

Page 16: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Precision-recall for category: Crude

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

LSVMDecision Tree Naïve BayesRocchio

Precision

Recall

Dumais (1998)

Page 17: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

F-measure

• F-measure = weighted harmonic mean of precision and recall (at some fixed operating threshold for the

classifier)

F1 = 1/ ( 0.5/P + 0.5/R )

= 2PR / (P + R)

Useful as a simple single summary measure of performance

Sometimes “breakeven” F1 used, i.e., measured when P = R

Page 18: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

F-Measures on Reuters Data Set

Sec.13.6

Page 19: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Probabilistic “Generative” Classifiers

• Model p( x | ck ) for each class and perform classification via Bayes rule,

c = arg max { p( ck | x ) } = arg max { p( x | ck ) p(ck) }

• How to model p( x | ck )?

– p( x | ck ) = probability of a “bag of words” x given a class ck

– Two commonly used approaches (for text):• Naïve Bayes: treat each term xj as being conditionally independent, given ck

• Multinomial: model a document with N words as N tosses of a p-sided die

– Other models possible but less common,• E.g., model word order by using a Markov chain for p( x | ck )

Page 20: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Naïve Bayes Classifier for Text

• Naïve Bayes classifier = conditional independence model

– Assumes conditional independence assumption given the class: p( x | ck ) = p( xj | ck )

– Note that we model each term xj as a discrete random variable

– Binary terms (Bernoulli): p( x | ck ) = p( xj | ck )

– Non-binary terms (counts) (far less common)

p( x | ck ) = p( xj = k | ck ) can use a parametric model (e.g., Poisson) or non-parametric

model (e.g., histogram) for p(xj = k | ck ) distributions.

Page 21: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Simple Example: Naïve Bayes (binary)

• Vocabulary = {complexity, bound, cluster, classifier}

• Two classes: class 1 = algorithms, class 2 = data mining

• Naïve Bayes model:– Probability that a word appears in a document, given the document

class– Assumes that each word appears independently of the others

complexity

bound regression

classifier

Algorithms 0.9 0.7 0.1 0.1Data Mining

0.2 0.1 0.7 0.8

Page 22: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Spam Email Classification

• Naïve Bayes widely used in spam filtering– See Paul Graham’s A Plan for Spam (on the Web)

• Naive Bayes-like classifier with unusual parameter estimation

– Widely used in spam filters • Naive Bayes works well when appropriately used • Very easy to update the model: just update counts

– But also many other aspects• black lists, HTML features, etc.

Page 23: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Multinomial Classifier for Text

• Multinomial Classification model– Assume that the data are generated by a p-sided die (multinomial

model)

– where product is over all terms in the vocabulary nj = number of times term j occurs in the document

– Here we have a single random variable for each class, and the p( x j = i | ck ) probabilities sum to 1 over i (i.e., a multinomial model)

– Note that high correlation can not be modeled well compared to naïve Bayes

• e.g., two words that should almost always appear in a document for a given class

– Probabilities typically only defined and evaluated for non-zero counts• But “zero counts” could also be modeled if desired• This would be equivalent to a Naïve Bayes model with a geometric distribution

on counts

j

j

kjk

ncxpcp

1

)|()|(x

Page 24: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Simple Example: Multinomial

• Vocabulary = {complexity, bound, cluster, classifier}

• Two classes: class 1 = algorithms, class 2 = data mining

• Multinomial model:– Probability of next word in a document, given the document class– Assumes that words are “memoryless”

complexity

bound regression

classifier

Algorithms 0.5 0.4 0.05 0.05Data Mining

0.1 0.05 0.3 0.55

Page 25: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Highest Probability Terms in Multinomial Distributions

Page 26: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Prediction

• Naïve Bayes– log [ p( x | c ) p(c) ] = log [ p( xj | c ) ] + log p(c)

= log p( xj | c ) + log p(c)

Sum over all words in vocabulary

Page 27: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Prediction

• Naïve Bayes– log [ p( x | c ) p(c) ] = log [ p( xj | c ) ] + log p(c)

= log p( xj | c ) + log p(c)

• Multinomial– log [ p( x | c ) p(c) ] = log [ p( xj | c )nj ] + log p(c)

= nj log p( xj | c ) + log p(c)

Sum over all words in vocabulary

Sum over all words that occur in document

Page 28: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Classification Example

• Document 1 (bag of words)

• Document 2 (bag of words)

• Compute log p(class |document) for both naïve Bayes and multinomial model, for each document– Can assume p(class 1) = p(class 2)

complexity bound regression classifier

Counts 4 0 0 1

complexity bound regression classifier

Counts 2 0 1 3

Page 29: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Classification Example (for document 1)

complexity bound regression classifier

A -0.69 -0.92 -3.0 -3.0

DM -2.3 -3.0 -1.2 -0.6

complexity bound regression classifier

A -0.11 -0.36 -2.3 -2.3

DM -1.6 -2.3 -0.36 -0.22

complexity bound regression classifier

D1 4 0 0 1

complexity bound regression classifier

D1 4 0 0 1

Naïve Bayes Multinomial

Page 30: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Classification Example (for document 1)

complexity bound regression classifier

A -0.69 -0.92 -3.0 -3.0

DM -2.3 -3.0 -1.2 -0.6

complexity bound regression classifier

A -0.11 -0.36 -2.3 -2.3

DM -1.6 -2.3 -0.36 -0.22

complexity bound regression classifier

D1 4 0 0 1

complexity bound regression classifier

D1 4 0 0 1

log p( xj | c ) + log p(c) nj log p( xj | c ) + log p(c)

Class A: score = -0.11 – 2.3 = -2.41

Class DM: score = -1.6 – 0.22 = -1.08

Class A: score = -4 * 0.69 – 3.0 ~ -5.8

Class DM: score = -4 * 2.3 – 0.6 ~ -8.9

Naïve Bayes Multinomial

Page 31: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Comparing Naïve Bayes and Multinomial models

McCallum and Nigam (1998) Found that multinomial outperformed naïve Bayes (with binary features) in text classification experiments

Note on names used in the literature- Bernoulli (or multivariate Bernoulli) sometimes used for binary version of Naïve Bayes model

- multinomial model is also referred to as “unigram” model

- multinomial model is also sometimes (confusingly) referred to as naïve Bayes

Page 32: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

WebKB Data Set

• Classify webpages from CS departments into:– student, faculty, course,project

• Train on ~5,000 hand-labeled web pages– Cornell, Washington, U.Texas, Wisconsin

• Crawl and classify a new site (CMU)• Results with NB

Student Faculty Person Project Course DepartmtExtracted 180 66 246 99 28 1Correct 130 28 194 72 25 1Accuracy: 72% 42% 79% 73% 89% 100%

Page 33: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Comparing Naïve Bayes and Multinomial on Web KB Data

Page 34: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

• Most (over)used data set• 21578 documents• 9603 training, 3299 test articles (ModApte/Lewis split)• 118 categories

– An article can be in more than one category– Learn 118 binary category distinctions

• Average document: about 90 types, 200 tokens• Average number of classes assigned

– 1.24 for docs with at least one category

• Only about 10 out of 118 categories are large

Common categories(#train, #test)

Classic Reuters-21578 Data Set

• Earn (2877, 1087) • Acquisitions (1650, 179)• Money-fx (538, 179)• Grain (433, 149)• Crude (389, 189)

• Trade (369,119)• Interest (347, 131)• Ship (197, 89)• Wheat (212, 71)• Corn (182, 56)

Sec. 15.2.4

Page 35: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

35

Example of Reuters document

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798">

<DATE> 2-MAR-1987 16:51:43.42</DATE>

<TOPICS><D>livestock</D><D>hog</D></TOPICS>

<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>

<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.

Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.

A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

&#3;</BODY></TEXT></REUTERS>

Sec. 15.2.4

Page 36: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Comparing Multinomial and Naïve Bayes on Reuter’s Data

(from McCallum and Nigam, 1998)

Page 37: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Comparing Multinomial and Naïve Bayes on Reuter’s Data

(from McCallum and Nigam, 1998)

Page 38: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Comparing Naïve Bayes and Multinomial (slide from Chris Manning, Stanford)

Results from classifying 13,589 Yahoo! Web pages in Science subtree of hierarchy into 95 different topics

Page 39: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Feature Selection

• Performance of text classification algorithms can be optimized by selecting only a subset of the discriminative terms– Particularly important for Naïve Bayes (see previous slides)– Multinomial classifier is less sensitive

• Greedy search– Start from empty set or full set and add/delete one at a time– Heuristics for adding/deleting

• Information gain (mutual information of term with class)• Chi-square• Other ideas

Methods tend not to be particularly sensitive to the specific heuristic used for feature selection, but some form of feature selection often improves performance

Page 40: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Feature selection via Mutual Information

• In training set, choose k words which best discriminate (give most information about) the categories.

• The Mutual Information between a word, class is:

For each word w and each category c

the probability of w being in a document if the document belongs to class c

}1,0{ }1,0{ )(

)|(log),(),(

w ce e w

cwcw ep

eepeepcwI

Sec.13.5.1

)|( cw eep

Page 41: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Example of Feature Selection

• For each category we build a list of k most discriminating terms– Use the union of these terms in our classifier– K selected heuristically, or via cross-validation

• For example (on 20 Newsgroups), using mutual information– sci.electronics: circuit, voltage, amp, ground, copy, battery,

electronics, cooling, …– rec.autos: car, cars, engine, ford, dealer, mustang, oil,

collision, autos, tires, toyota, …

• Greedy: does not account for correlations between terms

Sec.13.5.1

Page 42: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Example of Feature Selection (from Chakrabarti, Chapter 5)

9600 documents from US Patent database20,000 raw features (terms)

Page 43: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Comments on Generative Models for Text(Comments applicable to both Naïve Bayes and Multinomial classifiers)

• Simple and fast => popular in practice– e.g., linear in p, n, M for both training and prediction

• Training = “smoothed” frequency counts, e.g.,

– e.g., easy to use in situations where classifier needs to be updated

regularly (e.g., for spam email)

• Numerical issues– Typically work with log p( ck | x ), etc., to avoid numerical underflow– Useful trick for naïve Bayes classifier:

• when computing log p( xj | ck ) , for sparse data, it may be much faster to

– precompute log p( xj = 0| ck ) – and then subtract off the log p( xj = 1| ck ) terms

• Note: both models are “wrong”: but for classification are often sufficient

mk

kjkjk

n

nxcp

,)1|(

Page 44: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Beyond Independence• Naïve Bayes and multinomial both assume conditional

independence of words given class

• Alternative approaches try to account for higher-order dependencies– Bayesian networks:

• p(x | c) = x p(x | parents(x), c) • Equivalent to directed graph where edges represent direct dependencies• Various algorithms that search for a good network structure• Useful for improving quality of distribution model• ….however, this does not always translate into better classification

– Maximum entropy models• p(x | c) = 1/Z subsets f( subsets(x) | c)• Equivalent to undirected graph model• Estimation is equivalent to maximum entropy assumption• Feature selection is crucial (which f terms to include) – • can provide high accuracy classification• …. however, tends to be computationally complex to fit (estimating Z is

difficult)

Page 45: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Dumais et al. 1998: Reuters - Accuracy

NBayes Trees LinearSVMearn 95.9% 97.8% 98.2%acq 87.8% 89.7% 92.8%money-fx 56.6% 66.2% 74.0%grain 78.8% 85.0% 92.4%crude 79.5% 85.0% 88.3%trade 63.9% 72.5% 73.5%interest 64.9% 67.1% 76.3%ship 85.4% 74.2% 78.0%wheat 69.7% 92.5% 89.7%corn 65.3% 91.8% 91.1%

Avg Top 10 81.5% 88.4% 91.4%Avg All Cat 75.2% na 86.4%

Page 46: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Precision-recall for category: Ship

Precision

Recall

Sec. 15.2.4

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

LSVMDecision Tree Naïve BayesRocchio

Dumais (1998)

Page 47: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Micro- vs. Macro-Averaging

• If we have more than one class, how do we combine multiple performance measures into one quantity?

• Macroaveraging: Compute performance for each class, then average.

• Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.

Note that macroaveraging ignores how likely each class is, gives performance on each class equal weight.

Microaveraging weights according to how likely the classes are.

Sec. 15.2.4

Page 48: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Micro- vs. Macro-Averaging: Example

Truth: yes

Truth: no

Classifier: yes

10 10

Classifier: no

10 970

Truth: yes

Truth: no

Classifier: yes

90 10

Classifier: no

10 890

Truth: yes

Truth: no

Classifier: yes

100 20

Classifier: no

20 1860

Class 1 Class 2 Micro Ave. Table

Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 = .83

Microaveraged score is dominated by score on common classes

Sec. 15.2.4

Page 49: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Results on Reuters Data

Sec. 15.2.4

Page 50: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

SVM vs. Other Methods(Yang and Liu, 1999)

Sec. 15.2.4

Page 51: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Comparison of accuracy across three classifiers: Naive Bayes, Maximum Entropy and Linear SVM, using three data sets: 20 newsgroups, the Recreation sub-tree of the Open Directory, and University Web pages from WebKB. From Chakrabarti, 2003, Chapter 5.

Page 52: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

52

Upweighting

• You can get a lot of value by differentially weighting contributions from different document zones– e.g., you count as two instance of a word when you see it in,

say, the abstract– Upweighting title words helps (Cohen & Singer 1996)

• Doubling the weighting on the title words is a good rule of thumb

– Upweighting the first sentence of each paragraph helps (Murata, 1999)

– Upweighting sentences that contain title words helps (Ko et al, 2002)

Are there principled approaches to learn where and how to upweight for a specific text classification task and corpus?

Sec. 15.3.2

Page 53: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

53

Accuracy as a function of data size

• With enough data the choice of classifier may not matter much, and the best choice may be unclear– Data: Banko and Brill on

“confusion-set” disambiguation,

e.g., principle v. principal

• But the fact that you have to keep doubling your data to improve performance is a little unpleasant

Sec. 15.3.1

Scaling to very very large corpora for natural language disambiguationBanko and Brill, Annual Meeting of the ACL, 2001

Page 54: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Learning with Labeled and Unlabeled documents

• In practice, obtaining labels for documents is time-consuming, expensive, and error prone– Typical application: small number of labeled docs and a very large

number of unlabeled docs

• Idea:– Build a probabilistic model on labeled docs– Classify the unlabeled docs, get p(class | doc) for each class and

doc• This is equivalent to the E-step in the EM algorithm

– Now relearn the probabilistic model using the new “soft labels”• This is equivalent to the M-step in the EM algorithm

– Continue to iterate until convergence (e.g., class probabilities do not change significantly)

– This EM approach to classification shows that unlabeled data can help in classification performance, compared to labeled data alone

Page 55: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Learning with Labeled and Unlabeled DataGraph from “Semi-supervised text classification using EM”, Nigam, McCallum, and Mitchell, 2006

Page 56: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

MultiLabel Text Classification

• Each document can have 1, 2, 3, … class labels– E.g., news-articles annotated with “tags”– Wikipedia categories– MESH headings for biomedical articles

• Common approach– K labels– Train K different “one versus all” classifiers, e.g., K binary SVMs– Limitations

• Does not account for label dependencies (e.g. positive and negative correlations in label patterns)

• Can have trouble with (a) many labels per document, and (b) few documents per label

– Alternatives• Train K(K-1)/2 binary classifiers – not good computationally• Capture labels correlations in lower dimensional space

– “compressed sensing” – Hsu, Kakade, Langford, NIPS 2009– topic models – new research at UCI, will discuss later in quarter

Page 57: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Page 58: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Page 59: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Other aspects of text classification

• Real-time constraints:– Being able to update classifiers as new data arrives– Being able to make predictions very quickly in real-time

• Document length– Varying document length can be a problem for some classifiers– Multinomial tends to be less sensitive than Bernoulli for example

• Linked documents– Can view Web documents as nodes in a directed graph– Classification can now be performed that leverages the link structure,

• Heuristic = class labels of linked pages are more likely to be the same– Optimal solution is to classify all documents jointly rather than

individually– Resulting “global classification” problem is typically computationally

complex

Page 60: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Summary of Text Classifiers

• Naïve Bayes models (Bernoulli or Multinomial)– Low time complexity (single linear pass through the data)– Generally good, but not always best– Widely used for spam email filtering

• Linear SVMs– Often produce best results in research studies– But computationally complex to train – not so widely used in practice as naïve Bayes

• Logistic regression– Can be as accurate as SVMs, computationally better– A variant of this is sometimes referred to as “maximum entropy”– Adaptations for sparse text data

• Feature selection or regularization (e.g., L1/Lasso) is essential• Stochastic gradient descent can be very useful (see earlier lecture)

Page 61: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Alternatives: Manual Classification

• Manual classification– Used by the original Yahoo! Directory– Looksmart, about.com, ODP, PubMed– Very accurate when job is done by experts– Consistent when the problem size and team is small– Difficult and expensive to scale

– Can be noisy when done by different people• E.g., spam/non-spam categorization by users

Note that this is how our training data is created for classification

Ch. 13

Page 62: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Rule-Based Systems

Hand-coded rule-based systems• One technique used by spam filters, Reuters, CIA, etc.

– Widely used in government and industry• Companies provide “IDE” for writing such rules• E.g., assign category if document contains a given Boolean

combination of words• Standing queries: Commercial systems have complex query

languages (everything in IR query languages +score accumulators)

• Accuracy is often very high if a rule has been carefully refined over time by a subject expert

• Building and maintaining these rules is expensive

Ch. 13

Page 63: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Further Reading on Text Classification

• Web-related text mining in general– S. Chakrabarti, Mining the Web: Discovering Knowledge from

Hypertext Data, Morgan Kaufmann, 2003.– See chapter 5 for discussion of text classification

• General references on text and language modeling– Foundations of Statistical Language Processing, C. Manning

and H. Schutze, MIT Press, 1999.– Speech and Language Processing: An Introduction to Natural

Language Processing, Dan Jurafsky and James Martin, Prentice Hall, 2000.

• SVMs for text classification– T. Joachims, Learning to Classify Text using Support Vector

Machines: Methods, Theory and Algorithms, Kluwer, 2002

Page 64: CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction

Data Mining Lectures Notes on Text Classification © Padhraic Smyth, UC Irvine

Additional Papers and Sources

• Christopher J. C. Burges. 1998. A Tutorial on Support Vector Machines for Pattern Recognition• S. T. Dumais. 1998. Using SVMs for text categorization, IEEE Intelligent Systems, 13(4)• S. T. Dumais, J. Platt, D. Heckerman and M. Sahami. 1998. Inductive learning algorithms and

representations for text categorization. CIKM ’98, pp. 148-155. • Yiming Yang, Xin Liu. 1999. A re-examination of text categorization methods. 22nd Annual

International SIGIR• Tong Zhang, Frank J. Oles. 2001. Text Categorization Based on Regularized Linear Classification

Methods. Information Retrieval 4(1): 5-31 • Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning: Data

Mining, Inference and Prediction. Springer-Verlag, New York. • T. Joachims, Learning to Classify Text using Support Vector Machines. Kluwer, 2002. • Fan Li, Yiming Yang. 2003. A Loss Function Analysis for Classification Methods in Text

Categorization. ICML 2003: 472-479.• Tie-Yan Liu, Yiming Yang, Hao Wan, et al. 2005. Support Vector Machines Classification with

Very Large Scale Taxonomy, SIGKDD Explorations, 7(1): 36-43.

Ch. 15