naïve bayes text classi cation

69
Naïve Bayes Text Classication 9 March 2021 cmpu 366 · Computational Linguistics

Upload: others

Post on 29-Nov-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Naïve Bayes Text Classi cation

Naïve Bayes Text Classification

9 March 2021

cmpu 366 · Computational Linguistics

Page 2: Naïve Bayes Text Classi cation

Machine learning is the area of computer science focused on the development and implementation of systems that improve as they encounter more data.

Machine learning has been central to advances in NLP for approximately the last 25 years.

Page 3: Naïve Bayes Text Classi cation

Text classification

Page 4: Naïve Bayes Text Classi cation

From: "Fabian Starr“ <[email protected]> Subject: Hey! Sofware for the funny prices!

Get the great discounts on popular software today for PC and Macintosh http://iiled.org/Cj4Lmx 70-90% Discounts from retail price!!! All sofware is instantly available to download - No Need Wait!

Is this spam?

Page 5: Naïve Bayes Text Classi cation

Who wrote each of the Federalist papers?

Mad dog A. Ham.

Page 6: Naïve Bayes Text Classi cation

What’s the subject of this medical article? Antagonists and inhibitors

Blood supply

Chemistry

Drug therapy

Embryology

Epidemiology

Page 7: Naïve Bayes Text Classi cation

…zany characters and richly applied satire, and some great plot twists

It was pathetic. The worst part about it was the boxing scenes…

…awesome caramel sauce and sweet toasty almonds. I love this place!

…awful pizza and ridiculously overpriced…

Are these reviews positive or negative?

Page 8: Naïve Bayes Text Classi cation

Many problems take the form of text classification: Assigning subject categories, topics, or genres

Spam detection

Authorship identification

Age/gender identification

Language identification

Sentiment analysis

Part-of-speech tagging

Automatic essay grading

Page 9: Naïve Bayes Text Classi cation

Text classification problems take this form: Input:

A document d

A fixed set of classes C = {c1, c2, …, cj}

Output:

A predicted class c ∈ C

Page 10: Naïve Bayes Text Classi cation

We can build a classifier by writing rules. Look for combinations of words or other features

Spam: black-list-address OR (“dollars” AND “have been selected”)

Accuracy can be high if the rules are carefully refined by an expert.

But building and maintaining these rules is expensive.

Page 11: Naïve Bayes Text Classi cation

Instead, like humans learn from experience, we make computers learn from data.

Page 12: Naïve Bayes Text Classi cation

A supervised machine-learning text classification problem takes this form:

Input:

A document d

A fixed set of classes C = {c1, c2, …, cj}

A training set of m hand-labeled documents (d1, c1), …, (dm, cm)

Output:

A learned classifier γ : d → c

Page 13: Naïve Bayes Text Classi cation

Supervised machine learning

Source: NLTK book

Page 14: Naïve Bayes Text Classi cation

Features

Page 15: Naïve Bayes Text Classi cation

A classification decision must rely on some observable evidence, which we encode as features.

Page 16: Naïve Bayes Text Classi cation

Typical features include: Words (n-grams) present in the text

Frequency of words

Capitalization

Presence of named entities

Syntactic relations

Semantic relations

Page 17: Naïve Bayes Text Classi cation

The simplest and most common features are Boolean, e.g., is the word present or not?

However, we can also have integer features like the number of times a word occurs.

Page 18: Naïve Bayes Text Classi cation

The features we select depend on the task. Is a name masculine or feminine?

Last letter = …

What part-of-speech is a word, e.g., park or carbingly?

Is the word preceded by the? to?

Does the word end with -ly? -ness?

Is an email spam?

Does it contain generic Viagra?

Is the subject in all capital letters?

See features that were used by SpamAssassin: http://spamassassin.apache.org/old/tests_3_3_x.html

Page 19: Naïve Bayes Text Classi cation

Feature engineering is the problem of deciding what features are relevant.

Approaches: Hand-crafted

Use expert knowledge to determine a small set of features that are likely to be relevant.

Kitchen sink

Give lots of features to the machine-learning algorithm and see what features are given greater weight and which are ignored

E.g., use each word in the document as a feature: has-cash: True

has-the: True

has-linguistics: False

Page 20: Naïve Bayes Text Classi cation

Weighting the evidence

A classification decision involves reconciling multiple features with different levels of predictive power.

Different types of classifiers use different algorithms to:

Determine the weights of individual features to maximize correct predictions for the training data and

Compute the likelihood of a label for an input, using the feature weights.

Page 21: Naïve Bayes Text Classi cation

Popular machine learning methods: Naïve Bayes

Decision tree

Maximum entropy (ME)

Hidden Markov model (HMM)

Neural networks, including deep learning

Support vector machine (SVM)

Page 22: Naïve Bayes Text Classi cation

Naïve Bayes

Page 23: Naïve Bayes Text Classi cation

Naïve Bayes is a simple classification method based on Bayes rule.

For text classification, we can use it with a simple representation of a document as a bag of words.

Page 24: Naïve Bayes Text Classi cation

The bag of words representation

Figure from J&M, 3rd ed. draft sec. 6.1

Page 25: Naïve Bayes Text Classi cation

The bag of words representation

Figure from J&M, 3rd ed. draft sec. 6.1

Page 26: Naïve Bayes Text Classi cation

The bag of words representation

Figure from J&M, 3rd ed. draft sec. 6.1

Page 27: Naïve Bayes Text Classi cation

The bag of words representation

γ( ) = cseen 2sweet 1whimsical 1recommend 1happy 1

... ...

Page 28: Naïve Bayes Text Classi cation

Bayes rule and classification

Page 29: Naïve Bayes Text Classi cation

Bayes rule relates conditional probabilities.

For a document d and a class c,

P(c ∣ d) =P(d ∣ c) ⋅ P(c)

P(d)Posterior

Likelihood

Evidence

Prior

Page 30: Naïve Bayes Text Classi cation

To choose the most likely class, cMAP from the set of classes C, for a document d:

cMAP = argmaxc∈C

P(c ∣ d)

= argmaxc∈C

P(d ∣ c)P(c)P(d)

= argmaxc∈C

P(d ∣ c)P(c)

MAP is “maximum a posteriori” – the most likely class

Bayes rule

Dropping the denominator, which is the same for each class

Page 31: Naïve Bayes Text Classi cation

But the number of training examples needed to calculate these estimates is exponentially large compared to the number of features:

O(|F|n · |C|) parameters

cMAP = argmaxc∈C

P(d ∣ c)P(c)

= argmaxc∈C

P( f1, f2, …, fn ∣ c)P(c)

Likelihood Prior

Document d is represented as features f1, …, fn

Page 32: Naïve Bayes Text Classi cation

Fortunately, the “naïve” in “naive Bayes” isn’t (just) a value judgment; it’s a functional design choice.

The naïve Bayes assumption is that the features f1, …, fn are conditionally independent (of one another) given the class c.

This simplifies combining contributions of features; you just multiply their probabilities:

P(f1, …, fn | c) = P(f1 | c) · P(f2 | c) ⋯ P(fn | c)

Page 33: Naïve Bayes Text Classi cation

cMAP = argmaxc∈C

P( f1, f2, …, fn ∣ c)P(c)

cNB = argmaxc∈C

P(c)n

∏i=1

P( fi ∣ c)

Page 34: Naïve Bayes Text Classi cation

Returning to our “bag of words model”: Let positions = all word positions in a document

where wi is the word at position i.

cNB = argmaxc∈C

P(c) ∏i ∈ positions

P(wi ∣ c)

This class is the one our naïve Bayes text classifier returns

Page 35: Naïve Bayes Text Classi cation

Naïve Bayes: Learning

Page 36: Naïve Bayes Text Classi cation

We need to estimate the prior probability of each category, P(c) for each c ∈ C.

We can get the maximum-likelihood estimate for each c from the training corpus:

We also need the probability of each word (feature) given each category:

P(c) =doccount(C = c)

Ndoc

P(wi ∣ c) =count(wi, c)

∑w∈V count(w, c)Fraction of times word wi appears among all words in documents of topic c

Page 37: Naïve Bayes Text Classi cation

In general, the more training data we can give the classifier, the better it will do. Er

ror r

ate

Training set size

Page 38: Naïve Bayes Text Classi cation

Note that we have a big problem with zero counts! If we never saw the word fantastic in a document labeled as positive,

When we calculate

this one 0 will turn the whole estimate to 0!

P(fantastic ∣ positive) =count(fantastic, positive)∑w∈V count(w, positive)

= 0

cMAP = argmaxc∈C

P(c)n

∏i=1

P( fi ∣ c)

Page 39: Naïve Bayes Text Classi cation

As we did with n-grams, we can use Laplace (add-1) smoothing:

→ P(wi ∣ c) =count(wi, c)

∑w∈V count(w, c)count(wi, c) + 1

(∑w∈V count(w, c)) + |V |

Page 40: Naïve Bayes Text Classi cation

What about the unknown words – those that appear in the test data but not in the training data?

Ignore them! Just remove them from the test document.

We could build an unknown word model, but it wouldn’t generally help; it’s unlikely to help us to know which class has more unknown words.

Page 41: Naïve Bayes Text Classi cation

Naïve Bayes and language modeling

Page 42: Naïve Bayes Text Classi cation

Generative model for multinomial naïve Bayes

c = spam

w1 = Dear w2 = sir w3 = SEEKING w4 = YOUR …

Page 43: Naïve Bayes Text Classi cation

Naïve Bayes classifiers can use any sort of feature URL, email address, dictionaries, network features

But if we have a feature corresponding to each word in the text, then each class in our naïve Bayes model is a unigram language model.

Page 44: Naïve Bayes Text Classi cation

The probability of assigning each word: P(word | c)

The probability of assigning each sentence: P(s | c) = Π P(word | c)

Class positive

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

I love this fun film

0.1 0.1 0.05 0.01 0.1positive

P(s | positive) = 0.000 000 5

Page 45: Naïve Bayes Text Classi cation

The probability of assigning each word: P(word | c)

The probability of assigning each sentence: P(s | c) = Π P(word | c)

Class positive

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

I love this fun film

0.1 0.1 0.05 0.01 0.1

0.2 0.001 0.01 0.005 0.1

positive

Class negative

0.2 I

0.001 love

0.01 this

0.005 fun

0.1 film P(s | positive) = 0.000 000 5

negative

P(s | negative) = 0.00 000 000 1

P(s | positive) > P(s | negative)

Page 46: Naïve Bayes Text Classi cation

Oh, to be Bayesian and naïve!

Page 47: Naïve Bayes Text Classi cation

Strengths of naïve Bayes classification:

The model is easy to understand and easy to implement

(compared with other classifiers!)

Training and classification are both fast

Requires modest storage space

Relatively robust to irrelevant features

If we include features – e.g., words – that don’t help us classify, they cancel out without affecting the results

Works well for many tasks

It’s a good, dependable baseline for classification that’s widely used in practice – but it’s not the best!

Page 48: Naïve Bayes Text Classi cation

Weakness of naïve Bayes classification:

The bag-of-words representation ignores the sequential ordering of words

The independence assumption is inappropriate if there are strong conditional dependencies between the variables.

Strengths of naïve Bayes classification:

The model is easy to understand and easy to implement

(compared with other classifiers!)

Training and classification are both fast

Requires modest storage space

Relatively robust to irrelevant features

If we include features – e.g., words – that don’t help us classify, they cancel out without affecting the results

Works well for many tasks

The model may not be “right”, but often we’re interested in the accuracy of the classification, not of the probability estimates.

Page 49: Naïve Bayes Text Classi cation

Evaluation

Page 50: Naïve Bayes Text Classi cation

After choosing the parameters for the classifier – i.e., training it – we test how well it does on a test set of examples that weren’t used for training.

Page 51: Naïve Bayes Text Classi cation

Precision, recall, and F-measure

Page 52: Naïve Bayes Text Classi cation

Jurafsky & Martin asks us to imagine we’re the CEO of Delicious Pie Company and we want to know what people are tweeting about our pies (which are delicious).

We build a classifier to identify which tweets are about Delicious Pie Company.

Page 53: Naïve Bayes Text Classi cation

2×2 confusion matrix

Page 54: Naïve Bayes Text Classi cation

2×2 confusion matrix

Classifier says it’s about us

Classifier says it’s not about us

It’s really about us It’s really NOT about us

Page 55: Naïve Bayes Text Classi cation

2×2 confusion matrix

Classifier says it’s about us

Classifier says it’s not about us

It’s really about us It’s really NOT about us

What percent of the tweets about us did we identify?

What percent of the tweets that we said were about us really were?

What percent of the tweets were identified correctly either way?

Page 56: Naïve Bayes Text Classi cation

Accuracy sounds great – it’s consider how the classifier does on all inputs!

Well, it depends on the base (prior) probabilities: 99.99% accuracy might be terrible.

If we see 1 million tweets and only 100 of them are about Delicious Pie Company, we could just label every tweet “not about us”!

60% accuracy might be pretty good.

If we’re labeling documents with 20 different topics and the largest category only accounts for 10% of the data, that’s a much more difficult problem.

Instead, we measure precision and recall.

Page 57: Naïve Bayes Text Classi cation

Precision is the percent of items the system detected (labeled as positive for a class) that are actually positive:

true positives / (true positives + false positives)

Page 58: Naïve Bayes Text Classi cation

Recall is the percent of items actually present in the input that were correctly identified by the system:

true positives / (true positives + false negatives)

Page 59: Naïve Bayes Text Classi cation

The classifier that says no tweets are about pie would have 99.99% accuracy – but 0% recall!

It doesn’t identify any of the 100 tweets we wanted.

Page 60: Naïve Bayes Text Classi cation

There’s a trade-off between precision and recall. A highly precise classifier will ignore cases where it’s less confident, leading to more false negatives → lower recall

A high-recall classifier will flag things it’s unsure about, leading to more false positives → lower precision

Page 61: Naïve Bayes Text Classi cation

In developing a real application, picking the right trade-off point between precision and recall is an important usability issue.

Think about a grammar checker: Too many false positives will irritate lots of users.

But if you’re designing a system to detect hate speech online, you might want to err on the side of high recall to avoid abuse slipping through the cracks.

Page 62: Naïve Bayes Text Classi cation

Any balance of precision and recall can be encoded as a single measure called an F-score:

The most common F-score is F1, which is the harmonic mean of precision and recall:

Fβ =(β2 + 1)PR

β2P + R

F1 =2PR

P + R

Why do we use the harmonic mean rather than the mean?

Page 63: Naïve Bayes Text Classi cation

Development test sets

We train on a training set and test on a test set.

But sometimes we also want a development test set. This avoids overfitting – “tuning to the test set” – and offers a more conservative estimate of performance.

Training set Development Test Set Test Set

Problem: We want as much data as possible for training and as much as possible for dev. How should we split it?

Page 64: Naïve Bayes Text Classi cation

Cross-validation: multiple splits

We can pool results over splits, compute the pooled dev. performance.

Page 65: Naïve Bayes Text Classi cation

3×3 confusion matrix

Page 66: Naïve Bayes Text Classi cation

How can we combine the precision or recall scores from three (or more) classes to get one metric?

Macroaveraging

Compute the performance for each class and then average over classes

Microaveraging

Collect decisions for all classes into one confusion matrix

Compute precision and recall from that table

Page 67: Naïve Bayes Text Classi cation

Macroaveraging and microaveraging

Page 68: Naïve Bayes Text Classi cation

Assignment 2: Who Said It?

Jane Austen or Herman Melville? I never met with a disposition more truly amiable.

But Queequeg, do you see, was a creature in the transition stage – neither caterpillar nor butterfly.

Oh, my sweet cardinals!

Task: build a Naïve Bayes classifier and explore it

Do three-way partition of data: test data

development-test data

training data

Page 69: Naïve Bayes Text Classi cation

Acknowledgments

The lecture incorporates material from: Nae-Rae Han, University of Pittsburgh

Nancy Ide, Vassar College

Daniel Jurafsky, Stanford University

Daniel Jurafasky and James Martin, Speech and Language Processing