more about naïve bayes...market research (due 11:59pm, friday, feb 3rd) • why is the company...

40
Instructor: Wei Xu More about Naïve Bayes Some slides adapted from Dan Jurfasky and Brendan O’connor

Upload: others

Post on 12-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Instructor: Wei Xu

More about Naïve Bayes

Some slides adapted from Dan Jurfasky and Brendan O’connor

Page 2: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Market Research (due 11:59pm, Friday, Feb 3rd)

• When was the company started? • Who were the founders? • What kind of organization is it? (publicly

traded company, privately held company, non-profit organization, other)

• What is company's main business model? • How does the company generate revenue? • What is the finance situation of the company?

(stock price, anual report, news)

Page 3: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Market Research (due 11:59pm, Friday, Feb 3rd)

• Why is the company interested in speech or NLP technologies or both?

• What are specific areas or applications of speech/NLP the company is interested in?

• What products of the company use speech or NLP technologies?

• What the main users of their speech or NLP technologies?

• Does the company hold any patent using speech or NLP technologies?

• Does the company publish any papers on speech or NLP technologies?

Page 4: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Market Research (due 11:59pm, Friday, Feb 3rd)

• Is the company recently hiring in NLP? interns? PhD? • What specific expertise within speech or NLP the

company is looking to hire? • How is the press coverage of the company? • How many employees do the company have? • An estimation of how many speech/NLP experts

currently in the company? • Any notable speech/NLP researcher or recent hires

in the company? • Which city is the company's speech/NLP research

office located?

Page 5: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Planning GUIGarbageCollection

Machine Learning NLP

parsertagtrainingtranslationlanguage...

learningtrainingalgorithmshrinkagenetwork...

garbagecollectionmemoryoptimizationregion...

Test document

parserlanguagelabeltranslation…

Text Classification

...planningtemporalreasoningplanlanguage...

?

Page 6: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Classification Methods: Supervised Machine Learning

• Input: – a document d – a fixed set of classes C = {c1, c2,…, cJ}

– A training set of m hand-labeled documents (d1,c1),....,(dm,cm)

• Output: – a learned classifier γ:d ! c

Page 7: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Naïve Bayes Classifier

MAP is “maximum a posteriori” = most likely class

Bayes Rule

Dropping the denominator

Document d represented as features x1, …, xn

Page 8: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Multinomial Naïve Bayes: Assumptions

• Bag of Words assumption: Assume position doesn’t matter

• Conditional Independence: Assume the feature probabilities P(xi|cj) are independent given the class c.

Page 9: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

estimate from data

Multinomial Naïve Bayes: Assumptions

Page 10: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

• maximum likelihood estimates – simply use the frequencies in the data

Multinomial Naïve Bayes: Learning

P̂(wi | cj ) =count(wi ,cj )count(w,cj )

w∈V∑

P̂(cj ) =doccount(C = cj )

Ndoc

use unique word wi as features (in place of xi)

Page 11: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

• Calculate P(cj) terms

– For each cj in C do

docsj ← all docs with class = cj

P(cj )←| docsj |

| total # documents|

Multinomial Naïve Bayes: Learning

Page 12: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

• maximum likelihood estimates – simply use the frequencies in the data

Multinomial Naïve Bayes: Learning

P̂(wi | cj ) =count(wi ,cj )count(w,cj )

w∈V∑

P̂(cj ) =doccount(C = cj )

Ndoc

use unique word wi as features (in place of xi)

Page 13: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Multinomial Naïve Bayes: Learning

• Create mega-document for topic j by concatenating all docs in this topic – Use frequency of w in mega-document

fraction of times word wi appears among all words in documents of

topic cj

P̂(wi | cj ) =count(wi ,cj )count(w,cj )

w∈V∑

Page 14: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Zero Probabilities Problem• What if we have seen no training documents

with the word fantastic and classified in the topic positive?

• Zero probabilities cannot be conditioned away, no matter the other evidence!

Page 15: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Problem with Maximum Likelihood• What if we have seen no training documents

with the word fantastic and classified in the topic positive?

Page 16: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Laplace (add-1) smoothing for Naïve Bayes

=count(wi ,c)+1

count(w,cw∈V∑ )

⎝⎜⎜

⎠⎟⎟ + V

P̂(wi | c) =count(wi ,c)count(w,c)( )

w∈V∑

smoothing to avoid zero probabilities

• For unknown words (which completely doesn’t occur in training set), we can ignore them.

Page 17: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

• From training corpus, extract Vocabulary • Calculate P(wk | cj) terms • Textj ← single doc containing all docsj

• For each word wk in Vocabulary nk ← # of occurrences of wk in Textj

P(wk | cj )←nk +α

n+α |Vocabulary |

Multinomial Naïve Bayes: Learning

smoothing to avoid zero probabilities (often use α = 1)

Page 18: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Exercise

Page 19: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Naïve Bayes: Practical Issues

• Multiplying together lots of probabilities • Probabilities are numbers between 0 and 1

Q: What could go wrong here?

Page 20: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Underflow

Page 21: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Floating Point Numbers

Page 22: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Floating Point Numbers

Page 23: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Working with Probabilities in Log Space

Page 24: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Log Identities (review)

Page 25: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Naïve Bayes with Log Probabilities

because log is monotonic increasing

Page 26: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Naïve Bayes with Log Probabilities

Q: Why don’t we have to worry about floating point underflow anymore?

Page 27: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Working with Probabilities in Log Space

x log(x)

0.00000000001 -11

0.00001 -5

0.0001 -4

0.001 -3

0.01 -2

0.1 -1

Page 28: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

What if we want to calculate posterior log-probabilities?

= logP (c) +

nX

i=1

logP (xi|c)� log

"X

c0

P (c

0)

nY

i=1

P (xi|c0)#

Page 29: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

What if we want to calculate posterior log-probabilities?

= logP (c) +

nX

i=1

logP (xi|c)� log

"X

c0

P (c

0)

nY

i=1

P (xi|c0)#

Page 30: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

• We have: a bunch of log probabilities: log(p1), log(p2), log(p3), … log(pn)

• We want: log(p1 + p2 + p3 + … pn)

• We could convert back from log space, sum then take the log.

Log Exp Sum Trick

Q: Is this a good idea?

Page 31: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

• We have: a bunch of log probabilities: log(p1), log(p2), log(p3), … log(pn)

• We want: log(p1 + p2 + p3 + … pn)

• We could convert back from log space, sum then take the log.

Log Exp Sum Trick

If the probabilities are very small, this will result in floating point underflow

Page 32: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Log Exp Sum Trick

Page 33: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Another issue: Smoothing

Page 34: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Another issue: Smoothing

Can think of alpha as a “pseudo-count”. Imaginary number of times this word has been seen.

Page 35: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Another issue: Smoothing

Alpha doesn’t necessarily need to be 1

(hyperparameter)

Hyperparameters are parameters that cannot be directly learned from the standard training process, and need to be predefined.

Page 36: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Another issue: Smoothing

• What if alpha = 0? • What if alpha = 0.000001? • What happens as alpha gets very large?

Page 37: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Overfitting• Model parameters fits the training data

well, but generalize poorly to test data • How to check for overfitting? – Training vs. test accuracy

• Pseudo-count parameter combats overfitting

Page 38: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

How to pick Alpha?

• Split train vs. test (dev) • Try a bunch of different

values • Pick the value of alpha

that performs best • What values to try?

Grid search – (10^-2,10^-1,...,10^2)

accuracy

Use this one

Page 39: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Data Splitting

• Train vs. Test

• Better: – Train (used for fitting model parameters) – Dev (used for tuning hyperparameters) – Test (reserve for final evaluation)

• Cross-validation

Page 40: More about Naïve Bayes...Market Research (due 11:59pm, Friday, Feb 3rd) • Why is the company interested in speech or NLP technologies or both? • What are specific areas or applications

Feature Engineering

• What is your word / feature representation – Tokenization rules: splitting on whitespace? – Uppercase is the same as lowercase? – Numbers? – Punctuation? – Stemming?