si485i : nlp - usna · language is where the real gains come from in nlp. ... new code for new...

16
SI485i : NLP Set 5 Using Naïve Bayes

Upload: dotruc

Post on 18-Aug-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SI485i : NLP - USNA · language is where the real gains come from in NLP. ... New code for new features. Call your language models, get a probability, and then multiply new

SI485i : NLP

Set 5

Using Naïve Bayes

Page 2: SI485i : NLP - USNA · language is where the real gains come from in NLP. ... New code for new features. Call your language models, get a probability, and then multiply new

Motivation

• We want to predict something.

• We have some text related to this something.

• something = target label Y

• text = text features X

Given X, what is the most probable Y?

Page 3: SI485i : NLP - USNA · language is where the real gains come from in NLP. ... New code for new features. Call your language models, get a probability, and then multiply new

Motivation: Author Detection

Alas the day! take heed of him; he stabbed me in

mine own house, and that most beastly: in good

faith, he cares not what mischief he does. If his

weapon be out: he will foin like any devil; he will

spare neither man, woman, nor child.

X =

Y = { Charles Dickens, William Shakespeare, Herman

Melville, Jane Austin, Homer, Leo Tolstoy }

)|()(maxarg kky

yYXPyYPYk

Page 4: SI485i : NLP - USNA · language is where the real gains come from in NLP. ... New code for new features. Call your language models, get a probability, and then multiply new

More Motivation

P(Y=spam | X=email) P(Y=worthy | X=review sentence)

Page 5: SI485i : NLP - USNA · language is where the real gains come from in NLP. ... New code for new features. Call your language models, get a probability, and then multiply new

5

The Naïve Bayes Classifier

• Recall Bayes rule:

• Which is short for:

• We can re-write this as:

)(

)|()()|(

j

iji

jiXP

YXPYPXYP

)(

)|()()|(

j

iji

jixXP

yYxXPyYPxXyYP

k kkj

iji

jiyYPyYxXP

yYxXPyYPxXyYP

)()|(

)|()()|(

Remaining slides adapted from Tom Mitchell.

Page 6: SI485i : NLP - USNA · language is where the real gains come from in NLP. ... New code for new features. Call your language models, get a probability, and then multiply new

Deriving Naïve Bayes

• Idea: use the training data to directly estimate:

• We can use these values to estimate using Bayes rule.

6

and )|( YXP )(YP

)|( newXYP

)|,,,()|( 21 YXXXPYXP n

• Recall that representing the full joint probability

is not practical.

Page 7: SI485i : NLP - USNA · language is where the real gains come from in NLP. ... New code for new features. Call your language models, get a probability, and then multiply new

Deriving Naïve Bayes

• However, if we make the assumption that the attributes

are independent, estimation is easy!

• In other words, we assume all attributes are conditionally

independent given Y.

• Often this assumption is violated in practice, but more on that

later…

7

i

in YXPYXXP )|()|,,( 1

Page 8: SI485i : NLP - USNA · language is where the real gains come from in NLP. ... New code for new features. Call your language models, get a probability, and then multiply new

Deriving Naïve Bayes

• Let and label Y be discrete.

• Then, we can estimate and

directly from the training data by counting!

8

Sky Temp Humid Wind Water Forecast Play?

sunny warm normal strong warm same yes

sunny warm high strong warm same yes

rainy cold high strong warm change no

sunny warm high strong cool change yes

P(Sky = sunny | Play = yes) = ? P(Humid = high | Play = yes) = ?

nXXX ,,1

)|( ii YXP )( iYP

Page 9: SI485i : NLP - USNA · language is where the real gains come from in NLP. ... New code for new features. Call your language models, get a probability, and then multiply new

The Naïve Bayes Classifier

• Now we have:

• To classify a new point Xnew:

9

k i kik

jiij

njyYXPyYP

yYXPyYPXXyYP

)|()(

)|()(),,|( 1

i

kiky

new yYXPyYPYk

)|()(maxarg

Page 10: SI485i : NLP - USNA · language is where the real gains come from in NLP. ... New code for new features. Call your language models, get a probability, and then multiply new

The Naïve Bayes Algorithm

• For each value yk

• Estimate P(Y = yk) from the data.

• For each value xij of each attribute Xi • Estimate P(Xi=xij | Y = yk)

• Classify a new point via:

• In practice, the independence assumption doesn’t often hold true, but Naïve Bayes performs very well despite it.

10

i

kiky

new yYXPyYPYk

)|()(maxarg

Page 11: SI485i : NLP - USNA · language is where the real gains come from in NLP. ... New code for new features. Call your language models, get a probability, and then multiply new

An alternate view of NB as LMs

Y1 = dickens Y2 = twain

P(Y1) * P(X | Y1) P(Y2) * P(X | Y2)

P(X | Y1) = PY1(X) P(X | Y2) = PY2(X)

Bigrams:

PY1(X) = 𝑃𝑌1(𝑥𝑖|𝑥𝑖 − 1) Bigrams:

PY2(X) = 𝑃𝑌2(𝑥𝑖|𝑥𝑖 − 1)

Page 12: SI485i : NLP - USNA · language is where the real gains come from in NLP. ... New code for new features. Call your language models, get a probability, and then multiply new

Naïve Bayes Applications

• Text classification

• Which e-mails are spam?

• Which e-mails are meeting notices?

• Which author wrote a document?

• Which webpages are about current events?

• Which blog contains angry writing?

• What sentence in a document talks about company X?

• etc.

12

Page 13: SI485i : NLP - USNA · language is where the real gains come from in NLP. ... New code for new features. Call your language models, get a probability, and then multiply new

Text and Features

• What is Xi?

• Could be unigrams, hopefully bigrams too.

• It can be anything that is computed from the text X.

• Yes, I really mean anything. Creativity and intuition into

language is where the real gains come from in NLP.

• Non n-gram examples:

• X10 = “the number of sentences that begin with conjunctions”

• X356 = “existence of a semi-colon in the paragraph”

i

in YXPYXXP )|()|,,( 1

Page 14: SI485i : NLP - USNA · language is where the real gains come from in NLP. ... New code for new features. Call your language models, get a probability, and then multiply new

Features

• In machine learning, “features” are the attributes to

which you assign weights (probabilities in Naïve

Bayes) that help in the final classification.

• Up until now, your features have been n-grams. You

now want to consider other types of features.

• You count features just like n-grams. How many did you see?

• X = set of features

• P(Y|X) = probability of a Y given a set of features

Page 15: SI485i : NLP - USNA · language is where the real gains come from in NLP. ... New code for new features. Call your language models, get a probability, and then multiply new

How do you count features?

• Feature idea: “a semicolon exists in this sentence”

• Count them:

• Count(“FEAT-SEMICOLON”, 1)

• Make up a unique name for the feature, then count!

• Compute probability:

• P(“FEAT-SEMICOLON” | author=“dickens”) =

Count(“FEAT-SEMICOLON”) / (# dickens sentences)

Page 16: SI485i : NLP - USNA · language is where the real gains come from in NLP. ... New code for new features. Call your language models, get a probability, and then multiply new

Authorship Lab

1. Figure out how to use your Language Models from

Lab 2. They can be your initial features.

• Can you train() a model on one author’s text?

2. P(dickens | text) = P(dickens) * PBigramModel(text)

3. New code for new features. Call your language

models, get a probability, and then multiply new

feature probabilities.