cse 446: naïve bayes winter 2012 dan weld some slides from carlos guestrin, luke zettlemoyer &...
TRANSCRIPT
![Page 1: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/1.jpg)
CSE 446: Naïve BayesWinter 2012
Dan Weld
Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein
![Page 2: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/2.jpg)
2
Today
• Gaussians• Naïve Bayes• Text Classification
![Page 3: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/3.jpg)
3
Long Ago• Random variables, distributions• Marginal, joint & conditional probabilities• Sum rule, product rule, Bayes rule• Independence, conditional independence
![Page 4: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/4.jpg)
Last Time
Prior Hypothesis
Maximum Likelihood Estimate
Maximum A Posteriori Estimate
Bayesian Estimate
Uniform The most likely
Any The most likely
Any Weighted combination
![Page 5: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/5.jpg)
Bayesian Learning
Use Bayes rule:
Or equivalently:
Prior
Normalization
Data Likelihood
Posterior
![Page 6: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/6.jpg)
6
Conjugate Priors?
![Page 7: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/7.jpg)
Those Pesky Distributions
7
Discrete Continuous
Binary {0, 1} M Values
SingleEvent
Sequence(N trials)N= H+T
Bernouilli
Binomial Multinomial
Beta DirichletConjugatePrior
![Page 8: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/8.jpg)
What about continuous variables?
• Billionaire says: If I am measuring a continuous variable, what can you do for me?
• You say: Let me tell you about Gaussians…
![Page 9: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/9.jpg)
Some properties of Gaussians
• Affine transformation (multiplying by scalar and adding a constant) are Gaussian– X ~ N(,2)– Y = aX + b Y ~ N(a+b,a22)
• Sum of Gaussians is Gaussian– X ~ N(X,2
X)
– Y ~ N(Y,2Y)
– Z = X+Y Z ~ N(X+Y, 2X+2
Y)
• Can make easy to differentiate, as we will see soon!
![Page 10: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/10.jpg)
Learning a Gaussian
• Collect a bunch of data– Hopefully, i.i.d. samples– e.g., exam scores
• Learn parameters– Mean: μ– Variance: σ
Xi
i =
Exam Score
0 85
1 95
2 100
3 12
… …99 89
![Page 11: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/11.jpg)
MLE for Gaussian:
• Prob. of i.i.d. samples D={x1,…,xN}:
• Log-likelihood of data:
![Page 12: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/12.jpg)
Your second learning algorithm:MLE for mean of a Gaussian
• What’s MLE for mean?
![Page 13: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/13.jpg)
MLE for variance• Again, set derivative to zero:
![Page 14: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/14.jpg)
Learning Gaussian parameters
• MLE:
![Page 15: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/15.jpg)
Bayesian learning of Gaussian parameters
• Conjugate priors– Mean: Gaussian prior– Variance: Wishart Distribution
![Page 16: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/16.jpg)
Supervised Learning of ClassifiersFind f
• Given: Training set {(xi, yi) | i = 1 … n}• Find: A good approximation to f : X Y
Examples: what are X and Y ?• Spam Detection
– Map email to {Spam,Ham}
• Digit recognition– Map pixels to {0,1,2,3,4,5,6,7,8,9}
• Stock Prediction– Map new, historic prices, etc. to (the real numbers)
Classification
![Page 17: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/17.jpg)
20
Bayesian Categorization
• Let set of categories be {c1, c2,…cn}• Let E be description of an instance.• Determine category of E by determining for each ci
• P(E) can be ignored since is factor categories
)(
)|()()|(
EP
cEPcPEcP ii
i
)|()(~)|( iii cEPcPEcP
![Page 18: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/18.jpg)
Text classification
• Classify e-mails– Y = {Spam,NotSpam}
• Classify news articles– Y = {what is the topic of the article?}
• Classify webpages– Y = {Student, professor, project, …}
• What to use for features, X?
![Page 19: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/19.jpg)
Example: Spam Filter
• Input: email• Output: spam/ham• Setup:
– Get a large collection of example emails, each labeled “spam” or “ham”
– Note: someone has to hand label all this data!
– Want to learn to predict labels of new, future emails
• Features: The attributes used to make the ham / spam decision– Words: FREE!– Text Patterns: $dd, CAPS– For email specifically,
Semantic features: SenderInContacts
– …
Dear Sir.
First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. …
TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT.
99 MILLION EMAIL ADDRESSES FOR ONLY $99
Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.
![Page 20: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/20.jpg)
Features X are word sequence in document Xi for ith word in article
![Page 21: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/21.jpg)
Features for Text Classification• X is sequence of words in document• X (and hence P(X|Y)) is huge!!!
– Article at least 1000 words, X={X1,…,X1000}
– Xi represents ith word in document, i.e., the domain of Xi is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc.
• 10,0001000 = 104000
• Atoms in Universe: 1080
– We may have a problem…
![Page 22: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/22.jpg)
Bag of Words ModelTypical additional assumption –
– Position in document doesn’t matter: • P(Xi=xi|Y=y) = P(Xk=xi|Y=y)
(all positions have the same distribution)
– Ignore the order of words
When the lecture is over, remember to wake up the person sitting next to you in the lecture room.
![Page 23: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/23.jpg)
Bag of Words Model
in is lecture lecture next over person remember room sitting the the the to to up wake when you
Typical additional assumption –
– Position in document doesn’t matter: • P(Xi=xi|Y=y) = P(Xk=xi|Y=y)
(all position have the same distribution)
– Ignore the order of words – Sounds really silly, but often works very well!
– From now on:• Xi = Boolean: “wordi is in document”
• X = X1 … Xn
![Page 24: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/24.jpg)
Bag of Words Approach
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0
![Page 25: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/25.jpg)
28
Bayesian Categorization
• Need to know:– Priors: P(yi)
– Conditionals: P(X | yi)
• P(yi) are easily estimated from data. – If ni of the examples in D are in yi, then P(yi) = ni / |D|
• Conditionals:– X = X1 … Xn
– Estimate P(X1 … Xn | yi)
• Too many possible instances to estimate!– (exponential in n) – Even with bag of words assumption!
Problem!
P(y1 | X) ~ P(yi)P(X|yi)
![Page 26: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/26.jpg)
29
Need to Simplify Somehow
• Too many probabilities– P(x1 x2 x3 | yi)
• Can we assume some are the same?– P(x1 x2 ) = P(x1)P(x2)
P(x1 x2 x3 | spam)P(x1 x2 x3 | spam)P(x1 x2 x3 | spam) ….P( x1 x2 x3 | spam)
?
![Page 27: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/27.jpg)
Conditional Independence• X is conditionally independent of Y given Z, if
the probability distribution for X is independent of the value of Y, given the value of Z
• e.g.,
• Equivalent to:
![Page 28: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/28.jpg)
Naïve Bayes
• Naïve Bayes assumption:– Features are independent given class:
– More generally:
• How many parameters now?• Suppose X is composed of n binary features
![Page 29: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/29.jpg)
The Naïve Bayes Classifier• Given:
– Prior P(Y)
– n conditionally independent features X given the class Y
– For each Xi, we have likelihood P(Xi|
Y)
• Decision rule:
Y
X1 XnX2
![Page 30: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/30.jpg)
MLE for the parameters of NB• Given dataset, count occurrences
• MLE for discrete NB, simply:– Prior:
– Likelihood:
![Page 31: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/31.jpg)
Subtleties of NB Classifier 1 Violating the NB Assumption
• Usually, features are not conditionally independent:
• Actual probabilities P(Y|X) often biased towards 0 or 1• Nonetheless, NB is the single most used classifier out
there– NB often performs well, even when assumption is violated– [Domingos & Pazzani ’96] discuss some conditions for good
performance
![Page 32: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/32.jpg)
Subtleties of NB classifier 2: Overfitting
![Page 33: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/33.jpg)
For Binary Features: We already know the answer!
• MAP: use most likely parameter
• Beta prior equivalent to extra observations for each feature• As N → 1, prior is “forgotten”• But, for small sample size, prior is important!
![Page 34: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/34.jpg)
38
That’s Great for Bionomial
• Works for Spam / Ham• What about multiple classes
– Eg, given a wikipedia page, predicting type
![Page 35: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/35.jpg)
Multinomials: Laplace Smoothing
• Laplace’s estimate:– Pretend you saw every outcome k
extra times
– What’s Laplace with k = 0?– k is the strength of the prior– Can derive this as a MAP estimate for
multinomial with Dirichlet priors
• Laplace for conditionals:– Smooth each condition independently:
H H T
![Page 36: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/36.jpg)
40
Naïve Bayes for Text
• Modeled as generating a bag of words for a document in a given category by repeatedly sampling with replacement from a vocabulary V = {w1, w2,…wm} based on the probabilities P(wj | ci).
• Smooth probability estimates with Laplace m-estimates assuming a uniform distribution over all words (p = 1/|V|) and m = |V|– Equivalent to a virtual sample of seeing each word in
each category exactly once.
![Page 37: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/37.jpg)
41
Easy to Implement
• But…
• If you do… it probably won’t work…
![Page 38: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/38.jpg)
Probabilities: Important Detail!
Any more potential problems here?
• P(spam | X1 … Xn) = P(spam | Xi)i
We are multiplying lots of small numbers Danger of underflow!
0.557 = 7 E -18
Solution? Use logs and add! p1 * p2 = e log(p1)+log(p2)
Always keep in log form
![Page 39: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/39.jpg)
43
Naïve Bayes Posterior Probabilities
• Classification results of naïve Bayes – I.e. the class with maximum posterior probability…– Usually fairly accurate (?!?!?)
• However, due to the inadequacy of the conditional independence assumption…– Actual posterior-probability estimates not accurate.– Output probabilities generally very close to 0 or 1.
![Page 40: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/40.jpg)
NB with Bag of Words for text classification
• Learning phase:– Prior P(Y)
• Count how many documents from each topic (prior)
– P(Xi|Y) • For each topic, count how many times you saw word in
documents of this topic (+ prior); remember this dist’n is shared across all positions i
• Test phase:– For each document
• Use naïve Bayes decision rule
![Page 41: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/41.jpg)
Twenty News Groups results
![Page 42: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/42.jpg)
Learning curve for Twenty News Groups
![Page 43: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/43.jpg)
What if we have continuous Xi ?Eg., character recognition: Xi is ith pixel
Gaussian Naïve Bayes (GNB):
Sometimes assume variance• is independent of Y (i.e., i), • or independent of Xi (i.e., k)• or both (i.e., )
![Page 44: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/44.jpg)
Estimating Parameters: Y discrete, Xi continuous
Maximum likelihood estimates:• Mean:
• Variance:
jth training example
(x)=1 if x true, else 0
![Page 45: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/45.jpg)
Example: GNB for classifying mental states
~1 mm resolution
~2 images per sec.
15,000 voxels/image
non-invasive, safe
measures Blood Oxygen Level Dependent (BOLD) response
Typical impulse response
10 sec
[Mitchell et al.]
![Page 46: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/46.jpg)
Brain scans can track activation with precision and sensitivity
[Mitchell et al.]
![Page 47: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/47.jpg)
Gaussian Naïve Bayes: Learned voxel,word
P(BrainActivity | WordCategory = {People,Animal})[Mitchell et al.]
![Page 48: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/48.jpg)
Learned Bayes Models – Means forP(BrainActivity | WordCategory)
Animal wordsPeople words
Pairwise classification accuracy: 85%[Mitchell et al.]
![Page 49: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/49.jpg)
58
![Page 50: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/50.jpg)
Bayes Classifier is Optimal!• Learn: h : X Y
– X – features– Y – target classes
• Suppose: you know true P(Y|X):– Bayes classifier:
• Why?
![Page 51: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/51.jpg)
Optimal classification
• Theorem: – Bayes classifier hBayes is optimal!
– Why?
![Page 52: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/52.jpg)
What you need to know about Naïve Bayes
• Naïve Bayes classifier– What’s the assumption– Why we use it– How do we learn it– Why is Bayesian estimation important
• Text classification– Bag of words model
• Gaussian NB– Features are still conditionally independent– Each feature has a Gaussian distribution given class
• Optimal decision using Bayes Classifier
![Page 53: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/53.jpg)
62
Text Naïve Bayes Algorithm(Train)
Let V be the vocabulary of all words in the documents in DFor each category ci C
Let Di be the subset of documents in D in category ci
P(ci) = |Di| / |D|
Let Ti be the concatenation of all the documents in Di
Let ni be the total number of word occurrences in Ti
For each word wj V Let nij be the number of occurrences of wj in Ti
Let P(wi | ci) = (nij + 1) / (ni + |V|)
![Page 54: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/54.jpg)
63
Text Naïve Bayes Algorithm(Test)
Given a test document XLet n be the number of word occurrences in XReturn the category:
where ai is the word occurring the ith position in X
)|()(argmax1
n
iiii
CiccaPcP
![Page 55: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/55.jpg)
64
Naïve Bayes Time Complexity
• Training Time: O(|D|Ld + |C||V|)) where Ld is the average length of a document in D.– Assumes V and all Di , ni, and nij pre-computed in O(|D|
Ld) time during one pass through all of the data.– Generally just O(|D|Ld) since usually |C||V| < |D|Ld
• Test Time: O(|C| Lt) where Lt
is the average length of a test document.
• Very efficient overall, linearly proportional to the time needed to just read in all the data.
![Page 56: CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein](https://reader035.vdocuments.site/reader035/viewer/2022070400/56649f115503460f94c24aae/html5/thumbnails/56.jpg)
Multi-Class Categorization
• Pick the category with max probability• Create many 1 vs other classifiers• Use a hierarchical approach (wherever hierarchy
available) Entity
Person Location
Scientist Artist City County Country
65