bayesianmethods&naïve( bayes(...
TRANSCRIPT
![Page 1: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/1.jpg)
Bayesian methods & Naïve Bayes Lecture 18
David Sontag New York University
Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate
![Page 2: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/2.jpg)
Beta prior distribu9on – P(θ)
• The posterior distribu9on:
Brief Article
The Author
January 11, 2012
ˆ⌅ = argmax
⌅lnP (D | ⌅)
ln ⌅�H
d
d⌅lnP (D | ⌅) =
d
d⌅[
ln ⌅�H(1� ⌅)�T
]
=
d
d⌅[
�H ln ⌅ + �T ln(1� ⌅)]
= �Hd
d⌅ln ⌅ + �T
d
d⌅ln(1� ⌅) =
�H
⌅� �T
1� ⌅= 0
⇥ ⇤ 2e�2N⇤2 ⇤ P (mistake)
ln ⇥ ⇤ ln 2� 2N⇤2
N ⇤ ln(2/⇥)
2⇤2
N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8
0.02= 190
P (⌅) ⇧ 1
P (⌅ | D) ⇧ P (D | ⌅)
P (⌅ | D) ⇧ ⌅�H(1� ⌅)�T ⌅⇥H�1
(1� ⌅)⇥T�1
1
Brief Article
The Author
January 11, 2012
ˆ⌅ = argmax
⌅lnP (D | ⌅)
ln ⌅�H
d
d⌅lnP (D | ⌅) =
d
d⌅[
ln ⌅�H(1� ⌅)�T
]
=
d
d⌅[
�H ln ⌅ + �T ln(1� ⌅)]
= �Hd
d⌅ln ⌅ + �T
d
d⌅ln(1� ⌅) =
�H
⌅� �T
1� ⌅= 0
⇥ ⇤ 2e�2N⇤2 ⇤ P (mistake)
ln ⇥ ⇤ ln 2� 2N⇤2
N ⇤ ln(2/⇥)
2⇤2
N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8
0.02= 190
P (⌅) ⇧ 1
P (⌅ | D) ⇧ P (D | ⌅)
P (⌅ | D) ⇧ ⌅�H(1� ⌅)�T ⌅⇥H�1
(1� ⌅)⇥T�1= ⌅�H+⇥H�1
(1� ⌅)�T+⇥t+1
1
Brief Article
The Author
January 11, 2012
ˆ⇧ = argmax
⌅lnP (D | ⇧)
ln ⇧�H
d
d⇧lnP (D | ⇧) =
d
d⇧[
ln ⇧�H(1� ⇧)�T
]
=
d
d⇧[
�H ln ⇧ + �T ln(1� ⇧)]
= �Hd
d⇧ln ⇧ + �T
d
d⇧ln(1� ⇧) =
�H
⇧� �T
1� ⇧= 0
⇤ ⇤ 2e�2N⇤2 ⇤ P (mistake)
ln ⇤ ⇤ ln 2� 2N⌅2
N ⇤ ln(2/⇤)
2⌅2
N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8
0.02= 190
P (⇧) ⇧ 1
P (⇧ | D) ⇧ P (D | ⇧)
P (⇧ | D) ⇧ ⇧�H(1�⇧)�T ⇧⇥H�1
(1�⇧)⇥T�1= ⇧�H+⇥H�1
(1�⇧)�T+⇥t+1= Beta(�H+⇥H , �T+⇥T )
1
![Page 3: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/3.jpg)
• We now have a distribu9on over parameters • For any specific f, a func9on of interest, compute the
expected value of f:
• Integral is oKen hard to compute • As more data is observed, prior is more concentrated
• MAP (Maximum a posteriori approxima9on): use most likely parameter to approximate the expecta9on
Using Bayesian inference for predic9on
![Page 4: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/4.jpg)
MAP for Beta distribu9on
• MAP: use most likely parameter:
Brief Article
The Author
January 11, 2012
ˆ⇧ = argmax
⌅lnP (D | ⇧)
ln ⇧�H
d
d⇧lnP (D | ⇧) =
d
d⇧[
ln ⇧�H(1� ⇧)�T
]
=
d
d⇧[
�H ln ⇧ + �T ln(1� ⇧)]
= �Hd
d⇧ln ⇧ + �T
d
d⇧ln(1� ⇧) =
�H
⇧� �T
1� ⇧= 0
⇤ ⇤ 2e�2N⇤2 ⇤ P (mistake)
ln ⇤ ⇤ ln 2� 2N⌅2
N ⇤ ln(2/⇤)
2⌅2
N ⇤ ln(2/0.05)2⇥ 0.12 ⌅ 3.8
0.02= 190
P (⇧) ⇧ 1
P (⇧ | D) ⇧ P (D | ⇧)
P (⇧ | D) ⇧ ⇧�H(1�⇧)�T ⇧⇥H�1
(1�⇧)⇥T�1= ⇧�H+⇥H�1
(1�⇧)�T+⇥t+1= Beta(�H+⇥H , �T+⇥T )
�H + ⇥H � 1�H + ⇥H + �T + ⇥T � 2
1• Beta prior equivalent to extra thumbtack flips • As N → 1, prior is “forgotten” • But, for small sample size, prior is important!
![Page 5: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/5.jpg)
What about con9nuous variables?
• Billionaire says: If I am measuring a con9nuous variable, what can you do for me?
• You say: Let me tell you about Gaussians…
![Page 6: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/6.jpg)
Some proper9es of Gaussians
• Affine transforma9on (mul9plying by scalar and adding a constant) are Gaussian – X ~ N(µ,σ2) – Y = aX + b Y ~ N(aµ+b,a2σ2)
• Sum of Gaussians is Gaussian – X ~ N(µX,σ2
X) – Y ~ N(µY,σ2
Y) – Z = X+Y Z ~ N(µX+µY, σ2
X+σ2Y)
• Easy to differen9ate, as we will see soon!
![Page 7: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/7.jpg)
Learning a Gaussian
• Collect a bunch of data – Hopefully, i.i.d. samples
– e.g., exam scores
• Learn parameters – Mean: µ – Variance: σ
xi i =
Exam Score
0 85
1 95
2 100
3 12
… …
99 89
![Page 8: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/8.jpg)
MLE for Gaussian:
• Prob. of i.i.d. samples D={x1,…,xN}:
• Log-likelihood of data:
µMLE , �MLE = argmaxµ,�
P (D | µ,�)
2
![Page 9: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/9.jpg)
Your second learning algorithm: MLE for mean of a Gaussian
• What’s MLE for mean?
µMLE , �MLE = argmaxµ,�
P (D | µ,�)
= �NX
i=1
(xi � µ)
�2= 0
2
µMLE , �MLE = argmaxµ,�
P (D | µ,�)
= �NX
i=1
(xi � µ)
�2= 0
2
µMLE , �MLE = argmaxµ,�
P (D | µ,�)
= �NX
i=1
(xi � µ)
�2= 0
= �NX
i=1
xi +Nµ = 0
= �N
�+
NX
i=1
(xi � µ)2
�3= 0
2
-
![Page 10: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/10.jpg)
MLE for variance • Again, set deriva9ve to zero:
µMLE , �MLE = argmaxµ,�
P (D | µ,�)
= �NX
i=1
(xi � µ)
�2= 0
2
µMLE , �MLE = argmaxµ,�
P (D | µ,�)
= �NX
i=1
(xi � µ)
�2= 0
= �N
�+
NX
i=1
(xi � µ)2
�3= 0
2
![Page 11: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/11.jpg)
Learning Gaussian parameters
• MLE:
• MLE for the variance of a Gaussian is biased – Expected result of es9ma9on is not true parameter! – Unbiased variance es9mator:
![Page 12: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/12.jpg)
Bayesian learning of Gaussian parameters
• Conjugate priors – Mean: Gaussian prior
– Variance: Wishart Distribu9on
• Prior for mean:
![Page 13: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/13.jpg)
Naïve Bayes
Slides adapted from Vibhav Gogate, Jonathan Huang, Luke Zettlemoyer, Carlos Guestrin, and Dan Weld
![Page 14: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/14.jpg)
Bayesian Classifica9on
• Problem statement: – Given features X1,X2,…,Xn – Predict a label Y
![Page 15: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/15.jpg)
Example Applica9on
• Digit Recogni9on
• X1,…,Xn ∈ {0,1} (Black vs. White pixels)
• Y ∈ {0,1,2,3,4,5,6,7,8,9}
Classifier 5
![Page 16: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/16.jpg)
The Bayes Classifier
• If we had the joint distribu9on on X1,…,Xn and Y, could predict using:
– (for example: what is the probability that the image represents a 5 given its pixels?)
• So … How do we compute that?
![Page 17: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/17.jpg)
The Bayes Classifier
• Use Bayes Rule!
• Why did this help? Well, we think that we might be able to specify how features are “generated” by the class label
Normalization Constant
Likelihood Prior
![Page 18: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/18.jpg)
The Bayes Classifier
• Let’s expand this for our digit recogni9on task:
• To classify, we’ll simply compute these probabili9es, one per class, and predict based on which one is largest
![Page 19: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/19.jpg)
Model Parameters
• How many parameters are required to specify the likelihood, P(X1,…,Xn|Y) ?
– (Supposing that each image is 30x30 pixels)
• The problem with explicitly modeling P(X1,…,Xn|Y) is that there are usually way too many parameters:
– We’ll run out of space
– We’ll run out of 9me – And we’ll need tons of training data (which is usually not available)
![Page 20: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/20.jpg)
Naïve Bayes • Naïve Bayes assump9on:
– Features are independent given class:
– More generally:
• How many parameters now? • Suppose X is composed of n binary features
![Page 21: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/21.jpg)
The Naïve Bayes Classifier • Given:
– Prior P(Y) – n condi9onally independent features X given the class Y
– For each Xi, we have likelihood P(Xi|Y)
• Decision rule:
If certain assumption holds, NB is optimal classifier! (they typically don’t)
Y
X1 Xn X2
![Page 22: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/22.jpg)
A Digit Recognizer
• Input: pixel grids
• Output: a digit 0-9
![Page 23: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/23.jpg)
Naïve Bayes for Digits (Binary Inputs)
• Simple version: – One feature Fij for each grid position <i,j>
– Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image
– Each input maps to a feature vector, e.g.
– Here: lots of features, each is binary valued
• Naïve Bayes model:
• Are the features independent given class?
• What do we need to learn?
![Page 24: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/24.jpg)
What has to be learned?
1 0.1
2 0.1
3 0.1
4 0.1
5 0.1
6 0.1
7 0.1
8 0.1
9 0.1
0 0.1
1 0.01
2 0.05
3 0.05
4 0.30
5 0.80
6 0.90
7 0.05
8 0.60
9 0.50
0 0.80
1 0.05
2 0.01
3 0.90
4 0.80
5 0.90
6 0.90
7 0.25
8 0.85
9 0.60
0 0.80
![Page 25: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/25.jpg)
MLE for the parameters of NB • Given dataset
– Count(A=a,B=b) Ã number of examples where A=a and B=b
• MLE for discrete NB, simply: – Prior:
– Observa9on distribu9on:
µMLE ,⇥MLE = argmaxµ,�
P (D | µ,⇥)
= �NX
i=1
(xi � µ)
⇥2= 0
= �NX
i=1
xi +Nµ = 0
= �N
⇥+
NX
i=1
(xi � µ)2
⇥3= 0
argmaxwln
✓1
⇥⇥2�
◆N
+NX
j=1
�[tj �P
iwihi(xj)]2
2⇥2
= argmaxw
NX
j=1
�[tj �P
iwihi(xj)]2
2⇥2
= argminw
NX
j=1
[tj �X
i
wihi(xj)]2
P (Y = y) =Count(Y = y)Py0 Count(Y = y�)
2
µMLE ,⇥MLE = argmaxµ,�
P (D | µ,⇥)
= �NX
i=1
(xi � µ)
⇥2= 0
= �NX
i=1
xi +Nµ = 0
= �N
⇥+
NX
i=1
(xi � µ)2
⇥3= 0
argmaxwln
✓1
⇥⇥2�
◆N
+NX
j=1
�[tj �P
iwihi(xj)]2
2⇥2
= argmaxw
NX
j=1
�[tj �P
iwihi(xj)]2
2⇥2
= argminw
NX
j=1
[tj �X
i
wihi(xj)]2
P (Y = y) =Count(Y = y)Py0 Count(Y = y�)
2
µMLE ,⇥MLE = argmaxµ,�
P (D | µ,⇥)
= �NX
i=1
(xi � µ)
⇥2= 0
= �NX
i=1
xi +Nµ = 0
= �N
⇥+
NX
i=1
(xi � µ)2
⇥3= 0
argmaxwln
✓1
⇥⇥2�
◆N
+NX
j=1
�[tj �P
iwihi(xj)]2
2⇥2
= argmaxw
NX
j=1
�[tj �P
iwihi(xj)]2
2⇥2
= argminw
NX
j=1
[tj �X
i
wihi(xj)]2
P (Y = y) =Count(Y = y)Py0 Count(Y = y�)
P (Xi = x|Y = y) =Count(Xi = x, Y = y)Px0 Count(Xi = x�, Y = y)
2
µMLE ,⇥MLE = argmaxµ,�
P (D | µ,⇥)
= �NX
i=1
(xi � µ)
⇥2= 0
= �NX
i=1
xi +Nµ = 0
= �N
⇥+
NX
i=1
(xi � µ)2
⇥3= 0
argmaxwln
✓1
⇥⇥2�
◆N
+NX
j=1
�[tj �P
iwihi(xj)]2
2⇥2
= argmaxw
NX
j=1
�[tj �P
iwihi(xj)]2
2⇥2
= argminw
NX
j=1
[tj �X
i
wihi(xj)]2
P (Y = y) =Count(Y = y)Py0 Count(Y = y�)
P (Xi = x|Y = y) =Count(Xi = x, Y = y)Px0 Count(Xi = x�, Y = y)
2
![Page 26: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/26.jpg)
MLE for the parameters of NB
• Training amounts to, for each of the classes, averaging all of the examples together:
![Page 27: Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature](https://reader034.vdocuments.site/reader034/viewer/2022042709/5f3f24a9115794314a05ef68/html5/thumbnails/27.jpg)
MAP es9ma9on for NB • Given dataset
– Count(A=a,B=b) Ã number of examples where A=a and B=b
• MAP es9ma9on for discrete NB, simply: – Prior:
– Observa9on distribu9on:
• Called “smoothing”. Corresponds to Dirichlet prior!
µMLE ,⇥MLE = argmaxµ,�
P (D | µ,⇥)
= �NX
i=1
(xi � µ)
⇥2= 0
= �NX
i=1
xi +Nµ = 0
= �N
⇥+
NX
i=1
(xi � µ)2
⇥3= 0
argmaxwln
✓1
⇥⇥2�
◆N
+NX
j=1
�[tj �P
iwihi(xj)]2
2⇥2
= argmaxw
NX
j=1
�[tj �P
iwihi(xj)]2
2⇥2
= argminw
NX
j=1
[tj �X
i
wihi(xj)]2
P (Y = y) =Count(Y = y)Py0 Count(Y = y�)
2
µMLE ,⇥MLE = argmaxµ,�
P (D | µ,⇥)
= �NX
i=1
(xi � µ)
⇥2= 0
= �NX
i=1
xi +Nµ = 0
= �N
⇥+
NX
i=1
(xi � µ)2
⇥3= 0
argmaxwln
✓1
⇥⇥2�
◆N
+NX
j=1
�[tj �P
iwihi(xj)]2
2⇥2
= argmaxw
NX
j=1
�[tj �P
iwihi(xj)]2
2⇥2
= argminw
NX
j=1
[tj �X
i
wihi(xj)]2
P (Y = y) =Count(Y = y)Py0 Count(Y = y�)
2
µMLE ,⇥MLE = argmaxµ,�
P (D | µ,⇥)
= �NX
i=1
(xi � µ)
⇥2= 0
= �NX
i=1
xi +Nµ = 0
= �N
⇥+
NX
i=1
(xi � µ)2
⇥3= 0
argmaxwln
✓1
⇥⇥2�
◆N
+NX
j=1
�[tj �P
iwihi(xj)]2
2⇥2
= argmaxw
NX
j=1
�[tj �P
iwihi(xj)]2
2⇥2
= argminw
NX
j=1
[tj �X
i
wihi(xj)]2
P (Y = y) =Count(Y = y)Py0 Count(Y = y�)
P (Xi = x|Y = y) =Count(Xi = x, Y = y)Px0 Count(Xi = x�, Y = y)
2
µMLE ,⇥MLE = argmaxµ,�
P (D | µ,⇥)
= �NX
i=1
(xi � µ)
⇥2= 0
= �NX
i=1
xi +Nµ = 0
= �N
⇥+
NX
i=1
(xi � µ)2
⇥3= 0
argmaxwln
✓1
⇥⇥2�
◆N
+NX
j=1
�[tj �P
iwihi(xj)]2
2⇥2
= argmaxw
NX
j=1
�[tj �P
iwihi(xj)]2
2⇥2
= argminw
NX
j=1
[tj �X
i
wihi(xj)]2
P (Y = y) =Count(Y = y)Py0 Count(Y = y�)
P (Xi = x|Y = y) =Count(Xi = x, Y = y)Px0 Count(Xi = x�, Y = y)
2
+ a
+ |X_i|*a