maximum likelihood estimation

25
1 Sep 6th, 2001 Copyright © 2001, 2004, Andrew W. Moore Learning with Maximum Likelihood Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm [email protected] 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received. Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 2 Maximum Likelihood learning of Gaussians for Data Mining • Why we should care • Learning Univariate Gaussians • Learning Multivariate Gaussians • What’s a biased estimator? • Bayesian Learning of Gaussians

Upload: guestfee8698

Post on 22-May-2015

1.868 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Maximum Likelihood Estimation

1

Sep 6th, 2001Copyright © 2001, 2004, Andrew W. Moore

Learning with Maximum Likelihood

Andrew W. MooreProfessor

School of Computer ScienceCarnegie Mellon University

www.cs.cmu.edu/[email protected]

412-268-7599

Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 2

Maximum Likelihood learning of Gaussians for Data Mining

• Why we should care• Learning Univariate Gaussians• Learning Multivariate Gaussians• What’s a biased estimator?• Bayesian Learning of Gaussians

Page 2: Maximum Likelihood Estimation

2

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 3

Why we should care• Maximum Likelihood Estimation is a very

very very very fundamental part of data analysis.

• “MLE for Gaussians” is training wheels for our future techniques

• Learning Gaussians is more useful than you might guess…

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 4

Learning Gaussians from Data• Suppose you have x1, x2, … xR ~ (i.i.d) N(µ,σ2)• But you don’t know µ

(you do know σ2)

MLE: For which µ is x1, x2, … xR most likely?

MAP: Which µ maximizes p(µ|x1, x2, … xR , σ2)?

Page 3: Maximum Likelihood Estimation

3

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 5

Learning Gaussians from Data• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)• But you don’t know µ

(you do know σ2)

MLE: For which µ is x1, x2, … xR most likely?

MAP: Which µ maximizes p(µ|x1, x2, … xR , σ2)?

Sneer

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 6

Learning Gaussians from Data• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)• But you don’t know µ

(you do know σ2)

MLE: For which µ is x1, x2, … xR most likely?

MAP: Which µ maximizes p(µ|x1, x2, … xR , σ2)?

Sneer

Despite this, we’ll spend 95% of our time on MLE. Why? Wait and see…

Page 4: Maximum Likelihood Estimation

4

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 7

MLE for univariate Gaussian• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)• But you don’t know µ (you do know σ2)• MLE: For which µ is x1, x2, … xR most likely?

),|,...,(maxarg 221 σµµ

µR

mle xxxp=

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 8

Algebra Euphoria),|,...,(maxarg 2

21 σµµµ

Rmle xxxp=

(after simplification)

=

(plug in formula for Gaussian)

=

(monotonicity of log)

=

(by i.i.d)=

Page 5: Maximum Likelihood Estimation

5

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 9

(after simplification)

=

(plug in formula for Gaussian)

=

(monotonicity of log)

=

(by i.i.d)= ∏=

R

iixp

1

2 ),|(maxarg σµµ

Algebra Euphoria),|,...,(maxarg 2

21 σµµµ

Rmle xxxp=

∑=

R

iixp

1

2 ),|(logmaxarg σµµ

∑=

−−

R

i

ix1

2

2

2)(

21maxarg

σµ

σπµ

∑=

−R

iix

1

2)(minarg µµ

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 10

Intermission: A General Scalar MLE strategy

Task: Find MLE θ assuming known form for p(Data| θ,stuff)1. Write LL = log P(Data| θ,stuff)2. Work out ∂LL/∂θ using high-school calculus3. Set ∂LL/∂θ=0 for a maximum, creating an equation in

terms of θ4. Solve it*5. Check that you’ve found a maximum rather than a

minimum or saddle-point, and be careful if θ is constrained

*This is a perfect example of something that works perfectly in all textbook examples and usually involves surprising pain if

you need it for something new.

Page 6: Maximum Likelihood Estimation

6

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 11

The MLE µ),|,...,(maxarg 2

21 σµµµ

Rmle xxxp=

∑=

−=R

iix

1

2)(minarg µµ

=∂

∂==

µµ LL0s.t.

= (what?)

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 12

The MLE µ),|,...,(maxarg 2

21 σµµµ

Rmle xxxp=

∑=

−=R

iix

1

2)(minarg µµ

=∂

∂==

µµ LL0s.t. ∑

=

−∂∂ R

iix

1

2)( µµ

∑=

−−R

iix

1

)(2 µ

∑=

=R

iixR 1

1 Thus µ

Page 7: Maximum Likelihood Estimation

7

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 13

Lawks-a-lawdy!∑

=

=R

ii

mle xR 1

1 µ

• The best estimate of the mean of a distribution is the mean of the sample!

At first sight:This kind of pedantic, algebra-filled and ultimately unsurprising fact is exactly the

reason people throw down their “Statistics” book and pick up their “Agent

Based Evolutionary Data Mining Using The Neuro-Fuzz Transform” book.

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 14

A General MLE strategySuppose θ = (θ1, θ2, …, θn)T is a vector of parameters.Task: Find MLE θ assuming known form for p(Data| θ,stuff)1. Write LL = log P(Data| θ,stuff)2. Work out ∂LL/∂θ using high-school calculus

⎟⎟⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜⎜⎜

∂∂

∂∂∂∂

=∂

nθLL

θLLθLL

LL

M2

1

θ

Page 8: Maximum Likelihood Estimation

8

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 15

A General MLE strategySuppose θ = (θ1, θ2, …, θn)T is a vector of parameters.Task: Find MLE θ assuming known form for p(Data| θ,stuff)1. Write LL = log P(Data| θ,stuff)2. Work out ∂LL/∂θ using high-school calculus3. Solve the set of simultaneous equations

0

0

0

2

1

=∂∂

=∂∂

=∂∂

nθLL

θLLθLL

M

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 16

A General MLE strategySuppose θ = (θ1, θ2, …, θn)T is a vector of parameters.Task: Find MLE θ assuming known form for p(Data| θ,stuff)1. Write LL = log P(Data| θ,stuff)2. Work out ∂LL/∂θ using high-school calculus3. Solve the set of simultaneous equations

0

0

0

2

1

=∂∂

=∂∂

=∂∂

nθLL

θLLθLL

M

4. Check that you’re at a maximum

Page 9: Maximum Likelihood Estimation

9

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 17

A General MLE strategySuppose θ = (θ1, θ2, …, θn)T is a vector of parameters.Task: Find MLE θ assuming known form for p(Data| θ,stuff)1. Write LL = log P(Data| θ,stuff)2. Work out ∂LL/∂θ using high-school calculus3. Solve the set of simultaneous equations

0

0

0

2

1

=∂∂

=∂∂

=∂∂

nθLL

θLLθLL

M

4. Check that you’re at a maximum

If you can’t solve them, what should you do?

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 18

MLE for univariate Gaussian• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)• But you don’t know µ or σ2

• MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely?

2

12

2221 )(

21)log

21(log),|,...,( log µ

σσπσµ ∑

=

−−+−=R

iiR xRxxxp

)(11

2 µσµ ∑

=

−=∂∂ R

iix

LL

2

1422 )(

21

σσσ ∑=

−+−=∂∂ R

iix

RLL

Page 10: Maximum Likelihood Estimation

10

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 19

MLE for univariate Gaussian• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)• But you don’t know µ or σ2

• MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely?

2

12

2221 )(

21)log

21(log),|,...,( log µ

σσπσµ ∑

=

−−+−=R

iiR xRxxxp

)(101

2 µσ ∑

=

−=R

iix

2

142 )(

21

20 µ

σσ ∑=

−+−=R

iix

R

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 20

MLE for univariate Gaussian• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)• But you don’t know µ or σ2

• MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely?2

12

2221 )(

21)log

21(log),|,...,( log µ

σσπσµ ∑

=

−−+−=R

iiR xRxxxp

∑∑==

=⇒−=R

ii

R

ii x

Rx

112

1)(10 µµσ

what?)(2

12

0 2

142 ⇒−+−= ∑

=

µσσ

R

iix

R

Page 11: Maximum Likelihood Estimation

11

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 21

MLE for univariate Gaussian• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)• But you don’t know µ or σ2

• MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely?

∑=

=R

ii

mle xR 1

2

1

2 )(1 mleR

iimle x

Rµσ ∑

=

−=

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 22

Unbiased Estimators• An estimator of a parameter is unbiased if the

expected value of the estimate is the same as the true value of the parameters.

• If x1, x2, … xR ~(i.i.d) N(µ,σ2) then

µµ =⎥⎦

⎤⎢⎣

⎡= ∑

=

R

ii

mle xR

EE1

1][

µmle is unbiased

Page 12: Maximum Likelihood Estimation

12

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 23

Biased Estimators• An estimator of a parameter is biased if the

expected value of the estimate is different fromthe true value of the parameters.

• If x1, x2, … xR ~(i.i.d) N(µ,σ2) then

[ ] 2

2

11

2

1

2 11)(1 σµσ ≠⎥⎥⎦

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛−=⎥

⎤⎢⎣

⎡−= ∑∑∑

===

R

jj

R

ii

mleR

iimle x

Rx

REx

REE

σ2mle is biased

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 24

MLE Variance Bias• If x1, x2, … xR ~(i.i.d) N(µ,σ2) then

[ ] 22

2

11

2 1111 σσσ ≠⎟⎠⎞

⎜⎝⎛ −=

⎥⎥⎦

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛−= ∑∑

== Rx

Rx

REE

R

jj

R

iimle

Intuition check: consider the case of R=1

Why should our guts expect that σ2mle would be an

underestimate of true σ2?

How could you prove that?

Page 13: Maximum Likelihood Estimation

13

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 25

Unbiased estimate of Variance• If x1, x2, … xR ~(i.i.d) N(µ,σ2) then

[ ] 22

2

11

2 1111 σσσ ≠⎟⎠⎞

⎜⎝⎛ −=

⎥⎥⎦

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛−= ∑∑

== Rx

Rx

REE

R

jj

R

iimle

So define⎟⎠⎞

⎜⎝⎛ −

=

R

mle

11

22unbiased

σσ [ ] 22unbiased So σσ =E

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 26

Unbiased estimate of Variance• If x1, x2, … xR ~(i.i.d) N(µ,σ2) then

[ ] 22

2

11

2 1111 σσσ ≠⎟⎠⎞

⎜⎝⎛ −=

⎥⎥⎦

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛−= ∑∑

== Rx

Rx

REE

R

jj

R

iimle

So define⎟⎠⎞

⎜⎝⎛ −

=

R

mle

11

22unbiased

σσ

2

1

2unbiased )(

11 mle

R

iixR

µσ ∑=

−−

=

[ ] 22unbiased So σσ =E

Page 14: Maximum Likelihood Estimation

14

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 27

Unbiaseditude discussion• Which is best?

2

1

2unbiased )(

11 mle

R

iixR

µσ ∑=

−−

=

2

1

2 )(1 mleR

iimle x

Rµσ ∑

=

−=

Answer:

•It depends on the task

•And doesn’t make much difference once R--> large

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 28

Don’t get too excited about being unbiased

• Assume x1, x2, … xR ~(i.i.d) N(µ,σ2)• Suppose we had these estimators for the mean

∑=+

=R

ii

suboptimal xRR 17

1xcrap =µ

Are either of these unbiased?

Will either of them asymptote to the correct value as R gets large?

Which is more useful?

Page 15: Maximum Likelihood Estimation

15

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 29

MLE for m-dimensional Gaussian• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)• But you don’t know µ or Σ• MLE: For which θ =(µ,Σ) is x1, x2, … xR most likely?

∑=

=R

kk

mle

R 1

1 xµ

( )( )∑=

−−=R

k

Tmlek

mlek

mle

R 1

1 µxµxΣ

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 30

MLE for m-dimensional Gaussian• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)• But you don’t know µ or Σ• MLE: For which θ =(µ,Σ) is x1, x2, … xR most likely?

∑=

=R

kk

mle

R 1

1 xµ

( )( )∑=

−−=R

k

Tmlek

mlek

mle

R 1

1 µµ xxΣ

∑=

=R

kki

mlei Rµ

1

1 x Where 1 ≤ i ≤ m

And xki is value of the ith component of xk(the ith attribute of the kth record)

And µimle is the ith

component of µmle

Page 16: Maximum Likelihood Estimation

16

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 31

MLE for m-dimensional Gaussian• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)• But you don’t know µ or Σ• MLE: For which θ =(µ,Σ) is x1, x2, … xR most likely?

∑=

=R

kk

mle

R 1

1 xµ

( )( )∑=

−−=R

k

Tmlek

mlek

mle

R 1

1 µµ xxΣ

( )( )∑=

−−=R

k

mlejkj

mleiki

mleij R 1

1 µµσ xx

Where 1 ≤ i ≤ m, 1 ≤ j ≤ m

And xki is value of the ithcomponent of xk (the ithattribute of the kth record)

And σijmle is the (i,j)th

component of Σmle

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 32

MLE for m-dimensional Gaussian• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)• But you don’t know µ or Σ• MLE: For which θ =(µ,Σ) is x1, x2, … xR most likely?

∑=

=R

kk

mle

R 1

1 xµ

( )( )∑=

−−=R

k

Tmlek

mlek

mle

R 1

1 µµ xxΣ

Q: How would you prove this?

A: Just plug through the MLE recipe.

Note how Σmle is forced to be symmetric non-negative definite

Note the unbiased case

How many datapoints would you need before the Gaussian has a chance of being non-degenerate?

( )( )∑=

−−−

=−

=R

k

Tmlek

mlek

mle

RR

1

unbiased

11

11µµ xxΣΣ

Page 17: Maximum Likelihood Estimation

17

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 33

Confidence intervals

We need to talk

We need to discuss how accurate we expect µmle and Σmle to be as a function of R

And we need to consider how to estimate these accuracies from data…

•Analytically *

•Non-parametrically (using randomization and bootstrapping) *

But we won’t. Not yet.

*Will be discussed in future Andrew lectures…just before we need this technology.

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 34

Structural error

Actually, we need to talk about something else too..

What if we do all this analysis when the true distribution is in fact not Gaussian?

How can we tell? *

How can we survive? *

*Will be discussed in future Andrew lectures…just before we need this technology.

Page 18: Maximum Likelihood Estimation

18

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 35

Gaussian MLE in action Using R=392 cars from the “MPG” UCI dataset supplied by Ross Quinlan

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 36

Data-starved Gaussian MLEUsing three subsets of MPG.

Each subset has 6 randomly-chosen cars.

Page 19: Maximum Likelihood Estimation

19

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 37

Biva

riate

MLE

in a

ctio

n

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 38

Multivariate MLE

Covariance matrices are not exciting to look at

Page 20: Maximum Likelihood Estimation

20

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 39

Being Bayesian: MAP estimates for Gaussians• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)• But you don’t know µ or Σ• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?

Step 1: Put a prior on (µ,Σ)

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 40

Being Bayesian: MAP estimates for Gaussians• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)• But you don’t know µ or Σ• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?

Step 1: Put a prior on (µ,Σ)

Step 1a: Put a prior on Σ

(ν0-m-1) Σ ~ IW(ν0, (ν0-m-1) Σ 0 )

This thing is called the Inverse-Wishart distribution.

A PDF over SPD matrices!

Page 21: Maximum Likelihood Estimation

21

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 41

Being Bayesian: MAP estimates for Gaussians• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)• But you don’t know µ or Σ• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?

Step 1: Put a prior on (µ,Σ)

Step 1a: Put a prior on Σ

(ν0-m-1) Σ ~ IW(ν0, (ν0-m-1) Σ 0 )

This thing is called the Inverse-Wishart distribution.

A PDF over SPD matrices!

ν0 small: “I am not sure about my guess of Σ 0 “

ν0 large: “I’m pretty sure about my guess of Σ 0 “

Σ 0 : (Roughly) my best guess of Σ

Ε[Σ ] = Σ 0

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 42

Being Bayesian: MAP estimates for Gaussians• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)• But you don’t know µ or Σ• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?

Together, “Σ” and “µ | Σ” define a joint distribution on (µ,Σ)

Step 1: Put a prior on (µ,Σ)

Step 1a: Put a prior on Σ

(ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 )

Step 1b: Put a prior on µ | Σ

µ | Σ ~ N(µ0 , Σ / κ0)

Page 22: Maximum Likelihood Estimation

22

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 43

Step 1: Put a prior on (µ,Σ)

Step 1a: Put a prior on Σ

(ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 )

Step 1b: Put a prior on µ | Σ

µ | Σ ~ N(µ0 , Σ / κ0)

Being Bayesian: MAP estimates for Gaussians• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)• But you don’t know µ or Σ• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?

µ 0 : My best guess of µE[µ ] = µ 0

Together, “Σ” and “µ | Σ” define a joint distribution on (µ,Σ)

κ0 small: “I am not sure about my guess of µ 0 “

κ0 large: “I’m pretty sure about my guess of µ 0 “

Notice how we are forced to express our ignorance of µ proportionally to Σ

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 44

Being Bayesian: MAP estimates for Gaussians• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)• But you don’t know µ or Σ• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?

Why do we use this form of prior?

Step 1: Put a prior on (µ,Σ)

Step 1a: Put a prior on Σ

(ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 )

Step 1b: Put a prior on µ | Σ

µ | Σ ~ N(µ0 , Σ / κ0)

Page 23: Maximum Likelihood Estimation

23

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 45

Being Bayesian: MAP estimates for Gaussians• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)• But you don’t know µ or Σ• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?

Step 1: Put a prior on (µ,Σ)

Step 1a: Put a prior on Σ

(ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 )

Step 1b: Put a prior on µ | Σ

µ | Σ ~ N(µ0 , Σ / κ0)

Why do we use this form ofprior?

Actually, we don’t have to

But it is computationally and algebraically convenient…

…it’s a conjugate prior.

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 46

Being Bayesian: MAP estimates for Gaussians• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?Step 1: Prior: (ν0-m-1) Σ ~ IW(ν0, (ν0-m-1) Σ 0 ), µ | Σ ~ N(µ0 , Σ / κ0)

Step 2:

∑=

=R

kkR 1

1 xxRR

R ++

=0

00

κκ xµµ

RR += 0κκ

RR += 0νν

( )( ) ( )( )R

mmTR

k

TkkRR /1/1

)1()1(0

00

100 +

−−+−−+−+=−+ ∑

= κνν µxµxxxxxΣΣ

Step 3: Posterior: (νR+m-1)Σ ~ IW(νR, (νR+m-1) Σ R ),

µ | Σ ~ N(µR , Σ / κR)

Result: µmap = µR, E[Σ |x1, x2, … xR ]= ΣR

Page 24: Maximum Likelihood Estimation

24

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 47

Being Bayesian: MAP estimates for Gaussians• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?Step 1: Prior: (ν0-m-1) Σ ~ IW(ν0, (ν0-m-1) Σ 0 ), µ | Σ ~ N(µ0 , Σ / κ0)

Step 2:

∑=

=R

kkR 1

1 xxRR

R ++

=0

00

κκ xµµ

RR += 0κκ

RR += 0νν

( )( ) ( )( )R

mmTR

k

TkkRR /1/1

)1()1(0

00

100 +

−−+−−+−+=−+ ∑

= κνν µxµxxxxxΣΣ

Step 3: Posterior: (νR+m-1)Σ ~ IW(νR, (νR+m-1) Σ R ),

µ | Σ ~ N(µR , Σ / κR)

Result: µmap = µR, E[Σ |x1, x2, … xR ]= ΣR

•Look carefully at what these formulae are doing. It’s all very sensible.•Conjugate priors mean prior form and posterior form are same and characterized by “sufficient statistics” of the data.•The marginal distribution on µ is a student-t•One point of view: it’s pretty academic if R > 30

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 48

Where we’re at

Inpu

ts

ClassifierPredict

category

Inpu

ts DensityEstimator

Prob-ability

Inpu

ts

RegressorPredictreal no.

Categorical inputs only

Mixed Real / Cat okay

Real-valued inputs only

Dec TreeJoint BC

Naïve BC

Joint DE

Naïve DE

Gauss DE

Page 25: Maximum Likelihood Estimation

25

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 49

What you should know• The Recipe for MLE• What do we sometimes prefer MLE to MAP?• Understand MLE estimation of Gaussian

parameters• Understand “biased estimator” versus

“unbiased estimator”• Appreciate the outline behind Bayesian

estimation of Gaussian parameters

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 50

Useful exercise• We’d already done some MLE in this class

without even telling you!• Suppose categorical arity-n inputs x1, x2, …

xR~(i.i.d.) from a multinomial M(p1, p2, … pn)

where

P(xk=j|p)=pj

• What is the MLE p=(p1, p2, … pn)?