© daniel s. weld 1 naïve bayes & expectation maximization cse 573

61
© Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Post on 21-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 1

Naïve Bayes & Expectation Maximization

CSE 573

Page 2: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 2

Logistics

• Team Meetings• Midterm

  Open book, notes   Studying

• See AIMA exercises

Page 3: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 3

573 Schedule

Agency

Problem Spaces

Search

Knowledge Representation & Inference

Planning SupervisedLearning

Logic-Based

Probabilistic

ReinforcementLearning

Selected Topics

Page 4: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 4

Coming Soon

Artificial Life

Selected Topics

Crossword Puzzles

Intelligent Internet Systems

Page 5: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 5

Topics

• Test & Mini Projects• Review • Naive Bayes

  Maximum Likelihood Estimates  Working with Probabilities

• Expectation Maximization• Challenge

Page 6: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Estimation Models

Uniform The most likely

Any The most likely

AnyWeighted

combination

Maximum Likelihood Estimate

Maximum A Posteriori Estimate

Bayesian Estimate

Prior Hypothesis

Page 7: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Continuous Case

Probability of heads

Rela

tive L

ikelih

ood

Page 8: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

0

1

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Continuous CasePrior

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Exp 1: Heads

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Exp 2: Tails

0

1

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

uniform

with backgroundknowledge

Page 9: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Continuous Case

0

1

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Bayesian EstimateMAP Estimate

ML Estimate

Posterior after 2 experiments:

w/ uniform prior

with backgroundknowledge

Page 10: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

-1

0

1

2

3

4

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-1

0

1

2

3

4

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

After 10 Experiments...Posterior:

Bayesian EstimateMAP Estimate

ML Estimatew/ uniform prior

with backgroundknowledge

Page 11: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 11

Topics

• Test & Mini Projects• Review • Naive Bayes

  Maximum Likelihood Estimates  Working with Probabilities

• Expectation Maximization• Challenge

Page 12: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Naive Bayes

•A Bayes Net where  all nodes are children of a single root node•Why? Expressive and accurate? Easy to learn?

=`Is “apple” in message?’

Page 13: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Naive Bayes

•All nodes are children of a single root node•Why? Expressive and accurate? No - why? Easy to learn?

Page 14: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Naive Bayes

•All nodes are children of a single root node•Why? Expressive and accurate? No Easy to learn? Yes

Page 15: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Naive Bayes

•All nodes are children of a single root node•Why? Expressive and accurate? No Easy to learn? Yes Useful? Sometimes

Page 16: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Inference In Naive Bayes

Goal, given evidence (words in an email)

Decide if an email is spam

P(S) = 0.6

P(A|S) = 0.01P(A|¬S) = 0.1

P(B|S) = 0.001P(B|¬S) = 0.01

P(F|S) = 0.8P(F|¬S) = 0.05

P(A|S) = ?P(A|¬S) = ?

Page 17: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Inference In Naive BayesP(S) = 0.6

P(A|S) = 0.01P(A|¬S) = 0.1

P(B|S) = 0.001P(B|¬S) = 0.01

P(F|S) = 0.8P(F|¬S) = 0.05

P(A|S) = ?P(A|¬S) = ?

Independence to the rescue!

P(S|E)

P(E)

P(E)

Page 18: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Inference In Naive BayesP(S) = 0.6

P(A|S) = 0.01P(A|¬S) = 0.1

P(B|S) = 0.001P(B|¬S) = 0.01

P(F|S) = 0.8P(F|¬S) = 0.05

P(A|S) = ?P(A|¬S) = ?

Spam if P(S|E) > P(¬S|E)

But...

P(E)

P(E)

Page 19: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Inference In Naive BayesP(S) = 0.6

P(A|S) = 0.01P(A|¬S) = 0.1

P(B|S) = 0.001P(B|¬S) = 0.01

P(F|S) = 0.8P(F|¬S) = 0.05

P(A|S) = ?P(A|¬S) = ?

Page 20: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Parameter Estimation Revisited

Can we calculate Maximum Likelihood estimate of θ easily?

P(S) = θ

Data:

+

Prior

0

1

2

0 0.2 0.4 0.6 0.8 1

θ=

0

1

2

0 0.2 0.4 0.6 0.8 1

θLooking for the maximum of a function:- find the derivative- set it to zero

Max Likelihoodestimate

Page 21: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 21

Topics

• Test & Mini Projects• Review • Naive Bayes

  Maximum Likelihood Estimates  Working with Probabilities

•Smoothing•Computational Details•Continuous Quantities

• Expectation Maximization• Challenge

Page 22: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Evidence is Easy?

•Or…. Are their problems?

# + # #P(Xi | S) =

Page 23: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Smooth with a Prior

p = prior probabilitym = weight

Note that if m = 10, it means “I’ve seen 10 samples that make me believe P(Xi | S) = p”

Hence, m is referred to as the equivalent sample size

# + # # + mp

+ mP(Xi | S) =

Page 24: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Probabilities: Important Detail!

Any more potential problems here?

• P(spam | X1 … Xn) = P(spam | Xi)i

• We are multiplying lots of small numbers Danger of underflow!  0.557 = 7 E -18

• Solution? Use logs and add!  p1 * p2 = e log(p1)+log(p2)

  Always keep in log form

Page 25: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 25

P(S | X)• Easy to compute from data if X discrete

Instance X Spam?

1 T F

2 T F

3 F T

4 T T

5 T F

• P(S | X) = ¼ ignoring smoothing…

Page 26: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 26

P(S | X)• What if X is real valued?

Instance X Spam?

1 0.01 False

2 0.01 False

3 0.02 False

4 0.03 True

5 0.05 True

• What now?

----- <T

----- <T

----- <T

----- >T

----- >T

Page 27: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 27

Anything Else?# X S?1 0.0

1

F

2 0.01

F

3 0.02

F

4 0.03

T

5 0.05

T

.01 .02 .03 .04 .05

3

2

1

0

P(S|.04)?

Page 28: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 28

Fit Gaussians# X S?1 0.0

1

F

2 0.01

F

3 0.02

F

4 0.03

T

5 0.05

T

.01 .02 .03 .04 .05

3

2

1

0

Page 29: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 29

Smooth with Gaussianthen sum

“Kernel Density Estimation”

# X S?1 0.0

1

F

2 0.01

F

3 0.02

F

4 0.03

T

5 0.05

T

.01 .02 .03 .04 .05

3

2

1

0

Page 30: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 32

Topics

• Test & Mini Projects• Review • Naive Bayes• Expectation Maximization

  Review: Learning Bayesian Networks•Parameter Estimation•Structure Learning

  Hidden Nodes• Challenge

Page 31: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 33

An Example Bayes Net

Earthquake Burglary

Alarm

Nbr2CallsNbr1Calls

Pr(B=t) Pr(B=f) 0.05 0.95

Radio

Pr(A|E,B)e,b 0.9 (0.1)e,b 0.2 (0.8)e,b 0.85 (0.15)e,b 0.01 (0.99)

Page 32: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Parameter Estimation and Bayesian Networks

E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F

...

We have: - Bayes Net structure and observations- We need: Bayes Net parameters

Page 33: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Parameter Estimation and Bayesian Networks

E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F

...

P(A|E,B) = ?P(A|E,¬B) = ?P(A|¬E,B) = ?P(A|¬E,¬B) = ?

Page 34: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Parameter Estimation and Bayesian Networks

E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F

...

P(A|E,B) = ?P(A|E,¬B) = ?P(A|¬E,B) = ?P(A|¬E,¬B) = ?

Prior

0

1

2

0 0.2 0.4 0.6 0.8 1

+ data= 0

1

2

0 0.2 0.4 0.6 0.8 1

Now com

pute

eith

er M

AP or

Bayes

ian

estim

ate

Page 35: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 37

Recap

• Given a BN structure (with discrete or continuous variables), we can learn the parameters of the conditional prop tables.

Spam

Nigeria NudeSex

Earthqk Burgl

Alarm

N2N1

Page 36: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

What if we don’t know structure?

Page 37: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Learning The Structureof Bayesian Networks

• Search thru the space…   of possible network structures!  (for now, assume we observe all variables)

• For each structure, learn parameters• Pick the one that fits observed data best

  Caveat – won’t we end up fully connected????

• When scoring, add a penalty model complexity

Problem !?!?

Page 38: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Learning The Structureof Bayesian Networks

• Search thru the space • For each structure, learn parameters• Pick the one that fits observed data best

• Problem?  Exponential number of networks!  And we need to learn parameters for each!  Exhaustive search out of the question!

• So what now?

Page 39: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Learning The Structureof Bayesian Networks

Local search!  Start with some network structure  Try to make a change   (add or delete or reverse edge)  See if the new network is any better

 What should be the initial state?

Page 40: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

Initial Network Structure?

• Uniform prior over random networks?

• Network which reflects expert knowledge?

Page 41: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 43

Learning BN Structure

Page 42: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

The Big Picture

• We described how to do MAP (and ML) learning of a Bayes net (including structure)

• How would Bayesian learning (of BNs) differ?•Find all possible networks

•Calculate their posteriors

•When doing inference, return weighed combination of predictions from all networks!

Page 43: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 45

Hidden Variables

• But we can’t observe the disease variable• Can’t we learn without it?

Page 44: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 46

We –could-• But we’d get a fully-connected network

• With 708 parameters (vs. 78)  Much harder to learn!

Page 45: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 47

Chicken & Egg Problem

• If we knew that a training instance (patient) had the disease…  It would be easy to learn P(symptom |

disease)  But we can’t observe disease, so we don’t.

• If we knew params, e.g. P(symptom | disease) then it’d be easy to estimate if the patient had the disease.   But we don’t know   these parameters.

Page 46: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 48

Expectation Maximization (EM)(high-level version)

• Pretend we do know the parameters  Initialize randomly

• [E step] Compute probability of instance having each possible value of the hidden variable

• [M step] Treating each instance as fractionally having both values compute the new parameter values

• Iterate until convergence!

Page 47: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 49

Simplest Version• Mixture of two distributions

• Know: form of distribution & variance, % =5• Just need mean of each distribution.01 .03 .05 .07 .09

Page 48: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 50

Input Looks Like

.01 .03 .05 .07 .09

Page 49: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 51

We Want to Predict

.01 .03 .05 .07 .09

?

Page 50: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 52

Expectation Maximization (EM)

• Pretend we do know the parameters  Initialize randomly: set 1=?; 2=?

.01 .03 .05 .07 .09

Page 51: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 53

Expectation Maximization (EM)• Pretend we do know the parameters

  Initialize randomly• [E step] Compute probability of instance

having each possible value of the hidden variable

.01 .03 .05 .07 .09

Page 52: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 54

Expectation Maximization (EM)• Pretend we do know the parameters

  Initialize randomly• [E step] Compute probability of instance

having each possible value of the hidden variable

.01 .03 .05 .07 .09

Page 53: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 55

Expectation Maximization (EM)• Pretend we do know the parameters

  Initialize randomly• [E step] Compute probability of instance

having each possible value of the hidden variable

.01 .03 .05 .07 .09

• [M step] Treating each instance as fractionally having both values compute the new parameter values

Page 54: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 56

ML Mean of Single Gaussian

Uml = argminu i(xi – u)2

.01 .03 .05 .07 .09

Page 55: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 57

Expectation Maximization (EM)• Pretend we do know the parameters

  Initialize randomly• [E step] Compute probability of instance

having each possible value of the hidden variable

.01 .03 .05 .07 .09

• [M step] Treating each instance as fractionally having both values compute the new parameter values

Page 56: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 58

Iterate

Page 57: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 59

Expectation Maximization (EM)•

• [E step] Compute probability of instance having each possible value of the hidden variable

.01 .03 .05 .07 .09

Page 58: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 60

Expectation Maximization (EM)•

• [E step] Compute probability of instance having each possible value of the hidden variable

.01 .03 .05 .07 .09

• [M step] Treating each instance as fractionally having both values compute the new parameter values

Page 59: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 61

Expectation Maximization (EM)•

• [E step] Compute probability of instance having each possible value of the hidden variable

.01 .03 .05 .07 .09

• [M step] Treating each instance as fractionally having both values compute the new parameter values

Page 60: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 62

Until Convergence

• Problems  Need to assume form of distribution  Local Maxima

• But  It really works in practice!  Can easilly extend to multiple variables

• E.g. Mean & Variance• Or much more complex models…

Page 61: © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573

© Daniel S. Weld 63

Crossword Puzzles