573 schedule coming soon - university of washington · 2004-11-30 · 2 0 1 2 3 0 0.1 0.2 0.3 0.4...

11
1 © Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573 © Daniel S. Weld 2 Logistics Team Meetings • Midterm Open book, notes Studying • See AIMA exercises © Daniel S. Weld 3 573 Schedule Agency Problem Spaces Search Knowledge Representation & Inference Planning Supervised Learning Logic-Based Probabilistic Reinforcement Learning Selected Topics © Daniel S. Weld 4 Coming Soon Artificial Life Selected Topics Crossword Puzzles Intelligent Internet Systems © Daniel S. Weld 5 Topics Test & Mini Projects • Review Naive Bayes Maximum Likelihood Estimates Working with Probabilities Expectation Maximization • Challenge Estimation Models Uniform The most likely Any The most likely Any Weighted combination Maximum Likelihood Estimate Maximum A Posteriori Estimate Bayesian Estimate Prior Hypothesis

Upload: others

Post on 16-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 573 Schedule Coming Soon - University of Washington · 2004-11-30 · 2 0 1 2 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Continuous Case Probability of heads Relative Likelihood 0

1

© Daniel S. Weld 1

Naïve Bayes & Expectation Maximization

CSE 573

© Daniel S. Weld 2

Logistics• Team Meetings• Midterm

Open book, notes Studying

• See AIMA exercises

© Daniel S. Weld 3

573 Schedule

AgencyProblem Spaces

SearchKnowledge Representation & Inference

Planning SupervisedLearning

Logic-Based Probabilistic

ReinforcementLearning

Selected Topics

© Daniel S. Weld 4

Coming Soon

Artificial Life

Selected Topics

Crossword Puzzles

Intelligent Internet Systems

© Daniel S. Weld 5

Topics• Test & Mini Projects• Review • Naive Bayes

Maximum Likelihood Estimates Working with Probabilities

• Expectation Maximization• Challenge

Estimation Models

Uniform The most likely

Any The most likely

Any Weighted combination

Maximum Likelihood Estimate

Maximum A Posteriori Estimate

Bayesian Estimate

Prior Hypothesis

Page 2: 573 Schedule Coming Soon - University of Washington · 2004-11-30 · 2 0 1 2 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Continuous Case Probability of heads Relative Likelihood 0

2

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Continuous Case

Probability of heads

Relative

Likelihoo

d

0

1

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Continuous CasePrior

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Exp 1: Heads

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Exp 2: Tails

0

1

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

uniform

with backgroundknowledge

Continuous Case

0

1

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Bayesian EstimateMAP Estimate

ML Estimate

Posterior after 2 experiments:

w/ uniform prior

with backgroundknowledge

-1

0

1

2

3

4

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-1

0

1

2

3

4

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

After 10 Experiments...Posterior:

Bayesian EstimateMAP Estimate

ML Estimatew/ uniform prior

with backgroundknowledge

© Daniel S. Weld 11

Topics• Test & Mini Projects• Review • Naive Bayes

Maximum Likelihood Estimates Working with Probabilities

• Expectation Maximization• Challenge

Naive Bayes

•A Bayes Net where all nodes are children of a single root node•Why? Expressive and accurate? Easy to learn?

=`Is “apple” in message?’

Page 3: 573 Schedule Coming Soon - University of Washington · 2004-11-30 · 2 0 1 2 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Continuous Case Probability of heads Relative Likelihood 0

3

Naive Bayes

•All nodes are children of a single root node•Why? Expressive and accurate? No - why? Easy to learn?

Naive Bayes

•All nodes are children of a single root node•Why? Expressive and accurate? No Easy to learn? Yes

Naive Bayes

•All nodes are children of a single root node•Why? Expressive and accurate? No Easy to learn? Yes Useful? Sometimes

Inference In Naive Bayes

Goal, given evidence (words in an email)

Decide if an email is spam

P(S) = 0.6

P(A|S) = 0.01P(A|¬S) = 0.1

P(B|S) = 0.001P(B|¬S) = 0.01

P(F|S) = 0.8P(F|¬S) = 0.05

P(A|S) = ?P(A|¬S) = ?

Inference In Naive BayesP(S) = 0.6

P(A|S) = 0.01P(A|¬S) = 0.1

P(B|S) = 0.001P(B|¬S) = 0.01

P(F|S) = 0.8P(F|¬S) = 0.05

P(A|S) = ?P(A|¬S) = ?

Independence to the rescue!

P(S|E)

P(E)

P(E)

Inference In Naive BayesP(S) = 0.6

P(A|S) = 0.01P(A|¬S) = 0.1

P(B|S) = 0.001P(B|¬S) = 0.01

P(F|S) = 0.8P(F|¬S) = 0.05

P(A|S) = ?P(A|¬S) = ?

Spam if P(S|E) > P(¬S|E)But...

P(E)

P(E)

Page 4: 573 Schedule Coming Soon - University of Washington · 2004-11-30 · 2 0 1 2 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Continuous Case Probability of heads Relative Likelihood 0

4

Inference In Naive BayesP(S) = 0.6

P(A|S) = 0.01P(A|¬S) = 0.1

P(B|S) = 0.001P(B|¬S) = 0.01

P(F|S) = 0.8P(F|¬S) = 0.05

P(A|S) = ?P(A|¬S) = ?

Parameter Estimation Revisited

Can we calculate Maximum Likelihood estimate of θ easily?

P(S) = θ

Data:

+

Prior

0

1

2

0 0.2 0.4 0.6 0.8 1

θ=

0

1

2

0 0.2 0.4 0.6 0.8 1

θLooking for the maximum of a function:- find the derivative- set it to zero

Max Likelihoodestimate

© Daniel S. Weld 21

Topics• Test & Mini Projects• Review • Naive Bayes

Maximum Likelihood Estimates Working with Probabilities

• Smoothing• Computational Details• Continuous Quantities

• Expectation Maximization• Challenge

Evidence is Easy?

•Or…. Are their problems?

# + # #P(Xi | S) =

Smooth with a Prior

p = prior probabilitym = weight

Note that if m = 10, it means “I’ve seen 10 samples that make me believe P(Xi | S) = p”

Hence, m is referred to as theequivalent sample size

# + # # + mp

+ mP(Xi | S) =

Probabilities: Important Detail!

Any more potential problems here?

• P(spam | X1 … Xn) = Π P(spam | Xi)i

• We are multiplying lots of small numbers Danger of underflow!

0.557 = 7 E -18

• Solution? Use logs and add! p1 * p2 = e log(p1)+log(p2)

Always keep in log form

Page 5: 573 Schedule Coming Soon - University of Washington · 2004-11-30 · 2 0 1 2 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Continuous Case Probability of heads Relative Likelihood 0

5

© Daniel S. Weld 25

P(S | X)• Easy to compute from data if X discrete

FT5

TT4

TF3

FT2

FT1

Spam?XInstance

• P(S | X) = ¼ ignoring smoothing…

© Daniel S. Weld 26

P(S | X)• What if X is real valued?

True0.055

True0.034

False0.023

False0.012

False0.01 1

Spam?XInstance

• What now?

----- >T

----- >T

----- <T

----- <T

----- <T

© Daniel S. Weld 27

Anything Else?

T0.055T0.034F0.023F0.012F0.01 1S?X#

.01 .02 .03 .04 .05

3

2

1

0

P(S|.04)?

© Daniel S. Weld 28

Fit Gaussians

T0.055T0.034F0.023F0.012F0.01 1S?X#

.01 .02 .03 .04 .05

3

2

1

0

© Daniel S. Weld 29

Smooth with Gaussianthen sum

“Kernel Density Estimation”

T0.055T0.034F0.023F0.012F0.01 1S?X#

.01 .02 .03 .04 .05

3

2

1

0

© Daniel S. Weld 30

Spam?

T0.055T0.034F0.023F0.012F0.01 1S?X#

.01 .02 .03 .04 .05

3

2

1

0

P(S| X=.023)P(¬S| X=.023)

What’s with the shape?

Page 6: 573 Schedule Coming Soon - University of Washington · 2004-11-30 · 2 0 1 2 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Continuous Case Probability of heads Relative Likelihood 0

6

© Daniel S. Weld 31

Analysis

Attribute value

© Daniel S. Weld 32

Topics• Test & Mini Projects• Review • Naive Bayes• Expectation Maximization

Review: Learning Bayesian Networks• Parameter Estimation• Structure Learning

Hidden Nodes• Challenge

© Daniel S. Weld 33

An Example Bayes Net

Earthquake Burglary

Alarm

Nbr2CallsNbr1Calls

Pr(B=t) Pr(B=f)0.05 0.95

Radio

Pr(A|E,B)e,b 0.9 (0.1)e,b 0.2 (0.8)e,b 0.85 (0.15)e,b 0.01 (0.99)

Parameter Estimation and Bayesian Networks

E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F...

We have: - Bayes Net structure and observations- We need: Bayes Net parameters

Parameter Estimation and Bayesian Networks

E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F...

P(A|E,B) = ?P(A|E,¬B) = ?P(A|¬E,B) = ?P(A|¬E,¬B) = ?

Parameter Estimation and Bayesian Networks

E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F...

P(A|E,B) = ?P(A|E,¬B) = ?P(A|¬E,B) = ?P(A|¬E,¬B) = ?

Prior

0

1

2

0 0.2 0.4 0.6 0.8 1

+ data= 0

1

2

0 0.2 0.4 0.6 0.8 1

Now co

mpute

eithe

r MAP or

Bayes

ian es

timate

Page 7: 573 Schedule Coming Soon - University of Washington · 2004-11-30 · 2 0 1 2 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Continuous Case Probability of heads Relative Likelihood 0

7

© Daniel S. Weld 37

Recap• Given a BN structure (with discrete or

continuous variables), we can learn the parameters of the conditional prop tables.

Spam

Nigeria NudeSex

Earthqk Burgl

Alarm

N2N1

What if we don’t know structure?

Learning The Structureof Bayesian Networks

• Search thru the space… of possible network structures! (for now, assume we observe all variables)

• For each structure, learn parameters• Pick the one that fits observed data best

Caveat – won’t we end up fully connected????

• When scoring, add a penalty ∝ model complexityProblem !?!?

Learning The Structureof Bayesian Networks

• Search thru the space • For each structure, learn parameters• Pick the one that fits observed data best• Problem?

Exponential number of networks! And we need to learn parameters for each! Exhaustive search out of the question!

• So what now?

Learning The Structureof Bayesian Networks

Local search! Start with some network structure Try to make a change (add or delete or reverse edge) See if the new network is any better

What should be the initial state?

Initial Network Structure?• Uniform prior over random networks?

• Network which reflects expert knowledge?

Page 8: 573 Schedule Coming Soon - University of Washington · 2004-11-30 · 2 0 1 2 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Continuous Case Probability of heads Relative Likelihood 0

8

© Daniel S. Weld 43

Learning BN Structure The Big Picture• We described how to do MAP (and ML)

learning of a Bayes net (including structure)

• How would Bayesian learning (of BNs) differ?

•Find all possible networks•Calculate their posteriors•When doing inference, return weighed combination of predictions from all networks!

© Daniel S. Weld 45

Hidden Variables

• But we can’t observe the disease variable• Can’t we learn without it?

© Daniel S. Weld 46

We –could-• But we’d get a fully-connected network

• With 708 parameters (vs. 78) Much harder to learn!

© Daniel S. Weld 47

Chicken & Egg Problem• If we knew that a training instance (patient)

had the disease… It would be easy to learn P(symptom | disease) But we can’t observe disease, so we don’t.

• If we knew params, e.g. P(symptom | disease) then it’d be easy to estimate if the patient had the disease. But we don’t know these parameters.

© Daniel S. Weld 48

Expectation Maximization (EM)(high-level version)

• Pretend we do know the parameters Initialize randomly

• [E step] Compute probability of instance having each possible value of the hidden variable

• [M step] Treating each instance as fractionally having both values compute the new parameter values

• Iterate until convergence!

Page 9: 573 Schedule Coming Soon - University of Washington · 2004-11-30 · 2 0 1 2 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Continuous Case Probability of heads Relative Likelihood 0

9

© Daniel S. Weld 49

Simplest Version• Mixture of two distributions

• Know: form of distribution & variance, % =5• Just need mean of each distribution

.01 .03 .05 .07 .09

© Daniel S. Weld 50

Input Looks Like

.01 .03 .05 .07 .09

© Daniel S. Weld 51

We Want to Predict

.01 .03 .05 .07 .09

?

© Daniel S. Weld 52

Expectation Maximization (EM)• Pretend we do know the parameters

Initialize randomly: set θ1=?; θ2=?

.01 .03 .05 .07 .09

© Daniel S. Weld 53

Expectation Maximization (EM)• Pretend we do know the parameters

Initialize randomly• [E step] Compute probability of instance

having each possible value of the hidden variable

.01 .03 .05 .07 .09© Daniel S. Weld 54

Expectation Maximization (EM)• Pretend we do know the parameters

Initialize randomly• [E step] Compute probability of instance

having each possible value of the hidden variable

.01 .03 .05 .07 .09

Page 10: 573 Schedule Coming Soon - University of Washington · 2004-11-30 · 2 0 1 2 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Continuous Case Probability of heads Relative Likelihood 0

10

© Daniel S. Weld 55

Expectation Maximization (EM)• Pretend we do know the parameters

Initialize randomly• [E step] Compute probability of instance

having each possible value of the hidden variable

.01 .03 .05 .07 .09

• [M step] Treating each instance as fractionally having both values compute the new parameter values

© Daniel S. Weld 56

ML Mean of Single Gaussian

Uml = argminu Σi(xi – u)2

.01 .03 .05 .07 .09

© Daniel S. Weld 57

Expectation Maximization (EM)• Pretend we do know the parameters

Initialize randomly• [E step] Compute probability of instance

having each possible value of the hidden variable

.01 .03 .05 .07 .09

• [M step] Treating each instance as fractionally having both values compute the new parameter values

© Daniel S. Weld 58

Iterate

© Daniel S. Weld 59

Expectation Maximization (EM)•

• [E step] Compute probability of instance having each possible value of the hidden variable

.01 .03 .05 .07 .09© Daniel S. Weld 60

Expectation Maximization (EM)•

• [E step] Compute probability of instance having each possible value of the hidden variable

.01 .03 .05 .07 .09

• [M step] Treating each instance as fractionally having both values compute the new parameter values

Page 11: 573 Schedule Coming Soon - University of Washington · 2004-11-30 · 2 0 1 2 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Continuous Case Probability of heads Relative Likelihood 0

11

© Daniel S. Weld 61

Expectation Maximization (EM)•

• [E step] Compute probability of instance having each possible value of the hidden variable

.01 .03 .05 .07 .09

• [M step] Treating each instance as fractionally having both values compute the new parameter values

© Daniel S. Weld 62

Until Convergence• Problems

Need to assume form of distribution Local Maxima

• But It really works in practice! Can easilly extend to multiple variables

• E.g. Mean & Variance• Or much more complex models…

© Daniel S. Weld 63

Crossword Puzzles