© daniel s. weld 1 statistical learning cse 573 lecture 16 slides which overlap fix several errors

58
© Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Post on 22-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

© Daniel S. Weld 1

Statistical LearningCSE 573

Lecture 16 slides

which overlap fix

several erro

rs

Page 2: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

© Daniel S. Weld 2

Logistics

• Team Meetings• Midterm

  Open book, notes   Studying

• See AIMA exercises

Page 3: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

© Daniel S. Weld 3

573 Topics

Agency

Problem Spaces

Search

Knowledge Representation & Inference

Planning SupervisedLearning

Logic-Based

Probabilistic

ReinforcementLearning

Page 4: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

© Daniel S. Weld 4

Topics

• Parameter Estimation:  Maximum Likelihood (ML)  Maximum A Posteriori (MAP)  Bayesian  Continuous case

• Learning Parameters for a Bayesian Network• Naive Bayes

  Maximum Likelihood estimates  Priors

• Learning Structure of Bayesian Networks

Page 5: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Coin Flip

P(H|C2) = 0.5P(H|C1) = 0.1

C1C2

P(H|C3) = 0.9

C3

Which coin will I use?

P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

Prior: Probability of a hypothesis before we make any observations

Page 6: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Coin Flip

P(H|C2) = 0.5P(H|C1) = 0.1

C1C2

P(H|C3) = 0.9

C3

Which coin will I use?

P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

Uniform Prior: All hypothesis are equally likely before we make any observations

Page 7: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Experiment 1: Heads

Which coin did I use?P(C1|H) = ? P(C2|H) = ? P(C3|H) = ?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1)=0.1

C1C2 C3

P(C1)=1/3 P(C2) = 1/3 P(C3) = 1/3

Page 8: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Experiment 1: Heads

Which coin did I use?P(C1|H) = 0.066P(C2|H) = 0.333 P(C3|H) = 0.6

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

Posterior: Probability of a hypothesis given data

Page 9: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Terminology

•Prior:   Probability of a hypothesis before we see any data

•Uniform Prior:   A prior that makes all hypothesis equaly likely

•Posterior:   Probability of a hypothesis after we saw some data

•Likelihood:   Probability of data given hypothesis

Page 10: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Experiment 2: Tails

Which coin did I use?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

P(C1|HT) = ? P(C2|HT) = ? P(C3|HT) = ?

Page 11: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Experiment 2: Tails

Which coin did I use?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

P(C1|HT) = 0.21P(C2|HT) = 0.58P(C3|HT) = 0.21

Page 12: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Experiment 2: Tails

Which coin did I use?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

P(C1|HT) = 0.21P(C2|HT) = 0.58P(C3|HT) = 0.21

Page 13: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Your Estimate?

What is the probability of heads after two experiments?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

Best estimate for P(H)

P(H|C2) = 0.5

Most likely coin:

C2

Page 14: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Your Estimate?

P(H|C2) = 0.5

C2

P(C2) = 1/3

Most likely coin: Best estimate for P(H)

P(H|C2) = 0.5C2

Maximum Likelihood Estimate: The best hypothesis

that fits observed data assuming uniform prior

Page 15: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Using Prior Knowledge• Should we

always use a Uniform Prior ?

• Background knowledge:

  Heads => we have take-home midterm

  Dan doesn’t like take-homes…

  => Dan is more likely to use a coin biased in his favor

P(H|C2) = 0.5P(H|C1) = 0.1

C1C2

P(H|C3) = 0.9

C3

Page 16: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Using Prior Knowledge

P(H|C2) = 0.5P(H|C1) = 0.1

C1C2

P(H|C3) = 0.9

C3

P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70

We can encode it in the prior:

Page 17: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Experiment 1: Heads

Which coin did I use?P(C1|H) = ? P(C2|H) = ? P(C3|H) = ?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70

Page 18: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Experiment 1: Heads

Which coin did I use?P(C1|H) = 0.006P(C2|H) = 0.165P(C3|H) = 0.829

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70

P(C1|H) = 0.066P(C2|H) = 0.333P(C3|H) = 0.600Compare with ML posterior after

Exp 1:

Page 19: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Experiment 2: Tails

Which coin did I use?P(C1|HT) = ? P(C2|HT) = ? P(C3|HT) = ?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70

Page 20: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Experiment 2: Tails

Which coin did I use?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70

P(C1|HT) = 0.035P(C2|HT) = 0.481P(C3|HT) = 0.485

Page 21: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Experiment 2: Tails

Which coin did I use?P(C1|HT) = 0.035P(C2|HT)=0.481P(C3|HT) = 0.485

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70

Page 22: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Your Estimate?

What is the probability of heads after two experiments?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70

Best estimate for P(H)

P(H|C3) = 0.9C3

Most likely coin:

Page 23: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Your Estimate?

Most likely coin: Best estimate for P(H)

P(H|C3) = 0.9C3

Maximum A Posteriori (MAP) Estimate: The best hypothesis that fits observed data

assuming a non-uniform prior

P(H|C3) = 0.9

C3

P(C3) = 0.70

Page 24: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Did We Do The Right Thing?

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

P(C1|HT)=0.035 P(C2|HT)=0.481P(C3|HT)=0.485

Page 25: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Did We Do The Right Thing?

P(C1|HT) =0.035P(C2|HT)=0.481P(C3|HT)=0.485

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1

C1C2 C3

C2 and C3 are almost equally likely

Page 26: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

A Better Estimate

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1C1

C2 C3

Recall: = 0.680

P(C1|HT)=0.035 P(C2|HT)=0.481P(C3|HT)=0.485

Page 27: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Bayesian Estimate

P(C1|HT)=0.035 P(C2|HT)=0.481 P(C3|HT)=0.485

P(H|C2) = 0.5 P(H|C3) = 0.9P(H|C1) = 0.1C1

C2 C3

= 0.680

Bayesian Estimate: Minimizes prediction error, given data and (generally) assuming a

non-uniform prior

Page 28: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Comparison After more experiments: HTH8

ML (Maximum Likelihood): P(H) = 0.5after 10 experiments: P(H) = 0.9

MAP (Maximum A Posteriori): P(H) = 0.9after 10 experiments: P(H) = 0.9

Bayesian: P(H) = 0.68after 10 experiments: P(H) = 0.9

Page 29: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Comparison

ML (Maximum Likelihood):Easy to compute

MAP (Maximum A Posteriori): Still easy to computeIncorporates prior knowledge

Bayesian: Minimizes error => great when data is

scarcePotentially much harder to compute

Page 30: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Summary For Now

Prior Hypothesis

Maximum Likelihood Estimate

Maximum A Posteriori Estimate

Bayesian Estimate

Uniform The most likely

Any The most likely

Any Weighted combination

Page 31: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Continuous Case

•In the previous example,  we chose from a discrete set of three coins

•In general,  we have to pick from a continuous distribution  of biased coins

Page 32: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Continuous Case

Page 33: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Continuous Case

Page 34: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

0

1

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Continuous CasePrior

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Exp 1: Heads

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Exp 2: Tails

0

1

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

uniform

with backgroundknowledge

Page 35: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Continuous Case

0

1

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

1

2

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Bayesian EstimateMAP Estimate

ML Estimate

Posterior after 2 experiments:

w/ uniform prior

with backgroundknowledge

Page 36: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

-1

0

1

2

3

4

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-1

0

1

2

3

4

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

After 10 Experiments...Posterior:

Bayesian EstimateMAP Estimate

ML Estimatew/ uniform prior

with backgroundknowledge

Page 37: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

After 100 Experiments...

Page 38: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

© Daniel S. Weld 38

Topics

• Parameter Estimation:  Maximum Likelihood (ML)  Maximum A Posteriori (MAP)  Bayesian  Continuous case

• Learning Parameters for a Bayesian Network• Naive Bayes

  Maximum Likelihood estimates  Priors

• Learning Structure of Bayesian Networks

Page 39: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

© Daniel S. Weld 39

Review: Conditional Probability

• P(A | B) is the probability of A given B• Assumes that B is the only info known.• Defined by:

)(

)()|(

BP

BAPBAP

A BAB

Tru

e

Page 40: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

© Daniel S. Weld 40

Conditional IndependenceTru

e

B

A A B

A&B not independent, since P(A|B) < P(A)

Page 41: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

© Daniel S. Weld 41

Conditional IndependenceTru

e

B

A A B

C

B C

AC

But: A&B are made independent by C

P(A|C) =P(A|B,C)

Page 42: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

© Daniel S. Weld 42

Bayes Rule

Simple proof from def of conditional probability:

)(

)()|()|(

EP

HPHEPEHP

)(

)()|(

EP

EHPEHP

)(

)()|(

HP

EHPHEP

)()|()( HPHEPEHP

QED:

(Def. cond. prob.)

(Def. cond. prob.)

)(

)()|()|(

EP

HPHEPEHP

(Mult by P(H) in line 1)

(Substitute #3 in #2)

Page 43: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

© Daniel S. Weld 43

An Example Bayes Net

Earthquake Burglary

Alarm

Nbr2CallsNbr1Calls

Pr(B=t) Pr(B=f) 0.05 0.95

Pr(A|E,B)e,b 0.9 (0.1)e,b 0.2 (0.8)e,b 0.85 (0.15)e,b 0.01 (0.99)

Radio

Page 44: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

© Daniel S. Weld 44

Given Parents, X is Independent of

Non-Descendants

Page 45: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

© Daniel S. Weld 45

Given Markov Blanket, X is Independent of All Other Nodes

MB(X) = Par(X) Childs(X) Par(Childs(X))

Page 46: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Parameter Estimation and Bayesian Networks

E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F

...

We have: - Bayes Net structure and observations- We need: Bayes Net parameters

Page 47: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Parameter Estimation and Bayesian Networks

E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F

...

P(B) = ? -5

0

5

10

15

20

25

0 0.2 0.4 0.6 0.8 1

Prior

+ data = -2

0

2

4

6

8

10

12

14

16

18

20

0 0.2 0.4 0.6 0.8 1

Now computeeither MAP or

Bayesian estimate

Page 48: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Parameter Estimation and Bayesian Networks

E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F

...

P(A|E,B) = ?P(A|E,¬B) = ?P(A|¬E,B) = ?P(A|¬E,¬B) = ?

Page 49: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Parameter Estimation and Bayesian Networks

E B R A J MT F T T F TF F F F F TF T F T T TF F F T T TF T F F F F

...

P(A|E,B) = ?P(A|E,¬B) = ?P(A|¬E,B) = ?P(A|¬E,¬B) = ?

Prior

0

1

2

0 0.2 0.4 0.6 0.8 1

+ data= 0

1

2

0 0.2 0.4 0.6 0.8 1

Now com

pute

eith

er M

AP or

Bayes

ian

estim

ate

Page 50: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

© Daniel S. Weld 50

Topics

• Parameter Estimation:  Maximum Likelihood (ML)  Maximum A Posteriori (MAP)  Bayesian  Continuous case

• Learning Parameters for a Bayesian Network• Naive Bayes

  Maximum Likelihood estimates  Priors

• Learning Structure of Bayesian Networks

Page 51: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

© Daniel S. Weld 51

Recap

• Given a BN structure (with discrete or continuous variables), we can learn the parameters of the conditional prop tables.

Spam

Nigeria NudeSex

Earthqk Burgl

Alarm

N2N1

Page 52: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

What if we don’t know structure?

Page 53: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Learning The Structureof Bayesian Networks

• Search thru the space…   of possible network structures!  (for now, assume we observe all variables)

• For each structure, learn parameters• Pick the one that fits observed data best

  Caveat – won’t we end up fully connected????

• When scoring, add a penalty model complexity

Problem !?!?

Page 54: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Learning The Structureof Bayesian Networks

• Search thru the space • For each structure, learn parameters• Pick the one that fits observed data best

• Problem?  Exponential number of networks!  And we need to learn parameters for each!  Exhaustive search out of the question!

• So what now?

Page 55: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Learning The Structureof Bayesian Networks

Local search!  Start with some network structure  Try to make a change   (add or delete or reverse edge)  See if the new network is any better

 What should be the initial state?

Page 56: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

Initial Network Structure?

• Uniform prior over random networks?

• Network which reflects expert knowledge?

Page 57: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

© Daniel S. Weld 57

Learning BN Structure

Page 58: © Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors

The Big Picture

• We described how to do MAP (and ML) learning of a Bayes net (including structure)

• How would Bayesian learning (of BNs) differ?•Find all possible networks

•Calculate their posteriors

•When doing inference, return weighed combination of predictions from all networks!