cse 446: point estimation winter 2012 dan weld slides adapted from carlos guestrin (& luke...
TRANSCRIPT
![Page 1: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/1.jpg)
CSE 446: Point EstimationWinter 2012
Dan Weld
Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)
![Page 2: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/2.jpg)
2
Random Variable
• Discrete or Continuous – Boolean: like an atom in probabilistic logic
• Denotes some event (eg, die will come up “6”)
– Continuous: domain is R numbers• Denotes some quantity (age of a person taken from CSE446)
• Probability Distribution• (Joint)
![Page 3: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/3.jpg)
3
Bayesian Methods
• Learning and classification methods based on probability theory.– Uses prior probability of each category
Given no information about an item.– Produces a posterior probability distribution over
possible categories Given a description of an item.
• Bayes theorem plays a critical role in probabilistic learning and classification.
![Page 4: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/4.jpg)
A B
4
Axioms of Probability Theory• All probabilities between 0 and 1
• Probability of truth and falsity P(true) = 1 P(false) = 0.
• The probability of disjunction is:
1)(0 AP
)()()()( BAPBPAPBAP
A B
![Page 5: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/5.jpg)
De Finetti (1931)An agent who bets according to probabilities that violate these axioms…Can be forced to bet so as to lose moneyRegardless of the outcome
5
![Page 6: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/6.jpg)
Terminology
Conditional ProbabilityJoint Probability
Marginal Probability
X value is given
![Page 7: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/7.jpg)
7
Conditional Probability
• P(A | B) is the probability of A given B• Assumes:
– B is all and only information known.• Defined by:
)(
)()|(
BP
BAPBAP
BA BA A B
![Page 8: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/8.jpg)
Probability Theory
Sum Rule
Product Rule
![Page 9: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/9.jpg)
9
Monty Hall
![Page 10: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/10.jpg)
10
Monty Hall
![Page 11: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/11.jpg)
11
Monty Hall
![Page 12: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/12.jpg)
12
Monty Hall
![Page 13: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/13.jpg)
13
Independence
• A and B are independent iff:
• Therefore, if A and B are independent:
)()|( APBAP
)()|( BPABP
)()(
)()|( AP
BP
BAPBAP
)()()( BPAPBAP
These constraints are logically equivalent
![Page 14: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/14.jpg)
© Daniel S. Weld 14
IndependenceT
rue
B
A A B
P(AB) = P(A)P(B)
![Page 15: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/15.jpg)
© Daniel S. Weld 15
Independence Is RareT
rue
AA B
A&B not independent, since P(A|B) P(A)
B
P(A) = 25%
P(B) = 80%
P(A|B) = ? 1/7
![Page 16: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/16.jpg)
© Daniel S. Weld 16
Conditional Independence….?
Are A & B independent? P(A|B) ? P(A)
AA B
B
P(A)=(.25+.5)/2 = .375
P(B)= .75
P(A|B)=(.25+.25+.5)/3 =.3333
![Page 17: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/17.jpg)
© Daniel S. Weld 17
A, B Conditionally Independent Given C
P(A|B,C) = P(A|C) C = spots
P(A|C) =.50 P(A|C) =.5 P(B|C) = .333P(A|B,C)=.5
A
CB
![Page 18: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/18.jpg)
18
Bayes Theorem
Simple proof from definition of conditional probability:
)(
)()|()|(
EP
HPHEPEHP
)(
)()|(
EP
EHPEHP
)()|()( HPHEPEHP QED:
(Def. cond. prob.)
(Def. cond. prob.)
)(
)()|()|(
EP
HPHEPEHP
(Mult both sides of 2 by P(H).)
(Substitute 3 in 1.)
1702-1761
)|()(
)(HEP
HP
EHP
![Page 19: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/19.jpg)
Medical DiagnosisEncephalitis Computadoris
)(
)()|()|(
EP
HPHEPEHP
Extremely NastyExtremely Rare 1/100MDNA Testing 0% false negatives
1/10,000 false positive
Scared?P(E)?P(H|E)=1*0.00000001 / 0.00010001 ~ 0.000099
In 100M people: 1+ 10,000
![Page 20: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/20.jpg)
Your first consulting jobBillionaire (eccentric) Eastside tech founder asks:
– He says: I have thumbtack, if I flip it, what’s the probability it will fall with the nail up?
– You say: Please flip it a few times:
– You say: The probability is:• P(H) = 3/5
– He says: Why???– You say: Because…
![Page 21: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/21.jpg)
Thumbtack – Binomial Distribution
• P(Heads) = , P(Tails) = 1-
• Flips are i.i.d.:– Independent events– Identically distributed according to Binomial
distribution
• Sequence D of H Heads and T Tails
…
![Page 22: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/22.jpg)
Maximum Likelihood Estimation• Data: Observed set D of H Heads and T Tails • Hypothesis: Binomial distribution • Learning: finding is an optimization problem
– What’s the objective function?
• MLE: Choose to maximize probability of D
![Page 23: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/23.jpg)
Your first parameter learning algorithm
• Set derivative to zero, and solve!
![Page 24: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/24.jpg)
But, how many flips do I need?
• Billionaire says: I flipped 3 heads and 2 tails.• You say: = 3/5, I can prove it!• He says: What if I flipped 30 heads and 20 tails?• You say: Same answer, I can prove it!• He says: What’s better?• You say: Umm… The more the merrier???• He says: Is this why I am paying you the big bucks???
![Page 25: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/25.jpg)
N
Pro
b. o
f M
ista
ke
Exponential Decay!
A bound (from Hoeffding’s inequality)
For N =H+T, and
Let * be the true parameter, for any >0:
![Page 26: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/26.jpg)
PAC Learning• PAC: Probably Approximate Correct• Billionaire says: I want to know the thumbtack ,
within = 0.1, with probability at least 1- = 0.95. • How many flips? Or, how big do I set N ?
Interesting! Lets look at some numbers! = 0.1, =0.05
![Page 27: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/27.jpg)
What if I have prior beliefs? • Billionaire says: Wait, I know that the thumbtack
is “close” to 50-50. What can you do for me now?• You say: I can learn it the Bayesian way…• Rather than estimating a single , we obtain a
distribution over possible values of
In the beginning After observationsObserve flips
e.g.: {tails, tails}
![Page 28: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/28.jpg)
Bayesian LearningUse Bayes rule:
Or equivalently:
Also, for uniform priors:
Prior
Normalization
Data Likelihood
Posterior
reduces to MLE objective
![Page 29: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/29.jpg)
Bayesian Learning for Thumbtacks
Likelihood function is Binomial:
• What about prior?– Represent expert knowledge– Simple posterior form
• Conjugate priors:– Closed-form representation of posterior– For Binomial, conjugate prior is Beta distribution
![Page 30: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/30.jpg)
Beta prior distribution – P()
• Likelihood function:• Posterior:
![Page 31: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/31.jpg)
Posterior Distribution• Prior:• Data: H heads and T tails
• Posterior distribution:
![Page 32: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/32.jpg)
Bayesian Posterior Inference• Posterior distribution:
• Bayesian inference:– No longer single parameter– For any specific f, the function of interest– Compute the expected value of f
– Integral is often hard to compute
![Page 33: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/33.jpg)
MAP: Maximum a Posteriori Approximation
• As more data is observed, Beta is more certain• MAP: use most likely parameter to
approximate the expectation
![Page 34: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/34.jpg)
MAP for Beta distribution
MAP: use most likely parameter:
Beta prior equivalent to extra thumbtack flipsAs N →∞, prior is “forgotten”But, for small sample size, prior is important!
![Page 35: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/35.jpg)
35
Envelope ProblemOne envelope has twice as much money as other
Switch?
A B
![Page 36: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/36.jpg)
36
A has $20
B
Envelope Problem
$A
One envelope has twice as much money as other
E[B] = $ ½ (10 + 40) = $ 25
Switch?
![Page 37: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/37.jpg)
What about continuous variables?
• Billionaire says: If I am measuring a continuous variable, what can you do for me?
• You say: Let me tell you about Gaussians…
![Page 38: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/38.jpg)
Some properties of Gaussians
• Affine transformation (multiplying by scalar and adding a constant) are Gaussian– X ~ N(,2)– Y = aX + b Y ~ N(a+b,a22)
• Sum of Gaussians is Gaussian– X ~ N(X,2
X)
– Y ~ N(Y,2Y)
– Z = X+Y Z ~ N(X+Y, 2X+2
Y)
• Easy to differentiate, as we will see soon!
![Page 39: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/39.jpg)
Learning a Gaussian
• Collect a bunch of data– Hopefully, i.i.d. samples– e.g., exam scores
• Learn parameters– Mean: μ– Variance: σ
Xi
i =
Exam Score
0 85
1 95
2 100
3 12
… …99 89
![Page 40: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/40.jpg)
MLE for Gaussian:
• Prob. of i.i.d. samples D={x1,…,xN}:
• Log-likelihood of data:
![Page 41: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/41.jpg)
Your second learning algorithm:MLE for mean of a Gaussian
• What’s MLE for mean?
![Page 42: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/42.jpg)
MLE for variance• Again, set derivative to zero:
![Page 43: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/43.jpg)
Learning Gaussian parameters
• MLE:
• BTW. MLE for the variance of a Gaussian is biased– Expected result of estimation is not true parameter! – Unbiased variance estimator:
![Page 44: CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)](https://reader038.vdocuments.site/reader038/viewer/2022110211/56649efe5503460f94c127d1/html5/thumbnails/44.jpg)
Bayesian learning of Gaussian parameters
• Conjugate priors– Mean: Gaussian prior– Variance: Wishart Distribution
• Prior for mean: