machine learning 10-601ninamf/courses/601sp15/slides/03...jan 21, 2015 · reasonings concerning...
TRANSCRIPT
![Page 1: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/1.jpg)
Machine Learning 10-601 Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
January 21, 2015
Today: • Bayes Rule • Estimating parameters
• MLE • MAP
Readings: Probability review • Bishop Ch. 1 thru 1.2.3 • Bishop, Ch. 2 thru 2.2 • Andrew Moore’s online
tutorial some of these slides are derived from William Cohen, Andrew Moore, Aarti Singh, Eric Xing, Carlos Guestrin. - Thanks!
![Page 2: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/2.jpg)
Announcements • Class is using Piazza for questions/discussions
about homeworks, etc. – see class website for Piazza address – http://www.cs.cmu.edu/~ninamf/courses/601sp15/
• Recitations thursdays 7-8pm, Wean 5409 – videos for future recitations (class website)
• HW1 was accepted to Sunday 5pm for full credit • HW2 out today on class website, due in 1 week • HW3 will involve programming (in Octave )
![Page 3: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/3.jpg)
P(B|A) * P(A)
P(B) P(A|B) =
Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418
…by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning…
Bayes’ rule
we call P(A) the “prior” and P(A|B) the “posterior”
![Page 4: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/4.jpg)
Other Forms of Bayes Rule
)(~)|~()()|()()|()|(
APABPAPABPAPABPBAP
+=
)()()|()|(
XBPXAPXABPXBAP
∧
∧∧=∧
P(B|A) * P(A)
P(B) P(A|B) =
![Page 5: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/5.jpg)
Applying Bayes Rule
€
P(A |B) =P(B | A)P(A)
P(B | A)P(A) + P(B |~ A)P(~ A)
A = you have the flu, B = you just coughed Assume: P(A) = 0.05 P(B|A) = 0.80 P(B| ~A) = 0.20 what is P(flu | cough) = P(A|B)?
![Page 6: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/6.jpg)
what does all this have to do with function approximation?
instead of F: X àY, learn P(Y | X)
![Page 7: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/7.jpg)
The Joint Distribution
Recipe for making a joint distribution of M variables:
[A. Moore]
A B C Prob 0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
A
B
C 0.05
0.25
0.10 0.05 0.05
0.10
0.10 0.30
Example: Boolean variables A, B, C
![Page 8: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/8.jpg)
The Joint Distribution
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all
combinations of values (M Boolean variables à 2M rows).
[A. Moore]
A B C Prob 0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
A
B
C 0.05
0.25
0.10 0.05 0.05
0.10
0.10 0.30
Example: Boolean variables A, B, C
![Page 9: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/9.jpg)
The Joint Distribution
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all
combinations of values (M Boolean variables à 2M rows).
2. For each combination of values, say how probable it is.
[A. Moore]
A B C Prob 0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
A
B
C 0.05
0.25
0.10 0.05 0.05
0.10
0.10 0.30
Example: Boolean variables A, B, C
![Page 10: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/10.jpg)
The Joint Distribution
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all
combinations of values (M Boolean variables à 2M rows).
2. For each combination of values, say how probable it is.
3. If you subscribe to the axioms of probability, those probabilities must sum to 1.
[A. Moore]
A B C Prob 0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
A
B
C 0.05
0.25
0.10 0.05 0.05
0.10
0.10 0.30
Example: Boolean variables A, B, C
![Page 11: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/11.jpg)
Using the Joint Distribution
One you have the JD you can ask for the probability of any logical expression involving these variables
∑=E
PEP matching rows
)row()(
[A. Moore]
![Page 12: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/12.jpg)
Using the Joint
P(Poor Male) = 0.4654 ∑=E
PEP matching rows
)row()(
[A. Moore]
![Page 13: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/13.jpg)
Using the Joint
P(Poor) = 0.7604 ∑=E
PEP matching rows
)row()(
[A. Moore]
![Page 14: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/14.jpg)
Inference with the Joint
∑
∑=
∧=
2
2 1
matching rows
and matching rows
2
2121 )row(
)row(
)()()|(
E
EE
P
P
EPEEPEEP
P(Male | Poor) = 0.4654 / 0.7604 = 0.612
[A. Moore]
![Page 15: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/15.jpg)
Learning and the Joint Distribution
Suppose we want to learn the function f: <G, H> à W Equivalently, P(W | G, H) Solution: learn joint distribution from data, calculate P(W | G, H) e.g., P(W=rich | G = female, H = 40.5- ) =
[A. Moore]
![Page 16: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/16.jpg)
sounds like the solution to learning F: X àY,
or P(Y | X).
Are we done?
![Page 17: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/17.jpg)
sounds like the solution to learning F: X àY,
or P(Y | X). Main problem: learning P(Y|X)
can require more data than we have
consider learning Joint Dist. with 100 attributes # of rows in this table? # of people on earth? fraction of rows with 0 training examples?
![Page 18: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/18.jpg)
What to do? 1. Be smart about how we estimate
probabilities from sparse data – maximum likelihood estimates – maximum a posteriori estimates
2. Be smart about how to represent joint distributions
– Bayes networks, graphical models
![Page 19: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/19.jpg)
1. Be smart about how we estimate probabilities
![Page 20: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/20.jpg)
Estimating Probability of Heads X=1 X=0
![Page 21: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/21.jpg)
Estimating θ = P(X=1) Test A: 100 flips: 51 Heads (X=1), 49 Tails (X=0) Test B: 3 flips: 2 Heads (X=1), 1 Tails (X=0)
X=1 X=0
![Page 22: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/22.jpg)
Case C: (online learning) • keep flipping, want single learning algorithm
that gives reasonable estimate after each flip
X=1 X=0
Estimating θ = P(X=1)
![Page 23: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/23.jpg)
Principles for Estimating Probabilities
Principle 1 (maximum likelihood): • choose parameters θ that maximize P(data | θ) • e.g., Principle 2 (maximum a posteriori prob.): • choose parameters θ that maximize P(θ | data) • e.g.
![Page 24: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/24.jpg)
Maximum Likelihood Estimation P(X=1) = θ P(X=0) = (1-θ) Data D: Flips produce data D with heads, tails • flips are independent, identically distributed 1’s and 0’s
(Bernoulli) • and are counts that sum these outcomes (Binomial)
X=1 X=0
![Page 25: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/25.jpg)
Maximum Likelihood Estimate for Θ
[C. Guestrin]
![Page 26: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/26.jpg)
hint:
![Page 27: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/27.jpg)
Summary: Maximum Likelihood Estimate
X=1 X=0 P(X=1) = θ
P(X=0) = 1-θ (Bernoulli)
![Page 28: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/28.jpg)
Principles for Estimating Probabilities Principle 1 (maximum likelihood): • choose parameters θ that maximize
P(data | θ) Principle 2 (maximum a posteriori prob.): • choose parameters θ that maximize P(θ | data) = P(data | θ) P(θ) P(data)
![Page 29: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/29.jpg)
Beta prior distribution – P(θ)
![Page 30: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/30.jpg)
Beta prior distribution – P(θ)
[C. Guestrin]
![Page 31: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/31.jpg)
and MAP estimate is therefore
![Page 32: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/32.jpg)
and MAP estimate is therefore
![Page 33: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/33.jpg)
Some terminology • Likelihood function: P(data | θ) • Prior: P(θ) • Posterior: P(θ | data)
• Conjugate prior: P(θ) is the conjugate prior for likelihood function P(data | θ) if the forms of P(θ) and P(θ | data) are the same.
![Page 34: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/34.jpg)
You should know
• Probability basics – random variables, conditional probs, … – Bayes rule – Joint probability distributions – calculating probabilities from the joint distribution
• Estimating parameters from data – maximum likelihood estimates – maximum a posteriori estimates – distributions – binomial, Beta, Dirichlet, … – conjugate priors
![Page 35: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/35.jpg)
Extra slides
![Page 36: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/36.jpg)
Independent Events • Definition: two events A and B are
independent if P(A ^ B)=P(A)*P(B) • Intuition: knowing A tells us nothing
about the value of B (and vice versa)
![Page 37: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/37.jpg)
Picture “A independent of B”
![Page 38: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/38.jpg)
Expected values Given a discrete random variable X, the expected value
of X, written E[X] is Example:
X P(X) 0 0.3 1 0.2 2 0.5
![Page 39: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/39.jpg)
Expected values Given discrete random variable X, the expected value of
X, written E[X] is We also can talk about the expected value of functions
of X
![Page 40: Machine Learning 10-601ninamf/courses/601sp15/slides/03...Jan 21, 2015 · reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any](https://reader034.vdocuments.site/reader034/viewer/2022051603/5fecbc35867da24f7b78d3f4/html5/thumbnails/40.jpg)
Covariance Given two discrete r.v.’s X and Y, we define the
covariance of X and Y as e.g., X=gender, Y=playsFootball or X=gender, Y=leftHanded Remember: