a quick overview of probability william w. cohen machine learning 10-605
TRANSCRIPT
![Page 1: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/1.jpg)
A Quick Overview of Probability
William W. CohenMachine Learning 10-605
![Page 2: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/2.jpg)
Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)
Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context
![Page 3: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/3.jpg)
Twelve years later….
• Starting point: Google books 5-gram data– All 5-grams that appear >= 40 times in a
corpus of 1M English books• approx 80B words• 5-grams: 30Gb compressed, 250-300Gb
uncompressed• Each 5-gram contains frequency distribution
over years
– Wrote code to compute • Pr(A,B,C,D,E|C=affect or C=effect) • Pr(any subset of A,…,E|any other fixed values
of A,…,E with C=affect V effect)
![Page 4: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/4.jpg)
![Page 5: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/5.jpg)
Probabilistic and Bayesian Analytics
Andrew W. MooreSchool of Computer ScienceCarnegie Mellon University
www.cs.cmu.edu/[email protected]
412-268-7599
Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
Copyright © Andrew W. Moore
[Some material pilfered from http://www.cs.cmu.edu/~awm/tutorials]
![Page 6: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/6.jpg)
Tuesday’s Lecture - Review
• Intro– Who, Where, When - administrivia– Why – motivations– What/How – assignments, grading, …
• Review - How to count and what to count– Big-O and Omega notation, example, …– Costs of i/o vs computation
• What sort of computations do we want to do in (large-scale) machine learning programs?– Probability
![Page 7: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/7.jpg)
Probability - what you need to really, really know
![Page 8: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/8.jpg)
Probability - what you need to really, really know
• Probabilities are cool
![Page 9: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/9.jpg)
The Problem of Induction
• David Hume (1711-1776): pointed out
1. Empirically, induction seems to work
2. Statement (1) is an application of induction.
• This stumped people for about 200 years
1. Of the Different Species of Philosophy.
2. Of the Origin of Ideas
3. Of the Association of Ideas
4. Sceptical Doubts Concerning the Operations of the Understanding
5. Sceptical Solution of These Doubts
6. Of Probability9
7. Of the Idea of Necessary Connexion
8. Of Liberty and Necessity
9. Of the Reason of Animals
10. Of Miracles
11. Of A Particular Providence and of A Future State
12. Of the Academical Or Sceptical Philosophy
![Page 10: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/10.jpg)
A Second Problem of Induction
• A black crow seems to support the hypothesis “all crows are black”.
• A pink highlighter supports the hypothesis “all non-black things are non-crows”
• Thus, a pink highlighter supports the hypothesis “all crows are black”.
)(CROW)(BLACK
lyequivalentor
)(BLACK)(CROW
xxx
xxx
![Page 11: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/11.jpg)
Probability - what you need to really, really know
• Probabilities are cool• Random variables and events
![Page 12: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/12.jpg)
Discrete Random Variables
• A is a Boolean-valued random variable if– A denotes an event, – there is uncertainty as to whether A occurs.
• Examples– A = The US president in 2023 will be male– A = You wake up tomorrow with a headache– A = You have Ebola– A = the 1,000,000,000,000th digit of π is 7
• Define P(A) as “the fraction of possible worlds in which A is true”– We’re assuming all possible worlds are equally probable
![Page 13: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/13.jpg)
Discrete Random Variables
• A is a Boolean-valued random variable if– A denotes an event, – there is uncertainty as to whether A occurs.
• Define P(A) as “the fraction of experiments in which A is true”– We’re assuming all possible outcomes are equiprobable
• Examples– You roll two 6-sided die (the experiment) and get doubles
(A=doubles, the outcome)– I pick two students in the class (the experiment) and they have
the same birthday (A=same birthday, the outcome)
a possible outcome of an “experiment”
the experiment is not deterministic
![Page 14: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/14.jpg)
Visualizing A
Event space of all possible worlds
Its area is 1Worlds in which A is False
Worlds in which A is true
P(A) = Area ofreddish oval
![Page 15: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/15.jpg)
Probability - what you need to really, really know
• Probabilities are cool• Random variables and events• There is One True Way to talk about
uncertainty: the Axioms of Probability
![Page 16: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/16.jpg)
The Axioms of Probability
• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)
Events, random variables, …., probabilities
“Dice”
“Experiments”
![Page 17: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/17.jpg)
The Axioms Of Probabi
lity
(This is Andrew’s joke)
![Page 18: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/18.jpg)
![Page 19: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/19.jpg)
These Axioms are Not to be Trifled With
• There have been many many other approaches to understanding “uncertainty”:
• Fuzzy Logic, three-valued logic, Dempster-Shafer, non-monotonic reasoning, …
• 25 years ago people in AI argued about these; now they mostly don’t– Any scheme for combining uncertain information, uncertain
“beliefs”, etc,… really should obey these axioms– If you gamble based on “uncertain beliefs”, then [you can
be exploited by an opponent] [your uncertainty formalism violates the axioms] - di Finetti 1931 (the “Dutch book argument”)
![Page 20: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/20.jpg)
Interpreting the axioms
• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and
B)The area of A can’t get any smaller than 0
And a zero area would mean no world could ever have A true
![Page 21: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/21.jpg)
Interpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and
B)
The area of A can’t get any bigger than 1
And an area of 1 would mean all worlds will have A true
![Page 22: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/22.jpg)
Interpreting the axioms
• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and
B)
A
B
![Page 23: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/23.jpg)
Interpreting the axioms
• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and
B)
A
B
P(A or B)
BP(A and B)
Simple addition and subtraction
![Page 24: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/24.jpg)
Theorems from the Axioms
• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)
P(not A) = P(~A) = 1-P(A)
P(A or ~A) = 1 P(A and ~A) = 0
P(A or ~A) = P(A) + P(~A) - P(A and ~A)
1 = P(A) + P(~A) - 0
![Page 25: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/25.jpg)
Elementary Probability in Pictures
• P(~A) + P(A) = 1
A ~A
![Page 26: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/26.jpg)
Another important theorem
• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)
P(A) = P(A ^ B) + P(A ^ ~B)
A = A and (B or ~B) = (A and B) or (A and ~B)
P(A) = P(A and B) + P(A and ~B) – P((A and B) and (A and ~B))
P(A) = P(A and B) + P(A and ~B) – P(A and A and B and ~B)
![Page 27: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/27.jpg)
Elementary Probability in Pictures
• P(A) = P(A ^ B) + P(A ^ ~B)
B
~B
A ^ ~B
A ^ B
![Page 28: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/28.jpg)
Probability - what you need to really, really know
• Probabilities are cool• Random variables and events• The Axioms of Probability• Independence
![Page 29: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/29.jpg)
Independent Events
• Definition: two events A and B are independent if Pr(A and B)=Pr(A)*Pr(B).
• Intuition: outcome of A has no effect on the outcome of B (and vice versa).– We need to assume the different rolls
are independent to solve the problem.– You frequently need to assume the
independence of something to solve any learning problem.
![Page 30: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/30.jpg)
Some practical problems
• You’re the DM in a D&D game.• Joe brings his own d20 and throws 4 critical hits in
a row to start off– DM=dungeon master– D20 = 20-sided die– “Critical hit” = 19 or 20
• What are the odds of that happening with a fair die?
• Ci=critical hit on trial i, i=1,2,3,4 • P(C1 and C2 … and C4) = P(C1)*…*P(C4) =
(1/10)^4
![Page 31: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/31.jpg)
Multivalued Discrete Random Variables
• Suppose A can take on more than 2 values• A is a random variable with arity k if it can take on exactly
one value out of {v1,v2, .. vk}
– Example: V={aaliyah, aardvark, …., zymurge, zynga}– Example: V={aaliyah_aardvark, …, zynga_zymgurgy}
• Thus…
jivAvAP ji if 0)(
1)( 21 kvAvAvAP
![Page 32: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/32.jpg)
Terms: Binomials and Multinomials
• Suppose A can take on more than 2 values• A is a random variable with arity k if it can
take on exactly one value out of {v1,v2, .. vk}
– Example: V={aaliyah, aardvark, …., zymurge, zynga}
– Example: V={aaliyah_aardvark, …, zynga_zymgurgy}
• The distribution Pr(A) is a multinomial• For k=2 the distribution is a binomial
![Page 33: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/33.jpg)
More about Multivalued Random Variables
• Using the axioms of probability and assuming that A obeys…
• It’s easy to prove that
jivAvAP ji if 0)(1)( 21 kvAvAvAP
)()(1
21
i
jji vAPvAvAvAP
• And thus we can prove
1)(1
k
jjvAP
![Page 34: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/34.jpg)
Elementary Probability in Pictures
1)(1
k
jjvAP
A=1
A=2
A=3
A=4
A=5
![Page 35: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/35.jpg)
Elementary Probability in Pictures
1)(1
k
jjvAP
A=aardvark
A=aaliyah
A=…
A=…. A=zynga
…
![Page 36: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/36.jpg)
Probability - what you need to really, really know
• Probabilities are cool• Random variables and events• The Axioms of Probability• Independence, binomials, multinomials• Conditional probabilities
![Page 37: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/37.jpg)
A practical problem
• I have lots of standard d20 die, lots of loaded die, all identical.
• Loaded die will give a 19/20 (“critical hit”) half the time.
• In the game, someone hands me a random die, which is fair (A) or loaded (~A), with P(A) depending on how I mix the die. Then I roll, and either get a critical hit (B) or not (~B)
•. Can I mix the dice together so that P(B) is anything I want - say, p(B)= 0.137 ?
P(B) = P(B and A) + P(B and ~A) = 0.1*λ + 0.5*(1- λ) = 0.137
λ = (0.5 - 0.137)/0.4 = 0.9075“mixture model”
![Page 38: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/38.jpg)
Another picture for this problem
A (fair die) ~A (loaded)
A and B ~A and B
It’s more convenient to say• “if you’ve picked a fair die then …” i.e. Pr(critical hit|fair die)=0.1• “if you’ve picked the loaded die then….” Pr(critical hit|loaded die)=0.5
Conditional probability:Pr(B|A) = P(B^A)/P(A)
P(B|A) P(B|~A)
![Page 39: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/39.jpg)
Definition of Conditional Probability
P(A ^ B) P(A|B) = ----------- P(B)
Corollary: The Chain Rule
P(A ^ B) = P(A|B) P(B)
![Page 40: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/40.jpg)
Some practical problems
• I have 3 standard d20 dice, 1 loaded die.
• Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)?
P(B) = P(B|A) P(A) + P(B|~A) P(~A) = 0.1*0.75 + 0.5*0.25 = 0.2
“marginalizing out” A
![Page 41: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/41.jpg)
A (fair die) ~A (loaded)
A and B ~A and B P(B|A) P(B|~A)
P(A) P(~A)P(B) = P(B|A)P(A) + P(B|~A)P(~A)
![Page 42: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/42.jpg)
Some practical problems• I have 3 standard d20 dice, 1 loaded die.
• Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die.
• Suppose B happens (e.g., I roll a 20). What is the chance the die I rolled is fair? i.e. what is P(A|B) ?
![Page 43: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/43.jpg)
A (fair die) ~A (loaded)
A and B ~A and B
P(B|A) P(B|~A)
P(A) P(~A)
P(A and B) = P(B|A) * P(A)
P(A and B) = P(A|B) * P(B)
P(A|B) * P(B) = P(B|A) * P(A)
P(B|A) * P(A)
P(B)P(A|B) =
A (fair die) ~A (loaded)
A and B ~A and B
P(B)
P(A|B) = ?
![Page 44: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/44.jpg)
P(B|A) * P(A)
P(B)P(A|B) =
P(A|B) * P(B)
P(A)P(B|A) =
Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418
…by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning…
Bayes’ rule
priorposterior
![Page 45: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/45.jpg)
Probability - what you need to really, really know
• Probabilities are cool• Random variables and events• The Axioms of Probability• Independence, binomials, multinomials• Conditional probabilities• Bayes Rule
![Page 46: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/46.jpg)
Some practical problems• Joe throws 4 critical hits in a row, is Joe cheating?• A = Joe using cheater’s die• C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1• B = C1 and C2 and C3 and C4• Pr(B|A) = 0.0625 P(B|~A)=0.0001
)(~)|~()()|(
)()|()|(
APABPAPABP
APABPBAP
))(1(*0001.0)(*0625.0
)(*0625.0)|(
APAP
APBAP
![Page 47: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/47.jpg)
What’s the experiment and outcome here?
• Outcome A: Joe is cheating• Experiment:
– Joe picked a die uniformly at random from a bag containing 10,000 fair die and one bad one.
– Joe is a D&D player picked uniformly at random from set of 1,000,000 people and n of them cheat with probability p>0.
– I have no idea, but I don’t like his looks. Call it P(A)=0.1
![Page 48: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/48.jpg)
Remember: Don’t Mess with The Axioms
• A subjective belief can be treated, mathematically, like a probability– Use those axioms!
• There have been many many other approaches to understanding “uncertainty”:
• Fuzzy Logic, three-valued logic, Dempster-Shafer, non-monotonic reasoning, …
• 25 years ago people in AI argued about these; now they mostly don’t– Any scheme for combining uncertain information, uncertain
“beliefs”, etc,… really should obey these axioms– If you gamble based on “uncertain beliefs”, then [you can be
exploited by an opponent] [your uncertainty formalism violates the axioms] - di Finetti 1931 (the “Dutch book argument”)
![Page 49: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/49.jpg)
Some practical problems
• Joe throws 4 critical hits in a row, is Joe cheating?
• A = Joe using cheater’s die• C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1• B = C1 and C2 and C3 and C4• Pr(B|A) = 0.0625 P(B|~A)=0.0001
)(/)()|(
)(/)()|(
)|(
)|(
BPAPABP
BPAPABP
BAP
BAP
)(
)()|()|(
BP
APABPBAP
)(
)(
)|(
)|(
AP
AP
ABP
ABP
)(
)(
0001.0
0625.0
AP
AP
)(
)(250,6
AP
AP
Moral: with enough evidence the prior P(A) doesn’t really matter.
![Page 50: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/50.jpg)
Probability - what you need to really, really know
• Probabilities are cool• Random variables and events• The Axioms of Probability• Independence, binomials, multinomials• Conditional probabilities• Bayes Rule• MLE’s, smoothing, and MAPs
![Page 51: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/51.jpg)
Some practical problemsI bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?
Frequency
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Face Shown
1. Collect some data (20 rolls)2. Estimate Pr(i)=C(rolls of i)/C(any roll)
![Page 52: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/52.jpg)
One solutionI bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?
Frequency
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Face Shown
P(1)=0
P(2)=0
P(3)=0
P(4)=0.1
…
P(19)=0.25
P(20)=0.2MLE = maximumlikelihood estimate
But: Do I really think it’s impossible to roll a 1,2 or 3?Would you bet your house on it?
![Page 53: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/53.jpg)
A better solutionI bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?
Frequency
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Face Shown
1. Collect some data (20 rolls)2. Estimate Pr(i)=C(rolls of i)/C(any roll)
0. Imagine some data (20 rolls, each i shows up 1x)
![Page 54: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/54.jpg)
A better solutionI bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?
Frequency
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Face Shown
P(1)=1/40
P(2)=1/40
P(3)=1/40
P(4)=(2+1)/40
…
P(19)=(5+1)/40
P(20)=(4+1)/40=1/8
)()(
1)()r(P̂
IMAGINEDCANYC
iCi
0.25 vs. 0.125 – really different! Maybe I should “imagine” less data?
![Page 55: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/55.jpg)
A better solution?
Frequency
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Face Shown
P(1)=1/40
P(2)=1/40
P(3)=1/40
P(4)=(2+1)/40
…
P(19)=(5+1)/40
P(20)=(4+1)/40=1/8
)()(
1)()r(P̂
IMAGINEDCANYC
iCi
0.25 vs. 0.125 – really different! Maybe I should “imagine” less data?
![Page 56: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/56.jpg)
A better solution?
)()(
1)()r(P̂
IMAGINEDCANYC
iCi
mANYC
mqiCi
)(
)()r(P̂
Q: What if I used m rolls with a probability of q=1/20 of rolling any i?
I can use this formula with m>20, or even with m<20 … say with m=1
![Page 57: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/57.jpg)
A better solution
)()(
1)()r(P̂
IMAGINEDCANYC
iCi
mANYC
mqiCi
)(
)()r(P̂
Q: What if I used m rolls with a probability of q=1/20 of rolling any i?
If m>>C(ANY) then your imagination q rulesIf m<<C(ANY) then your data rules BUT you never ever ever end up with Pr(i)=0
![Page 58: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/58.jpg)
Terminology – more laterThis is called a uniform Dirichlet prior
C(i), C(ANY) are sufficient statistics
mANYC
mqiCi
)(
)()r(P̂
MLE = maximumlikelihood estimate
MAP= maximuma posteriori estimate
![Page 59: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/59.jpg)
Probability - what you need to really, really know
• Probabilities are cool• Random variables and events• The Axioms of Probability• Independence, binomials, multinomials• Conditional probabilities• Bayes Rule• MLE’s, smoothing, and MAPs• The joint distribution
![Page 60: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/60.jpg)
Some practical problems
• I have 1 standard d6 die, 2 loaded d6 die.
• Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50
• Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles?
Three combinations: HL, HF, FLP(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL) = P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)
![Page 61: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/61.jpg)
Some practical problems
• I have 1 standard d6 die, 2 loaded d6 die.
• Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50
• Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles?
1 2 3 4 5 6
1 D 7
2 D 7
3 D 7
4 7 D
5 7 D
6 7 D
Three combinations: HL, HF, FL Roll 1
Roll
2
![Page 62: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/62.jpg)
A brute-force solution
A Roll 1 Roll 2 P
FL 1 1 1/3 * 1/6 * ½
FL 1 2 1/3 * 1/6 * 1/10
FL 1 … …
… 1 6
FL 2 1
FL 2 …
… … …
FL 6 6
HL 1 1
HL 1 2
… … …
HF 1 1
…
Comment
doubles
seven
doubles
doubles
A joint probability table shows P(X1=x1 and … and Xk=xk) for every possible combination of values x1,x2,…., xk
With this you can compute any P(A) where A is any boolean combination of the primitive events (Xi=Xk), e.g.
• P(doubles)
• P(seven or eleven)
• P(total is higher than 5)
• ….
![Page 63: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/63.jpg)
The Joint Distribution
Recipe for making a joint distribution of M variables:
Example: Boolean variables A, B, C
![Page 64: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/64.jpg)
The Joint Distribution
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).
Example: Boolean variables A, B, C
A B C0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
![Page 65: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/65.jpg)
The Joint Distribution
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).
2. For each combination of values, say how probable it is.
Example: Boolean variables A, B, C
A B C Prob0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
![Page 66: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/66.jpg)
The Joint Distribution
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).
2. For each combination of values, say how probable it is.
3. If you subscribe to the axioms of probability, those numbers must sum to 1.
Example: Boolean variables A, B, C
A B C Prob0 0 0 0.30
0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10
A
B
C0.050.25
0.10 0.050.05
0.10
0.100.30
![Page 67: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/67.jpg)
Using the Joint
One you have the JD you can ask for the probability of any logical expression involving your attribute
E
PEP matching rows
)row()(
Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. [Kohavi, 1996]Number of Instances: 48,842 Number of Attributes: 14 (in UCI’s copy of dataset); 3 (here)
![Page 68: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/68.jpg)
Using the Joint
P(Poor Male) = 0.4654 E
PEP matching rows
)row()(
![Page 69: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/69.jpg)
Using the Joint
P(Poor) = 0.7604 E
PEP matching rows
)row()(
![Page 70: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/70.jpg)
Probability - what you need to really, really know
• Probabilities are cool• Random variables and events• The Axioms of Probability• Independence, binomials, multinomials• Conditional probabilities• Bayes Rule• MLE’s, smoothing, and MAPs• The joint distribution• Inference
![Page 71: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/71.jpg)
Inference with the
Joint
2
2 1
matching rows
and matching rows
2
2121 )row(
)row(
)(
)()|(
E
EE
P
P
EP
EEPEEP
![Page 72: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/72.jpg)
Inference with the
Joint
2
2 1
matching rows
and matching rows
2
2121 )row(
)row(
)(
)()|(
E
EE
P
P
EP
EEPEEP
P(Male | Poor) = 0.4654 / 0.7604 = 0.612
![Page 73: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/73.jpg)
Estimating the joint distribution• Collect some data points• Estimate the probability P(E1=e1 ^ … ^
En=en) as #(that row appears)/#(any row appears)
• ….Gender Hours Wealth
g1 h1 w1
g2 h2 w2
.. … …
gN hN wN
![Page 74: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/74.jpg)
Estimating the joint distribution• For each combination of values r:
– Total = C[r] = 0• For each data row ri
– C[ri] ++
– Total ++
Gender Hours Wealth
g1 h1 w1
g2 h2 w2
.. … …
gN hN wN
Complexity?
Complexity?
O(n)
n = total size of input data
O(2d)
d = #attributes (all binary)
= C[ri]/Total
ri is “female,40.5+, poor”
![Page 75: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/75.jpg)
Estimating the joint distribution• For each combination of values r:
– Total = C[r] = 0• For each data row ri
– C[ri] ++
– Total ++
Gender Hours Wealth
g1 h1 w1
g2 h2 w2
.. … …
gN hN wN
Complexity?
Complexity?
O(n)
n = total size of input data
ki = arity of attribute i
d
iikO
1
)(
![Page 76: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/76.jpg)
Estimating the joint distribution
Gender Hours Wealth
g1 h1 w1
g2 h2 w2
.. … …
gN hN wN
Complexity?
Complexity?
O(n)
n = total size of input data
ki = arity of attribute i
d
iikO
1
)(• For each combination of values r:– Total = C[r] = 0
• For each data row ri
– C[ri] ++
– Total ++
![Page 77: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/77.jpg)
Estimating the joint distribution• For each data row ri
– If ri not in hash tables C,Total:
• Insert C[ri] = 0
– C[ri] ++
– Total ++Gender Hours Wealth
g1 h1 w1
g2 h2 w2
.. … …
gN hN wN
Complexity?
Complexity?
O(n)
n = total size of input data
m = size of the model
O(m)
![Page 78: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/78.jpg)
Another example….
![Page 79: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/79.jpg)
Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)
Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context
![Page 80: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/80.jpg)
An experiment• Starting point: Google books 5-gram data
– All 5-grams that appear >= 40 times in a corpus of 1M English books
• approx 80B words• 5-grams: 30Gb compressed, 250-300Gb uncompressed• Each 5-gram contains frequency distribution over years
– Extract all 5-grams from books published before 2000 that contain ‘effect’ or ‘affect’ in middle position
• about 20 “disk hours”• approx 100M occurrences• approx 50k distinct n-grams --- not big
– Wrote code to compute • Pr(A,B,C,D,E|C=affect or C=effect) • Pr(any subset of A,…,E|any other subset,C=affect V effect)
![Page 81: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/81.jpg)
Some of the Joint Distribution
A B C D Ep
is the effect of the 0.00036
is the effect of a 0.00034
. The effect of this 0.00034
to this effect : “ 0.00034
be the effect of the …
… … … … …
not the effect of any 0.00024
… … … … …
does not affect the general 0.00020
does not affect the question
0.00020
any manner affect the principle
0.00018
![Page 82: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/82.jpg)
Another experiment• Extracted all affect/effect 5-grams from the old
(small) Reuters corpus– about 20k documents– about 723 n-grams, 661 distinct– Financial news, not novels or textbooks
• Tried to predict center word with:– Pr(C|A=a,B=b,D=d,E=e)– then P(C|A,B,D,C=effect V affect)– then P(C|B,D, C=effect V affect)– then P(C|B, C=effect V affect)– then P(C, C=effect V affect)
![Page 83: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/83.jpg)
EXAMPLES
• “The cumulative _ of the” effect (1.0)• “Go into _ on January” effect (1.0)• “From cumulative _ of accounting” not
present–Nor is ““From cumulative _ of _”–But “_ cumulative _ of _” effect
(1.0)• “Would not _ Finance Minister” not
present–But “_ not _ _ _” affect (0.9625)
![Page 84: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/84.jpg)
Performance summary
Pattern Used Errors
P(C|A,B,D,E) 101 1
P(C|A,B,D) 157 6
P(C|B,D) 163 13
P(C|B) 244 78
P(C) 58 31
![Page 85: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/85.jpg)
Probability - what you need to really, really know
• Probabilities are cool• Random variables and events• The Axioms of Probability• Independence, binomials, multinomials• Conditional probabilities• Bayes Rule• MLE’s, smoothing, and MAPs• The joint distribution• Inference• Density estimation and classification
![Page 86: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/86.jpg)
Copyright © Andrew W. Moore
Density Estimation
• Our Joint Distribution learner is our first example of something called Density Estimation
• A Density Estimator learns a mapping from a set of attributes values to a Probability
DensityEstimator
ProbabilityInputAttributes
![Page 87: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/87.jpg)
Copyright © Andrew W. Moore
Density Estimation
• Compare it against the two other major kinds of models:
Regressor
Prediction ofreal-valued output
InputAttributes
DensityEstimator
ProbabilityInputAttributes
Classifier
Prediction ofcategorical output or class
InputAttributes
One of a few discrete values
![Page 88: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/88.jpg)
Copyright © Andrew W. Moore
Density Estimation Classification
DensityEstimator
P(x,y)Input Attributes
Classifier
Prediction ofcategorical output
InputAttributesx
One of y1, …., yk
Class
To classify x 1. Use your estimator to compute P(x,y1), …., P(x,yk)2. Return the class y* with the highest predicted
probabilityIdeally is correct with P(x,y*) = P(x,y*)/(P(x,y1) + …. + P(x,yk))
^ ^
^ ^ ^ ^
^
Binary case: predict POS if P(x)>0.5
^
![Page 89: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/89.jpg)
Classification vs Density Estimation
Classification Density Estimation
![Page 90: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/90.jpg)
Classification vs density estimation
![Page 91: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/91.jpg)
How do you evaluate?
Classification• Estimate Pr(error)• Given n test examples
and k errors, error rate is Pr(error) ~= k/n.
Density Estimation• Given n test examples
x1,y1, …, xn,yn, compute
i
iyYPn
)|(ˆlog1
x
Avg # bits to encode the true labels using a code based on your density estimator. (Optimal code will use -logP(Y=y) bits for event Y=y.)
![Page 92: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/92.jpg)
Bayes Classifiers
• If we can do inference over Pr(X1…,Xd,Y)…
• … in particular compute Pr(X1…Xd|Y) and Pr(Y).
– And then we can use Bayes’ rule to compute
),...,Pr(
)Pr()|,...,Pr(),...,|Pr(
1
11
d
dd XX
YYXXXXY
![Page 93: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/93.jpg)
Copyright © Andrew W. Moore
Summary: The Bad News
• Density estimation by directly learning the joint is trivial, mindless and dangerous
Andrew’s joke
• Density estimation by directly learning the joint is hopeless unless you have some combination of• Very few attributes• Attributes with low “arity”• Lots and lots of data
• Otherwise you can’t estimate all the row frequencies
![Page 94: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/94.jpg)
Back to the demo• Extracted all affect/effect 5-grams from the old
(small) Reuters corpus– about 20k documents– about 723 n-grams, 661 distinct– Financial news, not novels or textbooks
• Tried to predict center word with:– Pr(C|A=a,B=b,D=d,E=e)
– Of 723 combinations of a,b,d,e, only 101 (about 14%) appear in the n-gram corpus.
– Of those there is only one error.
![Page 95: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/95.jpg)
Performance …
Pattern Used Errors
P(C|A,B,D,E) 101 1
P(C|A,B,D) 157 6
P(C|B,D) 163 13
P(C|B) 244 78
P(C) 58 31
• Is this good performance?• Do other brute-force estimates of joint probabilities have the same problem?
![Page 96: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/96.jpg)
Probability - what you need to really, really know
• Probabilities are cool• Random variables and events• The Axioms of Probability• Independence, binomials, multinomials• Conditional probabilities• Bayes Rule• MLE’s, smoothing, and MAPs• The joint distribution• Inference• Density estimation and classification• Naïve Bayes density estimators and classifiers
![Page 97: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/97.jpg)
Copyright © Andrew W. Moore
Naïve Density Estimation
The problem with the Joint Estimator is that it just mirrors the training data.
We need something which generalizes more usefully.
The naïve model generalizes strongly:
Assume that each attribute is distributed independently of any of the other attributes.
![Page 98: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/98.jpg)
Copyright © Andrew W. Moore
Using the Naïve Distribution
• Once you have a Naïve Distribution you can easily compute any row of the joint distribution.
• Suppose A, B, C and D are independently distributed. What is P(A ^ ~B ^ C ^ ~D)?
![Page 99: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/99.jpg)
Copyright © Andrew W. Moore
Using the Naïve Distribution
• Once you have a Naïve Distribution you can easily compute any row of the joint distribution.
• Suppose A, B, C and D are independently distributed. What is P(A^~B^C^~D)?
= P(A|~B^C^~D) P(~B^C^~D)= P(A) P(~B^C^~D)= P(A) P(~B|C^~D) P(C^~D)= P(A) P(~B) P(C^~D)= P(A) P(~B) P(C|~D) P(~D)= P(A) P(~B) P(C) P(~D)
![Page 100: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/100.jpg)
Copyright © Andrew W. Moore
Naïve Distribution General Case• Suppose X1,X2,…,Xd are independently
distributed.
• So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand.
• How do we learn this?
)Pr(...)Pr(),...,Pr( 1111 dddd xXxXxXxX
![Page 101: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/101.jpg)
Copyright © Andrew W. Moore
Learning a Naïve Density Estimator
Another trivial learning algorithm!
MLE
Dirichlet (MAP)
records#
with records#)( iiii
xXxXP
m
mqxXxXP iiii
records#
with records#)(
![Page 102: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/102.jpg)
Is this an interesting learning algorithm?
• For n-grams, what is P(C=effect|A=will)?• In joint: P(C=effect|A=will) = 0.38• In naïve: P(C=effect|A=will) = P(C=effect) =
#[C=effect]/#totalNgrams = 0.94 (!)
• What is P(C=effect|B=no)?• In joint: P(C=effect|B=no) = 0.999• In naïve: P(C=effect|B=no) = P(C=effect) = 0.94
^
^
^ ^
^
^
^ ^
No
![Page 103: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/103.jpg)
Ended here….
![Page 104: A Quick Overview of Probability William W. Cohen Machine Learning 10-605](https://reader038.vdocuments.site/reader038/viewer/2022110206/56649f3e5503460f94c5f5e1/html5/thumbnails/104.jpg)
Probability - what you need to really, really know
• Probabilities are cool• Random variables and events• The Axioms of Probability• Independence, binomials, multinomials• Conditional probabilities• Bayes Rule• MLE’s, smoothing, and MAPs• The joint distribution• Inference• Density estimation and classification• Naïve Bayes density estimators and classifiers• Conditional independence…more on this next week!