a not-so-quick overview of probability

102

Upload: moanna

Post on 06-Jan-2016

15 views

Category:

Documents


1 download

DESCRIPTION

A Not-So-Quick Overview of Probability. William W. Cohen Machine Learning 10-605. Warmup : Zeno’s paradox. 0. 1. 1+0.1+0.01+0.001+0.0001+… = ?. Lance Armstrong and the tortoise have a race Lance is 10x faster Tortoise has a 1m head start at time 0. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Not-So-Quick Overview of Probability
Page 2: A Not-So-Quick Overview of Probability
Page 3: A Not-So-Quick Overview of Probability

A Not-So-Quick Overview of Probability

William W. CohenMachine Learning 10-605

Page 4: A Not-So-Quick Overview of Probability

Warmup: Zeno’s paradox

• Lance Armstrong and the tortoise have a race

• Lance is 10x faster• Tortoise has a 1m head start

at time 0

0 1

• So, when Lance gets to 1m the tortoise is at 1.1m

• So, when Lance gets to 1.1m the tortoise is at 1.11m …

• So, when Lance gets to 1.11m the tortoise is at 1.111m … and Lance will never catch up -?

1+0.1+0.01+0.001+0.0001+… = ?

unresolved until calculus was invented

Page 5: A Not-So-Quick Overview of Probability

The prosecution calls Gottfried Leibniz.

Page 6: A Not-So-Quick Overview of Probability

The Problem of Induction

• David Hume (1711-1776): pointed out

1. Empirically, induction seems to work

2. Statement (1) is an application of induction.

• This stumped people for about 200 years

1. Of the Different Species of Philosophy.

2. Of the Origin of Ideas

3. Of the Association of Ideas

4. Sceptical Doubts Concerning the Operations of the Understanding

5. Sceptical Solution of These Doubts

6. Of Probability9

7. Of the Idea of Necessary Connexion

8. Of Liberty and Necessity

9. Of the Reason of Animals

10. Of Miracles

11. Of A Particular Providence and of A Future State

12. Of the Academical Or Sceptical Philosophy

Page 7: A Not-So-Quick Overview of Probability

A Second Problem of Induction

• A black crow seems to support the hypothesis “all crows are black”.

• A pink highlighter supports the hypothesis “all non-black things are non-crows”

• Thus, a pink highlighter supports the hypothesis “all crows are black”.

)(CROW)(BLACK

lyequivalentor

)(BLACK)(CROW

xxx

xxx

Page 8: A Not-So-Quick Overview of Probability

Probability Theory

• Events – discrete random variables, boolean random variables,

compound events• Axioms of probability

– What defines a reasonable theory of uncertainty• Compound events• Independent events• Conditional probabilities• Bayes rule and beliefs• Joint probability distribution

Page 9: A Not-So-Quick Overview of Probability

Discrete Random Variables

• A is a Boolean-valued random variable if– A denotes an event, – there is uncertainty as to whether A occurs.

• Define P(A) as “the fraction of experiments in which A is true”– We’re assuming all possible outcomes are equiprobable

a possible outcome of an “experiment”

the experiment is not deterministic

Page 10: A Not-So-Quick Overview of Probability

Visualizing A

Event space of all possible worlds

Its area is 1Worlds in which A is False

Worlds in which A is true

P(A) = Area ofreddish oval

Page 11: A Not-So-Quick Overview of Probability

Discrete Random Variables

• A is a Boolean-valued random variable if– A denotes an event, – there is uncertainty as to whether A occurs.

• Define P(A) as “the fraction of experiments in which A is true”– We’re assuming all possible outcomes are equiprobable

• Examples– You roll two 6-sided die (the experiment) and get doubles (A=doubles,

the outcome)– I pick two students in the class (the experiment) and they have the

same birthday (A=same birthday, the outcome)– A = I have Ebola– A = The US president in 2023 will be male– A = You wake up tomorrow with a headache– A = the 1,000,000,000,000th digit of π is 7

a possible outcome of an “experiment”

the experiment is not deterministic

Page 12: A Not-So-Quick Overview of Probability

The Axioms of Probability

• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)

Events, random variables, …., probabilities

“Dice”

“Experiments”

Page 13: A Not-So-Quick Overview of Probability

The Axioms Of Probabi

lity

(This is Andrew’s joke)

Page 14: A Not-So-Quick Overview of Probability

These Axioms are Not to be Trifled With

• There have been many many other approaches to understanding “uncertainty”:

• Fuzzy Logic, three-valued logic, Dempster-Shafer, non-monotonic reasoning, …

• 25 years ago people in AI argued about these; now they mostly don’t– Any scheme for combining uncertain information, uncertain

“beliefs”, etc,… really should obey these axioms– If you gamble based on “uncertain beliefs”, then [you can

be exploited by an opponent] [your uncertainty formalism violates the axioms] - di Finetti 1931 (the “Dutch book argument”)

Page 15: A Not-So-Quick Overview of Probability

Interpreting the axioms

• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and

B)The area of A can’t get any smaller than 0

And a zero area would mean no world could ever have A true

Page 16: A Not-So-Quick Overview of Probability

Interpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and

B)

The area of A can’t get any bigger than 1

And an area of 1 would mean all worlds will have A true

Page 17: A Not-So-Quick Overview of Probability

Interpreting the axioms

• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and

B)

A

B

Page 18: A Not-So-Quick Overview of Probability

Interpreting the axioms

• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and

B)

A

B

P(A or B)

BP(A and B)

Simple addition and subtraction

Page 19: A Not-So-Quick Overview of Probability

Theorems from the Axioms

• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)

P(not A) = P(~A) = 1-P(A)

P(A or ~A) = P(A) + P(~A) - P(A and ~A)

1 = P(A) + P(~A) - 0

P(A or ~A) = 1 P(A and ~A) = 0

Page 20: A Not-So-Quick Overview of Probability

Elementary Probability in Pictures

• P(~A) + P(A) = 1

A ~A

Page 21: A Not-So-Quick Overview of Probability

Side Note

• I am inflicting these proofs on you for two reasons:

1. These kind of manipulations will need to be second nature to you if you use probabilistic analytics in depth

2. Suffering is good for you

(This is also Andrew’s joke)

Page 22: A Not-So-Quick Overview of Probability

Another important theorem

• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)

P(A) = P(A ^ B) + P(A ^ ~B)

A = A and (B or ~B) = (A and B) or (A and ~B)

P(A) = P(A and B) + P(A and ~B) – P((A and B) and (A and ~B))

P(A) = P(A and B) + P(A and ~B) – P(A and A and B and ~B)

Page 23: A Not-So-Quick Overview of Probability

Elementary Probability in Pictures

• P(A) = P(A ^ B) + P(A ^ ~B)

B

~B

A ^ ~B

A ^ B

Page 24: A Not-So-Quick Overview of Probability

The LAWSOf Probabi

lity

Laws of probability:1. Axioms …2. Monty Hall

Problem proviso

Page 25: A Not-So-Quick Overview of Probability

The Monty Hall Problem

• You’re in a game show. Behind one door is a prize. Behind the others, goats.

• You pick one of three doors, say #1

• The host, Monty Hall, opens one door, revealing…a goat!

3

You now can either

• stick with your guess

• always change doors

• flip a coin and pick a new door randomly according to the coin

Page 26: A Not-So-Quick Overview of Probability

The Monty Hall Problem

• Case 1: you don’t swap.–W = you win.

• Pre-goat: P(W)=1/3• Post-goat: P(W)=1/3

• Case 2: you swap–W1=you picked

the cash initially.–W2=you win.

• Pre-goat: P(W1)=1/3.

• Post-goat:–W2 = ~W1– Pr(W2) = 1-

P(W1)=2/3.

Moral: ?

Page 27: A Not-So-Quick Overview of Probability

The Extreme Monty Hall/Survivor Problem

• You’re in a game show. There are 10,000 doors. Only one of them has a prize.

• You pick a door.• Over the remaining 13 weeks, the host eliminates

9,998 of the remaining doors.• For the season finale:– Do you switch, or not?

Page 28: A Not-So-Quick Overview of Probability

Some practical problems

• You’re the DM in a D&D game.• Joe brings his own d20 and throws 4 critical hits in

a row to start off– DM=dungeon master– D20 = 20-sided die– “Critical hit” = 19 or 20

• Is Joe cheating?• What is P(A), A=four critical hits?

– A is a compound event: A = C1 and C2 and C3 and C4

Page 29: A Not-So-Quick Overview of Probability

Independent Events

• Definition: two events A and B are independent if Pr(A and B)=Pr(A)*Pr(B).

• Intuition: outcome of A has no effect on the outcome of B (and vice versa).–We need to assume the different rolls

are independent to solve the problem.– You frequently need to assume the

independence of something to solve any learning problem.

Page 30: A Not-So-Quick Overview of Probability

Some practical problems

• You’re the DM in a D&D game.• Joe brings his own d20 and throws 4 critical hits in

a row to start off– DM=dungeon master– D20 = 20-sided die– “Critical hit” = 19 or 20

• What are the odds of that happening with a fair die?

• Ci=critical hit on trial i, i=1,2,3,4 • P(C1 and C2 … and C4) = P(C1)*…*P(C4) =

(1/10)^4Followup: D=pick an ace or king out of deck three times in a row: D=D1 ^ D2 ^ D3

Page 31: A Not-So-Quick Overview of Probability

Some practical problems

The specs for the loaded d20 say that it has 20 outcomes, X where

• P(X=20) = 0.25

• P(X=19) = 0.25

• for i=1,…,18, P(X=i)= Z * 1/18

• What is Z?

Page 32: A Not-So-Quick Overview of Probability

Multivalued Discrete Random Variables

• Suppose A can take on more than 2 values• A is a random variable with arity k if it can take on exactly

one value out of {v1,v2, .. vk}

– Example: V={aaliyah, aardvark, …., zymurge, zynga}– Example: V={aaliyah_aardvark, …, zynga_zymgurgy}

• Thus…

jivAvAP ji if 0)(

Page 33: A Not-So-Quick Overview of Probability

Terms: Binomials and Multinomials

• Suppose A can take on more than 2 values• A is a random variable with arity k if it can

take on exactly one value out of {v1,v2, .. vk}

– Example: V={aaliyah, aardvark, …., zymurge, zynga}

– Example: V={aaliyah_aardvark, …, zynga_zymgurgy}

• The distribution Pr(A) is a multinomial• For k=2 the distribution is a binomial

Page 34: A Not-So-Quick Overview of Probability

More about Multivalued Random Variables

• Using the axioms of probability…

0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)

• And assuming that A obeys…

• It’s easy to prove that

jivAvAP ji if 0)(

Page 35: A Not-So-Quick Overview of Probability

More about Multivalued Random Variables

• Using the axioms of probability and assuming that A obeys…

• It’s easy to prove that

jivAvAP ji if 0)(

• And thus we can prove

1)(1

k

jjvAP

Page 36: A Not-So-Quick Overview of Probability

Elementary Probability in Pictures

1)(1

k

jjvAP

A=1

A=2

A=3

A=4

A=5

Page 37: A Not-So-Quick Overview of Probability

Elementary Probability in Pictures

1)(1

k

jjvAP

A=aardvark

A=aaliyah

A=…

A=…. A=zynga

Page 38: A Not-So-Quick Overview of Probability

Some practical problemsThe specs for the loaded d20 say that it has 20 outcomes, X

• P(X=20) = P(X=19) = 0.25

• for i=1,…,18, P(X=i)= z … and what is z?

5.01825.025.0)(1

1)(

18

1

1

zvAP

vAP

jj

k

jj

Page 39: A Not-So-Quick Overview of Probability

Some practical problems

• You (probably) have 8 neighbors and 5 close neighbors.

• What is Pr(A), A=one or more of your neighbors has the same sign as you?– What’s the

experiment?

• What is Pr(B), B=you and your close neighbors all have different signs?– What about

neighbors?

n c n

c * c

n c n

Moral: ?

Page 40: A Not-So-Quick Overview of Probability

Some practical problemsI bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?

Frequency

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Face Shown

P(X=20) = P(X=19) = 0.25

for i=1,…,18, P(X=i)= 0.5 * 1/18

Page 41: A Not-So-Quick Overview of Probability

Some practical problems

• I have 3 standard d20 dice, 1 loaded die.

• Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)?

P(B) = P(B and A) + P(B and ~A) = 0.1*0.75 + 0.5*0.25 = 0.2

using Andrew’s “important theorem” P(A) = P(A ^ B) + P(A ^ ~B)

Page 42: A Not-So-Quick Overview of Probability

Elementary Probability in Pictures

• P(A) = P(A ^ B) + P(A ^ ~B)

B

~B

A ^ ~B

A ^ B

Followup:

What if I change the ratio of fair to loaded die in the experiment?

Page 43: A Not-So-Quick Overview of Probability

Some practical problems

• I have lots of standard d20 die, lots of loaded die, all identical.

• Experiment is the same: (1) pick a d20 uniformly at random then (2) roll it. Can I mix the dice together so that P(B)=0.137 ?P(B) = P(B and A) + P(B and ~A) = 0.1*λ + 0.5*(1- λ) = 0.137

λ = (0.5 - 0.137)/0.4 = 0.9075“mixture model”

Page 44: A Not-So-Quick Overview of Probability

Another picture for this problem

A (fair die) ~A (loaded)

A and B ~A and B

It’s more convenient to say• “if you’ve picked a fair die then …” i.e. Pr(critical hit|fair die)=0.1• “if you’ve picked the loaded die then….” Pr(critical hit|loaded die)=0.5

Conditional probability:Pr(B|A) = P(B^A)/P(A)

P(B|A) P(B|~A)

Page 45: A Not-So-Quick Overview of Probability

Definition of Conditional Probability

P(A ^ B) P(A|B) = ----------- P(B)

Corollary: The Chain Rule

P(A ^ B) = P(A|B) P(B)

Page 46: A Not-So-Quick Overview of Probability

Some practical problems

• I have 3 standard d20 dice, 1 loaded die.

• Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)?

P(B) = P(B|A) P(A) + P(B|~A) P(~A) = 0.1*0.75 + 0.5*0.25 = 0.2

“marginalizing out” A

Page 47: A Not-So-Quick Overview of Probability

A (fair die) ~A (loaded)

A and B ~A and B P(B|A) P(B|~A)

P(A) P(~A)P(B) = P(B|A)P(A) + P(B|~A)P(~A)

Page 48: A Not-So-Quick Overview of Probability

Some practical problems• I have 3 standard d20 dice, 1 loaded die.

• Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die.

• Suppose B happens (e.g., I roll a 20). What is the chance the die I rolled is fair? i.e. what is P(A|B) ?

Page 49: A Not-So-Quick Overview of Probability

A (fair die) ~A (loaded)

A and B ~A and B

P(B|A) P(B|~A)

P(A) P(~A)

P(A and B) = P(B|A) * P(A)

P(A and B) = P(A|B) * P(B)

P(A|B) * P(B) = P(B|A) * P(A)

P(B|A) * P(A)

P(B)P(A|B) =

A (fair die) ~A (loaded)

A and B ~A and B

P(B)

P(A|B) = ?

Page 50: A Not-So-Quick Overview of Probability

P(B|A) * P(A)

P(B)P(A|B) =

P(A|B) * P(B)

P(A)P(B|A) =

Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418

…by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning…

Bayes’ rule

priorposterior

Page 51: A Not-So-Quick Overview of Probability

More General Forms of Bayes Rule

)(~)|~()()|(

)()|()|(

APABPAPABP

APABPBAP

)(

)()|()|(

XBP

XAPXABPXBAP

Page 52: A Not-So-Quick Overview of Probability

More General Forms of Bayes Rule

An

kkk

iii

vAPvABP

vAPvABPBvAP

1

)()|(

)()|()|(

Page 53: A Not-So-Quick Overview of Probability

Useful Easy-to-prove facts

1)|()|( BAPBAP

1)|(1

An

kk BvAP

Page 54: A Not-So-Quick Overview of Probability

More about Bayes rule

• An Intuitive Explanation of Bayesian Reasoning: Bayes' Theorem for the curious and bewildered; an excruciatingly gentle introduction - Eliezer Yudkowsky

• Problem: Suppose that a barrel contains many small plastic eggs. Some eggs are painted red and some are painted blue. 40% of the eggs in the bin contain pearls, and 60% contain nothing. 30% of eggs containing pearls are painted blue, and 10% of eggs containing nothing are painted blue. What is the probability that a blue egg contains a pearl?

Page 55: A Not-So-Quick Overview of Probability

Some practical problems• Joe throws 4 critical hits in a row, is Joe cheating?• A = Joe using cheater’s die• C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1• B = C1 and C2 and C3 and C4• Pr(B|A) = 0.0625 P(B|~A)=0.0001

)(~)|~()()|(

)()|()|(

APABPAPABP

APABPBAP

))(1(*0001.0)(*0625.0

)(*0625.0)|(

APAP

APBAP

Page 56: A Not-So-Quick Overview of Probability

What’s the experiment and outcome here?

• Outcome A: Joe is cheating• Experiment: – Joe picked a die uniformly at random from

a bag containing 10,000 fair die and one bad one.

– Joe is a D&D player picked uniformly at random from set of 1,000,000 people and n of them cheat with probability p>0.

– I have no idea, but I don’t like his looks. Call it P(A)=0.1

Page 57: A Not-So-Quick Overview of Probability

Remember: Don’t Mess with The Axioms

• A subjective belief can be treated, mathematically, like a probability– Use those axioms!

• There have been many many other approaches to understanding “uncertainty”:

• Fuzzy Logic, three-valued logic, Dempster-Shafer, non-monotonic reasoning, …

• 25 years ago people in AI argued about these; now they mostly don’t– Any scheme for combining uncertain information, uncertain

“beliefs”, etc,… really should obey these axioms– If you gamble based on “uncertain beliefs”, then [you can be

exploited by an opponent] [your uncertainty formalism violates the axioms] - di Finetti 1931 (the “Dutch book argument”)

Page 58: A Not-So-Quick Overview of Probability

Some practical problems

• Joe throws 4 critical hits in a row, is Joe cheating?

• A = Joe using cheater’s die• C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1• B = C1 and C2 and C3 and C4• Pr(B|A) = 0.0625 P(B|~A)=0.0001

)(/)()|(

)(/)()|(

)|(

)|(

BPAPABP

BPAPABP

BAP

BAP

)(

)()|()|(

BP

APABPBAP

)(

)(

)|(

)|(

AP

AP

ABP

ABP

)(

)(

0001.0

0625.0

AP

AP

)(

)(250,6

AP

AP

Moral: with enough evidence the prior P(A) doesn’t really matter.

Page 59: A Not-So-Quick Overview of Probability

Some practical problemsI bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?

Frequency

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Face Shown

1. Collect some data (20 rolls)2. Estimate Pr(i)=C(rolls of i)/C(any roll)

Page 60: A Not-So-Quick Overview of Probability

One solutionI bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?

Frequency

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Face Shown

P(1)=0

P(2)=0

P(3)=0

P(4)=0.1

P(19)=0.25

P(20)=0.2MLE = maximumlikelihood estimate

But: Do I really think it’s impossible to roll a 1,2 or 3?Would you bet your house on it?

Page 61: A Not-So-Quick Overview of Probability

A better solutionI bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?

Frequency

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Face Shown

1. Collect some data (20 rolls)2. Estimate Pr(i)=C(rolls of i)/C(any roll)

0. Imagine some data (20 rolls, each i shows up 1x)

Page 62: A Not-So-Quick Overview of Probability

A better solutionI bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?

Frequency

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Face Shown

P(1)=1/40

P(2)=1/40

P(3)=1/40

P(4)=(2+1)/40

P(19)=(5+1)/40

P(20)=(4+1)/40=1/8

)()(

1)()r(P̂

IMAGINEDCANYC

iCi

0.25 vs. 0.125 – really different! Maybe I should “imagine” less data?

Page 63: A Not-So-Quick Overview of Probability

A better solution?

Frequency

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Face Shown

P(1)=1/40

P(2)=1/40

P(3)=1/40

P(4)=(2+1)/40

P(19)=(5+1)/40

P(20)=(4+1)/40=1/8

)()(

1)()r(P̂

IMAGINEDCANYC

iCi

0.25 vs. 0.125 – really different! Maybe I should “imagine” less data?

Page 64: A Not-So-Quick Overview of Probability

A better solution?

)()(

1)()r(P̂

IMAGINEDCANYC

iCi

mANYC

mqiCi

)(

)()r(P̂

Q: What if I used m rolls with a probability of q=1/20 of rolling any i?

I can use this formula with m>20, or even with m<20 … say with m=1

Page 65: A Not-So-Quick Overview of Probability

A better solution

)()(

1)()r(P̂

IMAGINEDCANYC

iCi

mANYC

mqiCi

)(

)()r(P̂

Q: What if I used m rolls with a probability of q=1/20 of rolling any i?

If m>>C(ANY) then your imagination q rulesIf m<<C(ANY) then your data rules BUT you never ever ever end up with Pr(i)=0

Page 66: A Not-So-Quick Overview of Probability

TerminologyThis is called a uniform Dirichlet priorC(i), C(ANY) are sufficient statistics

It’s equivalent to a doing this:•p=some probability over 1…20•Pr(P=p|Prior)=f(…)•Pr(P=p|Prior,data)= g(…)•Pr(I=i|Prior,data)=

mANYC

mqiCi

)(

)()r(P̂

p

priordatappi ),|Pr()|Pr( dppriordatappip ),|Pr()|Pr(

MLE = maximumlikelihood estimate

MAP= maximuma posteriori estimate

f and g are the same function, diff’t args f is conjugate prior

Page 67: A Not-So-Quick Overview of Probability

Terminology – more laterThis is called a uniform Dirichlet prior

C(i), C(ANY) are sufficient statistics

mANYC

mqiCi

)(

)()r(P̂

MLE = maximumlikelihood estimate

MAP= maximuma posteriori estimate

Page 68: A Not-So-Quick Overview of Probability

Some practical problems

• I have 1 standard d6 die, 2 loaded d6 die.

• Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50

• Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles?

Three combinations: HL, HF, FLP(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL) = P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)

Page 69: A Not-So-Quick Overview of Probability

Elementary Probability in Pictures

A=HL

A=HF

A=FL

)()(1

k

jjvABPBP

B ^ A=HL

B ^ A=HF

B ^ A=LF

Page 70: A Not-So-Quick Overview of Probability

Elementary Probability in Pictures

A=1

A=2

A=3

A=4

A=5

)()|()()(11

j

k

jj

k

jj vAPvABPvABPBP

B ^ A=1

B ^ A=2

B ^ A=3

B ^ A=4

B ^ A=5

Page 71: A Not-So-Quick Overview of Probability

Elementary Probability in Pictures

A=2

A=3

A=4

A=5

)()|()()(11

j

k

jj

k

jj vAPvABPvABPBP

A=1

P(A=1)

Think of multiplying by P(A=1) as “squeezing” an area by some fraction

Page 72: A Not-So-Quick Overview of Probability

Elementary Probability in Pictures

A=2

A=3

A=4

A=5

)()|()()(11

j

k

jj

k

jj vAPvABPvABPBP

A=1

P(A=1)

P(B|A=1)

P(A and B)

Page 73: A Not-So-Quick Overview of Probability

Some practical problems

• I have 1 standard d6 die, 2 loaded d6 die.

• Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50

• Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles?

Three combinations: HL, HF, FLP(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL) = P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)

Page 74: A Not-So-Quick Overview of Probability

Some practical problems

• I have 1 standard d6 die, 2 loaded d6 die.

• Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50

• Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles?

1 2 3 4 5 6

1 D 7

2 D 7

3 D 7

4 7 D

5 7 D

6 7 D

Three combinations: HL, HF, FL Roll 1

Roll

2

Page 75: A Not-So-Quick Overview of Probability

A brute-force solution

A Roll 1 Roll 2 P

FL 1 1 1/3 * 1/6 * ½

FL 1 2 1/3 * 1/6 * 1/10

FL 1 … …

… 1 6

FL 2 1

FL 2 …

… … …

FL 6 6

HL 1 1

HL 1 2

… … …

HF 1 1

Comment

doubles

seven

doubles

doubles

A joint probability table shows P(X1=x1 and … and Xk=xk) for every possible combination of values x1,x2,…., xk

With this you can compute any P(A) where A is any boolean combination of the primitive events (Xi=Xk), e.g.

• P(doubles)

• P(seven or eleven)

• P(total is higher than 5)

• ….

Page 76: A Not-So-Quick Overview of Probability

The Joint Distribution

Recipe for making a joint distribution of M variables:

Example: Boolean variables A, B, C

Page 77: A Not-So-Quick Overview of Probability

The Joint Distribution

Recipe for making a joint distribution of M variables:

1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).

Example: Boolean variables A, B, C

A B C0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

Page 78: A Not-So-Quick Overview of Probability

The Joint Distribution

Recipe for making a joint distribution of M variables:

1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).

2. For each combination of values, say how probable it is.

Example: Boolean variables A, B, C

A B C Prob0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10

Page 79: A Not-So-Quick Overview of Probability

The Joint Distribution

Recipe for making a joint distribution of M variables:

1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).

2. For each combination of values, say how probable it is.

3. If you subscribe to the axioms of probability, those numbers must sum to 1.

Example: Boolean variables A, B, C

A B C Prob0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10

A

B

C0.050.25

0.10 0.050.05

0.10

0.100.30

Page 80: A Not-So-Quick Overview of Probability

Using the Joint

One you have the JD you can ask for the probability of any logical expression involving your attribute

E

PEP matching rows

)row()(

Abstract: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. [Kohavi, 1996]Number of Instances: 48,842 Number of Attributes: 14 (in UCI’s copy of dataset); 3 (here)

Page 81: A Not-So-Quick Overview of Probability

Using the Joint

P(Poor Male) = 0.4654 E

PEP matching rows

)row()(

Page 82: A Not-So-Quick Overview of Probability

Using the Joint

P(Poor) = 0.7604 E

PEP matching rows

)row()(

Page 83: A Not-So-Quick Overview of Probability

Inference with the

Joint

2

2 1

matching rows

and matching rows

2

2121 )row(

)row(

)(

)()|(

E

EE

P

P

EP

EEPEEP

Page 84: A Not-So-Quick Overview of Probability

Inference with the

Joint

2

2 1

matching rows

and matching rows

2

2121 )row(

)row(

)(

)()|(

E

EE

P

P

EP

EEPEEP

P(Male | Poor) = 0.4654 / 0.7604 = 0.612

Page 85: A Not-So-Quick Overview of Probability

Estimating the joint distribution• Collect some data points• Estimate the probability P(E1=e1 ^ … ^

En=en) as #(that row appears)/#(any row appears)

• ….Gender Hours Wealth

g1 h1 w1

g2 h2 w2

.. … …

gN hN wN

Page 86: A Not-So-Quick Overview of Probability

Inference is a big deal

• I’ve got this evidence. What’s the chance that this conclusion is true?– I’ve got a sore neck: how likely am I to have

meningitis?– I see my lights are out and it’s 9pm. What’s the

chance my spouse is already asleep?

Page 87: A Not-So-Quick Overview of Probability

Estimating the joint distribution• For each combination of values r:– Total = C[r] = 0

• For each data row ri

– C[ri] ++

– Total ++

Gender Hours Wealth

g1 h1 w1

g2 h2 w2

.. … …

gN hN wN

Complexity?

Complexity?

O(n)

n = total size of input data

O(2d)

d = #attributes (all binary)

= C[ri]/Total

ri is “female,40.5+, poor”

Page 88: A Not-So-Quick Overview of Probability

Estimating the joint distribution• For each combination of values r:– Total = C[r] = 0

• For each data row ri

– C[ri] ++

– Total ++

Gender Hours Wealth

g1 h1 w1

g2 h2 w2

.. … …

gN hN wN

Complexity?

Complexity?

O(n)

n = total size of input data

ki = arity of attribute i

d

iikO

1

)(

Page 89: A Not-So-Quick Overview of Probability

Estimating the joint distribution

Gender Hours Wealth

g1 h1 w1

g2 h2 w2

.. … …

gN hN wN

Complexity?

Complexity?

O(n)

n = total size of input data

ki = arity of attribute i

d

iikO

1

)(• For each combination of values r:– Total = C[r] = 0

• For each data row ri

– C[ri] ++

– Total ++

Page 90: A Not-So-Quick Overview of Probability

Estimating the joint distribution• For each data row ri

– If ri not in hash tables C,Total:

• Insert C[ri] = 0

– C[ri] ++

– Total ++Gender Hours Wealth

g1 h1 w1

g2 h2 w2

.. … …

gN hN wN

Complexity?

Complexity?

O(n)

n = total size of input data

m = size of the model

O(m)

Page 91: A Not-So-Quick Overview of Probability

Another example….

Page 92: A Not-So-Quick Overview of Probability

Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)

Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context

Page 93: A Not-So-Quick Overview of Probability

Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)

Page 94: A Not-So-Quick Overview of Probability

AN EXAMPLE OF THE JOINT

A B C D Ep

is the effect of the 0.00036

is the effect of a 0.00034

. The effect of this 0.00034

to this effect : “ 0.00034

be the effect of the …

… … … … …

not the effect of any 0.00024

… … … … …

does not affect the general 0.00020

does not affect the question

0.00020

any manner affect the principle

0.00018

Page 95: A Not-So-Quick Overview of Probability

The Joint Distribution for a “Big Data” task….

Page 96: A Not-So-Quick Overview of Probability

Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)

Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context

Page 97: A Not-So-Quick Overview of Probability

Big ML c. 2001 (Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)

Page 98: A Not-So-Quick Overview of Probability

Some of the Joint Distribution

A B C D Ep

is the effect of the 0.00036

is the effect of a 0.00034

. The effect of this 0.00034

to this effect : “ 0.00034

be the effect of the …

… … … … …

not the effect of any 0.00024

… … … … …

does not affect the general 0.00020

does not affect the question

0.00020

any manner affect the principle

0.00018

Page 99: A Not-So-Quick Overview of Probability

An experiment• Starting point: Google books 5-gram data– All 5-grams that appear >= 40 times in a corpus of

1M English books• approx 80B words• 5-grams: 30Gb compressed, 250-300Gb uncompressed• Each 5-gram contains frequency distribution over years

– Extract all 5-grams from books published before 2000 that contain ‘effect’ or ‘affect’ in middle position• about 20 “disk hours”• approx 100M occurrences• approx 50k distinct n-grams --- not big

– Wrote code to compute • Pr(A,B,C,D,E|C=affect or C=effect) • Pr(any subset of A,…,E|any other subset,C=affect V effect)

Page 100: A Not-So-Quick Overview of Probability

Another experiment• Extracted all affect/effect 5-grams from the old

(small) Reuters corpus– about 20k documents– about 723 n-grams, 661 distinct– Financial news, not novels or textbooks

• Tried to predict center word with:– Pr(C|A=a,B=b,D=d,E=e)– then P(C|A,B,D,C=effect V affect)– then P(C|B,D, C=effect V affect)– then P(C|B, C=effect V affect)– then P(C, C=effect V affect)

Page 101: A Not-So-Quick Overview of Probability

EXAMPLES

• “The cumulative _ of the” effect (1.0)• “Go into _ on January” effect (1.0)• “From cumulative _ of accounting” not

present–Nor is ““From cumulative _ of _”–But “_ cumulative _ of _” effect

(1.0)• “Would not _ Finance Minister” not

present–But “_ not _ _ _” affect (0.9625)

Page 102: A Not-So-Quick Overview of Probability

Performance summary

Pattern Used Errors

P(C|A,B,D,E) 101 1

P(C|A,B,D) 157 6

P(C|B,D) 163 13

P(C|B) 244 78

P(C) 58 31