binomial and normal distributions - faculty.chicagobooth.edu 3.pdf · a sum of two random variables...

Post on 21-Aug-2019

220 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Binomial and normal distributions

Business Statistics 41000

Fall 2015

1

Topics

1. Sums of random variables

2. Binomial distribution

3. Normal distribution

4. Vignettes

2

Topic: sums of random variables

Sums of random variables are important for two reasons:

1. Because we often care about aggregates and totals (sales, revenue,employees, etc).

2. Because averages are basically sums, and probabilities are basicallyaverages (of dummy variables), when we go to estimateprobabilities, we will end up using sums of random variables a lot.

This second point is the topic of the next lecture. For now, we focus onthe direct case.

3

A sum of two random variables

Suppose X is a random variable denoting the profit from one wager andY is a random variable denoting the profit from another wager.

If we want to consider our total profit, we may consider the randomvariable that is the sum of the two wagers, S = X + Y .

To determine the distribution of S , we must first know the jointdistribution of (X ,Y ).

4

A sum of two random variables

Suppose that (X ,Y ) has the following joint distribution:

-$200 $100 $200

$0 0 19

39

$100 19

29

29

So S can take the values {−200,−100, 100, 200, 300}.

Notice that there are two ways that S can be $200.

5

A sum of two random variables

We can directly determine the distribution of S as:

S

s P(S = s)

-$200 +$0 0

-$200 + $100 19

$100 + $0 19

$100 + $100 or $200 + $0 29 + 3

9 = 59

$200 + $100 29

When determining the distribution of sums of random variables, we loseinformation about individual values and aggregate the probability ofevents giving the same sum.

6

Topic: binomial distribution

A binomial random variable can be constructed as the sum ofindependent Bernoulli random variables.

Familiarity with the binomial distribution eases many practical probabilitycalculations.

See OpenIntro sections 3.4 and 3.6.4.

7

Sums of Bernoulli RVs

When rolling two dice, what is the probability of rolling two ones?

By independence we can calculate this probability as

P(1, 1) =1

6

(1

6

)=

1

36.

Now with three dice, what is the probability of rolling exactly two 1’s?

8

Sums of Bernoulli RVs (cont’d)

The event A =“rolling a one”, can be described as a Bernoulli randomvariable with p = 1

6 .

We can denote the three independent rolls by writing

Xiiid∼Bernoulli(p), i = 1, 2, 3.

The notation iid is shorthand for “independent and identicallydistributed”.

Determining the probability of rolling exactly two 1’s can be done byconsidering the random variable Y = X1 + X2 + X3 and asking forP(Y = 2).

9

Sums of Bernoulli random variables (cont’d)

Consider the distribution of Y = X1 + X2 + X3.

Y

Event y P(Y = y)

000 0 (1− p)3

001 or 100 or 010 1 (1− p)(1− p)p + p(1− p)(1− p) + (1− p)p(1− p)

011 or 110 or 101 2 (1− p)p2 + p2(1− p) + p(1− p)p

111 3 p3

Remember that for this example p = 16 .

10

Sums of Bernoulli random variables (cont’d)

Determining the probability of a certain number of successes requiresknowing 1) the probability of each individual success and 2) the numberof ways that number of successes can arise.

Y

Event y P(Y = y)

000 0 (1− p)3

001 or 100 or 010 1 3(1− p)2p

011 or 110 or 101 2 3(1− p)p2

111 3 p3

We find that P(Y = 2) = 3p2(1− p) = 3(1/36)(5/6) = 56(12) = 5

72 .

11

Sums of Bernoulli random variables (cont’d)

What if we had four rolls, and the probability of success was 13?

0000100001001100001010100110111000011001010111010011101101111111

12

Sums of Bernoulli random variables (cont’d)

Summing up the probabilities for each of the values of Y , we find:

Y

y P(Y = y)

0 (1− p)4

1 4(1− p)3p2 6(1− p)2p2

3 4(1− p)p3

4 p4

Substituting p = 13 we can now find P(Y = y) for any y = 0, 1, 2, 3, 4.

13

Defintion: N choose y

The number of ways we can arrange y successes among N trials can becalculated efficiently by a computer. We denote this number with aspecial expression.

N choose y

The notation (N

y

)=

N!

(N − y)!y !

designates the number of ways that y items can be assigned to Npossible positions.

This notation can be used to summarize the entries in the previous tablesfor various values of N and y .

14

Definition: Binomial distribution

Binomial distribution

A random variable Y has a binomial distribution with parameters N andp if its probability distribution function is of the form:

p(y) =

(N

y

)py (1− p)N−y

for integer values of y between 0 and N.

15

Example: drunk batter

What is the probability that our alcoholic major-leaguer gets more than 2hits in a game in which he has 5 at bats?

Let X =“number of hits”. We model X as a binomial random variablewith parameters N = 5 and p = 0.316.

X

x P(X = x)

0 (1− p)5

1 5(1− p)4p2 10(1− p)3p2

3 10(1− p)2p3

4 5(1− p)p4

5 p5

Substituting p = 0.316 we calculate P(X > 2) = 0.185.

16

Example: winning a best-of-seven play-off

Assume that the Chicago Bulls have probability 0.4 of beating the MiamiHeat in any given game and that the outcomes of individual games areindependent.

What is the probability that the Bulls win a seven game series against theHeat?

17

Example: winning a best-of-seven play-off (cont’d)

Consider the number of games won by the Bulls over a full seven gamesagainst the Heat. We model this as a binomial random variable Y withparameters N = 7 and p = 0.4, which we express with the notation

Y ∼ Bin(7, 0.4).

The symbol “∼” is read “distributed as”. “Bin” is short for “binomial”.The numbers which follow are the values of the two binomial parameters,the number of independent Bernoulli trials (N) and the probability ofsuccess at each trial (p).

18

Example: winning a best-of-seven play-off (cont’d)

Although we never see all seven games played (because the series stopsas soon as one team wins four games) we note that in this expandedevent space

I any event with at least four Bulls wins corresponds to an observableBulls series win,

I any event corresponding to an observed Bulls series win has at leastfour total Bulls wins.

19

Example: winning a best-of-seven play-off (cont’d)

For example, the observable sequence 011011 (where a 1 stands for aBulls win) has two possible completions, 0110110 or 0110111. Anyhypothetical games played beyond the series-ending fourth win can onlyincrease the total number of wins tallied by Y .

Conversely, the sequence 1010111 is an event corresponding to Y = 5and we can associate it with the observable subsequence 101011, a Bullsseries win in six games.

20

Example: winning a best-of-seven play-off (cont’d)

Therefore, the events corresponding to “Bulls win the series” areprecisely those corresponding to Y ≥ 4.

We may conclude that the probability of a series win for the Bulls is

P(Y ≥ 4) = P(Y = 4) + P(Y = 5) + P(Y = 6) + P(Y = 7)

= 0.29.

21

Example: winning a best-of-seven play-off (cont’d)

We can arrive at this answer without reference to the binomial randomvariable Y if we are willing to do our own counting.

P(Bulls series win) = p4 +

(4

3

)p4(1− p) +

(5

3

)p4(1− p)2 +

(6

3

)p4(1− p)3

= p4 +

(4

1

)p4(1− p) +

(5

2

)p4(1− p)2 +

(6

3

)p4(1− p)3

= 0.29.

This calculation explicitly accounts for the fact that Bulls series winsnecessarily conclude with a Bulls game win.

22

Example: double lottery winners

In 1971, Jane Adams won the lottery twice in one year! If you read of adouble winner in your daily newspaper, how surprised should you be?

To answer this question we need to make some assumptions. Consider 40state lotteries. Assume that each one has a 1 in 18 million chance ofwinning. Assume that each one has 1 million people that play it daily(say, 250 times a year), and that each one buys 5 tickets.

Given these conditions, what is the probability that in one calendar yearthere is at least one double winner?

23

Example: double lottery winners (cont’d)

Let Xi be the random variable denoting how many winning tickets personi has:

Xi ∼ Binomial(5(250), p = (1/18)× 10−6).

Now let Yi be the dummy variable for the event Xi > 1, which is theevent that person i is a double (or more) winner:

Yi ∼ Bernoulli(q).

We can compute q = 1− Pr(Xi = 0)− Pr(Xi = 1) = 2.4× 10−9.

24

Example: double lottery winners (cont’d)

To account for the million people playing the lottery in each of 40 states,we consider Z =

∑Ni=1 Yi , which is another binomial random variable:

Z ∼ Binomial(N = 4× 107, q).

Finally, the probability that Z > 0 can be found as

1− P(Z = 0) = 1− (1− q)N = 1/11.

Not so rare!

25

Example: rural vs. urban hospitals

About as many boys as girls are born in hospitals. In a small Country Hospitalonly a few babies are born every week. In the urban center, many babies areborn every week at City General. Say that a normal week is one where between45% and 55% of the babies are female. An unusual week is one where morethan 55% are girls or more than 55% are boys.

Which of the following is true?

I Unusual weeks occur equally often at Country Hospital and at CityGeneral.

I Unusual weeks are more common at Country Hospital than at CityGeneral.

I Unusual weeks are less common at Country Hospital than at City General.

26

Example: rural vs. urban hospital (cont’d)

We can model the births in the two hospitals as two independent randomvariables. Let X = “number of baby girls born at Country Hospital” andY =“number of baby girls born at City General”.

X ∼ Binomial(N1, p)

Y ∼ Binomial(N2, p)

Assume that p = 0.5. The key difference is that N1 is much smaller thanN2. To illustrate, assume that N1 = 20 and N2 = 500.

27

Example: rural vs. urban hospital (cont’d)

During a usual week at the rural hospital between 0.45N1 = 0.45(20) = 9and 0.55N1 = 0.55(20) = 11 baby girls are born.

The probability of usual week is P(9 ≤ X ≤ 11) ≈ 0.50, so theprobability of an unusual week is

1− P(9 ≤ X ≤ 11) = P(X < 9) + P(X > 11) ≈ 0.5.

Note: satisfying the condition X < 9 is the same as not satisfying thecondition X ≥ 9; strict versus non-strict inequalities make a difference.

28

Example: rural vs. urban hospital (cont’d)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0.00

0.05

0.10

0.15

0.20

Country Hospital

Births

Probability

29

Example: rural vs. urban hospital (cont’d)

In a usual week at the city hospital between 0.45N2 = 0.45(500) = 225and 0.55N2 = 0.55(500) = 275 baby girls are born.

Then the probability of a usual week is P(225 ≤ X ≤ 275) = 0.978, sothe probability of an unusual week is

1− P(225 ≤ X ≤ 275) = P(X < 225) + P(X > 275) = 0.022.

30

Example: rural vs. urban hospital (cont’d)

200 206 212 218 224 230 236 242 248 254 260 266 272 278 284 290

0.000

0.010

0.020

0.030

City General

Births

Probability

31

Variance of a sum of independent random variables

A useful fact:

Variance of linear combinations of independent random variables

A weighted sum/difference of random variables Y =∑m

i aiXi can beexpressed as

V(Y ) =m∑i

a2i V(Xi ).

How can this be used to derive the expression for the variance of abinomial random variable?

32

Variance of binomial random variable

Variance of a binomial random variable

A binomial random variable X with parameters N and p has variance

V(X ) = Np(1− p).

33

Variance of a proportion

By dividing through by the total number of babies born each week wecan consider the proportion of girl babies. Define the random variables

P1 =X

N1and P2 =

Y

N2.

Then it follows that

V (P1) =V(X )

N21

=N1p(1− p)

N21

= p(1− p)/N1

and

V (P2) =V(Y )

N22

=N2p(1− p)

N22

= p(1− p)/N2.

34

Law of Large Numbers

An arithmetical average of random variables is itself a random variable.

As more and more individual random variables are averaged up, thevariance decreases but the mean stays the same.

As a result, the distribution of the averaged random variable becomesmore and more concentrated around its expected value.

35

Law of Large Numbers

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.00

0.05

0.10

0.15

0.20

0.25

Distribution of sample proportion (N = 10, p = 0.7)

36

Law of Large Numbers

0.00

0.05

0.10

0.15

Distribution of sample proportion (N = 20, p = 0.7)

0 0.7 1

37

Law of Large Numbers

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Distribution of sample proportion (N = 50, p = 0.7)

0 0.7 1

38

Law of Large Numbers

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Distribution of sample proportion (N = 150, p = 0.7)

0 0.7 1

39

Law of Large Numbers

0.00

0.01

0.02

0.03

0.04

0.05

Distribution of sample proportion (N = 300, p = 0.7)

0 0.7 1

40

Example: Schlitz Super Bowl taste test

41

Bell curve approximation to binomial

The binomial distributions can be approximated by a smooth densityfunction for large N.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0.00

0.05

0.10

0.15

0.20

Normal approximation for binomial distribution with N = 20, p = 0.5

x

Pro

babi

lity

mas

s / D

ensi

ty

42

Bell curve approximation to binomial

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0.00

0.05

0.10

0.15

Normal approximation for binomial distribution with N = 60, p = 0.1

x

Pro

babi

lity

mas

s / D

ensi

ty

43

Bell curve approximation to binomial

340 346 352 358 364 370 376 382 388 394 400 406 412 418 424 430 436 442 448 454 460

0.00

0.01

0.02

0.03

0.04

Normal approximation for binomial distribution with N = 500, p = 0.8

x

Pro

babi

lity

mas

s / D

ensi

ty

What are some reasons that very small p or small N lead to badapproximations?

44

Central limit theorem

The normal distribution can be “justified” via its relationship to thebinomial distribution. Roughly: if a random outcome is the combinedresult of many individual random events, its distribution will follow anormal curve.

The quincunx or Galton box is a device which physically simulates sucha scenario using ball bearings and pins stuck in a board.

PLAY VIDEO

The CLT can be stated more precisely, but the practical impact is justthis: random variables which arise as sums of many other randomvariables (not necessarily normally distributed) tend to be normallydistributed.

45

Normal distributions

The normal family of densities has two parameters, typically denoted µand σ2, which govern the location and scale, respectively.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

Gaussian densities for various location parameters

x

f(x)

46

Normal distributions (cont’d)

I will use the terms normal distribution, normal density and normalrandom variable more or less interchangeably.

-4 -2 0 2 4

0.0

0.2

0.4

0.6

0.8

Mean-zero Gaussian densities with differing scale parameters

x

f(x)

The normal distribution is also called the Gaussian distribution or thebell curve.

47

Normal means and variances

Mean and variance of a normal random variable

A normal random variable X , with parameters µ and σ2, is denoted

X ∼ N(µ, σ2).

The mean and variance of X are

E (X ) = µ,

V (X ) = σ2.

The density function is symmetric and unimodal, so the median andmode of X are also given by the location parameter µ. The standarddeviation of X is given by σ.

48

Normal approximation to binomial

The binomial distributions can be approximated by a normal distribution.

Normal approximation to the binomial

A Bin(N, p) distribution can be approximated by a N(Np,Np(1− p))distribution for N “large enough”.

Notice that this just “matches” the mean and variance of the twodistributions.

49

Linear transformation of normal RVs

We can add a fixed number to a normal random variable and/or multiplyit by a fixed number and get a new normal random variable. This sort ofoperation is called a linear transformation.

Linear transformation of normal random variables

If X ∼ N(µ, σ2) and Y = a + bX for fixed numbers a and b, thenY ∼ N(a + bµ, b2σ2).

For example, if X ∼ N(1, 2) and Y = 3− 5X , then Y ∼ N(−2, 50).

50

Standard normal RV

Standard normal

A standard normal random variable is one with mean 0 and variance 1.It is often denoted by the letter Z :

Z ∼ N(0, 1).

We can write any normal random variable as a linear transformation of astandard normal RV. For normal random variable X ∼ N(µ, σ2), we canwrite

X = µ+ σZ .

51

The “empirical rule”

It is convenient to characterize where the “bulk” of the probability massof a normal distribution resides by providing an interval, in terms ofstandard deviations, about the mean.

0.0

0.1

0.2

0.3

0.4

N(µ,σ)

x

Density

µ − 4σ µ − 3σ µ − 2σ µ − σ µ µ + σ µ + 2σ µ + 3σ µ + 4σ

68 %

52

The “empirical rule” (cont’d)

The widespread application of the normal distribution has lead this to bedubbed the empirical rule.

0.0

0.1

0.2

0.3

0.4

N(µ,σ)

x

Density

µ − 4σ µ − 3σ µ − 2σ µ − σ µ µ + σ µ + 2σ µ + 3σ µ + 4σ

95 %

53

The “empirical rule” (cont’d)

It is, for obvious reasons, sometimes called the 68-95-99.7 rule.0.0

0.1

0.2

0.3

0.4

N(µ,σ)

x

Density

µ − 4σ µ − 3σ µ − 2σ µ − σ µ µ + σ µ + 2σ µ + 3σ µ + 4σ

99.7 %

54

The “empirical rule” (cont’d)

To revisit some earlier examples:

I 68% of Chicago daily highs in the winter season are between 19 and48 degrees.

I 95% of NBA players are between 6ft and 7ft 2in.

I In 99.7% of weeks, the proportion of baby girls born at City Generalis between 0.4985 and 0.5015.

55

Sums of normal random variables

Weighted sums of normal random variables are also normally distributed.

For example if

X1 ∼ N(5, 20) and X2 ∼ N(1, 0.5)

then for Y = 0.1X1 + 0.9X2

Y ∼ N(m, v).

where m = 0.1(5) + 0.9(1) = 1.4 and v = 0.12(20) + 0.92(0.5) = 0.605.

56

Linear combinations of normal RVs

Linear combinations of independent normal random variables

For i = 1, . . . , n, let

Xiiid∼N(µi , σ

2i ).

Define Y =∑n

i=1 aiXi for weights a1, a2, . . . , an. Then

Y ∼ N(m, v)

where

m =n∑

i=1

aiµi and v =n∑

i=1

a2i σ2i .

57

Example: two-stock portfolio

Consider two stocks, A and B, with annual returns (in percent ofinvestment) distributed according to normal distributions

XA ∼ N(5, 20) and XB ∼ N(1, 0.5).

What fraction of our investment should we put into stock A, with theremainder put in stock B?

58

Example: two-stock portfolio (cont’d)

For a given fraction α, the total return on our portfolio is

Y = αXA + (1− α)XB

with distribution

Y ∼ N(m, v).

where m = 5α + (1− α) and v = 20α2 + 0.5(1− α)2.

59

Example: two-stock portfolio (cont’d)

Suppose we want to find α so that P(Y ≤ 0) is as small as possible.

-5 0 5 10 15 20

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Two-stock portfolio

Percent return

Density

Stock AStock B

The blue distributions correspond to varying values of α.60

Example: two-stock portfolio (cont’d)

We can plot the probability of a loss as a function of α.0.04

0.06

0.08

0.10

0.12

Probability of a loss

α

Probability

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

We see that this probability is minimized when α = 11% approximately.This is the LLN at work!

61

Variance of a sum of correlated random variables

For correlated (dependent) random variables, we have a modified formula:

Variance of linear combinations of two correlated random variables

A weighted sum/difference of random variables Y = a1X1 + a2X2 can beexpressed as

V(Y ) = a21V(X1) + a22V(X2) + 2a1a2Cov(X1,X2).

There is a homework problem that asks you to find the variance ofportfolios of stocks, as in the example above, for stocks which are relatedto one another (in a common industry, for example).

62

Vignettes

1. Differential dispersion

2. Average number of sex partners

3. mean reversion

63

Vignette: a difference in dispersion

In this vignette we observe how selection (in the sense of evolution, orhiring, or admissions) can turn higher variability into over-representation.The analysis uses the ideas of random variables, distribution functions,and conditional probability.

For more background, read the article “Sex Ed” from the February 2005issue of the New Republic (available at the course home page).

64

A difference in dispersion

Consider two groups of college graduates with “employee fitness scores”following the distributions shown below.

-5 -4 -3 -2 -1 0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Distribution of Capabilities, Group A

Score

Probability

0.043 0.051 0.064 0.0850.128

0.256

0.1280.085 0.064 0.051 0.043

-5 -4 -3 -2 -1 0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Distribution of Capabilities, Group B

Score

Probability

0.003 0.008 0.0230.063

0.171

0.464

0.171

0.0630.023 0.008 0.003

These distributions have the same mean, the same median, and the samemode. But they differ in their dispersion, or variability.

65

A difference in dispersion (cont’d)Let X denote the random variables recording the scores and let A and Bdenote membership in the respective groups.

-5 -4 -3 -2 -1 0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Distribution of Capabilities, Group A

Score

Probability

0.043 0.051 0.064 0.0850.128

0.256

0.1280.085 0.064 0.051 0.043

-5 -4 -3 -2 -1 0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Distribution of Capabilities, Group B

Score

Probability

0.003 0.008 0.0230.063

0.171

0.464

0.171

0.0630.023 0.008 0.003

V (X | A) = 5.87 and V (X | B) = 1.666.

The corresponding standard deviations are σ(X | A) = 2.42 andσ(X | B) = 1.29. 66

A difference in dispersion (cont’d)But now consider only elite jobs, for which it is necessary that fitnessscore X ≥ 4.

-5 -4 -3 -2 -1 0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Distribution of Capabilities, Group A

Score

Probability

0.043 0.051 0.064 0.0850.128

0.256

0.1280.085 0.064 0.051 0.043

-5 -4 -3 -2 -1 0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Distribution of Capabilities, Group B

Score

Probability

0.003 0.008 0.0230.063

0.171

0.464

0.171

0.0630.023 0.008 0.003

We can use Bayes’ rule to calculate P(A | X ≥ 4) and P(B | X ≥ 4). 67

A difference in dispersion (cont’d)

If we assume a priori that P(A) = P(B) = 1/2, we find

P(A | X ≥ 4) =P(X ≥ 4 | A)P(A)

P(X ≥ 4 | A)P(A) + P(X ≥ 4 | B)P(B)

=0.094(0.5)

0.094(0.5) + 0.012(0.5)

= 0.89.

Why don’t we need to calculate P(B | X ≥ 4) separately?

68

Larry Summers and women-in-science

“Summers’s critics have repeatedly mangled his suggestion thatinnate differences might be one cause of gender disparities ... intothe claim that they must be the only cause. And they haveconverted his suggestion that the statistical distributions of men’sand women’s abilities are not identical to the claim that all men aretalented and all women are not–as if someone heard that womentypically live longer than men and concluded that every woman liveslonger than every man. . . .

In many traits, men show greater variance than women, and aredisproportionately found at both the low and high ends of thedistribution. Boys are more likely to be learning disabled or retardedbut also more likely to reach the top percentiles in assessments ofmathematical ability, even though boys and girls are similar in thebulk of the bell curve. . . .”

Stephen Pinker in The New Republic

69

Example: gender and aptitudes revisitedAssume that job“aptitude” can be represented as a continuous randomvariable and that the distribution of scores differs by gender.

-6 -4 -2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

Aptitude distribution

Score

Density

womenmen

For women, 93.7% of the scores are between the vertical dashed lines,whereas only 68.6% of the men’s scores fall in this range. 70

Example: gender and aptitudes revisited (cont’d)

The corresponding CDFs reveals the same difference.

-6 -4 -2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

Cumulative distribution function

Score

F(x)

These distributions are meant to be illustrative rather than factual.71

Sex partners vignette: which average?

Here is a torn-from-the-headlines example of why it pays to know a littleprobability.

“Everyone knows men are promiscuous by nature...Surveys bearthis out. In study after study and in country after country, menreport more, often many more, sexual partners than women...

But there is just one problem, mathematicians say. It islogically impossible for heterosexual men to have more partnerson average than heterosexual women. Those survey resultscannot be true.”

72

A sex-partners statistical model

Question: is it possible for men to have more sex partners, on average,than women?

To answer this question, we will consider a “toy” probability model forhomo sapiens mating behavior.

John Lenny Romeo

Sally 0.07 0.06 0.05

Chastity 0.5 0.5 0.5

Maude 0.05 0.04 0.09

Let’s call it the “summer camp” model.

73

A sex-partners random variable

The quantity of interest is the number of sex partners. In our model, thiswill be a number between 0 and 3.

For each individual we can compute the distribution of this randomvariable. We will denote individuals by their first initial. A red initialmeans they partnered, a black initial means they did not.

We will assume independence. This means, for example, that Sallyhooking up with Romeo makes it neither more nor less likely that she willhook up with Lenny.

74

Sally’s sex-partner distribution

Xs

Event x P(Xs = x)

JLR 0 (1-0.07)(1-0.06)(1-0.05)

JLR or JLR or JLR 1 (0.07)(1-0.06)(1-0.05) +(1-0.07)(0.06)(1-0.05) +(1-0.07)(1-0.06)(0.05)

JLR or JLR or JLR 2 (0.07)(0.06)(1-0.05) +(1-0.07)(0.06)(0.05) +(0.07)(1-0.06)(0.05)

JLR 3 (0.07)(0.06)(0.05)

Can you see the probability laws in action here?

75

Sally’s sex-partner distribution

Xs

Event x ps(x) = P(Xs = x)

JLR 0 0.83

JLR or JLR or JLR 1 0.16

JLR or JLR or JLR 2 0.01

JLR 3 0.0002

Here is what it looks like after the calculation (rounded a bit). We cando similarly for each individual.

76

Sally’s sex-partners distribution

Here is a picture of Sally’s sex partner distribution.

0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

0.8305

0.1592

0.0101 2e-04

Distribution of sex partners for Sally

Number of partners

Probability

The mean is 0(0.83) + 1(0.16) + 2(0.01) + 3(0.0002) = 0.18. What is themode? What is the median?

77

Female sex-partner distribution

To get the distribution for all females, we sum over the individual women.We apply the law of total probability using all three conditionaldistributions:

pfemale(x) = ps(x)P(Sally) + pc(x)P(Chastity) + pm(x)P(Maude).

We assume that the women are selected at random with equal probabilityP(Maude) = P(Chastity) = P(Sally) = 1/3.

78

Female sex-partner distribution

At the end we get a distribution like this.

0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

0.5951

0.23150.1315

0.0418

Distribution of sex partners for females

Number of partners

Probability

The mean is 0.62, the mode is 0, and the median is 0.79

Male sex-partner distribution

We can do the same thing for the males, and we get this.

0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

0.44170.4983

0.05830.0017

Distribution of sex partners for males

Number of partners

Probability

The mean is 0.62, the mode is 1, and the median is 1.80

Sex-partners vignette recap

The narrow lesson is that it pays to be specific about which measure ofcentral tendency you’re talking about!

The more general lesson is that using probability models and a little bitof algebra can help us see a situation more clearly.

This example uses the concepts of random variable, independence,conditional distribution, mean, median...and others.

81

Idea: statistical “null” hypotheses

The hypothesis that events are independent often makes a nice contrastto other explanations, namely that random events are somehow related.

This vantage point allows us to judge if those other explanations fit thefacts any better than the uninteresting “null” explanation that events areindependent.

82

Vignette: making better pilots

Flight instructors have a policy of berating pilots who make bad landings.They notice that good landings met with praise mostly result insubsequently less-good landings, while bad landings met with harshcriticism mostly result in subsequently improved landings.

Is their causal reasoning necessarily valid?

To stress-test their judgment that “criticism works” we consider theevidence in light of the null hypothesis that subsequent landings are infact independent of one another, regardless of criticism or praise.

83

Example: making better pilots (cont’d)

Contrary to the assumptions of the instructors, consider each landing asindependent of subsequent landings (irrespective of feedback).

Assume that landings can be classified into three types: poor, adequate,or excellent. Further assume the following probabilities:

Event Probability

bad pb

adequate pa

good pg

Remember that pb + pa + pg = 1.

84

Example: making better pilots (cont’d)

Assume that the policy of criticism is judged to work when a poorlanding is followed by a not-poor landing. Then

P(criticism seems to work) = P(not bad2 | bad1) = P(not bad2) = pa+pg

by independence.

Conversely, the policy of praise appears to work when an good landing isfollowed by another good landing. So

P(good2 | good1) = P(good2) = pg .

Praise always appears to work less often than criticism!

85

Remark: null and alternative hypotheses

The previous example shows that the evidence can appear to favorcriticism over praise even if criticism and praise are totally irrelevant.

Does this mean that criticism does not work?

No, it just means that the observed facts are not compelling evidencethat criticism works, because they are entirely consistent with the nullhypothesis that landing quality is independent of previous landings andfeedback.

In cases like this we say we “fail to reject the null hypothesis”. We’llrevisit this terminology a couple weeks from now.

86

Example: making better pilots (continuous version)

What if we want to take pilot skill into account?

We will model this situation using normal random variables and see if thesame conclusions (that praise appears to hurt performance and criticismseems to boost it) could arise by chance.

87

Example: making better pilots (continuous version, cont’d)

Assume that each pilot has a certain ability level, call it A. Eachindividual landing score arises as a combination of this ability and certainrandom fluctuations, call them ε. The landing score at time t can beexpressed as

St = A + εt .

Assuming that εtiid∼N(0, σ2), then

St ∼ N(A, σ2).

88

Example: making better pilots (continuous version, cont’d)Denote an average landing score as M. Consider a pilot with A > M.When he makes an exceptional landing, because ε1 > 2σ, he is unlikely tobest it on his next landing.

0.0

0.2

0.4

0.6

0.8

Distribution of landing scores

S2

Density

M A A+ε1

For this reason, praise is unlikely to work even though landings areindependent of one another. 89

Example: making better pilots (continuous version, cont’d)For a poor pilot with A < M a similar argument holds. When he makes avery poor landing, because ε1 < −2σ, he is unlikely to do worse on hisnext landing.

0.0

0.2

0.4

0.6

0.8

Distribution of landing scores

S2

Density

A+ε1 A M

For this reason, criticism is likely to “work” even though landings areindependent. 90

Idea: mean reversion

The previous example illustrates an idea known as mean reversion.

This name refers to the fact that subsequent observations tend to be“pulled back” towards the overall mean even if the events areindependent of one another.

Mean reversion describes a probabilistic fact, not a physical process.

What might the flight instructors have done (as an experiment) to reallyget to the bottom of their question?

91

top related