probabilities and distributions peter shaw. introduction the study of probabilities goes back to a...

Probabilities and distributions

Peter Shaw

Introduction The study of probabilities goes back to a Renaissance

dice game, when the Chevalier De Mere posed the following puzzle. Which is more likely (1) rolling at least one six in four throws of a single die or (2) rolling at least one double six in 24 throws of a pair of dice? The mathematician Fermat was eventually involved, and statistical analysis was born.

The key element here is the notion of randomness, inherent in use of dice.

Latin ‘Alea’ = dice, gives French ‘Aleatoire’ = random.

(The answer is that getting 1 six in 4 throws is more likely, but only by a tiny margin, p=0.5177 vs p = 0.491)

You never get a straight answer…

The notion of probability is invoked in situations where outcomes are uncertain, or where measurements are subject to detectable levels of error.

In practice this is most situations most of the time! The media keep looking to scientists for absolute

answers: Is beef absolutely safe? Are we sure that the climate warming is due to

CO2? Anyone who says “Yes” is not a scientist. The

correct answer is “very likely”. You cannot get absolute answers, but you can get estimates of likelihood = probability.

Roll 2 dice..

1 2 3 4 5 6

1 2 3 4 5 6 7

2 3 4 5 6 7 8

3 4 5 6 7 8 9

4 5 6 7 8 9 10

5 6 7 8 9 10 11

6 7 8 9 10 11 12

There are 36 possible outcomesOnly 1 combination adds to 2, soP(2) = 1/36

What is the most likely score, and why? P = ?

The distribution of dice scores note it is symmetrical and peaks at 7 with a score of 6/36 = 1/6 = ?

2 3 4 5 6 7 8 9 10 11 12

Score

Likelihood of 2 dice score sums(ignoring the rule about doubles that applies in backgammon)

Num

ber

of w

ays

(out

of

36)

0 1

2

3

4

5 6

7

You rolled double 6 – you must be cheating! In real life we often have to decide

whether an event is a random fluke, or indicates a genuine pattern.

If I rolled 6 sixes, would I have cheated? Actually it is very likely, as 6 sixes would occur 1 time in 6*6*6*6*6*6 = 46,656. But it COULD be due to chance.

We use probability as a tool in decision making. The field of inferential analysis relies on

finding an estimate of the probability for statements being true.

Statement 1:“Soil 1 is more polluted than soil 2”

Statement 2:“Soil 1 is exactly as polluted as soil 2, any observed differences are due to chance”.

If you find p(Statement 2) = 1 in a million, you judge the 2 soils to differ.

We use probability as a tool in decision making.

The field of inferential analysis relies on finding an estimate of the probability for statements being true.

Statement 1”Patients treated with compound X have (eg) lower blood sugar levels than untreated patients.”

Statement 2:“Patients treated with compound X do not differ from untreated patients, any while there may measurable differences, these are due to chance alone”.

If you find p(Statement 2) = 1 in a million, you judge the 2 groups of patients to differ, implying that the compound is having some detectable effect.

(Would this be absolute proof of its efficacy?)

Normal Distribution

Number of observations

Size of valueMean and medianabout the same

Also known as the Gaussian distribution, after Karl Gauss.

This is the expected distribution when many randomly distributed factors add together. It is found in distributions of body height/weight, chemical concentrations in soil/air/water, and many other situations.

Note the symmetricalbell-shaped curve

Carl Friedrich Gauss30/4/1777 – 23/2/1855

The Gaussian distribution was one of the many deeply significant mathematical discoveries made by Carl Gauss, who was probably the greatest mathematician in history. At the age of 7, when he started school, he was asked (by an exasperated tutor who wanted to put this little upshot in his place) to add up the numbers from 1+2+3…+99+100. Little Carl promptly and contemptuously write down 5050 on his slate and threw it onto the teacher’s desk!

How we think he did it:1 + 100 = 1012 + 99 = 1013 + 98 = 101EtcThere are 50 such pairs: 50*101 = 5050

You only need 2 numbers to define a Normal curve:

The mean μThe standard deviation σ

μ σ

Any observation in a dataset can be re-coded in terms of how many standard deviations away from the mean it lies

A powerful universal principal: The Normal distribution is immensely useful

because it is universal: The same shape describes human height, hardness of stones, strength of winds…

The way to convert any arbitrary set of data into the universal distribution is to recode as follows:

Convert each observation into a number telling you how many s.d.s it is away from the mean. This is called a Z score (I don’t know why):

Zi = (Xi- μ)/σ

And the point of this? Is that you can look up Z scores in

tables, confident in the knowledge that: C. 66% of the points will lie between Z=-

1.0 and Z=1.0 (ie within 1 sd of the mean)

C. 95% of the points lie within +- 2sd of the mean

99.9% of points are within+- 3sd of mean

We’ll try this out!

Measure the length of your left index finger, in mm.

I’ll enter a subset into the PC, and we’ll see whether a Gaussian curve emerges.

Given the mean + sd, you work out your own Z score!

You should know:

That the area under the standard normal curveCorresponds to probability, specifically the probabilityOf finding an observation less than a given Z value.

The total area under the curve, from infinity to – infinity = 1.0

You don’t need to know:

Equation of curve is: Y = 1/ (2π) ½ exp(-½Z*Z)

Z = 0, area = above Z = 0.5, ie half the curve lies below the mean

Z = 1.0, area = above Z =0.1587, ie about 85% data lies below (mean + 1 sd)

Applied example: A factory making widgets can only

sell those whose length is between 98 and 101 mm diameter.

The machine makes widgets with a mean of 100mm and an sd of 0.7mm.

What % of widgets are rejected as unsaleable due to size?

Convert data into Z scores:98 (98-100)/0.7 = -2.85101 (101-100)/0.7 = 1.42

Area above Z = 1.42. = 0.159Area below Z = -2.8 = 0.003

Acceptable area (purple) = 1- (0.159+ 0.003) = 0.838

Upper tail of distribution, area = 0.1587Lower tail of distributionarea = 0.003

LOI

90.0

85.0

80.0

75.0

70.0

65.0

60.0

55.0

50.0

45.0

40.0

35.0

30.0

25.0

20.0

15.0

10.0

5.0

20

10

0

Std. Dev = 27.97

Mean = 29.3

N = 69.00

LOGLOI

1.88

1.75

1.63

1.50

1.38

1.25

1.13

1.00

.88

.75

.63

12

10

8

6

4

2

0

Std. Dev = .44

Mean = 1.26

N = 69.00

Often real data don’t follow the Normal curve but are skewed – here organic content in heath soils

Try log-transforming the data. Here the same data after calculating log of the numbers – not perfect, but clearly more symmetrical

How to decide about normality? Inspect histogram + fitted

normal curve. Inspect a cumulative “P-P

curve” with predicted normal distribution

Run the Kolgomorov-Smirnov test

Normal P-P Plot of LOI

Observed Cum Prob

1.00.75.50.250.00

Exp

ect

ed

Cu

m P

rob

1.00

.75

.50

.25

0.00

Normal P-P Plot of LOGLOI

Observed Cum Prob

1.00.75.50.250.00

Exp

ect

ed

Cu

m P

rob

1.00

.75

.50

.25

0.00

One-Sample Kolmogorov-Smirnov Test

69 69

29.2806 1.2603

27.9695 .4409

.217 .086

.217 .080

-.183 -.086

1.804 .716

.003 .685

N

Mean

Std. Deviation

Normal Parameters a,b

Absolute

Positive

Negative

Most ExtremeDifferences

Kolmogorov-Smirnov Z

Asymp. Sig. (2-tailed)

LOI LOGLOI

Test distribution is Normal.a.

Calculated from data.b.

The Kolgomorov-Smirnov test examines whether data can be assumed to come from a chosen distribution – here the normal.

LOI is almost certainly NOT normally distributed

LogLOI may or may not be normal, but the test tells us that its deviations from normality would occur 7 times in 10 in randomly chosen normal data

probabilities and distributions peter shaw. introduction the study of probabilities goes back to a...

Documents