preliminaries - maths.lancs.ac.ukparkj1/math105/m105lecnotes.pdf · preliminaries 1 preliminaries...

PRELIMINARIES 1

Preliminaries

These notes are for the Math105 course in statistics. The course introduces the statistical methodswhich are required for tackling a range of applied problems. The focus is on strategies for datamodelling rather than mathematical theory. However, there is some theory, and we aim to introducebasic concepts as it will be taught fully in later statistics courses.The computer labs will introduce you to the R language. This is a versatile language and environ-ment for statistical computing and graphics. You can obtain a copy of the R software (for free!)from the following web site. Instructions and help on how to install and use the software are givenon the FAQ pages of this site.http://www.stats.bris.ac.uk/R/

The course data sets are on the LUVLE course web.At the end of this course, you should be able to:

• use graphical tools such as histograms, scatterplots, empirical distribution function;

• calculate and understand numerical summary statistics such as mean, median, variance, quan-tiles and the correlation coefficient;

• discuss a range of modelling assumptions that can play a part in statistical analysis;

• give examples of processes which could be modelled by standard parametric models anddescribe the role played by the parameters of these models;

• describe what is meant by the sampling distribution of a statistic;

• understand the method of moments estimation;

• use the method of moments method to obtain estimates for one and two parameter models;

• understand properties of the sample mean;

• obtain confidence interval for one parameter model and interpret it;

Lectures are held at 10am Tuesday, 11am Wednesday in Bowland Lecture Theatre, 9am Thursdayin Biology LT and 12 noon Friday in Bowland Lecture Theatre.Lab based practical sessions are based on R and you are expected to complete each week on yourown pace. Help sessions will be available.Workshops will be held in Management School Lecture Theatre 7. Lists of groups are postedoutside the Maths and Stats Department Office in Fylde College. Labs and workshops start in thefirst week. You need to bring your lecture notes to the labs and workshops!!

Notes are printed, but attendance at lectures is essential as additional material, including missingsections of notes, will be given.Data examples are used throughout the course, to illustrate the techniques that the course aims toteach you. The examples are not part of the syllabus. You will be assessed on your understandingof the methods and not the detail of these examples.Assessment is based on 20% coursework (10% online quiz and 10% written work), 30% end ofmodule test and 50% exam. The test will be in place of the final lecture, at 12 noon on Friday ofWeek 5.Coursework questions should be handed in to your tutor’s pigeonhole by 2pm on the Wednesdayfollowing their being set.

2

Office hours B73 PSC on Tuesdays and Fridays at 11.00am.Your participation in the course, by taking part in experiments, contributing in lectures and work-shops and responding to the questionnaire is much appreciated.Background reading. Although the lectures and these accompanying notes are self-contained, fur-ther details can be found in the following recommended texts:Clarke, G.M. and Cooke, D. (1998). A Basic Course in Statistics. 4th ed, Arnold.Daly, F., Hand, D., Jones, M., Lunn, A. and McConway, K. (1995). Elements of Statistics. AddisonWesley.Lindsey, J. (1995). Introductory Statistics: A Modelling Approach. Oxford Science Publications.

Chapter 1

Introduction to Statistics

1.1 Introduction

In our everyday lives, we are surrounded by uncertainty due to random variation.We often make decisions based on incomplete information.Mostly, we can cope with this level of uncertainty, but in situations where the decision is of particularimportance, it can be informative to understand this uncertainty in greater detail, to aid the decisionmaking.Statistics is unique in that it allows us to make formal statements quantifying uncertainty, and thisprovides a framework for decision making when faced with uncertainty.

Uncertainty

Sterling’s slide has continued, with the pound falling close to$1.37...The pound also

weakened against the euro, with the single currency now worth 94 pence.

If I am planning to make a trip in summer abroad, is it better to change the currency now thanlater?

Is there evidence of global warning or is it simply random fluctuation?

How would the answer affect your way of living?

Decision makingWe follow many different routes, rational or irrational, to find an answer and to cope with suchsituations.Often it is useful to obtain some evidence in order to decide what the answer should be.What sort of evidence would be useful in answering such questions?For the UK economy, we may look at exchange rates over the past few months to figure out a trend,if any. we may want to include other factors that may explain the trend, or study similar periodsin the past. To determine such factors or variables we may want to speak to economists.For the global warming, we may want to study a pattern in temperature over the past years inEngland, Europe or around the world. There may be other variables of interest, for example,increasing number of flooding or storms. Discussion with climatologist or hydrologist would behelpful in deciding which variables should be considered.

3

4 CHAPTER 1. INTRODUCTION TO STATISTICS

What is Data?For all occasions, we need to collect some form of data to investigate further.Data refers to information that is collected from experiments, surveys or observational studies.For example 4, 3.5, 3.2 is not data but only a sequence of numbers. However if we know thesenumbers are measurements of new-born baby’s weights, then these numbers become a data.But does that mean if we observe three new-born babies again, their weight will be one of thosenumbers?

Probability and StatisticsIn Probability, we consider an experiment before it is performed. Numbers to be observed or

calculated from observations are at that state random variables. We deduce the probability ofvarious outcomes of the experiment in terms of certain basic parameters.In Statistics, we have to infer things about the values of the parameters from the observed outcomesof an experiment already performed.We can decide whether or not operations on statistics are sensible only by considering probabilitiesassociated with the observable random variables.

Is Friday 13th bad for your health?Consider for a moment the following claim:

I’ve heard that Friday 13th is unlucky, am I more likely to be involved in a car accidentif I go out on Friday 13th than any other day?

What kind of evidence would be helpful?Suppose that data is available of emergency admissions to hospitals in the Southwest Thames regiondue to transport accidents, on six Friday 13ths, and corresponding emergency admissions due totransport accidents for the Friday 6th immediately before each Friday 13th.

Number 1 2 3 4 5 6

Accidents on 6th 9 6 11 11 3 5Accidents on 13th 13 12 14 10 4 12

Does the data support the claim? How might we use this data in order to obtain some evidencethat will help us answering the question?

We may consider comparing the number of accidents by working out the average (or mean) numberof accidents happening per day on both days:

Average number of accidents =Totalnumberofaccidents

Totalnumberofdays

so that

Average number of accidents on 6th =9 + 6 + 11 + 11 + 3 + 5

6=

and

Average number of accidents on 13th =13 + 12 + 14 + 10 + 4 + 12

6=

The average (or mean) is an example of a summary statistics, the topic in Chapter 2.

There are more accidents on Friday 13th than on Friday 6th. Therefore I am more

likely to be involved in a car accident if I go out on Friday 13th.

1.2. WHY STATISTICS? 5

Do you agree with what is being said?

Example 1.1.1. Referring to the Friday 13th example,

• why have we chosen to compare instead of focusing on accidents only on 13th Fridays?

• why have we chosen Friday 6th as the comparison day? Why not Thrusday 12th, or any otherday for that matter?

What is this course about

• illustrate scientific and mathematical contexts where statistical issues arise

• demonstrate where statistics can be useful, by showing the sort of questions it can answer,and the situations in which it is used

• understand sampling variation and quantify uncertainty

Specifically, we

• extend probability models to continuous random variables

• introduce various exploratory tools and summary statistics for data analysis

• introduce specifice techniques in statistical modelling and inference

• apply to real data examples

1.2 Why Statistics?

1.2.1 Separating sampling variation and a true difference

Consider a simple example of tossing a coin. If you toss a coin 10 times, how many heads wouldyou expect to see? Fill in the box your outcomes.

H, H, H, T, T, H, H, H, H, T

• Are you surprised that you didn’t have exactly 5, the half of the number of trials? Why orwhy not? Has the result changed your opinion about the coin?

• Are you surprised that your neighbors didn’t have exactly the same number of heads as youdid? Why or why not?

• Do the same experiment another two times, on two further coins and record the number ofheads. Did you get the same number of heads each time?

• What would happen if you toss 20 times?

What you have witnessed is called sampling variation.


Sampling variation or true difference?

Example 1.2.1. Think back to the Friday 13th example. Do you have more chance of being in a caraccident on Friday 13th, or is the difference in the average number of accidents down to samplingvariation?

Example 1.2.2. Suppose we collected the data for the Friday 13th example on some new dates.How sure are you that we would again see the average number of accidents on Friday 13th greaterthan the average number of accidents on Friday 6th?

In Chapters 4 and 5, we introduce a statistical framework to evaluate how much evidence there isfor a true difference.

1.2.2 Learning about a population from a sample

In the Friday 13th example, our interest is not limited to those available dates. Ideally we wouldlike to consider all possible number of accidents occurring on Friday 13th. We call the group ofthings or people that a study is targeting population.

Population and sample

• Population: a class of all individuals of interest

• Sample: a subset of the population

For any sound analysis, we need to

• define exactly what population is being targeted;

• choose sample to give good representation.

Statistical inference: learning about population through the behaviour of a sample.

1.3 Where is Statistics used?

Statistics is used in a surprisingly diverse range of areas. Here is a small selection of the fields towhich statistics contributes.

Where is Statistics used?

Environmental monitoring: for the setting of regulatory standards and in deciding whetherthese are being met;

Engineering: to gauge the quality of products used in manufacturing and building;

Agriculture: to understand field trials of new varieties and choose the crops that will grow bestin particular conditions;

Economics: to describe unemployment and inflation, which are used by the government and bybusiness to decide economic policies and form financial strategies;

Finance: risk management, and prediction of the future behaviour of the markets;

1.3. WHERE IS STATISTICS USED? 7

Pharmaceutical industry: to judge the clinical effectiveness and safety of new drugs before theycan be licensed;

Insurance: in setting premium sizes, to reflect the underlying risk of the events that are beinginsured against;

Medicine: to assess the reliability of clinical trials reported in journals, and choose the mosteffective treatment for patients;

Ecology: to monitor population sizes and to model interactions between different species;

Business: market research is used to plan sales strategies.

1.3.1 The Sally Clark Case

Statistics has played a key role in many topical news issues, including the controversial court caseof Sally Clark.The Sally Clark case is an infamious example of the misuse (or misunderstanding) of statisticscontributing to a miscarriage of justice. The Royal Statistical Society were so concerned that theywrote a press release, highlighting the statistical mistakes made.

Sally Clark CaseSally Clark was a mother convicted of murder, when two of her babies died of ‘Cot Death’ - the namegiven to the unexplained death of a young infant. The paediatrician Sir Roy Meadow, acting as anexpert witness for the prosecution in the case, famously claimed that the odds of two unexplaineddeaths in the same family was 1 in 73 million.Where does this figure come from?

First problemThe odds of a single unexplained death in an affluent, non-smoking family is estimated as 1 in 8500.The figure 73 million comes from multiplying these odds by themselves: 8500 × 8500 = 73million

IndependenceThe first problem is that it is only appropriate to multiply these odds together if the second deathis independent of the first.

Is this really a reasonable assumption?

Second problemThe second problem is known as the ‘prosecutors’s fallacy’, which goes as follows:

Chance and its realisationThe chance of two unexplained deaths in the same family occurring by chance is 1 in 73 million.Therefore, the chance of Sally Clark being innocent is 1 in 73 million also.

What is wrong with this argument? Can you spot the error in reasoning here?

The following analogy will help.Suppose you decide to play the British National Lottery. The idea behind the lottery is that 49balls are placed in a machine, and 6 of them are drawn. Before the draw takes place, you pay 1


pound to place a guess on which six balls will be drawn. There is a prize of millions of poundsavailable, if your guess turns out to be correct, but the chance of getting it right is 1 in 14 million.You decide to play, and, amazingly, all six of your numbers come up! You travel to the headquartersof the national lottery to claim your winnings, and instead you are arrested – accused of cheating!A few months later, you are in court, and the prosecuting lawyer makes the following argument:

The chance of getting all six balls correct by chance is 1 in 14 million. Therefore, the chance of thedefendant being innocent is 1 in 14 million also.

How would you defend yourself against this argument?

1.4 Types of Data

Variables in the dataWe have already introduced data. In the experiments or surveys, there may be specific attributesthat we are interested in measuring for the subjects. These are called variables. For example, inthe Friday 13th data, the variable we measure is Number of accidents.Because these variables are random, they are called random variables.

Types of DataMost random variables falls into the following two categories, depending partly on the nature ofthe characteristic of interest, and partly on how it is measured:

Discrete. Variables taking values on countable sets e.g. gender, eye color, college membership,exam grades(A, B, C, D, E), number of goals in a match, children in a family

Continuous. Variables taking values on some interval of the real line. e.g. height, weight,direction

Properties of discrete random variables are studied in MATH 104. Continuous random variableswill be the subject of Chapter 2.

1.5 Collecting Data

We see that some data are useful in carrying out our investigation. But how do we choose data?What are important considerations?

Collecting data

Example 1.5.1. Is there any limit to the amount of evidence that can be obtained from some givendata? Think back to the data on Friday 13th – could we use it to decide whether car accidentswere especially common on Fridays?

So if the evidence available is limited by the data we have, it makes sense that we should thinkvery carefully about how we collect the data.If you are not collecting the data yourself, it is always important to understand how the data iscollected, so that you are aware of any limitations that may place on your analysis.To illustrate the idea, we begin with an extreme example.

1.5. COLLECTING DATA 9

Scenario ISuppose you are interestd in estimating how many hours students spend studying every week. Soyou write a survey and set out to find participants for your survey.Thinking to yourself where a good place would be to find students to fill in your survey, you havea brilliant idea, library! You sit outside the library and stop students as they leave the library tofill in your survey. After some time you have enough results so you go home to do your analysis.You find that students spend, on average, 30 hours a week studying.

Collecting data

Example 1.5.2. • What is wrong with the way in which the study has been carried out?

• If you had stopped students outside the University bar instead of the library, do you thinkyou would have got similar results?

• Can you think of a better way to collect data for your survey?

Although sound silly, this example highlights the importance of choosing your sample well.The key message is that the sample should be representative of the population.That’s why it is important to define the population first, otherwise how can our sample be repre-sentative of it?

Representative sampleReferring to the Scenario I, there are many different populations that could be of interest here –students in a particular department, students in a particular faculty, all students at the University,all students in the UK, all students in the world....

Example 1.5.3. Could you use the same sample for each of these populations? Can you thinkof a population for which the sampling method of stopping people outside the library would beappropriate?

A representative sample reflects the characteristic or nature of the population. If your sample isnot representative, we introduce a systematic error called bias.

How large a sample should beAnother consideration is how large our sample should be. We usually use n to denote the numberof subjects in our sample.There are practical as well as statistical considerations to choosing the size of the sample. On thepractical side, financial constraints may mean that you cannot have a sample larger than n = 1000.Some statistical considerations will be discussed in Chapter 4.

Random sampleThe widely accepted method to avoid bias, and therefore obtain a representative sample of thepopulation is by conducting a random sample.A random sample of size n from a population is one in which each possible sample of that sizehas the same chance of being selected.One method to ensure random sampling is to write the name of every member of the populationon a slip of paper, place these slips into a hat, then draw out the required amount for the sample.A more practical method has been developed using computer, called random number generators.For an example of a pre-election poll, we may need n = 1000 random numbers between 1 and 40


million, for a sample size of n = 1000 out of the 40 million eligible voters in the UK. If we have allthe voters written in a list, we can pick out the desired subjects for our sample.

Other kinds of samplingIt is not always feasible to carry out sampling in a truly random fashion. It can be very expensiveto involve 1000 completely random people in a pre-election poll, as some of them may be difficult toreach, and it may take a long time for all the surveys to return. We may have to resort to a samplingmethod that is not random for practical reasons. Provided we are careful, we can minimize thebias that is caused.

Example 1.5.4. Suppose we go out on the streets in a cicy centre, and simply stop people in thestreet and ask who they are going to vote for in the next election. This is sometimes known asconvenience sampling. What kinds of bias may be introduced? What steps could be taken tominimise this bias?

We conclude with an important question.

Does increasing the size of a sample decrease the bias?

Example 1.5.5. For the student study example, one survey collects 1000 responses, with conveniencesampling, where the interviewer stands outside the library, stops students on the way out to askthem the question. A second survey collects only 50 responses, with random sampling from theentire student population of the University. Which study should we believe more?

It is almost always better to have a small, representative sample, than a large biased sample.In the next Chapters, we will assume that the sample is random and study properties of randomsamples. This greatly simplifies our mathematical treatment of the problem and provides insightsinto important statistical ideas used.

Chapter 2

Continuous random variables

2.1 Review of probability: events and probability

In Math104 Probability course, we introduced the concept of probability and random variables. Inparticular discrete random variables were studied in detail. In this chapter we review some of thebasics and further introduce continuous random variables.

Review of ProbabilityProbability considers an experiment before it is performed.Probability is a measure of the chance that an event may occur in the experiment. Tossing a coinor conducting an election survey is an example of an experiment.An event is a subset of the sample space, the set of all possible outcomes. Seeing tail in coin ora positive response from the survey is an event.The legitimate questions are then: What is the probability of seeing tail in the experiment of tossinga coin, or what is the probability of getting a positive response in a survey?

The Axioms of ProbabilityMathematically, probability is a function P which assigns to each event A in the sample space Ω anumber P(A) in [0, 1] such that

• Axiom 1: P(A) ≥ 0 for all A ⊆ Ω;

• Axiom 2: P(Ω) = 1;

• Axiom 3: P(A ∪B) = P(A) + P(B) if A ∩B = ∅ for any A,B ⊆ Ω.

When an event has an associated probability to occur, this gives rise to uncertainty.Uncertainty principle is the fundamental of statistical inference.

2.2 Random variable

A mathematical way of describing an experiment and its events is to define a random variableassociated with it.

11

12 CHAPTER 2. CONTINUOUS RANDOM VARIABLES

An example of random variableA random variable X is a function from sample space Ω to real numbers R.

Example 2.2.1. (a) Experiment 1: In a presential election with two candidates M and O, thepossible outcomes are Ω = Candidate M wins, Candidate O wins. If we define a randomvariable X that maps from Ω to 0, 1. Then the probability of the event Candidate O winsis equivalent to P(X = 1).

(b) Experiment 2: A national air quality monitoring system automatically collects measurementsof ozone level at designated sites. The possible outcomes are Ω = x : x ≥ 0. If we define arandom variable X to be the numerical measurements, the probability that ozone level fallsbelow a certain level c is given by P(X ≤ c).

Random variable

• A random variable X is a function that associates a unique number with each possible outcomeof an experiment.

• Associated with each random variable X is a probability distribution function that describesthe chance of all possible outcomes of X.

• Often in scientific investigation, X represents the variable of main interest that can be mea-sured or observed.

• All observable events are expressed in terms of a random variable.

2.3 Probability and Cumulative distribution function

Distribution functionIn order to describe all possible outcomes of an experiment, we focus on an event of the basic form

X ≤ x

for fixed x, where x can take any values. Then how would you express a general event a < X ≤ busing the basic form with set operations?

a < X ≤ b = X ≤ b ∩ X ≤ ac .

If we have a rule of assigning probability to an event of the basic form, then probability of any

event can be determined.

Cumulative Distribution functionFor any univariate random variable X, cumulative distribution function, c.d.f., FX : R → [0, 1],is defined by

FX(x) = F (x) := P(X ≤ x) .

Moreover, F (−∞) = 0 and F (∞) = 1 and F is a non-decreasing function. An example is drawn inFigure 2.1. Note that for a ≤ b,

P(a < X ≤ b) = F (b)− F (a) .

Why is it so?

2.4. REVIEW OF DISCRETE RANDOM VARIABLE 13

0

1

F (a)

F (b)

a bx

Cumulative distribution function

Figure 2.1: Example of cumulative distribution function.

2.4 Review of discrete random variable

Discrete random variableDiscrete random variables are probability models describing the outcomes of experiments withcountable sample space, often assigning only the integers or the non-negative integers.Examples include

college membership, exam grades(A,B,C,D,E), number of goals in a match, number of childrenin a family.

The probability distribution function FX(x) = P(X ≤ x) can be obtained by

P(X ≤ x) = P (X ≤ int(x)) =

int(x)∑

k=−∞

P(X = x) ,

where int(x) denotes the largest integer smaller than or equal to x, e.g. int(5.2) = 5, int(3) = 3,int(-2.1) = -2.

Probability mass functionThe probability distribution of a discrete random variable X is characterised by probability massfunction, p.m.f, p(x), where

p(x) = P(X = x) .

The probability mass function p(x) satisfies

• 0 ≤ p(x) ≤ 1 for all x;

• ∑∞x=−∞ p(x) = 1

For any event A, P(X ∈ A) =∑

x∈A p(x).For example,

P(a < X ≤ b) = P(X = a + 1) + P(X = a + 2) + · · · + P(X = b)

= p(a + 1) + p(a + 2) + · · ·+ p(b)


See Figure 2.2.

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Probability mass function

P (a < X ≤ b)

a bx

Figure 2.2: Example of probability mass function. Shaded area represents P (a < X ≤ b).

Example of discrete random variable

Example 2.4.1. For a random variable X that takes values 0,1 with probabilities θ, 1− θ, obtainP(X ≤ x) for all x ≥ 0 and plot the cumulative distribution function.

P (X ≤ x) =

0 if x < 0θ if 0 ≤ x < 11 if x ≥ 1

2.5 Introduction to continuous random variable

Continuous random variableWhen the outcome of an experiment is a measurement on a continuous scale, such as ozone levelmeasurements in the earlier example, the random variable is called continuous random variable.Examples include

height, weight, direction, waiting times in the hospital, price of stock

Again, the cumulative distribution function is defined by

F (x) = FX(x) = P(X ≤ x) .

However, if X is continuous random variable

P(X = x) = 0 for all x.

and hence

P(a < X < b) = P(a ≤ X < b) = P(a < X ≤ b) = P(a ≤ X ≤ b) .

2.5. INTRODUCTION TO CONTINUOUS RANDOM VARIABLE 15

Therefore, unlike the discrete case, the probability distribution function cannot be reduced to sumof single events.To describe probability of an event of a continuous random variable, we need new mathematicaltools!

Probability density functionThe probability density function, p.d.f. f(x) a continuous random variable X is defined by

f(x) =d

dxFX(x)

so that it satisfies

FX(x) =

∫ x

−∞f(u) du .

The probability density function f(x) satisfies

• f(x) ≥ 0 for all x;

•∫∞−∞ f(x) dx = 1.

For any event A, P(X ∈ A) =∫

x∈A f(x) dx.

Interpretation of probability density functionDue to results from calculus,

P(a < X ≤ b) = FX(b)− FX(a)

=

∫ b

−∞f(x) dx−

∫ a

−∞f(x) dx

=

∫ b

af(x) dx

See Figure 2.3.Note that the density function f(x) itself does NOT represent probability of any event, but aninstant change in probability. If the event lies in an interval, probability of the event is equal to thearea under the curve of f(x) between the interval.

Example 2.5.1. For a random variable X with cumulative distribution function

FX(x) =

x if 0 ≤ x ≤ 10 otherwise

(a) Find P(0.3 < X ≤ 0.5).

(b) Find the p.d.f of X.

(a) P(0.3 < X ≤ 0.5) = F (0.5) − F (0.3) = 0.5− 0.3 = 0.2

(b) f(x) =

1 if 0 ≤ x ≤ 10 if x < 0 or x > 1


Probability density function

P (a < X ≤ b)

a bx

f(x

)

Figure 2.3: Example of probability density function. P (a < X ≤ b) is the area under the curvebetween x = a and x = b.

2.6 Expected values

Expectation

If X is a discrete random variable with probability mass function p(x) on0, 1, · · · , then the expected value of X is

µX = E[X] =

∞∑

x=0

xp(x) .

If X is a continuous random variable with probability density function f(x) on(−∞,∞), then the expected value of X is

µX = E[X] =

∫ ∞

−∞xf(x) dx .

We can think of this as an average of the different values that X may take, weighted according totheir chance of occurrence.

Expectations of functions of random variablesSuppose Y = g(X) where g is a fixed function.

2.6. EXPECTED VALUES 17

If X is a discrete random variable with probability mass function p(x) on0, 1, · · · , then the expected value of Y is

µY = E[Y ] =

∞∑

x=0

g(x)p(x) .

If X is a continuous random variable with probability density function f(x) on(−∞,∞), then the expected value of Y is

µY = E[Y ] =

∫ ∞

−∞g(x)f(x) dx .

Example 2.6.1. (2.5.1 cont.) Let f(x) = exp(−x) for all x ≥ 0. Find (i)E[X] (= µX) (ii)E[X2] and(iii)E[(X − µX)2]

(i) µX = E[X] =

∫ ∞

0x exp(−x) dx = [−x exp(−x)]∞0 +

∫ ∞

0exp(−x) dx

= 0 + [− exp(−x)]∞0 = 0− (−1) = 1

(ii) E[X2] =

∫ ∞

0x2 exp(−x) dx = [−x2 exp(−x)]∞0 +

∫ ∞

02x exp(−x) dx

= 0 + 2× 1 = 2

(iii) E[(X − µX)2] =

∫ ∞

0(x− 1)2 exp(−x) dx =

∫ ∞

0(x2 − 2x + 1) exp(−x) dx

= 2− 2× 1 + 1 = 1

Properties of Expected values

Theorem 2.6.2. If X has expectation E[X] and Y is a linear function of X as Y = aX + b then

Y has expectation

E[Y ] = aE[X] + b .

More generally, the following properties hold:

E[g(X) + h(X)] = E[g(X)] + E[h(X)] (2.1)

E[cg(X)] = cE[g(X)] (2.2)

E[aX + b] = aE[X] + b (2.3)

Note that we proved them in MATH 104 for discrete random variables.


Using linear properties of expectation, we may compute E[(X − a)2] by

E[(X − a)2] = E[X2 − 2aX + a2]

= E[X2]−E[2aX] + E[a2]

= E[X2]− 2aE[X] + a2

Verify each step with (2.1)–(2.3).

2.7 Variance and standard deviation

Variance and Standard Deviation

If X is a random variable with expected value µX = E[X], the variance of Xis

σ2X = Var[X] = E[(X − µX)2]

=

∑∞x=0(x− µX)2p(x) for discrete r.v. on 0, 1, · · · ,

∫∞−∞(x− µX)2f(x) dx for continuous r.v. on (−∞,∞) .

The variance of X can be calculated as

σ2X = E[X2]− µ2

X .

The standard deviation of X is

σX =√

Var[X] .

We can think of variance as a measure of spread of dispersion of a random variable about theexpectation.

Variance and Standard Deviation: example

Example 2.7.1. (2.5.1 cont.) For f(x) = exp(−x) for all x ≥ 0, Find σX .

From example 2.6.1, σ2X = 1 so σX = 1.

Properties of Variance and Standard Deviation

Theorem 2.7.2. If Var[X] exists and Y = a + bX, then Var[Y ] = b2Var[X]. Hence, the standard

deviation of Y is σY = |b|σX .

Why do you need to take absolute value in the above expression?

2.8. VARIANCE FOR 2-DIM RANDOM VARIABLE 19

0.00

0.10

0.20

0.30

0 1 2 3 4 5

0 1 2 3 4 5

0.1

0.3

0.0

0.1

0.2

0.3

0.4

0 1 2 3 4 5

0 1 2 3 4 5

0.1

0.5

1.0

Probability mass functionProbability mass function

DensityDensity

µ = 2.5

µ = 2.5

σ = 1.1

σ = 1.1

µ = 0.83

µ = 0.83

σ = 0.83

σ = 0.83

x

x

x

x

Figure 2.4: Expectations and Standard deviations for discrete and continuous random variables.

2.8 Variance for 2-dim random variable

CorrelationWith multivariate data, we need to characterise dependence:The correlation between X and Y , denoted Corr(X,Y ), and defined as:

The correlation between two random variables X and Y is

Corr(X,Y ) =E [X − E[X]Y − E[Y ]]

√

Var(X)Var(Y )=

E[(X − µX)(Y − µY )]

σXσY

..

.

.

....

....

. .....

....

....

.

..

.. . . . .

.. . .

....

. .

...

..

... . . . ..

..

.

.... .

.

...

. ...

. . .

.. .

........

..

...

.

..

....

...

. .....

......

....

. ...

.. . . . .

.. . .

.

. ...

...

..

.. . . ..

..

.

....

.. ...

..

.. . .

.

..

.. .

.

.

...

.

..

.. .

Correlation near 1

..

.

.

....

....

. .....

......

....

...

...

.. . . . .

.. . .

.

. ...

. .

...

..

... . . . ..

..

.

.... .

.

...

. ...

...

.. . .

... .

........

..

.....

..

..

.

.

.

.

..

..

.cent

re h

ere

spre

ad

centre here

spread

Correlation near 0

loose clustering tight clustering

Figure 2.5: Mean and Variance are measures of location and scale; Correlation measures linear associationbetween variables.

Measure of clustering around a straight line with a slope ∈ [−1, 1]Correlation is a scale free measure of linear dependence between two variables.


Theorem 2.8.1. If X and Y are independent, then Corr(X,Y ) = 0.

CovarianceRecall that the variance of a random variable X is given by Var(X) = E

[

X − µX2]

. Thecovariance between two random variables X and Y is defined in a similar way:

The covariance between random variables X and Y is

Cov(X,Y ) = E [(X − µX)(Y − µY )] , .

so that Var(X) = Cov(X,X) and Cov(X,Y ) is the expected product of the deviations of eachvariable from its expected value.

Covariance cont.

Example 2.8.2. Show that covariance is equivalent to

Cov(X,Y ) = E[XY ]− µXµY .

Cov(X,Y ) = E[(X − µX)(Y − µY )] = E[XY − µXY − µY X + µXµY ]

= E[XY ]− µXE[Y ]− µY E[X] + µXµY

= E[XY ]− µXµY − µY µX + µXµY

= E[XY ]− µXµY

We have the following interpretation:

Cov(X,Y ) = ρσXσY ,

where Corr(X,Y ) = ρ, and σX and σY are the square roots of the variances of X and Y respectively.

2.9 Quantiles and Cumulative distribution function

QuantilesOften interest is in the values of a continuous random variable which are not exceeded with a givenprobability, e.g. income of lower 10% income tax payer or score of top 5% students.

Let X be a random variable and p any value such that 0 ≤ p ≤ 1. The the pthquantile of the distribution of X is the value xp that satisfies:

P (X ≤ xp) = p.

When p = 0.5, the quantitle x0.5 is called median.

2.9. QUANTILES AND CUMULATIVE DISTRIBUTION FUNCTION 21

pth QuantileSee Figure 2.6 for visualisation.

0

p

1

xpx

F(x

)

Cumulative distribution function

Figure 2.6: xp is the pth quantile obtained from c.d.f.

QuartilesThe quartiles of a distribution are the values at which we can cut the distribution into four equallylikely slices: (x0.25, x0.5, x0.75). Figure 2.7 shows quartiles in c.d.f and p.d.f.

x(.25) x(.75)

0.25

0.5

0.75

1

x(.5) x(.75)

0

0.5

Cumulative distribution function Density

f(x

)

F(x

)

xx

Figure 2.7: Quartiles (x0.25, x0.5, x0.75) shown on c.d.f. and p.d.f. respectively.


2.10 Standard continuous univariate distributions

2.10.1 Uniform

Uniform distributionThis distribution is used to model variables that can take any value on a fixed interval, when theprobability of occurrence does not vary over the interval.

The p.d.f. of a Uniform random variable X, distributed on the interval(a, b) is given by:

f(x;θ) =

1

b− aif a < x < b;

0 otherwise,

where θ = (a, b) and Θ is the set of (a, b) such that −∞ < a < b <∞. This iswritten as X ∼ Uniform(a, b).

a x0 b

01/

(b−

a)

P (a < X ≤ x0)

x

Figure 2.8: P.d.f. for Uniform(a, b) random variable. Shaded area represents P (a < X ≤ x0).

Figure 2.8 shows the p.d.f. of a Uniform random variable. The shaded area represents P (a < X ≤x0).

Cumulative distribution function and quantiles

Example 2.10.1. For X ∼ Uniform(a, b),

(i) Find the c.d.f. and sketch the graph.

F (x) =∫ xa

1b−a du = x−a

b−a for 0 ≤ x ≤ b and 1 for x ≥ b.

(ii) Find the mean µX and variance σ2X .

2.10. STANDARD CONTINUOUS UNIVARIATE DISTRIBUTIONS 23

µX =∫ ba x 1

b−a dx = a+b2 and E[X2] =

∫ ba x2 1

b−a dx = a2+ab+b2

3

So σ2X = E[X2]− µ2

X = (a−b)2

12 .

(iii) Find the median and compare to the mean

F (x) = x−ab−a = 0.5 so x0.5 = a + 0.5(b− a) = (a + b)/2, same as the mean.

Example 2.10.2. Numerically evaluate the p.d.f., c.d.f. and the quantile function of a Uniformdistribution.

dunif(0.5, -2, 2) # p.d.f. Uniform(-2,2) at x=0.5, f(0.5)=0.25

punif(0.3, 0, 1) # c.d.f. of Uniform(0,1) at x=0.3, P(X<3)=0.3

x = seq(-1, 1, length=101)

fx = dunif(x, -1,1)

plot(x, fx)

lines(x, fx)

2.10.2 Exponential

Exponential distributionThis distribution is often used to model variables that are the times until specific events happenwhen the events occur at random at a given rate over time.

The p.d.f. of an Exponential random variable X is

f(x;θ) =

θ exp(−θx) for x > 0,0 otherwise,

where 0 < x and 0 < θ. This is written as X ∼ Exponential(θ) and θ ∈ Θ =(0,∞).

Usually the rate parameter θ is unknown. The value of θ strongly influences the probability ofdifferent outcomes, as shown in Figure 2.9.

Shape of Exponential density function

Example 2.10.3. How is the shape of the function related to the parameter θ? Which one has lowertail probability P (X > 10)?

As f(0) = θ, the highest curve at 0 is for θ = 1.4 and the lowest curve at 0 is for θ = 0.6.The exponential decay of the function is quicker for larger θ, which results in smaller tailprobability.


0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

x

f(x)

P(X>1)

Figure 2.9: P.d.f.s for Exponential(θ) random variables for θ = 0.6, 1, 1.4.

Cumulative distribution function and quantilesThe c.d.f. of the Exponential(θ) distribution is

F (x) =

0 if x ≤ 0 ,1− exp(−θx) if x > 0 .

Example 2.10.4. For X ∼ Exponential(θ),

(i) Derive the cumulative distribution from the p.d.f.

(ii) Find the median.

(iii) The mean µX of the Exponential(θ) is 1θ. Compare the median and the mean. Which one islarger? Why does the median differ from the mean?

(i)

F (x) =

∫ x

0f(u) du =

∫ x

0θ exp(−θu) du = 1− exp(−θx)

(ii) Solving F (xp) = p gives xp = θ−1 log(1 − p)−1. For median, p = 0.5 so the median isx0.5 = 1

θ log 2.

(iii) µX = 1/θ > (1/θ) log 2 = x0.5. As the distribution is not symmetric, the mean and themedian are not the same. The median is always smaller because there is higher concen-tration of probability on smaller values but with thin probability values are stretchingto the far right.

Example 2.10.5. Numerically evaluate the p.d.f., c.d.f. and the quantile function of an Exponentialdistribution.

dexp(3, rate=2) # p.d.f. of Exponential(2) at x=3, f(3)=0.004957504


qexp(0.5, 3) # c.d.f. of Exponential(5) at x=3, P(X<3)=0.9999997

x = seq(0, 4, length=100)

fx = dexp(x, rate=2)

plot(x, fx)

lines(x, fx)

Example 2.10.6. Suppose that a goal is scored at random in a fixed time of the cup final and thetime until the event can be modelled by an Exponential distribution with rate parameter θ = 2/3hours. If the first goal has been scored just now, what is the probability of waiting time until thenext goal is

(i) more than 30 minutes

(ii) between 30 and 50 minutes

Let X be the random variable of the waiting time. Then X ∼ Exponential(2/3) and F (x) =1− exp(−(2/3)x).

(i) P (X > 1/2) = 1− F (1/2) = exp(−2/3 ∗ 1/2) = 0.7165

(ii) P (1/2 < X < 5/6) = F (5/6) − F (1/2) = exp(−2/3 · 1/2) − exp(−2/3 · 5/6) = 0.1428

2.10.3 Normal

Normal distribution: background

quoted from gqview weblib/Gauss.html

The normal distribution was introduced by the French mathematician Abraham De Moivre in 1733.

De Moivre used this distribution to approximate probabilities of winning in various games of chance

involving coin tossing.

It was later used by the German mathematician Karl Gauss to prredict the location of astronomical

bodies and became known as the Gaussian distribution.

In the late nineteenth century statisticians started to believe that most data sets would have his-

tograms with the Gaussian bell-shaped form and that all normal data sets would follow this form

and so the curve came to be known as the normal curve.

Normal distribution

The p.d.f. of a Normal random variable X is

f(x;θ) =1√2πσ

exp

(

−1

2

(

x− µ

σ

)2)

,

where θ = (µ, σ) −∞ < x < ∞, −∞ < µ < ∞ and 0 < σ. This is written asX ∼ N(µ, σ2) and θ ∈ Θ = (−∞,∞)× (0,∞).

E[X] = µ , Var(X) = σ2


The Normal distribution plays an important role in a result that is key to statistics, known as thecentral limit theorem. This theorem, discussed in Math230 and Math314 gives a theoretical basisto the empirical observation that many random phenomena seem to follow a Normal distribution.Usually, the mean parameter µ and the scale parameter σ are unknown, although sometimes it isassumed that σ is known as this simplifies things considerably.These parameters are crucial in determining probabilities, see Figure 2.10.

Shape of Normal density function

Example 2.10.7. How is the shape of the function related to the parameter θ? Which one hashigher probability of P (|X| > 3)?

The larger σ, the more spread. So θ = 1.5 has the largest probability of P (|X| > 3) andθ = 0.5 has the smallest.

−3 0 3

00.

20.

40.

60.

8

sigma=0.5sigma=1sigma=1.5

x

Figure 2.10: P.d.f.s for Normal(µ, σ2) random variables where µ = 0 and σ = 0.5, 1, 1.5.

Cumulative distribution function and quantilesThe normal c.d.f. is

F (x) =

∫ x

−∞f(u) du =

∫ x

−∞

1√2πσ2

exp

− (u− µ)2

2σ2

du .

This does not have a closed form expression so numerical evaluation is required, if we want toobtain probabilities of the form P(X ≤ x) or quantiles.

Example 2.10.8. pnorm(0, mean=2, sd=sqrt(5)) # P(X<0) when X ~ N(2,5)

pnorm(0, 2, sqrt(3)) # P(X<0) when X ~ N(2,5) 0.1855467

1-pnorm(-2, 0, 2) # P(X >-2) when X ~ N(0,4) 0.8413447

qnorm(0.975, 0, 1) # u such that P(X<u)=0.975, 1.959964 when X ~ N(0,1)

Note that the R functions for the Normal distribution use the standard deviation σ, not the varianceσ2.


Example 2.10.9. A normal distribution is proposed to model the variation in height of women withparameters µ = 160 and σ2 = 25 measured in cm. Find the proportion of tall women, defined asover 175cm tall.

Let X be the random variable of women’ height then X ∼ Normal(160, 252). So

P (X > 175) =

∫ ∞

175

1√2π25

exp

− (x− 160)2

2 ∗ 252

In the above example we have expressed the proportion in terms of an integral and as the numberof deviations from the mean.The integral is impossible to calculate analytically so numerical evaluation is required to obtainprobabilities or quantiles.

Standardardization of the random variableIt is useful to express such probabilities in terms of a standardized random variable, with µ = 0and σ = 1.

If X ∼ N(µ, σ2) then

Z =X − µ

σ∼ N(0, 1),

and conversely if Z ∼ N(0, 1), then

X = µ + σZ ∼ N(µ, σ2) .

The formal proof will be given in M230 and here it is sufficien to note that

E[Z] = 0 Var[Z] = 1 .

Standard normal distributionA random variable Z is said to have a standard normal distribution with mean 0 and standarddeviation 1 if its p.d.f. is given by

f(z) =1√2π

exp(−z2/2) ,

where −∞ < z <∞ and is denoted by Z ∼ N(0,1).The cumulative distribution function, i.e. the area under the curve, of the standard normal variableZ is given by

Φ(z) = P (Z ≤ z) =

∫ z

−∞

1√2π

exp(−x2/2) dx .

Values of Φ(z) are obtained from a table of standard normal probabilities or from computer softwaresuch as R: pnorm(z)

z -3.00 -2.33 -1.67 -1.00 -0.33 0.33 1.00 1.67 2.33 3.00

Φ(z) 0.0013 0.0098 0.0478 0.1587 0.3694 0.6306 0.8413 0.9522 0.9902 0.9987


Standard normal distribution: example

Example 2.10.10. We repeat the previous example to illustrate the standardization procedure:

P(H > 175) = P(H − 160

5>

175 − 160

5)

= P(Z > 3) = 1− P(Z ≤ 3)

= 1− Φ(3)

= 1− 0.9987 = 0.0013

Probabilities of Normal distributionFigure 2.11 illustrates special coverage properties of a Normal distribution.

µµ− σ µ + σµ− 2σ µ + 2σµ− 3σ µ + 3σP (µ− σ < X < µ + σ) = 0.683

P (µ− 2σ < X < µ + 2σ) = 0.954

P (µ− 3σ < X < µ + 3σ) = 0.997

Figure 2.11: Illustration of coverage probability of Normal(µ, σ2) distribution.

Chapter 3

Exploratory data analysis

We introduced some examples of discrete and continuous random variables and studied their prop-erties.If we had known the exact analytical form of the underlying distribution of interest (i.e. thepopulation), there is no need of collecting data and thus statistical analysis.In reality this is rarely the case, especially in the beginning of investigation and even if there is aconjectured model for the data, we always need to check if it is consistent with data.

Data and Random variabilityData is information we collected but also bears uncertainty, due to random variability in charac-

teristics of interest from one individual to another.In mathematical terms, the characteristics being measured are represented by random variables.e.g X = age of individuals chosen at random.Individual in this context refers to a unit of study - actual people, towns, cars, test tubes...

Random variables and realisationsIMPORTANT DIFFERENCE BETWEEN:

Random variable ←→ realisation (or observation)

• Random variable is always written in UPPER CASE and is associated with a probabilitydistribution. e.g. X = Ozone level

• Observations of random variables are written in lower case and is just a number. e.g. x =observed value in number

Data set of size n:

X1, · · · ,Xn random variables

x1, · · · , xn particular realizations

Data analysisThe first stage in any analysis is to get to know the problem and the data. This usually involves avariety of graphical procedures to try to visualise the data, as well as the calculation of a few simple

29

30 CHAPTER 3. EXPLORATORY DATA ANALYSIS

summary numbers, or summary statistics that capture key features of the data, which hopefullyreveal key features of the unknown underlying distribution.The random variability in the data is reflection of the underlying distribution however bear in mindthat, because of finite sample size, this would serve as an approximation to the true underlyingdistribution and its features. This implies that we also need to care how good the approximationis.

Role of exploratory analysis

Finding errors and anomalies: missing data, outliers, changes of scale...

Suggesting route of subsequent analysis: plots of data give information on location, scale andshape of the distribution and relationships between variables.

Augmenting understanding of applied problem: exploratory graphical tools sharpen the sci-entific questions being addressed.

A detailed and thorough exploratory analysis will increase the general understanding of the problembeing tackled and, through the fulfilment of each of these three roles, save time and increase thefocus of subsequent analysis.

3.1 Example problems and associated data sets

The ideas that we will present during the course will be demonstrated on a variety of real lifeproblems, chosen from a range of applied backgrounds. Each of these problems has an associateddata set which we will explore over the duration of the course, to show the whole process involved indetailed statistical analysis from conception through to conclusion. These problems are describednow.

Example problems and associated data sets

Ecological Diseased trees

Atmospheric Chemistry Monitoring urban air pollution

Health Comparing hospitals

3.1.1 Diseased trees

In an ecological study of diseased trees, trees along transects through a plantation were examinedand assessed as diseased or healthy. The following process is adopted for data collection. First adiseased tree is found. Then the number of neighbouring trees in an unbroken run of diseased treesalong the transect is recorded. The ecologists are interested in the following questions:How does the disease spread between trees, and what is the probability that trees are infected by the

disease?

The observations made on a total of 109 runs of diseased trees are recorded in Table 3.1. We willuse this data set to show the benefits of collecting more data. To do this we have broken down thedata in Table 3.1 into data collected from the first 50 observations and from the whole data set, werefer to these as the partial and full data sets respectively.

3.1.2 Urban and rural ozone

3.1. EXAMPLE PROBLEMS AND ASSOCIATED DATA SETS 31

Run length 0 1 2 3 4 5

Number of runs 31 16 2 0 1 0in first 50 observations

Number of runs 71 28 5 2 2 1in 109 observations

Table 3.1: Run lengths of diseased trees in an infected plantation.

In the UK the Department for Environment,Food & Rural Affairs operates a national airquality monitoring system, with a network ofsites at which air quality measurements are takenautomatically. These measurements are used tosummarise current air pollution levels, for fore-casting of future levels and to provide data forscientific research into the atmospheric processesbehind the pollution to which we are exposed.

We will look at ground-level ozone (O3). This pollutant is not emitted directly into the atmosphere,but is produced by chemical reactions between nitrogen dioxide (NO2), hydrocarbons and sunlight.When present at high levels, ozone can irritate the eyes and air passages causing breathing diffi-culties and may increase susceptibility to infection. Ozone is toxic to some crops, vegetation andtrees and is a highly reactive chemical, capable of attacking surfaces, fabrics and rubber materials.Whereas nitrogen dioxide participates in the formation of ozone, nitrogen oxide (NO) destroysozone to form oxygen and nitrogen dioxide. For this reason, ozone levels are not as high in urbanareas (where high levels of NO are emitted from vehicles) as in rural areas. As the nitrogen oxidesand hydrocarbons are transported out of urban areas, the ozone-destroying NO is oxidised to NO2,which participates in ozone formation.As sunlight provides the energy to initiate ozone formation, high levels of ozone are generallyobserved during hot, still, sunny, summertime weather in locations where the airmass has previouslycollected emissions of hydrocarbons and nitrogen oxides (e.g. urban areas with traffic). The resultingozone pollution or summertime smog may persist for several days and be transported over longdistances.We will focus on data from a pair of monitoring sites: an urban site in Leeds city centre and arural site at Ladybower Reservoir, just west of Sheffield. Based on the above understanding of theatmospheric processes that produce ozone, we address the following questions:How, if at all, does the distribution of ozone measurements vary between the urban and rural sites?

How, if at all, is the distribution of ozone measurements affected by season?

How, if at all, does the presence of other pollutants affect the levels of measured ozone?

The purpose of the statistical analysis is to provide an objective analysis of the data, by extractingthe information in the data relevant to each of the scientific questions.The data at each site are daily measurements of the maximum hourly mean concentration of O3

and NO2, recorded in parts per billion (ppb), from 1994 – 1998 inclusive. To focus on the questionof whether there is any effect of season on ozone levels, we compare data from winter (November –February inclusive) and early summer (April – July inclusive).We denote the ozone data from the Leeds city centre site by x1, . . . , xn, with the subscript denotingthe n different days for which we have measurements. Similarly, we denote the ozone data from


the Ladybower Reservoir site by y1, . . . , yn. When we wish to look at the winter or summer data,we will choose the subset of the xis and yis with subscripts i corresponding to winter or summerdays.

3.1.3 Comparing hospitals

League tables for many public institutions such as schools, hospitals and even universities try tocompare the relative performances of the institutions. This very small example uses the outcomesof a difficult operation at two hospitals. Ten patients at each hospital underwent the operation. Thepatients were selected to make sure that they had similar severity of illness and other characteristicswhich are believed to influence the outcome of the operation. There is no connection between thetwo hospitals.Each time the operation was performed, it was classified as successful or unsuccessful. The firsthospital had nine out of ten successful operations and the second hospital had five out of tensuccessful.What can we conclude about the relative performances of the two hospitals?

Population and sample: exampleIn the Ozone problem, we have data from a number of days during 1994-1998. However, interestis not solely in the levels of ozone on the days on which measurements were taken. The objectiveof a statistical analysis is to learn about the relationships between the variables, and to draw moregeneral conclusions about levels of the variables on other, perhaps future dates.

Example 3.1.1. For each of the problems that we are concentrating on in the course, state thepopulations that we are trying to learn about:

Ozone Levels of ozone at the two locations given the time of year and the level of NO2.

Diseased trees All trees in the forest and, possibly other locations where the climate, soiland trees have similar properties.

Hospital Other operations at the two hospitals.

Discrete or Continuous: Diseased trees

Example 3.1.2. For diseased tree data set, define the variable of interest as X and find possiblerange of values. Is the variable discrete or continuous?

• The variable of interest is

X = number of unbroken run of diseased trees in the neighboring trees of a diseased tree .

• Possible values are 0, 1, · · · ,

Discrete or Continuous: Comparing hospitals

Example 3.1.3. For hospital data set, define the variable of interest as X and find possible rangeof values. Is the variable discrete or continuous?

3.2. GRAPHICAL METHODS 33

• The variable of interest is

X = number of successful operations in the first hospital

Y = number of successful operations in the second hospital

• Possible values are

0, 1, · · · , 10 for both variables.

3.2 Graphical methods

Graphical methods exist for visualising multivariate and univariate data. If the data is very highdimensional, then it can become difficult to visualise since plots are flat! This is an interestingproblem in its own right. Here we focus on methods for examining the distribution of a univariatevariable, and the relationships between pairs of variables.

3.2.1 Histograms

Before starting to think about an applied statistical problem, we often have no reason to think thatdata come from one probability distribution rather than another. The exploratory stages of dataanalysis can help us to choose which probability distribution might describe our data well. Thehistogram can be used to estimate the underlying distribution as follows.

HistogramsHistogram - shape of distribution

• Bins of equal width

• Number of observations in each bin

Example 3.2.1. Diseased trees. Histograms for the partial and full data sets are shown in Figure 3.1.Both indicate a geometric decay in the distribution of run lengths.

Scaled histogramWhether dealing with a discrete or a continuous random variable, we can use the histogram toestimate the underlying p.m.f. (of a discrete variable) or the p.d.f. (of a continuous variable).Recall that all p.m.f.s sum to 1, and that all p.d.f.s integrate to 1.We can rescale the vertical axis of our histogram to ensure that the histogram has area 1. We dothis by calculating the area of our original histogram, then dividing all the counts, or frequencies,by this amount. When the bins are all of equal width, the area is:

A = contribution of one individual× number of individuals

= 1× bin width× number of individuals


0 1 2 3 4 5

020

4060

0 1 2 3 4 5

020

4060

Partial Full

Run lengthRun length

Cou

nt

Cou

nt

Figure 3.1: Histograms of run lengths of diseased trees in an infected plantation.

Example 3.2.2. Diseased trees. For the histogram of the full dataset of diseased tree,

A = 1× 109 and the new maximum value on the y axis is approx

71

1× 109≈ 0.65

The resulting histogram gives an idea of the underlying p.m.f. or p.d.f. of our variable. We achievethe scaled histogram in Figure 3.2.The shape of the histogram does not change, but the vertical axis now represents the density ratherthan the frequency as before.

The benefits of transforming the axis by this scaling procedure is now clear when we comparehistograms of the data with the shapes of the densities obtained by fitting different models.

Example 3.2.3. Ozone.

Comparing histogramsThe histograms of the summer ozone data for both sites are given in Figure 3.3.The location, spread and shape of these histograms is sufficiently close to make it difficult to identifyany obvious difference by eye.As we are interested in the difference in the distributions of summer ozone measurements at thetwo sites, we do not necessarily have to look at the separate distributions themselves.Some of the variability in each of the distributions shown in Figure 3.3 is due to climatic conditions,which will be similar at the two sites since they are relatively near to one another.


0 1 2 3 4 5

0.0

0.2

0.4

0.6

0 1 2 3 4 5

0.0

0.2

0.4

0.6

Partial Full


p.m

.f.

p.m

.f.

Figure 3.2: Histograms of run lengths of diseased trees in an infected plantation, partial (left plot)and full (right plot) data sets

20 40 60 80

0.00

0.01

0.02

0.03

0.04

20 40 60 80 100 120

0.00

0.01

0.02

0.03

Leeds city centre Ladybower Reservoir

Daily max ozoneDaily max ozone

Den

sity

Den

sity

Figure 3.3: Histograms of the summer daily maximum ozone levels (ppb) in Leeds city centre andat Ladybower reservoir.


Since we have observations every day at each site, looking directly at the daily differences betweenthe measured ozone at the two sites will remove this extra variability due to changes in atmosphericconditions which are common to the two locations.We therefore analyse the data on the differences in the measurements of ozone at the two sites onthe same day:

di = xi − yi for i in the set of summer sampling dates.

Reducing variabilityThe histogram of the differences di is shown in Figure 3.4.

Differences

Daily max ozone

Den

sity

−50 −40 −30 −20 −10 0 10 20

0.00

0.01

0.02

0.03

0.04

0.05

Figure 3.4: Differences of summer ozone daily maxima at the two sites.The variability of these differenced data is considerably less than the variability of the measurementsmade at the separate sites. This finding indicates that common factors affecting both sites influencevariation in ozone values. The common source of variation has been removed from the differenceddata. We would expect to extract more information from this less variable data, but this gainis only possible because of the data sampling strategy, where measurements from both sites werecollected on the same days.Figure 3.4 shows that most of the differences are negative, suggesting that in general the Ladybowermeasurements are larger than the Leeds measurements. This supports scientific expectations thatrural ozone levels are generally higher than urban levels (discussed in Section 3.1.2).

Choice of bin size

We stated in the introduction to histograms, that constructing a histogram involves dividing therange of the data into bins. Although we didn’t go into this at the time, the choice of the bin size

can drastically affect the shape of your histogram.

Example 3.2.4. Choosing bin size for the summer ozone data.

Effect of bin sizeExamples of very wide and very narrow bins are shown in Figure 3.5 for the summer ozone datafrom the Leeds city centre site.Using a very large bin size has obscured the structure of the data. So has the very small bin size –the right hand plot just shows the raw data! A bin size somewhere in between these two extremesis used in the histogram on page 35, in which the underlying structure of the data is much clearer.


Leeds

Summer daily maxima

Den

sity

0 20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

0.10

Leeds

Summer daily maxima

Den

sity

20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

0.10

Figure 3.5: Histograms of density of summer ozone data using different bin sizes.

3.2.2 Empirical distribution function

Empirical distribution functionRecall the cumulative distribution function (c.d.f.) of a random variable X:

F (x) = P (X ≤ x)

How can we estimate this from a finite number of observations?Let us assume that our variables X1, . . . ,Xn are independent and identically distributed (i.i.d.)replicates of a random variable X which has cumulative distribution function F . We denote byx1, . . . , xn, the observed values of X1, . . . ,Xn.

The empirical cumulative distribution function (c.d.f.) is defined as

F (x) =1

n(num of xi ≤ x) =

∑ni=1 II(xi ≤ x)

n

where

II(xi ≤ x) =

1 if xi ≤ x0 if xi > 0

The empirical c.d.f is a proper distribution function and has the following properties:

• F (x) is a step function with jumps at the data points;

• F (x) = 1 if x ≥ max(x1, . . . , xn);

• F (x) = 0 if x < min(x1, . . . , xn).

Remark

• We have no reason to favor any particular observation. So we give each observation an equalweight 1/n. If some values are more likely than others, they simply appear more frequentlythan others.


• Take the observed values and order them so that the smallest one comes first. Lable theseordered values x(1), x(2), · · · , x(n) so that

x(1) ≤ x(2) ≤ · · · ≤ x(n) .

Then the kth ordered point x(k) is the k/n th quantile.

Example 3.2.5. For observations 1, 2, 2, 3, 4, find F (x) and sketch the plot.x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

F (x)

Example 3.2.6. Summer ozone. Let’s take the first 20 observations from Leeds city center. Thesummer ozone values are:32 29 32 32 33 27 34 22 30 35 27 23 28 34 35 45 36 26 23 16To plot the empirical c.d.f. of this data we start by ordering the data:

i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20x(i)16 22 23 23 26 27 27 28 29 30 32 32 32 33 34 34 35 35 36 45Then at each sorted data point we have a jump of i/n.Here n = 20

15 20 25 30 35 40 45

0.0

0.2

0.4

0.6

0.8

1.0

F(x

)

Leeds

Summer daily maxima

According to this graph, about 50% of time the daily maxima was less than 30 and about 20% oftime the daily maxima was greater than 35. The minumum is 16 and the maximum is 45 and therelooks a fairly steady increase between 20 and 35.For the complete data set, Figure 3.6 shows the graph of the empirical c.d.f. of the summer dailymaxima ozone measurements at the Leeds city centre site. The observations themselves are plottedas black points – the jumps in the step function occur at these points.

With the complete data set, the maximum stretches out as far as 80 and there is a fair chance,about 10%, of having the maximum greater than 40.The data covers 1994-1998 inclusive. How would this graph change if we use observations fromanother period?


0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

Leeds

Summer daily maxima

Fn(

x)

Figure 3.6: Empirical c.d.f. for summer daily maxima ozone at the Leeds city centre site.

3.2.3 Scatterplot

Histograms and empirical distribution functions are useful methods for visualising one variable ata time. However, when dealing with multivariate data, it is important to examine the relationshipsbetween variables as well as the structure of each variable by itself. The easiest graphical methodfor doing this is the scatterplot in which we simply plot the value of one variable against another.

ScatterplotWhen we have multivariate data, we have to look at dependence between variables.Scatterplot – plots one variable against another.

Example 3.2.7. Ozone. We now turn to the effect of the nitrogen dioxide (NO2) on ozone levels.We focus on the Leeds city centre ozone measurements as our response variable.

20 40 60 80 100

2040

6080

20 40 60 80 100 120

010

2030

40

Summer Winter

O3

O3

NO2NO2

Figure 3.7: Scatterplots of Leeds city centre O3 values against NO2 for each season.


Figure 3.8 shows the scatterplots of this variable against the explanatory variable NO2. For thesummer data, there is a suggestion that there is positive association between these two variablessince there is a tendency for the large O3 and NO2 values to occur together. For the winter data,the relationship is less clear, partly due to the uneven distribution of the NO2 values.

3.2.4 Visualising conditional distributions

Visualising conditional distributionAs well as looking for dependence between variables, it can also be useful to identify situations inwhich variables appear to be independent. If two variables are independent, then the distributionof one variable will look the same regardless of the value of the other variable.Conditional probabilities were introduced in Math104:

If A and B are two events then, as long as P (B) > 0, the conditional probabilityof A given B is written as P (A |B) and calculated from

P (A |B) = P (A ∩B)/P (B).

We can look for more structure in our data, including the dependence of one variable on another,by examining conditional distributions of some subsets of our data.

Example 3.2.8. Ozone data. We will look at the following conditional histograms for Leeds city

Den

sity

0 10 20 30 40 50 60 70

0.00

0.01

0.02

0.03

0.04

0.05

Den

sity

0 10 20 30 40

0.00

0.01

0.02

0.03

0.04

0.05

Den

sity

0 10 20 30 40 50 60 70

0.00

0.01

0.02

0.03

0.04

0.05

Den

sity

0 10 20 30 40

0.00

0.01

0.02

0.03

0.04

0.05

O3O3

O3O3

Summer Ozone |NO2 ≤ 40

Summer Ozone | 40 < NO2 ≤ 60

Winter Ozone |NO2 ≤ 40

Winter Ozone | 40 < NO2 ≤ 60

Figure 3.8: Conditional histograms of ozone levels in Leeds city centre conditional on NO2 ≤ 40 and40 < NO2 ≤ 60 in summer (left) and winter (right).

center:

• daily maximum ozone levels in summer conditional on NO2 <= 40;

• daily maximum ozone levels in summer conditional on 40 < NO2 ≤ 60;


• daily maximum ozone levels in winter conditional on NO2 <= 40;

• daily maximum ozone levels in winter conditional on 40 < NO2 ≤ 60;

Compare Figures 3.8 and 3.8. What do you learn from these?


3.2.5 Historical note – Florence Nightingale

Good graphical display is the important first step in any data analysis. Choosing how to do itis part science, part art, and sometimes part politics! Florence Nightingale was the first femaleFellow of the Royal Statistical Society. She pioneered the use of statistics as an organised way oflearning, leading to improvements in medical and surgical practices. She developed the “polar-areadiagram”, shown in Figure 3.9, to dramatise the needless deaths caused by unsanitary conditions.Florence Nightingale revolutionised the idea that social phenomena could be objectively measuredand subjected to mathematical analysis, innovating in the collection, interpretation, and graphicaldisplay of descriptive statistics.

Figure 3.9: Florence Nightingale – May 12, 1820 to August 13, 1910. Example of a polar-area diagraminvented by Florence Nightingale. The original was in colour with the outer area in blue (deaths fromsickness), the central lighter areas in red (deaths from wounds), and the central darker areas in black (deathsfrom other causes).

The diagram showed that most deaths of British soldiers during the Crimean War were from sicknessrather than wounds or other causes. It also showed that the death rate was higher in the war’s firstyear (right half of diagram), before the Sanitary Commissioners improved hygiene in the campsand hospitals. The text in the lower left corner reads:

The Areas of the blue, red, & black wedges are each measured from the centre asthe common vertex.

The blue wedges measured from the centre of the circle represent area for area thedeaths from Preventable or Mitigable Zymotic diseases, the red wedges measured fromthe centre the deaths from wounds, & the black wedges measured from the centre thedeaths from all other causes.

The black line across the red triangle in Nov. 1854 marks the boundary of the deathsfrom all other causes during the month.

In October 1854, & April 1855, the black area coincides with the red, in January &February 1855, the blue coincides with the black.

The entire areas may be compared by following the blue, the red, & the black linesenclosing them.

3.3. SUMMARY STATISTICS 43

3.3 Summary statistics

Numerical summaries of the data can

facilitate the comparison of different variables;

help us make clear statements about some aspects of the data.

Mathematical skills:Recall the notation

n∑

i=1

g(i) = g(1) + g(2) + . . . + g(n − 1) + g(n)

for any positive integer value of n and any function g. In statistics we often have to do mathematicswith sums of the form

∑ni=1 h(xi). Using the above notation, this means

n∑

i=1

h(xi) = h(x1) + h(x2) + . . . + h(xn−1) + h(xn)

for any positive integer value of n, any function h, and any set of values x1, . . . , xn. The mostcommon forms of this expression we will encounter are:

n∑

i=1

xi = x1 + . . . + xn and

n∑

i=1

x2i = x2

1 + . . . + x2n.

Let c 6= 0 and d be real numbers and denote x = 1n

∑ni=1 xi.

3.3.1 Sample mean

Sample meanConsider a random variable X of which we obtain n i.i.d. realisations X1, . . . ,Xn. The samplemean of the observed values x1, . . . , xn is defined as follows:

The sample mean of n observations x1, . . . , xn is denoted by m(x) and obtainedby adding all the xi and dividing by n:

m(x) =

∑ni=1 xi

n= x .

This is an estimate of µX , the mean or expectation of X and is viewed as a measure of center.

3.3.2 Sample variance and standard deviation

Sample variance and standard deviation

The sample variance of n observations x1, . . . , xn is denoted s2(x) and is givenby:

s2(x) =1

n

n∑

i=1

(xi − x)2.


Note the divisor n here instead of n− 1. The sample variance is an estimate of the variance of X,

σ2X .

Ideally, the spread measure should have the same units as the original data. To obtain a measurewith the correct units, we take the square root and define:

The sample standard deviation of n observations x1, . . . , xn is denoted by s(x)and is given by:

s(x) =

√

√

√

√

1

n

n∑

i=1

(xi − x)2.

The standard deviation of X, σX , is the square root of σ2X = Var(X), and the sample standard

deviation estimates this value from the data.

Example 3.3.1. Ozone data We will calculate summary statistics of O3 to look more closely fordifferences between the locations and the seasons.There are four groups, arising from the two levels of each of the two nominal variables location andseason. Numbers in the parenthese are standard deviations.Mean ozone concentration

Leeds city Ladybower

summer 31.78 (9.28) 43.63 (11.81)winter 20.52 (10.77) 29.24 (8.40)

What conclusions do you draw from these summary statistics?

3.3.3 Sample quantiles

Sample quantilesThe median is the midpoint of the data, another measure of center of the data.Sample quantiles are calculated directly from the empirical c.d.f.To estimate the pth quantile, we find the value xp that satisfies

F (xp) = p.

Example 3.3.2. Calculate sample mean, sample standard deviation, sample quantiles (x0.25, x0.5, x75)for each dataset.

• Data A: 2, 4, 6, 8, 10

• Data B: 2, 4, 6, 8, 100

• Data C: 2, 4, 6, 8, 1000

Using empirical c.d.f.The procedure for estimating x(p) given p is illustrated in Figure 3.10.

Using sample quantilesSample median is an alternative measure of the center of the distribution.Similarly, an alternative measure of the spread of the distribution is the range of middle 50%observations, called interquartile range, x0.75 − x0.25.


0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

F(x

)

Leeds

Summer daily maxima

p = 0.6

x(p) in interval (33,34)

Figure 3.10: To estimate a quantile empirically using the empirical c.d.f., first find the probability ofinterest p on the y−axis, then go across and down to find x(p). If the function is quite “steppy” – like theone illustrated – you can quote an interval containing the quantile x(p). Shown here for the 0.6 quantile ofthe summer ozone daily maxima at the Leeds city centre site.

Using the above example, compare (i) sample mean and sample median, (ii) sample standarddeviation and sample interquartile range. When do you think measures based on sample quantilesare more preferable?

Sample mean is sensitive to outliers are are greatly influenced by extreme points, whereassample median is not affected by them. Likewise, sample standard deviation is greatly inflatedby extreme points, whereas sample interquartile range is stable.

Boxplot

Example 3.3.3. Ozone data. Figure 3.11 shows a graphical summary of quantiles for ozone data inBoxplot.What do you learn from the boxplot?

3.3.4 Sample correlation

Sample correlationRandom variables X and Y of which we have i.i.d. observations (x1, y1), . . . , (xn, yn).First calculate:

• m(x) and s(x) for variable X;

• m(y) and s(y) for variable Y .

Next standardise x and y:

xi −m(x)

s(x)and

yi −m(y)

s(y)for all i = 1, . . . , n

Then the sample correlation coefficient is the average of the product of these standardised values:


The sample correlation coefficient of n pairs of observations (x1, y1), . . . , (xn, yn) is denoted byr(x, y) and is given by:

r(x, y) =1

n

n∑

i=1

(

xi −m(x)

s(x)

)(

yi −m(y)

s(y)

)

∈ [−1, 1]

The sample correlation coefficient is an estimate of the correlation between X and Y , denotedCorr(X,Y ).Figure 3.12 shows data with different values of the correlation coefficient. All the plots in this figureshow data after standardisation to have mean 0 and standard deviation 1. There are 100 pointsin each diagram. The different levels of clustering in each picture is measured by the correlationcoefficient.

Sample correlation coefficientThe sample correlation coefficient is a measure of linear association, or clustering around a straight

line.

Misuse of sample correlation coefficientThe sample correlation coefficient is not appropriate for detecting nonlinear association, as illus-trated in Figure 3.13.

Example 3.3.4. Ozone data. Calculate sample correlation coefficients betweeb O3 and NO2 forozone data. There are four groups, arising from the two levels of each of the two nominal variableslocation and season.

Leeds.O3 Ladybower.O3

020

4060

8010

0

Leeds.O3 Ladybower.O3

020

4060

8010

0

Summer Winter

Figure 3.11: Sample quantiles for ozone data are summarized in Boxplot. The center line is sample median,the box represents interquartile range (25%, 75%) quantiles, and the minimum and maximum are at bothends.


o

o

ooo

o o

oo

oo

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

ooo

oo

o

o

o

o

o

o

o

o

o

ooo

oo o

oo

o

o

oo

ooo

o

o

o

oo o

o

oo

o

o

ooo

o

o

oo

oo

o

o

o

oo

o

o

oo

o

o

o

o

o

o

-4 -2 0 2 4

-4-2

02

4

o

o

o

oo

o

o

oo

o

o

o

o

o

o

oo

o

o

o

oo

o oo o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

o

o o

oo

o

o

o

oo

o oo

o

o

o

o

o

oo

o oo

o

o

o

o

o

o

oo

oooo

o

o

o

o

oo

oo

oo

oo

o

o

o

oo

o

o

oo

o

-4 -2 0 2 4

-4-2

02

4

o

o

oo

o

oo

o

o

o

oo

o

o

o o

o

o

o

o o

o

o

o

o

o

o

oo

o

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o o

oo

oo

o

o

o

oo

oo

o

o

oo o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

oo

o

ooo

o

o

o

oo

-4 -2 0 2 4

-4-2

02

4

o

oo

oo

o

o

o

o

o

oo

o

o o

o

o

oo

o

ooo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

o o

oo

oo

o

oo

o

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

oo

o

o

o

o

-4 -2 0 2 4

-4-2

02

4

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

oo

o

o

o o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

ooo

o oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

ooo o

o

oo

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

-4 -2 0 2 4

-4-2

02

4

oo

o

o

o

o

oo

oo

o

o

o

o

o

o

o

oo o

o

o

o

oo o

o

oo

o

oo

o

o o

o

o

oo

oo

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

oo

o

oo

o

o

oo

o

o

o

o

o

o

o

o oo

o

o

o

o

o

oo

o

o o

ooo

o

o

oo

oo

o

o

o

oo

o

-4 -2 0 2 4-4

-20

24

Corr(X,Y ) = −0.9 Corr(X,Y ) = −0.5 Corr(X,Y ) = 0

Corr(X,Y ) = 0.3 Corr(X,Y ) = 0.5 Corr(X,Y ) = 0.7

Figure 3.12: Six different values of Corr(X, Y ): Corr(X, Y ) = 0 gives independence; Corr(X, Y ) < 0 meansnegative association and Corr(X, Y ) > 0 means positive association.

−3 −2 −1 0 1 2 3

−3

−1

13

−3 −2 −1 0 1 2 3

−3

−1

13

−3 −2 −1 0 1 2 3

−3

−1

13

−3 −2 −1 0 1 2 3

−3

−1

13

Corr(X,Y ) = 0.16 Corr(X,Y ) = 0.028

Corr(X,Y ) = 0.51 Corr(X,Y ) = −0.83

Figure 3.13: Sample correlation coefficient is not appropriate measure of strength of non-linear association.

Leeds city Ladybower reservoir

Summer 0.10 0.25Winter -0.24 -0.48


What conclusions can you draw from these statistics? Compare to Figure 3.8. Are they consistentwith each other?

Chapter 4

Statistical modelling

Having completed an exploratory analysis the statistician should be aware of the details of thescientific problems that are to be addressed. These first stages have involved close scrutiny of thedata set available to help in the approach to these problems, and have developed a good feeling forthe structure of the data and how they relate to the questions at hand. What happens next?Almost any statistical analysis that is used to approach substantive problems will involve some typeof modelling. This term is very general and covers a very wide range of techniques and approaches.The statistician’s idea of a model may be rather different from that of an applied mathematicianor of a physicist. It is often stated in statistics that:

No model is right but some models are useful.

This appears to knock down the idea of modelling before we have even started. However, it doescapture one important aspect of statistical modelling, and that is that in general, statistical modelsdo not attempt to provide a complete or realistic description of the processes to which they areapplied.Often, statistical models are developed to answer a very particular question about a given phe-nomenon: if one were to ask a different question of that phenomenon, then the statistical modelused to answer it might also be different. Neither model is the right model for the phenomenon,in that it does not completely explain the observed behaviour, but each model might be useful inanswering the question for which it was designed.In this chapter, we shall define what we mean by a statistical model and give a range of examplesof such models.

Why statistical models?

• it identifies different sources of variation;

• it describes your data;

• it addresses the question you are trying to answer;

• it reflects your assumptions.

What is a statistical model?

Example 4.0.5. Ozone data. Statistical models that we might propose for the ozone data are

49

50 CHAPTER 4. STATISTICAL MODELLING

1. Ozone measurements are independent of the level of NO2 on the day of measurement.

2. There is no difference between (summer) ozone values on the same day, i.e. the differencehas zero mean.

Different modelling approachWe have already seen a variety of models, including:

• empirical c.d.f.

• histogram

Both of these are called nonparametric models.Parametric models:

• are characterised by parameters;

• try to reflect assumptions about the stochastic or random mechanisms that generated thedata;

• impose relatively strict forms on the assumed underlying distributions.

Discrete variable models were introduced in MATH 104 and continuous variable models in Chap-ter 2.

4.1 Distributional modelling

The particular model that should be used for a given problem will depend on the problem and onthe assumptions that the statistician is prepared to make as part of the analysis.As an example, we review both discrete and continuous distributional models, with an emphasison the nature of the models and their properties.

Distribution as a modelFrom now on we will write p.m.f.s and p.d.f.s explicitly showing the parameter θ as follows:

• p(x;θ) for p.m.f.

• f(x;θ) for p.d.f.

A parameter θ (possibly a vector) is an index value of a family of random variables. Thespace of possible values that can be taken by the parameter θ is termed the parameterspace, which we denote by Θ.

4.1. DISTRIBUTIONAL MODELLING 51

4.1.1 Binomial distribution

Binomial distributionThis distribution is used to model variables that describe

(i) X = the number of successes in n trials;

(ii) each trial is an independent Bernoulli trial;

(iii) each trail with equal probability θ of success.

A simple example of the random variable is the number of heads in n coin tossing.A more realistic example would be the number of votes agreeing with the candidate or policy in asurvey when there are only two possible choices.Can you think of any other example?

p.m.f. for Binomial random variableThe p.m.f. for such a variable is given by:

The p.m.f. of a Binomial random variable X is

p(x; θ) =

(

n

x

)

θx(1− θ)n−x

where x = 0, 1, 2, . . . , n and 0 ≤ θ ≤ 1. This is written as X ∼ Binomial(n, θ)and θ ∈ Θ = [0, 1].

E[X] = nθ Var[X] = nθ(1− θ)

Example 4.1.1. Explain why the following random variable could be modelled by a Binomial dis-tribution. What does the parameter θ represent?X=Number of seeds which germinate, having planted n.

• Each trial is Bernoulli?

Possible outcomes are whether seed geminates or not, so each trial has two possible outcomes

• Independent trials?

Probability of geminating one seed does not influence that of other seeds.

• Equal probability?

Assuming that seeds are randomly selected from the same population, the probability ofgerminating should be the same for all seeds.


• θ is

the probability that a (randomly selected) seed germinates.

Example 4.1.2. Explain why the following random variable may not be modelled by a Binomialdistribution.X=Number of students who weigh more than 200 pounds

Even though students are of same age, generally there is gender effect in weights. So theprobability of success is not the same.

Usually, the number of trials n is known, but the probability of success θ is not known. The valueof θ strongly influences the probability of different outcomes, as shown in the plots below.

Shape of Binomial mass function

0 10 20 30 40 50

0.00

0.05

0.10

0.15

0.20

0 10 20 30 40 50

0.00

0.05

0.10

0.15

0.20

0 10 20 30 40 50

0.00

0.05

0.10

0.15

0.20

p.m

.f.

p.m

.f.

p.m

.f.

xxx

θ = 0.7 θ = 0.5 θ = 0.1

Figure 4.1: P.m.f. for Binomial random variables with different success probabilities θ and fixednumber of trials, n = 50.

Example 4.1.3. How is the shape of the function related to the parameter θ?

• What is the most likely value in each scenario?

• What are likely values in each scenario? How would you characterise them?

• What is the expected number of successes in each scenario?

• If you were to observe 10 times, what sequence of values would you expect to see?

4.1.2 Geometric

Geometric distributionThis distribution is used to model variables that describe

(i) X= the numbers of failures before a first success;

(ii) in independent Bernoulli trials;


(iii) each trial with equal probability θ of failure

Note: Although the parameter θ refers to the probability of failure, this is the primary event ofinterest that is counted, the number of failures.

p.m.f. of Geometric distributionThe p.m.f. for such a variable is given by:

The p.m.f. of a Geometric random variable X is

p(x;θ) = θx(1− θ),

where x = 0, 1, 2, . . . and 0 ≤ θ ≤ 1. This is written X ∼ Geometric(θ) andθ ∈ Θ = [0, 1].

E[X] =θ

1− θ, Var(X) =

θ

(1− θ)2

A simple example of the random variable is the number of consecutive heads before a tail occursin coin tossing. Note that the number of trials is not fixed!Can you think of any other example?

Geometric distribution models

Example 4.1.4. Explain why the following random variable could be modelled by a Geometricdistribution. What does the parameter θ represent?X=Number of cars that pass the hitching post until I get a lift.

Each car passing the hitching post gives me a lift independently with the same probability.Then we can consider it as a sequence of Bernoulli trials, with probability of failure θ of acar passing without giving me a lift and we are counting the number of failures until the firstsuccess.

Usually, the probability of failure θ is not known. The value of θ strongly influences the probabilityof different outcomes, as shown in the plots below.

Shape of Geometric mass function



• What are likely values in each scenario? Rank the random variables according to P (X > 5).

• What is the expected number of failures before a first success occurs?



0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

p.m

.f.

p.m

.f.

p.m

.f.

xxx

θ = 0.7 θ = 0.5 θ = 0.2

Figure 4.2: P.m.f. for Geometric random variables with different failure probabilities θ.

4.1.3 Poisson

Poisson distributionThis distribution is used to model variables arising as

• X=the number of events in a fixed time interval of a continuous time process where the eventsoccur at random at a given rate over time; or

• X=the number of successes when successes are very rare.

The p.m.f. for such a variable is given by:

The p.m.f. of a Poisson random variable X is

p(x;θ) =θx exp(−θ)

x!,

where x = 0, 1, 2, . . . and 0 < θ. This is written as X ∼ Poisson(θ) and θ ∈Θ = [0,∞).

E[X] = θ , Var(X) = θ

Example 4.1.6. Explain why the following random variable could be modelled by a Poisson distri-bution. What does the parameter θ represent?X=Number of lightning storms in Lancaster in a year.

Assuming that the lightning storms occur at random in a year, the number of events in a fixedtime follows a Poisson distribution. The rate paramter θ is the rate of occurrence of lightningstorms in Lancaster in a year.

Usually, the rate parameter θ is not known. The value of θ strongly influences the probability ofdifferent outcomes, as shown in the plots below.

Shape of Poisson mass function


0 5 10 15 20 25 30

0.00

0.05

0.10

0.15

0.20

0 5 10 15 20 25 30

0.00

0.05

0.10

0.15

0.20

0 5 10 15 20 25 30

0.00

0.05

0.10

0.15

0.20

p.m

.f.

p.m

.f.

p.m

.f.

xxx

θ = 4 θ = 8 θ = 15

Figure 4.3: P.m.f. for Poisson random variables with different rate parameters θ.




• What is the expected number of events in each scenario?


4.1.4 Normal

Normal distributionThis distribution is also known as the Gaussian distribution, after the German mathematician KarlFrederick Gauss. The density was pictured on the German 10 mark note bearing Gauss’s image!The p.d.f. for such a variable is given by:

The p.d.f. of a Normal random variable X is

f(x;θ) =1√2πσ

exp

(

−1

2

(

x− µ

σ

)2)

,

where θ = (µ, σ) −∞ < x < ∞, −∞ < µ < ∞ and 0 < σ. This is written asX ∼ N(µ, σ2) and θ ∈ Θ = (−∞,∞)× (0,∞).

E[X] = µ , Var(X) = σ2

As stated earlier, the Normal distribution plays an important role in a result that is key to statistics,known as the central limit theorem. This theorem, discussed in Math230 and Math314 gives atheoretical basis to the empirical observation that many random phenomena seem to follow aNormal distribution.Usually, the mean parameter µ and the scale parameter σ are unknown, although sometimes it isassumed that σ is known as this simplifies things considerably.


−6 −4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

−6 −4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

−6 −4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

p.d

.f.

p.d

.f.

p.d

.f.

xxx

µ = −3, σ = 1 µ = 0, σ = 1 µ = 0, σ = 2

Figure 4.4: P.d.f. for Normal random variables with different location parameters µ and scaleparameters σ.

Shape of Normal density function



• What is the expected value in each scenario?


4.1.5 Exponential

Exponential distributionThis distribution is often used to model variables that are the times until specific events happenwhen the events occur at random at a given rate over time.The p.d.f. of such a variable is given by:

The p.d.f. of an Exponential random variable X is

f(x;θ) =

θ exp(−θx) for x > 0,0 otherwise,

where 0 < x and 0 < θ. This is written as X ∼ Exponential(θ) and θ ∈ Θ =(0,∞).

E[X] =1

θ, Var(X) =

1

θ2

Example 4.1.9. Explain why the following random variable could be modelled by an Exponentialdistribution. What does the parameter θ represent?X=Time until a goal is scored in the cup final.


A goal is scored at random in a fixed time of the cup final so the time until the event canbe modelled by an Exponential distribution. The parameter θ is the rate at which a goal isscored in the cup final.

Usually the rate parameter θ is unknown. The value of θ strongly influences the probability ofdifferent outcomes, as shown in the plots below.

Shape of Exponential density function

0 1 2 3 4

01

23

45

0 1 2 3 4

01

23

45

0 1 2 3 40

12

34

5

p.d

.f.

p.d

.f.

p.d

.f.

xxx

θ = 5 θ = 3 θ = 1

Figure 4.5: P.d.f. for Exponential random variables with different rate parameters θ.





4.1.6 Uniform

Uniform distributionThis distribution is used to model variables that can take any value on a fixed interval, when theprobability of occurrence does not vary over the interval.The p.d.f. of such a variable is given by:


The p.d.f. of a Uniform random variable X, distributed on the interval[a, b] is given by:

f(x;θ) =

1

b− aif a < x < b;

0 otherwise,

where θ = (a, b) and Θ is the set of (a, b) such that −∞ < a < b <∞. This iswritten as X ∼ Uniform[a, b].

E[X] =a + b

2, Var[X] =

(b− a)2

12

Example 4.1.11. Explain why the following random variable could be modelled by a Uniform dis-tribution and identify the parameter θ.X=Time of day at which birth occurs.

Birth is equally likely at any given time of day. Couting in hours, the range of the values isθ = (0, 24).

Shape of Uniform density function

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

p.d

.f.

p.d

.f.

p.d

.f.

xxx

a = 1, b = 2 a = 1, b = 3 a = 1, b = 4

Figure 4.6: P.d.f. for Uniform random variables with different endpoints a and b.





The distributions that we have considered here are by no means exhaustive. In fact you may easilyrun into situations where none of the models seem appropriate for the data that you are concernedwith, even though data itself may be suggestive. That is why it is important to be familar withdata and to be able to explore it throughly.

4.2. THE ROLE OF PARAMETERS 59

4.2 The role of parameters

Based on the discussion on the various parametric models, would any of the models be suitable forour data sets?

Example 4.2.1. Hospitals. For Hospitals, it would be reasonable to assume Binomial distributionsfor the number of successful operations in both hospitals, thinking that each operation is indepedentwith eqaul probability within hospital.Write X1 for the number of successful operations from the first hospital and X2 for the secondhopstital.Would it be also reasonable to assume that the success probability be the same for both hospitals?In other words, are they identically distributed?This assumption needs further checking so we write θ1 for the success probability for the firsthospital and θ2 for the second hospital and the models are

X1 ∼ Binomial(10, θ1) X2 ∼ Binomial(10, θ2).

The role of parametersOnce you have decided to take a parametric approach, there are two stages of choices:

1. what family of models to use?

e.g. binomial, geometric

this is decided by thinking about the stochastic mechanisms behind the process of interest;

2. once the family has been chosen, what parameter values should you choose?

choose the value of θ

this value is chosen to make the fitted model describe the data as well as possible.

Example 3.3.1: Diseased trees.Start with the simplest model we can think of:

• trees become diseased independently of one another;

• the probability of a tree becoming diseased is the same for all trees.

Looking along transects amounts to looking at

• a sequence of Bernoulli trials;

• probability θ of disease.

The parametric model that describes this type of behaviour is

Geometric(θ)


0 1 2 3 4 5

0.0

0.2

0.4

0.6

0 1 2 3 4 5

0.0

0.2

0.4

0.6

Partial Full


p.m

.f.

p.m

.f.

Figure 4.7: Histograms of run lengths of diseased trees in an infected plantation, partial (left plot)and full (right plot) data sets

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0 1 2 3 4 5

0.0

0.2

0.4

0.6

Partial Full


p.m

.f.

p.m

.f.

Figure 4.8: Histograms of run lengths of diseased trees in an infected plantation, partial (left plot)and full (right plot) data sets; p.m.f. using : θ = 0.75

4.2. THE ROLE OF PARAMETERS 61

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0 1 2 3 4 5

0.0

0.2

0.4

0.6

Partial Full


p.m

.f.

p.m

.f.

Figure 4.9: Histograms of run lengths of diseased trees in an infected plantation, partial (left plot)and full (right plot) data sets; p.m.f. using + : θ = 0.25.

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0 1 2 3 4 5

0.0

0.2

0.4

0.6

Partial Full


p.m

.f.

p.m

.f.

Figure 4.10: Histograms of run lengths of diseased trees in an infected plantation, partial (left plot)and full (right plot) data sets; p.m.f. using : θ = 0.5.


p(x;θ) = θx(1− θ) for x = 0, 1, . . . ,

where θ = θ, with Θ = [0, 1].Conclusion from the plots:

a value of θ = 0.75 is too high, as the curve traced out by the triangle symbols decreases tooslowly. Conversely, θ = 0.25 is too low; it(crosses) gives a p.m.f. that decays too fast. Thevalue of θ = 0.5 is about right as it seems to fit the histograms quite well.

We shall meet a more effective method of choosing the values of θ in Chapter 4.

4.3 Statistical model assumptions in this course

4.3.1 Independence

Whether or not your data are independent observations will depend on the characteristics of thevariable being measured and on the sampling strategy adopted for the collection.Independence is easy to construct in a sample survey. If the sampling operation involves pickingm participants completely at random from a population of size n, then independence is attainedif picking one individual has no influence on the probability of selection of any other individual.Independence can be harder to check in other contexts.

Example 4.3.1. Summer ozone.

• Summer ozone values at two sites are (independent, not independent).

• Residual behaviour at two sites are (independent, not independent).

• Ozone measurements are (independent, not independent) from one day to the next.

What happens when this assumption is violated?

you don’t have quite as much information as you thought you had!

4.3.2 Identically distributed

If we are to use a common statistical model for the probability distribution of random variables,then we must be able to justify the assumption that all the realisations of the random variablescome from the same distribution.

Example 4.3.2. Suppose that a random variable X follows Normal(µ, σ2). Which of the followingrandom variables has the identical distribution to X?

Y1 ∼ Binomial(n, θ) , Y2 ∼ Normal(µ2, σ2)

Y3 ∼ Normal(µ, σ2) , Y4 ∼ Normal(µ, σ22)

Example 4.3.3. FTSE index. What do you conclude about the assumption of identically distributedfor raw values and returns of FTSE index variable from Figure 4.11?The distribution of raw values of FTSE index variables change over time so they are (identically,not identically) distributed. However, the shape of the distribution of returns is very close and thusit seems reasonable to assume that the return values are (identically, not identically) distributed.

4.3. STATISTICAL MODEL ASSUMPTIONS IN THIS COURSE 63

FTSE raw: 1968.12.25 − 1982.12.21

0 1000 2000 3000 4000 5000 6000 7000

0.00

000.

0005

0.00

100.

0015

0.00

20

FTSE raw: 1993.02.15 − 2007.02.09

0 1000 2000 3000 4000 5000 6000 7000

0.00

000.

0005

0.00

100.

0015

0.00

20

Returns: 1968.12.25 − 1982.12.21

−0.05 0.00 0.05 0.10

010

2030

40

Returns: 1993.02.15 − 2007.02.09

−0.05 0.00 0.05 0.10

010

2030

40

Den

sity

Den

sity

Den

sity

Den

sity

Figure 4.11: Histograms of raw vales of FTSE index variable (top) and returns (bottom) for two selectedperiods.

You may judge directly from Figures 1.1 and 1.2. The raw values of FTSE index variables inFigure 1.1 show a clear time trend of increase over time, while the values of the returns plottedin Figure 1.2 seem to be randomly scatter over time, with no pattern to their behaviour from thepast to the future.

Example 4.3.4. Ozone data. Can we assume identical distribution of the ozone measurements? Thiswill depend on the population of interest and whether our sample would be representative of it.For example, could we consider our data representative of all daily ozone measurements anytime,anywhere in UK? If not, what is a sensible assumption to be considered?

• Ozone measurements from one day to the next are (identically, not identically) distributed.

• Ozone measurements from one day to the next in sumer are (identically, not identically)distributed.

• Ozone measurements from one day to the next in Leeds city centre in summer are (identically,not identically) distributed.

What happens when this assumption is violated?This will depend on:

• how different the various underlying distributions really are;

• the nature of the differences;

• what you intend to use your fitted model for.


Example 4.3.5. Ozone data. Suppose we are interested in the difference of ozone level betweenLeeds city centre and Ladybower reservoir, but mistakenly ignore the effect of season:The model ignoring seasonal effect (will, will not) fit the data so well, but conclusions about themeans of the ozone level at two sites (will, will not) be valid. If there is a difference, it (will, willnot) be clear whether this is just due to the difference between the two sites. A problem may ariseif the effect is (small, large), in which case the more structure we allow the model to capture, thegreater the chance of finding the effect.

Chapter 5

Statistical inference for Parametric

Models

In Chapter 4, we introduced several parametric models that could be used to describe variousrandom phenomena. For example, Binomial distribution would be an appropriate statistical modelnot only for the number of heads in coin tossing but also the number of people who agree with thelaw in a smoking ban survey, for example.To fully describe probabilities of a Binomial distribution, however, the parameters of the numberof sample size n and the common probability θ should be specified. The sample size n is theactual number of trials so can be obtained once an experiment or a survey is conducted. But thepopulation probability θ is generally unknown except for special circumstances. For example, inthe experiment of coin tossing, θ = 0.5 would be a reasonable choice for an unbiased coin. Butwhat is θ for the smoking ban survey?

Statistical Inference for Parametric ModelsParametric statistical inference refers to the process of

• estimating the parameters of a chosen distributional model from a sample;

• quantifying how accurate these estimates are;

• dealing formally with the uncertainty that exist in the data.

5.1 Overview

Example 4.1: Diseased trees.Recall:

• the variable of interest is run length of diseased trees;

• we have assumed a Geometric(θ) distribution can be used to model this variable;

• the parameter θ is the probability of a tree being diseased, so θ ∈ Θ = [0, 1].

Previously we experimented graphically with different values of θ:

65

66 CHAPTER 5. STATISTICAL INFERENCE FOR PARAMETRIC MODELS

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0 1 2 3 4 5

0.0

0.2

0.4

0.6


p.m

.f.

p.m

.f.

Estimation:Using methods to be introduced in Section 4.2, we find a best guess of the parameter value:

• θ = 0.324 for the partial data and

• θ = 0.343 for the full data.

The estimate hasn’t been changed very much by the addition of more data.The benefit of the extra data is the increased reliability of the best estimate for the full data set.Estimation:Quantifying reliability through reflecting uncertainty:

• we construct a set of values of θ ∈ Θ which are the most plausible given the observed data;

• specifically, we estimate an interval of value for θ;

• if we have more data, we have more information about θ and therefore the interval is tighter;

• we must also decide how confident we want to be that θ lies in the interval.

Confidence intervals for θ: 50% interval has an even chance of containing the true value of θ:

• (0.29,0.35) for the partial data and

• (0.33,0.36) for the full data.

95% interval has a good chance of containing the true value of θ:

• (0.23,0.43) for the partial data and

• (0.29,0.41) for the full data.

5.1. OVERVIEW 67

Q. The length of these intervals are

50% confidence interval:

0.06 for the partial data and 0.03 for the full data.

95% confidence interval:

0.2 for the partial data and 0.12 for the full data.

Prediction under the fitted model:We can now estimate the p.m.f. by replacing the unknown parameters by their estimates:

p(x; θ) = θx(1− θ) for x = 0, 1, . . . .

Q. This can now be used to estimate the probability of longer runs of diseased trees than we sawin the data. Assuming θ = 0.343, what is the probability of a run of 6 or more diseased trees?

For example, for a Geometric(θ) random variable X and a positive integer x then from Math104 P (X ≥ x;θ) = θx, so the estimate for the probability of a run of 6 or more diseased treesis

P (X ≥ 6; θ) = θ6 = (0.343)6 = 0.001625.

This would not be possible without a model and an inference method for the modelAssessment of the fitted model:

0 1 2 3 4 5

0.0

0.2

0.4

0.6

Run length

Cou

nt

Conclusions:

• the estimated parameter is a reasonable value for the data;

• the underlying Geometric model seems to be a good choice.

So we learn about the probability of disease but also the random mechanism by which diseasespreads.


Statistical Inference for Parametric ModelsWe have now completed the final stages of statistical analysis of trees data.Recall the whole procedure:

1. selection of parametric model Geometric(θ);

2. estimation of unknown model parameters;

3. assessment of validity of model choice;

4. use of model to predict aspects of variable of interest.

These stages are followed in all statistical inference:

• here the model fitted well;

• if the model doesn’t fit then we need to cycle round with a new model choice;

• improvements to the model are prompted by observed weaknesses and strengths of earlieranalyses.

5.2 Parameter estimation

In Chapter 4, while looking at various parametric models, we learned that within the same para-metric family models, the description of probabilities can dramatically change by the choice ofparameters.

5.2.1 Method of moments

In a smoking ban survey, if 75 out of 100 surveyed agree with the ban, a reasonable estimate forθ would be 0.75, the sample proportion. The simple idea of using a sample quantity in place of apopulation quantity is the basis of method of moments and also the summary statistics measuresused in Chapter 3. A more elaborate approach using likelihood will be discussed in MATH 235.

Sample meanOften the population mean itself is of primary interest and sample mean is one of the most popularsummary statistics.

Theorem 5.2.1. Let X1, · · · ,Xn are identically distributed random variables with expectation

µ = E[X], then

E[X ] = µ .

X is called sample mean and this theorem suggests that taking an empirical average might be agood idea when only a finite sample is available.First we look at the case where n =. Then we have

E[X] = E[X1 + X2

2

]

=E[X1 + X2]

2

=E[X1] + E[X2]

2=

µ + µ

2= µ

5.2. PARAMETER ESTIMATION 69

We can generalise to any n:

E[X ] = E[X1 + · · ·+ Xn

n

]

=E[X1 + · · ·+ Xn]

n

=E[X1] + · · · + E[Xn]

n=

µ + · · ·+ µ

n=

nµ

n= µ

Note that X is a random variable and the expectation takes into account all possible values of xthat can arise in any particular observation.

Example 5.2.2. Poisson. Suppose X1,X2 are i.i.d. Poisson(θ) random variables and X = X1+X2

2 .What is E[X ]?

E[X ] = θ

Sample proportion

Example 5.2.3. Bernoulli. Bernoulli random variables take values on 0, 1. For example, 5 responsesfrom a survey on the smoking ban might look like:

0, 1, 1, 0, 1 0 for disagree and 1 for agree .

If we take the average, 3/5 is the proportion of responses that agree with the law.If we take another sample of 5, the proportion may change.

In general when the random variables X1, · · · ,Xn only take values of 0 and 1, X is called sampleproportion.

Sample proportion and Binomial distributionRecall that each Xi is called Bernoulli trial. So if they are i.i.d. (what does that mean?), then thesample proportion is indeed the sample mean of Bernoulli random variables.Moreover, we know that the sum of the random variables follows a Binomial distribution:

Y =

n∑

i=1

Xi ∼ Binomial(n, θ)

with µ = nθ and σ2 = nθ(1 − θ). In this case, the sample proportion is simply Yn , where Y ∼

Binomial(n, θ).In particular,

E[∑n

i=1 Xi] = E[Y ] = nθ E[X ] = E[Y

n

]

= θ

Var[∑n

i=1 Xi] = E[Y ] = nθ(1− θ) Var[X] = Var[Y

n

]

=θ(1− θ)

n

Example 5.2.4. Binomial. Give an interpretation of Y and Y/n for the survey example.

Y represents the total number of people (responses) who agree with the law and Y/n repre-sensts the proportion of people (responses) who agree with the law.

This will be used later to quantify sampling variation of the sample proportion.


The method of momentsWe learned that sample mean is a sensible estimator for the unknown population mean E[X], alsoknown as first moment. This idea can be generalised to E[Xk], k = 1, 2, · · · , kth moments. Basicallywhen only a finite sample is available, we can approximate the expectation by a finite sum.We write the k-th moment of the random variable X as µk = E[Xk].The sample moments are then calculated by

µk =1

n

n∑

i=1

Xki with random variables: estimator

µk =1

n

n∑

i=1

xki with observations: estimate

If the unknown parameter θ is expressed by the population moments or some function of them, itcan be estimated by replacing the population quantities by the corresponding sample quantities.

Example 5.2.5. Poisson distribution.The first moment for the Poisson distribution is the parameter µ = E[X]. The first sample momentis

X =1

n

n∑

i=1

Xi

which is the method of moments estimator of µ. Since θ = µ, this is the estimator of the rateparameter θ.

Example 5.2.6. Exponential distribution.The first moment for the Exponential distribution is µ = E[X] and the method of momentsestimator of µ is X. Since θ = 1

E[X] , this can be estimated by

θ =1

X

Example 5.2.7. Normal distribution.The first and second moments for the normal distributions are

µ1 = E[X] = µ µ2 = E[X2] = µ2 + σ2

Thus,

µ = µ1 σ2 = µ2 − µ21

What are the method of moments estimator of these parameters?

µ = X

σ2 =1

n

n∑

i=1

X2i − X2 =

1

n

n∑

i=1

(Xi − X)2


5.2.2 How good are method of moments estimates?

The principle of estimation is broadly applicable however the quality of estimation varies consider-ably.

Sampling distribution of sample mean: symmetric caseWe have conducted our simple experiment of coin tossing in Chapter 1. Supposing that the exper-iment was repeated for 1000 times, Figure 5.1 shows results when sample sizes (number of trials)are n = 10, n = 20, n = 50, n = 100.

n = 10

no. e

xper

imen

ts

0.0 0.2 0.4 0.6 0.8 1.0

010

020

0

n = 20

no. e

xper

imen

ts

0.0 0.2 0.4 0.6 0.8 1.0

050

150

n = 50

no. e

xper

imen

ts

0.0 0.2 0.4 0.6 0.8 1.0

010

025

0

n = 100

no. e

xper

imen

ts

0.0 0.2 0.4 0.6 0.8 1.0

010

025

0

θθ

θθ

Figure 5.1: Sampling distribution of θ with increasing sample size

Figure 5.2 shows additional simulation results when sample sizes (number of trials) are n =100, 1000, 5000, 10000 for the case.

Effect of sample size

Example 5.2.8. Sample size. Summarise your findings about the estimates θ and comment on theeffect of sample size.The estimates, on average, agree with the true value of θ = (0, 0.3, 0. 5, 0.7, 1) however there isconsiderable variability depending on the sample size. In general the larger sample size, the (larger ,smaller) variability in the estimates and thus the (more accurate , less accurate) the estimatesbecome.

The same phenomeon is expected if the result came from a survey instead of a coin tossing experi-ment, except that the centre of the distribution will change accordingly. The only limitation withthe survey is that we would not be able to run the same survey for 1000 times to see the effect!

Sampling distribution of sample mean: asymmetric case


n = 100

no. e

xper

imen

ts

0.0 0.2 0.4 0.6 0.8 1.0

020

0

n = 1000

no. e

xper

imen

ts

0.0 0.2 0.4 0.6 0.8 1.0

010

020

0

n = 5000

no. e

xper

imen

ts

0.0 0.2 0.4 0.6 0.8 1.0

010

020

0

n = 10000

no. e

xper

imen

ts

0.0 0.2 0.4 0.6 0.8 1.0

010

025

0

θθ

θθ


Suppose that the population proportion agreeing with the low be θ = 0.8 and n sample was taken,where n = 10, 20, 50, 100. Figure 5.3 shows distribution of possible estimates for the correspondingscenarios based on simulation.

Example 5.2.9. • What is the underlying distributional model used in Figure 5.3?

X1

10 , where X1 ∼ Binomial(10, 0.8) X2

20 , where X2 ∼ Binomial(20, 0.8), X3

50 , where X3 ∼Binomial(50, 0.8), X4

100 , where X4 ∼ Binomial(100, 0.8)

• How is the shape of the distribution of θ affected by the sample size?

The shape of the distribution of different estiamtes becomes more (symmetric , flat) when thesample size increases.

• Explain why the larger sample size would be preferrable in practice?

smaller variability and less skewed distribution for the sample mean

Figure 5.4 shows more results for increasing sample sizes.

Variability of sample meanA mathematical explanation of the behaviour of the sample mean estimates comes from the follow-ing property.

Theorem 5.2.10. If X1, · · · ,Xn are i.i.d random variables with expectation E[Xi] = µ and Var[Xi] = σ2,

then Var[X] = σ2

n .


n = 10

no. e

xper

imen

ts

0.0 0.2 0.4 0.6 0.8 1.0

010

025

0

n = 20

no. e

xper

imen

ts

0.0 0.2 0.4 0.6 0.8 1.0

010

020

0

n = 50

no. e

xper

imen

ts

0.0 0.2 0.4 0.6 0.8 1.0

020

0

n = 100

no. e

xper

imen

ts0.0 0.2 0.4 0.6 0.8 1.0

010

020

0θθ

θθ


n = 100

no. e

xper

imen

ts

0.0 0.2 0.4 0.6 0.8 1.0

050

150

n = 1000

no. e

xper

imen

ts

0.0 0.2 0.4 0.6 0.8 1.0

010

025

0

n = 5000

no. e

xper

imen

ts

0.0 0.2 0.4 0.6 0.8 1.0

010

025

0

n = 10000

no. e

xper

imen

ts

0.0 0.2 0.4 0.6 0.8 1.0

050

150

θθ

θθ



Relate this result to Figures 5.1–5.4.Consider the case where n = 2:

Var[X ] = Var[X1 + X2

2

]

=Var[X1 + X2]

22

=σ2 + σ2

4=

σ2

2

We can generalise to any n:

Var[X ] = Var[X1 + · · · + Xn

n

]

=Var[X1 + · · ·+ Xn]

n2

=σ2 + · · ·+ σ2

n2=

nσ2

n2=

σ2

n

How variable the estimates areIf more and more sample were available, the estimates will converge to the true parameter because

• variance decreases (converges to zero) as sample size increases;

Var[X ] =σ2

n→ 0 as n→∞

• E[X ] = θ for all sample sizes and thus there is no systematic bias.

therefore the estimate will converge to the true parameter as the sample size increases.Note that properties of sample mean do not depend on any particular distributional models andthus is not limited to the parametric models that we are considering here.

Standard ErrorWe have seen that it is important to take into account sampling variability in the estimation.As a measure of precision, standard error is defined as the squared root of variance of theestimator.Estimator: θ

Standard error(θ) = StdError(θ) =

√

Var(θ).

If X1, · · · ,Xn are an i.i.d. sample from X with E[X] = µ and Var[X] = σ2, then the method ofmoments estimator of θ is µ = X and the standard error is

StdError(µ) = σ√n

In practice when σ is unknown, it will be replaced by its estimate.

Example 5.2.11. Poisson distribution.If X ∼ Poisson(θ), then we know µ = θ and σ2 = θ. Thus, the method of moments estimator of θis

θ = X


and its standard error is given by

StdErrorr(θ) =

√

θ

n.

Example 5.2.12. Binomial distributionIf X ∼ Binomial(n, θ), then we know µ = nθ and σ2 = nθ(1− θ).Since θ = µ

n , the method of moments estimator θ is

θ =X

nsample proportion .

and the standard error of the estimator is

StdError(θ) =

√

θ(1− θ)

n.

Hospitals exampleWe consider the hospitals example.Hospital 2: 5 out of 10 operations classified as a success.What does this tell us about the probabilities of successful operations now?Two possible answers depending on assumptions:

Independence: this assumption seems reasonable from the context of the problem.

Identically distributed: it is not clear from the context whether the probability of success is thesame at each hospital.

We will look at what happens when we assume the successes at the two hospitals are

• identically distributed;

• NOT identically distributed.

We denote by

• X1 – the random variable number of successful operations at the first hospital;

• X2 – the random variable number of successful operations at the second hospital;

We assume that X1 and X2:

• are independent;

• with Xi following Binomial(10, θi) for i = 1, 2.

We observe variable X = (X1,X2) with value x = (x1, x2) = (9, 5).


Non-identically distributedi.e. θ = (θ1, θ2)) The method of moments estimates are

θ1 =9

10θ2 =

5

10

and the variance is

Var[θ1] =θ1(1− θ1)

10= 0.009 , Var[θ2] =

θ2(1− θ2)

10= 0.025 .

So the standard errors are

StdError(θ1) =√

0.009 = 0.095, StdError(θ2) =√

0.025 = 0.158.

There is no correlation between the estimates. This is reasonable as the data from each hospitaltells us only about the probability of a successful operation for that hospital.

Identically distributedi.e. θ = θ so θ1 = θ2 = θ. Then we can aggregate the information as total number of trialsis 10+10=20 and the number of successes is 9+5=14. So the method of moments estimate isθ = (9 + 5)/(10 + 10) = 0.7 and the variance is

Var(θ) = Var(θ) =θ(1− θ)

2× 10= 0.0105,

So the standard error is

StdError(θ) =√

0.0105 = 0.1025

Standard error of functions of sample meanWe have seen that X plays an important role in estimating the population mean and we were ableto quantify variability of the sample mean.Other cases such as Exponential distribution, as in Example 5.2.6, the estimator for θ is 1/X , butwhat is Var[1/X ]?Certainly,

Var[ 2

X1 + X2

]

6= Var[ 2

X1

]

+ Var[ 2

X2

]

.

Standard error of functions of sample meanTaylor approximation: If g is differentiable, then

g(X) ≈ g(µ) + g′(µ)(X − µ)

Write

Var[ 1

X

]

= Var[g(X)] , where g(x) = 1/x .


So, we can compute the variance using the approximation:

Var[g(X)] ≈ Var[g(µ) + g′(µ)(X − µ)]

= Var[g′(µ)(X − µ)] = g′(µ)2Var[X − µ]

= g′(µ)2Var[X ]

Verify each step! Note that g′ should be evaluated at µ, which is a function of θ.

Assume X1, · · · ,Xn are an i.i.d. sample from X with E[X] = µ and Var[X] =σ2. Let X = 1

n

∑ni=1 Xi. The standard error of g(X) is

StdError[g(X)] =∣

∣g′(µ)∣

∣

σ√n

.

For θ of dimension two or more, similar expressions hold but are mathematically more difficult, seeMath235 and Math350.

Diseased Trees exampleConsider the diseased tree example.For the diseased trees example for the partial and full data n = 50 and 109 respectively.First let us use general notation for the observations x1, . . . , xn so that we can do the mathematicsonce for both cases. Here we are assuming the data come from i.i.d. random variables with p.m.f.p(x; θ) = θx(1− θ), so that θ = θ, Θ = [0, 1] and µ = E[X] = θ

1−θ and σ2 = Var[X] = θ(1−θ)2 .

The method of moments estimate of θ is

θ =x

1 + x.

For the partial data

50∑

i=1

xi = 0× 31 + 1× 16 + 2× 2 + 3× 0 + 4× 1 + 5× 0 = 24

so θ = 24/501+24/50 = 0.3243. For the full data

109∑

i=1

xi = 0× 71 + 1× 28 + 2× 5 + 3× 2 + 4× 2 + 5× 1 = 57

so θ = 57/1091+57/109 = 0.3434.

For the standard error of θ: if g(x) = x1+x , g′(x) = 1

(1+x)2, so |g′(µ)| = 1

(1+µ)2= (1 − θ)2 and

σ =√

θ1−θ so the standard error of θ is

StdError(θ) =

√

θ(1− θ)√n

.


For the partial data,

StdError(θ) =

√0.3243(1− 0.3243)√

50= 0.054

For the full data,

StdError(θ) =

√0.3434(1− 0.3434)√

109= 0.037

5.2.3 Interval estimation

As was seen in Figures 5.1–5.4, although the (best) single estimate may give a sense of the truevalue of the unknown parameter, because of sampling variation, precision of each estimate variesgreatly. We quantified the precision of an estimator by standard error. We can combine thoseinformations into an interval estimation.

Confidence regionInstead of picking out one single value, we choose a set of parameter values which are consistentwith the observed data. We term such a set a confidence region for θ.confidence region.

The estimator of the confidence region is a random region, which has the prob-ability 1− α of containing the true value θ0.

If the parameter of interest is 1-dimensional, it is called confidence interval.Based on θ, how do we choose a confidence region to ensure the required probability?

Probability of an intervalWe first look at the case where we know the underlying distribution is Normal.Write z(p) for the pth quantile of the Normal(0,1) distribution.If X ∼ Normal(µ, σ2) then

Pr(µ− |z(α/2)|σ ≤ X ≤ µ + |z(α/2)|σ) = 1− α

You may recall the Figure 2.11.In other words,

Pr(µ ∈ [X − |z(α/2)|σ,X + |z(α/2)|σ]) = 1− α

If X1, · · · ,Xn are i.i.d. random variables from Normal(µ, σ2) distribution andX = 1

n

∑ni=1 Xi, then

X ∼ Normal(µ,σ2

n) .


The proof of this result is given in MATH 230.Therefore, we can choose an interval that has the required probability:

Pr(µ ∈ [X − |z(α/2)|σ√n

, X + |z(α/2)|σ√n

]) = 1− α

Hence, [X − |z(α/2)| σ√n, X + |z(α/2)| σ√

n] is (1− α)% confidence interval of µ.

Central Limit Theorem

If X1, · · · ,Xn are i.i.d. random variables from unknown distribution functionand X = 1

nXi, then

X ∼ approximately Normal(µ,σ2

n) .

No matter what distribution the original data come from, the sample mean follows approximatelyNormal distribution if you have a large enough sample.This is one of the most significant results in statistics. Again the formal proof will be given inMATH 230. Here we use our intuition and informal justification shown in Figures 5.1–5.4.This result allows us to construct (approximate) confidence interval as we did for the Normal data.

Confidence intervalGenerally, an approximate (1−α)% confidence interval can be constructed from standard errors ofan estimator.The standard error is the factor which determines the width of confidence intervals for θ.

The (1− α)% confidence interval for θ is

(

θ − |z(α/2)| × StdError(θ) , θ + |z(α/2)| × StdError(θ))

.

where z(p) is the pth quantile of Normal(0,1) distribution.

Example 5.2.13. Hospitals. Find the 95% confidence interval for θ1 under two assumptions consid-ered earlier. Use z(0.025) = −1.96.

• Non-identically distributed:

(

θ1 − 1.96× StdError, θ1 + 1.96 × StdError(θ1))

= (0.7138, 1.0862)

• Identically distributed:


(

θ1 − 1.96 × StdError(θ1), θ1 + 1.96 × StdError(θ1))

= (0.4992, 0.9008)

What do we learn from this information?

Interpretation of confidence interval

Example 5.2.14. Suppose that all our MATH 105 students take a sample of the same size from UGfor a smoking ban survey and each student, based on his/own sample, constructs a 95% confidenceinterval for the UG proportion of students who agree with the law.

• Would all the confidence intervals be the same?

Probably not.

• Would the length of all the confidence intervals be the same?

Yes!

• Would your confidence interval contain the true value of population proportion?

We do not know whether each confidence region contains the true value. However, we canexpect that approximately 95% of those intervals would contain the true value.

• Would exactly 95 out of 100 intervals contain the true value of population proportion?

No, it is possible that all intervals could contain the true value by chance or more than 5%intervals may not contain the true value. Our intervals are only a sample of all possibleconfidence intervals!

Recall that confidence interval is a random variable. Each confidence interval constructed from theobserved data is a realisation of the random variable.Figure 5.5 illustrates 95% confidence intervals constructed from 20 different samples.Thus care is required in interpreting estimated confidence regions or intervals, but probably thesafest way is to use them as a measure of the precision of the estimate.


Sampling distribution

θ θ + 1.96 x StdErrθ − 1.96 x StdErr

θθ − 1.96 x StdErr θ + 1.96x StdErr

Figure 5.5: Illustration of 95% confidence interval

preliminaries - maths.lancs.ac.ukparkj1/math105/m105lecnotes.pdf · preliminaries 1 preliminaries...

Documents