intro to stat (stat 111) by ewens

113
Statistics 111 Updated lecture notes Fall 2013 Warren J. Ewens [email protected] Room 324 Leidy Labs (corner 38th St and Hamilton Walk). These notes provide an outline of the some of material to be discussed in the first few lectures of STAT 111. They are important because they provide the framework for all the material given during the entire course. Also, much of this material is not given in the textbook for the course. 1

Upload: iris-zhang

Post on 27-Dec-2015

93 views

Category:

Documents


6 download

DESCRIPTION

Introduction to basic concepts and methods of stats.

TRANSCRIPT

Page 1: Intro to Stat (STAT 111) by Ewens

Statistics 111

Updated lecture notes

Fall 2013

Warren J. [email protected] 324 Leidy Labs

(corner 38th St and Hamilton Walk).

These notes provide an outline of the some of material to be discussed in the first few lecturesof STAT 111. They are important because they provide the framework for all the materialgiven during the entire course. Also, much of this material is not given in the textbook forthe course.

1

Page 2: Intro to Stat (STAT 111) by Ewens

Introduction and basic ideas

What is Statistics?

Statistics is the science of analyzing data in whose generation chance has played some part.This explains why statistical methods are important in many areas, including for exam-ple sociology, psychology, biology, economics and anthropology. In these areas there aremany chance mechanisms at work. For example, in biology the random transmission of onechromosome from a pair of chromosomes from parent to offspring introduces a chance mech-anism into many areas of genetics. Second, in all the above areas data are usually derivedfrom some random sample of individuals. A different sample would almost certainly yielddifferent data, so that the sampling process introduces a second chance element. Finally,in economics the values of quantities such as the Dow Jones industrial average cannot bepredicted in advance, since their values are affected by many chance events that we cannotknow of in advance.

Everyone knows that you cannot make much progress in such areas as physics, astronomyand chemistry without using mathematics. Similarly, you cannot make much progress insuch areas as psychology, sociology and biology without using statistics.

Because of the chance, or random, aspect in the generation of statistical data, it is nec-essary, in discussing statistics, to also consider aspects of probability theory. The syllabusfor this course thus starts with an introduction to probability theory, and this is reflected inthese introductory notes. But before discussing probability theory, we have to discuss therelation between probability theory and statistics.

The relation between probability theory and statistics

Most of the examples given in the class concern simple situations and are not taken from thesociological, psychological, etc. contexts. This is done so that the basic ideas of probabilitytheory and statistics will not be confounded with the complexities arising in those areas. Sowe start here with a simple example concerning the flipping of a coin.

Suppose that we have a coin that we suspect of being biased towards heads. To check upon this suspicion we flip the coins (say) 2,000 times and observe the number of heads thatwe get. If the coin is fair, we would, beforehand, expect to see about 1,000 heads. If oncewe flipped the coin we got 1,973 heads we would obviously (and reasonably) claim that wehave very good evidence that the coin is biased towards heads. If you think about it, thereasoning that you went through in coming to this conclusion was something like this: “Ifthe coin is fair is is extremely unlikely that I would get 1,973 heads from 2,000 flips. Thussince I did in fact get 1,973 heads, I have strong evidence that the coin is unfair.”

Equally obviously, if we got 1,005 heads, we would conclude that we do not have good ev-idence that the coin is biased towards heads. Again, the reason for coming to this conclusionis that a fair coin can easily give 1,005 (or more) heads from 2,000 flips.

2

Page 3: Intro to Stat (STAT 111) by Ewens

But these are extreme cases, and reality often has to deal with more gray-area cases.What if we saw 1,072 heads? Intuition and common sense might not help in such a case.What we have to do is to calculate the probability that we would get 1,072 or more heads ifthe coin is fair. If this probability is low we might conclude that we have significant evidencethat the coin is biased towards heads. If this probability is fairly high we might concludethat we do not have significant evidence that the coin is biased.

The conclusion that we draw is an act of statistical inference, or a statistical induction.An inference, or an induction, is a conclusion that we draw about reality, based on someobservation or observations. The reason why this is a statistical inference (or induction) isthat it is based on a probability calculation. No statistical inference can be made withoutfirst making the relevant corresponding probability calculation.

In the above example, probability theory calculations (which we will do later) shows thatthe probability of getting 1,072 or more heads from 2,000 flips of a fair coin is very low(less than 0.01). Thus having observed 1,072 heads in our 2,000 flips, we would reasonablyconclude that we have significant evidence that the coin is biased.

Here is a more important example. Suppose that we are using some medicine (the “cur-rent” medicine) to cure some illness. From long experience we know that, for any personhaving this illness, the probability that this current medicine cures any patient is 0.8. A newmedicine is proposed as being better than the current one. To test whether this claim isjustified we plan to conduct a clinical trial, in which the new medicine will be given to 2,000people suffering from the disease in question. If the new medicine is equally effective as thecurrent one we would, beforehand, expect it to cure about 1,600 of these people. If after theclinical trial is conducted the proposed new medicine cured 1,945 people, no-one would doubtthat it is better than the current medicine. Again, the reason for this opinion is somethinglike: “If the new medicine has the same cure rate as the current one, it is extremely unlikelythat is would cure 1,945 people out of 2,000. But it did cure 1,945 people, and therefore Ihave significant evidence that its cure rate is higher than that of the current medicine.”

But, equally obviously, if the proposed medicine cured 1,615 people we do not have strongevidence that it is better than the current medicine. The reason for this is that if the newmedicine is equally effective as the current one, that is if the probability of a cure with thenew medicine is the same (0.8) as that for the current medicine, we can easily observe 1,615(or more) people cured with the new medicine.

Again these are extreme cases, and reality often has to deal with more gray-area cases.What if the new medicine cured 1,628 people? Intuition and common sense might not helpin such a case. What we have to do is to calculate the probability that we would get 1,628or more people cured with the new medicine if it is equally effective as the current medicine.This probability is about 0.11, and because this is not a really small probability we mightconclude that we do not have significant evidence that the new medicine is superior to thecurrent one. Drawing this conclusion is an act of statistical inference.

Statistics is a difficult subject for two reasons. First, we have to think of the situationboth before and after our experiment, (in the medicine case the experiment is giving the

3

Page 4: Intro to Stat (STAT 111) by Ewens

new medicine to the individuals in the clinical trial), and go back and forth several timesbetween these two time points in any statistical operation. This is not easy. Second, beforethe experiment we have to consider aspects of probability theory. Unfortunately our mindsare not wired up well to think in terms of probabilities. (Think of the “two fair coins”example given in class, and also the Monty Hall situation.)

The central point is this: no statistical operation can be carried out without consideringthe situation before the experiment is performed. Because, at this time point, we do notknow what will happen in our experiment, these considerations involve probability calcula-tions. We therefore have to consider the general features of the probability theory “beforethe experiment” situation and the relation between these aspects and the statistical “after theexperiment” aspects. We will do this later, after first looking more closely at the relationbetween the deductive processes of probability and the inductive processes of statistics.

Deductions (implications) and inductions (inferences)

Probability theory is a deductive activity, and uses deductions (also called implications). Itstarts with some assumed state of the world (for example that the coin is fair), and enablesus to make various probability calculations relevant to our proposed experiment. Statisticsis the converse, or inductive, operation, and uses inductions (also called inferences). It startswith data from our experiment and attempts to make objective statements about the un-known real world (whether the coin is fair or not). These inductive statements are alwaysbased on some probability calculation. The relation between probability and statistics can beseen from the following diagram :

Probability theory (deductive) →→→→→→

Some unknown realityand a hypothesis about it.

Uses data (what is ob-served in an experiment)to test this hypothesis.

←←←←←← Statistical inference (inductive)

This diagram makes it clear that to learn how to conduct a statistical procedure we firsthave to discuss probability on its own. We now do this.

4

Page 5: Intro to Stat (STAT 111) by Ewens

Probability Theory

Events and their probabilities

As has been discussed above, any discussion of Statistics requires a prior discussion of prob-ability theory. In this section an introduction to probability theory is given as it applies toprobabilities of events.

Events

An event is something which does or does not happen when some experiment is performed,field survey is conducted, etc. Consider for example a Gallup poll, in which (say) 2,000people are asked, before an election involving two candidates, Smith and Jones, whetherthey will vote for Smith or Jones. Here are some events that could occur:-

1. More people say that they will vote for Smith than say they will vote for Jones.2. At least 1,200 people say that they will vote for Jones.3. Exactly 1,124 people say that they will vote for Smith.

We will later discuss probability theory relating to Gallup polls. However, all the exam-ples given below relate to events involving rolling a six-sided die, since that is a simple andeasily understood situation. Here are some events that could occur in that context:-

1. An even number turns up.2. The number 3 turns up.3. A 3, 4, 5 or a 6 turns up.

Clearly there are many other events that we could consider. Also, with two or more rollsof the die, we have events like “a 6 turns up both times on two rolls of a die”, “in ten rollsof a die, a 3 never turns up”, and so on.

Notation

We denote events by upper-case letters at the beginning of the alphabet, and also the letterS and the symbol ∅. So in the die-rolling example we might have:-

1. A is the event that an even number turns up.2. B is the event that the number 3 turns up.3. C is the event that a 3, 4, 5 or a 6 turns up.

The letter S has a special meaning. In the die-rolling example it is the event that thenumber turning up is 1, 2, 3, 4, 5 or 6. In other words, S is the certain event. It comprises

5

Page 6: Intro to Stat (STAT 111) by Ewens

all possible events that could occur.

The symbol ∅ also has a special meaning. This is the so-called “empty” event. It is anevent that cannot occur, such as rolling both an even number and an odd number in onesingle roll of a die. It is an important event when considering intersections of events - seebelow.

Unions, intersections and complements of events

Given a collection of events we can define various derived events. The most important ofthese are unions of events, intersections of events, and complements of events.These are de-fined as follows:-

(i) Unions of events: If D and E are events, the union of D and E, written D ∪ E, is theevent that either D, or E, or both occur. In the die-rolling example above, A ∪ B is theevent that a 2, 3, 4 or 6 turns up, A ∪ C is the event that a 2, 3, 4, 5 or 6 turns up, andB ∪C is the event that a 3, 4, 5 or a 6 turns up. (Notice that in this case B ∪C is the sameas C.)

(ii) Intersections of events: If D and E are events, the intersection of D and E, writtenD∩E, is the event that both D and E occur. In the die-rolling example above, A∩B is theempty event ∅, since A and B cannot both occur, A ∩ C is the event that a 4 or a 6 turnsup, and B ∩ C is the event that the number 3 turns up. (Notice that in this case B ∩ C isthe same as B.)

(iii) Complements of events: If D is an event, Dc is the complementary event to D. It isthe event that D does not occur. In the three examples above, Ac is the event that an oddnumber turns up, Bc is the event that some number other than 3 turns up, and Cc is theevent that a 1 or a 2 turns up.

Probabilities of events

The concept of a probability is quite a complex one. These complexities are not discussedhere: we will be satisfied with a straightforward intuitive concept of probability as in somesense meaning a long-term frequency. Thus we would say, for a fair coin, that the probabilityof a head is 1/2, in the sense that we think that in a very large number of flips of this coin,we will get a head almost exactly half the time. We are interested here in the probabilitiesof events, and we write the probability of the event A as P (A), the probability of the eventB as P (B), and so on.

6

Page 7: Intro to Stat (STAT 111) by Ewens

Probabilities of derived events

The probabilities for the union and the intersection of two events are linked by the followingequation. If D and E are any two events,

P (D ∪ E) = P (D) + P (E)− P (D ∩ E).

To check this equation, we note that P (A∪C) = 5/6, and this is given by P (A)+P (C)−P (A ∩ C) = 1/2 + 2/3− 1/3 = 5/6. The other examples can be checked similarly.

It is always true that for any event D, P (Dc) = 1−P (D). This is obvious: the probabilitythat D does not occur is 1 minus the probability that D does occur.

Suppose that the die in the die-rolling example is fair. Then the probabilities of thevarious union and intersection events discussed above are as follows:-

P (A) = 1/2. P (A ∪B) = 2/3. P (A ∩B) = 0.P (B) = 1/6. P (A ∪ C) = 5/6. P (A ∩ C) = 1/3.P (C) = 2/3. P (B ∪ C) = 2/3. P (B ∩ C) = 1/6.

Notice that the probability of the empty event ∅ is 0.

Mutually exclusive events

Two events D and E are “mutually exclusive” if the both cannot occur together. Then theirintersection is the empty event and

P (D ∪ E) = P (D) + P (E). (1)

A generzalization of this formula is that if D, E, F , . . . , H are all mutually exclusiveevents (that is, no two of them can happen together), then

P (D ∪ E ∪ F ∪ .... ∪H) = P (D) + P (E) + P (F ) + · · ·+ P (H). (2)

Independence of events

Two events D and E are said to be independent if P (D ∩ E) = P (D) × P (E). It can beseen from the calculations given above that A and B are not independent, B and C are notindependent but that A and C are independent.

The intuitive meaning of independence is that if two events are independent, and if youare told that one of them has occurred, then this information does not change the probabilitythat the other event occurs. Thus in the above example, if you are given the informationthat an even number turned up (event A), then the probability that a 3, 4, 5 or a 6 turns up

7

Page 8: Intro to Stat (STAT 111) by Ewens

(event C) is still 2/3, which is the probability of the event C without this information beinggiven. Similarly if you are told that the number that turned up was a 3, 4, 5 or 6, then theprobability that an even number turned up is still 1/2. (The calculations confirming this aregiven in the next section.)

The generalization of the above is the following: if D, E, ... M are independent events,then the probability of the intersection of all these events is

P (D ∩ E ∩ F ∩ .... ∩H) = P (D)× P (E)× P (F )× · · · × P (H). (3)

An example

A fair six-sided die is to be rolled twice. What is the probability that the sum of the twonumbers to turn up is 6?

This sum can be 6 in five mutually exclusive ways:-

1 turns up on the first roll, 5 turns up on the second roll (event D).2 turns up on the first roll, 4 turns up on the second roll (event E)3 turns up on the first roll, 3 turns up on the second roll (event F )4 turns up on the first roll, 2 turns up on the second roll (event G)

5 turns up on the first roll, 1 turns up on the second roll (event H)

The union of these five events is the event “the sum of the two numbers to turn up is 6”.

Next, the number to turn up on the first roll is independent of the number to turn up onthe second roll. This the probability of reach of the above five events is 1

6× 1

6= 1

36.

Using the fomula for the probability of the union of mutually exclusive events, the prob-ability of the event “the sum of the two numbers to turn up is 6” is

1

36+

1

36+

1

36+

1

36+

1

36=

5

36.

The calculations above assume that the die is fair. For an unfair die we might reach adifferent conclusion than the one that we reach for a fair die. For example, if the die is biased,so that the probabilites for a 1, 2, 3, 4, 5 or 6 turning up are, respectively, 0.1, 0.3, 0.1, 0.2,0.2 and 0.1, then with the events A, B and C as defined above, P (A) = 0.6, P (C) = 0.6 andP (A∩C) is 0.3. Since 0.6× 0.6 = 0.36 6= 0.3, the events A and C are now not independent.

Conditional probabilities

We often wish to calculate the probability of some event D, given that some other event Ehas occurred. Such a probability is called a conditional probability, and is denoted P (D|E).

8

Page 9: Intro to Stat (STAT 111) by Ewens

The conditional probability P (D|E) is calculated by the formula

P (D|E) =P (D ∩ E)

P (E). (4)

It is essential to calculate P (D|E) using this formula: using any other approach, and inparticular using “common sense”, will usually give an incorrect answer.

If the events D and E are independent, then P (D|E) = P (D). In other words, D andE are independent if the knowledge that E has occurred does not change the probabilitythat D occurs. In the “fair die” example given above, equation (4) shows that P (A|C) =(1/3)/(2/3) = 1/2, and this is equal to P (A). This confirms that A and C are independent(for a fair die). In the “unfair die” example given above, equation (4) shows that P (A|C) =0.3 /0.6 = 0.5, and this is not equal to P (A), which 0.6. This confirms that for this unfairdie A and C are not independent.

Probability: One Discrete Random Variable

Random variables and data

In this section we define some terms that we will use often. We do this in terms of the coinflipping example, but the corresponding definitions for other examples are easy to imagine.

Before we flip the coin the number of heads that we will get is unknown to us. Thisnumber is therefore called a “random variable”. It is a “before the experiment” concept. Bythe word “data” we mean the observed value of a random variable once some experimentis performed. In the coin example, once we have flipped the coin the “data” is simply thenumber of heads that we did get. It is the observed value of the random variable once the“experiment” of flipping the coin is carried out. It is an “after the experiment” concept.

To assist us with keeping the distinction between random variables and data clear, andas a matter of notational convention, a random variable (a “before the experiment is carriedout” concept) is always denoted by an upper case Roman letter. We use the upper-caseletter X in these notes for this purpose. It is a concept of our mind - at this stage we do notknow its value. In the coin example the random variable X is the “concept of our mind”number of heads we will get, tomorrow, when we flip the coin.

The notation for the “after the experiment is done” data is the corresponding lower caseletter. So after we have flipped the coin we would denote the number of heads that wedid get by the corresponding lower-case letter x. Thus it makes sense, after the coin hasbeen flipped, to say “x = 1, 142”. It does not make sense before the coin is flipped to sayX = 1, 142. This second statement “does not compute”.

There are therefore two notational conventions that we always use: upper-case Romanletters for random variables, lower-case Roman for data. We will later find a third notationalconvention (for “parameters”).

9

Page 10: Intro to Stat (STAT 111) by Ewens

Definition: one discrete random variable

In this section we give informal definitions for discrete random variables and their prob-ability distributions rather than the formal definitions often found in statistics textbooks.Continuous random variables will be considered in a later section.

A discrete random variable is a numerical quantity that in some future experiment thatinvolves some degree of randomness will take one value from some discrete set of possiblevalues. These possible values are usually known before the experiment: In the coin examplethe possible values of X, the number of heads that will turn up, tomorrow, when we willflip the coin 2,000 times, are clearly 0, 1, 2, 3, . . . 2, 000. In practice the possible values of adiscrete random variable often consist of the numbers 0, 1, 2, 3, . . . k, for some number k.

The probability distribution of a discrete random variable; parameters

The probability distribution of a discrete random variable X is a listing of the possible valuesthat this random variable can take, together with their respective probabilities. If there arek possible values of X, namely v1, v2, . . . , vk, with respective probabilities P(v1), P(v2, ) . . . ,P(vk), this probability distribution can be written generically as

Possible values of X v1 v2 . . . vkRespective probabilities P(v1) P(v2) . . . P(vk)

(5)

In some cases we know (or hypothesize) the probabilities of the possible values v1, v2, . . . , vk.For example, if in the coin example we we know that the coin is fair, the probability distri-bution of X, the number of heads that we would get on two flips of the coin, is found asfollows.

The possible values ofX are 0, 1 and 2. We will get 0 heads if both flips give tails, and sincethe outcomes of the two flips are independent, the probability of this is (1/2)× (1/2) = 1/4.We can get 1 head in two ways: head on the first flip and tail on the second, and tail onthe first flip and head on the second. These events are mutually exclusive, and arguing asabove each has probability 1/4. Thus the total probability of 1 head is 1/2. Similarly theprobability of 2 heads is 1/4. This leads to the following probability distribution:-

Possible values of X 0 1 2Respective probabilities .25 .50 .25

(6)

Here P(0) = .25, P(1) = .5, P(2) = .25.

Suppose more generally that the probability of getting a head on any flip is some valueθ. We continue to define X as the number of heads that we would get on two flips of thecoin, and the possible values of X are still 0, 1 and 2. We will get 0 heads if both flipsgive tails, and since the outcomes of the two flips are independent, the probability of this is(1− θ)× (1− θ) = (1− θ)2. As above, we can get 1 head in two ways: head on the first flip

10

Page 11: Intro to Stat (STAT 111) by Ewens

and tail on the second, and tail on the first flip and head on the second. These events aremutually exclusive, and arguing as above each has probability θ(1−θ). Thus the probabilityof getting one head is 2θ(1− θ). Similarly the probability of 2 heads is θ2. This leads to thefollowing probability distribution:-

Possible values of X 0 1 2Respective probabilities (1− θ)2 2θ(1− θ) θ2 (7)

In this case,P(0) = (1− θ)2, P(1) = 2θ(1− θ), P(2) = θ2. (8)

Here θ is a so-called “parameter”: see more on these below.The probability distribution (8) can be generalized to the case of an arbitrary number of

flips of the coin- see (9) below.

The binomial distribution

There are many important discrete probability distributions that arise often in the applica-tions of probability and statistics to real-world problems. Each one of these distributionsis appropriate under some collection of requirement specific to that distribution. Here wefocus only on the most important of these distributions, namely, the binomial distribution,and consider first the requirements for it to be appropriate.

The binomial distribution arises if, and only if, all four of the following requirementshold. First, we plan to conduct some fixed number n of trials. (By “fixed” we mean fixedin advance, and not, for example, determined by the outcomes of the trials as they occur.)Second, there must be exactly two possible outcomes on each trial. The two outcomes areoften called, for convenience, “success” and “failure”. (Here we might regard getting a headon the flip of a coin as a success and a tail as a failure.) Third, the various trials mustbe independent - the outcome of any trial must not affect the outcome of any other trial.Finally, the probability of success must be the same on all trials. One must be careful whenusing a binomial distribution that all four of these conditions hold. We reasonably believethat these conditions hold when flipping a coin.

We often denote the probability of success on each trial by θ, since in practice this is oftenunknown. That is, it is a parameter. The random variable of interest is the total numberX of successes in the n trials. The probability distribution of X is given by the (binomialdistribution) formula

P(x) =

(n

x

)θx(1− θ)n−x, x = 0, 1, 2, . . . , n. (9)

The binomial coefficient(nx

)is often spoken as “n choose x”: it is the number of different

orderings in which x successes can arise in the n trials.(nx

)is calculated as n!

x!(n−x)!, where

x! = x(x− 1)(x− 2) · · · 3.2.1.

11

Page 12: Intro to Stat (STAT 111) by Ewens

The factor 2 in (8) is an example of a binomial coefficient, reflecting the fact that there aretwo orders (success followed by failure and failure followed by success) in which we can obtainone success and one failure in two trials. THis is also given by the calculation 2!

1!(1)!= 2.

In the expression (9) θ is the parameter, and n is called the index, of the binomialdistribution. The probabilities in (8) are binomial distribution probabilities for the casen = 2, and can be found from (9) by putting n = 2 and considering the respective valuesx = 0, 1 and 2.

Parameters

The quantity θ introduced above is a “parameter”. In general a parameter is some unknownconstant. In the binomial case it is the unknown probability of success in (9). Almost all ofStatistics consists of:-

(i) Estimating the value of a parameter.(ii) Giving some idea of the precision of our estimate of a parameter (sometimes called the“margin of error”).(iii) Testing hypotheses about the value of a parameter.

We shall consider these three activities later in the course. In the coin example, thesewould be:-

(i) Estimating the value of the binomial parameter θ.(ii) Giving some idea of the precision of our estimate of this parameter.(iii) Testing hypotheses about the numerical value of this parameter, for example testing thehypothesis that θ = 1/2.

The mean of a discrete random variable

The mean of a random variable is often confused with the concept of an average, and it isimportant to keep a clear distinction between the two concepts. The mean of the discreterandom variable X whose probability distribution is given in (5) above is defined as

v1P (v1) + v2P (v2) + · · · vkP (vk). (10)

In more mathematical notation this is

k∑i=1

viP(vi), (11)

12

Page 13: Intro to Stat (STAT 111) by Ewens

the summation being over all possible values (v1, v2, . . . , vk) that the random variable X cantake. As an example, the mean of a random variable having the binomial distribution (9) is

n∑x=0

x

(n

x

)θx(1− θ)n−x, (12)

and this can be shown, after some algebra, to be nθ.As a second example, consider the (random) number (which we denote by X) to turn up

when a die is rolled. The possible values of X are 1, 2, 3, 4, 5 and 6. If the die is fair, eachof these values has probability 1

6. Application of equation (10) shows that the mean of X is

1× 1

6+ 2× 1

6+ 3× 1

6+ 4× 1

6+ 5× 1

6+ 6× 1

6= 3.5. (13)

Suppose on the other hand that the die is unfair, and that the probability distribution ofthe (random) number X to turn up is:-

Possible values of X 1 2 3 4 5 6Respective probabilities 0.15 0.25 0.10 0.15 0.30 0.05

(14)

In this case the mean of X is

1× 0.15 + 2× 0.25 + 3× 0.10 + 4× 0.15 + 5× 0.30 + 6× 0.05 = 3.35. (15)

There are several important points to note about the mean of a discrete random variable:-

(i) The notation µ is often used for a mean. In many practical situations the mean µ of adiscrete random variable X is unknown to us, because we do not know the numericalvalues of the probabilities P (x). That is to say, µ is a parameter, and this is why weuse Greek notation for it. As an example, if in the binomial distribution case we donot know the value of the parameter θ, then we do not know the value µ(= nθ) of themean of that distribution.

(ii) The mean of a probability distribution is its center of gravity, that is its “knife-edgebalance point”.

(iii) Testing hypotheses about the value of a mean is perhaps the most important of statis-tical operations. An important example of tests of hypotheses about means is a t test.Different t tests will be discussed in this course.

(iv) The word “average” is not an alternative for the word “mean”, and has a quite differentinterpretation from that of “mean”. This distinction will be discussed often in class.

13

Page 14: Intro to Stat (STAT 111) by Ewens

The variance of a discrete random variable

A quantity of importance equal to that of the mean of a random variable is its variance. Thevariance (denoted by σ2) of the discrete random variable X whose probability distributionis given in (5) above is defined by

σ2 = (v1 − µ)2P(v1) + (v2 − µ)2P(v2) + . . .+ (vk − µ)2P(vk). (16)

In more mathematical terms we write this as

σ2 =k∑i=1

(vi − µ)2P(vi), (17)

the summation being taken over all possible values of the random variable X.In the case of a fair die, we have already calculated (in (15)) the mean of X, the (random)

number to turn up on a roll of the die, to be 3.5. Application of (16) shows that the varianceof X is

σ2 = (1−3.5)2× 1

6+(2−3.5)2× 1

6(3−3.5)2× 1

6(4−3.5)2× 1

6(5−3.5)2× 1

6(6−3.5)2 =

35

12. (18)

There are several important points to note about the variance of a discrete random variable:-

(i) The variance has the standard notation σ2, anticipated above.

(ii) The variance is a measure of the dispersion of the probability distribution of the randomvariable around its mean. Thus a random variable with a small variance is likely to beclose to its mean. (see Figure 1).

smaller variance larger variance

Figure 1:

14

Page 15: Intro to Stat (STAT 111) by Ewens

(iii) A quantity that is often more useful than the variance of a probability distribution isthe standard deviation. This is defined as the positive square root of the variance, and(naturally enough) is denoted by σ.

(iv) The variance, like the mean, is often unknown to us. This is why we denote it by aGreek letter.

(v) The variance of the number of successes in the binomial distribution (9) can be shown,after some algebra, to be nθ(1− θ).

Many Random Variables

Introduction

Almost every application of statistical methods in psychology, sociology, biology and similarareas requires the analysis of many observations. For example, if a psychologist wantedto assess the effects of sleep deprivation on the time needed to answer the questions in aquestionnaire, he/she would want to test a fairly large number of people in order to getreasonably reliable results. Before this experiment is performed, the various times that thepeople in the experiment will need to answer the questions are all random variables.

In line with the approach in this course, ideas about many observations will often bediscussed in the simple case of rolling a die (fair or unfair) many times. Here the observationsare the numbers that turn up on the various rolls of the die. If we wish to test whether thisdie is fair, we would plan to roll it many times, and thus plan to get many observations,before making our assessment. As with the sleep deprivation example, before we actuallyroll the die the numbers that will turn up on the various rolls are all random variables. Toassess the implications of the numbers which do, later, turn up when we get around to rollingthe die, and of the times needed in the sleep deprivation example, we have to consider theprobability theory for many random variables.

Notation

Since we are now considering many random variables, the notation “X” for one single randomvariable is no longer sufficient for us. We denote the first random variable by X1, the secondby X2, and so on. Suppose that in the die example we denote the planned number of rollsby n. We would then denote the (random) number that will turn up on the first roll of thedie by X1, the (random) number that will turn up on the second roll of the die by X2, . . .,the (random) number that will turn up on the n-th roll of the die by Xn.

As with a single random variable (see notes, page 8), we need a separate notation forthe actual observed numbers that did turn up once the die was rolled (n times). We denotethese by x1, x2, . . . , xn. To assess (for example) whether we can reasonably assume that thedie is fair we would use these numbers, but also we would have to use the theory of the nrandom variables X1, X2, . . . , Xn.

15

Page 16: Intro to Stat (STAT 111) by Ewens

As in the case of one random variable, a statement in the die example like: “X6 = 4”makes no sense. It “does not compute”. On the other hand, once the die has been rolled,the statement “x6 = 4” does make sense. It means that a 4 turned up on the sixth roll ofthe die. In the sleep example, a statement like: “x6 = 23.7” also makes sense. It meansthat the time that the sixth person in the experiment took to complete the questionnairewas 23.7 minutes. By contrast, before the experiment was conducted, the time X6 that thesixth person will take to complete the questionnaire is unknown. It was a random variable.Thus a statement like: “X6 = 23.7” does not make sense.

Independently and identically distributed random variables

The die example introduces two important concepts. We would reasonably assume thatX1, X2, . . . , Xn all have the same probability distribution, since it is the same die that isbeing rolled each time. For example, we would assume that the probability that a threeturns up on roll 77 (whatever it might be) is the same as the probability that a three turnsup on roll 144. Further, we would also reasonably assume that the various random variablesX1, X2, . . . , Xn are all independent of each other. That is, we would reasonably assume thatthe value of any one of these would not affect the value of any other one. Whatever numberturned up on roll 77 has no influence on the number turning up on roll 144.

Random variables which are independent of each other, and which all have the sameprobability distribution, are said to be iid (independently and identically distributed). Thisconcept is discussed again below.

The assumptions that the various random variables X1, X2, . . . , Xn are all independentof each other, and that they all have the same probability distribution, are often made inthe application of statistical methods. However, in areas such as psychology, sociology andbiology that are more scientifically important and complex than rolling a die, the assumptionof identically and independently distributed random variables might not be reasonable. Thusif twin sisters were used in the sleep deprivation example, the times that they take to completethe questionnaire might not be independent, since we might expect them to be quite similarbecause of the common environment and genetic make-up of the twins. If the people inthe experiment were not all of the same age it might not be reasonable to assume that thetimes needed are identically distributed - people of different ages might perhaps be expectedto tend to need different amounts of time. Thus in practice care must often be exercisedand common sense used when applying the theory of iid random variables in areas such aspsychology, sociology and biology.

The mean and variance of a sum and of an average

Given n random variables X1, X2, . . . , Xn, two very important derived random variables aretheir sum, denoted by Tn, and defined as

Tn = X1 +X2 + · · ·+Xn, (19)

16

Page 17: Intro to Stat (STAT 111) by Ewens

and their average, denoted by X, and defined by

X =X1 +X2 + · · ·+Xn

n=Tnn. (20)

Since both Tn and X are functions of the random variables X1, X2, . . . , Xn they are them-selves random variables. In the die example we do not know, before we roll the die, whatthe sum or the average of the n numbers that will turn up will be.

Both the sum and the average, being random variables, each have a probability distribu-tion, and thus each has a mean and a variance. These must be related in some way to themean and the variance of each of X1, X2, . . . , Xn. The general theory of many random vari-ables shows that if X1, X2, . . . , Xn are iid, with (common) mean µ and (common) varianceσ2, then the mean and the variance of the random variable Tn are, respectively,

mean of Tn = nµ, variance of Tn = nσ2, (21)

and the mean and the variance of the random variable X are given respectively by

mean of X = µ, variance of X =σ2

n. (22)

In STAT 111 we will call these four formulas the four “magic formulas” and will refer tothem often. Thus you have to know them by heart.

Equations (21) and (22) apply of course in the particular case n = 2. However in this casetwo further equations are important. If we define the random variable D by D = X1 −X2

(think of D standing for “difference”) then

mean of D = 0, variance of D = 2σ2. (23)

These are also “magic formulas” and will refer to them several times, especially whenmaking “comparison studies”. Thus you also have to know these two formulas by heart.

Two generalizations

More generally, suppose that X1, X2, . . . , Xn are independent random variables with respec-tive means µ1, µ2, . . . , µn and respective variances σ2

1, σ22, . . . σ

2n. Then

mean of Tn = µ1 + µ2 + · · ·+ µn, variance of Tn = σ21 + σ2

2 + · · ·+ σ2n, (24)

and

mean of X = (µ1 + µ2 + · · ·+ µn)/n, variance of X =σ2

1 + σ22 + · · ·+ σ2

n

n2. (25)

The formulas is (21) and (22) are, respectively, special cases of these formulas.

17

Page 18: Intro to Stat (STAT 111) by Ewens

Next, the generalization of the formulas in (23) is that Dij, defined by Dij = Xi −Xj, is arandom variable and that

mean of Dij = µi − µj, variance of Dij = σ2i + σ2

j . (26)

The formulas in (23) are, respectively, special cases of these formulas.

An example of the use of equations (22)

In the case of a fair die, each Xi has mean 3.5 and variance 35/12, as given by (15) and

(18), and thus standard deviation is√

(35/12), or about 1.708. On the other hand if n, thenumber of rolls, is 1,000, the variance of X = (X1 + X2 + · · · + X1,000)/1, 000 is, from the

second equation in (22), 35/12,000. Therefore the standard deviation of X is√

(35/12, 000),or about 0.0540. This small standard deviation implies that once we roll the die 1,000 times,it is very likely that the observed average of the numbers that actually turned up will bevery close to 3.5. This is no more than what intuition suggests. We will later do a JMPexperiment to confirm this.

Later we will see that if the die is fair, the probability that the observed average x isbetween its mean of X minus two standard deviations of X, (that is 3.5 - 2 × 0.0540 =3.392) and its mean of X plus two standard deviations of X , (that is 3.5 + 2 × 0.0540 =3.608) is about 95%. This statement is one of probability theory. It is an implication, ordeduction.

So here is a window into Statistics. If we have now rolled the die 1,000 times, and theobserved average x is 3.382. This is outside the range 3.392 to 3.608, the range within whichthe average is about 95% likely to lie if the die is fair. Then we have good evidence that thedie is not fair. This claim is an act of Statistics. It is an inference, or induction.

We will later make many statistical inferences, all of which will be based on the relevantcorresponding probability theory calculation.

The proportion of successes in n binomial trials

The random variable in the binomial distribution is the number of successes in n binomialtrials, with probability distribution given in (9). In some applications it is necessary toconsider instead the proportion of successes in these trials (more exactly, the proportion oftrials leading to success). If X is the number of successes in n binomial trials, then thisproportion is X/n, which we will denote by P .

P is a discrete random variable, and its possible values are 0, 1n, 2n, ..., (n−1)

n, 1. It has a

probability distribution which can be found from the binomial distribution (9) since theprobability that P = i/n is the same as the probability that X = i for any value of i.

Because P is a random variable it has a mean and variance. These are

mean of P = θ, variance of P = θ(1− θ)/n. (27)

18

Page 19: Intro to Stat (STAT 111) by Ewens

These equations bear a similarity to the formulas for the mean and variance of an averagegiven in (22).

We will see later, when testing for the equality of two binomial parameters, why it is oftennecessary to operate via the proportion of trials giving success rather than by the numberof trials giving success.

The standard deviation and the standard error

In the die example in the previous section the standard deviation of X = (X1 +X2 + · · ·+X1,000)/1, 000 is about 0.0540. The standard deviation of an average such as this is sometimescalled “the standard error of the mean”. (This terminology is unfortunate and causes muchconfusion - it should be “the standard deviation of the average”. How can a mean have astandard deviation? A mean is a parameter, and only random variables can have a standarddeviation.) Many textbooks use this unfortunate terminology. Watch out for it.

Means and averages

It is crucial to remember that a mean and an average are two entirely different things. (Thetextbook, and other textbooks, are sometimes not too good on making this distinction.) Amean is a parameter, that is some constant number whose value is often unknown to us. Forexample, with an unfair die for which the probabilities for the number to turn up on anyroll are unknown, the mean of the number to turn up is unknown. It is a parameter whichwe might wish to estimate or test hypotheses about. We will always denote a mean by theGreek letter µ.

By contrast, an average as defined above (i.e. X) is a random variable. It has a probabilitydistribution and thus has a mean and a variance. Thus it makes sense to say (as (22) and(25) say) “the mean of the average is such and such”.

There is also a second concept of an average, and this was already referred to in thedie-rolling example above. This is the actual average x of the numbers that actually turnedup once the 1,000 rolls were completed. This is a number, for example 3.382. You can thinkof this as the realized value of the random variable X once the rolling had taken place.

Thus there are three related concepts: first a mean (a parameter), second a “before theexperiment” average X (a random variable, and a concept of probability theory), and thirdan “after the experiment” average x (a calculated number, and a concept of Statistics). Theyare all important and must not be confused with each other.

Why do we need all three concepts? Suppose that we wish to estimate a mean (firstconcept), or to test some hypothesis about a mean (for example that it is 3.5). We would dothis by using the third concept, the “after the experiment” observed average x. How good xis an estimate of the mean, or what hypothesis testing conclusion we might draw given theobserved value of x, both depend on the properties of the random variable X, (the secondconcept), in particular its mean and variance.

19

Page 20: Intro to Stat (STAT 111) by Ewens

Continuous Random Variables

Definition

Some random variables by their very nature are discrete, such as the number of heads in2, 000 flips of a coin. Other random variables, by contrast, are continuous. Continuousrandom variables can take any value in some continuous range of values. Measurementssuch as height and blood pressure are of this type. Here we denote the range of a continuousrandom variable by (L,H), (L = lowest possible value, H = highest possible value of thecontinuous random variable), and use this notation throughout.

Probabilities for continuous random variables are not allocated to specific values, butrather are allocated to ranges of values. The probability that a continuous random variabletakes some specified numerical value is zero.

We use the same notation for continuous random variables as we do for discrete randomvariables, so that we denote a continuous random variable in upper case, for example by X.

Every continuous random variable X has an associated density function f(x). The densityfunction f(x) is the continuous random variable analogue of a discrete random variableprobability distribution such as (9). This density function can be drawn as a curve in the(x, f(x)) plane. (Examples will be given in class.) The probability that the random variableX takes a value in some given range a to b is the area under this curve between a annd b.

From a calculus point of view (for those who have a good calculus background) thisprobability is obtained by integrating this density function over the range a to b. For example,the probability that the (continuous) random variable X having density function f(x) takesa value between a and b, (with a < b) is given by

P(a < X < b) =

∫ b

a

f(x) dx. (28)

Because the probability that a continuous random variable takes some specified numericalvalue is zero, the three probabilities Prob(a ≤ X < b), Prob(a < X ≤ b), and Prob(a ≤X ≤ b) are also given by the right-hand side in (28).

As a particular case of equation (28),∫ H

L

f(x) dx = 1. (29)

This equation simply states that a random variable must take some value in its range ofpossible values.

For those who do not have a calculus background, don’t worry - we will never do any ofthese integration procedures.

20

Page 21: Intro to Stat (STAT 111) by Ewens

The mean and variance of a continuous random variable

The mean µ and variance σ2 of a continuous random variable X having range (L,H) anddensity function f(x) are defined respectively by

µ =

∫ H

L

xf(x)dx (30)

and

σ2 =

∫ H

L

(x− µ)2f(x)dx. (31)

Again, if you do not have a calculus background, don’t worry about it. We will never do anyof these integration procedures. The main thing to remember is that these definitions arethe natural analogues of the corresponding definitions for a discrete random variable, thatis that the mean is the “center of gravity”, or the “knife-edge balance point” of the densityfunction f(x) and the variance is a measure of the dispersion, or “spread-out-ness”, of thedensity function around the mean. (Examples will be given in class.)

Also, the remarks about the mean and the variance of a continuous random variable arevery similar to those of a discrete random variable given above. In particular we denote amean by µ and a variance by σ2. In a research context the mean µ and the variance σ2 ofthe random variable of interest are often unknown to us. That is, they are parameters, as isindicated by the Greek notation that we use for them. Many statistical procedures involveestimating, and testing hypotheses about, the mean and the variance of continuous randomvariables.

The normal distribution

There are many continuous probability distributions relevant to statistical operations. Wediscuss the most important one in this section, namely the normal, or Gaussian, distribution.

The (continuous) random variable X has a normal , or Gaussian, distribution if its range(i.e. set of possible values) is (−∞,∞) and density function f(x) given by

f(x) =1√2πσ

e−(x−µ)2

2σ2 . (32)

The shape of this density function is the famous (infamous?) “bell-shaped curve”. (Here π isthe well-known geometrical value of about 3.1416 and e is the equally important exponentialconstant of about 2.7183.)

It can be shown that the mean of this normal distribution is µ and its variance σ2, andthese parameters are built into the functional form of the distribution, as (32) shows. Arandom variable having this distribution is said to be an N(µ, σ2) random variable.

A stated above, the probability that a continuous random variable takes a value betweena and b is found by a calculus operation, which gives the area under the density function

21

Page 22: Intro to Stat (STAT 111) by Ewens

of the random variable between a and b. Thus for example the probability that a randomvariable having a normal distribution with mean 6 and variance 16 takes a value between 5and 8 is ∫ 8

5

1√2π√

16e−

(x−6)2

32 dx. (33)

Amazingly, the processes of mathematics do not allow us to evaluate the integral in (33):it is just “too hard”. (This indicates an interesting limit as to what mathematics can do.)So how would be find the probability given in (33)? It has to be done using a chart. So wenow have to discuss the normal distribution chart.

There is a whole family of normal distributions, each member of the family correspondingto some pair of (µ, σ2) values. (The case µ = 6 and σ2 = 16 just considered is an example ofone member of this family.) However, probability charts are available only for one particularmember of this family, namely the normal distribution for which µ = 0 and σ2 = 1. This issometimes called the standardized normal distribution, for reasons which will appear shortly.(The normal distribution chart that you will be given refers to this specific member of thenormal distribution family.)

The way that chart works is best described by a few examples. (We will also do someexamples in class.) The chart gives “less than” probabilities for a variety of positive numbers,generically denoted by “z”. Thus the probability that a random variable having a normaldistribution with mean 0 and variance 1 takes a value less than 0.5 is .6915. The theprobability that a random variable having a normal distribution with mean 0 and variance1 takes a value less than 1.73 is .9582. Note that the chart only goes up to the “z” value3.09; for any “z” greater than this, the probability that a random variable having a normaldistribution with mean 0 and variance 1 takes a value less than any “z” larger than 3.09 isgood enough as 1.

We usually have to consider more complicated examples than this. For example, theprobability that a random variable having a normal distribution with mean 0 and variance1 takes a value between 0.5 and 1.73 is 0.8582 − 0.6915 = 0.2667. The probability that arandom variable having a normal distribution with mean 0 and variance 1 takes a valuebetween 1.23 and 2.46 is 0.9931− 0.8907 = 0.1024.

As a different form of calculation, we often have to find “greater than” probabilities. Forexample, the probability that a random variable having a normal distribution with mean 0and variance 1 takes a value exceeding 1.44 is 1 minus tthe probability that it takes a valueless than 1.44, namely 1− 0.9251 = 0.0749.

Even more complicated calculations arise when negative numbers are involved. Here wehave to use the symmetry of the normal distribution around the value 0. For example, theprobability that a random variable having a normal distribution with mean 0 and variance1 takes a value between −1.22 and 0 is the same as the probability that it takes a valuebetween 0 and +1.22, and this is 0.8888 − 0.5 = 0.3888. The probability that a randomvariable having a normal distribution with mean 0 and variance 1 takes a value less than−0.87 is the same as the probability that it takes a value greater than +0.87, and this is

22

Page 23: Intro to Stat (STAT 111) by Ewens

1− 0.8078 = 0.1922.Finally, perhaps the most complicated calculation concerns the probability that a random

variable having a normal distribution with mean 0 and variance 1 takes a value between somegiven negative number and some given positive number. Suppose for example that we wantto find the probability that a random variable having a normal distribution with mean 0 andvariance 1 takes a value between −1.28 and +0.44. This is the probability that it takes avalue between −1.28 and 0 plus the probability takes a value between 0 and +0.44. This inturn is the probabilitiy that it takes a value between 0 and +1.28 plus the probabiiliy thatit takes a value between 0 and +0.44. This is (0.8997−0.5000) + (0.6700−0.5000) = 0.5697.

Why is there a probability chart only for this one particular member of the normaldistribution family? Suppose that a random variable X has the normal distribution (32),that is, with arbitrary mean µ and arbitrary variance σ2. Then the “standardized” randomvariable Z, defined by Z = (X − µ)/σ, has a normal distribution with mean 0, variance 1(trust me on this). This standardization procedure can be used to find probabilities for arandom variable having any normal distribution.

For example, if X is a random variable having a normal distribution with mean 6 andvariance 16 (and thus standard deviation 4), P(7 < X < 10), that is the probability of theevent 7 < X < 10, can be found by standardizing and creating a Z statistic:

P(7 < X < 10) = P

(7− 6

4<X − 6

4<

10− 6

4

)= P(0.25 < Z < 1), (34)

and this probability is found from the standardized normal distribution chart, (or fromcomputer packages), to be 0.8413 - 0.5987 = 0.2426.

As a slightly more complicated example, the probability that the random variable X in theprevious paragraph, the probability of the event 4 < X < 11, can be found by standardizingand creating a Z statistic:

P(4 < X < 11) = P

(4− 6

4<X − 6

4<

11− 6

4

)= P(−0.5 < Z < 1.25), (35)

and this probability is found from the kind of manipulations discussed above to be (0.6915−0.5000) + (0.8944− 0.5000) = 0.5859.

Two useful properties of the normal distribution, often used in conjunction with thisstandardization procedure, is that if the random variable Z has a normal distribution, mean0, variance 1, then

P(Z > +1.645) = 0.05 (36)

andPr(−1.96 < Z < +1.96) = 0.95, (37)

23

Page 24: Intro to Stat (STAT 111) by Ewens

or equivalentlyP(Z < −1.96) + P(Z > +1.96) = 0.05. (38)

The “standardized” quantity Z, defined as Z = (X − µ)/σ, where X is a random variablewith mean µ and standard deviation σ will be referred to often below, and the symbol Z isreserved, in these notes and in Statistics generally, for this standardized quantity. One ofthe applications of the normal distribution is to provide approximations for probabilities forvarious random variables, almost always using the standardized quantity Z.

One frequently-used approximation derives from equation (37) by approximating the value1.96 by 2. This is

P(−2 < Z < +2) ≈ 0.95, (39)

Remembering that Z = (X−µ)/σ, this equation implies that ifX is a random variable havinga normal distribution with mean µ and variance σ2, then One frequently-used approximationderives from equation (37) by approximating the value 1.96 by 2. This is

P(µ− 2σ < X < µ+ 2σ) ≈ 0.95, (40)

A similar calculation, using the normal distribution chart, shows that

P(µ− 2.575σ < X < µ+ 2.575σ) ≈ 0.99. (41)

Applications of (40) often arise from the Central Limit Theorem, discussed immediatelybelow.

The Central Limit Theorem

An important property of an average and of a sum of several random variables derives fromthe so-called “Central Limit Theorem”. This states that if the the random variables X1, X2,. . . , Xn are independently and identically distributed, then no matter what the probabilitydistribution these random variables might be, average X = (X1 +X2 + · · ·+Xn)/n and thesum X1 +X2 + · · ·+Xn both have approximately a normal distribution. This approximationbecomes more accurate the larger n is, and is usually very good for values of n greater thanabout 50. Since many statistical procedures deal with sums or averages, the Central LimitTheorem ensures that we often deal with the normal distribution in these procedures. Alsowe often use the formulas (21) and (22) for the mean and variance of a sum and of an averageand the approximation (40) when doing this.

We have already seen an example of this (in the Section “Example of the use ofequations (22)”. The average X of the numbers to turn up on 10,000 rolls of a fair dieis a random variable with mean 3.5 and variance 35/12,000, and thus standard deviation√

35/1, 0002 ≈ 0.0540. The central limit theorem states that to a very close approximation

24

Page 25: Intro to Stat (STAT 111) by Ewens

this average has a normal distribution with this mean and this variance. Then applicationof (40) shows that to a very close approximation,

P(3.5− 2× 0.540 < X < 3.5 + 2× 0.0540) ≈ 0.95, (42)

and this led to the (probability theory) statement given in the Section “Example of theuse of equations (22)” that the probability that X takes a value between 3.392 and 3.608is is about 95%. We also saw how this statement gives us a window into Statistics.

The Central Limit Theorem also applies to the binomial distribution. Suppose that X hasa binomial distribution with index n (the number of trials) and parameter θ (the probabilityof success on each trial), and thus mean nθ and variance nθ(1− θ). In the binomial contextthe Central Limit Theorem states that X has, to a very close approximation, a normaldistribution with this mean and this variance. It similarly states that the proportion P ofsuccesses has, to a very close approximation, a normal distribution with mean θ and thisvariance θ(1− θ)/n.

Here is an application of this result. Suppose that it is equally likely that a newbornwill be a boy as a girl. If this is true, the number of boys is a sample of 2,500 newbornshas approximately a normal distribution with mean 1,250 and variance (from note (v) aboutvariances) of 1

2× 1

2×2, 500 = 625, and hence standard deviation

√625 = 25. Then (41) shows

that the probability is about 0.99 that the number of boys in this sample will be between

1, 250− 2.575× 25 and 1, 250 + 2.575× 25,

that is about between 1185 and 1315. We saw these numbers in Homework 1.Here is the corresponding window into Statistics. IF a newborn is equally likely to be

a boy as a girl, then the probability is about 99% that in a sample of 2,500 newborns, thenumber of boys that we see will be between 1185 and 1315. (This is a probability theorydeduction, or implication. It is a “zig”. It is made under the assumption that newborn isequally lilkely to be a boy as a girl.) However when we actually took this sample we saw 1,334boys. We therefore have good evidence that it is NOT equally likely for a newborn to be aboy as a girl. (This is an induction, or inference. It is a “zag”. It is a statement of Statistics.It cannot be made without the corresponding probability theory “zig” calculation.)

The above example illustrates how we increase our knowledge in a context involvingrandomness (here the randomness induced by the sampling process) by a probability the-ory/Statistics “zig-zag” process. (In fact it is now known that it is NOT equally likely for anewborn to be a boy as a girl.)

25

Page 26: Intro to Stat (STAT 111) by Ewens

Statistics

Introduction

So far in these notes we have been contemplating the situation before some experiment iscarried out, so that we have been discussing random variables and their properties. We nowdo our experiment. As indicated above, if before the experiment we had been consideringseveral random variables X1, X2, . . . , Xn, we denote the actually observed value of these ran-dom variables, once the experiment has been carried out, by x1, x2, . . . , xn. These observedvalues are our data. As an example, if an experiment consisted of the rolling of a die n = 3times, and after the experiment we observe that a 5 turned up on the first roll and a 3on both the second and third rolls, we would say that x1 = 5, x2 = 3, x3 = 3. These areour data values. It does not make sense to say, before the experiment has been done, thatX1 = 5, X2 = 3, X3 = 3. This comment “does not compute”.

The three main activities of Statistics are the estimation of the numerical values of aparameter or parameters, assessing the accuracy these estimates, and testing hypothesesabout the numerical values of parameters. We now consider each of these in turn.

Estimation (of a parameter)

Comments on the “die-rolling” example

In much of the discussion in these notes (and the course) so far the values of the variousparameters entering the probability distributions considered were taken as being known. Agood example was the “fair die” simulation: we knew in advance that the die is fair, so weknew in advance the values of the mean (3.5) and the variance (35/12) of the number toturn up on any roll of the die.

However, in practice these parameters are usually unknown, and must be estimated fromdata. This means that our JMP “die rolling” experiment is very atypical. The reason isthat in this JMP experiment we know that the die is fair, so that we know for example themean of the (random variable) average of the numbers turning up after (say) 1,000 of rollsof the die is 3.5. The real-life situation, especially in research, is that we will not know therelevant mean. For example, we might be interested in the mean blood-sugar reading ofdiabetics. To get some idea about what this mean might be we would take a sample of (say)1,000 diabetics, measure the blood-sugar reading for each of these 1,000 people and iuse theaverage of these to estimate this (unknown) mean. This is a natural (as as we see later)correct thing to do.

So think of the JMP die-rolling example as a “proof of principle”: because we know thatthe die is fair, we know in advance that the mean of the (random variable) average of thenumbers turning up after (say) 1,000 of rolls of the die is 3.5. We also know that it hasa small variance (35/12,000). This value (3.5) of the mean and this small variance implythat, once we have rolled the die 1,000 times, our actually observed average should be very

26

Page 27: Intro to Stat (STAT 111) by Ewens

close to the mean of 3.5. And this is what we saw happen. This suggests that in a real-lifeexample, where we do not knowthe numerical value of a a mean, then using an observedaverage should give us a pretty good idea of what the mean is. Later we will refine this ideamore precisely.

General principles

In this section we consider general aspects of estimation procedures. Much of the theoryconcerning estimation of parameters is the same for both discrete and continuous randomvariables, so in this section we use the notation X for both.

Let X1, X2, ...., Xn be n independently and identically (iid) random variables, each havinga probability distribution P (x; θ) (for discrete random variables) or density function f(x; θ)(for continuous random variables), depending in both cases (as the notation implies) onsome unknown parameter θ. We have now done our experiment, so that we now have thecorresponding data values x1, x2, . . . , xn. How can we use these data values to estimate theparameter θ? (Note that we are estimating the parameter θ, not calculating it. Even afterwe have the data values we still will not know what the numerical value of θ is. But at leastif we use good estimation procedures we should have a reasonable approximate idea of itsvalue.)

Before discussing particular cases we have to consider general principles of estimation.An estimator of the parameter θ is some function of the random variables X1, X2, . . . , Xn,and thus may be written θ(X1, X2, . . . , Xn), a notation that emphasizes that this estimator

is itself a random variable. For convenience we generally use the shorthand notation θ. (We

pronounce θ as “θ-hat”.) The quantity θ(x1, x2, . . . , xn), calculated from the observed datavalues x1, x2, . . . , xn of X1, X2, . . . , Xn, is called the estimate of θ. The “hat” terminology isa signal to us that we are talking abpout either an estimator or an estimate.

Note the two different words estimate and estimator. The estimate of θ is calculated fromour data, and will then be just some number. How good this estimate is depends on theproperties of the (random variable) estimator θ, in particular its mean and its variance.

Various desirable criteria have been proposed for an estimator to satisfy, and we nowdiscuss three of these.

First, a desirable property of an estimator is that it be unbiased. An estimator θ is saidto be an unbiased estimator of θ if its mean value is equal to θ. If θ is an unbiased estimatorof θ we say that the corresponding estimate θ(x1, x2, . . . , xn), calculated from the observeddata values x1, x2, . . . , xn is an unbiased estimate of θ. It is “shooting at the right target”.Because of the randomness involved in the generation of our data, it will almost certainlynot exactly hit the target. But at least it is shooting in the right direction.

As an example, think of the average that you got of the number that turned up on your1,000 rolls of a fair die in the JMP experiment. We will later show that, if you did notknow the mean (3.5) in the die case you would use your average to estimate it. And almostcertainly your average was not exactly 3.5. But at least it should have been close to 3.5,

27

Page 28: Intro to Stat (STAT 111) by Ewens

since it was “shooting at the right target”.Second, if an estimator θ of some parameter θ is unbiased, we would also want the variance

of θ to be small, since if it is, the observed value θ(x1, x2, . . . , xn) calculated from your data,that is your estimate of θ, should be close to θ. This is why a variance is an importantprobability theory concept.

Finally, it would also be desirable if θ has, either exactly or approximately, a normaldistribution, since then well-known properties of this distribution can be used to provideproperties of θ. In particular, we often use the two-standard-deviation rule in assessing theprecision of our estimate, and this rule derives from the normal distribution. Fortunately,several of the estimators we consider are unbiased, have a small variance, and have anapproximately normal distribution.

Estimation of the binomial parameter θ

The binomial distribution gives the probability distribution of the random variable X, thenumber of successes from n binomial trials with the binomial parameter θ (the probabilityof success on each trial). This random variable and its probability distribution are purelyprobability theory concepts.

The corresponding statistical question is: I have now done my experiment and observedx successes from n trials. What can I say about θ? In particular, how should I estimate θ?How precise can I expect my estimate to be?

The classical, and perhaps natural, estimate of θ is p = x/n, the observed proportion ofsuccesses. This is indeed our estimate of θ. What are the properties of this estimate?

These depend on the properties of the random variable P , the (random) proportion ofsuccesses before we do the experiment. We know (from the relevant probability theoryformulas) that the mean of P is θ and that the variance of P is θ(1− θ)/n. What does thisimply?

First, since we know that the random variable P has a mean of θ, the estimate p of θ,once we have done our experiment, is an unbiased estimate of θ. It is “shooting at the righttarget”. That is good news.

Just as important, we want to ask: how precise is this estimate? An estimate of a param-eter without any indication of its precision is not of much value. This precision of p as anestimate of θ depends on the variance, and thus on the standard deviation, of the randomvariable P . We know that the variance of P is θ(1− θ)/n, so that the standard deviation of

P is√θ(1− θ)/n. We now use two facts.

(i) from the Central Limit Theorem as applied to the random variable P , we know that Phas, to a very accurate approximation, a normal distribution (with mean θ and varianceθ(1− θ)/n).

(ii) Once we know that P hasa normal distribution (to a sufficently good approximation) we

28

Page 29: Intro to Stat (STAT 111) by Ewens

are free to adapt either (40) or (41), which are normal distribution results,to the question ofthe precision of p as an estimate of θ. First we have to find out what these equations reduceto, or imply, in the binomial distribution context. They become

P(θ − 2√θ(1− θ)/n < P < θ + 2

√θ(1− θ)/n) ≈ 0.95, (43)

andP(θ − 2.575

√θ(1− θ)/n < P < θ + 2.575

√θ(1− θ)/n) ≈ 0.99. (44)

The first inequality implies, in words, something like this:

“Before we do our experiment we can say that the random variable P takes a value within2√θ(1− θ)/n of θ with probability of about 95% .”

From this we can say:

“After we have done our experiment, it is about 95% likely that the observed proportionp of successes is within 2

√θ(1− θ)/n of θ”.

We now turn this second statement “inside-out”, (using the “if I am within 10 yards ofyou, you are and withinin 10 yards of me idea”), and say:

“It is about 95% likely that the once we have done our experiment, θ is within 2√θ(1− θ)/n

of observed proportion p of successes”.

Writing this somewhat loosely, we can say

P(p− 2√θ(1− θ)/n < θ < p+ 2

√θ(1− θ)/n) ≈ 0.95. (45)

We still have a problem. Since we do not know the value of θ we do not know the valueof the expression

√θ(1− θ)/n occuring twice in (45). However at least we have an estimate

of θ, namely p. Since (45) is already an approximation, we make a further approximationand say, again somewhat loosely,

P(p− 2√p(1− p)/n < θ < p+ 2

√p(1− p)/n) ≈ 0.95. (46)

This leads to the so-called (approximate) 95% “confidence interval” for θ of

p− 2√p(1− p)/n to p+ 2

√p(1− p)/n. (47)

29

Page 30: Intro to Stat (STAT 111) by Ewens

As an example, suppose that n = 1, 000 and p is 0.47. The sort of thing we would say is:

“I estimate the value of θ to be 0.47, and I am (approximately) 95% certain that θ is between

0.47− 2√

0.47× 0.53/1, 000 (i.e. 0.4384) and 0.47 + 2√

0.47× 0.53/1, 000 (i.e. 0.5016). Insaying this we have not only indicated our estimate of θ, but we have also given some ideaof the precision, or reliability, of that estimate.

Notes on the above.

(i) The range of values 0.4384 to 0.5016 in the above example is usually called a “95% con-fidence interval for θ”. The interpretation of this statement is that we are (approximately)95% certain that θ is within this range of values. Thus the confidence interval gives us anidea of the precision of the estimate 0.47.

(ii) In research papers, books, etc, you will often see the above result written as

θ = 0.47± 0.0316.

(iii) The precision of the estimate 0.47 as indicated by the confidence interval depends onthe variance θ(1− θ)/n of the random variable P . This is why we have to consider randomvariables, their properties and in particular their variances.

(iv) It is a mathematical fact that p(1− p) can never exceed 1/4. Further, for quite a widerange of values of p near 1/2, p(1 − p) is quite close to 1/4. So if we approximate p(1 − p)by 1/4, and remember that

√1/4 = 1/2, we arrive from (47) at a conservative confidence

interval for θ asp−

√1/n to p+

√1/n. (48)

This formula is quite easy to remember and you may use it in place of (47).

(v) What was the sample size? Suppose that a TV announcer says, before an election be-tween two candidates Smith and Jones, that a Gallup poll predicts that 52% of the voterswill vote for Smith, “with a margin of error of 3%”. The TV announcer has no idea wherethat 3% ( = 0.03) came from, but in effect it came from the (approximate) 95% confidenceinterval (47) or (more likely) from (48). So we can work out, from (48), how many individu-als were in the sample that led to the estimate 52%, or 0.52. All we have to do is to equate√

1/n with 0.03. We find n = 1, 111. ( Probably their sample size was 1,000, and with this

value the “ margin of error” is√

1/1, 000 = 0.0316. They just approximated this by 0.03.)

(vi) All of the above relates to an (approximate) 95% confidence interval for θ. If you want tobe more conservative, and have a 99% confidence interval, you can start with the inequalities

30

Page 31: Intro to Stat (STAT 111) by Ewens

in (44) which, compared to (43) (which led to our 95% confidence interval) replaces the “2”in (43) by 2.575. Carrying through the same sort of argument that led to (47) and (48), wewould arrive at an (approximate) 99% confidence interval for θ of

p− 2.575√p(1− p)/n to p+ 2.575

√p(1− p)/n (49)

in place of (47) or

p− 1.2875√

1/n to p+ 1.2875√

1/n. (50)

in place of (48).

Example. This example is from the field of medical research. Suppose that someone proposesan entirely new medicine for curing some illness. Beforehand we know nothing about theproperties of this medicine, and in particular we do not know the probability θ that it willcure someone of the illness involved. θ is an (unknown) parameter. We want to carry outa clinical trial to estimate θ. Suppose now that we have given the new medicine to 10,000people with the illness and of these, 8,716 were cured. Then we estimate θ to be 0.8716.

Since we want to be very precise in a medical context we might prefer to use the 99% con-fidence interval (50) instead of the 95 % confidence interval (48). Since

√1/10, 000 is 0.01,

we would say: “ I estimate the probability of a cure by 0.8716, and I am about 99% certainthat the probability of a cure with this proposed medicine is between 0.8716 - 0.012875 (=0.8587 ) and 0.8716 + 0.012875 (=0.8845).

(vii) Notice that the length of both confidence intervals (50) and (48) are proportional to1/√n. This means that if you want to be twice as accurate you need four times the sample

size, that if you want to be three times as accurate you need nine times the sample size, andso on. This is why your medicines are so expensive: the FDA requires considerable accuracybefore a medicine can be put on the market, and this often implies that a very large samplesize is needed to meet this required level of accuracy.

(viii). Often in research publications the result of an estimation procedure is written assomething like: “estimate ± some measure of precsion of the estimate”. Thus the result inthe medical example above might be written as something like: “ θ = 0.8716 ± 0.012875.”This can be misleading because, for example, it is not indicated if this is a 95% or a 99%confidence interval. Also, it is not the best way to present the conclusion.

(ix). The width of the confidence interval, and hence the precision of the estimate, ultimatelydepend on the variance of the random variable P . This is why we have to discuss (a) randomvariables and (b) variances of random variables.

31

Page 32: Intro to Stat (STAT 111) by Ewens

Estimation of a mean (µ)

Suppose that we wish to estimate the mean blood sugar level of diabetics. We take a randomsample of n diabetics and measure their blood sugar levels, getting data values x1, x2, ...., xn.It is natural to estimate the mean µ by the average x of these observed values. What arethe properties of this estimate? To answer these questions we have to zig-zag backwards andforwards between probability theory and Statistics.

We start with probability theory theory and think of the situation before we got ourdata. We think of the data values x1, x2, ...., xn as the observed values of n iid randomvariables X1, X2, ...., Xn, all having some continuous probability density function with (un-known to us) mean µ and (unknown to us) variance σ2. Our aim then is to estimate µ fromthe data and to assess the precision of our estimate. The form of the density function ofeach X is unknown to us. We can however conceptualize about this distribution graphically:-

.......

There is some (unknown to us) probability (the shaded area below) that the blood sugarlevel of a randomly chosen diabetic lies between the values a and b:-

..

32

Page 33: Intro to Stat (STAT 111) by Ewens

.

This distribution has some (unknown to us) mean (which is what we want to estimate)at the “balance point” of this density function, as indicated by the arrow:-

.......

We continue to think of the situation before get our data. That is we continue to think interms of probability theory. We conceptualize about the average X of the random variablesX1, X2, ...., Xn. Since the mean value of X is µ (from equation (22)), X is an unbiasedestimator of µ. Thus x is an unbiased estimate of µ. That is good news: it is “shooting atthe right target”. So our “natural” estimate is the correct one.

Much more important: how precise is it as estimate of µ? This depends on the varianceof X. We know (see (22)) that the variance of X is σ2/n, and even though we do not knowthe value of σ2, this result is still useful to us. Next, the Central Limit Theorem showsthat the probability distribution of X is approximately normal when n is large, so that to agood approximation we can use the two-standard-deviation rule. These facts lead us to anapproximate 95% confidence interval for µ.

The 95% confidence interval for µ

Suppose first that we know the numerical value of σ2. (In practice it is very unlikely thatwe would know this, but we will remove this assumption soon.) The two-standard-deviationrule, deriving from properties of the normal distribution, then shows that for large n,

P

(µ− 2σ√

n< X < µ+

2σ√n

)≈ 0.95. (51)

33

Page 34: Intro to Stat (STAT 111) by Ewens

The inequalities (51) can be written in the equivalent “turned inside-out form” form

P

(X − 2σ√

n< µ < X +

2σ√n

)≈ 0.95. (52)

This leads to an approximate 95% confidence interval for µ, given the data values x1, x2, ...., xn,as

x− 2σ√n

to x+2σ√n. (53)

This interval is valuable in providing a measure of accuracy of the estimate x of µ. To betold that the estimate of a mean is 14.7 and that it is approximately 95% likely that themean is between 14.3 and 15.1 is far more useful information than being told only that theestimate of a mean is 14.7.

The main problem with the above is that, in practice, the variance σ2 is usually unknown,so that (53) is not immediately applicable. However it is possible to estimate σ2 from thedata values x1, x2, ...., xn. The theory here is not easy, so here is a “trust me” result: theestimate s2 of σ2 found from observed data values x1, x2, . . . , xn is

s2 =x2

1 + x22 + · · ·+ x2

n − n(x)2

n− 1. (54)

This leads (see (53)) to an even more approximate 95% confidence interval for µ as

x− 2s√n

to x+2s√n. (55)

This estimated confidence interval is useful, since it provides a measure of the accuracy ofthe estimate x and it can be computed entirely from the data. Further theory shows that inpractice it is reasonably accurate.

Some notes on the above

1. The number “2” appearing in the confidence interval (55) comes, eventually, from thetwo-standard-deviation rule. This rule is only an approximate one so that, as mentionedabove, the 95% confidence interval (55) is only reasonably accurate.

2. Why do we have n−1 in the denominator of the formula (54) for s2 and not n (the samplesize)? This question leads to the concept of “degrees of freedom”, which we shall discuss later.

3. The effect of changing the same size. Consider two investigators both interested intheblood sugar levels of diabetics. Suppose that n = 10 for the first investigator (i.e. her samplesize was 10). Suppose that n = 40 for the second investigator (i.e. his sample size was 40).The two investigators will estimate µ by their respective values of x. Since both estimates

34

Page 35: Intro to Stat (STAT 111) by Ewens

are unbiased, that is both are“ shooting at the same target” (µ), they should be reasonablyclose to each other. Similarly, their respective estimates of σ2 should be reasonably close toeach other, since both are unbiased estimates of σ2.

On the other hand the length of the confidence interval for µ for the second investigatorwill be about half that of the first investigator, since he will have a

√40 involved in the

calculation of his confidence interval, not the√

10 that the first investigator will have (see(55) and note that 1/

√40 is half of 1/

√10). This leads to the next point.

4. To be twice as accurate you need four times the sample size. To be three times as accu-rate you need nine times the sample size. This happens because the length of the confidenceinterval (55) is 4s/

√n. The fact that there is a

√n in the denominator and not an n explains

this phenomenon. This is why research is often expensive: to get really accurate estimatesone often needs very large sample sizes. To be 10 times as accurate you need 100 times thesample size!

5. How large a sample size do you need before you do your experiment in order to get somedesired degree of precision of the estimate of the mean µ? One cannot answer this questionin advance, since the precision of the estimate depends on σ2, which is unknown. Often oneruns a pilot experiment to estimate σ2, and from this one can get a good idea of what samplesize is needed to get the required level of precision, using the above formulas.

6. The quantity s/√n is often called “the standard error of the mean”. This statement

incorporates three errors. More precisely it should be: “the estimated standard deviation ofthe estimator of the mean”.

A numerical example. For many years corn has been grown using a standard seed processingmethod. A new method is proposed in which the seed is kiln-dried before sowing. We wantto assess various properties of this new method. In particular we want to estimate µ, thethe mean yield per acre (in pounds) under the new method and to find two limits betweenwhich we are approximately 95% certain that µ lies.

We plan to do this by sowing n = 11 separate acres of land with the new seed type andmeasuring the yield per acre for each of these 11 acres. At this point, before we do theexperiment, these yields are unknown to us. They are random variables, and we think oftheir values, before we carry our this experiment, as the random variables X1, X2, ...., X11.We know (as above) that the mean X of these random variables has mean µ, so we knowthat the estimate x will be unbiased.

With this conceptualization behind us, we now apply this new style of seed to our 11separate acre lots, and we get the following values (pounds per acre):-

1903, 1935, 1910, 2496, 2108, 1961, 2060, 1444, 1612, 1316 1511.

35

Page 36: Intro to Stat (STAT 111) by Ewens

These are our data values, which we have previously generically denotes by x1, x2, ...., xn.

Now to our estimation and confidence interval procedures. We estimate the mean µ ofthe yield per acre by the average

1903 + 1935 + · · ·+ 1511

11= 1841.46.

We know from the above theory that this is an unbiased estimate of µ, the mean yield peracre.

To calculate the approximate 95% confidence interval (55) for µ we first have to calculates2, our estimate of the variance σ2 of the probability distribution of yield with this new seedtype. The estimate of σ2 is, from (54),

s2 =(1903)2 + (1935)2 + · · ·+ (1511)2 − 11(1841.46)2

10= 117, 468.9.

Following (55), these calculations lead to our approximate 95% confidence interval for µ as

1841.46− 2√

117468.9√11

to 1841.46 +2√

117468.9√11

, (56)

that is from 1634.78 to 2048.14.

Since the individual yields are clearly given rounded to whole numbers, it is not appropriateto be more accurate than this in our final statement, which is: “We estimate the mean by1841, and we are about 95% certain that it is between 1635 and 2048.”

Often in research publications the above result might be written µ = 1841±206. This canbe misleading because, for example, it is not indicated if this is a 95% or a 99% confidenceinterval. Also, it is not the best way to present the conclusion.

36

Page 37: Intro to Stat (STAT 111) by Ewens

Estimating the difference between two binomial parameters

Let’s start with an example. Is there is a difference between men and women on their atti-tudes in the pro-life/pro-choice debate? We approach this question from a statistical pointof view as follows.

Let θ1 be the (unknown) probability that a woman is pro-choice and let θ2 be the (un-known) probability that a man is pro-choice. So we are interested in the difference θ1 − θ2.Our aim is to take a sample of n1 women and n2 men and find out for each person whetherhe/she is pro-life or pro-choice. Our aim then is to estimate θ1 − θ2 and to find an approxi-mate 95% confidence interval for θ1 − θ2.

Suppose now that we have taken our sample, and that x1 of the n1 women are pro-choice,and x2 of the n2 men are pro-choice. We would estimate θ1 by x1/n1, which we will write asp1, and would estimate θ2 by x2/n2, which we will write as p2. Thus we would (correctly)estimate θ1 − θ2 by the difference d = p1 − p2.

What are the properties of this estimate? These are determined by the properties ofthe random variable D = P1 − P2, where, before we took our sample, P1 = X1/n1 is theproportion of women who will be pro-choice and P2 = X2/n2 is the proportion of men whowill be pro-choice. Both P1 and P2 are random variables.

Notice that P1 − P2 is a difference. In comparing two groups we are often involved withdifferences. That is why we have done some probability theory about differences.

Now we know that the mean of P1 is θ1 (this is one of the magic formulas for the pro-portion of “successes” in the binomial context) and we also know that the mean of P2 is θ2

(this uses the same magic formula). Thus from the first equation in (26), giving the mean ofa difference of two random variables with possibly different means, the mean of D is θ1− θ2.Thus D is an unbiased estimator of θ1 − θ2 and correspondingly d = p1 − p2 is an unbiasedestimate of θ1 − θ2. It is “shooting at the right target”. It is the estimate of θ1 − θ2 that wewill use.

More important: how precise is this estimate? To answer this we have to find the varianceof the estimator P1 − P2. Now the variance of P1 is θ1(1− θ1)/n1 (from the variance of theproportion of successes in n1 binomial trials). Similarly the variance of P2 is θ2(1− θ2)/n2.From the second equation in (26), giving the variance of a difference of two random variableswith possibly different variances, the variance of D is θ1(1− θ1)/n1 + θ2(1− θ2)/n2.

We do not of course know the numerical value of this variance, since we do not know thevalues of θ1 and θ2. However we have an estimate of θ1, namely p1 and an estimate of θ2,namely p2. So we could estimate this variance by p1(1− p1)/n1 + p2(1− p2)/n2.

37

Page 38: Intro to Stat (STAT 111) by Ewens

Using the same sort of argument that led to (46), we could then say

P(p1 − p2 − 2√p1(1− p1)/n1 + p2(1− p2)/n2 < θ1 − θ2

< p1 − p2 + 2√p1(1− p1)/n1 + p2(1− p2)/n2) ≈ 0.95. (57)

This leads to the so-called (approximate) 95% “confidence interval” for θ1 − θ2 of

p1 − p2 − 2√p1(1− p1)/n1 + p2(1− p2)/n2 to

p1 − p2 + 2√p1(1− p1)/n1 + p2(1− p2)/n2. (58)

These formulas a pretty clumsy, so we carry out the same approximation that we didwhen estimating a single binomial parameter (see the discussion leading to (48)). That is,we use the mathematical fact that neither p1(1 − p1) nor p2(1 − p2) can ever exceed 1/4.Further, for quite a wide range of values of any fraction f near 1/2, f(1 − f) is quite closeto 1/4. So if we approximate both p1(1 − p1) and p2(1 − p2 by 1/4, and remember that√

1/4 = 1/2, we arrive at a conservative confidence interval for θ1 − θ2 as

p1 − p2 −√

1/n1 + 1/n2 to p1 − p2 +√

1/n1 + 1/n2. (59)

Numerical example. Suppose that we interview n1 = 1, 000 women and n2 = 800 menon the pro-life/pro-choice question. We find that 624 of the women are pro-life and 484of the men are. So we estimate θ1 by p1 = 624/1, 000 = 0.624 and we estimate θ2 byp2 = 484/800 = 0.605. So we estimate the difference between the proportion of womenwho are pro-choice and the proportion of men who are pro-choice to be 0.624 - 0.605 =0.019. Further, we are approximately 95% certain that the actual proportion is between0.019 −

√1/1, 000 + 1/800 = 0.019 − 0.047 = −0.028 and 0.019 +

√1/1, 000 + 1/800 =

0.019 + 0.047 = 0.066. A TV commentator would call 0.047 the “margin of error”.

Later, when we do hypothesis testing, we will see if the estimate 0.019 differs significantlyfrom 0.

Estimating the difference between two means

As in the previous section, let’s start with an example. In fact we will follow the structureof the last section fairly closely. Is the mean blood pressure of women equal to the meanblood pressure of men ? We approach this question from a statistical point of view as follows.

Let µ1 be the (unknown) mean blood pressure for a a woman and µ2 be the (unknown)mean blood pressure for a man. So we are interested in the difference µ1 − µ2. Our aim is

38

Page 39: Intro to Stat (STAT 111) by Ewens

to take a sample of n1 women and n2 men and measure the blood pressures of all n1 + n2

people. Our aim then is to estimate µ1 − µ2 and to find an approximate 95% confidenceinterval for µ1 − µ2.

Clearly we estimate the mean blood pressure for women by x1, the average of the bloodpressures of the n1 women in the sample, and similarly we estimate the mean blood pressurefor men by x2, the average of the blood pressures of the n2 men in the sample. We thenestimate µ1 − µ2 by x1 − x2.

How accurate is this estimate? This depends on the variance of the random variableX1 − X2. Using the formula for the variance of a difference, as well as the formula for the

variance of an average, this variance isσ21

n1+

σ22

n2, where σ2

1 is the (unknown) variance of blood

pressure among women and σ22 is the (unknown) variance of blood pressure among men. We

do not know either σ21 or σ2

2 and these will have to be estimated from the data. If the bloodpressures of the n1 women are denoted x11, x12, ...., x1n1 , we estimate σ2

1 (see equation 54) by

s21 =

x211 + x2

12 + · · ·+ x21n1− n1(x1)2

n1 − 1. (60)

Similarly, if the blood pressures of the n2 men are denoted x21, x22, ...., x2n1 , we estimate σ22

(see equation 54) by

s22 =

x221 + x2

22 + · · ·+ x22n2− n2(x2)2

n2 − 1. (61)

Thus we estimateσ21

n1+

σ22

n2by

s21n1

+s22n2.

Finally our approximate 95% confidence interval for µ1 − µ2 is

x1 − x2 − 2

√s2

1

n1

+s2

2

n2

to x1 − x2 + 2

√s2

1

n1

+s2

2

n2

. (62)

39

Page 40: Intro to Stat (STAT 111) by Ewens

Regression

How does one thing depend on another? How does the GNP of a country depend on thenumber of people in full-time employment? How does the reaction time of a person to somestimulus depend on the amount of sleep deprivation administered to that person? How doesthe growth height of a plant in a greenhouse depend on the amount of water that we givethe plant during the growing period? Many practical questions are of the “how does thisdepend on that?” type.

These questions are answered by the technique of regression. Regression problems can getpretty complicated, so we consider here only one case of regression: How does some randomnon-controllable quantity Y depend on some non-random controllable quantity x?

Notice two things about the notation. First, we denote the random quantity in upper case- see Y above. This is in accordance with the notational convention of denoting randomvariables in upper case. We denote the controllable non-random quantity in lower case - seex above. Second, we are denoting the random variable by Y . Up to now we have denotedrandom variables using the letter X. We switch the notation from X to Y in the regressioncontext to because we will later plot our data values in the standard x-y plane, and it isnatural to plot the observed values of the random variable as the y values.

We will use the plant and water example to demonstrate the central regression concepts.

First we think of the situation before our experiment, and consider some typical generic plantto which we will plan to give x units of water. At this stage the eventual growth height Y ofthis plant is a random variable - we do not know what it will be. We make the assumptionthat the mean of Y is of the “linear” form:-

mean of Y = α + βx, (63)

where α and β are parameters, that is quantities whose value we do not know. In fact ourmain aim once the experiment is finished is to estimate the numerical values of these param-eters and to get some idea of the precision of our estimates. Note that we assume that themean growth height potentially depends on x, and indeed our main aim is to assess the wayit depends on x.

We also assume thatvariance of Y = σ2, (64)

where σ2 is another parameter whose value we do not know and which, once the experimentis finished, we wish to estimate. The fact that there is a (positive) variance for Y derivesfrom the fact that there is some, perhaps muhc, uncertainty about what the value of theplant growth will be after we have done the experiment. There are many factors that we

40

Page 41: Intro to Stat (STAT 111) by Ewens

do not know about, such as soil fertility, which imply that Y is a random variable, with avariance.So we are involved with three parameters, α, β and σ2. We do not know the value of anyone of them. A stated above, one of our aims, once the experiment is over, is to estimatethese parameters from our data.

Of these three the most important on to us is β. The interpretation of β is that it is themean increase in growth height per unit increase in the amount of water given. If β = 0this mean increase is zero, and equation (63) shows that the mean growth height does notdepend on the amount of water that we give the plant. So we will be interested, later, inseeing if our estimate of β, once we have our data, is close to zero or not.

Taking a break from regression for a moment, equation (63) reminds us that the algebraicequation y = a+ bx defines a geometric line in the x-y plane, as shown:-

The interpretation of a is that it is the intercept of this line on the y axis (as shown). Theinterpretation of b is hat it is the slope of the line. If b = 0 the line is horizontal, and thenthe values of y for points one the line are all the same, whatever the value of x.

Now back to the regression context. We plan to use some pre-determined number n of plantsin our greenhouse experiment, planning to give the plants respectively x1, x2, ..., xn units ofwater. These x values do not have to be all different from each other, but it is essential thatthey are not all equal. In fact there is a strategy question about how we would choose thevalues of x1, x2, ..., xn which is discussed later.

We are still thinking of the situation before we conduct our experiment. At this stage weconceptualize about the growth heights Y1, Y2, ..., Yn of the n plants. (Y1 corresponds to theplant getting x1 units of water, Y2 corresponds to the plant getting x2 units of water, and soon.) These are all random variables - we do not know in advance of doing the experimentwhat values they will take. Then from equation (63), the mean of Y1 is α + βx1, the meanof Y2 is α + βx2, and so on. The variance of Y1 is σ2, the variance of Y2 is also σ2, and so

41

Page 42: Intro to Stat (STAT 111) by Ewens

on. We assume that the various Yi values are independent. However they are clearly notassumed to be identically distributed, since if for example xi 6= xj, that is the amount ofwater to be given to plant i differs from that to be given to plant j, the means of Yi and Yjare different if β 6= 0 and the assumptions embodied in (63) are true.

Equation (63) shows that the mean of Y is a linear function of x. This means that oncewe have our data they should (if the assumption in (63) is correct) approximately lie on astraight line. We do not expect them to lie exactly on a straight line: we can expect randomdeviations from a straight line because of factors unknown to us such as differences in soilcomposition among the pots that the various plants are grown in, temperature differencesfrom the environment of one plant to another, etc. The fact that deviations from a straightline are to be expected is captured by the concept of the variance σ2. The larger this (un-known to us) variance is, the larger these deviations from a line would tend to be.

All the above refers to the situation before we conduct our experiment. We now do theexperiment, and we obtain growth heights y1, y2, ..., yn. (The plant getting x1 units of wa-ter had growth height y1, the plant getting x2 units of water had growth height y2, and so on.)

The first thing that we have to do is to plot the (x1, y1), (x2, y2), ...., (xn, yn) values on agraph. Equation (63) shows that when we do this the data points should “more or less” lieon a straight line. Suppose that our data points are as shown below:-

These data points “more or less” lie on a straight line. If the data points are more or lesson a straight line (as above, and deciding this is really a matter of judgement) we can goahead with our analysis. If they are clearly not on a straight line (see example at the top ofthe next page) then you should not proceed with the analysis.

42

Page 43: Intro to Stat (STAT 111) by Ewens

.

There are methods for dealing with data that clearly do not lie close to being on a straightline, but we do not consider them here. So from now on we assume that the data are “moreor less” on a straight line.

Our first aim is to use the data to estimate α, β and σ2. To do this we have to calculatevarious quantities. These are

x =x1 + x2 + ....+ xn

n, y =

y1 + y2 + ....+ ynn

, (65)

as well as the quantities sxx, syy and sxy, defined by

sxx = (x1 − x)2 + (x2 − x)2 + ....+ (xn − x)2, (66)

syy = (y1 − y)2 + (y2 − y)2 + ...+ (yn − y)2, (67)

sxy = (x1 − x)(y1 − y) + (x2 − x)(y2 − y) + ...+ (xn − x)(yn − y). (68)

The most important parameter is β, since if β = 0 the growth height for any plant does notdepend on the amount of water given to the plant. The derivation of unbiased estimateshere is complicated, so we just give the “trust me” results:-

We estimate β by b, defined byb = sxy/sxx. (69)

We estimate α by a, defined bya = y − bx. (70)

Finally we estimate σ2 by s2r, defined by

s2r = (syy − b2sxx)/(n− 2). (71)

These are the three estimates that we want for our further analysis.

43

Page 44: Intro to Stat (STAT 111) by Ewens

Notes on this.

1. The suffix “r” in s2r stands for the word “regression”. The formula for s2

r relates only tothe regression context.

2. You will usually do a regression analysis by a statistical package (we will do an examplein class), so in practice you will usually not have to do the computations for these estimates.

3. It can be shown (the math is too difficult to give here) that a is an unbiased estimate ofα, that b is an unbiased estimate of β, and that that s2

r is an unbiased estimate of σ2.

4. How accurate is the estimate b of β? Again here there is some difficult math that you willhave to take on trust. The bottom line is that we are about 95% certain that β is between

b− 2sr√sxx

and b+2sr√sxx

. (72)

This is our (approximate) 95% confidence interval for β. Clearly the “2” in this result comesfrom the two-standard-deviation rule. You will have to take the 2sr√

sxxpart of it on trust.

5. This result introduces a “strategy” concept into our choice of the values x1, x2, ..., xn, theamounts of water that we plan to put on the various plants. The width of this confidenceinterval is proportional to 1/

√sxx. Thus the larger we make

√sxx the shorter is the length

of this confidence interval and the more precise we can be about the value of β. How can wemake

√sxx large? We can do this by spreading the x values as far away from their average as

we reasonably can. However two further considerations then come into play. We should keepthe various x values within the range of values which is of interest to us. Also, we do not wantto make half the x values at the lower end of this “interesting range” and the other half at theupper end of this “interesting range”. If we did this we would have no idea what is happeningin the middle of this range - see the picture below to illustrate the case where we put abouthalf our x values at the same low value and the other half our x values at the same high value.

44

Page 45: Intro to Stat (STAT 111) by Ewens

So in practice we tend to put quite a few x values near the extremes but also string quite afew x values in the middle of the interesting range. This will be illustrated in a numericalexample later.

6. A second result goes in the other direction. Suppose that the amounts of water put onthe various plants were close to each other. In other words the x values would be close toeach other and all would then be close to their average. This would mean that sxx would besmall. So 2sr/sxx would be large, and the confidence interval (72) would be wide. We thenhave little confidence in our estimate of β.

An even more extreme case arises if we give all plants the same amount of water. Thenboth sxx and sxy would be zero, and the definition of b shows that we would calculate b as0/0, which mathematically makes no sense. In fact the formula for b is telling you: “Youcan’t estimate β with the data that you have”. The formula here is definitely sending youa message. In fact it is saying: “You want to assess how the growth height depends on theamount of water given to the plant. If you give all plants the same amount of water thereis no way that you can do this”. It would be the same as a situation where you wanted toassess how the height of a child depended on his/her age, and all the children in your samplewere of exactly the same age. You clearly could not make this assessment with data of thattype.

Example. We will do an example from the “water and plant growth” situation.

We have n = 12 plants to which we gave varying amounts of water (see below). After theexperiment we obtained the following data:-

Plant number 1 2 3 4 5 6 7 8 9 10 11 12

Amount of water 16 16 16 18 18 20 22 24 24 26 26 26

Growth height 76.2 77.1 75.7 78.1 77.8 79.2 80.2 82.5 80.7 83.1 82.2 83.6

From these we compute x = (16+16+· · ·+26)/12 = 21 and y = (76.2+77.1+· · ·+83.6)/12 =79.7. Also we find sxx = 188, syy = 83.54 and sxy = 122.4.

We now compute our estimate b of β as sxy/sxx = 122.4/188 = 0.6510638. (This result isgiven to 7 decimal places so as to compare with the JMP printout. In practice you are notjustified in giving it to an accuracy greater than that of the data, so in practice we wouldwrite b = 0.65).

45

Page 46: Intro to Stat (STAT 111) by Ewens

Next, our estimate a of α is y− bx = 79.7− (0.6510638× 21) = 66.02766. (Again this resultis given to 7 decimal places so as to compare with the JMP printout. In practice you are notjustified in giving it to an accuracy greater than that of the data, so in practice we wouldwrite a = 66.03.)

Finally we estimate σ2 by s2r, calculated in this case (see (71)) as

83.54− (0.65)2(188)

10= 0.384979.

How accurate is our estimate b of β? First, from the theory it is unbiased. It was found bya process which is truly “aiming at β”. next, we are approximately 95% certain that β isbetween

0.65− 2√

0.384979√188

to 0.65 +2√

0.384979√188

, (73)

that is from 0.56 to 0.74.

Notes on this

1. Our so-called “estimated regression line” is y = 66.03 + 0.65x. That is the equation of theline that appears on the JMP screen. We could use this line, for example, to say that weestimate the mean growth height for a plant given 21 units of water to be 66.03 + 0.65×21= 79.68.

2. Never extrapolate between the x values in the experiment. For example it is not appro-priate to say that we estimate the mean growth height for a plant given 1,000 units of wateris 66.03 + 0.65×1,000 = 716.03. (You probably would have killed the plant if you gave itthis much water.)

3. Later we will consider testing the hypothesis that the growth height of the plant does notdepend on the amount of water given to it. This is equivalent to testing the hypothesis β = 0.

4. Notice the choices of the amount of water in the above example. We gave three plantsthe lowest amount of water (16) and three plants the highest amount of water (26). We alsostrung a few values out between these values, in acordance with the discussion above aboutthe choice of x values.

We will do this example by JMP in class. There will also be a handout discusing the JMPoutput and the interpretation of various things in this output. That handout should beregarded as part of these notes.

46

Page 47: Intro to Stat (STAT 111) by Ewens

Testing hypotheses

Background

In hypothesis testing we attempt to answer questions. Here are some simple examples.

Is this coin fair? Is a women equally likely to be left-handed as a man is? Is there anydifference between men and women so far as blood pressure is concerned? Is there any effectof the amount of water given to a plant and its growth height?

We always re-phrase these questions in terms of questions about parameters:-

If the probability of a head is θ, is θ = 1/2? If the probability that a woman is left-handedis θ1, and the probability that a man is left-handed is θ2, is θ1 = θ2? If the mean bloodpresure for a woman is µ1, and the mean blood presure for a man is µ2, is µ1 = µ2? Is β = 0?

We re-phrase these questions in this way because we know how to estimate parameters andto get some idea of the precision of our estimates. So re-phrasing questions in terms ofquestions about parameters helps us to answer them. Attempting to answer them is anactivity of hypothesis testing.

The general approach to hypothesis testing

We will consider two equivalent approaches to hypothesis testing. The first approach pre-dates the availability of statistical packages, while the second approach is to some extentmotivated by the availability of these packages. We will discuss both approaches. Bothapproaches involve five steps. The first three steps in both approaches are the same, andwe consider these three steps first. We will illustrate all steps by considering two problemsinvolving the binomial distribution.

Step 1

Statistical hypothesis testing involves the test of a null hypothesis (which we write in short-hand as H0) against an alternative hypothesis (which we write in shorthand as H1). Thefirst step in a hypothesis testing procedure is to declare the relevant null hypothesis H0 andthe relevant alternative hypothesis H1. The null hypothesis, as the name suggests, usuallystates that “nothing interesting is happening”. This comment is discussed in more detailbelow. The choice of null and alternative hypotheses should be made before the data areseen. Also the nature of the alternative hypothesis must be decided before the data are seen:this is also discussed in more detail below. To decide on a hypothesis as a result of the datais to introduce a bias into the procedure, invalidating any conclusion that might be drawn

47

Page 48: Intro to Stat (STAT 111) by Ewens

from it. Our aim is eventually to accept or to reject the null hypothesis as the result of anobjective statistical procedure, using our data in making this decision.

It is important to clarify the meaning of the expression “the null hypothesis is accepted.”This expression means that there is no statistically significant evidence for rejecting the nullhypothesis in favor of the alternative hypothesis. A better expression for “accepting” is thus“not rejecting.” So instead of saying “We accept H0”, it is best to say “We do not havesignificant evidence to reject H0”.

The alternative hypothesis will be one of three types: “one-sided up”, “one-sided down”,and “two-sided”. In any one specific situation which one of these three types is appropriatemust be decided in advance of getting the data. The context of the situation will generallymake it clear which is the appropriate alternative hypothesis.

All the above seems very abstract, so as stated above we will illustrate the steps in thehypothesis testing procedure by two examples, both involving the binomial distribution.

Example 1

It is essential for a gambling casino that the various games offered are fair, since an astutegambler will soon notice if they are unfair and bet accordingly. As a simplified example,suppose that one game involves flipping a coin, and it is essential, from the point of viewof the casino operator, that this coin be fair. The casino operator now plans to carry out ahypotheis testing procedure. If the probability of getting “head” on any flip of the coin isdenoted θ, the null hypothesis H0 for the casino operator then states that θ = 1/2. (No biasin the coin. Nothing interesting happening.)

In the casino example it is important, from the point of view of the casino operator, to detecta bias of the coin towards either heads or tails (if there is such a bias). Thus in this case thealternative hypothesis H1 is the two-sided alternative θ 6= 1/2. This alternative hypothesis issaid to be “composite”: it does not specify some numerical value for θ (as H0 does). Insteadit specifies a whole collection of values. It often happens that the alternative hypothesis iscomposite.

Example 2

This example comes from the field of medical research. Suppose that we have been usingsome medicine for some illness for many years (we will call this the “current” medicine),and we in effect know from much experience that the probability of a cure with the currentmedicine is 0.84. A new medicine is proposed and we wish to assess whether it is betterthan the current medicine. Here the only interesting possibility is that it is better than

48

Page 49: Intro to Stat (STAT 111) by Ewens

the current medicine: if it is equally effective as the current medicine, or (even worse) lesseffective than the current medicine, we would not want to introduce it.

Let θ be the (unknown) probability of a cure with the new medicine. Here the null hypoth-esis is θ = 0.84. If this null hypothesis is true the new medicine is equally effective as thecurrent one, since its cure rate would be equal to that of the current medicine. The naturalalternative in this case is “one-sided up”, namely θ > 0.84. This is the only case of interestto us. This is also a composite hypothesis.

Notice how, in both examples, the nature of the alternative hypothesis is determined by thecontext, and that in both cases the null and alternative hypotheses are stated before thedata are seen.

Step 2

Since the decision to accept or reject H0 will be made on the basis of data derived from somerandom process, it is possible that an incorrect decision will be made, that is, to reject H0

when it is true (a Type I error), or “false positive”, or to accept H0 when it is false (a TypeII error), or a “false negative”. This is illustrated in the following table:-

We accept H0 We reject H0

H0 is true H0 OK Type I errorH0 is false H0 Type II error OK

When testing a null hypothesis against an alternative it is not possible to ensure that theprobabilities of making a Type I error and a Type II error are both arbitrarily small unlesswe are able to make the number of observations as large as is needed to do this. In practicewe are seldom able to get enough observations to do this.

This dilemma is resolved in practice by observing that there is often an asymmetry in theimplications of making the two types of error. In the two examples given above there mightbe more concern about making the false positive claim and less concern about making thefalse negative claim. This would be particularly true in the “medicine” example: we areanxious not to claim that the new medicine is better than the current one if it is not better.If we make this claim and the new medicine in not better than the current one, many millionsof dollars will have been spent manufacturing the new medicine, only to find later that it isnot better than the current one. For this reason, a frequently adopted procedure is to focuson the Type I error, and to fix the numerical value of this error at some acceptably low level(usually 1% or 5%), and not to attempt to control the numerical value of the Type II error.The value chosen is denoted α. The choice of the values 1% and 5% is reasonable, but is alsoclearly arbitrary. The choice 1% is a more conservative one than the choice 5% and is often

49

Page 50: Intro to Stat (STAT 111) by Ewens

made in a medical context. Step 2 of the hypothesis testing procedure consists in choosingthe numerical value for the Type I error, that is in choosing the numerical value of α. Thischoice is entirely at your discretion. In the two examples that we are considering we willchoose 1% for the medical example and 5% for the coin example.

Step 3

The third step in the hypothesis testing procedure consists in determining a test statistic.This is the quantity calculated from the data whose numerical value leads to acceptance orrejection of the null hypothesis. In the coin example the natural test statistic is numberof heads that we will get after we have flipped the coin in our testing procedure. In themedicine case the natural test statistic is number of people cured with the new medicine ina clinical trial. These are both more or less obvious, and both are the correct test statistics:however, in more complicated cases the choice of a test statistic is not so straightforward.

As stated above there are two (equivalent) approaches to hypothesis testing. Which approachwe use is simply a matter of our preference. As also stated above the first three steps (asoutlined above) are the same for both approaches. Steps 4 and 5 differ under the twoapproaches, so we now consider them separately.

Approach 1, Step 4

Under Approach 1, Step 4 in the procedure consists in determining which observed values ofthe test statistic lead to rejection of H0. This choice is made so as to ensure that the test hasthe numerical value for the Type I error chosen in Step 2. We first illustrate this step withthe medicine example, where the calculations are simpler than in the coin example. First wereview steps 1 - 3 in this example.

Step 1. We write the (unknown) probability of a cure with the new medicine as θ. The nullhypothesis claims that θ = 0.84 and the alternative hypothesis claims that θ > 0.84.

Step 2. Since this is a medical example we choose a Type I error of 1%.

Step 3. Suppose that we plan to give the new medicine to 5,000 patients. We will reject thenull hypothesis if the number x of patients who were cured with the new medicine is largeenough. In other words x is our test statistic.

Now we proceed to steps 4 and 5.

Step 4. How large does x, the number of patients cured with the new medicine, have tobe before we will reject the null hypothesis? We will reject the null hypothesis if x ≥ A,where A is chosen so that the Type I error takes the desired value 1%. We now have to do

50

Page 51: Intro to Stat (STAT 111) by Ewens

a probability theory “zig”, and consider before the clinical trial is conducted the randomvariable X, the number of people who will be cured with the new medicine. Then we willlreject the null hypothesis if x ≥ A, where (using a probability theory “zig”), A is chosen sothat P(X ≥ A when θ = 0.84) = 0.01.

How do we calculate the value of A? We will use the central limit theorem and a Z chart. Ifθ = 0.84 the mean of X is (5,000)(0.84) = 4,200 and the variance of X is (5,000)(0.84)(.16)= 672, using the formula for the mean and the variance of a binomial random variable.(Why binomial? Because there are two possible outcomes on each trial for each patient -cured or not cured). Next, to a very close approximation X can be taken as having a normaldistribution with this mean and this variance when the null hypothesis is true. So to thislevel of approximation, A has to be such that

P (X ≥ A) = 0.01,

where X has a normal distribution with mean 4,200 and variance 672. We now do a z-ing:we want

P (X − 4, 200√

672≥ A− 4, 200√

672) = 0.01.

Now when the null hypothesis is true, X−4,200√672

is a Z, and the Z charts now show that A−4,200√672

has to be equal to 2.326. (You have to use the Z chart “inside - out” to find this value.)Solving the equation A−4,200√

672= 2.326 we find that A = 4260.30. To be conservative, we

choose the value 4261.

To conclude step 4, we have made the calculation that if the number of people cured withthe new medicine is 4,261 or more we will reject the null hypothesis and claim that the newmedicine is superior to the current one.

It is now straightforward to do step 5, so we do it.

Approach 1, Step 5

The final step in the testing procedure is straightforward. We do the clinical trial and countthe number of people cured with the new medicine. If this number is 4,261 or larger wereject the null hypothesis and claim that the new medicine is superior to the current one. Ifthis number is less than 4,261 we say that we do not have significant evidence that the newmedicine is better than the current one.

51

Page 52: Intro to Stat (STAT 111) by Ewens

Note on this

The value 4,261 is sometimes called the “critical point” and the range of values “4,261 ormore” is sometimes called the “critical region”.

The coin example.

First we review steps 1, 2 and 3.

Step 1. We write θ as the probability of a head on each flip. The null hypothesis claims thatθ = 1/2 and the alternative hypothesis claims that θ 6= 1/2.

Step 2. We choose a numerical value for α as 5%.

Step 3. Suppose that we plan to flip the coin 10,000 times in our experiment. The teststatistic is x, the number of heads that we will get after we have flipped the coin 10,000times.

Now we proceed to steps 4 and 5.

Step 4. This test is two-sided, so we will reject the null hypothesis if r x is either too large ortoo small. How large or how small? We will reject the null hypothesis if x ≤ A or if x ≥ B,where A and B have to be chosen so that α = 5%. We now go on a probability theory “zig”,and consider the random variable X, the random number of times we will get heads beforethe experiment is done. We have to choose A and B so that

P (X ≤ A) + P (X ≥ B) = 0.05 when H0 is true.

Choosing A and B so as to satisfy this requirement ensures that the Type I error is indeed5%. We usually adopt the symmetic requirement

P (X ≤ A) = P (X ≥ B) = 0.025 when H0 is true.

Let us first calculate B. When the null hypothesis is true, X has a binomial distribution withmean 5,000 and variance 2,500 (using the formula for the mean and the variance of a binomialrandom variable). The standard deviation of X is thus

√2, 500 = 50. To a sufficiently close

approximation, when the null hypothesis is true X has a normal distribution with this meanand this standard deviation. Thus when the null hypothesis is true, (X − 5, 000)/50 is a Z.Carrying out a Z-ing procedure, we get

P (X − 5, 000

50≥ B − 5, 000

50) = 0.025.

52

Page 53: Intro to Stat (STAT 111) by Ewens

Since X−5,00050

is a Z when the null hypothesis is true, the Z charts show that B−5,00050

= 1.96,and solving this equation for B we get B = 5, 098.

Carrying out a similar operation for A we find A = 4, 902.

Step 5. This takes us straight to step 5. We now flip the coin 10,000 times, and if the numberof heads is 4,902 or fewer, or 5,098 or more, we reject the null hypothesis and claim that wehave significant evidence that the coin is biased. If the number of heads between 4,903 and5,097 inclusive, we say that we do not have significant evidence to reject the null hypothesis.That is, we do not have significant evidence to claim that the coin is unfair.

Note on this

The values 4,902 and 5,098 are sometimes called the “critical points” and the range ofvalues “x ≤ 4, 902 or x ≥ 5, 098 ” is sometimes called the “critical region”.

We now consider Approach 2 to hypothesis testing.

As stated above, steps 1, 2 and 3 are the same under Approach 2 as they are under Approach1. So we now move to steps 4 and 5 under Approach 2, again using the coin and the medicineexamples.

Approach 2, Step 4

Under Approach 2 we now do our experiment and note the observed value of the test statis-tic. Thus in the medicine example we do the clinical trial (with the 5,000 patients) andobserve the number of people cured under the new medicine. In the coin example we flipthe coin 10,000 times and observe the number of heads that we got.

Approach 2, Step 5

This step involves the calculation of a so-called P -value. Once the data are obtained wecalculate the probability of obtaining the observed value of the test statistic, or one moreextreme in the direction indicated by the alternative hypothesis, assuming that the nullhypothesis is true. This probability is called the P -value. If the P -value is less than orequal to the chosen Type I error, the null hypothesis is rejected. This procedure alwaysleads to a conclusion identical to that based on the significance point approach.

For example, suppose that in the medicine example the number of people cured underthe new medicine was 4,272. Using the normal distribution approximation to the binomial,the P -value is the probability that a random variable X having a normal distribution with

53

Page 54: Intro to Stat (STAT 111) by Ewens

mean 4,200 and variance 672 (the null hypothesis mean and variance) takes a value 4,272or more. This is a straightforward probability theory “zig” operation, carried out using aZ-ing procedure and normal distribution charts. We have

P (X ≥ 4, 272) = P (X − 4, 200√

672≥ 4, 272− 4, 200√

672),

and since X−4,200√672

is a Z when the null hypothesis is true, and 4,272−4,200√672

= 2.78, we obtain,

from the Z chart, a P -value of 0.0027. This is less than the chosen Type I error (0.01) so wereject the null hypothesis. This is exactly the same conclusion that we would have reachedusing Approach 1, since the observed value 4,272 exceeds the critical point 4,261.

As a different example, suppose that the number cured with the new medicine was 4,250.This observed value does not exceed the critical point 4,261, so under Approach 1 we wouldnot reject the null hypothesis. Using the P -value approach (Approach 2), we would calculatethe P value as

P (X ≥ 4, 250) = P (X − 4, 200√

672≥ 4, 250− 4, 200√

672),

and since X−4,200√672

is a Z when the null hypothesis is true, and 4,250−4,200√672

= 1.93, we obtain,

from the Z chart, a P -value of 0.0268. This is more than the Type I error of 0.01, so wedo not have enough evidence to reject the null hypothesis. That is, we do not have enoughevidence to claim that the new medicine is better than the current one. This conclusionagrees with that we found under Approach 1.

The coin example

The P -value calculation for a two-sided alternative hypothesis such as in the coin caseis more complicated than in the medicine example. Suppose for example that we obtained5,088 heads from the 10,000 tosses. This is 88 more than the null hypothesis mean of 5,000.The P -value is then the probability of obtaining 5,088 or more heads plus the probabilityof getting 4,912 or fewer heads if the coin is fair (that is, if the null hypothesis is true),since values 4,912 or fewer are as extreme as, or more extreme, for a two-sided alternative,than the observed value 5,088. For example, 4,906 is more extreme than 5,088 in that itdiffers from the null hypothesis mean (5,000) by 96, which is more than 5,088 does.

So using a normal distribution approximation, the P -value is the probability that a randomvariable having a normal distribution with mean 5,000 and standard deviation 50 takes avalue 5,088 or more, plus the probability that a random variable having a normal distribu-tion with mean 5,000 and standard deviation 50 takes a value 4,912 or fewer. Doing a Z-ing,this is 0.0392 + 0.0392 = 0.0784. This exceeds the value chosen for α (0.05), so we do nothave enough evidence to reject the null hypothesis. This agrees with the conclusion that we

54

Page 55: Intro to Stat (STAT 111) by Ewens

reached using the significance point approach (see Approach 1, step 5).

DOES ANY OF ALL THIS SEEM FAMILIAR?

Remember the material discussed in the first few days of the class about deductions (impli-cations) and inductions (inferences). In all of the above examples we have started with aprobability theory deduction, or implication. A deduction, or implication, starts with thewords “If”. In all the above examples, and in all hypothesis testing procedures, this is “If thenull hypothesis is true, then......”. The calculation that followed is a probability theory “zig”.

This probability theory calculation was followed by a statistical inference, or induction.This is the “zag” corresponding to the probability theory “zig”. Here are the examples givenabove, formatted in this way.

In the medicine example, under Approach 1: “If the null hypothesis (that the proposednew medicine has the same cure probability, 0.84, as the current medicine) is true, then theprobability that the new medicine will cure 4,261 or more people out of 5,000 is 0.01. Thisis a probability theory “zig”.

Suppose we chose a Type I error of 1% (= 0.01) and that, when the experiment had beenconducted, we found that the new medicine cured 4,272 people. What is the correpondingstatistical inference, or induction? We would reject the null hypothesis, since the observednumber cured (4,272) is in the critical region. That is, it is greater than the value 4,261calculated by the probability theory “zig”. This is the corresponding statistical “zag”. Itcannot be made without the probability theory “zig”.

In the coin example, under Approach 1: “If the null hypothesis is true, then the probabilitythat the coin will give 4,902 or fewer heads, or 5,098 or more heads, is 0.05. This is a prob-ability theory “zig”. It leads to the criticall region: 4,902 or fewer, or 5,098 or more, heads.

Suppose we chose a Type I error of 5% (= 0.05) and that, when the experiment had beenconducted, we found that the number of heads was 4,922. This value is not in the criticalregion, so we say we have no significant evidence that the coin is biased. This is the corre-sponding statistical “zag”. It cannot be made without the probability theory “zig”.

Under Approach 2, (carried out via P -values), we reach the same conclusions - see below.

In the medicine example, under Approach 2: “If the null hypothesis (that the proposednew medicine has the same cure probability, 0.84, as the current medicine) is true, then theprobability that the new medicine will cure 4,272 or more people out of 5,000 is 0.0027.”

55

Page 56: Intro to Stat (STAT 111) by Ewens

When the experiment had been conducted, we found that the new medicine cured 4,272people. From the probability theory result just calculated, the P value is 0.0027. This isless than our chosen Type I rerror of 0.01, so we reject the null hypothesis. This is the cor-responding statistical “zag”. It cannot be made without the probability theory “zig” (whichled to the P -value calculation of 0.0027).

In the coin example, under Approach 2: If the null hypothesis (that the coin is fair) is true,the probability of getting 5,088 or more heads, or 4,912 or fewer heads, is 0.0392 + 0.0392= 0.0784. This is a probability theory “zig”.

When the experiment was conducted, we found that the number of heads was 5,088. Fromthe probability theory result just calculated, the P value is 0.0784. This is greater than ourchosen Type I rerror of 0.05, so we do not have enough evidence to reject the null hypothesis.This is the corresponding statistical “zag”. It cannot be made without the probability theory“zig” (which led to the P -value calculation of 0.0784).

General note All our probability calculations assume that the null hypothesis is true. Thisis because we are focusing on the Type I error, which is the probability of rejecting the nullhypothesis when it is true. All our hypothesis-testing calculations will be made under theassumption that the null hypothesis is true.

Testing for the equality of two binomial parameters

Suppose we want to test whether there is any difference between men and women on theattitude on the pro-choice / pro-life question. If there is a difference, we have no a prioriview as to whether men are more, or are less, likely to be pro-choice than women are.

We can rephrase this question in terms of asking whether two binomial parameters are equal.We write θ1 as the probability that a woman is pro-choice and θ2 as the probability that aman is pro-choice. The null hypothesis (no difference) is

H0: θ1 = θ2 = θ (unspecified).

Note that the value of θ is unspecified. That is, the null hypthesis just claims that θ1 andθ2 are equal: it does not specifiy what their (common) numerical value is.

Since we have no a priori view as to whether men are more, or are less, likely to be pro-choicethan women are, the alternative hypothesis is two-sided:

H1: θ1 6= θ2.

56

Page 57: Intro to Stat (STAT 111) by Ewens

Declaring H0 and H1 completes step 1 of the hypothesis testing procedure.

Step 2. In step 2 we choose α, the numerical value of the Type I error. Let’s choose 5%. Itis important to note that this implies that since we want to ensure that we do indeed havea value of α equal to 5%, in all the calculations that we do below, we assume that the nullhypothesis is true.

Step 3. In this step we create a test statistic. This is a much more complicated businessthan it was in the “medicine” and the “coin” example. We have to think in advance whatthe data will look like. Suppose that we plan to ask n1 women what their attitude is on thepro-choice / pro-life question, and that we plan to ask n2 men what their attitude is on thisquestion. Write n1 + n2 = n, the total number of people whose attitude we will ask about.Of the women, some number x1 will say they are pro-choice and of the men, some numberx2 will say they are pro-choice. We can think of the data are being arranged in a so-called“two-by-two” table, as shown.

pro-choice pro-life totalwomen x1 n1 − x1 n1men x2 n2 − x2 n2total c1(= x1 + x2) c2(= n− x1 − x2) n(= n1 + n2)

A reasonable start for a test statistic is to compare x1/n1 with x2/n2. (Comparing x1 with x2

makes no sense if n1 6= n2.) This observation is enough for us to go back to the time beforewe took the survey. At this time the number of women who will say they are pro-choice is arandom variable, which we write as X1. Similarly the number of men who will say they arepro-choice is a random variable, which we write as X2. The proportion of women who willsay they are pro-choice is X1/n1, and this is also a random variable. Similarly the proportionof men who will say they are pro-choice is X2/n2, and this is also a random variable.

Suppose now that the null hypothesis is true. Then (from one of the magic formulas) themean of X1/n1 is θ and also the mean of X2/n2 is also θ. Therefore (from another magicformula) the mean of X1/n1 −X2/n2 is 0.

We next have to find the variance of X1/n1−X2/n2 when the null hypothesis is true. Fromyet another magic formula, this is

θ(1− θ)n1

+θ(1− θ)n2

.

This shows that when the null hypothesis is true,

X1/n1 −X2/n2√θ(1−θ)n1

+ θ(1−θ)n2

(74)

57

Page 58: Intro to Stat (STAT 111) by Ewens

is a Z, that is to a very close approximation has a normal distribution with mean 0 andvariance 1.

The problem that we have now is that we do not know the numerical value of θ. So wewill have to estimate it from the data. We are assuming that the null hypothesis is true,and so we estimate θ (the probability that a person, male or female, is pro-choice) by theproportion of people in the sample who are pro-choice, namely c1/n. Similarly we estimate1− θ (the probability that a person, male or female, is pro-life) by the proportion of peoplein the sample who are pro-life, namely c2/n. We therefore replace the statistic in (74) by

X1/n1 −X2/n2√(c1/n)(c2/n)

n1+ (c1/n)(c2/n)

n2

(75)

When the null hypothesis is true, this quantity has, to a reasonable approximation, a “Z”distribution - that is, a normal distribution with mean 0, variance 1.

Our eventual test statistic is calculated from the data, so it is the data analogue of thequantity in (75), namely

x1/n1 − x2/n2√(c1/n)(c2/n)

n1+ (c1/n)(c2/n)

n2

(76)

This quantity is rather messy, and it can be simplfied to

x1n2 − x2n1√c1c2n1n2/n

(77)

This (after all these calculations) is our test statistic, and this then concludes Step 3.

The procedure in Steps 4 and 5 depends on whether we use Approach 1 or Approach 2. Wefirst consider Approach 1.

Under Approach 1, in Step 4 we ask what values of the test statistic lead us to reject the nullhypothesis. Since the alternative hypothesis is two-sided, sufficiently large negative or suffi-ciently large positive values of the quantity in (77) will lead us to reject the null hypothesis.How large positive or how large negative? This depends on the value we chose for α, namely5%. It also depends on the fact that if the null hypothesis is true, the random variable in(75) has, to a reasonable approximation, a “Z” distribution. Now “Z” charts show that theprobability that a “Z” is less than -1.96 or greater than +1.96 is 0.05 (=5%). These are ourupper and lower significance points. This leads to

Step 5. We get our data and compute the numerical value of the test statistic (77). If thisvalue is -1.96 or less, or +1.96 or more, we will reject the null hypothesis. If the numerical

58

Page 59: Intro to Stat (STAT 111) by Ewens

value of the test statistic (77) is betwen -1.96 and +1.96 we do not have enough significantevidence to reject the null hypothesis.

The ONLY statistical step is the simple one, Step 5. Essentially all the other steps arerelated to the probability theory part of the procedure. That is why so much emphasis isplaced in this course on probability theory.

A second example

Testing a new medicine is often done using a comparison of the new medicine with a placebo(i.e. a harmless and unbeneficial mixture made out of (say) flour, sugar and water). Supposethat we plan to give the proposed medicine to n1 people and the placebo to n2 people. Wewrite θ1 as the probability that the new medicine leads to a cure and θ2 as the probabilitythat the placebo leads to a cure. The null hypothesis (no difference) is

H0: θ1 = θ2 = θ (unspecified).

If the null hypothesis is true the proposed new medicine is ineffective: its cure probability isthe same as that for the placebo. As in the previous example, the value of θ is unspecified.That is, the null hypthesis just claims that θ1 and θ2 are equal: it does not specifiy whattheir (common) numerical value is.

Since we are only interested in the possibility that the proposed medicine is beneficial, thealternative hypothesis is one-sided up :

H1: θ1 > θ2.

Declaring H0 and H1 completes step 1 of the hypothesis testing procedure.

Step 2. In step 2 we choose α, the numerical value of the Type I error. Since this is amedical example, we choose 1%. It is important to note that this implies that since we wantto ensure that we do indeed have a value of α equal to 1%, in all the calculations that wedo below, we assume that the null hypothesis is true.

Step 3. In this step we create a test statistic. Suppose that x1 of the n1 people given theproposed medicine are cured, and that x2 of the n2 people given the placebo are cured. Wecan form the data in a table just like the one above (next page):

59

Page 60: Intro to Stat (STAT 111) by Ewens

cured not cured totalgiven proposed medicine x1 n1 − x1 n1

given placebo x2 n2 − x2 n2total c1(= x1 + x2) c2(= n− x1 − x2) n(= n1 + n2)

With this interpretation of x1, n1, x2, n2, c1, c2 and n, the test statistic is as in (77) above.

We now have to find what values of the test statistic lead us to reject the null hypothesis.This is a one-sided up test, so we would reject the null hypothesis if x1/n1 is significantlylarger than x2/n2. Going back to the original form of the test statistic (77), or equivalently(77), we see that sufficiently large positive values of the statistic (77) will lead us to rejectthe null hypothesis.

How large? In other words, what is the critical point? Here we have to remember that theType I error is 1%.

When the null hypothesis is true, the random variable (74), which the statistic (75) andthus (77) is based on, has approximately a normal distribution, mean 0, variance 1. “Z”charts show that the probability that such a random variable exceeds 2.326 is 0.01. So 2.326is the required critical value: we will reject the null hypothesis if the statistic (77) is 2.326or more, and otherwise say that we do not have enough evidence to reject the null hypothesis.

Step 5 is now straightforward. We get our data, compute the numerical value of the teststatistic (77), and reject the null hypothesis if this numerical value is equal to, or exceeds,2.326. If it is less than 2.326 we say that we do not have enough evidence to reject the nullhypothesis.

The ONLY statistical step is the simple one, Step 5. Essentially all the other steps arerelated to the probability theory part of the procedure. That is why so much emphasis isplaced in this course on probability theory.

We now consider the procedure under Approach 2. Steps 1, 2 and 3 are the same as forApproach 1, so we start by considering Step 4.

Under Approach 2, Step 4 consists of getting the data and calculating the numerical valueof the test statistic. In both the pro-choice/pro-life example and the medicine example, thisis the test statistic (77) (with the appropriate interpretation of x1, n1, x2, n2, c1, c2 and nin each case - for example, in the pro-choice/pro-life case x1 meant the number of womenwho were pro-choice, while in the medicine case x1 meant the number of people given theproposed medicine who were cured).

60

Page 61: Intro to Stat (STAT 111) by Ewens

Step 5 involves the calculation of the P -value corresponding to the observed numerical valueof the test statistic (77). Remember the definition of a P -value: it is the probability of get-ting the observed value of the test statistic, or one more extreme in the direction indicatedthe alternative hypothesis, when the null hypothesis is true. If this P -value is less than orequal to the value of α (the probability of making a Type I error) that we chose in Step 2,we reject the null hypothesis. If the P -value is less than the value of α that we chose in Step2 we do not have enough evidence to reject the null hypothesis. This will be illustrated inthe numerical examples below.

Numerical example 1: the pro-choice/pro-life example.

Suppose that we ask 300 women their view on the pro-choice/pro-life question, and that188 of these say they are pro-choice. Suppose that we also ask 200 men their view on thepro-choice/pro-life question, and that 118 of these say they are pro-choice. The data tablenow looks like this:-

pro-choice pro-life totalwomen 188 112 300men 118 82 200total 306 194 500

The numerical value of the test statistic (77) is

(188)(200)− (118)(300)√(306)(194)(300)(200)/(500)

= 0.8423.

Remember that the alternative hypothesis in this example is two-sided and that our cho-sen value of α was 5% ( = 0.05). This meant that we would reject the null hypothesis ifnumerical value of the test statistic (77) turned out to be - 1.96 or less, or +1.96 or more.Given the numerical value 0.8423 of the test statistic, using Approach 1 we say we do nothave enough evidence to reject the null hypothesis. In more practical language, we do nothave enough evidence to claim that there is a difference between men and women on thepro-choice/pro-life question.

Under Approach 2 we have to calculate the P -value corresponding to the observed value0.8423. Remembering again the alternative hypothesis in this example is two-sided, the P -value is the probability that a “Z” is 0.8423 or more, or -.8423 or less. “Z” charts show thatthis probability is about 0.2005 +0.2005 = 0.401. This exceeds the value 0.05 chosen for α,so we draw the same conclusion that we drew using Approach 1: we do not have enough ev-idence to claim that there is a difference between men and women on the pro-choice/pro-lifequestion.

61

Page 62: Intro to Stat (STAT 111) by Ewens

Numerical example 2: the medicine example.

Suppose that we give the proposed new medicine to 1,000 people and that of these, 765 arecured. Suppose that we also gave the placebo to 800 people of these 550 are cured. Thedata table now looks like this:-

cured not cured totalgiven proposed medicine 765 235 1, 000

given placebo 550 250 800total 1, 315 485 1, 800

The numerical value of the test statistic (77) is

(765)(800)− (550)(1, 000)√(1, 315)(485)(1, 000)(800)/(1, 800)

= 3.68.

We start with Approach 1. Remember that the alternative hypothesis in this example isone-sided up and that our chosen value of α was 1% ( = 0.01). This meant that we wouldreject the null hypothesis if numerical value of the test statistic (77) turned out to be 2.326or more. Since the observed value did turn out to be greater than 2.326 we reject the nullhypothesis and claim that we have significant evidence that the proposed new medicine iseffective.

Under Approach 2 we have to calculate the P -value corresponding to the observed value3.68. Remembering again the alternative hypothesis in this example is one-sided up, theP -value is the probability that a “Z” 3.68 or more. “Z” charts do not go as high as thevalue 3.68, which is “off the chart”. The P -value is less than the P -value corresponding tothe last chart value (3.09), which is 0.001. Therefore we draw the same conclusion that wedrew using Approach 1: we reject the null hypothesis and claim that we have significantevidence that the proposed new medicine is effective.

Some notes on this.

1. The statistic (77) is sometimes written (in research papers) as “z”. This is because ifthe null hypothesis is true, it is the observed value of a random variable having, to a closeapproximation, a normal distribution with mean 0, variance 1.

2. Let’s run through the various probability theory results that we have used, using the pro-choice/pro-life case as an example. First we recognized that we were in a binomial situation.Second we realized that we were involved with proportions. Then we had to remember theformulas for the mean and the variance of a proportion. Then we realized that we were in-volved with the difference between two proportions, so we had to remember the formulas for

62

Page 63: Intro to Stat (STAT 111) by Ewens

the mean and the variance of a difference. Then we had to remember the “Z”-ing procedure.Finally we had to be able to use the “Z” chart.

3. What are we testing in the pro-choice/pro-life example”? We are not testing whetherpeople are more likely to be pro-choice than pro-life. We are testing for a difference betweenmen and women on their attitude on this question.

4. There were various approximations that we used. For example we did not know the vari-ance θ(1 − θ)/n1 + θ(1 − θ)/n2 and we approximated it by the estimate (c1/n)(c2/n)/n1 +(c1/n)(c2/n)/n2. Next, when the null hypothesis is true the test statistic does not exactlyhave a “Z” distribution (a normal distribution witrh mean 0, variance 1). However its dis-tribution is very close to a “Z” distribution so we are happy to use this approximation. Onthe other hand there is a procedure, called “Fisher’s exact test”, where no approximations atall are made. This is quite complicated and we will not consider this procedure in this course.

5. It is arbitrary, in the data table, which row we use for women (row 1 or row 2), and alsoarbitrary which column we use for pro-choice (column 1 or column 2). We also could haveused the columns for men/women and the rows for pro-choice/ pro-life.

6. It is essential that you use the original numbers, and not for example percentages, in thedata table and the resulting calculations. Consider two data tables where the numbers inthe first table are all 10 times the numbers in the second table. The percentages are thesame in both tables, but the value of z in the first table will be

√10 times the value of z in

the second table.

7. For obvious reasons, tests of the kind just considered are often called “two-by-two tabletests” (two rows of data, two columns of data). What you are testing using the data in sucha table is association, not cause and effect. Here is an example illustrating this.

Suppose that we take a sample of teen-aged children and classify each child as eating 8or more junk food meals per week, on eating less than 8 junk food means per week (rowclassification), and obese or not obese (column classification). Suppose that with the data ob-tained we get a significant value for z. It is natural to argue that eating the junk food causesobesity. But a fast-food company might claim that children who are by nature obese havea desire to eat junk food, in other words that the cause and effect relation goes the other way.

8.You can think of the statistic

x1/n1 − x2/n2√(c1/n)(c2/n)

n1+ (c1/n)(c2/n)

n2

(78)

given above, and repeated here for convenience, as a “signal-to-noise” ratio. The numerator,

63

Page 64: Intro to Stat (STAT 111) by Ewens

x1n1− x2

n2, is the signal - it is your best estimate of θ1 − θ2. However this signal on its own is

not enough - it has to be compared to the estimate of its standard deviation (the “noise”,that is the denominator in (78), before you can assess whether it is significant.

9. What is the relation of this to conditional probabilities of events? Let A be the eventthat a person is pro-choice and B be the event that a person is a woman. The null hypoth-esis claims that P (A|B) = P (A). That is, the null hypothesis claims that being given theinformation that a person is a woman does not change the probability that that person ispro-choice.

10. This note, and everything from here on in this section about two-by-two table tests,assumes that the alternative hypothesis is two-sided.

First we change the notation, and write the entries in the data table as

Column 1 Column 2 totalRow 1 o11 o12 r1Row 2 o21 o22 r2total c1 c2 n

(79)

Here r1 and r2 denote row totals, c1 and c2 denote column totals, n is the grand total, andthe oij notation stands for “observed numbers” in each of the four cells of the table. Forexample, o12 is the observed number in row 1, column 2. This new notation is flexible andallows generalizations to tables that are bigger than two-by-two.

With this new notation, the “z” statistic (78) becomes

o11/r1 − o21/r2√(c1/n)(c2/n)

r1+ (c1/n)(c2/n)

r2

(80)

Algebraic manipulations show that this can be written as

z =

√n(o11o22 − o21o12)√r1r2c1c2

. (81)

This is the form that we use for “z” from now on.

Suppose that we chose our value of α (the numerical value for the Type I error) to be 5%.Then we would reject the null hypothesis if z ≤ −1.96 or if z ≥ +1.96. (Remember that weare ONLY considering two-sided tests.) This is the same as rejecting the null hypothesis ifz2 ≥ (1.96)2 = 3.8415. Now

z2 =n(o11o22 − o21o12)2

r1r2c1c2

. (82)

64

Page 65: Intro to Stat (STAT 111) by Ewens

So we would reject the null hypothesis if z2 ≥ 3.8415.

z2 is usually called “chi-square” and written as χ2. (More precisely it is usually called “chi-square with one degree of freedom”. We will discuss degrees of freedom later.) However inthis class we use Greek symbols ONLY for parameters, and z2 is not a parameter: it is theobserved value of a random variable. So we will not use the symbol χ2. So we need a newnotation for “chi-square”. We will soon be generalizing this sort of problem to data tablesbigger than two-by-two, where the notation z2 is no longer appropriate. So from now on wewill use the notation c2 for z2 : “c” is the first letter in the word “chi”.

Numerical example. Let’s re-do the above pro-choice/pro-life example in terms of c2. Thedata were

pro-choice pro-life totalwomen 188 112 300men 118 82 200total 306 194 500

Here c2 = 500[(188)(82) − (112)(118)]2/[(300)(200)(306)(194)] = 0.6794. Since this is lessthan 3.8415, we do not have enough evidence to reject the null hypothesis.

Note that 0.6794 is the square of the value 0.8423 that we calculated earlier for z: this is asit should be, since c2 is the square of z.

We will soon generalize this type of problem to data tables that are bigger than two-by-two.To do this it is convenient to re-write c2 in a different form. To do this we first form a tableof so-called “expected numbers” corresponding to the observed data values in the data table(79). This table is

Column 1 Column 2 totalRow 1 e11 e12 r1Row 2 e21 e22 r2total c1 c2 n

(83)

The definition of the “expected numbers” e11, e12, e21 and e22 is that

e11 = r1c1/n, e12 = r1c2/n, e21 = r2c1/n, e22 = r2c2/n. (84)

What is the logic behind these definitions? This can be explained by referring to the pro-choice/pro-life data table above. The proportion of women in the sample is 300/500 =0.6, or 60%. If there is no difference between men and women on the pro-choice/pro-lifequestion, we would expect that about 60% of the 306 people who were pro-choice would bewomen. Now 60% of 306 is (300)(306)/500. Using the notation of the general data table

65

Page 66: Intro to Stat (STAT 111) by Ewens

(79), (300)(306)/500 is r1c1/n, and this is e11. So if there is no association between row andcolumn modes of categorization, we would expect about r1c1/n observations in the upperleft-hand cell of the data table. Similar arguments lead to the interpretations of e12, e21 ande22.

The alternative (and more flexible) formula for c2 is

c2 =(o11 − e11)2

e11

+(o12 − e12)2

e12

+(o21 − e21)2

e21

+(o22 − e22)2

e22

. (85)

As a check on the calculations, the numerical values of the eij are

e11 = 183.6, e12 = 116.4, e21 = 122.4, e22 = 77.6. (86)

This leads to a calculation of c2 as

c2 =(188− 183.6)2

183.6+

(112− 116.4)2

116.4+

(118− 122.4)2

122.4+

(82− 77.6)2

77.6, (87)

and this leads to the same value of c2, namely 0.6794, as found by the previous formula.

Some notes on this.

1. We use the formula (85) for c2 because it generalizes to tables that are bigger than two-by-two (see later).

2. You can think of c2 as a measure of the “distance” betwen the set o11, o12, o21, o22 ofobserved values and the set e11, e12, e21, e22 of “expected’ values. If the four marginal totalsr1, r2, c1 and c2 are given, these expected values are actually the means of the numbers inthe respective cells of the table assuming that the null hypothesis is true.

3. Remember: you use c2 only if the original alternative hypothesis is two-sided.

4. When you calculate each eij, the value that you get will usually not be a whole number.To be sufficiently accurate in your calculations, compute each eij value with four decimalplace accuracy, and then report your eventual c2 value to two decimal place accuracy.

5. Only sufficiently large positive values of c2 are significant.

6. Continuing on from note 5, if α, the numerical value of our Type I error, had been 5%(as assumed above) we reject the null hypothesis if c2 ≥ 3.8415. What would be the criticalpoint of c2 if the numerical value of our Type I error, had been 1%? We calculated 3.8415as the square of 1.96, where P (Z ≤ −1.96) + P (Z ≥ +1.96) = 0.05. Now Z charts showthat P (Z ≤ −2.575) + P (Z ≥ +2.575) = 0.01. Thus the critical point of c2 if the numerical

66

Page 67: Intro to Stat (STAT 111) by Ewens

value of our Type I error had been chosen to be 1% is (2.575)2 = 6.6349.

7. Going on from Note 6, you have been given a chi-square chart and you will find thenumber 6.6349 on it (for a numerical value 0.01 for the Type I error and one “degree offreedom”). So what are “degrees of freedom” and what is “one degree of freedom”?

Suppose that the four marginal totals r1, r2, c1 and c2 are given. Then you can only freelyfill in one number in the four cells of the table. The remaining three numbers will thenautomatically be defined. You only have “one degree of freedom”. This concept becomesimportant when we move to tables that are bigger than two-by-two.

8. What is the relation of the procedure to the “deduction/ implications” and the “induc-tions/inferences” material in the first few lectures. Here it is.

Let’s go back to the time before we do our experiment. Our eventual data will be in theform of a two-by-two table. It is simplest to think first of the case where the row and columntotals were both chosen in advance. Before we get our data, the numbers in the four cells ofthe table are random variables, and at this time we can imagine the following table:

Column 1 Column 2 totalRow 1 O11 O12 r1Row 2 O21 O22 r2total c1 c2 n

(88)

Here for example O11 is the random variable describing the number that we will get for row1, column 1. So we can think of the random quantity

C2 =(O11 − e11)2

e11

+(O12 − e12)2

e12

+(O21 − e21)2

e21

+(O22 − e22)2

e22

. (89)

When the null hypothesis is true, C2 is a random variable and has, to a very close ap-proximation, the so-called “chi-square distribution with one degree of freedom”. This is atrust-me result: the theory is very complicated. Tables of critical (= significance) pointsof this distribution were referred to above. Now to the “deduction/ implications” and the“inductions/inferences” aspect of this.

If the null hypothesis is true, the probability that the random variable C2 takes a valuegreater than or equal to 3.8415 is 5%. (We were able to arrive at this number ourselves,using many probability theory results.) This calculation is a probability theory “zig”.

Suppose that we do our experiment and that our data value of c2 is 4.032. Given theabove probability theory deduction/implication, we can say that we have significant

67

Page 68: Intro to Stat (STAT 111) by Ewens

evidence that the null hypothesis is not true. This is the corresponding statistical “zag”. Itis an easy step to take, given the probability theory deduction/implication.

9. All the above follows Approach 1. Approach 2 would require us, in Step 4, to get the dataand calculate the value of c2. Step 5 would require us to find the P -value corresponding tothis number. Doing this is possible but not easy. So for this hypothesis testing procedurethe most useful approach is Approach 1, together with using charts of critical points of thechi-square distribution.

Tables bigger than two-by-two

We developed the statistic c2 above since that generalizes to the case of tables that are biggerthan two-by-two. Suppose that we have a table of data with some arbitrary number r ofrows and some arbitrary number c of columns. Again we suppose that the row and columntotals are fixed in advance of getting the data.

To fix ideas, suppose that we have four strains of mice. There are five coat colors: black,brown, white, yellow and gray. We want to test for association between the strain of a mouseand its coat color.

Step 1. The null hypothesis claims that there is no association and the alternative hypothesisclaims that there is some association (of an unspecified type). This concludes Step 1. Thereis no other allowable null or alternative hypothesis.

In Step 2 we choose a value 5% for α.

To fix ideas, let’s go straight to the data table. The data are (next page):-

68

Page 69: Intro to Stat (STAT 111) by Ewens

colorblack brown white yellow gray Total

1 34 41 20 23 32 150strain 2 40 47 27 31 55 200

3 71 77 38 51 63 3004 80 81 44 56 89 350

Total 225 246 129 161 239 1, 000

(90)

Table 1: Two-way table data.

For Step 3 we generalize the above table and suppose that the data are:-

column1 2 3 . . . c Total

1 o11 o12 o13 . . . o1c r1row 2 o21 o22 o23 . . . o2c r2

......

......

......

...r or1 or2 or3 . . . orc rr

Total c1 c2 c3 . . . cc n

(91)

Table 2: General two-way table data.

We first compute the “expected value” corresponding to the observed value oij. This is eij,defined by eij = ricj/n. The general chi-square test statistic, given (in terms of “Sigmanotation”), is

c2 =r∑i=1

c∑j=1

(oij − eij)2

eij, (92)

which can be regarded as a measure of the difference of the oij and eij values. (Note thatthis a generalized version of the c2 given in (85)).This is the end of Step 3: the quantity in(92) is the test statistic. Equivalently,

c2 =(o11 − e11)2

e11

+(o12 − e12)2

e12

+ . . .+(orc − erc)2

erc.

Steps 4 and 5 depend on the approach taken. We first consider Approach 1.

Step 4. This is a very difficult step. What values of c2 will lead us to reject the nullhypothesis? We have to go back to the time before we get our data. At this time we areconsidering random variables, and before we do our experiment we can envision a table suchas the following, where the Oij quantities are, at this stage, random variables (next page):-

69

Page 70: Intro to Stat (STAT 111) by Ewens

column1 2 3 . . . c Total

1 O11 O12 O13 . . . O1c r1row 2 O21 O22 O23 . . . O2c r2

......

......

......

...r Or1 Or2 Or3 . . . Orc rr

Total c1 c2 c3 . . . cc n

(93)

Table 3: Two-way table: random variables.

Here is another “trust-me” result. When the null hypothesis is true the random variable

C2 =r∑i=1

c∑j=1

(Oij − eij)2

eij, (94)

has an approximate chi-square distribution with ν = (r−1)(c−1) degrees of freedom. (Notethat this a generalized version of the C2 given in (89)). Only sufficiently large values of theobserved value c2 of C2 lead us to reject the null hypothesis. How large? These are foundfrom the chi-square distribution with ν degrees of freedom. Critical (= significance) pointsof this distribution are given in chi-square charts. The actual significance point relevant toany specific case depends on (i) the value of α chosen in Step 2, and (ii) the number ν ofdegrees of freedom in that case.

Step 5. Get the data, calculate c2 and reject the null hypothesis if the observed value of c2

is greater than or equal to the relevant significance point in the chi-square chart.

The “mouse coat color ” example.

The first thing that we have to do is to calculate the expected values. The easiest way ofremembering how to do this is that the expected value in any cell is the corresponding rowtotal times the corresponding column total divided by the grand total. Thus, for example,the expected value in the upper left-hand cell is (150)(225)/1,000 = 33.7500. (Remember(as I have done here) to carry four decimal place accuracy in the calculations of the expectedvalues.) Continuing in this way we get the following table of expected values (next page):-

70

Page 71: Intro to Stat (STAT 111) by Ewens

colorblack brown white yellow gray Total

1 33.7500 36.9000 19.3500 24.1500 35.8500 150strain 2 45.0000 49.2000 25.8000 32.2000 47.8000 200

3 67.5000 73.8000 38.7000 48.3000 71.7000 3004 78.7500 86.1000 45.1500 56.3500 83.6500 350

Total 225 246 129 161 239 1, 000

(95)

Table 4: Two-way table expected values.

We then evaluate c2 as

c2 =(34− 33.7500)2

33.7500+ . . .+

(89− 83.6500)2

83.6500(96)

(20 terms in all in the sum), and this is about 5.02. This is the end of Step 4.

Step 5 is easy: since 5.02 is well less than the critical value 21.03 (12 degrees of freedom, α= 5%), we have no reason to reject the null hypothesis.

All the above follows Approach 1. In Approach 2 we would go to Step 5 after completingStep 3, and calculate the numerical value of c2 (5.02). We would then have to find theP -value corresponding to this. This is extremely difficult and can only be done bya computer package. Thus if you do not have such a package available you have to useApproach 1 and the chi-square chart.

Notes on this.

1. Most of the notes for a 2×2 table continue to hold for general r× c tables. One exceptionis this: while in a 2 × 2 table one could find a P -value by working back to the original zstatistic, in the case of tables bigger than 2× 2 there is no corresponding z statistic and cal-culation of a P -value is very difficult mathematically. You can only use Approach 2 (which isbased on the calculation of a P -value) if you do the analysis by a statistical computer package.

2. As a matter of terminology, 2×2, and more generally r×c, tables are called “contingencytables”.

3. So far we have assumed that the row and column totals in a contingency table werefixed in advance. This is often the case, and this is why we denoted these in lower case,for example ri and cj. However sometimes only the row totals are fixed in advance (whichseems to be the case in the mouse example), sometimes only the column totals are fixedin advance, and sometimes neither set of totals is fixed in advance. However, all the abovetheory continues to hold even when one or both sets of marginal totals were not fixed inadvance. This is why we use lower-case letters for row and column totals: we might as well

71

Page 72: Intro to Stat (STAT 111) by Ewens

think of them as being fixed in advance.

4. Degrees of freedom. This is an elusive concept, and the calculation of the number of de-grees of freedom in any particular statistical analysis depends on that analysis: The numberof degrees of freedom differs from one form of analysis to another.

In an r×c contingency table table analysis the number of degrees of freedom is (r−1)(c−1).Thus in the mouse example the number of degrees of freedom is 3× 4 = 12. Why is this?

In calculating the number of degrees of freedom we take the row and column totals to begiven. Thus in the mouse example we can freely choose four of the five numbers in row 1,but not the fifth number: it must be such that the numbers in this row add up to the totalin row 1. Similarly we can freely choose four of the five numbers in rows 2 and 3, but notthe fifth number in those rows. Having done this we cannot choose any numbers in row 4freely: the numbers in this row must be such as to lead to the given column totals. Thus inthis case we have 4 + 4 + 4 = 12 degrees of freedom. Similar arguments lead to the generalvalue (r− 1)(c− 1) for the number of degrees of freedom in an r× c contingency table. Notethat in a 2× 2 table we only have one degree of freedom.

5. Finally, the details of the probability theory deductive/implication part of this is nowobscure to us, since the math it is very difficult for a table bigger than 2× 2. In the mouseexample it is, in effect, the following:-

If there is no association between strain and coat color, the probability that the eventuallycomputed value of c2 will be greater than or equal to 21.03 is 5%. The value 21.03 is onlyarrived at after a mathematically difficult probability calculation, hidden to us because ofthe complexities of the math.

The corresponding statistical induction/inference is this. The observed value of c2 is 5.02.Based on the value 21.03 calculated by deductive probability theory methods, we have nosignificant evidence of an association between the strain of a mouse and its coat color.

Another use of chi-square: testing for a specified probability distribution

Chi-square is like a Swiss army knife: it can be used for many quite different purposes. Inthis section we consider a new form of the chi-square statistic that is quite different from theform used in contingency tables. Never confuse this new form of chi-square discussed belowwith the form used in contingency tables. In particular, do not use the statistic (92) for thenew use of chi-square described in this section.

We start with an example. Is this die fair?

72

Page 73: Intro to Stat (STAT 111) by Ewens

This is the same as asking the following question. If X is the (random) number to turn upon a (future) roll of a die, is the probability distribution of X is given by the following:-

Possible values of X :

Probability :

1 2 3 4 5 6

1/6 1/6 1/6 1/6 1/6 1/6. (97)

We start (as always) with Step 1. Here there is no choice. The null hypothesis is (in this dieexample) that the die is fair. Equivalently, the null hypothesis is that (97) is the probabilitydistribution of X. The alternative hypothesis is that the die is unfair, but in an unspecifiedway. No other choices are possible. So Step 1 is easy: you have no choice about the null andalternative hypotheses.

In Step 2 we choose the value of α (the probability of making a Type I error when the nullhypothesis is true). As before, we usually choose either 1% or 5%.

The choice of test statistic (Step 3), that is what we will calculate from the eventual data inorder to either accept of reject the null hypothesis, is not so obvious. Suppose that we planto roll the die n times and record the number that turns up on each roll. One possibility forthe test statistic is the average of the numbers that will turn up on these n rolls. Before weroll the die this average is a random variable. We know the mean of this average if the nullhypothesis is true (3.5) and we also know the variance of this average if the null hypothesisis true (35/12n), and thus we could easily form a z statistic from the evental data whichwould allow us to test the null hypothesis.

However this is not a good test statistic, as the following example shows. Suppose that weroll the die 10,000 times, and that a “1” turns up 5,000 times and a “6” turns up 5,000times. No other number turns up. Clearly this is almost certainly a biased die. Yet theaverage of the numbers that turned up is 3.5, which is the mean of the (random variable)average to turn up when the null hypothesis is true. If we then used the average of thenumbers that did turn up after we has rolled the die n times as our test statistic, we wouldnot reject the null hypothesis. This is clearly unreasonable. So we need a better test statistic.

This observation shows that the choice of a test statistic is not a straightforward matter.Some test statistics will be “better” than others. There is a deep mathematical theory whichleads to the choice of a “best” test statistic in any given testing procedure. We do not con-sider this theory here. Instead we go straight to what this theory shows is the best teststatistic in the “die” example, namely a chi-square statistic. To repeat a comment madeearlier: this chi-square statistic (see details below) is quite different from the chi-squarestatistic used in contingency tables. Here are the details of this new chi-square statistic.

The chi-square statistic for any “testing for a specified probability distribution” situation is

73

Page 74: Intro to Stat (STAT 111) by Ewens

similar to that in a contingency table in that it considers a test statistic of the form “sumover all possibilities of (observed− expected)2/expected.”

However the details, in particular the details about how we calculate expected numbers,differ from those in a contingency table. To see how this works in the current context wefirst consider the “die” example.

Remember that by “expected value” we mean, loosely, “more or less what we would expectto see if the null hypothesis is true”. So if we plan to roll the die n times, what would theseexpected values be? We would expect to see a 1 turn up about n/6 times, a 2 turn up aboutn/6 times and so on. In fact these are exact means under the null hypothesis. If we considera 1 turning up as a success, then the binomial formula for the mean number of times a 1will turn up is exactly n/6. The same holds for all six numbers, 1 through 6.

Suppose now that we have rolled the die n times, and that a 1 turned up n1 times, a 2 turnedup n2 times, ...., a 6 turned up n6 times. Then the appropriate test statistic is

c2 =6∑i=1

(ni − n6)2

n6

. (98)

This concludes Step 3: in the die case this is the correct test statistic. (The appropriate teststatistic in more general cases will be discussed later.)

Steps 4 and 5 depend on whether we use Approach 1 or Approach 2. We consider Approach1 first.

Step 4. What values of the test statistic (98) lead us to reject the null hypothesis (thatthe die is fair)? First, this test statistic (and all chi-square statistics) cannot be negative,and for all chi-square statistics sufficiently large positive values lead us the reject the nullhypothesis. In the die case, how large? First, this depends on the value of α chosen in Step2. For concreteness, suppose that we chose α = 0.05 = 5%.

To proceed further, we have to go back to the time before we roll the die. At this time thenumbers N1, N2, . . . , N6 that a 1, a 2, ...., a 6 will turn up are random variables. Thus thequantity

C2 =6∑i=1

(Ni − n6)2

n6

. (99)

is a random variable. To a very close approximation it has a chi-square distribution when thenull hypothesis is true. (This has been anticipated in the notations c2 and C2.) How manydegrees of freedom does it have? We have to ask how many of the numbers n1, n2, . . . , n6

can be chosen freely, given the number of rolls (n) is fixed in advance. The answer is five.

74

Page 75: Intro to Stat (STAT 111) by Ewens

Once (for example) n1, n2, . . . , n5 are given, n6 is automatically determined. Note that thiscalculation of degrees of freedom is quite different from that arising in contingency tables.

This implies that we will reject the null hypothesis (with α = 0.05) if c2 ≥ A, where A ischosen so that P(C2 ≥ A) when the null hypothesis is true = 0.05. The calculation of A isquite complicated, and the relevant value is given in the chi-square chart. Since α = 0.05and we have five degrees of freedom, the chart shows that A = 11.0705.

Step 5 is now easy. We get the data, compute the value of c2, and reject the null hypothe-sis if the value obtained is 11.0705 or larger. This concludes the procedure under Approach 1.

Under Approach 2, Step 4 is to get the data and compute the value of c2 (the first part ofStep 4 under Approach 1.) . Step 5 is to find the P -value asociated with the value of c2 thatwe found. This is very difficult and can only be done via a statistical computer package.Thus Approach 2 is feasible only if you have access to such a package.

Notes Before doing a numerical example we make some notes.

1. The expected numbers (n/6) for the die example are all the same. This is not usually thecase: see a later example.

2. The expected numbers in the die example (and in general) are not necessarily wholenumbers. Calculate them to four decimal place accuracy, and present your eventual c2 valueto two decimal place accuracy.

3. Although the statistic (98) is different from that in a contingency table, it does share thesame characteristic of the contingency table chi-square statistic in that it can be thought ofas a measure of the difference between, or distance between, the various observed numbersand the numbers expected (in fact the mean numbers) if the null hypothesis is true. Thelarger these differences are the larger c2 is, and if these differences are large enough c2 willbe large enough for us to reject the null hypothesis.

Numerical example

We have a die and want to test if it is fair. The null hypothesis is that it is fair and thealternative hypothesis is that it is unfair in some unspecified way Suppose that we choose avalue α = 0.05, that is we choose the probability of making a false positive claim (a Type Ierror) to be 0.05, or 5%. Going straight to Step 5, suppose that we roll the die 5,000 timesand get the following data:-

Number turning up:

Number of times seen:

1 2 3 4 5 6

861 812 820 865 821 821. (100)

75

Page 76: Intro to Stat (STAT 111) by Ewens

Each expected number is 5000/6 = 833.3333 (to four decimal place accuracy). The value ofc2 is thus

(861− 833.3333)2

833.3333+

(812− 833.3333)2

833.3333+

(820− 833.3333)2

833.3333+

+(865− 833.3333)2

833.3333+

(821− 833.3333)2

833.3333+

(821− 833.3333)2

833.3333, (101)

and this is 3.2464, or to two decimal place accuracy, 3.24. The 5% critical point is 11.0705,so we do not have enough evidence to claim that the die is unfair. Thinking of this anotherway, a fair die can quite easily give an array of observed numbers such as that above.

Another example

Before considering the general case we consider a second example. The seeds of a certainplant are either yellow and smooth (YS), yellow and wrinkled (YW), green and smooth(GS) or green and wrinkled (GW). A simple genetic theory claims that in a given systemof crossing (mating), the seed colorand smoothness of any plant will be YS with probability9/16, YW with probability 3/16, GS with probability 3/16 or GW with probability 1/16.We wish to test this theory. (By way of background, this theory claims that the color of theseed is determined by the genes at one genetic locus, with the probability that these geneslead to yellow seeds being 3/4 and lead to green seeds with probability 1/4, and that thesmoothness of the seed is determined independently by the genes at a different genetic locus,with the probability that these genes lead to smooth seeds being 3/4 and lead to wrinkledseeds with probability 1/4. There are reasons why this theory might not be correct: if forexample the color and smoothness gene loci are on the same chromosome, the independenceassumption is not correct.)

Step 1. The null hypothesis is that the theory is correct, and the alternative hypothesis isthat it is incorrect, but in an unspecified way. These are the only allowable null and alter-native hypotheses.

Step 2. In this step we choose the value of α: let’s choose 5%.

Step 3. The test statistic will be of the chi-square form, but the details will differ fromthose in the die example. Suppose that we observe the seeds of n plants. Then if the nullhypothesis is true, the expected number of plants with YS seeds is (9n)/16, the expectednumber of plants with YW seeds is (3n)/16, the expected number of plants with GS seedsis (3n)/16, the expected number of plants with GW seeds is n/16. These expected numbers(unlike those in the die example) are not all equal, and this is usually the case in chi-squareprocedures of this kind. If we eventually see, in our sample of n plants, n1 YS plants, n2 YWplants, n3 GS plants, and n4 GW plants, the test statistic will be of the standard (chi-square)

76

Page 77: Intro to Stat (STAT 111) by Ewens

form for this kind of test, namely

c2 =(n1 − (9n/16))2

9n/16+

(n2 − (3n/16))2

3n/16+

(n3 − (3n/16))2

3n/16+

(n4 − (n/16))2

n/16. (102)

Steps 4 and 5 depend on whether we use Approach 1 or Approach 2. We consider Approach1 first. Under Approach 1 we have to determine what is the critical value of c2 will lead usto reject the null hypothesis. The probability theory “zig” that determines this is that c2

is the observed value of a random variable which, when the null hypothesis is true, has (toa very close approximation) a chi-square distribution with three degrees of freedom. Whythree? Because if n, the sample size, is given, only three of the numbers n1, n2, n3 and n4

can be freely chosen.

Chi-square charts of critical points then show that the null hypothesis will be rejected ifc2 ≥ 7.8147.

As always, Step 5 is now straightforward. We get the data (that is the values of n1, n2, n3, n4

and n), compute the value of c2, and reject the null hypothesis if c2 ≥ 7.8147. If c2 < 7.8147we say we do not have enough evidence to reject the null hypothesis.

Approach 2. Under 2 we get the data and compute the value of c2 (Step 4). In Step 5 wefind the P -value associated with thois value of c2. This is very difficult and can only be donewith a computer statisical package. If the P -value is 0.05 or less we reject reject the nullhypothesis. If the P -value is greater than 0.05 we say we do not have enough evidence toreject the null hypothesis.

Numerical example (This example describes the classic experiment by Mendel that led to thediscovery of genes, chromosomes, and eventually to genomics, so crucial in today’s scientificresearch.) The situation (yellow or green, smooth or wrinkled) is as above, and as above wechoose α = 0.05. In this example n = 250, so that the four expected numbers are 9(250)/16= 140.6250 for YS, 3(250)/16 = 46.8750 for YW, 3(250)/16 = 46.8750 for GS, and 250/16 =15.6250 for GW. The observed numbers were n1 = 152, n2 = 39, n3 = 53 and n4 = 6. Thus,following (102),

c2 =(152− 140.6250)2

140.6250+

(39− 46.8750))2

46.8750+

(53− 46.8750)2

46.8750+

(6− 15.2625)2

15.2625= 8.982.

(103)This value exceeds 7.8147, and so we reject the null hypothesis. (This turned out to be thecorrect conclusion: the assumption of independence was not correct. The genes for colorand smoothness are on the same chromosome. Finding this out was a definite step forwardin genetics, in particular in establishing the fact that genes lie on chromosomes. )

77

Page 78: Intro to Stat (STAT 111) by Ewens

The general case

The die and the genetics examples should indicate how this test is done in the general case.

We have k categories, which we call categories 1, 2, ...., k. For example, in the genetics case,category 1 is “yellow and smooth”.

Suppose that we plan to take n independent observations. The null hypothesis states thatthe probability that any of these observations will fall in category 1 is P0(1), the probabilitythat an observation is in category 2 is P0(2), . . . , the probability that an observation is incategory k is P0(k). (The suffix “0” denotes null hypothesis.) These probabilities are givennumerical values. That is, they do not involve parameters.

Suppose that we have now done our experiment, and that n1 obervations did fall in category1, n2 did fall in category 2, . . . , nk did fall in category k.

Let’s go through the five hypothesis testing steps.

Step 1. The null hypothesis is as given above, stating the respective probabilities for anobservation to fall in the various catogories. The alternative hypothesis is that these are notthe correct probabilities, but the alternative hypothesis does not specify in what way theyare incorrect.

Step 2. Choose the value of α (1%, 5%).

Step 3. The test statistic is the general form of that used in the “die” and the “genetics”examples namely,

c2 =(n1 − nP0(1))2

nP0(1)+ . . .+

(nk − nP0(k))2

nP0(k). (104)

Approach 1, Step 4. What values of c2 lead to rejection of the null hypothesis? Sufficientlylarge (positive) values. How large? This depends on the value of α chosen in Step 2, and thenumber of degrees of freedom. So, how many degrees of freedom do we have? The answer isk− 1. NOTE: not n-1. The chart of critical points of chi-square will now show how large c2

has to be before we reject the null hypothesis.

Approach 1, Step 5. Get the data, calculate c2, and if the value found is greater than orequal to the appropriate chart value, reject the null hypothesis. If the value found is lessthan the appropriate chart value, we do not ave enough evidence to reject the null hypothesis.

Approach 2, Step 4. Get the data and calculate c2.

78

Page 79: Intro to Stat (STAT 111) by Ewens

Approach 2, Step 5. Find the P -value asociated with the observed value of c2. This canonly be done by a computer statistical package. If the P -value is less than or equal to α wereject the null hypothesis. If the P -value is greater than α we do not have enough evidenceto reject the null hypothesis.

Notes on this.

1. Most of the notes for r-by-c tables also apply for this form of chi-square. In particular,use the actual counts and not percentages.

2. Remember that the number of degrees of freedom is k − 1, not n − 1. Remember alsothat this is a different use of chi-square from the use in r-by-c tables.

3. Most of the time the categories are descriptive, such as “green and smooth”. In the dieexample we could follow this and say that category 1 is “a 1 turns up”, that category 2 is“a 2 turns up”, etc.

4. Remember that this “is this the correct distribution?” use of chi-square is completelydifferent from the use of chi-square in an r-by-c table.

A problem for you.

We conduct 1,000 binomial trials. The probability of success on each trial is θ. The nullhypothesis claims that θ = 0.3. We actually saw 321 (= n1) successes (and thus n2 = 679failures) once the trials were conducted.

1. Focus only on the number of successes. Calculate a z statistic, using the observed numberof successes (321) and the null hypothesis mean and variance of the number of successes in1,000 trials. From this calculate z2. Do this exactly.

2. Focus only on the proportion of trials giving success. Calculate a z statistic, using theobserved value of this proportion (321/1,000) and the null hypothesis mean and variance ofthis proportion in 1,000 trials. From this calculate z2 Do this exactly.

3. We see 321 successes and 679 failures in 1,000 trials. calculate a c2 statistic, using equa-tion (104). (This will be the sum of two terms, one relating to successes, one relating tofailures).

Which was the correct calculation to make in order to test the null hypothesis, the onecalculated as in 1, the one calculated as in 2, or the one calculated as in 3? (Your numericalcalculations should tell you the answer to this question.)

79

Page 80: Intro to Stat (STAT 111) by Ewens

Here are the answers to the problem above.

1. z = (321−300)√(1000(.3)(.7)

= 21√210,

so that z2 = 441210

= 2.1.

2. z = .(321−.3)√(.3)(.7)/1,000

= .021√.00021

,

so that z2 = .000441.00021

= 2.1.

3. c2 = (321−300)2

300+ (679−700)2

700= 441

300+ 441

700= 1.47 + 0.63 = 2.1.

Thus all three methods give the same answer. They are all correct. Note however thatmethods 1 and 2 rely on binomial calculations, which are OK here since there are only twocategories (success and failure). The c2 method extends to any number of categories (suchas 6 in the “die” example and 4 in the “genetics” example).

A more complicated situation

In all the above examples the probabilities for the k catgegories as given by the null hypothesiswere all numbers, for example 1/6, 1/6, . . . , 1/6 in the die example and 9/16, 3/16, 3/16, 1/16in the genetics example. In some more complex cases the probabilities are given in terms ofa parameter or several parameters. The procedure in such cases is as follows:-

(i) Estimate the parameters from the data. (There is advanced statistical theory that showsyou how to do this. You cannot be expected to know this theory, so you will always be toldhow this estimation is to be done.)

(ii) Calculate c2 in the normal way, but with the parameter(s) replaced by the parameterestimate(s).

(iii) You lose one further degree of freedom in the chi-square for every parameter that youestimate. (So if there are k categories and you estimate m parameters, the number of degreesof freedom is k −m− 1.)

Example (again from genetics).

Under the null hypothesis, individuals is a certain population are either of genetic type aa(with probability (1 − θ)2), of genetic type Aa (with probability 2θ(1 − θ)), or of genetictype AA (with probability θ2). Here θ is a parameter whose numerical value is not specified

80

Page 81: Intro to Stat (STAT 111) by Ewens

. (You can see the binomial distribution is relevant here.) The alternative hypothesis is thatthe null hypothesis is not true in an unspecified way.

Given that in our data we have n1 individuals of type aa, n2 individuals of type Aa, and n3

individuals of type AA, with n1 + n2 + n3 = n, how do we test the null hypothesis againstthe alternative hypothesis? That is, what is our test statistic and what values of the teststatistic will lead us to reject the null hypothesis?

Suppose first (and unrealistically) that the numerical value of θ is given. Then we wouldform the chi-square statistic

(n1 − n(1− θ)2)2

n(1− θ)2+

(n2 − 2nθ(1− θ))2

2nθ(1− θ)+

(n3 − nθ2)2

nθ2. (105)

We would test the null hypothesis using chi-square charts with two degrees of freedom.

However, of course, we do not know the numerical value of θ. We will have to estimate itfrom the data. So here is a “trust me” result: the estimate of θ is θ = (n2 + 2n3)/(2n). So

we form the appropriate c2 statsiitc by replacing θ wherever we see it in (105) by θ. Thisgives the test statistic

c2 =(n1 − n(1− θ)2)2

n(1− θ)2+

(n2 − 2nθ(1− θ))2

2nθ(1− θ)+

(n3 − nθ2)2

nθ2. (106)

We refer the value so calculated to chi-square charts with 3 - 1 -1 = 1 degree of freedom (sincewe always lose one degree of freedom, and we lose an extra one here because we estimatedone parameter).

Numerical example.

Suppose that n1 = 466, n2 = 444 and n3 = 90, so that n = 1, 000. Then θ = (444 +180)/2, 000 = 0.312. Then c2 is calculated as

c2 =(466− 1, 000(0.688)2)2

1, 000(0.688)2+

(444− 2, 000(0.312)(0.688))2

2, 000(0.312)90.688)+

(90− 1, 000(0.312)2)2

1, 000(0.312)2, (107)

and the numerical value of this is

(466− 473.344)2

473.344+

(444− 429.312)2

429.312+

(90− 97.344)2

97.344,

which is about 1.17. Let’s assume that we had chosen α = 0.05 in Step 2. Then the relevantcritical point value from the chi-square chart is 3.8415. Since 1.17 is less than this, we donot have enough evidence to reject the null hypothesis.

81

Page 82: Intro to Stat (STAT 111) by Ewens

A final thought about P -values

As indicated often in class, if you want an exact P -value in a chi-square test with more thanone degree of freedom, you will have to use a statistical computer package (Approach 2).However even with Approach 1, where you use a chart of critical points of chi-square, youcan put bounds on the P -value. Here is how it works.

Consider for example the case where the chi-square has 10 degrees of freedom. For α = 0.05the critical value is 18.3070 and for α = 0.01 the critical value is 23.2093. What this means isthat if C2 is a random variable having a chi-square distribution with 10 degrees of freedom,P(C2 ≥ 18.3070) = 0.05 and P(C2 ≥ 23.2093) = 0.01. So we know then, for example, thatP(C2 ≥ 21.3456) is somewhere between 0.01 and 0.05, but we do not know exactly where itis in that range.

Suppose now that you have your data in a chi-square test with 10 degrees of freedom andhave computed c2, and that numerically the c2 value is 21.3456. What can be said about theP -value corresponding to this? Remember what a P -value is: it is the probability of gettingthe observed value of your test statistic (here, 21.3456) or one more extreme in the directionof the alternative hypothesis (here greater than or equal to 21.3456) when the null hypothesisis true (here, that C2 does have a chi-square distribution with 10 degrees of freedom). Fromthe above, the P -value must be between 0.01 and 0.05. That is all you can say about thenumerical value of the P -value. Thus if a researcher used Approach 1 (in which an exactP -value cannot be obtained), he or she will use the chi-square chart and say something like:“χ2 = 21.3456, 0.01 ≤ P ≤ 0.05”. You will often see this sort of statement in research papers.

Tests on means

Perhaps the most frequently used tests in Statistics concern tests on means. Tests on means(in this course) are carried out using so-called t tests. (So-called ANOVA tests, which alsoare tests on means but for data more complicated than that which we consider in this course,are discussed in STAT 112.) We shall discuss four different t tests in this course.

The one-sample t test.

It is easiest to start with an example. Different people have different body temperatures,due to differences in physiology, age, and so on. That is, the temperature of a person takenat random is a (continuous) random variable. So there is a probability distribution of tem-peratures. From considerable past experience, suppose that we know that the mean of thisdistribution is 98.6 for normal healthy individuals.

We are concerned that the mean temperature for people 24 hours after an operation ex-

82

Page 83: Intro to Stat (STAT 111) by Ewens

ceeds 98.6. To investigate this we plan to get some data from n people 24 hours after theyhave had this operation. At this stage these temperatures are random variables: we do notknow what values they will take. So we denote them in upper case: X1 for the first personin the sample, X2 for the second person in the sample, ...., Xn for the nth person in thesample. X1, X2, . . . , Xn are iid random variables each having some probability distributionwith unknown mean, which we denote by µ. Under the null hypothesis (that the mean bodytemperature of people 24 hours after the operation is the same as that for normal healthypeople), µ = 98.6. From the context of the situation, the natural alternative hypothesis isone-sided up, that is that µ is greater than 98.6. This concludes Step 1 in the hypothesistesting procedure: we have declared our null and alternative hypotheses.

Step 2 consists of choosing the false positive rate α, the probability of claiming that µ isgreater than 98.6 when it is in fact equal to 98.6. In other words, this is the probabilitywe allow ourselves for rejecting the null hypothesis when it is true. Since this is a medicalsituation, let’s choose α = 0.01.

Step 3 consists of choosing the test statistic. Looking ahead to our actual data x1, x2, . . . , xn,it is clear that a key component in the test statistic will be a comparison of the average ofthese values (x) with the null hypothesis mean 98.6. That is, a key component in the teststatistic will be the difference x− 98.6.

(This is the comparison of an average with a mean. This then is why it has been emphasizedso frequently in this course that an average and a mean are two entirely different concepts.To use these two words as meaning the same thing makes it almost impossible to understandthe testing procedure that we are developing.)

However, the difference x − 98.6 is not enough. To see why this is so, we consider first thequite unrealistic case where we know the variance σ2 of the probability distribution of tem-peratures of people 24 hours after they have had this operation. (We consider the realisticcase, where we do not know this variance, later.)

Now let’s go on a probability theory “zig”. Consider the situation before the data wereobtained. At this stage the average X is a random variable. If the null hypothesis is true, ithas mean 98.6 (the mean of each individual X, by one of the magic formulas), and varianceσ2/n (by another magic formula). If we also assume that the X’s have a normal distribution,then

X − 98.6

σ/√n

=(X − 98.6)

√n

σ

is a “Z’ if the null hypothesis is true. So once we have our data we can compute

(x− 98.6)√n

σ(108)

83

Page 84: Intro to Stat (STAT 111) by Ewens

and reject the null hypothesis if the value of this is 2.326 or more. (The value 2.326 comesfrom the Z chart and the fact that α = 0.01: make sure that you can work out where 2.326came from, using the Z chart.)

The above procedure is very suggestive to us for the far more realistic case where we do notknow the variance σ2 of the probability distribution of temperatures of people 24 hours afterthey have had this operation. However we know how to estimate a variance, by

s2 =x2

1 + x22 + . . .+ x2

n − n(x)2

n− 1.

So a sensible thing to do is to replace the test statistic (108) by t, defined as

t =(x− 98.6)

√n

s. (109)

This is our test statistic for the “temperatures” example. This concludes Step 3 of the test-ing procedure in the “temperatures” example.

More generally, if the null hypothesis claims that the mean of the temperature of a person24 hours after the operation is µ0, the test statistic is

t =(x− µ0)

√n

s. (110)

This is the general form of the test statistic t in the “one-sample” case.

Approach 1, Step 4. What values of t will lead us to reject the null hypothesis? Becauseof the nature of the alternative hypothesis in the “temperatures” example, sufficiently largepositive values. This is a “one-sided up” example. How large? To answer this question wego back to the situation before we get our data. The random variable analogue of t is therandom variable T , defined (in parallel with (111)) as

T =(X − 98.6)

√n

S, (111)

where

S2 =X2

1 +X22 + . . .+X2

n − n(X)2

n− 1.

So T is very similar to a Z. The only difference is that T has the estimator of a standarddeviation in the denominator instead of a known standard deviation. Because of this dif-ference T does not have the Z distribution. Instead it has the so-called t distribution withn− 1 degrees of freedom. (Why n− 1? That will be discussed later.) This is just one morewell-known and well-studied probability distribution of central importance in Statistics (likethe chi-square distribution). Again like the chi-square distribution, the mathematical form

84

Page 85: Intro to Stat (STAT 111) by Ewens

is not given here. So there will be several “trust me” results coming along soon.

The form of the t distribution tells us how large t has to be in the tempteratures examplefor us to reject the null hypothesis. Charts indicating this will be given out in class. Forexample, if n = 10 (so that you have 9 degrees of freedom) and α = 0.01, we would rejectthe null hypothesis if t ≥ 2.821.

Note. The t chart that you will be given is designed for one-sided up tests (such as thetemperature example). If the alternative hypothesis had been that µ (the mean temperaturefor people 24 hours after the operation) is less that 98.6, then with n = 10 and α = 0.01we would reject the null hypothesis if t ≤ −2.821. If the alternative hypothesis has beenµ 6= 98.6 then with n = 10 and α = 0.01 we would reject the null hypothesis if t ≥ 3.250 orif t ≤ −3.250.

Approach 1, Step 5. Get the data, compute t, compare the value with the relevant chartvalue, and accept or reject the null hypothesis accordingly.

Approach 2, Step 4. Get the data and calculate t.

Approach 2, Step 5. Find the P -value assiociated with the observed value of t. This canonly be done with a computer package. An example using JMP will be discussed in class.

Numerical example - the “temperature” case

Suppose that we get the temperatures of n = 10 people 24 hours after they have the opera-tion. These values were 98.9, 98.6, 99.3, 98.7, 98.7, 98.4, 99.0, 98.5, 98.8, 98.8. The averageof these is

98.9 + 98.6 + 99.3 + 98.7 + 98.7 + 98.4 + 99.0 + 98.5 + 98.8 + 98.8

10= 98.77.

Next we have to calculate s2. This is

98.92 + 98.62 + 99.32 + 98.72 + 98.72 + 98.42 + 99.02 + 98.52 + 98.82 + 98.82 − 10(98.77)2

9,

and this is 0.066778. From this, s = 0.258414. Thus the calculated value of t is

t =(98.77− 98.6)

√10

0.258414= 2.0803.

Since there are 9 degrees of freedom and we chose α = 0.01, the critical value (from thet chart is 2.821 (as given above). Our observed value is less than this, so we do not have

85

Page 86: Intro to Stat (STAT 111) by Ewens

enough evidence to reject the null hypothesis. In other words, we do not have enough evi-dence to claim that there is a significant increase in temperature 24 hours after the operation.

Approach 2, Step 4. Get the data and calculate t. As above, we get 2.0803.

Approach 2, Step 5. Use a statistical package to find the P -value corresponding to t = 2.0803.You will be given a JMP handout doing this. The P -value shown is 0.0336. Since this is notless than or equal to 0.01, we draw the same conclusion as we did under Approach 1: wedo not have enough evidence to claim that there is a significant increase in temperature 24hours after the operation.

Notes on the one-sample t test.

1. Degrees of freedom. In a one-sample t test the number of degrees of freedom is one lesthan the number of data values, (that is , degrees of freedom = n− 1). Trust me.

2. Think of t as a signal to noise ratio. The signal is the numerator quantity (x−µ), tellingyou the difference of the average of the data values from the null hypothesis mean, togetherwith the square root of the sample size

√n. The noise is s. (An example will be given later

illustrating the importance of the noise.)

3. The only difference between a one-sample t statistic and a z statistic is that t has astandard deviation estimate in the denominator (s) and a z has a known standard deviation(σ).

4. The t chart that you will be given is designed for one-sided up tests. It can still be usedfor one-sided down and two-sided tests - see examples later.

5. The critical points in a t chart are calculated under the assumption that the randomvariables X1, X2, . . . , Xn have a normal distribution.

6. What do you do if you are not prepared to assume that X1, X2, . . . , Xn have a normaldistribution? You use a non-parametric (= distribution-free) test. See later.

7. The general form of a one-sample t statistic is t = (x−µ0)√n

s. Here µ0 is the null hypothesis

mean.

8. The t test is scale-free. If one person measures temperature in degrees Fahrenheit anduses a null hypothesis mean of 98.6, and another measures the same temperatures in degreescentigrade and used the equivalent null hypothesis mean of 37.0, they would get the samenumerical value for t. (It would be disaster if this did not happen.)

86

Page 87: Intro to Stat (STAT 111) by Ewens

9. More on degrees of freedom. Remember that if in the “operation” example we knewthe variance σ2 of the temperatures of people 24 hours after the operation, and if we choseα = 0.01, we would reject the null hypothesis is z ≥ 2.326. How many degrees of freedomdoes 2.326 correspond to in a t test? Infinity.

10. Suppose that in the “temperatures” example the alternative hypothesis had been thatthe temperature of a person 24 hours the operation tends to be lower than 98.6. Then thetest would be one-sided down, and with α = 0.01 and n = 10 as before, we would reject thenull hypothesis if t ≤ −2.821. If the alternative hypothesis had been that the temperatureof a person 24 hours the operation differs from 98.6, but in an unspecified direction, thenthe test would be two-sided, and with α = 0.01 and n = 10 as before, we would reject thenull hypothesis if t ≤ −3.250 or if t ≥ +3.250.

11. Under Approach 1 we do not attempt to compute a P -value. However we can usuallyput a bound on the P -value. For example, in the original one-sided up “temperature” ttest, suppose that (again with n = 10) we had obtained (as above) a value of t of 2.0803.What bounds can we put on the P -value from the values in the t chart? Remember: theP -value in this example is the probability of getting a t value of 2.0803 or more when the nullhypothesis is true. The t chart shows that when the null hypothesis is true, the probabilityof getting a value of t greater than or equal to 1.833 is 0.05 and the probability of getting avalue of t greater than or equal to 2.821 is 0.01. Thus the probability of getting a value of tgreater than or equal to the observed value 2.0803 when the null hypothesis is true must bebetween 0.01 and 0.05. And (as we saw from the computer output using Approach 2) it is0.0336, which is indeed between 0.01 and 0.05.

You will often see a result such as this written as: “t = 2.0803, 0.01 < P < 0.05”.

12. The t chart and the two-standard-deviation rule. If the true mean of X1, X2, . . . , Xn isµ, then t = (x− µ)

√n/s has the t distribution with n− 1 degrees of freedom. Suppose for

example that n = 21, so that you have 20 degrees of freedom. Then the critical points inthe t chart show that P (−2.086 ≤ t ≤ +2.086) = 0.95. Some algebraic manipulations thenshow that

P (x− 2.086s√n≤ µ ≤ x+

2.086s√n

) = 0.95.

Thus an exact 95% confidence interval for µ is

x− 2.086s√n

to x+2.086s√

n.

Similarly if n = 41, so that you have 40 degrees of freedom, an exact 95% confidence interval

87

Page 88: Intro to Stat (STAT 111) by Ewens

for µ is

x− 2.021s√n

to x+2.021s√

n.

Both of these are close to the approximate 95% confidence interval for µ deriving from thetwo-standard-deviation rule, namely

x− 2s√n

to x+2s√n.

Looking down the chart of the critical values of t corresponding to α = 0.025, you can seethat from n = 20 onwards the exact 95% confidence interval for µ is very close to the ap-porximate confidence interval. In fact for n = 60 they are essentially identical.

This of course is why the “approximate two-standard deviation rule” has been emphasizedso much in class.

13. Robustness As noted above, if you do a t test and use the t chart to assess significance,you are implicitly assuming that the random variables X1, X2, . . . , Xn have a normal distri-bution. (The critical points in the t chart were calculated under this assumption.) So it isnatural to ask: “How far wrong do you go in using a t test if X1, X2, . . . , Xn do not have anormal distribution?

The answer is: “not much”. The t test is said to be robust. The critical points in the t chartare quite accurate even if X1, X2, . . . , Xn do not have a normal distribution.

In part this happens because of the Central Limit Theorem. A crucial component of the tstatistic is x, and from the Central Limit Theorem one can reasonably assume that X hasclose to a normal distribution whatever the probability distribution of X1, X2, . . . , Xn mightbe.

********************

Example 2. This (a) gives example of a one-sided down test, and (b) shows the importanceof the noise (s) in the denominator of the t statistic.

You are a member of a consumer group and you are concerned that your local supermarketis putting in less than 5 pounds of coffee in bags which they label 5 pounds. We want tocheck on this and, to do this, we plan to buy 10 bags of coffee and weigh them. (With 10bags we will have 9 degrees of freedom on our eventual t test.)

Step 1. Let µ be the true mean weight of coffee put in the bags. The null hypothesis is µ = 5and (because of the context) the alternative hypothesis is µ < 5.

88

Page 89: Intro to Stat (STAT 111) by Ewens

Step 2. Let’s choose α = 0.05. Because of the nature of the alternative hypothesis and thefact that we will have 9 degrees of freedom, we will reject the null hypothesis if our eventualvalue of t is -1.833 or less.

Steps 3, 4 and 5 combined.

Case 1. Suppose that the weights of the 10 bags are (in pounds)

4.96, 5.04, 4.93, 5.06, 4.92, 4.96, 5.03, 4.91, 4.98, 5.01.

From these values we find x = 4.98 and s2 = 0.0028. From this,

t =(4.98− 5.00)

√10√

0.0028= −1.1952.

Because this is not less than -1.833 we do not have enough evidence to reject the null hy-pothesis. ( In an upcoming homework you will do this test using JMP, and will also find theP -value asociated with the t value of -1.1952.)

Case 2. As above, but suppose that the weights of the 10 bags are (in pounds)

4.98, 4.97, 4.99, 4.99, 5.01, 4.97, 4.98, 4.96, 4.99, 4.96.

From these values we find x = 4.98 (the same as in Case 1), but now s2 = 0.00002444 (muchless than in case 1). From this,

t =(4.98− 5.00)

√10√

0.00002444= −4.0452.

Because this now is less than -1.833 we have enough evidence to reject the null hypothesis.Note that the averages are the same in Cases 1 and 2, and that the only difference betweenthe two cases is that there is a smaller level of “noise” in Case 2.

A two-sided example

Remember: the t chart that you have is designed for one-sided up tests. However they areeasily adapted for two-sided tests, provided you remember what the chart values are tellingyou. This was discussed in Note 10 above.

If for example in the “temperatures” case we are concerned about both an increase and adecrease in the temperatures of people 24 hours after the operation, we will have to conducta two-sided test. If as before we choose α = 0.01, we put half of this value (i.e. 0.005) onthe up side and the other half on the down side. If n = 10, so that we have 9 degrees offreedom, the chart shows that we will reject the null hypothesis if t ≥ 3.250 or if t ≤ −3.250.

89

Page 90: Intro to Stat (STAT 111) by Ewens

Two-sample t tests

These are used very often and thus are very important.

As the name suggests, with two-sample t tests we have two sets of data. Here is an example.

Suppose that we want to assess whether people tend to lose the ability to remember a setof instructions as they get older. To address this question we plan to get a sample of mpeople aged 18-45 (“group 1”) and a sample of n people aged 60-75 (“group 2”) and sub-ject each person to a memory test (see how long they take to remember a set of instructions).

We are still in the planning phase. We will do all this tomorrow. So at this stage the lengthsof time for the people in group 1 are random variables, which we denote by X11, X12, . . . , X1m.Similarly the lengths of time for the people in group 2 are random variables, which we denoteby X21, X22, . . . , X2n. (Notice that the first suffix indicates group membership.)

We assume that X11, X12, . . . , X1m are iid random variables, each having a normal distribu-tion with mean µ1 and (unknown) variance σ2. (The normal distribution assumption will bediscussed later). Similarly we assume that X21, X22, . . . , X2n are iid random variables, eachhaving a normal distribution with mean µ2 and (unknown) variance σ2. (Again, the normaldistribution assumption will be discussed later. Note also that we are assuming the samevariance in both groups: this assumption also will be discussed later.)

Step 1. The null hypothesis claims that µ1 = µ2.

(Note here the parallel with what we did in the binomial case. We first tested for the valueof one single binomial parameter: “Is this coin fair?” and moved on from that to testing forthe equality of two binomial parameters: “Is the probability (θ1) that a woman is pro-choicethe same as the probability (θ2) that a man is pro-choice?”. Similarly here we first testedfor the value of one single mean: “Is the mean 98.6?”, and we have now moved to testingfor the equality of two means.)

The alternative comes from the context, and in this example it is that µ1 > µ2. So this willbe a one-sided test. (Whether it will be one-sided up or one-sided down will be discussedlater.) In other cases the alternative hypothesis could be two-sided (µ1 6= µ2). It all dependson the context.

Step 2. Here we decide on the value of α, usually either 5% or 1%. This is the false positiveprobability. In the “memory” example, it is the probability that you are prepared to adoptthat you reject the null hypothesis (that the mean memory time is the same for both agegroups) when it is true.

90

Page 91: Intro to Stat (STAT 111) by Ewens

Step 3. What is the test statistic? Let’s now go forward in time to when we have ourdata. We will have data values x11, x12, . . . , x1m from the people in group 1 and data valuesx21, x22, . . . , x2n from the people in group 2. It is clear that a key part of the “signal” in thet statistic that we are developing will be the difference between the averages x1 and x2. Wecould use the difference x1− x2 or the difference x2− x1. Which choice we make will eventu-ally decide if the test is one-sided up or one-sided down: more on this later. To be specifichere, let’s choose the difference x1 − x2. As stated above, this will be a key componrent ofthe“signal” in the numerator in the t test statistic that we are developing.

However, as with the one-sample t test, this on its own is not enough. We have to considerthe “noise”. To see what this should be, let’s pretend for the moment that the variance σ2

(discussed above) is known. We now have to do a probability theory “zig”, and considerthe situation before we get our data. Since we know that our eventual t statistic will havex1− x2 in the numerator, at this stage we think about the random variable X1− X2. Whatis the variance of this random variable?

Her we have to use two of the magic formulas of probability theory. The formula for thevariance of an average shows us that the variance of X1 is σ2/m and variance of X2 is σ2/n.Next we have to use the magic formula for the difference of two random variables, namelythe sum of their variances. Putting all of this together, the variance of X1 − X2 is

Variance of X1 − X2 =σ2

m+σ2

n. (112)

Again using magic formulas, the mean of X1 is µ1 and mean of X2 is µ2. Now suppose thatthe null hypothesis µ1 = µ2 is true. Using a further magic formula, the mean of X1 − X2 isthen zero. Putting this together with the result given in (112), when the null hypothesis istrue, the quantity

X1 − X2√σ2

m+ σ2

n

(113)

is a “Z”, that is it is a random variables having a normal distribtion with mean zero, variance1. We could use the “Z” chart to test for its significance.

Algebraic manipulations in (113) show that the statistic defined in (113) can be written inthe more convenient form

(X1 − X2)√

mnm+n

σ. (114)

To say it again, if the null hypothesis µ1 = µ2 is true, the quantity in (114) is a “Z”, thatis, has a normal distribution with mean zero, variance 1.

91

Page 92: Intro to Stat (STAT 111) by Ewens

However, the problem with the above is that we do not know σ. So we will have to estimateit from the data. This will eventually lead us to a t statistic, since as discussed previously,the only difference between a z statistic and a t statistic is that in a t we have an estimate ofa standard deviation in the denominator, whereas in a Z statistic we have a known standarddeviation in the denominator, as in (114).

This leads to a “ trust me” result. We want to use the data from both groups, combinedin some way, to estimate σ2, and from this to estimate σ. The data x11, x12, . . . , x1m fromgroup 1, taken on their own, would lead to the estimate

s21 =

x211 + x2

12 + · · ·+ x21m −m(x1)2

m− 1. (115)

Similarly the data x21, x22, . . . , x2m from group 2, taken on their own, would lead to theestimate

s22 =

x221 + x2

22 + · · ·+ x22n − n(x2)2

n− 1. (116)

Here is the “trust me” result. Our estimate of σ2, combining the estimates s21 from group 1

and s22 from group 2, is s2, defined by

s2 =(m− 1)s2

1 + (n− 1)s22

m+ n− 2. (117)

This is a weighted average of s21 and s2

2, the weights being thew respective degrees of freedomin the two groups. Combining this with the thinking that led to the z in (114), the teststatistic that we will use is t, defined by

t =(x1 − x2)

√mnm+n

s. (118)

This is the end of Step 3. We have arrived at our test statistic. It is what we will eventuallycalculate fom our data.

Step 4, Approach 1. In this step we ask: “What values of t will lead us to reject the nullhypothesis?” We answer this question in two stages. But before starting, recall that wearbitrarily have x1 − x2 in the numerator of the test statistic t. We could equally havedecided to have x2 − x1 in the numerator. If we had done this our test statistic would havebeen t∗, defined by

t∗ =(x2 − x1)

√mnm+n

s. (119)

Either choice of test statistic is allowed. However it is crucial that you think very carefully,once you have decided which statistic to use, (118) or (119), about the sided-ness of yourtest. This is determined by the alternative hypothesis and the choice of statistic that you

92

Page 93: Intro to Stat (STAT 111) by Ewens

make.

Suppose, for example, that we use the test statistic t as given in (118), and that the alterna-tive hypothesis had been µ1 < µ2. What value of t would we expect to get if this alternativehypothesis is true? If the alternative hypothesis is true, x1 will tend to be less than x2 andso t will tend to be negative. Thus sufficently large negative values of t, as defined in (118),will lead us to reject the null hypothesis.

Similarly, suppose that the alternative hypothesis had been µ1 > µ2. If this alternativehypothesis is true, x1 will tend to be greater than x2 and so t will tend to be positive. Thussufficently large positive values of t, as defined in (118), would in that case lead us to rejectthe null hypothesis.

What would the corresponding procedure be if you decided to use the test statistic t∗? Sup-pose first that alternative hypothesis had been µ1 < µ2. What value of t∗ would we expectto get if this alternative hypothesis is true? If this alternative hypothesis is true, x1 will tendto be less than x2 and so t∗ will tend to be positive. Thus sufficently large positive values oft∗ would lead you to reject the null hypothesis.

Does this contradict the conclusion that you would have reached using t as the test statistic?NO. Observe that t∗ is the negative of t: if for example t = −1.66, then t∗ = +1.66. Solarge negative values of t correpond to large positve values of t∗, and you will reach the sameconclusion whether you use t or t∗.

To be concrete we will use t as defined in (118) as our test statistic, but the above discussionshows that we must be careful to work out, if we have a one-sided test, whether we will havea one-sided up or a one-sided down test in any specific example.

If our test is two-sided then it does not matter: sufficently large positive or sufficently largenegative of t will lead us to reject the null hypothesis, and this will correspond exactly tosufficently large positive or sufficently large negative of t∗.

How large? Here is another “trust me” result. If the null hypothesis is true, then the statstict is the observed value of a random variable having the t distribution with m+n− 2 degreesof freedom. (You get m − 1 degrees of freedom from group 1 and n − 1 degrees of freedomfrom group 2, and thus n+m− 2 degrees of freedom altogether.)

The degrees of freedom calculation together with the value of α will tell you how large pos-itive, or how large negative, or how large positive or negative, t has to be before the nullhypothesis is rejected. Here are some examples.

93

Page 94: Intro to Stat (STAT 111) by Ewens

Suppose that m = 10 and n = 13. You therefore have 10 + 21− 2 = 21 degrees of freedom.Suppose that the alternative hypothesis is µ1 > µ2 and that you chose α = 0.05. Then youhave a one-sided up test, and the t chart shows that you will reject the null hypothesis ift > 1.721. If the alternative hypothesis had been µ1 < µ2 and you chose α = 0.05, then youhave a one-sided down test, and you will reject this null hypothesis if t < −1.721. If thealternative hypothesis had been µ1 6= µ2 you have a two-sided test, and with α = 0.05 youwould reject this null hypothesis if t > 2.080 or if t < −2.080.

Step 5, Approach 1. This is the easy step. Get the data, calculate t, refer the value you getto the approptriate value on the t chart, and accept or reject the null hypothesis dependingon this comparison.

Step 4, Approach 2. In this step we calculate the value of t.

Step 5, Approach 2. In this step we find the P -value associated with the value of t calculatedin Step 4. This can only be done via a statistical computer package. JMP prints out threeP -values, one for the case where the test is one-sided up, one for the case where the test isone-sided down, and one for the case where the test is two-sided.

Notes on the two-sample t test.

Most of the notes for the one-sample t test apply also for a two-sample t test. There arehowever some differences. Here are three of them.

1. An obvious difference is in the formula for the number of degrees of freedom in the twocases (n− 1 compared to m+ n− 2).

2. A key assumption made above is that the variance corresponding to group 1 is the sameas that corresponding to group 2. (No corresponding assumption is needed for one-samplet tests.) The assumption of equal variances is one that you might, or might not, like tomake, depending perhaps on your experience with the sort of data you are involved with.There is a way of proceeding if you are not prepared to make this assumption. However thisprocedure is complex and we will not consider it in this class. Note that when using JMPfor a two-sample t test you can specify whether you want to make this assumption or not.JMP will compute the value of t and its P -value corresponding to whatever choice you made.

3. The main part of the “signal” in a one-sample t test is the difference between an averageand a mean, i.e. x−µ. The main part of the “signal” in a two-sample t test is the differencebetween two averages (x1 − x2).

4. This note concerns a computational simplification for the calculation of s2. The formula

94

Page 95: Intro to Stat (STAT 111) by Ewens

given in (117) for s2 is (to repeat that formula)

s2 =(m− 1)s2

1 + (n− 1)s22

m+ n− 2.

Now s21 is defined by

s21 =

x211 + x2

12 + · · ·+ x21m −m(x1)2

m− 1,

and similarly s22 is defined by

s22 =

x221 + x2

22 + · · ·+ x22n − n(x2)2

n− 1.

It follows from these three equations that

s2 =x2

11 + x212 + · · ·+ x2

1m −m(x1)2 + x221 + x2

22 + · · ·+ x22n − n(x2)2

m+ n− 2.

This is perhaps the simplest formula for computing s2.

5. There is no need, in a two-sample t test, for m and n to be equal. However in practice itis wise to make them as reasonably equal as possible, given the constraints of the situationinvolved.

6. The null hypothesis in any two-sample t test is that two means are equal (in the notationabove, that µ1 = µ2). However no statement is made or required as to what the commonvalue of the mean is under the null hypothesis.

Numerical example of a two-sample t test.

A group of m = 10 young people were given a certain stimulus and the reaction time of eachof these 10 people was measured (in thousandths of a second). At the same time a groupof n = 8 old people were given the same stimulus and the reaction time of each of these 8people was measured (again, in thousandths of a second). The null hypothesis is that themean reaction time for younger poeple is the same as that of older people. The naturalalternative hypothesis is that the mean reaction time for older people is longer than thatfor younger people. These comments in effect conclude Step 1: we have stated our null andalternative hypotheses.

In Step 2 we choose the value of α, the false positive rate (here the probability that weconclude that the mean mean reaction time for older people is longer than that for youngerpeople when in fact it is the same). Let’s choose 1%.

95

Page 96: Intro to Stat (STAT 111) by Ewens

Steps 3, 4 and 5 combined. Here group 1 will be the young people and group 2 will be theold people. At this point we have to ask ourselves: what kind of values of the t statistic willlead us to reject the null hypothesis? Suppose that m, the number of young people, is 10and that n, the number of older people, is 8. Looking at the numerator in the t statistic wesee that if the alternative hypothesis is true, t will tend to be negative. This implies that wewill reject the null hypothesis if t is sufficently large negative. How large negative? We willhave 10 + 8 -2 = 16 degrees of freedom, and with α = 0.01 the t chart shows that we willreject the null hypothesis if t ≤ −2.583.

We are now set up to do the calculations. The data are as follows. For the 10 young peoplethe reaction times were

382, 446, 483, 378, 414, 420, 452, 391, 399, 426

and for 8 old the reaction times were

423, 474, 456, 432, 513, 480, 498, 448.

From these values we compute x1 = 419.1 and x2 = 465.5. The value of s2 is 17143/16 =1071. Thus

t =(419.1− 465.5)

√8018√

1071= −2.9890.

Since this is less than the critical value -2.583 we reject the null hypothesis, and claim thatwe have significant evidence that the mean reaction time for older people exceeds that foryounger people.

Let’s review the logic behind this, focusing on the probability theory implication/deductionactivity (the “zig”) and the statistical inference/induction activity (the “zag”).

If the null hypothesis is true, then difficult math which we will not see shows that the prob-ability that the eventual value of t will be les than -2.583 is very small, in fact 0.01. This isthe probability theory “zig”.

After we got our data we found that the calculated value of t was -2.9890. Since this is lessthan -2.583, we make the inference, or induction, that we have good evidence that the nullhypothesis is not true. This is the statistical “zag”.

The paired two-sample t test

In this test, although we start with two samples of data, we finish up doing a one-sampletest. Here is an example from the “temperature 24 hours after operation” context.

96

Page 97: Intro to Stat (STAT 111) by Ewens

As before, we are concerned that the temperature of people 24 hours after an operationis unduly elevated. This time we measure the temperature of the people undergoing theoperation both 24 hours before, and also 24 hours after, the operation. Suppose that weconsider n people who will have this operation. Going straight to the data, we will have n“24 hour before the operation” temperatures x11, x12, . . . , x1n and also n “24 hour after theoperation” temperatures x21, x22, . . . , x2n. here the temperatures x11 and x21 are from thesame person, (person number 1) x12 and x22 are from person number 2, and so on.

There is now a natural pairing of the temperatures taken from the same person. We couldignore this and do a two-sample t test. But doing this does not take advantage of this naturalpairing. We will see how we do take advantage of this pairing in Step 3 below.

First, let’s consider Step 1.

Step 1.

As before, the null hypothesis is that the mean temperature 24 hours before the operationand the mean temperature 24 hours after the operation are equal. The alternative hypothesisof interest to us is that mean temperature goes up after the operation. This concludes Step1 - we have specified both our hypotheses.

Step 2. Let’s again choose α = 0.01.

Steps 3, 4 and 5

Let’s look forward to our eventual data. It is arbitrary whether we choose Group 1 as the“24 hour before” readings of the “24 hour after” readings, so long as we think about whatthis means about whether the test will be one-sided up or one-sided down. Let’s choose the“24 hour before” readings as Group 1.

Suppose that there are n people in the experiment and we denote the n “24 hour be-fore” data values as x11, x12, . . . x1n and the n “24 hour after” data values as x21, x22, . . . x2n.We now take advantage of the pairing to focus on the differences d1 = x11 − x21, d2 =x12 − x22, . . . , dn = x1n − x2n. From now on we discard the original xij data values anduse only these differences. This means that we will be doing a one-sample t test (on the divalues), even though our original data formed two samples.

To find the relevant test statistic we do a probability theory “zig”. If X1 is the temperatureof a randomly chosen person 24 hours before the operation and X2 is the temperature ofthis same person 24 hours after the operation, then under the null hypothesis the mean ofthe difference D = X1 −X2 between these two temperatures is zero. Under the alternative

97

Page 98: Intro to Stat (STAT 111) by Ewens

hypothesis the mean of this difference is negative.

Following the standard procedure for a one-sample test, we will form a t statistic from therespective differences d1, d1, . . . , dn. The numerator in this statistic will be the average dof these differences. This implies that the test will be one-sided down: if the alternativehypothesis is true, this average will tend to be negative. We also calculate s2

d using theformula

s2d =

d21+d22+···+d2n−n(d)2

n−1.

The eventual t statistic is then

t =d√n

sd. (120)

Here is a “trust me” result. If the null hypothesis is true, this is the observed value of arandom variable havcing thre t distribution with n− 1 degrees of freedom. So we now referthe calculated value of this t to the t chart with n− 1 degrees of freedom, remembering thatin this example the test is one-sided down. Thus if for example n = 15 and α = 0.01, wewould reject the null hypothesis if t is less than or equal to - 2.624.

Two questions. 1. If we had done a two-sample t test (which is quite posible, using thedata) we would have 2n− 2 degrees of freedom. What has happened to the remaining n− 1degrees of freedom?

2. What is the advantage of using the paired t test? It eliminates the natural person toperson variation in temperature. This is illustrated by a the following example. Before doingthe example recall that we defined dj as xij − x2j, and this led to the test being one-sideddown. We now address these two points in turn.

1. Suppose that we take the temperatures of n = 8 people both before and after the opera-tion. Going straight to the data, suppose that we get:-

Person number 1 2 3 4 5 6 7 8Temperature before 98.0 98.9 98.1 97.8 98.0 98.8 98.8 98.1Temperature after 98.4 98.8 98.4 98.1 98.2 99.0 98.7 98.5

From these we form the differences:-

Difference -.4 +.1 -.3 -.3 -.2 -.2 +.1 -.4

The average of these differences is -.2. Next,

98

Page 99: Intro to Stat (STAT 111) by Ewens

s2d =

(−.4)2 + (+.1)2 + · · ·+ (−.4)2 − 8(−.2)2

7,

and this is 0.04. Thus the numerical value of t is

t =(−.2)

√8√

0.04= −2.828.

This would be significant if α = 0.05 and would almost be significant if α = 0.01 (7 degreesof freedom).

We now do the analysis using the ordinary (i.e. unpaired)two-sample t test and then com-pare the results of the two procedures.

The average (x1) of the “before” temperatures is 98.3125 and the average (x2) of the “after”temperatures is 98.5125. The difference(x1−x2) between these two averages is −0.2. (This isidentical to the average of the differences calculated doing the “paired” t test, and this equal-ity will always happen. Thus this part of the “signal” will be identical in the two procedures.)

Further calculation shows that the “pooled” estimate s2 of variance is s2 = 2.017514

= 0.144107.Thus under this approach the numerical value of t is

t =(−0.2)

√6416

0.144107= −1.0537.

This is nowhere near significant (14 degrees of freedom).

What has happened to cause the difference in the two results is that the natural “individual-to-individual” variation in temperature (which is not of direct interest to us) has contributedto the noise, decreasing the numerical value of the t statistic. This makes it more difficult todetect a significant signal. The paired t test overcomes this problem by taking differences oftemperatures within individuals, and this removes any “individual-to-individual” variationin temperature from the procedure.

The take-home message is: If there is a logical reason to pair, then do a paired t test. Thiswill give you a sharper result.

2. What happened to the “lost” degrees of freedom, the difference between the 2n−2 degreesof freedom for the ordinary(i.e. unpaired) unpaired test and the n − 1 degrees of freedomfor the paired test?

99

Page 100: Intro to Stat (STAT 111) by Ewens

Suppose that we had also been interested in investigating if there is significant “individual-to-individual” variation in temperature. These “lost” n− 1 degrees of freedom are used forthis test. This is the beginning of the concept of ANOVA (the Analysis of Variance). Thethinking of ANOVA is the following.

Any body of data will almost always exhibit variation. We can ascribe some of this variationto one cause (here, “before and after the operation”) and some of this variation to anothercause (here, “indivdual-to-individual” variation). ANOVA does two tests (one for a “beforeto after” difference, one for an “individual-to-individual” difference) for the price of one set ofdata. Some of the overall degrees of freedom are used for one test and some for the other test.

The word “Analysis” (in the Analysis of Variance) means “sub-division”. We subdivide thevariation in the data into meaningful components, and tests each component for significance.

Final note. It was of course arbitrary that we chose the “before” temperatures as Group1. What would have happened if we chose the “after” temperatures as Group 1? First, thetest would now be one-sided up, since under the alternative hypothesis we now expect theaverage of (the new) Group 1 to exceed that of (the new) Group 2. Second, with the abovedata the value of t would now be +2.828.

Putting these two observations together, we would reach the same conclusion as we did be-fore. It would be a disaster if we did not.

t tests in regression.

Preview. We first review our previous discussion on regression, where our focus was onestimating parameters, not on testing hypotheses about them. We will get onto tests ofhypotheses in regression soon.

Regression concerns the question: “How does one thing depend on another?” In the examplethat we considered earlier, we asked: “How does the growth height of a plant depend on theamount of water given the plant in the growing period?”

The amount of water given to any plant is NOT random: we can choose this for ourselves.So we denote this in lower case letters. The growth height of any plant IS random: this willdepend on various factors such as soil fertility that we do not know much about. The basic(“linear model”) assumption is that if Y is the (random) growth height for a plant given xunits of water, then the mean of y is of the form α + βx and the variance of Y is σ2. Hereα, βx and σ2 are all (unknown) parameters which we previously learned how to estimatefrom our eventual data.

100

Page 101: Intro to Stat (STAT 111) by Ewens

What are the data, once the experiment is completed? These are the respective amountsof water x1, x2, . . . , xn given to n plants and the corresponding growth heights y1, y2, . . . , yn.The first thing that we did was to calculate five quantities from these data. There were

x =x1 + x2 + ....+ xn

n, y =

y1 + y2 + ....+ ynn

, (121)

as well as the quantities sxx, syy and sxy, defined by

sxx = (x1 − x)2 + (x2 − x)2 + ....+ (xn − x)2, (122)

syy = (y1 − y)2 + (y2 − y)2 + ...+ (yn − y)2, (123)

sxy = (x1 − x)(y1 − y) + (x2 − x)(y2 − y) + ...+ (xn − x)(yn − y). (124)

From these we estimated β by b, defind as b = sxy/sxx. We also estimated σ2 by s2r, defined

by s2r = (syy − b2ssxx)(n− 2).

In hypothesis testing the most interesting question is: “Is there any effect of the amount ofwater on the growth height?” This is identical to asking the question: “Is β = 0?”. So ourestimate b of β will be the key component in the numerator “signal” of our eventual t statistic.

Before going further we note three things about s2r.

1. The suffix “r” denotes “regression”: this estimate is specific to the regression context.

2. Why is there an n−2 in the calculation of s2r? This is because here we have n−2 degrees

of freedom: more on this later.

3. Another “trust me” result: the estimate of the variance of the estimator correspondingto b is not s2

r. It is s2r/sxx.

We now put all of this together to formulate our regression t test.

First we have to assume that before the experiment the (random variable) growth heightshave a normal distribution. (This assumption was NOT necessary in our previous estimationactivities.)

The null hypothesis is “no effect of the amount of water on the growth height”, which trans-lates into “β = 0”. In general the alternative hypothesis could be one-sided up (“β > 0”),one-sided down (“β < 0”), or two-sided up (“β 6= 0”). In the water example the naturalalternative hypothesis,from the context, is “β > 0”. This concludes Step 1: we have formu-lated our null and alternative hypotheses.

101

Page 102: Intro to Stat (STAT 111) by Ewens

Step 2. Here we choose α, the false positive rate (the probability of our claiming that thereis a positive effect of water on growth height if in fact there is no effect). Let’s go with 5%.

Step 3. What is the test statistic? Clearly it will be a t statistic with b in the numera-tor and our estimate of the standard deviation of the estimator corresponding to b in thedenominator. Thus from the above, this statistic is

t =b

sr/√sxx

=b√sxx

sr. (125)

This concludes Step 3: we have formulated our test statistic.

Approach 1,Step 4. What values of t will lead us to reject the null hypothesis? This dependson three things:-

(i) The sided-ness of the test. If the alternative hypothesisis is β > 0 then sufficiently largepositive values of t lead us to reject the null hypothesis. If the alternative hypothesisis isβ < 0 then sufficiently large negative values of t lead us to reject the null hypothesis. If thealternative hypothesis is β 6= 0 then sufficiently large positive or negative values of t lead usto reject the null hypothesis.

(ii) The value chosen for α. This will determine the appropriate column in the t chart.

(iii) The number of degrees of freedom that we have. As above: trust me. We have n − 2degrees of freedom.

Step 5. This is easy. get the data, calculate t, do the test.

Approach 2,Step 4. Calculate t.

Approach 2,Step 5. Find the P -value corresponding to the calculated value of t. This canonly be done with a computer package. There will be a JMP handout in class about this.

(The JMP handout is not well formatted, and it is not immediately clear where you haveto look in the handout to get the information that you want. The so-called “Intercept” tstatistic tests the null hypothesis α = 0. This is not of interest to us. The so-called “Water”t statistic tests the null hypothesis β = 0. This is the test of interest to us. The value oft given is 14.39 (with 10 degrees of freedom). Unfortunatley the print-out only gives theP -value for a two-sided test. It should also give the P -value for a one -sided up, and also fora one-sided down, test. The P -value for a one-sided test would be half the value given for atwo-sided test.)

102

Page 103: Intro to Stat (STAT 111) by Ewens

Generalization. In the above example the null hypothesis was “β = 0”. In some cases thealternative hypothesis is of the form “β = β0”, where β0 is some specified numerical value.The relevant t statisic is then

t =(b− β0)

√sxx

sr. (126)

In general the alternative hypothesis could be one-sided up (“β > β0”), one-sided down(“β < β0”), or two-sided up (“β 6= β0”). There are still n− 2 degrees of freedom.

A note on degrees of freedom. As stated in class, the numerical value for the number ofdegrees of freedom is sometimes mysterious. Your best approach is possibly just to trust mein any situation as to how many degrees of freedom you have. However, if you want someidea of how this number is calculated in the t test context, it is the number of observationsminus the number of parameters that you have estimated before you estimate the relevantvariance. In the one-sample t test we estimated one parameter (the mean, estimated by x)before we estimated the variance. So we have n− 1 degrees of freedom. The same is true inthe paired t test. In the two-sample t test we estimated two means µ1 and µ2 (respectivelyby x1 and x2) before estimating the variance, so we have m + n− 2 degrees of freedom. Inthe regression case we estimated two parameters (α and β) before estimating the variance(σ2). So we have n− 2 degrees of freedom.

Another way of looking at this is as follows. Suppoise that in the plant example you onlyhad two plants (n = 2). Then you have no degrees of freedom. Why is this?

With n = 2 you will have two points in the x − y data diagram plane. You can always fita line exactly through these two points. But there would be no scatter of points aroundthat line, and such a scatter is what is needed to estimate a variance. You then just cannotestimate σ2 if n = 2.

Numerical example

We consider the “plant and water” example discussed when estimating parameters in theregression context. We have from that example b = 0.6501638, sxx = 188 and s2

r = 0.384979.(The value given for s2

r in the slides and handouts was incorrect. It will be corrected in thecanvas version.) This gives

t =0.6510638

√188√

0.384979= 14.38744

in agreement with the value in the JMP output.

103

Page 104: Intro to Stat (STAT 111) by Ewens

Non-parametric (= distribution-free) tests.

Remember: any time that you use a t test you are implicitly assuming that the data arethe observed values of random variables having a normal distribution. Although (as notedabove) t tests are quite robust, that is they work quite well even if the data are not theobserved values of random variables having a normal distribution, it is handy to have testsavailable that do not make this implicit assumption. These tests are called “non-parametric”tests. A better expression is perhaps “distribution-free”: they are free of the normal distri-bution assumption. However non-parametric tests do make some assumptions, as indicatedbelow.

We start with three non-parametric tests that are alternative to the two-sample t test. Re-member the background to these tests: we have m observations (i.e. data values) in Group 1and n observations (i.e. data values) in Group 2. The null hypothesis in all three tests is thatthese m + n observations are the observed values of random variables all having the sameprobability distribution. To be concrete we will only consider the case where the alternativehypothesis is that the probability distribution corresponding to the observations in Group1 is the same as that corresponding to the observations in Group 1 except that it is movedto the right. Thus if the alternative hypothesis is true we expect that the observations inGroup 1 to tend to be larger than those in Group 2.

Non-parametric alternative number 1 to two-sample t tests: the two-by-two table test.

Thus a rather crude and quick test. We start by putting all m + n observations into anordered sequence from lowest to highest and then find some number A close to the half-waypoint. For example if the data are:-.

Group 1: 56, 66, 65, 44, 51.

Group 2: 51, 65, 63, 48, 47 50 61,

then the ordered sequence is

44, 47, 48, 50, 51, 51, 56, 61, 63, 65, 65, 66.

We could choose A = 53. With this choice half the data values are less than A and half aregreater than A.

The next step is to form a two-by-two table. Each data value will be categorized by rows aseither Group 1 (in row 1) or Group 2 (in row 2). Also, each data value will be categorized bycolumns as either less than A (in column 1) or greater that A (in column 2). For example,

104

Page 105: Intro to Stat (STAT 111) by Ewens

the first observation in Group 1 (56) would go in row 1 (since it is in Group 1) and column 2(since it is greater than 53). We now form a two-by-two table of counts, each count indicatinghow many observations are in each of the four cells. With the above data this table wouldbe as follows.

Less than A Greater than A TotalGroup 1 2 3 5Group 2 4 3 7Total 6 6 12

(127)

If the null hypothesis is true there is no association between the row mode of categorization(Group membership) and the column mode of categorization (less than A or greater thanA). So we now do a standard two-by-two table test, using z (since this is a one-sided test).

We have to be very careful about whether this is a one-sided up or a one-sided down test.You have to think about about the logic of the situation.

The general form of the data in a two-by-two table is, for this test,

Less than A Greater than A TotalGroup 1 x1 n1 − x1 n1Group 2 x2 n2 − x2 n2Total c1 c2 n

(128)

and the original form of the z statistic was

z =x1/n1 − x2/n2√

(c1/n)(c2/n)n1

+ (c1/n)(c2/n)n2

(129)

Now ask yourself: If the alternative hypothesis is true, what sort of value should z tend totake? Well, if the alternative hypothesis (that the Group 1 probability distribution is movedto the right of the Group 2 probability distribution, so that the observations in Group 1should tend to be larger than those in group 2) is true, then the fraction x1

n1should tend to

be less than the fraction x2n2

, so that z should tend to be negative. Thus this test is one-sideddown.

Non-parametric alternative number 2 to the two-sample t tests: the Mann-Whitney (akaWilcoxon two-sample) test.

In this procedure we put all the m + n data values into one sequence, ordered from lowestto highest, as in the previous procedure. We then assign them ranks (1, 2, . . . , m+n), withthe smallest data value getting rank 1, the next smallest rank 2, . . . , the largest rank m+n,

105

Page 106: Intro to Stat (STAT 111) by Ewens

(If there are ties you share out the ranks - see example below.) Thus with the data valuesfor the previous procedure we would have

Data values: 44, 47, 48, 50, 51, 51, 56, 61, 63, 65, 65, 66.

Rank: 1 2 3 4 5.5 5.5 7 8 9 10.5 10.5 12.

The test statistic is the sum u of the ranks for the Group 1 data values. In the above examplethis is 1 + 5.5 + 7 + 10.5 + 12 = 36. To assess the significance of this value we have to findthe null hypothesis mean and variance of the corresponding random variable U the (random)sum of the ranks for Group 1 before the experiment was performed.

To find this mean we have to do two things:-

(i) Remember that the sum 1 + 2 + 3 + 4 + ... + a of the first a numbers is a(a + 1)/2.If m = 5, n = 7 as above, so that m+ n = 12, the sum of all the ranks (taking both groupstogether) is 1 + 2 + 3 + 4 + ... + 12 = 78. In general this sum is (m+ n)(m+ n+ 1)/2.

Use a proportionality argument (see the “Thanksgiving cake” homework question) to arguethat since Group 1 comprises a proportion 5/12 of all the data values, then if the null hypothe-sis is true, the mean of U in the above numerical example is (5/12)(78) = 32.5. This argumentis correct. In general, the mean of U is {m/(m+n)}(m+ n)(m+ n+ 1)/2 = m(m+n+1)/2if the null hypothesis is true.

The null hypothesis variance of U is much harder to establish, so here is a “trust me”result: this null hypothesis variance is mn(m + n + 1)/12. In the above example this is(5)(7)(13)/12 = 37.9167.

We now compute a “z” statistic from the observed sum u of the ranks for group 1 and thismean and variance as

z =u− m(m+n+1)

2√mn(m+n+1)

12

. (130)

This concludes Step 3: we will use this z as our (new) test statistic.

Step 4. What values of z will lead us to reject the null hypothesis? If the null hypothesisis true z is the observed value of a random variable having very close to a “Z” distribution.(The Central Limit Theorem helps with this claim, since u is a sum.) So we will use “Z”charts in the testing procedure. To proceed further we have to think what values of z wouldtend to arise if the alternative hypothesis is true. In the above example, where the alternativehypothesis is that the probabioity distribution for Group 1 is shifted to the right relative to

106

Page 107: Intro to Stat (STAT 111) by Ewens

that of Group 2, we would expect the Group 1 data values to tend to attract higher ranksthan the Group 2 data values, so that u would tend to exceed m(m+ n+ 1)/2 and z wouldtend to be positive. Thus significantly large positive values of z will lead us to reject the nullhypothesis. So the test is one-sided up. How large z has to be depends on the value chosenfor α: if α = 0.05 we reject the null hypothesis if z ≥ 1.645 and if α = 0.01 we reject thenull hypothesis if z ≥ 2.326.

Step 5. In the above example

z =36− 32.5√

37.9167= 0.57,

and this is not significant for any reasonable value of α. We do not have enough evidence toreject the null hypothesis.

Notes on this procedure.

1. This non-parametric procedure uses the data more efficiently than did non-parametricprocedure 1. In procedure 1 all we looked was whether an observation was more than A orless than A. We disregarded how much more than A it was, or how much less than A itwas, whereas the procedure just described to some extent takes this information into account.

2. Despite this comment, we have lost some information by replacing each data value by itsranking value. The next procedure that we consider does not do this.

3. This procedure is very popular, since it can be shown that in some sense of the word“efficient”, it is about 95% as efficient as the two-sample t test even when the data do havea normal distribution.

Non-parametric alternative number 3 to the two-sample t tests: the permutation test.

In this procedure we retain the original data values. We permute the data in all possibleways and calculate the two-sample t statistic for each permutation. (One of these permu-tations will correspond to the actual data.) The null hypothesis is rejected if the value of tcalculated from the actual data is a significantly extreme one of all these permuted values.

It is best to demonstrate the idea with an example. Suppose that the null hypothesis isthat men and women have the same mean blood pressure, and that we plan to test thisnull hypothesis by taking the blood pressures of m = 5 men and n = 5 women. The bloodpressures of the five men are 12, 131, 98, 114, 132 and the blood pressures of the five womenare 113, 110, 127, 99, 119.

There are(

105

)= 252 permutations of the data such that 5 values are “men” values and the

107

Page 108: Intro to Stat (STAT 111) by Ewens

remaining 5 are “women” values. each will lead to a value of t. Here are some of the 252permutations and the corresponding value of t:-

Permutation 1. (The real data). men women122, 131, 98, 114, 132 113, 110, 127, 99, 119 → t = 1.27.

Permutation 2.. “men” “women”127, 114, 132, 99, 113 110, 122, 131, 119, 98 → t = 1.01.Permutation 3.. “men” “women”98, 99, 114, 110, 119 127, 131, 132, 113, 122 → t = −2.30.

...................................................................................

Permutation 252.. “men” “women”132, 99, 113, 127, 131 110, 119, 122, 98, 114 → t = 1.58.

Suppose that the alternative hypothesis is that the mean blood pressure for men is higherthan the mean blood pressure for women. If we had chosen α = 0.05 we would reject the nullhypothesis (that men and women have the same probability distribution of blood pressure)if the observed value of t, here 1.27, is among the highest 0.05×252 = 12.6, or conservativelyin practice 12, of these 252 permutation t values.

Here is the logic behind this. If the null hypothesis is true, then given the 10 data values,but with no labeling as to gender, all of the 252 permutation values of t are equally likely.Thus if the null hypothesis is true the probability that the actual value of t is among the12 largest of these 252 t values is 12/252, or 4.76%. (It is not possible with 5 + 5 = 10data values to have a test with an exact value of α equal to 5%.) Our test is a bit conservative.

Notes on these three non-parametric procedures.

1. If the data are the observed values of random variables having a normal distribution, itcan be shown that in some sense the t test is “optimal”. If no assumption is made as towhether the data are the observed values of random variables having a normal distributionthere is no unique “optimal” procedure. That is why there are several non-parametric pro-cedures.

2. What do you potentially lose by using a non-parametric procedure? If the data hadbeen the observed values of random variables having a normal distribution you have used a

108

Page 109: Intro to Stat (STAT 111) by Ewens

non-optimal procedure. On the other hand, as noted above, the Mann-Whitney (Wilcoxontwo-sample) test is about 95% as efficient as the two-sample t test even when the data dohave a normal distribution.

3. The third non-parametric procedure described above is clearly computer intensive. Ifm = n = 10 there are 184,756 possible permutations of the data. However present-daycomputers could easily handle this. However if m = n = 20 there are about 1.378 × 1011

possible permutations, and this might be too many even for a powerful computer. In thiscase you would randomly perform a random ample of (say) a million permutations (includingthe actual one) and see if the actual t is among the most extreme of these.

109

Page 110: Intro to Stat (STAT 111) by Ewens

Non-parametric alternatives to the one sample t test.

Remember that if you use a t test you are implicitly assuming that the data are the observedvalues of random variables having a normal distribution. Remember also that the normaldistribution is symmetric about its mean µ. We do have to make an assumption in the twonon-parametric alternatives to the one sample t test discussed below, namely that the datax1, x2, . . . , xn are the observed values of n iid random variables X1, X2, . . . , Xn , each Xhaving the same symmetric probability distribution. We denote the point of symmetry ofthis distribution, which is also the mean of the distribution, by λ.

Step 1. In both tests that we consider the null hypothesis claims that λ = λ0, where λ0 issome prescribed numerical value. The alternative hypothesis can be one-sided up (λ > λ0),one-sided down (λ < λ0), or two-sided (λ 6= λ0). Which is the appropriate alternative hy-pothesis depends on the context.

Step 2. Here we choose the value of α, typically 5% or 1%.

The remaining steps depend on the test considered. We consider two tests in turn.

Non-parametric alternative number 1 to the one sample t test: The sign, or “binomial”, test.

Step 3. The test statistic is y, the number of data values greater than the null hypothesismean λ0.

Step 4. What values of y lead us to reject the null hypothesis? If the alternative hypothesis isone-sided up (λ > λ0) then sufficently large values of y lead us to reject the null hypothesis.If the alternative hypothesis is one-sided down (λ < λ0) then sufficently small values of ylead us to reject the null hypothesis. If the alternative hypothesis is two-sided (λ 6= λ0) thensufficently large values or sufficently small values of y lead us to reject the null hypothesis.

How large or small? To answer this question we consider the probability distribution ofthe random variable Y corresponding to y. If the null hypothesis is true, Y has a binomialdistribution with parameter 1/2 (this is where the assumption that the distribution of therandom variables X1, X2, . . . , Xn is symmetric comes in) and index n. So we now do a stan-dard binomial test, as discussed previously in these notes.

This is a weak test since it takes no account of the extent to which any data value is lessthan, or more than, λ0. So we will not consider a numerical example. The next test takesthis information into account, at least to some extent.

110

Page 111: Intro to Stat (STAT 111) by Ewens

Non-parametric alternative number 2 to the one sample t test: The Wilcoxon one-sample test.

Steps 3, 4 and 5 rolled together.

Suppose that we have our data values x1, x2, . . . , xn. We first construct the differencesx1 − λ0, x2 − λ0, . . . , xn − λ0. Some of these differences will probably be negative, somepositive. Next we ignore the signs of these differences and construct the absolute values|x1 − λ0|, |x2 − λ0|, . . . , |xn − λ0|. These absolute values are then put in order from smallestto largest, and ranks are then assigned to these absolute differences, the smallest gettingrank 1, ..., the largest getting rank n. The test statistic is w, the sum of the ranks of theoriginally positive differences.

Before going further we consider a numerical example. We are interested in the heights ofadult males and wish to test the null hypothesis that the mean height of adult males is 69.(All heights are measured in inches.) Thus λ0 = 69. We are not willing to believe that theheight of an adult male taken at random has a normal distribution, but we are willing tobelieve that it has a symmetric distribution about its mean, so we plan to use the Wilcoxonone-sample test. Suppose that the heights of n = 7 randomly chosen adult males are are:-

Heights : 67.2 69.3 69.7 68.5 69.8 68.9 69.4Differences: -1.8 0.3 0.7 -0.5 0.8 -0.1 0.4Absolute differences: 1.8 0.3 0.7 0.5 0.8 0.1 0.4Rank: 7 2 5 4 6 1 3

Thus here w = 2 + 5 + 6 + 3 = 16.

What values of w lead us to reject the null hypothesis? If the alternative hypothesis is one-sided up (λ > 69), and this alternative hypothesis is true, we will expect that many of thedifferences to be positive and thus w would tend to be large. Hence sufficently large valuesof w lead us to reject the null hypothesis. If the alternative hypothesis is one-sided down(λ < 69), and this alternative hypothesis is true, we will expect that few of the differencesto be positive and thus w would tend to be small. Hence sufficently large values of w leadus to reject the null hypothesis. If the alternative hypothesis is two-sided (λ 6= 69), andthis alternative hypothesis is true, we will expect that either many of the differences to bepositive and thus w would tend to be large, or that many of the differences to be negativeand thus w would tend to be small. Hence sufficently large or small values of w lead us toreject the null hypothesis.

How large or how small? To answer this question we consider the “before the experiment”random variable W corresponding to w. With n data values the sum of all the ranks is1 + 2 + · · · + n = n(n + 1)/2. This is 7(8)/2 = 28 in the above example. Under the null

111

Page 112: Intro to Stat (STAT 111) by Ewens

hypothesis the probability that any given one of the X values is greater than λ0 is 1/2: (thisis where the assumption of symmetry of the distribution on each X comes in). This impliesthat if the null hypothesis is true, the mean value of W is n(n+ 1)/4.

It is much harder to calculate the variance of W when the null hypothesis is true, so here isa “trust me” result: When the null hypothesis is true the variance of W is n(n+1)(2n+1)/24.

Further, when the null hypothesis is true, W has close to a normal distribution (the CentralLimit Theorem helps with this assertion). Thus instead of using w as test statistic it isconvenient to use z, defined by

z =w − n(n+1)

4√n(n+1)(2n+1)

24

.

This is the end of Step 3: we take z as just defined as the test statistic.

What values of z lead us to reject the null hypothesis? When the null hypothesis is truez is the observed value of a “Z”, that is a random variable having very close to a normaldistribution with mean 0 and variance 1. From the discussion above concerning the values ofw which would lead to rejection of the null hypothesis, we would reject the null hypothesisin favor of the alternative hypothesis λ > λ0 if z is sufficently large positive, we would rejectthe null hypothesis in favor of the alternative hypothesis λ < λ0 if z is sufficently largenegative, we would reject the null hypothesis in favor of the alternative hypothesis λ 6= λ0

if z is sufficently large positive or negative,. How large depends on the value chosen for αin Step 2, and the critical value(s) will be numbers such 1.645, 2.326, ±1.96, and ± 2.575found from the z chart.

Suppose that in the above example the alternative hypothesis had been λ > 69 and thatwe chose α = 0.05. Then we would reject the null hypothesis if z ≥ 1.645. With the abovevalues

z =16− 7(8)

4√7(8)(15)

24

= 0.34,

and we do not have enough evidence to reject the null hypothesis.

112

Page 113: Intro to Stat (STAT 111) by Ewens

On the horizon

The topics which naturally follow from those considered in this course are:-

1. ANOVA (The Analysis of Variance) considered initally as the extension of the two-samplet tests to more than two samples, with many generalizations.

2. Corelation. (The relation between two random variables, for example the heights andweights of randomly chosen adult males.)

3. Non-parametric methods in regression and other areas.

4. Sampling from a finite population.

5. Descriptive statistics.

6. Multiple regression.

These and other topics will be discussed in STAT 112.

113