chapter 6: the analysis of count data 6.0 introduction

84
Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 1 Chapter 6: The analysis of count data 6.0 Introduction In this chapter we turn our attention to situations where our data come in the form of counts, rather than continuous measurements. The simplest example, to be discussed in Section 6.1, is, perhaps, a political poll, where n voters are randomly selected from a population of voters and asked a question to which they must answer either yes or no. What is of interest is the proportion of voters in the population who would say yes to the question. The analysis will involve calculating a confidence interval for the population proportion; this is directly equivalent to calculating a confidence interval for a population mean based on a sample mean, as discussed in Chapter 2. In many comparative studies, the outcome of interest is a count rather than a measured quantity; in Section 6.2 we will consider drug trials where two groups of patients, who are treated differently, are classified as improved/not improved, rather than measured on an improvement scale. In such cases the question of interest is whether there would be a difference in the population proportions that improve, if the treatments were applied to whole populations, rather than to relatively small samples. Again, this is directly equivalent to the simple comparative studies (involving two- sample and paired t-tests and the corresponding confidence intervals) of Chapter 2, where comparisons of population means were of interest. The section concludes with a discussion of how to determine appropriate sample sizes for comparative studies where the responses are proportions. In Section 6.3 we generalise from a two-group comparison to comparing proportions arising in more than two groups, exactly as was done in Chapter 4 where ANOVA was used in comparing several means. We also consider situations where more than two outcomes are observed: for example, we examine the results of a market research study where respondents were classified as having bought a product, having heard of it but not bought it, or, as never having heard of it. The chi-square distribution is introduced for analysing such data. Section 6.4 is concerned with „goodness-of-fit‟ tests: we ask if an observed set of counts corresponds to an a priori set of relative frequencies. To answer the question, we carry out a statistical test which compares the divergences between the observed and expected counts, and tests if the overall divergence would have been likely under chance variation. The final section deals with some additional topics: the comparison of two proportions that sum to 1, the problem of ignoring influential factors when analysing count data we will consider a special case known as „Simpson‟s paradox, and finally the analysis of matched samples.

Upload: others

Post on 10-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 1

Chapter 6: The analysis of count data 6.0 Introduction In this chapter we turn our attention to situations where our data come in the form of counts, rather than continuous measurements. The simplest example, to be discussed in Section 6.1, is, perhaps, a political poll, where n voters are randomly selected from a population of voters and asked a question to which they must answer either yes or no. What is of interest is the proportion of voters in the population who would say yes to the question. The analysis will involve calculating a confidence interval for the population proportion; this is directly equivalent to calculating a confidence interval for a population mean based on a sample mean, as discussed in Chapter 2. In many comparative studies, the outcome of interest is a count rather than a measured quantity; in Section 6.2 we will consider drug trials where two groups of patients, who are treated differently, are classified as improved/not improved, rather than measured on an improvement scale. In such cases the question of interest is whether there would be a difference in the population proportions that improve, if the treatments were applied to whole populations, rather than to relatively small samples. Again, this is directly equivalent to the simple comparative studies (involving two-sample and paired t-tests and the corresponding confidence intervals) of Chapter 2, where comparisons of population means were of interest. The section concludes with a discussion of how to determine appropriate sample sizes for comparative studies where the responses are proportions. In Section 6.3 we generalise from a two-group comparison to comparing proportions arising in more than two groups, exactly as was done in Chapter 4 where ANOVA was used in comparing several means. We also consider situations where more than two outcomes are observed: for example, we examine the results of a market research study where respondents were classified as having bought a product, having heard of it but not bought it, or, as never having heard of it. The chi-square distribution is introduced for analysing such data. Section 6.4 is concerned with „goodness-of-fit‟ tests: we ask if an observed set of counts corresponds to an a priori set of relative frequencies. To answer the question, we carry out a statistical test which compares the divergences between the observed and expected counts, and tests if the overall divergence would have been likely under chance variation. The final section deals with some additional topics: the comparison of two proportions that sum to 1, the problem of ignoring influential factors when analysing count data – we will consider a special case known as „Simpson‟s paradox, and finally the analysis of matched samples.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 2

6.1 Single proportions We are often interested in counting the relative frequency of occurrences of particular characteristics of units sampled from a population or process, with a view to making inferences on the population or process characteristics. For example, political polls are carried out to estimate support for individual politicians or parties in a defined population of electors; market research studies are conducted to elicit information on purchasing behaviour in a defined population of consumers; quality control samples are taken to monitor conformance to specifications of industrial products. Here the practical question is how to estimate the relative frequency of a given characteristic in a population or process from that of the sample. The Statistical Model Suppose a random sample of n units is selected from a population or process in

which the proportion of units that has a particular characteristic is and that r members of the sample have this characteristic. It can be shown that under repeated sampling the sample proportion p=r/n

behaves as if selected from a Normal distribution with mean and standard error:

npSE

)1()(

This means that repeated samples of size n will give sample proportions that

vary around as shown in Figure 6.1 below. Note that this result is approximate and that the larger the sample size n is, the better the approximation becomes.

Figure 6.1 The distribution of sample proportions

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 3

The fact that p is approximately Normally distributed allows the use of the methods previously encountered for the analysis of measured variables, even though we are now dealing with counts, i.e., even though the possible values of r are integers.

A confidence interval for

The statistical question is how to estimate , the population or process proportion, from that of the sample value, p. Recall that when dealing with measured quantities we used the fact that if some variable x is at least

approximately Normally distributed (say, with mean and standard deviation ) then the average of n randomly selected values, x , will also be Normally

distributed, with the same mean, , and with standard error / n. This is illustrated in Figure 6.2 below.

Figure 6.2: Distributions of individual measurements (a) and of the averages of n measurements (b).

This gave us a method for estimating the population or process mean , using

the sample information. A 95% confidence interval for was given by:

x ± 1.96n

Where was unknown it was replaced by the sample value s and, if the sample size was small, the standard Normal value of 1.96 was replaced by the corresponding Student‟s t-value. The sample proportion in the current case corresponds directly to x above.

Similarly we estimate the standard error of p by replacing by the sample estimate p in the formula for standard error:

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 4

n

pppES

)1()(ˆ

This allows us to calculate an approximate 95% confidence interval† for as:

n

ppp

)1(96.1

Note the structure of the confidence interval formula: the confidence interval is given by a point estimate of the parameter of interest, plus or minus a critical value multiplied by the estimated standard error of the point estimate. This structure was encountered in Chapter 2 where confidence intervals were calculated for population means, and in Chapter 5 where both the slope and intercept of the regression line:

)(ˆ xEStx c

)ˆ(ˆˆ11 EStc

)ˆ(ˆˆ00 EStc

Example 1: French presidential election On May 4th, 2007 the BBC website reported that an opinion poll (Ifop for the Le Monde newspaper on May 3rd) „put Mr. Sarkozy at 53%, with Ms. Royal trailing with 47%‟ in the election campaign for the French presidency (voting took place on the following Sunday). The sample size was 858 people and all had watched the head to head television debate between the candidates. What can we infer about the voting intentions of the French electorate from these sample findings? A 95% confidence interval for the population proportion who favoured Sarkozy is given by:

n

ppp

)1(96.1

858

)530.01(530.096.1530.0

033.0530.0

i.e., between 49.7% and 56.3% are estimated to favour Sarkozy.

† Obviously other confidence levels may be obtained by replacing 1.96 by the corresponding Z value; thus,

2.58 gives 99% confidence.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 5

A confidence interval allows for the sampling variability involved in basing our conclusions about the population on the results of a random sample. Taken at face value the confidence interval indicates that, although Sarkozy was six percentage points above Royal in the sample, sampling variability is such that Royal could have been very slightly ahead in the population of voters [50.3%, corresponding to the lower bound of 49.7% for Sarkozy]. In other words the sample size was not large enough to distinguish clearly between the two candidates. Note that the formula implies that the larger n is, the narrower will be the error bound on the sample result. The confidence interval is based on the assumption of a simple random sample from the population of interest; this means that all voters would be equally likely to be included in the sample. In practice, this is unlikely to be true for several reasons. The poll was very likely a telephone survey – this raises several questions: do all voters have telephones, if several voters reside in the same house, are they equally likely to answer the telephone, will they answer truthfully if they do respond to the survey? The poll was aimed at people who had watched the TV debate – is this subset of the French electorate identical in terms of voting intentions as the overall electorate? What exactly was the question asked – did they ask who had won the debate or who the respondent would support – the question may appear the same, but it is dangerous to assume that the respondent interprets a question as you do, unless it is made very explicit. [I had a colleague once who, when asked by a TV licence inspector if there was a television in the house, replied that there was and when asked if he had a TV licence, replied that he did not. When summoned to court he was asked by the judge why he did not have a TV licence; he explained that he did not own the TV, which was his wife‟s and that she did have a TV licence!]. Many people are likely to refuse to answer survey questions (whether on the telephone or otherwise) – is there a difference between the subset of the population who are willing to respond and that which is not; any such difference leads to a non-response bias. Many surveys have response rates of less than 20%. In such cases it is virtually impossible to assess the likely bias in the sample of respondents. In short, surveying is something of a black art! The use of a mathematical formula, for most people, suggests an exactness which is clearly inappropriate in very many survey situations. Judgements need to be made on how representative the sample is likely to be and the uncertainties inherent in these judgements are not included in the error bounds generated by the confidence interval. The real uncertainy is likely to be quite a lot higher (note the precison in this final sentence!). Exercise 6.1.1 On Saturday May 17

th, 2008, the Irish Times published the results of a poll on attitudes to

the planned constitutional referendum on the Lisbon Treaty. The size of the sample was

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 6

1000; 35% of respondents said they intended to vote „yes‟. Estimate the population proportion who intended to vote „yes‟, at this point in the campaign.

Sample Size In any empirical study, one of the first questions that needs to be answered is: how many observations are required? One way of addressing this question in the current context is to ask how wide the confidence interval bounds on the sample proportion could be while still giving an interval that answers the practical question, i.e., an interval that is „fit for purpose‟. The standard error of the sample proportion is:

npSE

)1()( .

This depends not only on the sample size (n) but also on the population or

process proportion ( ) or rather on the product ). Table 6.1 shows how this

varies with ; it assumes its worst value (in the sense of giving the widest interval)

when is 0.5.

0.1

0.9

0.09

0.2 0.8 0.16 0.3 0.7 0.21 0.4 0.6 0.24 0.5 0.5 0.25 0.6 0.4 0.24 0.7 0.3 0.21 0.8 0.2 0.16 0.9 0.1 0.09

Table 6.1: The impact of on the standard error

Suppose we require that a 95% confidence interval is of the form:

p ± 0.01 then, since the confidence interval is given by:

n

ppp

)1(96.1

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 7

and a worst case is either p(1–p) or

01.025.0

96.1n

which gives a sample size

9604)01.0(

25.0)96.1(

2

2

n

If n is around 1000 the error bounds are ± 0.03. This is a typical sample size for political polls, which are usually reported in the press as subject to an error of approximately 3%. Significance Tests When only a single proportion is selected it is much more likely that the desired analysis will be to calculate a confidence interval than to carry out a significance test. Confidence intervals are more useful in that they not only tell us whether the data are consistent with a particular hypothesised population or process value, but also show the range of values with which the results are consistent. Tests may be carried out, however, as the following (invented) example illustrates. Example 2: A market research study Suppose a manufacturer of a new electronic toy has been assured by a marketing agency that a (very expensive) television advertising campaign will ensure a market penetration of at least 50% (i.e., at least 50% of the target age group of 15–25 year olds will have heard of the product at the end of the campaign). After the TV campaign an independent market survey is commissioned and a sample of n=1000 members of the target population gives a proportion of p=0.45 who have heard of the product. Has the marketing agency delivered the promised performance? The null hypothesis is that the population proportion who have heard of the product is at least 0.5. We specify our hypotheses as:

Ho: ≥ 0.5

H1 : < 0.5

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 8

The sampling distribution of p, when the null hypothesis is true, has a mean of o and a standard error of:

0158.01000

)5.01(5.0)1()(

npSE oo

where is 0.5 and n = 1000. The test statistic is:

16.3

1000

)5.01(5.0

5.045.0

)1(

n

pZ

oo

o

This follows a standard Normal distribution when the null hypothesis is true. For

a one-tailed significance test, with significance level = 0.05, the critical value is Zc = –1.645.

Figure 6.3: The cut-off value for the test statistic

Since the observed Z statistic is much smaller than this, we reject the null hypothesis and conclude that the population proportion is less than 0.5: the marketing campaign has not delivered the promised result. Continuity Correction In carrying out the Z-test, it is often advised that a „continuity correction‟ be applied to the formula for the test statistic. This arises because although we are counting discrete events, our calculations are based on the Normal distribution, which is continuous – there is, therefore, an approximation involved. This is a

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 9

technical detail which I have chosen to ignore in the course of this chapter. The continuity correction is only relevant when the sample size is relatively small (say less than 200) – in the current case applying the correction changes the test statistic Z from 3.16 to 3.13. If sample sizes are small then computer-intensive „exact tests‟ are preferable, in my opinion. Exercises 6.1.2. Calculate a two-sided 95% confidence interval for the market penetration achieved by the

campaign. Note that you should use the observed sample proportion (not the hypothesised value) when calculating the standard error.

6.2 Simple Comparative Studies In Chapter 2, we discussed comparative studies where the outcomes of interest could be measured and the questions to be addressed referred to population or process means. We now turn our attention to studies where the outcome involves counting, rather than measuring, and the questions of interest refer to population or process proportions. Typical examples would be medical studies where the two groups are either inherently different (e.g., smokers versus non-smokers) or the subjects are initially similar and are broken into two groups to be allocated to different treatments; the number in each group who respond in a defined way (e.g., develop lung cancer or become free of a medical condition) after a period of time is then counted and expressed as a proportion of the group size. The statistical problem is to assess the extent to which observed sample differences are likely to be systematic and not simply due to chance variation. The Statistical Model In Section 6.1, we saw that if a single random sample of n units is selected from a population or process in which the proportion of units that has a particular

characteristic is the sample proportion p=r/n behaves as if selected from a Normal distribution

with mean and standard error:

npSE

)1()( .

Assume now that we have two groups which are independent of each other in the sense that the chance that any unit in one group does or does not have the characteristic of interest is not in any way affected by the responses of the units in the second group. If this is true then we can extend the result used above in a very simple way [see Appendix 6.1 for details]. If p1=r1/n1 is the sample

proportion for the first group and is the population or process proportion, and

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 10

p2=r2/n2 and are the corresponding values for the second group, then the difference between the sample proportions, (p1–p2) has a sampling distribution

– and standard error:

2

22

1

1121

)1()1()(

nnppSE

This is illustrated in Figure 6.4.

Figure 6.4: The sampling distribution of the difference between sample proportions

We are now back on familiar territory and can work with this sampling distribution, as it is merely a single Normal distribution (with which we have worked several times already), even though the standard error formula looks a little complicated. The statistical questions that we typically wish to address are whether or not the

population or process proportions are the same and, if there is a

difference, how big is that difference – ? These questions will be addressed in turn by carrying out a statistical significance test for the difference between the long-run proportions, and by calculating a confidence interval for that difference. We will use a medical study to illustrate the methods. Example 3: Use of Timolol for treating angina Angina pectoris is a chronic heart condition in which the sufferer has periodic attacks of chest pain. Samuels [1] cites the data below from a study [2] to evaluate the effectiveness of the drug Timolol in preventing angina attacks; patients were randomly assigned to receive a daily dosage of either Timolol or a

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 11

Placebo for 28 weeks. The numbers of patients who did or did not become completely free of angina attacks (Angina-Free) are shown in Table 6.2.

Treatment Timolol Placebo Angina-Free

44

19

Not Angina-Free 116 128

160 147

Table 6.2: The Timolol study results

Significance Test Obviously the sample proportion of patients who became Angina-Free is greater in the Timolol group (p1=44/160=0.275) than in the Placebo group (p2=19/147=0.129). If we imagine the two treatments being administered to a large population of such patients (from which the sample may be considered to be randomly selected) then we can ask if the sample results are consistent with a difference in the population proportions. We frame our question in terms of a null hypothesis which considers the long-run proportions as being the same and an alternative hypothesis which denies their equality.

Ho: – = 0

H1: – 0

When the null hypothesis is true the sample difference (p1–p2) has a

sampling distribution which is approximately Normal with mean – and standard error:

2121

)1()1()(

nnppSE

where is the common population value. We calculate a Z statistic:

21

21

)1()1(

0)(

nn

ppZ

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 12

- this will have a standard Normal distribution. Because is unknown it must be estimated from the sample data. If there is no group difference then a natural estimator is the overall sample proportion of patients who become Angina-Free, p=63/307=0.205. This gives our test statistic1 as:

16.3

147

)205.01(205.0

160

)205.01(205.0

0)129.0275.0(

)1()1(

0)(

21

21

n

pp

n

pp

ppZ

If we choose a significance level of =0.05 then the critical values are ±1.96 as shown in Figure 6.5 below.

Figure 6.5: The Standard Normal distribution, with critical values

Our test statistic Z=3.16 is greater than the upper critical value: we reject Ho and conclude that, in the long-run, the proportion of patients who would become Angina-Free is greater when Timolol is prescribed than it would be if a placebo were administered routinely. Note the assumption implicit in the analysis, that the patients are a random sample from the corresponding population. In practice, of course, this is unlikely to be true (especially as it would be the intention to administer the treatment to future patients also). It is important, therefore, in considering the scientific value of the test result to assess the extent to which the sample may be treated as if it were a random sample from the target population. In reporting on the findings it would also be important to specify the target population carefully; for example, the sample may be restricted in its coverage: it may include patients of a narrow age profile, of only one sex, of only one nationality or race etc.

1 note: rounding affects such calculations, the ratio is 0.1457/0.0461

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 13

Confidence Interval In calculating a confidence interval for the difference between the two long-run proportions we do not assume (naturally!), as we do for the statistical test, that the two proportions are the same. A 95% confidence interval is given by:

2

22

1

1121

)1()1(96.1)(

n

pp

n

pppp

For the Timolol data this gives:

147

)129.01(129.0

160

)275.01(277.096.1)129.0275.0(

088.0146.0

The interval suggests that if patients were routinely administered Timolol (according to the study protocol) then an excess of between 6% and 23% of the population would become Angina-Free over and above the percentage that would be expected if the patients were routinely prescribed a placebo. Typical Computer Output Table 6.3 shows the output obtained from Minitab when the Timolol study sample results are entered.

Test and CI for Two Proportions Sample X N Sample p

1 44 160 0.275000 2 19 147 0.129252 Difference = p (1) - p (2) Estimate for difference: 0.145748 95% CI for difference: (0.0578398, 0.233657) Test for difference = 0 (vs not = 0): Z = 3.16 P-Value = 0.002

Table 6.3: Minitab analysis of Timolol data

The p-value associated with the test indicates that if the null hypothesis were true, the probability would be 0.002 that we would have obtained, just by chance, a Z test statistic as far out, or further, into the tails of the standard Normal distribution, than the value Z=3.16 obtained here. Accordingly, this is an unusually large value and the null hypothesis of no long-run group difference

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 14

should be rejected. The computer output also gives the confidence interval (to an impressive number of decimal places!). Exercise 6.2.1 On Saturday May 17

th, 2008, the Irish Times published the results of a poll on attitudes to

the planned constitutional referendum on the Lisbon Treaty. The size of the sample was 1000; 35% of respondents said they intended to vote „yes‟. A similar poll, carried out at the end of January, 2008, gave 26% as the percentage who intended to vote „yes‟. Carry out a test to determine if the data suggest that the population proportions intending to vote „yes‟ were the same or different on the two sampling dates. Estimate the change in the population proportion of those who intended to vote „yes‟.

Other Sampling Designs The data of Example 3 resulted from a designed experiment – a clinical trial – in which patients were assigned randomly to the two treatment groups. Where random assignment is possible it strengthens the conclusions that may be drawn from the study, as randomisation is the best protection against unknown sources of bias, which may be present in purely observational studies, as discussed in Chapter 3. Our next two examples give data that are analysed in exactly the same way as the Timolol data, even though the data generating mechanisms are different. Example 4 is a typical cohort or longitudinal study. Such studies are said to be prospective: they identify the study groups first and subsequently classify the subjects or patients according to the outcomes being studied. Example 4: Cohort Study of Slate Workers Machin et al. [3] cite a study by Campbell et al. [4] in which a group of Welsh men who had been exposed to slate dust through their work were compared to a similar group who had not been so exposed; the control group was similar in age and smoking status. The groups were followed over 24 years and the numbers of deaths in the two groups were recorded, as shown in Table 6.4.

Died Survived Totals Proportions

Slate Workers 379 347 726 0.522

Controls 230 299 529 0.435

Totals 609 646 1255

Relative risk 1.201

Odds Ratio 1.420

Table 6.4: A cohort study of slate workers and controls

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 15

Table 6.5 shows a statistical comparison of the proportions who died in the two groups: the difference p1–p2 is highly statistically significant, with a p-value of 0.002.

Test and CI for Two Proportions Sample X N Sample p

1 379 726 0.522039 2 230 529 0.434783

Difference = p (1) - p (2) Estimate for difference: 0.0872560 95% CI for difference: (0.0315353, 0.142977) Test for difference = 0 (vs not = 0): Z = 3.05 P-Value = 0.002

Table 6.5: A statistical comparison of the death rates

The confidence interval shows an increased mortality rate of between 3 and 14 percentage points for the slate workers over the control group. The relative risk and odds ratio shown in Table 6.4 will be discussed later. Example 5: Oesophageal Cancer Case-Control Study Case-control studies involve selecting a group of patients with some condition (a disease, say), together with a control group, and then looking backwards in time (they are retrospective studies) to see if some antecedent or risk factor occurs more frequently with the study group, than it does with the control group. If it does, it suggests a possible link between that factor and the outcome of interest. Obviously, case-control studies involve an inversion of the preferred point of view: we would like to be able to say that „people who do X are more likely to get disease Y‟, whereas, what we can say is that „people who have disease Y are more likely to have done X‟. Such studies are particularly useful for studying rare conditions, for which enormous sample sizes would be required for prospective studies (in order to be sure of recruiting into the study sufficient numbers of people who will get the disease). Breslow and Day [5] cite a case-control study by Tuyns et al [6] involving 200 males who were diagnosed with oesophageal cancer in a Brittany hospital. A control sample of 775 adult males was selected from electoral lists in the nearby communes. Both cases and controls were administered a detailed dietary interview, which contained questions about their consumption of alcohol and tobacco as well as food. Only a small part of the dataset is examined below, Breslow and Day discuss the data in a much more sophisticated way than is attempted here.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 16

Alcohol consumption (g/day)

80+ <80 Totals Proportions

Cases 96 104 200 0.480

Controls 109 666 775 0.141

Totals 205 770 975

Odds Ratio 5.640

Table 6.6: Oesophageal cancer case-control study

A Z-test (see Table 6.7, p<0.001) provides strong statistical evidence of an association between alcohol consumption and disease status. The sample proportions P(high alcohol consumption|Case)=0.48 and P(high alcohol consumption|Control)=0.14 (where the bars are read as „given that‟ or „conditional on‟) are, therefore, indicative of underlying differences.

Test and CI for Two Proportions Sample X N Sample p

1 96 200 0.480000 2 109 775 0.140645

Difference = p (1) - p (2) Estimate for difference: 0.339355 95% CI for difference: (0.265916, 0.412793) Test for difference = 0 (vs not = 0): Z = 10.50 P-Value = 0.000

Table 6.7: Minitab analysis of oesophageal cancer case-control study

The confidence interval suggests a higher prevalence of heavy drinking among the cases, over and above the level for the controls, of between 27 and 41 percentage points. Measuring Association The Timolol study was a clinical trial – an experimental study in which patients were randomly assigned to the treatment and control groups. In such cases we have strong evidence for a causal relationship between the response and the treatment administered. The oesophageal cancer study, on the other hand, was a purely observational study: in such cases the assumption/deduction of a causal relationship between the risk factor (heavy drinking) and the response

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 17

(cancer) is always subject to the caveat that there could be other (unknown) factors related to both of the observed factors, which are responsible for the apparent relationship. This is described as „confounding‟. For this reason, the term „association‟ tends to be used, in order to make clear that the data collection method does not allow us to deduce, with any certainty, the existence of a causal relationship. For the Timolol study a natural measure of association is:

p1–p2 = P(Angina free|Timolol)–P(Angina free|Control)

as it makes a direct comparison between the two proportions of interest. We have already discussed such comparisons extensively. An alternative way of viewing the same information is to calculate an estimate of the relative risk:

Relative Risk = 13.2129.0

275.0

2

1

p

p

This lends itself to a simple description of the outcome of the study: „the chances of becoming angina-free are more than doubled for the Timolol group‟. Similarly, for the slate workers study the mortality rates after 24 years were p1=0.522 for the slate workers and p2=0.435 for the controls. The relative risk of dying was 1.2 – slate workers had a 20% higher chance of dying within the study period. Neither of these measures (p1–p2 or p1/p2) is available if the study is retrospective, since the relevant proportions are not available from the data. An alternative measure of association, the odds ratio, has the useful property that it provides a direct measure of association, even when the data collection is retrospective. Odds Ratio Table 6.8 reproduces the Timolol data, together with a second table where the frequencies are represented by letters

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 18

Timolol Control Totals Timolol Control Totals

Angina free 44 19 63 AF a b a+b

Not A-free 116 128 244 not-AF c d c+d

Totals 160 147 307 Totals a+c b+d n

Table 6.8: Timolol study results

Out of the 160 people in the Timolol group 44 become Angina-free (AF) while 116 do not. Instead of expressing the chance of becoming Angina-free as a proportion of the total p1=44/160, an alternative is to say that the odds of becoming AF as opposed to not-AF are 44:116, which we may write as 44/116. Similarly, the odds of becoming AF for the control group are 19/128. If we take the ratio of these two quantities we get the odds ratio:

OR = 6.2][128/19

116/44

bc

ad

d

b

c

a

The odds of becoming AF are more than two and a half times higher for patients who received Timolol. Table 6.9 reproduces the oesophageal cancer study results in a similar format to those just considered for the Timolol study. Note however, that it is the row totals that are fixed by the study design here, whereas it was the column totals for the Timolol study.

Alcohol consumption Alcohol consumption

80+ <80 Totals 80+ <80 Totals

Cases 96 104 200 Cases a b a+b

Controls 109 666 775 Controls c d c+d

Totals 205 770 975 Totals a+c b+d n

Table 6.9: Oesophageal cancer study results

We cannot calculate directly the odds of being a case for the high alcohol group, because the number would be meaningless – to see this you need only realise that the odds would double if the study had included 400 instead of 200 cases (assuming the same proportion of heavy drinkers appeared in the study). We can, however, find the odds of being a heavy drinker versus a moderate drinker

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 19

for each of the two groups. For the cases, this is 96/104 and for the controls, it is 109/666. The odds ratio is then:

OR = 6.5][666/109

104/96

bc

ad

d

c

b

a

The odds are more than five times higher that an oesophageal cancer patient will be a heavy drinker, than that a control will be one. Note, however, that the odds ratio formula (in letters) for the retrospective case-control study turns out to be exactly the same as for the prospective experimental study:

OR = bc

ad

and so, although the data were collected retrospectively, we can say that the odds of getting oesophageal cancer are more than five times higher for heavy drinkers than they are for more moderate drinkers. The ratio is often called the „cross-product‟ ratio because of the „shape‟ of the resulting formula: compare it to the layout of the letters in the Tables. This property that it produces a forward-oriented index, even when the data collection is retrospective makes the odds ratio an attractive measure of association. The odds ratio also allows the tools of multiple regression to be applied to frequency data, when we want to analyse the influence of several risk factors simultaneously, e.g., the effects not just of alcohol consumption, but also of age, smoking habit, nutritional status etc. There is another reason why the odds ratio is considered a useful index: it is frequently applied to diseases with low rates2, and so it gives a close approximation to the relative risk for the groups being compared, even when the sampling is retrospective and the relative risk cannot be calculated. Exercises 6.2.2. 460 adults took part in an influenza vaccination study. Of these 240 received the

influenza vaccine and 220 received a placebo vaccination. At the end of the trial 100 people contracted influenza of whom 20 were in the treatment group and 80 in the control group. Carry out a statistical test to determine if the vaccine is effective. Calculate and interpret a confidence interval which will estimate the potential effect of the vaccine (beyond that of a placebo). Calculate the odds ratio.

If the entire population of adults had been vaccinated, estimate the proportion of the

population that would be expected to have contracted influenza during the test period.

2 The odds are p/(1–p) which is approximately p(1+p+p

2) so, as long as p is small (say no more than 0.05),

p2 and higher powers of p are negligible and p/(1–p) is effectively equal to p. Thus, the relative risk is

effectively the same as the odds ratio.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 20

6.2.3. A study of respiratory disease in childhood [7] examined the question of whether children

with bronchitis in infancy get more respiratory symptoms in later life than others. Of 273 children with a history of bronchitis before age 5 years, 26 were reported to have a persistent cough at age 14, whereas in the group of 1046 children in the study who did not have bronchitis before the age of 5 years, 44 were reported to have a persistent cough at 14 years. Are the proportions having a persistent cough statistically significantly different? Calculate an interval estimate of the long run average difference between the two rates of occurrence of persistent cough at age 14 years? Calculate the odds ratio.

Sample Size for Comparative Studies We have discussed how to analyse studies where the response of interest is the difference between the sample proportions having a particular characteristic in each of two groups of subjects/experimental units. We now turn to the prior question, which is how to decide on a sample size for such studies? The discussion will follow closely the discussion of sample size requirement for studies leading to a two-sample t-test for continuous responses, in Chapter 3. Detailed discussion of the various issues that arise (with references to the research literature) and corresponding sample size tables, will be found in Machin and Campbell [8]. Sample Size Formula Consider a situation where two independent sample proportions (p1 and p2) are to be compared using a Z-test. The null hypothesis is that the corresponding

population proportions ( and ) are the same, i.e., . In most cases3 equal group sizes will be decided upon (for a given total sample size, N, the Z test will be most powerful if n1=n2=N/2) and the following (approximate) formula is widely recommended for arriving at an appropriate value for n, the number of subjects/experimental units in each group. Note that, apart from the part of the numerator between square brackets, the formula is identical to that for two-sample t-tests as discussed in Chapter 3.

2

22112 )]1()1([)( ZZ

n

where:

Z is the standard Normal critical value for a significance level of . For a two-tailed test this is 1.96, for a one-tailed test it is 1.645.

3 In clinical trials where there is strong prior evidence that one treatment is superior to the other, it

may be decided on ethical grounds that more patients should be allocated to this treatment.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 21

Z is the critical value for a Type II error of accepting the null hypothesis when, in fact, it should be rejected). One-

sided critical values are used here, since we want a power of (1– )

irrespective of whether the difference is positive or negative. Thus, for

=0.05, corresponding to a power of (1– )=0.95 of rejecting Ho, Z =1.645

and for =0.20, corresponding to a power of 0.80, it is Z =0.84.

is the difference we are interested in detecting, with a power of

(1– ). You will have noticed (I hope!) that the specifications here correspond directly to those required for determining the sample size for two-sample t-tests, as

discussed in Chapter 3. When we choose a significance level ( ), we specify the

probability of Type I error, i.e., the probability of deciding that 1 and 2 are

different, when, in fact, they are the same. When we specify , the probability of Type II error, we specify the risk we are willing to take of failing to reject the null

hypothesis when in fact, 1 and 2 differ by our pre-specified amount .

Note that for continuous data we did not need to specify the population means 1

and 2, only the difference, , we wished to detect. We did, however, have

to specify the (assumed) common standard deviation, . The standard error for the difference between two independent proportions depends on the proportions

themselves (see page 9); this is why we need to specify 1 and 2 in the formula. Example 6: Anturan Trial Pocock [9] illustrates a discussion of sample size calculation by referring to the Anturan Reinfarction Trial (1978) [10]. This was a randomised double blind trial comparing the prescription of anturan versus a placebo to patients who have had a myocardial infarction4. Survival for one year after start of treatment was used

as the simple criterion of success. They chose =0.95 and =0.90, =0.05

(two-sided test) and =0.10. The required sample size, based on these assumptions, is therefore:

2

22112 )]1()1([)( ZZ

n

4 (Wikipedia: “[MI] more commonly known as a heart attack, is a disease state that occurs when

the blood supply to a part of the heart is interrupted. The resulting ischemia or oxygen shortage causes damage and potential death of heart tissue. It is a medical emergency, and the leading cause of death for both men and women all over the world”).

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 22

57805.0

)]90.01(90.0)95.01(95.0[)28.196.1(

2

2

n

Pocock reports that the authors increased this figure to 750 per treatment to allow for dropouts and to increase power slightly (see Example 8, later). Example 7: Salk Polio Vaccine Trial Snedecor and Cochran (in an exercise in [11]) state that in planning the 1954 Salk polio vaccine trial sample size was critical, since it was unlikely that the trial could be repeated. Various estimates of sample size were made, in one of which it was assumed that the probability that an unprotected child would contract paralytic polio was 0.00030. If the vaccine was 50% effective, i.e., reduced this to 0.00015, it was desired that there would be a 90% chance of finding a

significant difference using a two-tailed test with a significance level of =0.05.

Thus, inserting =0.00030, =0.00015, Z =1.96 and Z =1.28 into the formula:

2

22112 )]1()1([)( ZZ

n

we find:

21009600015.0

)]00015.01(00015.0)00030.01(00030.0[)28.196.1(

2

2

n

which means that about 200,000 children would be required for each of the two study groups, vaccinated and control.5 Power Calculations The sample size formula can be inverted to give an estimate of the power that

will result from specifying and the sample size n. This is useful,

5 A note from Wikipedia:

Salk's vaccine was used in a test called the Francis Field Trial, led by Thomas Francis; the largest medical experiment in history. The test began with some 4,000 children at Franklin Sherman Elementary School in McLean, Virginia, and would eventually involve 1.8 million children, in 44 states from Maine to California.

By the conclusion of the study, roughly 440,000

received one or more injections of the vaccine, about 210,000 children received a placebo, consisting of harmless culture media, and 1.2 million children received no vaccination and served as a control group, who would then be observed to see if any contracted polio.

The results of the

field trial were announced April 12, 1955 (the tenth anniversary of the death of Franklin Roosevelt). The Salk vaccine had been 60 - 70% effective against PV1 (poliovirus type 1), over 90% effective against PV2 and PV3, and 94% effective against the development of bulbar polio.

[

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 23

especially where the calculated sample size is considered impractical and we wish to assess the implications of using a smaller sample. The inverted formula is:

Zn

Z)1()1( 2211

where is the difference we are interested in detecting, with an adequate

power. Since is not squared (as in the sample size formula), should be the larger of the two proportions. The estimate of the power is:

P(Z<Z ) which may be obtained from a standard Normal table. Example 8: Anturan Trial - revisited Example 6 referred to a randomised double blind trial comparing the prescription of anturan versus a placebo to patients who have had a myocardial infarction.

The authors chose =0.95 and =0.90, =0.05 and =0.10, corresponding to a power of 0.90. The required sample size, based on these assumptions, was 578 in each group. Pocock reported that the authors increased this figure to 750 per treatment to allow for dropouts and to increase power slightly. We can use the inverted formula to investigate the implications of the sample size increase. The inverted formula is:

Zn

Z)1()1( 2211

which gives:

76.196.1)90.01(90.0)95.01(95.0

75005.Z

and a power of:

P(Z < Z ) = P(Z < 1.76) = 0.96. So, by increasing the sample size from 578 per group to 750 per group, they increased the power from 0.90 to 0.96. The increase of 30% in the sample size was probably decided upon more to allow for dropouts than to increase the power.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 24

Example 9: Salk Polio Vaccine Trial - revisited The Salk vaccine trial clearly needed unusually high numbers of subjects because both the individual proportions being compared, and their difference, were very small. Suppose that instead of carrying out the sample size calculation, as described above, the trial planners had picked out of the air an obviously large number of subjects, say (i) 50,000 or (ii) 100,000 for each group. What would have been the implications for the likely success of the trial, if the initial assumptions, that the probability that an unprotected child would contract paralytic polio was 0.00030 and that the vaccine might reduce this to 0.00015, were correct?

The expression for Z is:

Zn

Z)1()1( 2211

which for n=50,000 gives:

379.096.1)00015.01(00015.0)00030.01(00030.0

5000000015.Z

and a power of:

P(Z < Z ) = P(Z < –0.379) = 0.352. Thus, with a sample size of only 50,000, there would be a 65% chance of Type II error, i.e., of failing to decide that the vaccine works when, in fact, it reduces the probability of contracting polio by 50%. The equivalent calculation for n=100,000 gives a power of 0.61, i.e., a 39% chance of failing to detect the effectiveness of the vaccine. These calculations show that even a very large sample size can be too small! Discussion The sample size formula on which our discussions have been based is a large sample approximation. If the numbers it produces are small, it may be worthwhile looking at other sources of information on sample size determination (both Machin and Campbell [8] and Fleiss [12] provide extensive tables and references to other work in this area). While different statistical approaches will produce somewhat different sample sizes for a study, the technical differences are generally much less important than two practical considerations: the specification of the entries to the formula

and the availability of resources. The formula requires us to specify and – if

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 25

we knew these there would be no need for a study! Manipulation of the sample size formula shows that the resulting number n, for each group, will be sensitive to the guesses used for these quantities and, in particular, to their difference

– , which is squared in the denominator. The choices of the significance

level, , and the power, (1– ), are also quite arbitrary; while we might start out

optimistically requiring a high power for a small – , it is quite likely that in many cases the resulting sample sizes will drive us towards more „realistic‟ (or just plain pragmatic!) values. The lack of available resources is, in practice, likely to be a major consideration, overshadowing the technical statistical considerations. This, however, does not diminish the value of the statistical analysis – on the contrary. The statistical analysis provides concrete answers to „what if?‟ questions and helps clarify the implications of the guesswork necessarily involved in planning a study where vital information is either absent or, at best, vague. In the extreme, when the power calculation shows that the available resources are insufficient to provide a reasonable chance of a successful study, then, in general, the study should be abandoned. This would, very often, be mandatory for clinical trials, on ethical grounds. Exercises

6.2.4 For the Timolol study, suppose that the researchers had guessed that the control group

would produce an angina-free rate of 10% and, if the drug were successful, that this would increase to (i) 20% or (ii) 25%. They intended carrying out a two-tailed test using a significance level of 0.05. If they required power values of (i) 90% or (ii) 95%, use the sample size formula to calculate the required sample sizes for all four sets of assumptions.

6.3 Analysis of Contingency Tables In Section 6.2 we discussed the analysis of comparative studies involving two groups where the responses were classified into one of two categories. We now turn our attention to studies which may involve more than two groups and where the responses may be classified into more than two categories. The data resulting from such studies are usually presented in tabular form, with the columns (say) representing the different groups and the rows the different response categories. Such tables are often referred to as „contingency tables‟ - one characteristic is considered as „contingent on‟ or „dependent on‟ a second. Chi-square Test The analysis of the resulting data involves the use of (for us) a new statistical test – the chi-square test. Before plunging into the analysis of larger tables, the chi-square test will be introduced, by using it to re-analyse an old friend – the Timolol data!

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 26

Example 9: The Timolol Study- revisited Patients were assigned to receive a daily dosage of either Timolol or a Placebo for 28 weeks. The numbers of patients who did or did not become completely free of angina attacks (Angina-Free) are shown again in Table 6.10.

Timolol

Placebo

Totals

Angina-Free 44 19 63 Not Angina Free 116 128 244

Totals 160 147 307

Table 6.10: The Timolol Study Results

Suppose we ask the question: „If Timolol has no greater effect than the Placebo, how many of the treatment group would be expected to become Angina-Free? Since 63/307 in all become Angina-Free, and this fraction should apply equally to both the treatment and placebo groups if there is no differential effect, we estimate the expected6 number of Angina-Free patients in the Timolol group as:

83.32)160(307

6311E

Following exactly the same logic, we estimate the expected numbers in the other cells of the 2x2 table as shown below.

83.32)160(307

6311E

17.30)147(307

6312E

63

17.127)160(307

24421E

83.116)147(307

24422E

244

160 147 307

Table 6.11: Calculating the expected values

Note that the expected value in cell (i, j) is generated by:

GrandTotal

olumnTotalRowTotalxCEij

6 An „expected value‟ is an average and, as such, may assume non-integer values even though

the raw data are counts. The expected values should not be rounded to integer values when carrying out the calculations.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 27

We now have a set of „observed values‟ and a set of „expected values‟. If the assumption of no differential treatment effect is valid, then the two sets of values should differ only by chance variation. A measure of the divergence is:

allcells ij

ijij

E

EOX

22

)(

Since a difference of (Oij–Eij)=5 would be much more important if we expected Eij=10 than if we expected Eij=100, the divisor scales each squared deviation to allow for the size of the expected value. The deviations are squared, as otherwise they would sum to zero, since each row and column contains a positive and negative deviation (Oij–Eij) of equal magnitude (see Table 6.12).

Our divergence measure is a test statistic which has a (chi-square) sampling distribution when the assumption under which the expected values were calculated is true. The shape of the sampling distribution depends on a

parameter called the „degrees of freedom‟. This is often labelled (the Greek

letter n); here =1. This value follows from the fact that for 2x2 tables once one expected value is calculated, all the others can be found by subtracting this value from the marginal values of the table. Thus, the expected value in cell (1,2) is 63–32.83 = 30.17 and so on. Equivalently, once one deviation (Oij–Eij) is found, the rest follow from the fact that the deviations sum to zero across the rows and down the columns, see Table 6.12. The Statistical Question

If the proportion becoming Angina-Free is and the proportion becoming not-

Angina-Free is (note =1) in the population of patients receiving the

placebo, then the null hypothesis is that the pair [ ] applies also to the

population of patients receiving Timolol. The pair [ ] is the „relative frequency distribution‟: the null hypothesis is that this applies to both groups, equally. The individual cell calculations are shown in Table 6.12. These give the test statistic as:

98.9)( 2

2

allcells ij

ijij

E

EOX

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 28

797.383.32

)17.11(

83.32

)83.3244()( 22

11

21111

E

EO

133.417.30

)17.11(

17.30

)17.3019()( 22

12

21212

E

EO

980.017.127

)17.11(

17.127

)17.127116()( 22

21

22121

E

EO

067.183.116

)17.11(

83.116

)83.116128()( 22

22

22222

E

EO

Table 6.12: Calculating the contributions to X

2

If we choose a significance level =0.05 then the critical value is c=3.84 as shown in Figure 6.6 (see Table A7, page 68). Note that the test is inherently one-tailed, as large values of X2 point to large divergences between observed and expected values, thus requiring rejection of the null hypothesis under which the expected values were calculated.

Figure 6.6: The chi-square distribution with =1

Since the test statistic X2=9.98 greatly exceeds the critical value of 2c=3.84, the

null hypothesis is rejected. The test result suggests a systematic difference between the observed and expected values. This points to a difference between the two treatment groups in terms of the long-run proportions becoming Angina-Free. Of course, we already know this from our previous analysis of the sample proportions. Typical Computer Output Table 6.13 shows a chi-square analysis of the Timolol data, produced by Minitab.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 29

Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Timolol Placebo Total

A-Free 44 19 63

32.83 30.17 3.797 4.133 Not-A-Free 116 128 244

127.17 116.83 0.980 1.067 Total 160 147 307 Chi-Sq = 9.978, DF = 1, P-Value = 0.002

Table 6.13: Minitab chi-square output for the Timolol data

The computer output gives the observed and expected values and the contribution to X2 for each of the four cells of the table. The individual contributions are summed to give the test statistic, X2=9.978. This is stated to have a p-value of 0.002, which means that if the null hypothesis were true, the probability would be only 0.002 that we would have obtained, by chance, an X2 statistic equal to or greater than X2=9.978. This means that our test statistic is unusually large (see Figure 6.7), which in turn means that the divergence between the observed and expected values is also large. Such a p-value would lead to rejection of the null hypothesis.

Figure 6.7: The p-value is the area in the right-hand tail

Chi-square or Z-test? Table 6.14 shows the Minitab analysis for the comparison of the two proportions of patients who become Angina-Free.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 30

Test and CI for Two Proportions Sample X N Sample p Timolol 44 160 0.275000 Placebo 19 147 0.129252

Difference = p (1) - p (2) Estimate for difference: 0.145748 95% CI for difference: (0.0578398, 0.233657) Test for difference = 0 (vs not = 0): Z = 3.16 P-Value = 0.002

Table 6.14: Comparing proportions using Minitab

The Z-test for the difference between the population proportions gives a test statistic of Z=3.16; when this is squared it gives (apart from the effects of rounding) a value (9.98) which is the same as the X2 statistic for the chi-square test: the two tests are, therefore, equivalent. The Z value has a standard Normal distribution; when the standard Normal is squared it gives a chi-square distribution with one degree of freedom, which is the sampling distribution of the X2 statistic. The approach using proportions is (to me at least!) more appealing in that it focuses attention on the quantities of direct interest (the proportions becoming Angina-Free). It also leads naturally to calculating a confidence interval to quantify the likely population difference. A Different Data Collection Design for a 2x2 table The data in Table 6.10 derive from two fixed study groups n1=160 and n2=147; this is sometimes referred to as fixing the column margin of the table by design. Not all 2x2 tables arise in this way; often only the grand total is fixed; in surveys for example the total sample size may be fixed and the respondents are subsequently cross-classified by their characteristics (e.g., sex) and outcomes of interest (e.g., voting intentions). The following example is different again – the researchers collected data for a given time period, and then cross-classified the results, without pre-specifying any sample size – this was itself a random quantity. Example 10: Otter Kills in Eastern Germany Hauer, Ansorge and Zinke [13] report data on otters that were either killed on the road or shot by hunters in eastern Germany over a 40 year time period. Note that the numbers were recovered from summaries in the original paper and the total sample size here is 1023, but the original paper refers to a total sample size of 1027. Table 6.15 shows the N=1023 otters cross-classified into a 2x2 table with age-group and sex as the row and column variables, respectively. The

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 31

authors of the paper described otters of age four or more years as „grown adults‟; I have described otters younger than this as „young‟.

Male

Female

Totals

Young 286 153 439 Adults 312 272 584

Totals 598 425 1023

Table 6.15: Otters cross-classified by age and sex

In analysing these data we can treat the numbers of males and females as fixed (this is referred to as „conditioning on the margin‟). We then ask if the proportions of young versus adult deaths are the same for both sexes. This is a natural way of interrogating the data in that it addresses directly the research question. You will, however, find that many books and journal articles pose the question to be answered by the data differently, when, as here, neither the row nor the column margin is fixed by design. Often the question is expressed in the form: „is the row classification independent of the column classification?‟ In the current case this would mean that the probability that an otter is young when it is killed does not depend on its sex. If an otter is selected at random from the total sample of N=1023, then the probability that it will be male is P[male]=598/1023 and the probability that it will be young is P[young]=439/1023. If two characteristics are independent of each other (in the sense that the occurrence of one does not affect the probable occurrence of the other) then the probability that they both occur simultaneously is given by the product of their separate probabilities. So, if independence holds:

P[young and male] = P[young] x P[male] These sample probabilities are, of course, only estimates of corresponding population values. The estimate of the expected7 number of young males is given by:

N x P[young and male] = N x P[young] x P[male]

= 1023

598

1023

4391023 xx

7 If 100 fair coins are tossed, we expect NxP[head] = 100[0.5] = 50 heads. By this we do not

mean that if the experiment were carried out once, the result would be 50 heads. Rather, that in a very long sequence of such experiments the long-run average number of heads would be 50. Thus, the term „expected value‟ is a technical statistical term meaning „long-run average‟.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 32

= 62.2561023

598439x

which is

E11 = GrandTotal

olumnTotalRowTotalxC

as before. Thus, the same set of expected values (and, hence, the same test) is obtained irrespective of whether we express our hypothesis in terms of the equality of population proportions for the two groups (as for the Timolol study) or in terms of row-column independence, as here. The estimated expected values and the contributions to the X2 statistic for all cells are shown in the Minitab analysis in Table 6.16. The sum of all the cell contributions gives an overall test statistic X2=14.183 which is highly statistically significant when compared to a critical value of 3.84 (again „degrees of freedom‟=1). The test statistic and the corresponding p-value (p < 0.000) suggest that the two classifications are not independent of each other: the observed frequencies deviate too much from those calculated on the independence assumption. Chi-square tests where the null hypothesis is expressed in terms of the equality of two or more proportions (as in the Timolol example) are often referred to as „tests for homogeneity of proportions‟, while those for which the null hypothesis is posed in the manner just discussed are called „tests of independence‟ or „tests of association‟, i.e., they address the question: are frequencies for one classification „independent of‟ or „associated with‟ those of the other?

Chi-Square Test: Male, Female Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Male Female Total

Young 286 153 439

256.62 182.38 3.364 4.733 Adult 312 272 584

341.38 242.62 2.529 3.558 Total 598 425 1023 Chi-Sq = 14.183, DF = 1, P-Value = 0.000

Table 6.16: Minitab chi-square analysis of the otter data

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 33

Having carried out the chi-square test for the otter data a natural practical question is „by how much does the long-run proportion of young male deaths exceed that for young females?‟ In addressing this question we compare the sample proportions; in doing so, we regard the column totals as fixed in making our comparison, which suggests that the chi-square null hypothesis would have been more appropriately expressed in „homogeneity of proportions‟ terms.

Test and CI for Two Proportions Sample X N Sample p

Males 286 598 0.478261 Females 153 425 0.360000

Difference = p (1) - p (2) Estimate for difference: 0.118261 95% CI for difference: (0.0575530, 0.178969) Test for difference = 0 (vs not = 0): Z = 3.77 P-Value = 0.000

Table 6.17: Minitab comparison of sample proportions for otters

Note that the square of the Z-value of Table 6.17 is identical to the X2 value for the chi-square analysis of Table 6.16 (3.772 = 14.183, apart from rounding). The confidence interval suggests that the long-run percentage of young male deaths is between 6 and 18 percentage points higher than the corresponding percentage for female otters. The odds-ratio is 1.63 – the odds of a male being young when killed are about 60% higher than those for a female. The original paper suggests reasons for such a difference. These include the larger territorial size and higher activity levels, particularly of younger males searching for a free territory, leading to increased exposure to road traffic (69% of the deaths were attributed to road kills). Exercises 6.2.5 Re-analyse the Slate Workers Cohort Study data of Example 4, using a chi-square test. 6.2.6 Re-analyse the Oesophageal Cancer Case-Control Study of Example 5, using a chi-

square test.

Larger Contingency Tables No new ideas arise when we turn our attention to contingency tables that are larger than the 2x2 tables, just discussed – the amount of calculation simply increases. We consider three examples from different areas: market research, literary style analysis and genetics.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 34

Example 11: Market Research Study Stuart [14, 15] reports on a market research study in which interviewees were asked whether they had heard of a product, had heard of it but not bought it, or had bought it at least once; these classifications measure „the degree of market penetration‟ of the product. The study results, for three regions are shown in Table 6.18.

Region

A B C Totals

Bought 109 49 168 326

Heard-no-buy 55 56 78 189

Never heard 36 45 54 135

Totals 200 150 300 650

Table 6.18: Market penetration study

The samples sizes were clearly fixed for the three regions; why they were different is not reported – the regions may, for example, be of different geographical sizes. Table 6.19 shows the relative frequencies for the three regions; the proportions are expressed to three decimal places.

Region

A B C

Bought 0.545 0.327 0.560

Heard-no-buy 0.275 0.373 0.260

Never heard 0.180 0.300 0.180

Table 6.19: Relative frequency distributions of response by region

The same information is shown in Table 6.20, but this time the numbers are expressed as column percentages and rounded to whole numbers (note that region A sums to 101). The columns are also re-arranged to put A beside C to emphasise their similarity.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 35

Region

A C B

Bought 55 56 33

Heard-no-buy 28 26 37

Never heard 18 18 30

Table 6.20: The percentage responses by region (rounded)

I think you will agree that Table 6.20 is much easier to read and interpret: regions A and C are similar in response characteristics and different from B. Clearly organised tables are important when presenting information to others - See Ehrenberg, [16] or Stuart [15] for a discussion of the principles of good table making. The Statistical Model

For the chi-square analysis of the Timolol study our model was that a fraction ( ) of the population of patients receiving Timolol would become free of angina

attacks, and a fraction ( ) would not. The question of interest was whether the

same fractions [ ] applied to both the Timolol and the placebo groups. Here, we have three groups and three response categories. Our model is that the observed sets of three response frequencies are generated by the same

underlying population response relative frequency distribution [ ]. Analysis The observed relative frequencies from the three regions are clearly not the same. The first statistical question is whether the differences between regions are attributable to chance sampling variation or not. If the answer is yes, there is little point in making further between-region comparisons. This question will be addressed using a chi-square test which is a straightforward extension of the test carried out on the 2x2 tables earlier. Differences between regions will then be quantified using confidence intervals for differences between proportions – these are exactly the same intervals as those discussed earlier for the Timolol and Otter data. Our null hypothesis is that there is a common set of population relative

frequencies [ ]. On this assumption, Table 6.18 suggests that a natural estimate of the proportion who have bought the product is 326/650, and this should apply to all three regions. Then the expected number of respondents in region A who have bought the product is:

31.100)200(650

32611E

As before this estimate takes the form:

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 36

GrandTotal

olumnTotalRowTotalxCEij

Following precisely the same logic, Table 6.21 shows the calculation of the expected values for all the cells.

A B C

Bought 31.100)200(650

32611E 23.75)150(

650

32612E 46.150)300(

650

32613E

Heard-not-bought

15.58)200(650

18921E 62.43)150(

650

18922E 23.87)300(

650

18923E

Never heard

54.41)200(650

13531E 15.31)150(

650

13532E 31.62)300(

650

13533E

Table 6.21: Calculating the expected values

Table 6.22 shows the divergences between the observed and expected values [Oij–Eij]. Note that the divergences sum to zero across all rows and down all columns (our method of calculating the expected values forces the expected table margins to be the same as the observed table margins, which means the differences are zero).

A B C

Bought 69.831.100109 23.2623.7549 54.1746.150168

Heard-not-bought

15.315.5855 38.1262.4356 23.923.8778

Never heard

54.554.4136 85.1315.3145 31.831.6254

Table 6.22: Differences between observed and expected values

Because of these restrictions on the table margins, once four cells are filled (either expected values or differences between observed and expected) the other five are automatically determined – the deviations are said, therefore, to have four degrees of freedom. This means that when we calculate our X2 test statistic, we will compare it to a chi-square distribution with four degrees of freedom.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 37

Figure 6.8: Chi-square distribution with 4 degrees of freedom

Chi-Square Test: A, B, C Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts A B C Total

Bought 109 49 168 326

100.31 75.23 150.46 0.753 9.146 2.044 Heard-no-buy 55 56 78 189

58.15 43.62 87.23 0.171 3.517 0.977 Never-heard 36 45 54 135

41.54 31.15 62.31 0.738 6.154 1.108 Total 200 150 300 650

Chi-Sq = 24.608, DF = 4, P-Value = 0.000

Table 6.23: Minitab chi-square analysis of market research data

The Minitab analysis of Table 6.23 shows that the individual cell contributions sum to a test statistic of:

6.24)( 2

2

allcells ij

ijij

E

EOX

which greatly exceeds the critical value of 9.49 (shown in Figure 6.8). Clearly, the assumption of a common underlying distribution of responses must be rejected.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 38

A chi-square analysis of regions A and C supports the conclusion of common underlying rates, which was suggested by Table 6.20. The p-value of 0.928 of Table 6.24 shows the observed and expected values to be very close to each other.

Chi-Square Test: A, C Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts A C Total

Bought 109 168 277

110.80 166.20 0.029 0.019 Heard-no-buy 55 78 133

53.20 79.80 0.061 0.041 Never heard 36 54 90

36.00 54.00 0.000 0.000 Total 200 300 500

Chi-Sq = 0.150, DF = 2, P-Value = 0.928

Table 6.24 Chi-square comparison of responses for regions A and C

Table 6.25 shows a chi-square comparison of the pooled values for regions A and C versus those of region B: region B clearly behaves differently. Note that the statistical tests shown in Tables 6.24 and 6.25 have a different status from that in Table 6.21. The tests carried out in Tables 6.24 and 6.25 were prompted by examining the data. It is axiomatic in scientific investigation that you cannot form an hypothesis on the basis of a set of data and test the hypothesis using the same dataset. These tests may be regarded, therefore, as crude measures in an exploratory analysis; we would treat the results with a great deal of caution if the tests were only marginally statistically significant. Given the marked differences in responses, it is possible (even likely?) that the market researchers would have expected such results and might have been in a position to formulate the hypotheses underlying the tests of Tables 6.24 and 6.25 a priori, in which case the tests would have their usual status. It might, for example, be the case that B is a relatively new market while A and C are mature markets; Stuart does not provide further information which would throw light on such questions.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 39

Chi-Square Test: A+C, B Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts A+C B Total

Bought 277 49 326

250.77 75.23 2.744 9.146 Heard-no-buy 133 56 189

145.38 43.62 1.055 3.517 Never heard 90 45 135

103.85 31.15 1.846 6.154 Total 500 150 650

Chi-Sq = 24.461, DF = 2, P-Value = 0.000

Table 6.25: Chi-square comparison of pooled A and C versus B

Quantitative statements regarding differences can be supported by confidence intervals for the population differences. For example, Table 6.26 shows a comparison of the proportions of consumers who have bought the product in region B versus the combined region A+C. Expressed in terms of percentages, the point estimate for the difference is 23%, but the confidence interval suggests that the percentage of consumers in regions A and C who bought the product is between 14 and 31 percentage points higher than the corresponding value for region B.

Test and CI for Two Proportions Sample X N Sample p A+C 277 500 0.554000 B 49 150 0.326667

Difference = p (1) - p (2) Estimate for difference: 0.227333 95% CI for difference: (0.140550, 0.314117) Test for difference = 0 (vs not = 0): Z = 4.88 P-Value = 0.000

Table 6.26: Difference in buying proportions; regions A+C versus B

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 40

Figure 6.9: A graphical representation of the study results (percentages)

Figure 6.9 presents the results of the study in graphical form. Note that the lines connecting the dots have no intrinsic meaning – they simply help us visualize the different market penetration „profiles‟ corresponding to the two market segments. Contrast this picture with Table 6.18 as a way of reporting the study results. Tables of statistical analyses, such as Table 6.21, are inappropriate in presenting study results – they contain too many unnecessary technical details. If at all, they should appear in a technical appendix to a report on the market research study. Example 12: Literary Style Jane Austen left an unfinished novel (Sanditon) when she died. She also left a summary of the remaining chapters. The novel was finished by an imitator, who tried to match her style as closely as possible. The data below are cited by Rice [17], based on an analysis by Morton [18]. Six words were selected for analysis and their frequencies were counted in Chapters 1 and 3 of Sense and Sensibility, 1, 2 and 3 of Emma, 1 and 6 of Sanditon (described below as Sanditon-A, written by Austen) and 12 and 24 of Sandition (described below as Sanditon-B, written by the imitator). Can we distinguish between the authors on the basis of the word counts?

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 41

Sense+

Sensibility Emma Sanditon-A Sanditon-B Totals

a 147 186 101 83 517

an 25 26 11 29 91

this 32 39 15 15 101

that 94 105 37 22 258

with 59 74 28 43 204

without 18 10 10 4 42

Totals 375 440 202 196 1213

Table 6.27 Word counts for the literary works

Statistical Model Before beginning our analysis it is worthwhile considering the nature of the data given in the table. Note that the data generation mechanism is quite different from our previous market research example. There, there were three categories of responses (bought, heard but no buy, and never heard of product). The three relative frequencies or percentages for any region provided us with estimates of consumer behaviour. Here the row categories are arbitrarily selected words used by the author. The relative frequencies for any one novel do not have any intrinsic meaning (unlike the market research study); they are simply the relative frequencies of these particular words (relative to the full set of words included in the list) and would change if any of the words were dropped from the list or if others were added. The statistical model that underlies the chi-square analysis that follows is that there is a set of relative frequencies for these arbitrarily selected words

[ ] that characterises Austen‟s writing style. If this is true, then the observed frequencies (considered as a random sample from her works) should only reflect chance variation away from this stable underlying structure. Analysis Before comparing Austen to her imitator, it makes sense to check if the word distributions are stable within her own work. If they are not, then the basis for the comparison is undermined; we might have to ask ourselves which Jane Austen we want to compare to the imitator! The frequency distributions for the three known Austen works are given in Table 6.28 and a chi-square test is carried out. The critical value for a test using a significance level of 0.05, where there are 10 degrees of freedom, is 18.31. The test statistic of 12.3 is much less than this (with a correspondingly large p-value)

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 42

and so we do not reject the null hypothesis of a consistent underlying structure for the three works. It makes sense, therefore, to combine the counts for the three works before comparing them to the imitator. One set of frequencies, based on a total sample size of 1017, will be more stable than three sets based on smaller sample sizes. The observed values in the first column of Table 6.29 are the combined word counts.

Chi-Square Test: Sensibility, Emma, Sanditon-A Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Sensibility Emma Sanditon-A Total a 147 186 101 434

160.03 187.77 86.20 1.061 0.017 2.540 an 25 26 11 62

22.86 26.82 12.31 0.200 0.025 0.140 this 32 39 15 86

31.71 37.21 17.08 0.003 0.086 0.254 that 94 105 37 236

87.02 102.10 46.88 0.560 0.082 2.080 with 59 74 28 161

59.37 69.66 31.98 0.002 0.271 0.495 without 18 10 10 38

14.01 16.44 7.55 1.135 2.523 0.797

Total 375 440 202 1017

Chi-Sq = 12.271, DF = 10, P-Value = 0.267

Table 6.28: Consistency check on Austen‟s style

Table 6.29 contains the between author comparison. The test statistic for the between author comparison (X2 =32.8) is highly statistically significant (the critical value for a significance level of 0.05 with 5 degrees of freedom is 11.07). This indicates a divergence between the authors in the relative frequency of use of one or more of the selected words. Note the very large contributions to the overall chi-square statistic given by „an‟ (13.899 for imitator) and „that‟ (9.298 for imitator). Table 6.30 shows the relative frequencies (expressed as percentages) of the six words for the two authors.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 43

Chi-Square Test: Austen, Imitator Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Austen Imitator Total a 434 83 517

433.46 83.54 0.001 0.003 an 62 29 91

76.30 14.70 2.679 13.899 this 86 15 101

84.68 16.32 0.021 0.107 that 236 22 258

216.31 41.69 1.792 9.298 with 161 43 204

171.04 32.96 0.589 3.056 without 38 4 42

35.21 6.79 0.220 1.144 Total 1017 196 1213

Chi-Sq = 32.810, DF = 5, P-Value = 0.000

Table 6.29: Between author comparison

Word Austen Imitator

a 43 42

an 6 15

this 8 8

that 23 11

with 16 22

without 4 2

Table 6.30: Relative frequencies (percentages) of word use

Relative to his uses of the other words, the imitator uses „an‟ much more frequently than does Austen (15% versus 6%). Austen on the other hand uses „that‟ more frequently (relative to her use of the other words) than does the imitator (23% versus 11%). The impact of these divergences can be seen in Table 6.29. For Austen, 62 instances of „an‟ were observed; on the

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 44

independence assumption we would have expected 76.3. For the imitator, the corresponding values are 29 and 14.7. Note that the divergences are equal in magnitude but opposite in sign ±14.3 (as they must be since the row sums of both the observed and expected values equal 91). The contribution to X2 is greater for the imitator since the contribution is:

ij

ijij

E

EO 2)(

and Eij is only 14.7 for the imitator, whereas it is 76.3 for Austen. Similar comments apply to the use of „that‟. When „an‟ and „that‟ are excluded from the word list, the chi-square test that compares the distributions for the remaining words is not statistically significant, as shown in Table 6.31.

Chi-Square Test: Austen, Imitator Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Austen Imitator Total a 434 83 517

430.23 86.77 0.033 0.163 this 86 15 101

84.05 16.95 0.045 0.224 with 161 43 204

169.76 34.24 0.452 2.243 without 38 4 42

34.95 7.05 0.266 1.319 Total 719 145 864

Chi-Sq = 4.746, DF = 3, P-Value = 0.191

Table 6.31: The reduced word list comparison

In summary, the imitator appears to have matched the relative frequencies of the uses of „a‟, „this‟, „with‟, and „without‟ in Austen‟s style, but failed to do so with respect to „an‟ and „that‟. Does this mean we will never see a BBC Sanditon series? Will Hollywood be quite so fastidious?

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 45

Example 13: Mendel’s Pea Plant Data The data consist of 529 pea plants cross-classified into a two-way table according to the shape [round, round and wrinkled, wrinkled] and colour [yellow, yellow and green, green] of their seeds. They are typical of the type of data presented in discussions of simple Mendelian theories of inheritance. What is of interest here is whether the row and column classification can or cannot be considered independent of each other: this carries implications for the underlying genetic mechanisms at work.

Colour

Yellow Y+G Green

Totals

Round 38 65 35 138

Shape R+W 60 138 67 265

Wrinkled 28 68 30 126

Totals 126 271 132 529

Table 6.32 Mendel‟s pea plant data

Statistical Model In the discussion of 2x2 tables it was noted that the hypothesis of equal relative

frequencies (i.e., a common response distribution [ ]) is often expressed in terms of the independence of the row and column classifications. Where there is a natural response characteristic (e.g., age at death of otters), an analysis expressed in terms of the proportions having this characteristic in each of the two groups (sex in the case of the otters) is usually more appropriate in addressing the questions of interest. In the current example, however, neither classification factor (shape or colour) can be thought of as a response while the other drives that response in some way – here, the row and column classifications have equal status. Testing the data for row-column independence is, therefore, a natural approach in this case. Analysis Seed shape and colour are genetically determined. In asking if the row and column classifications of the 529 plants are independent of each other, what we are really concerned with are the underlying genetic mechanisms determining seed characteristics. Rejection of the null hypothesis of independence would suggest that the genetic mechanisms are linked in some way.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 46

We estimate the probability that a seed will be Round as 138/529, while the probability that it will be Yellow is estimated as 126/529. If the two characteristics are independent

Chi-Square Test: Yellow, Y+G, Green Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Yellow Y+G Green Total Round 38 65 35 138

32.87 70.70 34.43 0.801 0.459 0.009 R+W 60 138 67 265

63.12 135.76 66.12 0.154 0.037 0.012 Wrinkled 28 68 30 126

30.01 64.55 31.44 0.135 0.185 0.066

Total 126 271 132 529

Chi-Sq = 1.857, DF = 4, P-Value = 0.762

Table 6.33 Chi-square test of row-column independence

then the product of these two sample estimates gives us an estimate of the probability that both characteristics are present simultaneously.

P[Round and Yellow] = P[Round] x P[Yellow]

529

126

529

138x

The estimate of the expected number of round and yellow seeds is given by:

N x P[Round and Yellow] = N x P[Round] x P[Yellow]

= 529

126

529

138529 xx

= 87.32529

126138x

Note that the expected value is given by:

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 47

E11 = GrandTotal

olumnTotalRowTotalxC

as before. Precisely the same logic, applied to the other cells of the table, gives a complete set of expected values, under the independence assumption (these are shown in Table 6.33). If the sample size were infinite then the observed and expected values would coincide (assuming independence). Because of the small sample size and the inevitable chance variation in biological systems, we do not expect exact correspondence, even if the two characteristics are independent. However, the X2 measure of divergence should be smaller than the critical value on a chi-square distribution with 4 degrees of freedom. In general, the degrees of freedom for contingency tables is given by:

degrees of freedom = (Number of rows – 1) x (Number of columns – 1)

which, for the current dataset, is:

degrees of freedom = (3 – 1) x (3 – 1) = 4.

If we choose a significance level of =0.05, then our critical value is 9.49. Since the value of X2 in Table 6.33 is much smaller than this, there is no basis for rejecting the null hypothesis of row-column classification independence. The observed data are consistent with the independence assumption. It would appear that the genetic mechanisms that determine seed shape and colour do not affect each other. Exercises

6.2.7 In a study involving high-altitude balloon flights near the north magnetic pole, the

numbers of positive and negative electrons in cosmic ray showers were counted. Each particle was classified by charge and into one of three energy intervals (MeV stands for millions of electron Volts) as shown in the table below.

I have lost the source of these data, so no further information is available. Studies like this were the most important source of information on high-energy sub-atomic particles (particularly in the 1930s) before the construction of very high-energy particle accelerators. Cosmic rays are still an important and very active research area (see Wikipedia for a long account of the history and current research).

Electron Charge

Energy

Interval (MeV) Positive Negative

50-100 9 20

101-300 32 51

301-1000 23 117

64 188

Table 6.34 Cosmic ray shower data

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 48

(a) What can we learn about the charge-energy characteristics of the two particle types from

the data? (b) Suppose it was realized, subsequent to your analysis, that the energy measurements had

been faulty in such a way that electrons of up to 200 MeV could sometimes have been assigned to the lowest energy interval. Re-analyse the data in the light of this information. Do your initial conclusions still hold?

6.2.8 Box, Hunter and Hunter [19] report the data shown below for five hospitals. The

frequencies represent patient status after a surgical procedure designed to improve the functioning of certain joints which have become impaired by disease. Patients were classified by post-operative status: no improvement, partial functional restoration and complete functional restoration.

Hospital

A B C D E Totals

No improvement 13 5 8 21 43 90

Partial funct. rest. 18 10 36 56 29 149

Complete funct. rest. 16 16 35 51 10 128

Totals 47 31 79 128 82 367

Table 6.35: Post-operative improvement status of patients at five hospitals

The authors describe hospital E as a „referral hospital‟, which presumably means that it is a specialised centre where more difficult procedures are carried out. It is to be expected, therefore, that the post-operative patient characteristics could be different from the other hospitals. Analyse the data and present your results both in tabular and graphical form.

6.4 Goodness-of-Fit Tests We have used the chi-square test to compare two or more empirical frequency distributions (for example, in the Market Research Study we compared the numbers of consumers falling into each of three „market penetration‟ categories, for three regions), to determine if the underlying distributions could be considered the same or not. Sometimes, it is desired to assess whether or not a single frequency distribution corresponds to an expected theoretical distribution; the test is then called a „goodness-of-fit‟ test. For example, we might ask if the numbers on a roulette wheel are equally likely to turn up – if not, we might conclude that the wheel is inherently biased or that the wheel operator can influence the outcome of the game.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 49

Example 14: Random Number Generation Random numbers are used in a variety of important problems, from solving mathematical equations that are too difficult to analyse analytically, to creating simulation models of physical systems (or computer games), to selecting random samples from lists of people for inclusion in a survey. In practice, the numbers used are, very often, „pseudo-random‟, that is they are generated by a deterministic algorithm with a view to creating a list of numbers that have the characteristics (suitably defined) of random numbers. I generated 10,000 digits, 0 – 9, using the random number generator in Minitab and obtained the results shown in Table 6.36. Digits 0 1 2 3 4 5 6 7 8 9 Total

Observed freq. 1036 960 1059 946 1029 995 974 959 1045 997 10000

Table 6.36 A set of pseudo-random digits

Perhaps the simplest characteristic we might expect in a set of random digits is that the ten values are equally likely to appear. Does the set of frequencies I have generated suggest that the random number generator is well-behaved in this regard? Our null hypothesis is that the probability of occurrence for each digit is constant,

=0.10. The alternative hypothesis is that not all probabilities are equal to 0.10. If the null hypothesis is true, then the expected frequency is 1000

[N =10,000(0.10)] for each digit, as shown in Table 6.37. The chi-square test statistic is calculated as before:

55.14)(10

1

22

i Expected

ExpectedObservedX

Digits 0 1 2 3 4 5 6 7 8 9 Totals

Observed freq. 1036 960 1059 946 1029 995 974 959 1045 997 10000

Expected freq. 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 10000

Chi-square 1.296 1.6 3.481 2.916 0.841 0.025 0.676 1.681 2.025 0.009 14.55

Table 6.37: A chi-square goodness-of-fit test for on random digits

The ten observed and expected values in Table 6.37 each add to 10,000 and, therefore, the set of differences [Oi–Ei] sums to zero. Accordingly, once nine expected values (or differences) are calculated, the last one is automatically determined. This means that, when the null hypothesis is true, the X2 statistic

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 50

follows a chi-square distribution with 9 degrees of freedom. The critical value for a test having a significance level of 0.05 is Xc

2=16.9. Since our test statistic of X2=14.55 is less than this, we cannot reject the null hypothesis that the digits are equally likely to occur. The area beyond X2=14.55 in the right-hand tail of a chi-square distribution with 9 degrees of freedom is 0.10; this is the p-value for our test. The fact that it exceeds 0.05 means that we cannot reject the null hypothesis, if we have previously decided on a significance level of 0.05. The observed frequency distribution is consistent with the pseudo-random number generator producing equal relative frequencies for the ten digits. Example 15: Lotto Numbers Table 6.38 shows information on the frequencies of occurrence of different numbers in the Irish National Lottery (downloaded from the Lotto website on 30/7/2007). The mechanism used to select the numbers is physical rather than algorithmic (balls are picked from a large glass bubble), and it is obviously important that the numbers are equally likely to be picked. I decided to use only the first 32 (the original number of balls when the Lotto was established) as I was unclear on whether numbers after 32 had been included to the same extent in all draws (clearly 43-45 had not).

1=248 2=235 3=236 4=242 5=214

6=219 7=226 8=205 9=255 10=233

11=255 12=220 13=238 14=259 15=227

16=235 17=200 18=242 19=217 20=230

21=236 22=222 23=217 24=224 25=213

26=192 27=210 28=233 29=198 30=217

31=217 32=253 33=194 34=236 35=216

36=228 37=213 38=213 39=226 40=214

41=229 42=220 43=10 44=17 45=8

NOTE: The numbers 43, 44 and 45 were introduced on the 28th October, 2006

Table 6.38: Frequencies for the Lotto numbers

The total for the 32 observed frequencies is N=7268. On the assumption that all numbers are equally likely (null hypothesis) then our expected value for each of the 32 numbers is E=7268/32=227.125. We calculate the X2 statistic as above and obtain X2=40.2. As before, the degrees of freedom for the sampling distribution of X2 (when the null hypothesis is true) will be the number of categories (K) minus 1, (K–1)=31. For a test with a significance level of 0.05, the critical value is Xc

2=45.0; our test statistic is not statistically significant – the corresponding p-value is 0.12. We cannot reject the null hypothesis of equally likely outcomes – the test does not provide any evidence of bias in the selection of Lotto numbers.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 51

Example 16: Mendelian Ratios Samuels [1] reports data from a breeding study [20] in which white chickens with small combs were mated and produced 190 offspring, which were classified as shown in Table 6.39.

Feathers/ Observed

Comb size Freq.

White, small 111

White, large 37

Dark, small 34

Dark, large 8

Total 190

Table 6.39: Breeding study results

This table is a typical example of Mendelian genetics data, where there are two characteristics, here Feather Colour (white is dominant) and Comb Size (small is dominant). The parents here have one dominant and one recessive gene for each characteristic; their offspring are expected to be born in the ratios ¾ for the appearance (phenotype) of the dominant characteristic, and ¼ for the recessive characteristic. How do these ratios arise? Mendelian theory suggests the following mechanism. The process begins with parents with pure genetic characteristics: both genes for the characteristic of interest are the same – either both dominant, D, or both recessive, r. When two such parents produce offspring, the genetic structure of the offspring is necessarily hybrid, D-r, as each parent contributes one gene. When two first generation offspring breed, the second generation appear as shown in Figure 6.10.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 52

Figure 6.10: Mendelian mechanism for second generation ratios

If the two characteristics are independently determined then we expect the combined characteristics to appear with probabilities 9/16, 3/16, 3/16, 1/16, that is, in the ratios 9:3:3:1 as shown in Table 6.40.

Small:3 Large:1

White:3 9 3

Dark:1 3 1

Table 6.40: The expected ratios, given independence

Our null hypothesis is that the Mendelian ratios 9:3:3:1 describe the underlying frequency structure, i.e., the two characteristics are independently determined. A statistical test essentially asks if the observed frequencies differ from those predicted by the probabilities 9/16:3/16:3/16:1/16 by more than would be expected from chance biological variation. Table 6.41 shows the observed and expected values and the contributions to the overall chi-square test statistic X2=1.55 for the four categories. The critical value for a test with a significance level of 0.05 is Xc

2=7.8, for a chi-square distribution with 3 degrees of freedom [DF = No. categories – 1]. The test is non-significant; the corresponding p-value is 0.67. The empirical data are consistent with the classic Mendelian theory.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 53

Feathers/Comb size Ratios

Obs. Freq.

Expect. Freq. Chi-square

White, small 9/16 111 106.875 0.159

White, large 3/16 37 35.625 0.053

Dark, small 3/16 34 35.625 0.074

Dark, large 1/16 8 11.875 1.264

Total 1.00 190 190 1.551

Table 6.41: A chi-square analysis of the chicken breeding data

Example 17: Poisson Distribution – -particles The Poisson distribution is a probability model for rare independent events,

where the rate of occurrence ( ) is constant in either space or time, and events occur singly. For example, suppose a particular complication in the delivery of

babies occurs at a rate of =3 per month in a large maternity hospital. We can calculate the probabilities for given numbers of such events8 (x) in any one month using the formula:

ex

xPx

!)( x = 0, 1, 2, 3, 4 …..

thus

0498.0)0( 3eP

1494.01

3)1( 3eP

2240.01.2

3)2( 3

2

eP

2240.01.2.3

3)3( 3

3

eP

1680.01.2.3.4

3)4( 3

4

eP

8 Note that the requirement that events occur singly means that the model could be applied to

numbers of deliveries, but not to numbers of births, if twin births were a possibility for the event being modelled.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 54

The value of such calculations lies in the fact that if special teams/equipment are required for these deliveries, then we can evaluate the probabilities for different numbers of events and plan resources (contingency plans) accordingly. The Poisson model is a fundamental tool in analysing the likelihood of events such as occurrence of equipment failure, overloads on computer systems (e.g., internet switching devices), blackouts on power transmission networks etc. Simulation of such systems will require generation of Poisson random numbers (or of a closely related distribution). Accordingly, we might want to test random numbers to determine if they behave as if they come from a Poisson distribution, rather than being simply equally likely, as in Example 14. The example I have chosen to illustrate testing such a distribution refers to a physics experiment. Rice [17] reports an analysis by Berkson [21] of radioactivity data observed in the American Bureau of Standards. The data are summarised in Table 6.42: the observed frequencies refer to the number of 10-second intervals in which

different numbers (x) of -particles were emitted from an Americium 241 source. The total number of particles observed divided by the total observation time gave

0.8392 particles per second, which corresponds to =8.392 particles per 10-second interval. As this value is calculated from the data, it is only an estimate of the „true‟ underlying rate (this is relevant to the analysis). The question of interest is whether or not the data can be modelled by a Poisson distribution. If the answer is yes, it suggests that the individual emissions are independent of each other – which carries implications for the underlying physical processes. To calculate the set of expected frequencies9, we first use the Poisson model

with parameter =8.392 to get a set of probabilities P(x) for different numbers of particles x, and then multiple these by N=1207, the total number of intervals, to get the expected number of intervals with x particles [Expected value, E(x) = NP(x)]. For example:

ex

xPx

!)( x = 0, 1, 2, 3, 4 …..

022328.01.2.3

392.8)3( 392.8

3

eP

9 In calculating expected frequencies it is recommended that the cells should have an expectation

of at least 1 (some say 5) to ensure that the approximation to the chi-square distribution is good; cells should be combined to ensure that this happens. This is the reason the first row of the table accumulates x=0, 1, 2; otherwise the expected values corresponding to x=0 and 1 would both be less than 5.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 55

and the expected number of intervals with three particles is:

E[No. intervals with 3 particles] = NP(3) = 1207x0.022328=26.95. The probability for the first row of the table is the sum P(0)+P(1)+P(2)=P(<3)=0.010111. Since there is no upper bound on the number of particles in a time interval, we calculate P(x>16)=1–[P(0)+P(1)+P(2)…P(16)] = 0.00587.

No. particles

(x) Frequency

0-2 18

3 28

4 56

5 105

6 126

7 146

8 164

9 161

10 123

11 101

12 74

13 53

14 23

15 15

16 9

17+ 5

Total 1207

Table 6.42: Numbers of intervals in which (x) -particles were observed

The last column of Table 6.43 is calculated using:

i

ii

E

EO 2)(

where (i) labels the rows.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 56

Number of Observed Expected

particles (x) Frequency Frequency Chi-square

0-2 18 12.20 2.75

3 28 26.95 0.04

4 56 56.54 0.01

5 105 94.90 1.08

6 126 132.73 0.34

7 146 159.12 1.08

8 164 166.92 0.05

9 161 155.65 0.18

10 123 130.62 0.44

11 101 99.65 0.02

12 74 69.69 0.27

13 53 44.99 1.43

14 23 26.97 0.58

15 15 15.09 0.00

16 9 7.91 0.15

17+ 5 7.09 0.61

Total 1207 1207 9.04

Table 6.43: A chi-square analysis of the -particle data

The overall X2 statistic is 9.04, as shown in Table 6.43. To calculate the degrees of freedom, we subtract 1 from the number of categories, K, as before; we also subtract 1 degree of freedom for every parameter estimated from the data10 –

here we estimated one parameter, , so we have:

Degrees of Freedom = No. of categories –1 – no. of parameters estimated

= (K –1 –1) = 16 – 1 – 1 = 14.

The critical value is Xc

2=23.7 for a test with a significance level of 0.05; our test statistic of 9.04 is very much less than this (the corresponding p-value is 0.83), so we cannot reject the null hypothesis - a Poisson model fits the data. This

suggests that the -particles are emitted independently of each other.

10

If we were fitting a Normal distribution to a set of data, we would need to estimate the mean, , and the

standard deviation, , and so we would lose 2 degrees of freedom for parameter estimation (3 in total).

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 57

Exercise Example 13 analysed 529 pea plants cross-classified into a two-way table according to the shape [round, round and wrinkled, wrinkled] and colour [yellow, yellow and green, green] of their seeds. The question of interest was whether the row and column classifications were independent. The data are reproduced below.

Colour

Yellow Y+G Green

Totals

Round 38 65 35 138

Shape R+W 60 138 67 265

Wrinkled 28 68 30 126

Totals 126 271 132 529

Table 6.44: Mendel‟s pea plant data

6.4.1. Mendelian theory predicts that the frequencies for the three colours (Y: Y+G: G) will

appear in the ratios 1:2:1 [or equivalently, the probability distribution is: ¼, ½, ¼]. Similarly the shapes (R: R+W: W) are also predicted to appear in the same ratios.

Carry out a goodness-of-fit test to check that the observed data depart from the 1:2:1

ratios by no more than would be expected from chance variation. Do this for both the Colour and Shape distributions.

6.5 Additional Topics This final section of the chapter focuses on three special topics: comparing proportions that sum to 1, the dangers involved in collapsing multi-dimensional contingency tables, and the analysis of matched studies where the response of interest is a proportion. Comparing two proportions that sum to 1 Where the two proportions that are to be compared are proportions of the same total, they are not independent and the standard error needs to take account of this fact. If there are only two possible outcomes, the analysis is a simple modification of our earlier analysis for two independent proportions.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 58

Example 18: French presidential election - revisited

Recall the political poll, of Section 6.1, that estimated the support for Sarkozy ( )

and Royal ( ) in the French presidential election. Note that in this case the two sample proportions p1 and p2 are not independent since (p1+p2) = 1 (assuming respondents were required to choose one or the other). If we wish to focus on the difference in support for the two candidates then the standard error of the sample difference (p1–p2) is

nppSE

)1(2)( 11

21

where n is the size of the poll and p2 = (1–p1). In practice, we replace by p1 in using this formula for estimation. Note the difference between this formula and that for the standard error of independent proportions, given on page 10. A 95% confidence interval for Sarkozy‟s lead at the time the poll was taken is:

n

pppp

)1()2(96.1)( 11

21

858

)470.0(530.092.3)470.0530.0(

067.0060.0

This interval covers zero and so the sample difference is not statistically significant – the corresponding significance test would not reject the hypothesis of no difference in the population of voters. Despite the size of the sample difference in Sarkozy‟s favour, the sampling variability is such that Royal could, in fact, have had a slight lead in the population of electors from which the sample was selected. The conclusion is exactly the same as that based on an analysis of either Sarkozy‟s or Royal‟s support separately, since (because p1+p2 = 1) no extra information is used in analysing the difference. Example 19: Otter deaths in eastern Germany Example 10 analysed data on otters that were either killed on the road or shot by hunters in eastern Germany over a 40 year time period. The total sample size of 1023 was made up of 598 males and 425 females. The authors state that there is an “even sex ratio in living otters”. Do the study results support the proposition that male otters are more likely to be killed, or could the sample difference (58.5 versus 41.5%) be attributable to chance sampling variability in the reported data?

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 59

A 95% confidence interval for the difference between the long-run proportions for

males ( ) and females ( ) is:

n

pppp

)1()2(96.1)( 11

21

1023

)415.0(585.092.3)415.0585.0(

06.017.0

This suggests that there is an excess of such violent deaths among males, of between 11 and 23 percentage points, over the percentage for female otters. Example 20: Collapsing Multi-dimensional Contingency Tables: Simpson’s Paradox Consider an hypothetical Graduate School with two programmes: Mathematics and English11. Suppose Table 6.45 represents the admission/rejection decisions for applications in a particular year, classified by sex. The table strongly suggests bias against females (12% acceptance versus 29% for males) on the part of those making the admission decisions. This difference is highly statistically significant, as shown by the chi-square test result (X2 = 22.7).

Overall

Number of Percentage

Reject Accept Applicants Accepted X-squared p-value

Males 107 43 150 29

Females 370 50 420 12 22.7 0.000

Totals 477 93 570 16

Table 6.45: Overall Graduate School admission decisions

Consider now the results from the two Schools individually.

11

The Graduate School example is inspired by Bickel et. al [22]

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 60

Mathematics

Number of Percentage

Reject Accept Applicants Accepted X-squared p-value

Males 60 40 100 40

Females 10 10 20 50 0.686 0.408

Totals 70 50 120 42

English

Number of Percentage

Reject Accept Applicants Accepted X-squared p-value

Males 47 3 50 6

Females 360 40 400 10 0.823 0.364

Totals 407 43 450 10

Table 6.46: Graduate School admission decisions by School

In both cases the sample admission rate is higher for females (though not statistically significantly) – 50 versus 40% for Mathematics and 10 versus 6% for English. This is in stark contrast to the overall rates for the Graduate School. This phenomenon (aggregated results contradicting the disaggregated results) is called Simpson‟s Paradox. Why does it happen? In this example females apply (in greater numbers) for the programme into which it is more difficult to gain admission. Thus, only 10% of applicants are accepted into the English programme while 42% are accepted into Mathematics, but the females overwhelmingly apply for English. The conclusion must be that if there is sex bias, the source needs to be sought, not in the admissions procedures of the Graduate School, but in the conditioning and social influences that lead females disproportionately to apply for the graduate programmes into which it is difficult to gain entry.

This should be a sobering example when we consider how often we collect multiply cross-classified data and, typically, collapse across categories to form a multiplicity of two-dimensional tables for analytical purposes. For example, Social/Political surveys will often collect information on Voting Intention, Party Affiliation, Sex, Age, Income Group, Education, Occupational Status, Geographical Location etc. Data collected on expectant mothers will include most of the above and also variables such as, number of previous pregnancies,

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 61

smoking, drinking, dietary, and exercise habits etc. as well as medical history information. There is an obvious need for care in the analysis of such information, to ensure that apparent effects are not the result of collapsing over influential variables. Matched Studies In Sections 6.2 and 6.3 we analysed the results of a clinical trial on the use of Timolol for preventing angina attacks. The Timolol study design is similar to that which underlies studies where two-sample t-tests are used for the analysis. If, instead of simply classifying each patient as Angina-Free or not, we could assign an (approximately Normally distributed) improvement score to each individual in the study, we could have used a two-sample t-test to analyse the resulting data. This would be preferable statistically (more powerful), as there is very little information embodied in a simple yes/no classification. When we studied two-sample t-tests, we also discussed paired t-tests: these were appropriate when each person or experimental unit was individually matched with another, or where, in measurement studies, the same material/experimental unit was measured twice, by two different instruments. Matching increases the statistical power of the test by eliminating much of the chance variation affecting the comparison – the „treatments‟ are compared on more uniform material than is the case for two-sample t-tests. Matching can also be used where the response is binary. The analysis of matched data will be illustrated using data from a case-control study. We previously encountered a case-control study in Example 5 of Section 6.2; there, the control group was selected from the same community of adult males as the cases, but there was no attempt to match the characteristics of individual cases to those of individual controls. In our next example, patients with other forms of cancer were individually matched to the cases, who all had a particular cancer. Example:21 Case-control study – Endometrial cancer Rice [17] cites a study [23] in which 317 patients, who were diagnosed as having endometrial carcinoma, were matched with other cancer patients. The controls all had one of cervical cancer, ovarian cancer or cancer of the vulva. Each control was matched by age at diagnosis (within 4 years) and year of diagnosis (within two years) to a corresponding case. Table 6.47 shows the 317 pairs cross-classified by use or not of oestrogen for at least six months prior to the cancer diagnosis. The question of interest is whether or not there is an association between exposure to oestrogen and the presence of endometrial cancer.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 62

Controls

Exposed Not-expos. Totals

Cases

Exposed 39 113 152 a b a+b

Not-expos. 15 150 165 c d c+d

Totals 54 263 317 a+c b+d n

Table 6.47: A matched case-control study of endometrial cancer

We address the question of interest by asking if the proportion of cases exposed to oestrogen [p1=(a+b)/n = 152/317] is statistically significantly different from the proportion of the controls who were exposed [p2=(a+c)/n = 54/317]. This means that we want to assess whether or not the difference, (p1–p2) = [(a+b)/n –(a+c)/n] = (b–c)/n, is statistically significantly different from zero, and if so, to estimate the long-run difference. A confidence interval provides both the answer to the statistical significance question and also the interval estimate we require. Confidence Interval The two sample proportions are not statistically independent of each other, so the standard error formula used in comparing independent proportions (e.g., the Timolol study) is not applicable here. The appropriate standard error estimate is:

ncbcbppESn

/)()()(ˆ 2121

We can use this to obtain an approximate 95% confidence interval for the

difference between the corresponding long-run (population) proportions, ( ). Thus, when we fill the relevant numbers for the endometrial cancer study into the formula:

ncbcbppn

/)()(96.1)( 2121

we obtain:

317/)15113()15113(96.1)17.048.0( 2

3171

08.031.0

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 63

The interval is quite far from zero, suggesting that the proportion exposed to oestrogen in the population of endometrial cancer patients is considerably higher than that for the other cancer groups – clear evidence of an association between the use of oestrogen and the development of endometrial cancer. A Statistical Test An alternative approach to determining if an association exists, is through the use

of a statistical test. If we wish to test the null hypothesis that then, when this hypothesis is true, the estimated standard error formula simplifies12 to:

)()(ˆ 121 cbppES

n

Using this we calculate a Z-statistic:

)(

)(

/)(

/)(

)(ˆ

0)(

21

21

cb

cb

ncb

ncb

ppES

ppZ

which follows a Standard Normal distribution (again, when the null hypothesis is true). For the oestrogen data we obtain:

7.8)15113(

)15113(

)(

)(

cb

cbZ

The critical values are ±1.96 for a two-tail test using a significance level of 0.05. Our observed value of Z=8.7 is far out in the right-hand tail. There is, therefore, strong evidence of an association between oestrogen use and endometrial cancer. McNemar’s Test If the Z-statistic is squared it gives a chi-squared test statistic with one degree of freedom, just as is the case for the test on two independent proportions. This test gives identical results to the corresponding Z-test. The chi-square version is usually referred to as McNemar‟s test.

12

If the null hypothesis is true, the expectation (long-run value) of (p1–p2) is zero, i.e., the expectation of (b–c)/n is zero, so the (b-c) part of the standard error formula drops out.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 64

Exercises 6.5.1 Samuels reports data from a 1973 study of risk factors for stroke in which 155 women

who had experienced a haemoragic stroke (cases) were interviewed. Controls (who had not had a stroke) were matched to each case by where they lived, age and race. The subjects were asked if they had used oral contraceptives. The table below displays oral contraceptive use (Yes/No) for all pairs.

Controls Yes No

Cases

Yes 5

30

No 13 107

Table 6.48: Pairs classified by use of oral contraceptives

Do these data suggest an association between oral contraception use and strokes?

6.5.2 Altman reports data from a 1975 study carried out to see if patients whose skin did not

respond to dinitrochlorobenzene (DNCB), a contact allergen, would show an equally negative response to croton oil, a skin irritant. The table below shows the results of simultaneous tests on the two chemicals for 173 patients with skin cancer.

DNCB + –

Croton Oil

+

81

48

– 23 21

Table 6.49: Pairs classified by use of oral contraceptives

Is there evidence of a differential response to the two chemicals?

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 65

References [1] Samuels, M.L., Statistics for the Life Sciences, Dellen Publishing Company,

1989. [2] Brailovsky, D., Timolol maleate (MK-950): A new beta-blocking agent for

prophylatic management of angina pectoris. A multicenter, multinational, cooperative trial. In Beta-Adrenergic Blocking Agents in the Management of Hypertension and Angina Pectoris, B. Magnani (ed.), Raven Press, 1974.

[3] Machin, D., Campbell, M.J., and Walters, S.J., Medical Statistics, Wiley,

2007. [4] Campbell et al., A 24 year cohort study of mortality in slate workers in North

Wales, Journal of Occupational Medicine, 55, 448-453, 2005. [5] Breslow, N.E. and Day, N.E., Statistical Methods in Cancer Research: Vol. 1

– The analysis of case-control studies, International Agency for Research on Cancer, 1980.

[6] Tuyns, A.J., Pequignot, G. and Jensen, O.M., Le cancer de l‟oesophage en

Ille-et-Vilaine en fonction des niveaux de consommation d‟alcool et de tabac, Bull. Cancer, 64, 45-60, 1977.

[7] Bland, M., “An introduction to medical statistics”, Oxford University Press,

2000 [8] Machin, D. and Campbell, M.J., Sample Size Tables for Clinical Studies,

Blackwell Science Inc., 1997. [9] Pocock, S.J., Statistical aspects of clinical trial design, The Statistician, 31,

1-18, 1982 [10] Anturan Reinfarction Trial (1978), Sulfinpyrazone in the prevention of

sudden death after myocardial infarction, New England Journal of Medicine, 302, 250-256.

[11] Snedecor, G.W. and Cochran, W.G., Statistical Methods, Iowa State

University Press, 6thed., 1967 [12] Fleiss, J.L., Statistical Methods for Rates and Proportions, Wiley, 1981.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 66

[13] Hauer S., Ansorge H. and Zinke O., A long-term analysis of the age

structure of otters (Lutra Lutra) from eastern Germany, International Journal of Mammalian Biology, 65, 360 – 368, 2000.

[14] Stuart, M., Changing the teaching of statistics, The Statistician, Vol. 44, No.

1,45-54, 1995. [15] Stuart, M., An Introduction to Statistical Analysis for Business and Industry,

Hodder, 2003. [16] Ehrenberg, A.S.C., Data reduction, Wiley, 1975 [17] Rice, J.A., Mathematical Statistics and data Analysis, Duxbury Press, 1995. [18] Morton, A.Q., Literary Detection, Scribner‟s, 1978. [19] Box, G.E.P., Hunter, S., and Hunter, W.G., Statistics for Experimenters,

Wiley, 1978 (note a second edition appeared in 2005) [20] Bateson, W. and Saunders, E.R., Reports for the Evolution Committee of

the Royal Society, 1, 1-160, 1902. [21] Berkson, J., Examination of the randomness of alpha particle emissions, in

Research Papers in Statistics, F.N. David (ed.), Wiley, 1966. [22] P.J. Bickel, E.A. Hammel and J.W. O‟Connell, Sex Bias in Graduate Admissions:

data from Berkeley, Science, Vol. 187, 398-404, 1975.

[23] Smith, D.G., Prentice, R., Thompson, D.J., and Hermann, W.L., Association

of exogenous estrogen and endometrial carcinoma, N. Eng. J. Med., 293, 1164-67, 1975.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 67

Appendix 6.1

Let X and Y be two random quantities such that the mean of X is x and the

standard deviation of X is x; the corresponding quantities for y are y and y. The square of the standard deviation is called the variance, thus:

VAR(X) = x2 and VAR(Y) = y

2

The variance is of interest because when we combine random quantities the variances (but not the standard deviations) are additive, as discussed in Chapter 1.5. Thus, it can be shown that if X and Y are independent (the value of x does not affect the probable value of Y) then:

VAR(X + Y) = VAR(X) + VAR(Y) and

VAR(X – Y) = VAR(X) + VAR(Y) This latter result is used in finding the standard error of (p1 – p2). Since

npSE

)1()( and

npVAR

)1()(

it follows that:

2

22

1

1121

)1()1()(

nnppVAR

which gives

2

22

1

1121

)1()1()(

nnppSE

Obviously, x-y = x – y so the mean of (p1 – p2) is ( 1 – 2). These results give us the sampling distribution of (p1 – p2) as shown in Figure 6.4.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 68

The Special Case (of Section 6.5) If (p1+p2) = 1 then (p1–p2) = [p1–(1–p1)] = (2p1–1). It follows that:

)(2)2()12()( 11121 pSEpSEpSEppSE

since „1‟ is a constant and, therefore,

npSEppSE

)1(2)(2)( 11

121

which is the result used in our analysis of the French presidential poll.

Text: © Eamonn Mullins, 2009, Data: see references for copyright owners.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 69

A.7

Selected critical values for the chi-squared distribution

is the proportion of values in a chi-squared distribution with degrees of

freedom which exceed the tabled value. For example, 20% of the values in

a chi-squared distribution with 1 degree of freedom exceed 1.64.

.2 .1 .05 .025 .01 .005

= 1 1.64 2.71 3.84 5.02 6.64 7.88

2 3.22 4.61 5.99 7.38 9.21 10.60

3 4.64 6.25 7.82 9.35 11.35 12.84

4 5.99 7.78 9.49 11.14 13.28 14.86

5 7.29 9.24 11.07 12.83 15.09 16.75

6 8.56 10.65 12.59 14.45 16.81 18.55

7 9.80 12.02 14.07 16.01 18.48 20.28

8 11.03 13.36 15.51 17.54 20.09 21.96

9 12.24 14.68 16.92 19.02 21.67 23.59

10 13.44 15.99 18.31 20.48 23.21 25.19

12 15.81 18.55 21.03 23.34 26.22 28.30

15 19.31 22.31 25.00 27.49 30.58 32.80

20 25.04 28.41 31.41 34.17 37.57 40.00

24 29.55 33.20 36.42 39.36 42.98 45.56

30 36.25 40.26 43.77 46.98 50.89 53.67

60 68.97 74.40 79.08 83.30 88.38 91.96

120 132.81 140.23 146.57 152.21 158.95 163.65

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 70

Outline Solutions 6.1.1 A 95% confidence interval for the population proportion who intended to vote yes in the referendum:

n

ppp

)1(96.1

1000

)35.01(35.096.135.0

03.035.0

i.e., we estimate that between 32% and 38% of the population of voters intended to vote for the Lisbon treaty. A Minitab analysis is shown below.

Test and CI for One Proportion Sample X N Sample p 95% CI

1 350 1000 0.350000 (0.320438, 0.379562)

Using the normal approximation.

Table 6.50: Minitab analysis of voting data

6.1.2 The calculations for this exercise are carried out in exactly the same way as for Exercise 6.1.1. A Minitab analysis is shown below: a 95% confidence interval suggests that the population market penetration rate is between 42 and 48% - note that this does not include the target value of 50%.

Test and CI for One Proportion Sample X N Sample p 95% CI

1 450 1000 0.450000 (0.419166, 0.480834)

Using the normal approximation.

Table 6.51: Minitab analysis of market research data

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 71

6.2.1 First, we will carry out a Z-test of the equality of the population proportions. As for the Timolol study, we frame our question in terms of a null hypothesis which considers the long-run proportions as being the same and an alternative hypothesis which denies their equality.

Ho: – = 0

H1: – 0 We calculate a Z statistic:

21

21

)1()1(

0)(

nn

ppZ

- this will have a standard Normal distribution. Our estimate of the common p (when the null hypothesis is true) is 610/2000 = 0.305

37.4

1000

)305.01(305.0

1000

)305.01(305.0

0)26.035.0(

)1()1(

0)(

21

21

n

pp

n

pp

ppZ

When compared to the critical values of 1.96 corresponding to a significance level of 0.05, the Z-statistic is highly statistically significant, leading us to conclude that the population proportions are not the same. A 95% confidence interval for the difference between the population proportions is given by:

2

22

1

1121

)1()1(96.1)(

n

pp

n

pppp

For the referendum poll results this gives:

1000

)26.01(26.0

1000

)35.01(35.096.1)26.035.0(

04.009.0

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 72

We are 95% confident that the difference between the population proportions is somewhere between 5 and 13 percentage points. A Minitab analysis is shown below.

Test and CI for Two Proportions Sample X N Sample p

1 350 1000 0.350000

2 260 1000 0.260000

Difference = p (1) - p (2)

Estimate for difference: 0.09

95% CI for difference: (0.0498375, 0.130163)

Test for difference = 0 (vs not = 0): Z = 4.37 P-Value = 0.000

Table 6.52: Minitab comparison of survey results

6.2.2 This exercise requires the same analyses for the test and confidence interval as did 6.2.1 – the Minitab output given below shows the required results.

Test and CI for Two Proportions Sample X N Sample p

1 80 220 0.363636

2 20 240 0.083333

Difference = p (1) - p (2)

Estimate for difference: 0.280303

95% CI for difference: (0.207754, 0.352852)

Test for difference = 0 (vs not = 0): Z = 7.28 P-Value = 0.000

Table 6.53: Minitab analysis of vaccine data

The influenza data are set out in tabular form below, to facilitate the calculation of the odds ratio.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 73

Control Vaccine Totals Control Vaccine Totals

Influenza 80 20 100 Inf. a b a+b

No influenza 140 220 360 no Inf. c d c+d

Totals 220 240 460 Totals a+c b+d n

Table 6.54: The influenza vaccine data

OR = 3.6][220/20

140/80

bc

ad

d

b

c

a

The study suggests that the odds of contracting influenza are more than six times higher for the control group. A one-sample confidence interval for the population incidence of influenza, if the entire population (of adults) had been vaccinated is shown in the Minitab output below. Between 5 and 12% of adults would be expected to contract influenza, even though they had been vaccinated.

Test and CI for One Proportion Sample X N Sample p 95% CI

1 20 240 0.083333 (0.048366, 0.118300)

Using the normal approximation.

Table 6.55: Estimating the incidence of influenza in the population

6.2.3 The statistical test and confidence interval for comparing the sample proportions are shown below. The difference between the sample proportions is highly statistically significant.

Test and CI for Two Proportions Sample X N Sample p

1 26 273 0.095238

2 44 1046 0.042065

Difference = p (1) - p (2)

Estimate for difference: 0.0531731

95% CI for difference: (0.0162884, 0.0900577)

Test for difference = 0 (vs not = 0): Z = 3.49 P-Value = 0.000

Table 6.56: Minitab analysis of bronchitis data

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 74

Bronchitis Control Totals Bronchitis Control Totals

Coughs 26 44 70 Coughs a b a+b

No coughs 247 1002 1249 No coughs c d c+d

Totals 273 1046 1319 Totals a+c b+d n

Table 6.57: The bronchitis data

OR = 4.2][1002/44

247/26

bc

ad

d

b

c

a

The study suggests that the odds of having a persistent cough in later life are more than twice as high for the children who suffered from bronchitis in infancy than they are for the control group. 6.2.4 The following sample sizes (for each group) were obtained using the sample size formula (page 20).

Increased

proportion

0.2 0.25

Power 0.9 263 130

0.95 325 161

Slightly larger values (shown below) were obtained using the sample size option in Minitab – the differences are negligible when we take into account the uncertainties in specifying the parameters of the problem.

Increased

proportion

0.2 0.25

Power 0.9 266 133

0.95 329 164

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 75

6.2.5 A Minitab chi-square analysis of the slate workers‟ study is shown below. The X2 value of 9.328 is simply the square of the Z-value shown on page 15 (apart from rounding). The conclusions from the chi-square analysis are the same, therefore, as those from the Z-test for comparing two proportions, as discussed on pages 14,15.

Chi-Square Test: C1, C2 Expected counts are printed below observed counts

Chi-Square contributions are printed below expected counts

Died Surv. Total

Slate workers 379 347 726

352.30 373.70

2.024 1.908

Controls 230 299 529

256.70 272.30

2.778 2.618

Total 609 646 1255

Chi-Sq = 9.328, DF = 1, P-Value = 0.002

Table 6.58: Chi-square analysis of the slate workers‟ data

6.2.6 A chi-square analysis of the oesophageal cancer case-control data is shown below – again there is a one-to-one correspondence with the Z-test of Table 6.7 (page 16).

Chi-Square Test: C3, C4 Expected counts are printed below observed counts

Chi-Square contributions are printed below expected counts

80+ <80 Total

Cases 96 104 200

42.05 157.95

69.212 18.427

Controls 109 666 775

162.95 612.05

17.861 4.755

Total 205 770 975

Chi-Sq = 110.255, DF = 1, P-Value = 0.000

Table 6.59: Chi-square analysis of the oesophageal cancer data

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 76

Exercise 6.2.7: Cosmic Rays It is not obvious to me whether the physicists would ask if the energy distributions are different for the two particle types or if the charge distributions are different at different energy ranges. In analysing these data, therefore, I would frame the null hypothesis as a test of independence – are the energy and charge classifications independent? I would treat the two classifications as if they have equal status, in other words.

Chi-Square Test: Positive, Negative Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Energy Positive Negative Total Interval

50-100 9 20 29

7.37 21.63 0.363 0.124 101-300 32 51 83

21.08 61.92 5.658 1.926 301-1000 23 117 140

35.56 104.44 4.434 1.509 Total 64 188 252

Chi-Sq = 14.013, DF = 2, P-Value = 0.001

Table 6.60: Chi-square analysis of the cosmic ray data

A chi-square test on the data gives a test statistic of X2=14.01, as shown in Table 6.60. This should be compared to a chi-square distribution with (3–1)(2–1)=2

degrees of freedom; for a significance level of =0.05, the critical value is 5.99 and the null hypothesis of energy-charge independence is rejected. Suppose we decide to view the data in terms of the energy distributions for the two particle types; positive electrons are called „positrons‟ (as in Positron Emission Tomography (PET) scanners). Table 6.61 shows the energy distributions expressed as percentages, for the two particle types.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 77

Electron Charge

Energy

Interval (MeV) Positive Negative

50-100 14 11

101-300 50 27

301-1000 36 62

100 100

Table 6.61: The energy distributions (percentage)

It is obvious that there is a much higher percentage of highest energy particles in the negative charge group; there is a difference of 26 percentage points, i.e., 62 versus 36%. The 95% confidence interval for the difference is from 13 to 40 percentage points, as shown in Table 6.62.

Test and CI for Two Proportions Sample X N Sample p

e- 117 188 0.622340

e+ 23 64 0.359375

Difference = p (1) - p (2) Estimate for difference: 0.262965 95% CI for difference: (0.126506, 0.399425)

Test for difference = 0 (vs not = 0): Z = 3.66 P-Value = 0.000

Table 6.62: Comparing proportions in the highest energy interval

Presumably, differences in the energy distributions would carry implications for the mechanisms generating the particles. Thus, the non-symmetrical energy distributions observed here may rule-in/rule-out certain sources of these particles. Faulty data collection If the energy measurements cannot be trusted to differentiate between the two lower energy intervals, then it makes sense to combine them into a single 50 – 300 MeV band; the revised distributions are shown in Table 6.63.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 78

Electron Charge

Energy

Interval (MeV) Positive Negative

50 -300 41 71

301-1000 23 117

64 188

Table 6.63: Revised energy distributions

A chi-square analysis of this 2x2 table (shown in Table 6.64) is effectively equivalent to the comparison of two proportions in Table 6.62, since we focused there on the proportions in the highest energy interval, which remains unchanged. However, the null hypothesis underlying the Z-test for proportions in Table 3 is an „homogeneity of proportions‟ hypothesis, while that underlying Table 6.64 could be framed, alternatively, as an energy level-charge independence hypothesis. As we have seen, these are essentially the same hypothesis and the tests are equivalent (Z=3.66)2 = (X2=13.37), apart from rounding.

Chi-Square Test: Positron, Electron Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Energy Positron Electron Total Interval Low 41 71 112

28.44 83.56 5.542 1.887 High 23 117 140

35.56 104.44 4.434 1.509 Total 64 188 252

Chi-Sq = 13.372, DF = 1, P-Value = 0.000

Table 6.64: A chi-square analysis of the revised data

The overall conclusion of our analysis is that energy level and particle charge are not independent classifications and that there is a disproportionately higher fraction of (negatively charged) electrons in the highest energy band.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 79

Exercise 6.2.8: Hospitals We begin our analysis by excluding the referral hospital, E, and by examining the other four hospitals A–D to determine if their post-operative patient status distributions can be considered as being generated by the same set of underlying

relative frequencies

Chi-Square Test: A, B, C, D Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts A B C D Total

No-imp 13 5 8 21 47

7.75 5.11 13.03 21.11 3.555 0.002 1.941 0.001 Partial 18 10 36 56 120

19.79 13.05 33.26 53.89 0.162 0.714 0.225 0.082 Complete 16 16 35 51 118

19.46 12.84 32.71 53.00 0.615 0.780 0.160 0.075 Total 47 31 79 128 285

Chi-Sq = 8.313, DF = 6, P-Value = 0.216

Table 6.65: Chi-square comparison of hospitals A – D

Table 6.65 shows a chi-square analysis of the 3x4 table. The X2=8.3 test statistic

is non-significant when compared to the =0.05 critical value of 2c=12.6 of the

chi-square distribution with (3–1)(4–1)=6 degrees of freedom. The large p-value of 0.216 indicates that the test statistic would be expected to be bigger than 8.3 more than 20% of the time, even if the null hypothesis of equal underlying frequency distributions were true. Accordingly, we cannot reject the null hypothesis of equal underlying post-operative result performances for the four hospitals. It makes sense, therefore, to combine the data into one „general hospital‟ distribution.

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 80

Table 6.66 shows the (rounded) percentages that fall into the three outcome categories for the general and referral hospitals.

Hospital

General Referral

No improvement 16 52

Partial funct. rest. 42 35

Complete funct. rest. 41 12

Table 6.66: Percentage (rounded) changes

There are quite different response patterns, as might be expected. The referral hospital has a much lower „complete recovery‟ rate. (12% versus 41%) and a much higher „no improvement‟ rate (52% versus 16%). Table 6.67 is a chi-square analysis of this table.

Chi-Square Test: General, Referral Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts General Referral Total

No-imp 47 43 90

69.89 20.11 7.497 26.058 Partial 120 29 149

115.71 33.29 0.159 0.553 Complete 118 10 128

99.40 28.60 3.480 12.096 Total 285 82 367

Chi-Sq = 49.844, DF = 2, P-Value = 0.000

Table 6.67 Chi-square comparison of general and referral hospitals

The chi-square analysis (highly significant test statistic) provides statistical justification for interpreting the two different response distributions of Table 6.66:

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 81

the chi-square result tells us that the differences are unlikely to be due to chance fluctuations.

Figure 6.10: Response „profiles‟ for hospital types

Figure 6.10 shows one possible graphical representation of the results. It illustrates the difference in the „response profiles‟ of the two types of hospital. The profiles do not have any intrinsic meaning, of course, as the horizontal axis is not a continuous scale. The profiles do, however, illustrate the fundamentally different response patterns and would (I think!) help a report reader who might be overwhelmed by the detail shown in Tables 6.65 and 6.67, in particular, to interpret the results of the study. 6.4.1 The observed colour distribution for the seeds is compared to the theoretical ratios of 1:2:1 in the Minitab out put below. This gives a chi-square value of 0.46, with a corresponding p-value of 0.796. There is no reason to doubt the applicability of the Mendelian theory.

Chi-Square Goodness-of-Fit Test for Observed Counts in Variable: C5 Test Contribution

Category Observed Propn. Expected to Chi-Sq

1 126 0.25 132.25 0.295369

2 271 0.50 264.50 0.159735

3 132 0.25 132.25 0.000473

N DF Chi-Sq P-Value

529 2 0.455577 0.796

Table 6.68: Minitab Goodness of Fit analysis of seed colours

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 82

Similarly, the observed shape distribution for the seeds is compared to the theoretical ratios of 1:2:1 in the Minitab output in Table 6.69. This gives a chi-square value of 0.55, with a corresponding p-value of 0.76. There is no reason to doubt the applicability of the Mendelian theory with respect to the shape distribution either.

Chi-Square Goodness-of-Fit Test for Observed Counts in Variable: C7 Test Contribution

Category Observed Propn. Expected to Chi-Sq

1 138 0.25 132.25 0.250000

2 265 0.50 264.50 0.000945

3 126 0.25 132.25 0.295369

N DF Chi-Sq P-Value

529 2 0.546314 0.761

Table 6.69: Minitab Goodness of Fit analysis of seed shapes

6.5.1. Table 6.70 shows the 155 pairs cross-classified by exposure to oral contraceptives. The question of interest is whether or not there is an association between exposure to oral contraception and the occurrence of a stroke.

Controls

Exposed Not-expos. Totals

Cases

Exposed 5 30 35 a b a+b

Not-expos. 13 107 120 c d c+d

Totals 18 137 155 a+c b+d n

Table 6.70: A matched case-control study of strokes and oral contraceptive use

To test the null hypothesis that the proportion of women (in the relevant population) who used oral contraceptives was the same for the population of women who experienced strokes [cases: p1=(a+b)/n] as for the population who did not [controls: p2=(a+c)/n], we calculate a Z-statistic:

59.2)1330(

)1330(

)(

)(

/)(

/)(

)(ˆ

0)(

21

21

cb

cb

ncb

ncb

ppES

ppZ

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 83

When compared to critical values of 1.96 given by the conventional significance level of 0.05, the Z-value is clearly statistically significant: cases – those who had strokes - were more likely to have been exposed to oral contraception. A confidence interval for the increased probability of exposure to oral contraception for cases is given by:

ncbcbppn

/)()(96.1)( 2121

where p1 = 35/155 = 0.226 and p2 = 18/155 = 0.116 giving (p1 – p2) = 0.110 The calculated confidence interval is:

0.110 0.085 or [0.025 to 0.195]

that is, a difference of between 3 and 20 percentage points.

6.5.2.

Table 6.71 shows the 173 self-pairs classified by response to the two chemicals. The question of interest is whether or not there is a differential response to the two chemicals.

DNCB

+ – Totals

Croton Oil

+ 81 48 129 a b a+b

– 23 21 44 c d c+d

Totals 104 69 173 a+c b+d n

Table 6.71: 173 patients classified according to response to two chemicals

These data may be analysed in exactly the same way as the strokes/contraceptives data of Exercise 6.5.1. To test the null hypothesis that the probability of a negative response to DNCB was the same that for Croton Oil we can calculate a Z-statistic:

97.2)2348(

)2348(

)(

)(

/)(

/)(

)(ˆ

0)(

21

21

cb

cb

ncb

ncb

ppES

ppZ

Course Manual: Base Module - Diploma in Statistics, TCD, 2009-10 84

When compared to critical values of 1.96 given by the conventional significance level of 0.05, the Z-value is clearly statistically significant – patients (in the relevant population) were more likely to react negatively to DNCB than they were to Croton Oil. A confidence interval for the increased probability of reacting negatively to DNCB is given by:

ncbcbppn

/)()(96.1)( 2121

where p1 = 69/173 = 0.399 and p2 = 44/173 = 0.254 giving (p1 – p2) = 0.145

The calculated confidence interval is:

0.145 0.098

Which ranges from 0.05 to 0.24, that is, a difference of between 5 and 24 percentage points.

© Text, Eamonn Mullins, data, see references, 2009