workshop 4 -solutions

MAS230 : Biostatistical Methods

Workshop 4

1. Consider Data Set 75. These data come from Judge, M.D. et al (1984). “Thermal

shrinkage temperature of intramuscular collagen of bulls and steers,” Journal of

Animal Science 59: 706–9, and are reproduced in Samuels and Witmer (1999),

Statistics for Life Sciences, 2nd Edition, Prentice Hall, p. 357. The study is designed

to assess the effect of electrical stimulation of a beef carcass in terms of improving

the tenderness of the meat. In this test, beef carcasses were split in half. One

side was subjected to a brief electrical current while the other was an untreated

control. From each side a specimen of connective tissue (collagen) was taken, and

the temperature at which shrinkage occurred was determined. Increased tenderness

is related to a low shrinkage temperature.

Carry out analyses to assess the impact of electrical stimulation on the meat tender-

ness. Use both parametric and non-parametric methods and compare the results.

Suppose acceptable tenderness corresponds to a shrinkage temperature less than

69 degrees. How would you test to see if the proportions of acceptable tenderness

values differed under the two treatments? Use SPSS to create appropriate variables

to enable this to be tested and carry out the analysis. Don’t forget that the sample

sizes are small here.

To assess the impact of electrical stimulation on meat tenderness, we will examine

whether the mean temperature at which shrinkage occurs is different for the two

groups. Since one half of each carcass is assigned to the untreated group, and

the other half is assigned to the group that receives electrical stimulation, the two

samples are not independent. Consequently, the most appropriate analysis is a test

for paired differences.

We might consider a t-test for paired differences. This test is based on the hypothe-

ses

H0 : δ = 0

H1 : δ 6= 0

where δ represents the mean difference in temperature at which shrinkage occurs for

the untreated carcass halves and the carcass halves treated with electrical stimula-

tion. To carry out a t-test for paired differences the carcasses must be independent.

1

Without information about how the carcasses were obtained, we will assume that

this is the case. Additionally, the differences in temperature at which shrinkage

occurs must come from an approximately normal distribution. A Q-Q plot for dif-

ferences in temperature at which shrinkage occurs is given in Figure 1. The observed

Figure 1: Q-Q plot of differences in temperature at which shrinkage occurs.

differences fall quite close to the line, suggesting that the assumption of normality

of the differences is plausible, and we proceed with the t-test. Output for this test is

given in Figure 2. The test statistic is given by t = 0.404 on 14 degrees of freedom.

Figure 2: Output for t-test for mean difference in temperature at which shrinkage occurs

for untreated carcass halves and electrically stimulated carcass halves.

The corresponding p-value of 0.692 is not significant for any reasonable significance

level, so we do not reject the null hypothesis, meaning that we have insufficient

evidence to suggest that treating carcasses with electrical stimulation produces any

difference in the mean temperature at which shrinkage occurs.

If the assumptions of the t-test had not met, we might have considered a sign

test or Wilcoxon signed-rank test. Even though the t-test is the appropriate test,

2

here I provide analyses using both the sign test and Wilcoxon signed-rank test for

illustrative purposes. For both the sign test and Wilcoxon signed-rank test, we

would test the hypotheses

H0 : µ̃ = 0

H1 : µ̃ 6= 0

where µ̃ represents the median difference in temperature at which shrinkage oc-

curs for the untreated carcass halves and the carcass halves treated with elec-

trical stimulation. To carry out both tests, the carcasses must be independent.

We already addressed this requirement/assumption for the t-test. To carry out a

Wilcoxon signed-rank test, the differences would need to be approximately sym-

metric. Both a histogram and boxplot for differences are given in Figure 3. Neither

show strong evidence to suggest that differences are asymmetric, so either a sign

test or a Wilcoxon-signed rank test can be carried out. Output for the two tests

are given in Figure 4. The p-values for both tests (0.791 for the sign test and 0.801

for the Wilcoxon signed-rank test) exceed any reasonable significance level, so we

would not reject the null hypothesis, meaning that we have insufficient evidence to

conclude that there is a difference in the median temperature at which shrinkage

occurs for untreated carcass halves and electrically stimulated carcass halves.

For the secondary analysis of interest, we must convert temperatures at which

shrinkage occurs to binary variables, where a ‘1’ denotes a shrinkage temperature

less than 69 degrees, and a ‘0’ denotes a shrinkage temperature of 69 degrees or

more. (Note that we could also let ‘0’ denote a shrinkage temperature less than

69 degrees and ‘1’ denote a shrinkage temperature of 69 degrees or more, and our

analysis would produce identical results.) Our two samples are still dependent, and

we would like to determine whether or not the proportions of acceptable carcass

halves differs under the two treatments, so we are restricted to either a kappa test

or McNemar’s test. McNemar’s test is the appropriate test here, as we would like to

see whether or not the proportion of acceptable carcass halves changes after apply-

ing a treatment (electrical stimulation). Since the independence assumption for this

test was already addressed when we carried out the t-test, we proceed with the test.

Output for the test is given in Figure 5. The p-value of approximately 1 means that

we will not reject the null hypothesis for any reasonable significance level, meaning

that we do not have evidence to suggest that there is a difference in the proportion

of acceptable carcass halves for untreated carcass halves and electrically stimulated

carcass halves.

3

Figure 3: Histogram (top) and boxplot (bottom) for difference in temperature at which

shrinkage occurs for the untreated carcass halves and the carcass halves treated with

electrical stimulation.

4

Figure 4: Output for sign test (left) and Wilcoxon signed-rank test (right) for median

temperature at which shrinkage occurs for untreated carcass halves and electrically stim-

ulated carcass halves.

Figure 5: Output for McNemar’s test for difference in proportion of acceptable carcass

halves for untreated carcass halves and electrically stimulated carcass halves.

5

2. Refer to Data Set 76. These data come from Mochizuki, M. et al (1984). “Effects

of smoking on fetoplacentalmaternal system during pregnancy,” American J. Ob-

stet. Gyn. 149: 13–20. The study considered the effects of smoking during preg-

nancy by examining the placentas from 58 women after childbirth. Each mother was

classified as a non-, moderate or heavy smoker during pregnancy, and the outcome

measure was presence or absence of atrophied placental villi, finger-like structures

that protrude from the wall to increase absorption area.

Combine the two smoking classes to create a “smoker” class and carry out an ap-

propriate test for association of villi atrophy with smoking status. Given there are

three ordered classes of smoking (none < moderate < heavy) think about how you

might display such data.

We can consider either of two different statistical tests, both of which will produce

identical results. First, if villi atrophy is associated with smoking status, that means

that the probability of villi atrophy changes with a person’s smoking status. Thus,

we can carry out a test to determine if the proportion of smokers with villi atrophy

is the same as the proportion of non-smokers with villi atrophy. To do this, we

consider the statistical hypotheses

H0 : πs = πn.s.

H1 : πs 6= πn.s.

where πs denotes the population proportion of smokers with villi atrophy and πn.s.

denotes the population proportion of non-smokers with villi atrophy. By hand, we

can verify that

ns = 36

nn.s. = 22

p̂s =21

36= 0.583

p̂n.s. =4

22= 0.182

p̂ =21 + 4

36 + 22= 0.431

Before carrying out this test, it is important to determine whether or not the re-

quirements/assumptions of the test are met. Without detailed information about

the sampling mechanism, we will assume that the smokers and non-smokers in the

study were sampled independently of each other, so the sample of smokers is inde-

pendent of the sample of non-smokers, and individuals within each sample are in-

dependent. Additionally, we must confirm that nsp̂ > 5, ns (1− p̂) > 5, nn.s.p̂ > 5,

6

and nn.s. (1− p̂) > 5. Since

nsp̂ = 36 · 0.431 = 15.517 > 5

ns (1− p̂) = 36 (1− 0.431) = 20.483 > 5

nn.s.p̂ = 22 · 0.431 = 9.483 > 5

nn.s. (1− p̂) = 22 (1− 0.431) = 12.517 > 5

it is appropriate to use a normal approximation for our two sample test for inde-

pendent proportions. Then our test statistic is given by

z =p̂s − p̂n.s.√

p̂ (1− p̂)(

1ns

+ 1nn.s.

)=

0.583− 0.182√0.431 (1− 0.431)

(136

+ 122

)= 2.996

and the corresponding p-value is given by

P (Z > 2.996) + P (Z < −2.996) = 2 · 0.001367431 = 0.0027

Since our p-value of 0.0027 is less than any reasonable significance level, we have

sufficient evidence to reject the null hypothesis, meaning that we have evidence

to suggestion that the proportion of smokers with villi atrophy is different from

the proportion of non-smokers with villi atrophy. This means that villi atrophy is

associated with smoking status.

The second test we can consider is a chi-square test of independence. If villi atrophy

is not associated with smoking status, then villi atrophy and smoking status are

independent. Otherwise, they are dependent. To carry out this test, we consider

the hypotheses

H0 : smoking status and villi atrophy are independent

H1 : smoking status and villi atrophy are not independent

Output for the chi-square test of independence is given in Figure 6. The test statistic

is

χ2 = 8.976 , d.f. = (2− 1) · (2− 1) = 1

and the corresponding p-value is approximately 0.003. Note that these results

are identical to the two sample test for independent proportions. In particular,

7

Figure 6: Output for chi-square test of independence.

z2 = 2.9962 = 8.976 = χ2, and the p-values are identical (The difference between

the two is simply due to rounding for the p-value for the chi-square test of inde-

pendence). As we would expect, given the results of the two sample test for inde-

pendent proportions, we reject the null hypothesis, so we have evidence to suggest

that smoking status and villi atrophy are dependent, so incidence of villi atrophy

changes based on smoking status.

Other than the previously mentioned independence assumption, the only other as-

sumption for the chi-square test is that no more than 20% of expected cell counts

have a value less than 5. From the output, we see that no cell counts are less

than 5, so using the chi-square test of independence is appropriate. Note that the

output also gives the lowest expected cell count. In fact, this quantity will always

correspond to the smallest value of n1p̂, n1 (1− p̂), n2p̂, and n2 (1− p̂). Thus, if

this value is greater than 5, we know that the sample size requirements are met

to use the normal approximation for a two sample test for independent propor-

tions. At the same time, n1p̂, n1 (1− p̂), n2p̂, and n2 (1− p̂) give the expected cell

counts for a chi-square test. These further highlight that these two tests are in fact

identical except that they use different probability distributions (standard normal

distribution versus chi-square distribution). Consequently, SPSS does not include

a function for two sample tests for independent samples because a chi-square test

of independence does exactly the same thing.

To display our data, we note that our data are categorical, so a good way to graph

the data is through bar plots. In Figure 7, we give two similar barplots of smoking

status by villi atrophy. The plot on the top gives a comparison of smokers with

non-smokers, whereas the bottom gives a comparison across the different smoking

classes. These suggest that the likelihood of villi atrophy increases with smoking.

8

Figure 7: Bar plots of smoking status villi atrophy.

9

3. An environmental scientist studying the impact of pollution on species diversity

along two nearby rivers carried out a survey in which plots (quadrats) of size 30

metres by 20 metres were randomly chosen from along the banks of the rivers.

Within each quadrat the numbers of different tree species were recorded. The data

were as follows:

Valley River Ridge River

9 9 15 12 13 13 10 6 7 10

13 13 8 11 9 9 18 6 9 9

10 9 14 11 7 8 6 11

What would you conclude from these data in terms of differences in species diver-

sity? Think about the nature of the data, what might be the best way to compare

them, what assumptions are being made in the comparison, etc. Are there any

values which might need special consideration? What is their effect on the various

analyses if included or excluded?

Examining boxplots of species for each river, given by Figure 8, the value of 18

species recorded for the Ridge River is an outlier. Without a valid reason to omit

this data point (e.g., the value was clearly recorded wrong), it should remain in the

analysis. However, for instructional purposes we will carry out analyses including

and excluding this data point to see how much of an impact it has on the analyses.

We would like to test the hypotheses

H0 : µV = µR

H1 : µV 6= µR

where µV denotes the average number of species for the Valley River, and µR denotes

the average number of species for the Ridge River. We might consider carrying out

a t-test for two independent means. To carry out such a test, we must assume that

the two rivers are independent of each other and that sampled quadrats within each

river are independent of each other. Since the number of quadrats is not all that

large for either river, we must see if it is plausible that the number of species per

quadrat comes from an approximately normal distribution for each river. To do this,

we look at Q-Q plots for species numbers separately for the two rivers. Such plots

are given in Figure 9. We note that the points fall quite close to the line with the

lone exception being the outlier, so it seems plausible that the normality assumption

has been met. The third assumption that must be checked is that of equal variance

10

Figure 8: Boxplots of species per quadrat for the .

for number of species per quadrat for the two rivers. SPSS automatically carries out

Levene’s test with a t-test for two independent means, and output from Levene’s

test and the t-test is given in Figure 10. Levene’s test essentially tests

H0 : σ2V = σ2

R

H1 : σ2V 6= σ2

R

where σ2V denotes the population variance for number of species per quadrat for

the Valley River, and σ2R denotes the population variance for number of species

per quadrat for the Ridge River. This test produces a p-value of 0.696 which is

larger than any reasonable significance level, so we do not reject the null hypothesis,

meaning that we have insufficient evidence to conclude that the variances for number

of species per quadrat are different for the two rivers. Thus, it seems plausible that

our assumption of equal variances is met. With the necessary assumptions met, we

proceed with the t-test and note that we obtained a test statistic of t = 1.711 on

26 degrees of freedom. The corresponding p-value of 0.099 is not significant at the

α = 0.05 significance level, so we would not reject H0, and we cannot conclude that

there is a difference between the two rivers in terms of diversity of the species.

To assess the impact of the outlier on this result, we repeat the analysis with this

outlier omitted. Doing so, our Q-Q plot for the Ridge River is given by Figure 11

and more strongly suggests normality than before. Repeating Levene’s test on this

data, we obtain the output given in Figure 12. The p-value of 0.540 again suggests

11

Figure 9: Q-Q plots for species numbers per quadrat for the Valley River (top) and Ridge

River (bottom).

Figure 10: Output for Levene’s test and t-test for two independent means.

12

Figure 11: Q-Q plot for species numbers per quadrat for the Ridge River.

that it is plausible that the assumption of equal variances is met, so the necessary

requirements of the test are still met, and we repeat our t-test for the original data,

minus the outlier. The output is given in Figure 12. With the outlier removed, the

Figure 12: Output for Levene’s test and t-test for two independent means with outlier

removed.

test statistic is given by t = 2.838 with 25 degrees of freedom, and the p-value is

0.009. This p-value is smaller than most any significance levels we might consider, so

we reject the null hypothesis, meaning that we have evidence to suggest that there

is a difference between the two rivers in terms of diversity of the species. Thus, this

outlier has a rather significant impact on the results of our analysis.

Non-parametric tests are not as adversely affected by outliers, so we might have

considered a Wilcoxon rank-sum/Mann-Whitney U test instead of a t-test. This

test has only the minimal assumption that the two rivers are independent of each

other and sampled quadrats are independent of each other for each river, and the

13

hypotheses are given by

H0 : µ̃V = µ̃R

H1 : µ̃V 6= µ̃R

where µ̃V denotes the median number of species per quadrat for the Valley River,

and µ̃R denotes the median number of species per quadrat for the Ridge River.

Output for such a test on the full data set is given in Figure 13. The p-value of

0.058 means that we would not reject the null hypothesis at the α = 0.05 significance

level, so we have insufficient evidence to suggest that there is a difference in the

median number of species per quadrat for the Valley River and the Ridge River.

Figure 13: Output for Wilcoxon rank-sum/ Mann-Whitney U test.

If we remove the outlier, we obtain the results given in Figure 14. As we would

expect, we get a smaller p-value of 0.019 when we remove the outlier, and we

would reject the null hypothesis at the α = 0.05 significance level. Thus, we would

conclude that we have sufficient evidence to suggest that there is a difference in the

median number of species per quadrat for the Valley River and the Ridge River.

Figure 14: Output for Wilcoxon rank-sum/ Mann-Whitney U test with outlier removed.

Note that, with the outlier included in the analysis, the p-value produced by the

Wilcoxon rank-sum/Mann-Whitney U test is substantially smaller than that pro-

duced by the t-test. When we remove the outlier, we observe the opposite relation-

ship, as the p-value produced by the Wilcoxon rank-sum/Mann-Whitney U test is

larger than that produced by the t-test. This highlights important aspects of para-

metric and non-parametric tests. Parametric tests are quite sensitive to departures

14

from the assumed distributions of the samples (in the case of the t-test, normal

distributions), so outliers will have a significant impact on the results and tend to

inflate the p-value. Since non-parametric tests are not tied to specific distributional

assumptions for the samples, outliers have a much less significant impact, so they

tend to inflate the p-value much less when outliers are present. At the same time,

if there are no outliers present and the samples fit the distributional assumptions of

the parametric test, that test will virtually always produce a smaller p-value than

a non-parametric tests, as the minimal distribution assumptions of non-parametric

tests will make their estimates of the p-value more conservative.

15

workshop 4 -solutions

Education

sign test

appropriate test

test statistic

low shrinkage temperature

wilcoxonsigned rank

mean temperature

median temperature

shrinkage temperatureless