ch 11 winkler

7/24/2019 Ch 11 Winkler

1/52

BASIC STATISTICS:

A USER ORIENTED APPROACH

(Manuscript)

Spyros MAKRIDAKIS

andRobert L. WINKLER

CHAPTER 11

7/24/2019 Ch 11 Winkler

2/52

CHAPTER 11

HYPOTHESIS TESTING: MEANS AND PROPORTIONS

11.1 Introduction

In Chapters 8 and 9 we discussed problems of estimation, where we want to come up with an

estimate of a parameter and to have some idea of the accuracy of estimation. No specific values

of the parameter are singled out in advance for special attention. We simply take a sample and

calculate the desired point or interval estimates.

Another form of statistical inference is called hypothesis testing. In hypothesis testing, a specific

value or set of values of a parameter is singled out in advance. For example, if a carmanufacturer claims that the mean gasoline mileage obtained by a certain model in city driving

is 32 miles per gallon, the value = 32 is being singled out. Claims such as the statement that =32 are called hypotheses.

Suppose that the members of a consumer group are suspicious about the manufacturer's claim

that = 32. In particular, they suspect that the mean gasoline mileage in city driving for thismodel is in fact less than 32 miles per gallon. This is another hypothesis: the hypothesis that

32,we reject HOif x

is sufficiently high. With a two tail test we reject HO if x

is sufficiently different

from 32 (that is, if xis sufficiently high or sufficiently low). The direction of the rejection region(to the left, to the right, both sides) is the same as the direction of the alternative hypothesis HA.

The specification of a test statistic and rejection region completes the setting up of the test. All

that remains is to calculate the test statistic from the data. If it falls in the rejection region, HO

should be rejected; if it does not fall in the rejection region, HOshould be accepted. Remember

that accepting Ho does not necessarily mean that we believe that Ho is true. It may simply imply

that we do not have enough evidence to reject Ho or that we are reserving judgment.

In this section we have attempted to give you some idea of the nature of hypothesis testing.

Much of the terminology of hypothesis testing has been introduced, although more terms will be

defined and discussed in later sections. Now we are ready to apply the ideas of hypothesis testing

7/24/2019 Ch 11 Winkler

11/52

to tests involving a mean (Sections 11.3 and 11.4) and a proportion p (Section 11.5). If yougain an appreciation of the general idea of hypothesis testing from this chapter and you can

follow the specific hypothesis-testing procedures for means and proportions, then other tests

developed in later chapters will seem like "variations on a theme" and should not be difficult to

understand.

11.3 Hypothesis Testing: Means

(Large Samples)

Armed with some general ideas about hypothesis testing from the previous section, you are now

ready to consider specific methods for dealing with hypotheses about a mean . We will beginby analyzing the gasoline mileage example with hypotheses

HO: = 32and HA: < 32.

Recall that we use the sample mean xas a point estimate of . Moreover, from the Central LimitTheorem, which was covered in Section 7.4, the sampling distribution of x is approximatelynormal if the sample size is not too small. Also, the standard error of the mean is /n.Therefore,

z =x /n

has approximately a standard normal distribution. All we are doing here is standardizing x(subtracting its mean and dividing by its standard error /n) and invoking the Central LimitTheorem to justify the claim of normality.

7/24/2019 Ch 11 Winkler

12/52

In the gasoline mileage example, suppose that the sample consists of the gasoline mileage for

each of 40 cars and that the standard deviation of gasoline mileage for city driving with this type

of car is = 4.5 miles per gallon. Therefore, the standard error of the mean is

/n= 4.5/40= 0.71 miles per gallon.Now, if HOis true (that is, if = 32), then

z =x /n = x 324.5/40 = x 320.71

What if we reject whenever x31 ? The rejection region x31 corresponds to

z 31 320.71

=1

0.71=1.41

From the cumulative nonnal probabilities given in Table 3, the probability that z -1.41 isP(z -1.41) = P(z 1.41) = 1 - P(z 1.41)

= 1 - 0.92 = 0.08.

But this is the probability that x31 given that HOis true (since = 32 was used to find z = -1.41). If the rejection region is x31, this is the probability of rejecting HOwhen HO is true,which is the error probability

. We can represent a as an area in the left tail of the normal

distribution, as shown in Figure 11.4.

What does this result mean? It means that when we use the rejection region x 31 (orequivalently, z -1.41), if HOis true then the probability of rejecting HOis 0.08. Even though

7/24/2019 Ch 11 Winkler

13/52

the true mean is = 32, sampling fluctuations are such that the chance of a sample mean of 31 orless is 0.08. If we were to take repeated samples of size 40 from a population = 32 and = 4.5,8 percent of the samples would yield sample means of 31 or less.

Perhaps = 0.08 is judged to be too great a risk of a Type I error in this case. How can wechange the rejection region to make = 0.01? Since this is a one-tailed test to the left, is a left-hand tail area under the normal curve, as in Figure 11.4. To make the area smaller, the critical

value, which is the dividing line between acceptance and rejection, needs to be shifted to the left

The critical value of z is -1.41 in Figure 11.4. If we want = 0.01, the critical value of z shouldbe the first percentile of z. From Table 3, the 99th percentile of z is -2.33, since P(z > -2.33) =

0.99. Thus,

P(z -2.33) = 0.01,and a rejection region of z -2.33 gives us an a of 0.01, as shown in Figure 11.5. Now we havea decision rule with = 0.01:

reject HOif z

-2.33,

accept HOif z > -2.33.

Suppose that the sample mean for the sample of 40 cars turns out to be x = 30.21 miles pergallon. The computed value of the test statistic z is then

z =

30.21

32

0.71 = 1.79

0.71 =2.52Since this is less than the critical value of z, -2.33, we reject the manufacturer's claim that = 32in favor of the alternative hypothesis that < 32. Alternatively, the rejection region can beexpressed in terms of x; z -2.33 corresponds to

7/24/2019 Ch 11 Winkler

14/52

x32 - 2.33(0.71),or x30.35.

The observed sample mean, 30.21, is in the rejection region.

How can we interpret the results of the test? With = 0.01, the observed sample mean of 30.21miles per gallon is far enough below 32 to make us reject HO. The difference between 30.21 and

32 is not within the range of what might be expected due to chance fluctuations.

Let us consider another example. Suppose that a study indicates that the mean IQ score amongcollege students in the U.S. is 115 and the standard deviation is 10. The president of a large state

university feels that the mean IQ at that university is higher than 115. Unfortunately, IQ tests are

not administered routinely and are not included in students' files at the university.

Therefore, the president decides to gather some data to test

HO: = 115and HA: > 115,

where is the mean IQ score among students at the university. A random sample of 50 studentsis selected, and these 50 students take an IQ test.

The test statistic in this example is

z =x 115

10/115 = x 1151.41

7/24/2019 Ch 11 Winkler

15/52

since = 10, n = 50, and the hypothesized mean in HOis 115. Furthermore, since one-tailed testto the right (because HAis > 115), we will reject for large values of z. If = 0.05, the criticalvalue of z is the 95th percentile of the standard normal distribution, which is 1.64 (from Table 3).

The rejection region is z

1.64, as shown in Figure 11.6.

The IQ tests are administered, and the average IQ score for the 50 students is 116.72. The test

statistic is

z =116.72 115

1.41=

1.72

1.41= 1.22

But the rejection region is z 1.64. The observed xof 116.72 is not large enough to enable thepresident to reject HO. With = 0.05, this xis within the bounds of what might be expected justby chance even if = 115. In order to reject HOwhen = 0.05, we need to have z 1.64, which,when convened to x; corresponds to

x115 + 1.64(1.41) = 117.31.For this test, the sample mean IQ score must be at least 2.31 points above the hypothesized mean

115 in order to reject HO.

The gasoline mileage example and the IQ example illustrate one-tailed tests to the left and right,

respectively. How about a two-tailed test involving a mean ? Consider a small firm in theFrench province of Brittany. The firm bottles and sells the region's famous apple cider. In the

process, a machine that automatically fills bottles with apple cider is used. The bottles are

supposed to hold one liter (100 centiliters) each, but the actual amount varies slightly from bottle

to bottle because of chance variation. Extensive measurements of such variation in the past

indicate that the standard deviation is about 1.5 centiliters, or 0.015 liters. The manager of thefirm is concerned about , the mean amount of cider (in centiliters) per bottle. If > 100, thenthe bottles are being overfilled on the average, and such overfilling is costly to the firm. On the

7/24/2019 Ch 11 Winkler

16/52

other hand, if < 100, then the bottles tend not to hold as much as the label claims, which is alsoan undesirable state of affairs since it is not fair to the consumer. The firm wants to equal 100centiliters and would want to stop the bottling process and adjust the machine to eliminate any

deviations from

= 100.

The hypotheses in this example are

HO: = 100and HA: 100.

Suppose that the firm takes a sample of 30 bottles and measures the amount of cider in each

bottle. The measurement process is very accurate, providing accuracy to the nearest hundredth of

a centiliter. Thus, any variation in the measurement process can be ignored for all practical

purposes.

The test statistic is

z =x 1001.5/30 = x 1000.27 .

Let = 0.05. In the previous examples, Ho was to be rejected only for low values of z (thegasoline mileage example) or only for high values of z (the IQ example). This is a two-tailed

test, however, and we should reject HO if z deviates too much from zero in either direction. Now

is not represented by the area in one tail of the standard normal distribution, as in Figures 11.4,

11.5, and 11.6. Instead, the (l of 0.05 is split into an area of 0.025 in the left tail and an area of

0.025 in the right tail. From Table 3, the value z = 1.96 cuts off an area of 0.025 in the right tail.

By the symmetry of the normal distribution, z = -1.96 cuts off an area of 0.025 in the left tail.

Thus, the decision rule is as follows:

7/24/2019 Ch 11 Winkler

17/52

reject HOif z -1.96 or z 1.96,accept HOotherwise (that is, if -1.96 < z < 1.96).

This is illustrated in Figure 11.7. Using the notation introduced in the discussion of interval

estimation in Chapter 8, z/2 = 1.96.

The sample of 30 bottles is taken and the amount of cider in each bottle is measured. The sample

mean of the 30 measurements is x= 100.93. On average, the .machine appears to be overfillingthe bottles by almost a centiliter per bottle.

Is this much overfilling likely to occur by chance in a sample of n = 30, or is it an indication that

the machine needs adjusting? Let's compute the test statistic z:

z =100.93 100

0.27= 3.44

But z = 3.44 is much larger than the critical value of z = 1.96 in Figure 11.7. This much

overfilling is highly unlikely to occur by chance in a sample of n = 30, and HO should be

rejected.

A review of the procedures discussed in this section is in order.

Step 1. Formulate the hypotheses. The null hypothesis is of the form

HO : Owhere Ostands for the specific value given in HO (O= 115 in the IQ example, for instance).We have considered three types of alternative hypotheses:

7/24/2019 Ch 11 Winkler

18/52

HA: < O(one-tailed test to the left),HA: > O(one-tailed test to the right),HA:

O(two-tailed test).

Step 2. Determine the appropriate test statistic. The test statistic used in this section is

z =x o/n

Step 3. Specify a rejection region. For a given value of , the decision rule isreject HOif z -z. for a one-tailed test to the left;reject HOif z z. for a one-tailed test to the right;reject HOif z -z/2or if z z/2for a two-tailed test.

Here zrepresents the value of z cutting off the area a in the right tail of the normal curve (andz

/2, of course, cuts off

/2in the right tail of the normal curve). When

= 0.05, for example, z

= 1.64 and z/2= 1.96; when = 0.01, z= 2.33 and z/2=2.58.If you prefer, the rejection region can be expressed in terms of x:

reject HOif xO - z/nfor a one-tailed test to the left;reject HOif x

O + z

/

nfor a one-tailed test to the right;

reject HOif xO - z/2/nor if xO - z/2/nfor a two-tailed test.Step 4. Compute the values of the sample mean x and the test statistic z from the data. If theobserved z falls in the rejection region, reject HO. Otherwise, accept HO.

7/24/2019 Ch 11 Winkler

19/52

It is important to keep in mind the fact that this test is based on the normality of z. This is

justified by the Central Limit Theorem, which holds for large samples; n 30 is often used as arule of thumb, although smaller values of n may work if the population distribution is symmetric

and "mound-shaped" like the normal distribution. As a result, the test given in this section is

called a large-sample test. Tests for when the sample is small will be discussed in the nextsection, Section 11.4.

Another point worth noting is the fact that we need to know the population standard deviation in order to compute the test statistic z. Given that we do not know (we are testing about ), arewe likely to be able to specify ? In some cases, past data may provide reliable informationabout

. However, often we have no more than a rough guess about

. Since the test is a large

sample test, we are saved by the fact that the sample standard deviation s, which can be

computed from the formula

s = (xi x)i=1n 1

provides a good estimate of

for large samples. Therefore, s can be used in place of

when we

calculate the test statistic z for large samples.

In the gasoline-mileage example, suppose that the standard deviation is not known. From thedata (the miles per gallon obtained from each of the 40 cars), the standard deviation is found to

be s = 4.16. (We already computed x= 30.21). With s used in place of , the test statistic is

z =30.21

32

4.16/40 = 1.79

0.66 = 2.71 .

This is less than the critical value of z = -2.33 (see Figure 11.5), and Ho should therefore be

rejected.

7/24/2019 Ch 11 Winkler

20/52

11.4 Hypothesis Testing: Means

(Small Samples)

Next we consider tests of hypotheses about for small samples from normally distributedpopulations. Suppose that the US Environmental Protection Agency (EPA), as pan of a review of

national air quality standards, is interested in the level of ozone in the atmosphere in a particular

region. In particular, the hypotheses

HO: = 0.10and HA: > 0.10

are of interest, where represents the mean level of ozone in parts per million (ppm). Twelvemeasurements of ozone level are taken at randomly selected points in the region. The sample

mean is 0.117 ppm, and the sample standard deviation is 0.016 ppm.

The test statistic is

t = x 0.10s/n

and the number of degrees of freedom is 12 -1 = 11. If = 0.05, the rejection region for this one-tailed test to the right consists of the values of t in the upper 5 percent of the distribution, from

Table 4, the rejection region is t 1.796.

From the sample results.

t =0.117 0.10

0.016/12 = 0.0170.0046 = 3.70.

7/24/2019 Ch 11 Winkler

21/52

Since this value is larger than 1.796, we reject the null hypothesis that the mean level of ozone is

0.10 ppm in favor of the alternative hypothesis that the mean level is greater than 0.10 ppm,

The procedure used in testing hypotheses concerning

for small .samples from normally

distributed populations can be summarized as follows,

Step 1. Formulate the hypotheses, The null hypothesis is of the form

HO: = Oand the alternative hypothesis is either

HA: < O(one-tailed test to the left),HA: > O(one-tailed test to the right);or HA: :O(two-tailed test).

Step 2. Determine the appropriate test statistic. The test statistic for small-sample tests involving

is

t =x Os/n

with n - 1 degrees of freedom.

Step 3. Specify a rejection region. For a given value of . the decision rule is:reject HOif t -t, n-1 for a one-tailed test to the left;reject HOif t -t, n-1 for a one-tailed test to the right;reject HOif t -t/2, n-1 or if t -t/2, n-1 for a for a two-tailed test.

7/24/2019 Ch 11 Winkler

22/52

Here t represents the value of t cutting off the area a in the right tail of the t curve (and t /2, ofcourse, cuts off /2in the right tail).In terms of x

, the rejection region is

reject HOif xO - t,n-1s/nfor a one-tailed test to the left;reject HOif xO + t,n-1s/nfor a one-tailed test to the right;reject HOif xO - t/2, n-1s/nor if xO - t/2, n-1,s/nfor a two-tailed test.

Step 4. Compute the values of the sample mean x

and the sample standard deviation s from the

data. Then compute t. Reject HO if the observed t falls in the rejection region, accept HO

otherwise.

11.5 Hypothesis Testing: Proportions

When only two categories are of interest (such as a question that only can be answered yes or no,

a part that is good or defective, a person who has a disease or does not have it, or a coin that can

come up heads or tails), the parameter we deal with is a proportion (the proportion of yes

answers, the proportion of defective parts, the proportion of people with the disease, the

proportion of times heads comes up). In Sections 8.4 and 8.5 we discussed point and interval

estimation for proportions. Now we will consider the testing of hypotheses about proportions. As

in the discussion of hypothesis testing for , we will give some examples and then provide asummary of the procedure.

Suppose that for a particular form of cancer, the cure rate using the standard treatment has been

0.60. That is, 60 percent of the patients with the disease have been cured in the sense that they

are free of the disease for at least five years following the treatment. A group of medical

researchers have developed a new treatment for the disease, and they claim that their treatment

leads to a cure rate higher than 0.60. The hypotheses, then, are

7/24/2019 Ch 11 Winkler

23/52

HO: p=0.60

and HA: p > 0.60,

where p represents the cure rate with the new treatment.

The researchers report that the new treatment has been tried on a number of patients at various

clinics. Among those patients treated at least five years ago, 47 out of 61 have been cured. The

cure rate for these patients is x/n = 47/61 = 0.7705. This is encouraging, but is it enough larger

than 0.60 to make us reject HOand conclude that the new treatment is indeed more effective at

curing the disease?

The number of patients cured, x, has a binomial distribution, and if n is not too small we can use

a normal approximation to the binomial distribution. As in Sections 8.4 and 8.5, we will work

with the sample proportion x/n instead of x. The mean of x/n is p and the standard deviation of

the sampling distribution (i.e., the standard error) of x/n is p(l p)/n. Therefore when westandardize x/n by subtracting its mean and dividing by its standard error, the z statistic we wind

up with is

z =(x/n) pp(l p)/n

Just as in testing hypotheses about , our test statistic is a standard normal z-statistic.In the cancer example, n is large enough for the normal approximation to be used. The value of p

under the null hypothesis is 0.60, which means that np = 61(0.60) = 36.6 and n(1-p) = 61(0.40) =

24.4. In Section 5.4 we suggested as a rule of thumb that the normal distribution provides a

reasonable approximation to the binomial distribution when np 5 and n(l-p) 5. For thecancer example np and n(l-p) are both considerably larger than 5.

7/24/2019 Ch 11 Winkler

24/52

Let = 0.05. As in the previous section, can be represented as an area under the normal curve.Since this is a one-tailed test to the right (HAconsists of values of p greater than 0.60), the area

must be in the right tail of the distribution. If we want = 0.05, the critical value of z should bethe 95th percentile of z (so that 95 percent of the area under the curve will be to the left of the

critical value and 5 percent to the right). From Table 3, the 95th percentile of z is 1.64, since

P1.64) = 0.95. The rejection region is z 1.64, as shown in Figure 11.8.Next we calculate the test statistic z. With 47 cured patients out of 61, we have

z =(47/61) 0.60

0.60(0.40)/61

=0.1705

0.0627= 2.72

This is in the rejection region, since 2.72 > 1.64. The evidence is strong enough to make us

believe that the new treatment yields a cure rate better than 0.60.

How large would x/n have to be to make us reject HOin this example? The rejection region of z1.64 corresponds tox/n 0.60 + 1.64 0.60(0.40)/61,or x/n 0.7029.

A cure rate in the sample of at least 70.29 percent of the patients leads to rejection of H Owhen n

= 61, = 0.05, and the null hypothesis indicates we should expect only 60 percent to be cured.The hypotheses for a test of whether a coin is fair or loaded were set up in Section 11.2. They are

versus

HO: p =0.50

HA: p 0.50,

7/24/2019 Ch 11 Winkler

25/52

where p represents the proportion of times heads comes up. To test Ho versus HAfor a particular

coin, the coin is tossed 100 times.

The normal approximation can be used, since np = n(1-p) = 100(0.50) = 50 for n = 100 and p =

0.50. Suppose that = 0.10. For a two-tailed test, that means that the rejection region can berepresented by an area of 0.05 in the left tail and another area of 0.05 in the right tail. The critical

values of z are thus the 5th and 95th percentiles, which are -1.64 and 1.64 (see Figure 11.9). The

decision rule is:

reject HOif z -1.64 or if z 1.64,accept HOotherwise (that is, if -1.64 < z < 1.64).

The 100 tosses of the coin result in 46 heads and 54 tails. The test statistic can be computed as

follows when x = 46 and n = 100:

z =(46/100) 0.500.50(0.50)/100 = 0.040.05 = 0.80

This value is well within the acceptance region. Although we expect 50 heads in 100 tosses if thecoin is fair, 46 heads is not unusual enough to make us conclude that the coin is loaded.

For a third example, consider a union which is preparing for a strike vote. The union rules

indicate that a strike will be called only if at least 80 percent of the members vote in favor of

striking. The union's leaders, who recommend a strike, claim that the members support this

recommendation and that the required number of pro-strike votes will be obtained. A polling

organization decides to take a survey of 140 randomly-chosen union members in order to test

versus

HO: p = 0.80

7/24/2019 Ch 11 Winkler

26/52

versus HA:p < 0.80,

where p represents the proportion of union members who favor a strike. A value of 0.05 is

chosen for

, which means that HOwill rejected if z

-1.64, as shown in Figure 11.10.

Of the 140 members polled, 103 favor the strike and the remaining 37 are against the strike. A

point estimate of p is 103/140 = 0.7357. This is less than 0.80, but is it sufficiently low to cause

us to reject the claim of the union's leaders? The test statistic is

z =(103/ 140) 0.80

0.80(0.20)/140

=0.0.643

0.0338 = 1.90.

This is less than -1.64, and HOshould be rejected.

The procedures used to test hypotheses about p can be summarized as follows.

Step 1. Formulate the hypotheses. The null hypothesis is of the form

Ho: p = po,

where postands for the specific value given in HO(pois 0.60 in the cancer example, 0.50 in the

coin example, and 0.80 in the strike vote example).

We have considered three types of alternative hypotheses:

HA: p < po(one-tailed test to the left);

HA: p > po(one-tailed test to the right),

and HA: p po (two-tailed test).

7/24/2019 Ch 11 Winkler

27/52

Step 2. Determine the appropriate test statistic. The test statistic used in this section is

z =(x/n) po

p

o(l

p

o)/n

Step 3. Specify a rejection region. For given value of , the decision rule isreject HOif z -zfor a one-tailed test to the left;reject HOif z zfor a one-tailed test to the right;reject HOif z

-z/2 or if z

z/2 for a two-tailed test.

Here zrepresents the value of z cutting off the area (l in the right tail of the normal curve (andz/2, of course, cuts off a/2 in the right tail of the normal curve).If you prefer, the rejection region can be expressed in terms of x/n:

reject HOif x/n po- zpo(l po)/nfor a one-tailed test to the left;reject HOif x/n

po+ z

po(l

po)/nfor a one-tailed test to the right;

reject HOif x/n po- z/2po(l po)/nor if x/n po+ z/2po(l po)/nfor a two-tailed test.

Step 4. Compute the values of the sample proportion x/n and the test statistic z from the data. If

the observed z falls in the rejection region, reject HOOtherwise, accept Ho

11.6 Significance Levels

In some instances everyone would agree that a particular hypothesis should be rejected. In the

coin example from Section 11.5, suppose that the number of times heads comes up in 100 tosses

was 6 instead of 46. With this result, it is hard to imagine anyone continuing to support the

7/24/2019 Ch 11 Winkler

28/52

notion of the coin being fair. In fact, anyone who still thought the coin was fair might be a good

person to bet against with bets on future tosses of the coin. When x = 6, the test statistic is

z =

(6/100)

0.50

0.50(0.50)/100 = 0.44

0.05 = 8.80

To say the least, this is an extreme value of z.

At the other extreme, if x = 50, surely no one would consider this as evidence against the coin

being fair. This value should lead to general agreement that Ho should not be rejected. Here we

have

z =(50/100) 0.500.50(0.50)/100 = 00.05 = 0

which is right in the middle of the standard normal distribution (the mean of the distribution, to

be exact).

The decision to accept or reject is not always so easy. In the strike vote example of Section 11.5,

z = -1.90 for a one-tailed test to the left. With = 0.05, as used in the example, the rejectionregion is z -1.64, and the decision is therefore to reject Ho. However, what if = 0.01? Withthis smaller value of , the rejection region becomes z -2.33, since -2.33 is the first percentileof the distribution of z. But the calculated test statistic, z = -1.90, is greater than -2.33, as you can

see from Figure 11.11. Having changed to 0.01, we now accept HO.This example illustrates how the choice between accepting and rejecting hypotheses may depend

on the value of that is selected. Specifying a value of enables us to determine a rejectionregion. The conventional justification for concentrating on is that

1. The hypotheses are set up so that a Type I error is more serious than a Type II error and

7/24/2019 Ch 11 Winkler

29/52

2. if "accept Ho is interpreted as "fail to reject HO", then we never accept Ho literally and

thus theoretically we never get into a Type II error.

But the choice of

is often made in a somewhat arbitrary fashion. Over the years,

= 0.05 and

= 0.01 have been used most often. Unfortunately, these values are generally used more out of

tradition than out of a careful consideration of the real-world problem at hand.

One way to avoid a choice of is not to view hypothesis testing in decision-making terms. Theapproach of hypothesis testing presented in the preceding sections has been decision-oriented,

emphasizing the choice between rejecting and accepting hypotheses. This requires a formal

decision rule, or rejection region, that should be chosen before the data are analyzed.

But what are we trying to accomplish in hypothesis testing? Often we are trying to see whether a

particular result seems "real" or whether it might be just due to chance. Is the sample proportion

103/140 = 0.7357 in the strike vote example an indication that the true population proportion p in

favor of the strike is less than 0.80? Or could it be that 80 percent of the members favor the strike

and the sample proportion is 0.7357 just because the sample, by the luck of the draw, happened

to include an unusually high proportion of members against the strike? The null hypothesis that p

= 0.80 is a claim that the result (the difference between 0.7357 and 0.80) is due to chance. The

alternative hypothesis that p < 0.80 is an alternative claim that the difference is real.

How strong is the evidence against p = 0.80? The observed x/n of 0.7357 corresponds to z =

-1.90. From the sampling distribution of z shown in Figure 11.12. you can see that -1.90 is in the

left tail of the distribution. How surprised should we be to get a z of -1.90? The chance of getting

a z this far to the left of zero or farther (remember. this is a one-tailed test to the left) is

represented by the shaded area in Figure 11.12. But this is P(z -1.90), which can be found byusing Table 3; its value is approximately 0.03. This means that if p were exactly 0.80 and wetook repeated samples of size n = 140. about 3 percent of the samples would result in a sample

proportion x/n less than or equal to 0.7357 (that is, less than or equal to x = 103 people in favor

7/24/2019 Ch 11 Winkler

30/52

of the strike). The probability P(z -1.90) = 0.03 is called the observed significance level for thetest of HO: p = 0.80 versus HA: p < 0.80 in the strike vote example.

Observed Significance Level: The chance of getting a test statistic as extreme as or more

extreme than the value of the test statistic that is actually obtained, given that the null

hypothesis is true. Another name for an observed significance level is a P-value.

The lower the observed significant level (or P-value) is the more surprised someone who

believes the null hypothesis should be. A low P-value suggests that if HO is true, we have

witnessed an unusual sample. Thus, the lower the P-value is, the stronger the evidence is against

HO. Consistent with the widespread use of 0.05 and 0.01 as values of a is the interpretation of an

observed significance level of 0.05 or less as a "statistically significant result" and of 0.01 or less

as a "highly statistically significant result". Alternative expressions are "a result significant at the

0.05 level" or "a result significant at the 0.01 level" or "a result significant at the (insert the P-

value here) level". The term "significant" comes from the initial question of whether the

difference between the sample results and the null hypothesis (for example, between 0.7357 and

0.80 in the strike vote example) is "real", or "significant". In fact hypothesis tests are often

referred to as tests of significance.

In the gasoline mileage example from Section 11.3, the test is one-tailed to the left and z = -1.41.

Thus.

P-value = P(z -1.41) = 0.08.The observed significance level is 0.08, which provides some evidence against HO but not

enough to satisfy someone who insists upon = 0.05.In the IQ example from Section 11.3, the test is one-tailed to the right and z = 1.22. Therefore,

P-value = P(z 1.22) = 0.11.

7/24/2019 Ch 11 Winkler

31/52

Because the test is one-tailed to the right, the P-value is a right-tail area, not a left-tail area.

In the cancer example of Section 11.5, z = 2.72. The observed significance level is

P-value = P(z 2.72) = 0.003for this one-tailed test to the right. This provides very strong evidence against the null hypothesis

that the cure rate for the new treatment is 0.60. Since the P-value is less than 0.01, the result

might be called highly statistically significant.

How about the coin example of Section 11.5? The results of 46 heads in 100 tosses yield z =-0.80, and the corresponding left-tail area is

P(z -0.80) = 0.21.But the test is two-tailed (p = 0.50 versus p 0.50). Therefore, a z of +0.80 would be just asextreme as a z of -0.80. As a result, we double the one-tail area to allow for the fact that this is a

two-tailed test. The observed significance level is

P-value = 2P(z -0.80) = 2(0.21) = 0.42.A significance level this high indicates a result that is not at all surprising under the null

hypothesis.

Consider the bottle-filling example from Section 11.3, with z = 3.44. The one-tail area is

P(z 3.44) = 0.0003,which means that the P-value for the two-tailed test (= 100 versus 100) is

7/24/2019 Ch 11 Winkler

32/52

P-value = 2(0.0003) = 0.0006.

Finally, consider the ozone-level example in Section 11.4. The test statistic is t = 3.70 and the

number of degrees of freedom is 11. The test is one-tailed to the right. Therefore, the observed

significance level (P-value) is P (t 3.70). From Table 4,t0.995 = 3.106 and t0.999 = 4.025 with 11degrees of freedom. Thus, the area to the right of 3.106 is 0.005, and the area to the right of

4.025 is only 0.001. Since the observed t of 3.70 is in between 3.106 and 4.025, the observed

significance level is between 0.005 and 0.001. We would say that the results are significant at the

0.005 level but not at the 0.001 level. Of course, the 0.005 level is quite low, indicating that the

results would be highly unlikely if HOwere really true.

In summary,

P-value = area to the left of the observed test statistic (z or t, whichever is used) for a one-

tailed test to the left;

P-value = area to the right of the observed test statistic (z or t, whichever is used) for a

one-tailed test to the right;

P-value = twice the one-tailed P-value for a two-tailed test.

We must emphasize the importance of looking at more than just the observed significance level

in order to evaluate the practical importance of the results. A low significance level does not

guarantee practical importance. For instance, in the gasoline mileage example, suppose that

1,600 cars were used instead of 40, and the average mileage for the 1,600 cars was 31.68. The

test statistic would be

z =31.68 324.5/1600 = 0.320.1125 = 2.84

For this one-tailed test to the left (= 32 versus 32), the observed significance level would be

7/24/2019 Ch 11 Winkler

33/52

P-value = P(z -2.84) = .002.Thus, the results appear to be highly statistically significant. But notice that x= 31.68, which isonly about one-third of a mile per gallon less than the manufacturer's claim of 32 miles per

gallon. If the consumer group uses this difference of one-third of a mile per gallon to dispute the

manufacturer's claim, they will be laughed at. It is unlikely that anyone will care about such a

small difference, statistically significant or not

If the difference of one-third of a mile per gallon is not important, why does it lead to such a low

observed significance level? Notice that the sample is very large (1,600 cars). Such a large

sample can provide a high degree of accuracy, as you learned in Chapter 8. In hypothesis testing,

this great accuracy means that the test becomes sensitive to very small differences from the null

hypothesis. The standard error of 0.1125, or just over a tenth of a mile per gallon, is quite small.

The sample tells us that the evidence is strongly against being exactly 32. That does not meancannot be close to 32, and it appears that it is in the neighborhood of 31.68.The moral of this example for the consumer group is that a large sample size gives them more

accuracy than they need. They might as well save some money by reducing their sample size. On

the other hand, in some situations a high degree of accuracy, hence a large sample, may be

desirable. In the bottle-filling example, a deviation of only 0.01 centiliters may be very costly to

the firm when the number of bottles being filled and shipped out is considered. (If it is important

to the firm to detect small deviations from the null hypothesis of = 100, then a larger samplesize may be needed.)

The moral of this example for anyone evaluating the results of their own or someone else's tests

of hypotheses is that it is important to look beyond the statistical significance of such results to

consider their practical significance. Of course, there are many well-designed studies which yield

both. As you can see by now, interpreting hypothesis tests can be trickier than interpreting point

and interval estimates. If you keep the underlying real-world situation in mind while evaluating

the statistical results (and look carefully at statistics such as xas well as at significance levels),

7/24/2019 Ch 11 Winkler

34/52

it's not that hard to figure out what is happening. And in reporting results of tests, you should

report some statistics (for example, xwhen the test involves a mean ) in addition to an observedsignificance level, so that others will be able to evaluate the results.

11.7 Hypothesis Testing Using SPSS

Go to Analyze > Compare Means > One-Sample T Test

Take the variable that you are testing to the box Test Variables

Example: A Car Cleaning service firm FastCarClean is trying to reduce the total service time

during the peak periods from the current 30 minutes. A new procedure in implemented. If they

are successful, the entire chain will change to the new process. To test the effectiveness of the

new process, a random sample of a 100 cars is surveyed. The data is provided in

CarCleanWaitingTime.xls. Are the new procedures effective? Test at = .05.

Step 1: Set up the hypothesis

H0 : o= 30

HA : < 30

This is a one-tail test.

Step 2: Determine the test statistic

= .05.

Steps 3 and 4

In SPSS, write down the null value.

The output from SPSS is :

One-Sample Statistics

7/24/2019 Ch 11 Winkler

35/52

N Mean

Std.

Deviation

Std. Error

Mean

WaitingTime 100 27.350 12.3525 1.2352

Observe from the top panel that the mean waiting time after the process is 27.35 minutes. Thebottom panel reveals that the two-tailed p-value is .034. Since this is a one-tail test, the p-value

for our one-tail test is .034 /2 = .017.

Since the p-value of .017 is less than (= .05), we reject the null. The firm has successfully

reduced waiting time.

11.7 Two-Sample t tests

So far we have done looked at one-sample hypothesis testing. We call it one-sample since only

one set of sample observations was gathered. Very often, we are interested in making

comparisons. For example, we may be interested in whether behaviour is different between two

sets of people do Asian consumers spend more on entertainment than Caucasian consumers?

Or do younger people more likely to try new products than older people? Is a new medication

better than the existing one? Since we are interested in differences or comparisons between

people/ events / strategies, we very often have to conduct hypothesis testing using more than one

sample. To test for differences between Asian and Caucasian consumers, we will have to gather

data from a sample of Asian consumers and another set of observations from a sample of

Caucasian consumers. Now we are dealing with two samples. The basic process of hypothesis

7/24/2019 Ch 11 Winkler

36/52

testing using two samples is the same as in the one-sample case. We have to formulate the

hypothesis, set up the rejection region and calculate the test statistic. Two-sample hypotheses

testing can be of two types independent samples t test and paired-sample t tests. We discuss

each type next.

11.7.1 Independent-Sample t-test

Independent-sample t test is used when data for the two samples is gathered independently of

each other. For example, if we are interested in differences in spending patterns between Asians

and Caucasians, we can get spending data by going to Galleria and soliciting consumers. We ask

a random sample of 100 Asians (e.g., by asking every 10 thAsian we see) how much they spend

on entertainment. We simultaneously also get a random sample of 100 Caucasians (also by

contacting every 10thCaucasian we see) and ask them how much they spend on entertainment.

The crucial notion is that these two samples of 100 Asians and 100 Caucasians are not connected

in anyway. These data for one sample (Asians) is collected independently of the data for the

second sample (Caucasian): that is why it is called independent-samples.

A manager of a restaurant is considering offering a free drink to customers as soon as they sit to

order. Some of her colleagues think it will be a good idea and encourage customers to order

more expensive dishes or buy more drinks. Other colleagues believe it will be bad for business

as people who would have otherwise ordered drinks will not do so anymore and also will spend

less on food. They suggest offering a free appetizer. She therefore decides to conduct an

experiment one night to see which one will be more effective. We will call the free drink

promotion strategy X and the old approach of appetizers Y. The firm decides to implement one

strategy one night and gives 12 (nx) randomly selected customers a free drink while 12 (ny) other

randomly selected customers will get an appetizer. The manager will then see the how much

money customers have spent to make a decision on which strategy to adopt. Before all the data

collection begins, the firm needs to set up the hypothesis testing process as discussed in the

earlier sections.

Step 1:

7/24/2019 Ch 11 Winkler

37/52

First the null hypothesis. The null hypothesis is that both promotions are equally effective which

is denoted by:

H0: x= y

This implies that the null hypothesis is that mean sales of the drinks promotion (x) are going tobe the same as the second appetizer promotion.

Depending on the firms point of view, three alternative hypotheses are possible. First, the firm

may believe that promotion (x) will be more effective than promotion (y). This might be the

case if promotion y is what the firm currently uses and a management consultant has

recommended that the firm try out promotion x. Now the firm will set up the alternative

hypothesis to be:

HA: x> y

This is a one-tail test.

Alternatively, the firms management might believe that promotion (y) is more effective than

promotion (x). In this case, the alternative hypothesis will be:

HA: x< y

This is also a one-tail test.

Finally, the management might be equivocal about either promotion. In this case, the alternative

hypothesis will be

HA: x y

This is a two-tailed test.

Step 2: Determine the appropriate test statistic

There are two possible test statistics for the independent (two)-sample tests. The formulae for the

test statistics are quite cumbersome. Which one should be used depends on the nature of the data

gathered. First, the standard deviations of the two random samples need to be examined. If the

standard deviations of the two samples are not significantly different (equal variances), then one

7/24/2019 Ch 11 Winkler

38/52

formula is used; if the standard deviations are significantly different (unequal variances), another

formula is used. You will not solve these problems by hand SPSS will do this for you. But in

case you are curious, the formulae for the test statistics are given below.

t-statistic when sample variances are equal

( ) )

y

2

p

x

2

p

yx

n

s

n

s

yxt

+

= where2nn

1)s(n1)s(ns

yx

2

yy

2

xx2

p+

+

=

and t has (n1+ n2 2) degrees of .freedom

t-statistic when sample variances are unequal

Y

2

y

X

2

x

0

n

s

n

s

D)yx(t

+

= where

1)/(nn

s1)/(n

n

s

)n

s()

n

s(

y

2

y

2y

x

2

x

2x

2

y

2y

x

2

x

+

+

=v

and t has (n1+ n2 2) degrees of freedom.

As mentioned, SPSS computes these for you. SPSS also computes both the formulas for you.

The output looks like this:

Back to the promotions example, let us consider that the manager is equivocal about the two

promotions and sets the hypothesis as:

Step 1:

H0: x= y (The effectiveness of the two promotions will be the same)

HA: x= y (The effectiveness of the two promotions will not be the same).

7/24/2019 Ch 11 Winkler

39/52

7/24/2019 Ch 11 Winkler

40/52

See the 5th column that has the P-values. The P-value in this example is dereived when we

calculated using the t-statistic formula for equal variances is (.021) while the P-value when the t-

statistics formula for unequal variances was used is .037. So what P-value should we use? We

should use the more conservative P-value so that we are consistent with our philosophy of

reducing Type-I error. We therefore should use the P-value of .037.

11.7.2 Paired-Sample Hypothesis Test

When the sample data consists of matched pairs, the Paired-Sample test is required. If you are

interested in seeing whether a weight reduction plan is effective, then weight measures for

individuals before starting the plan and after the plan are taken. The difference in weight loss of

individual patients will indicate whether the plan significantly reduces weight. Similarly, if a

regulatory agency wants to check if gas prices are different in two chains, they will check gas

prices in different zip codes and then measure the differences in gas prices within each zip code.

Example

A CPG firm wants to decide which of two package designs to use. They randomly select 29

customers and show them both designs. Each customer rates both designs on a 1-10 point scale

where 10 indicates Excellent and 1 means Very Bad. The data is in PackageDesign.xls. Is

any design significantly preferred over another? Test at = 0.05

7/24/2019 Ch 11 Winkler

41/52

Step 1:

First the null hypothesis. The null hypothesis is that both designs are equally preferred which is

denoted by:H0: x= y

Step 2: Decision Rule: = 0.05

Step 3:

In SPSS, go to

Analyze > Compare Means > > Paired-Sample T-test

Move the variables so that they are side-by-side as shown below.

7/24/2019 Ch 11 Winkler

42/52

From the top panel, we can see that the mean preference for Package is 7.07 while it is 6.34 for

Package design B. Is this a significant difference? The bottom panel indicates a t-value of -

2.470 and p-value of .02. Since the p-value is less than = 0.05, we reject the null and conclude

that Package A is significantly preferred to Package B.

7/24/2019 Ch 11 Winkler

43/52

7/24/2019 Ch 11 Winkler

44/52

7/24/2019 Ch 11 Winkler

45/52

7/24/2019 Ch 11 Winkler

46/52

7/24/2019 Ch 11 Winkler

47/52

7/24/2019 Ch 11 Winkler

48/52

7/24/2019 Ch 11 Winkler

49/52

7/24/2019 Ch 11 Winkler

50/52

7/24/2019 Ch 11 Winkler

51/52

7/24/2019 Ch 11 Winkler

52/52

ch 11 winkler

Documents