ch 11 winkler

Upload: reinler

Post on 21-Feb-2018

235 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/24/2019 Ch 11 Winkler

    1/52

    BASIC STATISTICS:

    A USER ORIENTED APPROACH

    (Manuscript)

    Spyros MAKRIDAKIS

    andRobert L. WINKLER

    CHAPTER 11

  • 7/24/2019 Ch 11 Winkler

    2/52

    CHAPTER 11

    HYPOTHESIS TESTING: MEANS AND PROPORTIONS

    11.1 Introduction

    In Chapters 8 and 9 we discussed problems of estimation, where we want to come up with an

    estimate of a parameter and to have some idea of the accuracy of estimation. No specific values

    of the parameter are singled out in advance for special attention. We simply take a sample and

    calculate the desired point or interval estimates.

    Another form of statistical inference is called hypothesis testing. In hypothesis testing, a specific

    value or set of values of a parameter is singled out in advance. For example, if a carmanufacturer claims that the mean gasoline mileage obtained by a certain model in city driving

    is 32 miles per gallon, the value = 32 is being singled out. Claims such as the statement that =32 are called hypotheses.

    Suppose that the members of a consumer group are suspicious about the manufacturer's claim

    that = 32. In particular, they suspect that the mean gasoline mileage in city driving for thismodel is in fact less than 32 miles per gallon. This is another hypothesis: the hypothesis that

    32,we reject HOif x

    is sufficiently high. With a two tail test we reject HO if x

    is sufficiently different

    from 32 (that is, if xis sufficiently high or sufficiently low). The direction of the rejection region(to the left, to the right, both sides) is the same as the direction of the alternative hypothesis HA.

    The specification of a test statistic and rejection region completes the setting up of the test. All

    that remains is to calculate the test statistic from the data. If it falls in the rejection region, HO

    should be rejected; if it does not fall in the rejection region, HOshould be accepted. Remember

    that accepting Ho does not necessarily mean that we believe that Ho is true. It may simply imply

    that we do not have enough evidence to reject Ho or that we are reserving judgment.

    In this section we have attempted to give you some idea of the nature of hypothesis testing.

    Much of the terminology of hypothesis testing has been introduced, although more terms will be

    defined and discussed in later sections. Now we are ready to apply the ideas of hypothesis testing

  • 7/24/2019 Ch 11 Winkler

    11/52

    to tests involving a mean (Sections 11.3 and 11.4) and a proportion p (Section 11.5). If yougain an appreciation of the general idea of hypothesis testing from this chapter and you can

    follow the specific hypothesis-testing procedures for means and proportions, then other tests

    developed in later chapters will seem like "variations on a theme" and should not be difficult to

    understand.

    11.3 Hypothesis Testing: Means

    (Large Samples)

    Armed with some general ideas about hypothesis testing from the previous section, you are now

    ready to consider specific methods for dealing with hypotheses about a mean . We will beginby analyzing the gasoline mileage example with hypotheses

    HO: = 32and HA: < 32.

    Recall that we use the sample mean xas a point estimate of . Moreover, from the Central LimitTheorem, which was covered in Section 7.4, the sampling distribution of x is approximatelynormal if the sample size is not too small. Also, the standard error of the mean is /n.Therefore,

    z =x /n

    has approximately a standard normal distribution. All we are doing here is standardizing x(subtracting its mean and dividing by its standard error /n) and invoking the Central LimitTheorem to justify the claim of normality.

  • 7/24/2019 Ch 11 Winkler

    12/52

    In the gasoline mileage example, suppose that the sample consists of the gasoline mileage for

    each of 40 cars and that the standard deviation of gasoline mileage for city driving with this type

    of car is = 4.5 miles per gallon. Therefore, the standard error of the mean is

    /n= 4.5/40= 0.71 miles per gallon.Now, if HOis true (that is, if = 32), then

    z =x /n = x 324.5/40 = x 320.71

    What if we reject whenever x31 ? The rejection region x31 corresponds to

    z 31 320.71

    =1

    0.71=1.41

    From the cumulative nonnal probabilities given in Table 3, the probability that z -1.41 isP(z -1.41) = P(z 1.41) = 1 - P(z 1.41)

    = 1 - 0.92 = 0.08.

    But this is the probability that x31 given that HOis true (since = 32 was used to find z = -1.41). If the rejection region is x31, this is the probability of rejecting HOwhen HO is true,which is the error probability

    . We can represent a as an area in the left tail of the normal

    distribution, as shown in Figure 11.4.

    What does this result mean? It means that when we use the rejection region x 31 (orequivalently, z -1.41), if HOis true then the probability of rejecting HOis 0.08. Even though

  • 7/24/2019 Ch 11 Winkler

    13/52

    the true mean is = 32, sampling fluctuations are such that the chance of a sample mean of 31 orless is 0.08. If we were to take repeated samples of size 40 from a population = 32 and = 4.5,8 percent of the samples would yield sample means of 31 or less.

    Perhaps = 0.08 is judged to be too great a risk of a Type I error in this case. How can wechange the rejection region to make = 0.01? Since this is a one-tailed test to the left, is a left-hand tail area under the normal curve, as in Figure 11.4. To make the area smaller, the critical

    value, which is the dividing line between acceptance and rejection, needs to be shifted to the left

    The critical value of z is -1.41 in Figure 11.4. If we want = 0.01, the critical value of z shouldbe the first percentile of z. From Table 3, the 99th percentile of z is -2.33, since P(z > -2.33) =

    0.99. Thus,

    P(z -2.33) = 0.01,and a rejection region of z -2.33 gives us an a of 0.01, as shown in Figure 11.5. Now we havea decision rule with = 0.01:

    reject HOif z

    -2.33,

    accept HOif z > -2.33.

    Suppose that the sample mean for the sample of 40 cars turns out to be x = 30.21 miles pergallon. The computed value of the test statistic z is then

    z =

    30.21

    32

    0.71 = 1.79

    0.71 =2.52Since this is less than the critical value of z, -2.33, we reject the manufacturer's claim that = 32in favor of the alternative hypothesis that < 32. Alternatively, the rejection region can beexpressed in terms of x; z -2.33 corresponds to

  • 7/24/2019 Ch 11 Winkler

    14/52

    x32 - 2.33(0.71),or x30.35.

    The observed sample mean, 30.21, is in the rejection region.

    How can we interpret the results of the test? With = 0.01, the observed sample mean of 30.21miles per gallon is far enough below 32 to make us reject HO. The difference between 30.21 and

    32 is not within the range of what might be expected due to chance fluctuations.

    Let us consider another example. Suppose that a study indicates that the mean IQ score amongcollege students in the U.S. is 115 and the standard deviation is 10. The president of a large state

    university feels that the mean IQ at that university is higher than 115. Unfortunately, IQ tests are

    not administered routinely and are not included in students' files at the university.

    Therefore, the president decides to gather some data to test

    HO: = 115and HA: > 115,

    where is the mean IQ score among students at the university. A random sample of 50 studentsis selected, and these 50 students take an IQ test.

    The test statistic in this example is

    z =x 115

    10/115 = x 1151.41

  • 7/24/2019 Ch 11 Winkler

    15/52

    since = 10, n = 50, and the hypothesized mean in HOis 115. Furthermore, since one-tailed testto the right (because HAis > 115), we will reject for large values of z. If = 0.05, the criticalvalue of z is the 95th percentile of the standard normal distribution, which is 1.64 (from Table 3).

    The rejection region is z

    1.64, as shown in Figure 11.6.

    The IQ tests are administered, and the average IQ score for the 50 students is 116.72. The test

    statistic is

    z =116.72 115

    1.41=

    1.72

    1.41= 1.22

    But the rejection region is z 1.64. The observed xof 116.72 is not large enough to enable thepresident to reject HO. With = 0.05, this xis within the bounds of what might be expected justby chance even if = 115. In order to reject HOwhen = 0.05, we need to have z 1.64, which,when convened to x; corresponds to

    x115 + 1.64(1.41) = 117.31.For this test, the sample mean IQ score must be at least 2.31 points above the hypothesized mean

    115 in order to reject HO.

    The gasoline mileage example and the IQ example illustrate one-tailed tests to the left and right,

    respectively. How about a two-tailed test involving a mean ? Consider a small firm in theFrench province of Brittany. The firm bottles and sells the region's famous apple cider. In the

    process, a machine that automatically fills bottles with apple cider is used. The bottles are

    supposed to hold one liter (100 centiliters) each, but the actual amount varies slightly from bottle

    to bottle because of chance variation. Extensive measurements of such variation in the past

    indicate that the standard deviation is about 1.5 centiliters, or 0.015 liters. The manager of thefirm is concerned about , the mean amount of cider (in centiliters) per bottle. If > 100, thenthe bottles are being overfilled on the average, and such overfilling is costly to the firm. On the

  • 7/24/2019 Ch 11 Winkler

    16/52

    other hand, if < 100, then the bottles tend not to hold as much as the label claims, which is alsoan undesirable state of affairs since it is not fair to the consumer. The firm wants to equal 100centiliters and would want to stop the bottling process and adjust the machine to eliminate any

    deviations from

    = 100.

    The hypotheses in this example are

    HO: = 100and HA: 100.

    Suppose that the firm takes a sample of 30 bottles and measures the amount of cider in each

    bottle. The measurement process is very accurate, providing accuracy to the nearest hundredth of

    a centiliter. Thus, any variation in the measurement process can be ignored for all practical

    purposes.

    The test statistic is

    z =x 1001.5/30 = x 1000.27 .

    Let = 0.05. In the previous examples, Ho was to be rejected only for low values of z (thegasoline mileage example) or only for high values of z (the IQ example). This is a two-tailed

    test, however, and we should reject HO if z deviates too much from zero in either direction. Now

    is not represented by the area in one tail of the standard normal distribution, as in Figures 11.4,

    11.5, and 11.6. Instead, the (l of 0.05 is split into an area of 0.025 in the left tail and an area of

    0.025 in the right tail. From Table 3, the value z = 1.96 cuts off an area of 0.025 in the right tail.

    By the symmetry of the normal distribution, z = -1.96 cuts off an area of 0.025 in the left tail.

    Thus, the decision rule is as follows:

  • 7/24/2019 Ch 11 Winkler

    17/52

    reject HOif z -1.96 or z 1.96,accept HOotherwise (that is, if -1.96 < z < 1.96).

    This is illustrated in Figure 11.7. Using the notation introduced in the discussion of interval

    estimation in Chapter 8, z/2 = 1.96.

    The sample of 30 bottles is taken and the amount of cider in each bottle is measured. The sample

    mean of the 30 measurements is x= 100.93. On average, the .machine appears to be overfillingthe bottles by almost a centiliter per bottle.

    Is this much overfilling likely to occur by chance in a sample of n = 30, or is it an indication that

    the machine needs adjusting? Let's compute the test statistic z:

    z =100.93 100

    0.27= 3.44

    But z = 3.44 is much larger than the critical value of z = 1.96 in Figure 11.7. This much

    overfilling is highly unlikely to occur by chance in a sample of n = 30, and HO should be

    rejected.

    A review of the procedures discussed in this section is in order.

    Step 1. Formulate the hypotheses. The null hypothesis is of the form

    HO : Owhere Ostands for the specific value given in HO (O= 115 in the IQ example, for instance).We have considered three types of alternative hypotheses:

  • 7/24/2019 Ch 11 Winkler

    18/52

    HA: < O(one-tailed test to the left),HA: > O(one-tailed test to the right),HA:

    O(two-tailed test).

    Step 2. Determine the appropriate test statistic. The test statistic used in this section is

    z =x o/n

    Step 3. Specify a rejection region. For a given value of , the decision rule isreject HOif z -z. for a one-tailed test to the left;reject HOif z z. for a one-tailed test to the right;reject HOif z -z/2or if z z/2for a two-tailed test.

    Here zrepresents the value of z cutting off the area a in the right tail of the normal curve (andz

    /2, of course, cuts off

    /2in the right tail of the normal curve). When

    = 0.05, for example, z

    = 1.64 and z/2= 1.96; when = 0.01, z= 2.33 and z/2=2.58.If you prefer, the rejection region can be expressed in terms of x:

    reject HOif xO - z/nfor a one-tailed test to the left;reject HOif x

    O + z

    /

    nfor a one-tailed test to the right;

    reject HOif xO - z/2/nor if xO - z/2/nfor a two-tailed test.Step 4. Compute the values of the sample mean x and the test statistic z from the data. If theobserved z falls in the rejection region, reject HO. Otherwise, accept HO.

  • 7/24/2019 Ch 11 Winkler

    19/52

    It is important to keep in mind the fact that this test is based on the normality of z. This is

    justified by the Central Limit Theorem, which holds for large samples; n 30 is often used as arule of thumb, although smaller values of n may work if the population distribution is symmetric

    and "mound-shaped" like the normal distribution. As a result, the test given in this section is

    called a large-sample test. Tests for when the sample is small will be discussed in the nextsection, Section 11.4.

    Another point worth noting is the fact that we need to know the population standard deviation in order to compute the test statistic z. Given that we do not know (we are testing about ), arewe likely to be able to specify ? In some cases, past data may provide reliable informationabout

    . However, often we have no more than a rough guess about

    . Since the test is a large

    sample test, we are saved by the fact that the sample standard deviation s, which can be

    computed from the formula

    s = (xi x)i=1n 1

    provides a good estimate of

    for large samples. Therefore, s can be used in place of

    when we

    calculate the test statistic z for large samples.

    In the gasoline-mileage example, suppose that the standard deviation is not known. From thedata (the miles per gallon obtained from each of the 40 cars), the standard deviation is found to

    be s = 4.16. (We already computed x= 30.21). With s used in place of , the test statistic is

    z =30.21

    32

    4.16/40 = 1.79

    0.66 = 2.71 .

    This is less than the critical value of z = -2.33 (see Figure 11.5), and Ho should therefore be

    rejected.

  • 7/24/2019 Ch 11 Winkler

    20/52

    11.4 Hypothesis Testing: Means

    (Small Samples)

    Next we consider tests of hypotheses about for small samples from normally distributedpopulations. Suppose that the US Environmental Protection Agency (EPA), as pan of a review of

    national air quality standards, is interested in the level of ozone in the atmosphere in a particular

    region. In particular, the hypotheses

    HO: = 0.10and HA: > 0.10

    are of interest, where represents the mean level of ozone in parts per million (ppm). Twelvemeasurements of ozone level are taken at randomly selected points in the region. The sample

    mean is 0.117 ppm, and the sample standard deviation is 0.016 ppm.

    The test statistic is

    t = x 0.10s/n

    and the number of degrees of freedom is 12 -1 = 11. If = 0.05, the rejection region for this one-tailed test to the right consists of the values of t in the upper 5 percent of the distribution, from

    Table 4, the rejection region is t 1.796.

    From the sample results.

    t =0.117 0.10

    0.016/12 = 0.0170.0046 = 3.70.

  • 7/24/2019 Ch 11 Winkler

    21/52

    Since this value is larger than 1.796, we reject the null hypothesis that the mean level of ozone is

    0.10 ppm in favor of the alternative hypothesis that the mean level is greater than 0.10 ppm,

    The procedure used in testing hypotheses concerning

    for small .samples from normally

    distributed populations can be summarized as follows,

    Step 1. Formulate the hypotheses, The null hypothesis is of the form

    HO: = Oand the alternative hypothesis is either

    HA: < O(one-tailed test to the left),HA: > O(one-tailed test to the right);or HA: :O(two-tailed test).

    Step 2. Determine the appropriate test statistic. The test statistic for small-sample tests involving

    is

    t =x Os/n

    with n - 1 degrees of freedom.

    Step 3. Specify a rejection region. For a given value of . the decision rule is:reject HOif t -t, n-1 for a one-tailed test to the left;reject HOif t -t, n-1 for a one-tailed test to the right;reject HOif t -t/2, n-1 or if t -t/2, n-1 for a for a two-tailed test.

  • 7/24/2019 Ch 11 Winkler

    22/52

    Here t represents the value of t cutting off the area a in the right tail of the t curve (and t /2, ofcourse, cuts off /2in the right tail).In terms of x

    , the rejection region is

    reject HOif xO - t,n-1s/nfor a one-tailed test to the left;reject HOif xO + t,n-1s/nfor a one-tailed test to the right;reject HOif xO - t/2, n-1s/nor if xO - t/2, n-1,s/nfor a two-tailed test.

    Step 4. Compute the values of the sample mean x

    and the sample standard deviation s from the

    data. Then compute t. Reject HO if the observed t falls in the rejection region, accept HO

    otherwise.

    11.5 Hypothesis Testing: Proportions

    When only two categories are of interest (such as a question that only can be answered yes or no,

    a part that is good or defective, a person who has a disease or does not have it, or a coin that can

    come up heads or tails), the parameter we deal with is a proportion (the proportion of yes

    answers, the proportion of defective parts, the proportion of people with the disease, the

    proportion of times heads comes up). In Sections 8.4 and 8.5 we discussed point and interval

    estimation for proportions. Now we will consider the testing of hypotheses about proportions. As

    in the discussion of hypothesis testing for , we will give some examples and then provide asummary of the procedure.

    Suppose that for a particular form of cancer, the cure rate using the standard treatment has been

    0.60. That is, 60 percent of the patients with the disease have been cured in the sense that they

    are free of the disease for at least five years following the treatment. A group of medical

    researchers have developed a new treatment for the disease, and they claim that their treatment

    leads to a cure rate higher than 0.60. The hypotheses, then, are

  • 7/24/2019 Ch 11 Winkler

    23/52

    HO: p=0.60

    and HA: p > 0.60,

    where p represents the cure rate with the new treatment.

    The researchers report that the new treatment has been tried on a number of patients at various

    clinics. Among those patients treated at least five years ago, 47 out of 61 have been cured. The

    cure rate for these patients is x/n = 47/61 = 0.7705. This is encouraging, but is it enough larger

    than 0.60 to make us reject HOand conclude that the new treatment is indeed more effective at

    curing the disease?

    The number of patients cured, x, has a binomial distribution, and if n is not too small we can use

    a normal approximation to the binomial distribution. As in Sections 8.4 and 8.5, we will work

    with the sample proportion x/n instead of x. The mean of x/n is p and the standard deviation of

    the sampling distribution (i.e., the standard error) of x/n is p(l p)/n. Therefore when westandardize x/n by subtracting its mean and dividing by its standard error, the z statistic we wind

    up with is

    z =(x/n) pp(l p)/n

    Just as in testing hypotheses about , our test statistic is a standard normal z-statistic.In the cancer example, n is large enough for the normal approximation to be used. The value of p

    under the null hypothesis is 0.60, which means that np = 61(0.60) = 36.6 and n(1-p) = 61(0.40) =

    24.4. In Section 5.4 we suggested as a rule of thumb that the normal distribution provides a

    reasonable approximation to the binomial distribution when np 5 and n(l-p) 5. For thecancer example np and n(l-p) are both considerably larger than 5.

  • 7/24/2019 Ch 11 Winkler

    24/52

    Let = 0.05. As in the previous section, can be represented as an area under the normal curve.Since this is a one-tailed test to the right (HAconsists of values of p greater than 0.60), the area

    must be in the right tail of the distribution. If we want = 0.05, the critical value of z should bethe 95th percentile of z (so that 95 percent of the area under the curve will be to the left of the

    critical value and 5 percent to the right). From Table 3, the 95th percentile of z is 1.64, since

    P1.64) = 0.95. The rejection region is z 1.64, as shown in Figure 11.8.Next we calculate the test statistic z. With 47 cured patients out of 61, we have

    z =(47/61) 0.60

    0.60(0.40)/61

    =0.1705

    0.0627= 2.72

    This is in the rejection region, since 2.72 > 1.64. The evidence is strong enough to make us

    believe that the new treatment yields a cure rate better than 0.60.

    How large would x/n have to be to make us reject HOin this example? The rejection region of z1.64 corresponds tox/n 0.60 + 1.64 0.60(0.40)/61,or x/n 0.7029.

    A cure rate in the sample of at least 70.29 percent of the patients leads to rejection of H Owhen n

    = 61, = 0.05, and the null hypothesis indicates we should expect only 60 percent to be cured.The hypotheses for a test of whether a coin is fair or loaded were set up in Section 11.2. They are

    versus

    HO: p =0.50

    HA: p 0.50,

  • 7/24/2019 Ch 11 Winkler

    25/52

    where p represents the proportion of times heads comes up. To test Ho versus HAfor a particular

    coin, the coin is tossed 100 times.

    The normal approximation can be used, since np = n(1-p) = 100(0.50) = 50 for n = 100 and p =

    0.50. Suppose that = 0.10. For a two-tailed test, that means that the rejection region can berepresented by an area of 0.05 in the left tail and another area of 0.05 in the right tail. The critical

    values of z are thus the 5th and 95th percentiles, which are -1.64 and 1.64 (see Figure 11.9). The

    decision rule is:

    reject HOif z -1.64 or if z 1.64,accept HOotherwise (that is, if -1.64 < z < 1.64).

    The 100 tosses of the coin result in 46 heads and 54 tails. The test statistic can be computed as

    follows when x = 46 and n = 100:

    z =(46/100) 0.500.50(0.50)/100 = 0.040.05 = 0.80

    This value is well within the acceptance region. Although we expect 50 heads in 100 tosses if thecoin is fair, 46 heads is not unusual enough to make us conclude that the coin is loaded.

    For a third example, consider a union which is preparing for a strike vote. The union rules

    indicate that a strike will be called only if at least 80 percent of the members vote in favor of

    striking. The union's leaders, who recommend a strike, claim that the members support this

    recommendation and that the required number of pro-strike votes will be obtained. A polling

    organization decides to take a survey of 140 randomly-chosen union members in order to test

    versus

    HO: p = 0.80

  • 7/24/2019 Ch 11 Winkler

    26/52

    versus HA:p < 0.80,

    where p represents the proportion of union members who favor a strike. A value of 0.05 is

    chosen for

    , which means that HOwill rejected if z

    -1.64, as shown in Figure 11.10.

    Of the 140 members polled, 103 favor the strike and the remaining 37 are against the strike. A

    point estimate of p is 103/140 = 0.7357. This is less than 0.80, but is it sufficiently low to cause

    us to reject the claim of the union's leaders? The test statistic is

    z =(103/ 140) 0.80

    0.80(0.20)/140

    =0.0.643

    0.0338 = 1.90.

    This is less than -1.64, and HOshould be rejected.

    The procedures used to test hypotheses about p can be summarized as follows.

    Step 1. Formulate the hypotheses. The null hypothesis is of the form

    Ho: p = po,

    where postands for the specific value given in HO(pois 0.60 in the cancer example, 0.50 in the

    coin example, and 0.80 in the strike vote example).

    We have considered three types of alternative hypotheses:

    HA: p < po(one-tailed test to the left);

    HA: p > po(one-tailed test to the right),

    and HA: p po (two-tailed test).

  • 7/24/2019 Ch 11 Winkler

    27/52

    Step 2. Determine the appropriate test statistic. The test statistic used in this section is

    z =(x/n) po

    p

    o(l

    p

    o)/n

    Step 3. Specify a rejection region. For given value of , the decision rule isreject HOif z -zfor a one-tailed test to the left;reject HOif z zfor a one-tailed test to the right;reject HOif z

    -z/2 or if z

    z/2 for a two-tailed test.

    Here zrepresents the value of z cutting off the area (l in the right tail of the normal curve (andz/2, of course, cuts off a/2 in the right tail of the normal curve).If you prefer, the rejection region can be expressed in terms of x/n:

    reject HOif x/n po- zpo(l po)/nfor a one-tailed test to the left;reject HOif x/n

    po+ z

    po(l

    po)/nfor a one-tailed test to the right;

    reject HOif x/n po- z/2po(l po)/nor if x/n po+ z/2po(l po)/nfor a two-tailed test.

    Step 4. Compute the values of the sample proportion x/n and the test statistic z from the data. If

    the observed z falls in the rejection region, reject HOOtherwise, accept Ho

    11.6 Significance Levels

    In some instances everyone would agree that a particular hypothesis should be rejected. In the

    coin example from Section 11.5, suppose that the number of times heads comes up in 100 tosses

    was 6 instead of 46. With this result, it is hard to imagine anyone continuing to support the

  • 7/24/2019 Ch 11 Winkler

    28/52

    notion of the coin being fair. In fact, anyone who still thought the coin was fair might be a good

    person to bet against with bets on future tosses of the coin. When x = 6, the test statistic is

    z =

    (6/100)

    0.50

    0.50(0.50)/100 = 0.44

    0.05 = 8.80

    To say the least, this is an extreme value of z.

    At the other extreme, if x = 50, surely no one would consider this as evidence against the coin

    being fair. This value should lead to general agreement that Ho should not be rejected. Here we

    have

    z =(50/100) 0.500.50(0.50)/100 = 00.05 = 0

    which is right in the middle of the standard normal distribution (the mean of the distribution, to

    be exact).

    The decision to accept or reject is not always so easy. In the strike vote example of Section 11.5,

    z = -1.90 for a one-tailed test to the left. With = 0.05, as used in the example, the rejectionregion is z -1.64, and the decision is therefore to reject Ho. However, what if = 0.01? Withthis smaller value of , the rejection region becomes z -2.33, since -2.33 is the first percentileof the distribution of z. But the calculated test statistic, z = -1.90, is greater than -2.33, as you can

    see from Figure 11.11. Having changed to 0.01, we now accept HO.This example illustrates how the choice between accepting and rejecting hypotheses may depend

    on the value of that is selected. Specifying a value of enables us to determine a rejectionregion. The conventional justification for concentrating on is that

    1. The hypotheses are set up so that a Type I error is more serious than a Type II error and

  • 7/24/2019 Ch 11 Winkler

    29/52

    2. if "accept Ho is interpreted as "fail to reject HO", then we never accept Ho literally and

    thus theoretically we never get into a Type II error.

    But the choice of

    is often made in a somewhat arbitrary fashion. Over the years,

    = 0.05 and

    = 0.01 have been used most often. Unfortunately, these values are generally used more out of

    tradition than out of a careful consideration of the real-world problem at hand.

    One way to avoid a choice of is not to view hypothesis testing in decision-making terms. Theapproach of hypothesis testing presented in the preceding sections has been decision-oriented,

    emphasizing the choice between rejecting and accepting hypotheses. This requires a formal

    decision rule, or rejection region, that should be chosen before the data are analyzed.

    But what are we trying to accomplish in hypothesis testing? Often we are trying to see whether a

    particular result seems "real" or whether it might be just due to chance. Is the sample proportion

    103/140 = 0.7357 in the strike vote example an indication that the true population proportion p in

    favor of the strike is less than 0.80? Or could it be that 80 percent of the members favor the strike

    and the sample proportion is 0.7357 just because the sample, by the luck of the draw, happened

    to include an unusually high proportion of members against the strike? The null hypothesis that p

    = 0.80 is a claim that the result (the difference between 0.7357 and 0.80) is due to chance. The

    alternative hypothesis that p < 0.80 is an alternative claim that the difference is real.

    How strong is the evidence against p = 0.80? The observed x/n of 0.7357 corresponds to z =

    -1.90. From the sampling distribution of z shown in Figure 11.12. you can see that -1.90 is in the

    left tail of the distribution. How surprised should we be to get a z of -1.90? The chance of getting

    a z this far to the left of zero or farther (remember. this is a one-tailed test to the left) is

    represented by the shaded area in Figure 11.12. But this is P(z -1.90), which can be found byusing Table 3; its value is approximately 0.03. This means that if p were exactly 0.80 and wetook repeated samples of size n = 140. about 3 percent of the samples would result in a sample

    proportion x/n less than or equal to 0.7357 (that is, less than or equal to x = 103 people in favor

  • 7/24/2019 Ch 11 Winkler

    30/52

    of the strike). The probability P(z -1.90) = 0.03 is called the observed significance level for thetest of HO: p = 0.80 versus HA: p < 0.80 in the strike vote example.

    Observed Significance Level: The chance of getting a test statistic as extreme as or more

    extreme than the value of the test statistic that is actually obtained, given that the null

    hypothesis is true. Another name for an observed significance level is a P-value.

    The lower the observed significant level (or P-value) is the more surprised someone who

    believes the null hypothesis should be. A low P-value suggests that if HO is true, we have

    witnessed an unusual sample. Thus, the lower the P-value is, the stronger the evidence is against

    HO. Consistent with the widespread use of 0.05 and 0.01 as values of a is the interpretation of an

    observed significance level of 0.05 or less as a "statistically significant result" and of 0.01 or less

    as a "highly statistically significant result". Alternative expressions are "a result significant at the

    0.05 level" or "a result significant at the 0.01 level" or "a result significant at the (insert the P-

    value here) level". The term "significant" comes from the initial question of whether the

    difference between the sample results and the null hypothesis (for example, between 0.7357 and

    0.80 in the strike vote example) is "real", or "significant". In fact hypothesis tests are often

    referred to as tests of significance.

    In the gasoline mileage example from Section 11.3, the test is one-tailed to the left and z = -1.41.

    Thus.

    P-value = P(z -1.41) = 0.08.The observed significance level is 0.08, which provides some evidence against HO but not

    enough to satisfy someone who insists upon = 0.05.In the IQ example from Section 11.3, the test is one-tailed to the right and z = 1.22. Therefore,

    P-value = P(z 1.22) = 0.11.

  • 7/24/2019 Ch 11 Winkler

    31/52

    Because the test is one-tailed to the right, the P-value is a right-tail area, not a left-tail area.

    In the cancer example of Section 11.5, z = 2.72. The observed significance level is

    P-value = P(z 2.72) = 0.003for this one-tailed test to the right. This provides very strong evidence against the null hypothesis

    that the cure rate for the new treatment is 0.60. Since the P-value is less than 0.01, the result

    might be called highly statistically significant.

    How about the coin example of Section 11.5? The results of 46 heads in 100 tosses yield z =-0.80, and the corresponding left-tail area is

    P(z -0.80) = 0.21.But the test is two-tailed (p = 0.50 versus p 0.50). Therefore, a z of +0.80 would be just asextreme as a z of -0.80. As a result, we double the one-tail area to allow for the fact that this is a

    two-tailed test. The observed significance level is

    P-value = 2P(z -0.80) = 2(0.21) = 0.42.A significance level this high indicates a result that is not at all surprising under the null

    hypothesis.

    Consider the bottle-filling example from Section 11.3, with z = 3.44. The one-tail area is

    P(z 3.44) = 0.0003,which means that the P-value for the two-tailed test (= 100 versus 100) is

  • 7/24/2019 Ch 11 Winkler

    32/52

    P-value = 2(0.0003) = 0.0006.

    Finally, consider the ozone-level example in Section 11.4. The test statistic is t = 3.70 and the

    number of degrees of freedom is 11. The test is one-tailed to the right. Therefore, the observed

    significance level (P-value) is P (t 3.70). From Table 4,t0.995 = 3.106 and t0.999 = 4.025 with 11degrees of freedom. Thus, the area to the right of 3.106 is 0.005, and the area to the right of

    4.025 is only 0.001. Since the observed t of 3.70 is in between 3.106 and 4.025, the observed

    significance level is between 0.005 and 0.001. We would say that the results are significant at the

    0.005 level but not at the 0.001 level. Of course, the 0.005 level is quite low, indicating that the

    results would be highly unlikely if HOwere really true.

    In summary,

    P-value = area to the left of the observed test statistic (z or t, whichever is used) for a one-

    tailed test to the left;

    P-value = area to the right of the observed test statistic (z or t, whichever is used) for a

    one-tailed test to the right;

    P-value = twice the one-tailed P-value for a two-tailed test.

    We must emphasize the importance of looking at more than just the observed significance level

    in order to evaluate the practical importance of the results. A low significance level does not

    guarantee practical importance. For instance, in the gasoline mileage example, suppose that

    1,600 cars were used instead of 40, and the average mileage for the 1,600 cars was 31.68. The

    test statistic would be

    z =31.68 324.5/1600 = 0.320.1125 = 2.84

    For this one-tailed test to the left (= 32 versus 32), the observed significance level would be

  • 7/24/2019 Ch 11 Winkler

    33/52

    P-value = P(z -2.84) = .002.Thus, the results appear to be highly statistically significant. But notice that x= 31.68, which isonly about one-third of a mile per gallon less than the manufacturer's claim of 32 miles per

    gallon. If the consumer group uses this difference of one-third of a mile per gallon to dispute the

    manufacturer's claim, they will be laughed at. It is unlikely that anyone will care about such a

    small difference, statistically significant or not

    If the difference of one-third of a mile per gallon is not important, why does it lead to such a low

    observed significance level? Notice that the sample is very large (1,600 cars). Such a large

    sample can provide a high degree of accuracy, as you learned in Chapter 8. In hypothesis testing,

    this great accuracy means that the test becomes sensitive to very small differences from the null

    hypothesis. The standard error of 0.1125, or just over a tenth of a mile per gallon, is quite small.

    The sample tells us that the evidence is strongly against being exactly 32. That does not meancannot be close to 32, and it appears that it is in the neighborhood of 31.68.The moral of this example for the consumer group is that a large sample size gives them more

    accuracy than they need. They might as well save some money by reducing their sample size. On

    the other hand, in some situations a high degree of accuracy, hence a large sample, may be

    desirable. In the bottle-filling example, a deviation of only 0.01 centiliters may be very costly to

    the firm when the number of bottles being filled and shipped out is considered. (If it is important

    to the firm to detect small deviations from the null hypothesis of = 100, then a larger samplesize may be needed.)

    The moral of this example for anyone evaluating the results of their own or someone else's tests

    of hypotheses is that it is important to look beyond the statistical significance of such results to

    consider their practical significance. Of course, there are many well-designed studies which yield

    both. As you can see by now, interpreting hypothesis tests can be trickier than interpreting point

    and interval estimates. If you keep the underlying real-world situation in mind while evaluating

    the statistical results (and look carefully at statistics such as xas well as at significance levels),

  • 7/24/2019 Ch 11 Winkler

    34/52

    it's not that hard to figure out what is happening. And in reporting results of tests, you should

    report some statistics (for example, xwhen the test involves a mean ) in addition to an observedsignificance level, so that others will be able to evaluate the results.

    11.7 Hypothesis Testing Using SPSS

    Go to Analyze > Compare Means > One-Sample T Test

    Take the variable that you are testing to the box Test Variables

    Example: A Car Cleaning service firm FastCarClean is trying to reduce the total service time

    during the peak periods from the current 30 minutes. A new procedure in implemented. If they

    are successful, the entire chain will change to the new process. To test the effectiveness of the

    new process, a random sample of a 100 cars is surveyed. The data is provided in

    CarCleanWaitingTime.xls. Are the new procedures effective? Test at = .05.

    Step 1: Set up the hypothesis

    H0 : o= 30

    HA : < 30

    This is a one-tail test.

    Step 2: Determine the test statistic

    = .05.

    Steps 3 and 4

    In SPSS, write down the null value.

    The output from SPSS is :

    One-Sample Statistics

  • 7/24/2019 Ch 11 Winkler

    35/52

    N Mean

    Std.

    Deviation

    Std. Error

    Mean

    WaitingTime 100 27.350 12.3525 1.2352

    Observe from the top panel that the mean waiting time after the process is 27.35 minutes. Thebottom panel reveals that the two-tailed p-value is .034. Since this is a one-tail test, the p-value

    for our one-tail test is .034 /2 = .017.

    Since the p-value of .017 is less than (= .05), we reject the null. The firm has successfully

    reduced waiting time.

    11.7 Two-Sample t tests

    So far we have done looked at one-sample hypothesis testing. We call it one-sample since only

    one set of sample observations was gathered. Very often, we are interested in making

    comparisons. For example, we may be interested in whether behaviour is different between two

    sets of people do Asian consumers spend more on entertainment than Caucasian consumers?

    Or do younger people more likely to try new products than older people? Is a new medication

    better than the existing one? Since we are interested in differences or comparisons between

    people/ events / strategies, we very often have to conduct hypothesis testing using more than one

    sample. To test for differences between Asian and Caucasian consumers, we will have to gather

    data from a sample of Asian consumers and another set of observations from a sample of

    Caucasian consumers. Now we are dealing with two samples. The basic process of hypothesis

  • 7/24/2019 Ch 11 Winkler

    36/52

    testing using two samples is the same as in the one-sample case. We have to formulate the

    hypothesis, set up the rejection region and calculate the test statistic. Two-sample hypotheses

    testing can be of two types independent samples t test and paired-sample t tests. We discuss

    each type next.

    11.7.1 Independent-Sample t-test

    Independent-sample t test is used when data for the two samples is gathered independently of

    each other. For example, if we are interested in differences in spending patterns between Asians

    and Caucasians, we can get spending data by going to Galleria and soliciting consumers. We ask

    a random sample of 100 Asians (e.g., by asking every 10 thAsian we see) how much they spend

    on entertainment. We simultaneously also get a random sample of 100 Caucasians (also by

    contacting every 10thCaucasian we see) and ask them how much they spend on entertainment.

    The crucial notion is that these two samples of 100 Asians and 100 Caucasians are not connected

    in anyway. These data for one sample (Asians) is collected independently of the data for the

    second sample (Caucasian): that is why it is called independent-samples.

    A manager of a restaurant is considering offering a free drink to customers as soon as they sit to

    order. Some of her colleagues think it will be a good idea and encourage customers to order

    more expensive dishes or buy more drinks. Other colleagues believe it will be bad for business

    as people who would have otherwise ordered drinks will not do so anymore and also will spend

    less on food. They suggest offering a free appetizer. She therefore decides to conduct an

    experiment one night to see which one will be more effective. We will call the free drink

    promotion strategy X and the old approach of appetizers Y. The firm decides to implement one

    strategy one night and gives 12 (nx) randomly selected customers a free drink while 12 (ny) other

    randomly selected customers will get an appetizer. The manager will then see the how much

    money customers have spent to make a decision on which strategy to adopt. Before all the data

    collection begins, the firm needs to set up the hypothesis testing process as discussed in the

    earlier sections.

    Step 1:

  • 7/24/2019 Ch 11 Winkler

    37/52

    First the null hypothesis. The null hypothesis is that both promotions are equally effective which

    is denoted by:

    H0: x= y

    This implies that the null hypothesis is that mean sales of the drinks promotion (x) are going tobe the same as the second appetizer promotion.

    Depending on the firms point of view, three alternative hypotheses are possible. First, the firm

    may believe that promotion (x) will be more effective than promotion (y). This might be the

    case if promotion y is what the firm currently uses and a management consultant has

    recommended that the firm try out promotion x. Now the firm will set up the alternative

    hypothesis to be:

    HA: x> y

    This is a one-tail test.

    Alternatively, the firms management might believe that promotion (y) is more effective than

    promotion (x). In this case, the alternative hypothesis will be:

    HA: x< y

    This is also a one-tail test.

    Finally, the management might be equivocal about either promotion. In this case, the alternative

    hypothesis will be

    HA: x y

    This is a two-tailed test.

    Step 2: Determine the appropriate test statistic

    There are two possible test statistics for the independent (two)-sample tests. The formulae for the

    test statistics are quite cumbersome. Which one should be used depends on the nature of the data

    gathered. First, the standard deviations of the two random samples need to be examined. If the

    standard deviations of the two samples are not significantly different (equal variances), then one

  • 7/24/2019 Ch 11 Winkler

    38/52

    formula is used; if the standard deviations are significantly different (unequal variances), another

    formula is used. You will not solve these problems by hand SPSS will do this for you. But in

    case you are curious, the formulae for the test statistics are given below.

    t-statistic when sample variances are equal

    ( ) )

    y

    2

    p

    x

    2

    p

    yx

    n

    s

    n

    s

    yxt

    +

    = where2nn

    1)s(n1)s(ns

    yx

    2

    yy

    2

    xx2

    p+

    +

    =

    and t has (n1+ n2 2) degrees of .freedom

    t-statistic when sample variances are unequal

    Y

    2

    y

    X

    2

    x

    0

    n

    s

    n

    s

    D)yx(t

    +

    = where

    1)/(nn

    s1)/(n

    n

    s

    )n

    s()

    n

    s(

    y

    2

    y

    2y

    x

    2

    x

    2x

    2

    y

    2y

    x

    2

    x

    +

    +

    =v

    and t has (n1+ n2 2) degrees of freedom.

    As mentioned, SPSS computes these for you. SPSS also computes both the formulas for you.

    The output looks like this:

    Back to the promotions example, let us consider that the manager is equivocal about the two

    promotions and sets the hypothesis as:

    Step 1:

    H0: x= y (The effectiveness of the two promotions will be the same)

    HA: x= y (The effectiveness of the two promotions will not be the same).

  • 7/24/2019 Ch 11 Winkler

    39/52

  • 7/24/2019 Ch 11 Winkler

    40/52

    See the 5th column that has the P-values. The P-value in this example is dereived when we

    calculated using the t-statistic formula for equal variances is (.021) while the P-value when the t-

    statistics formula for unequal variances was used is .037. So what P-value should we use? We

    should use the more conservative P-value so that we are consistent with our philosophy of

    reducing Type-I error. We therefore should use the P-value of .037.

    11.7.2 Paired-Sample Hypothesis Test

    When the sample data consists of matched pairs, the Paired-Sample test is required. If you are

    interested in seeing whether a weight reduction plan is effective, then weight measures for

    individuals before starting the plan and after the plan are taken. The difference in weight loss of

    individual patients will indicate whether the plan significantly reduces weight. Similarly, if a

    regulatory agency wants to check if gas prices are different in two chains, they will check gas

    prices in different zip codes and then measure the differences in gas prices within each zip code.

    Example

    A CPG firm wants to decide which of two package designs to use. They randomly select 29

    customers and show them both designs. Each customer rates both designs on a 1-10 point scale

    where 10 indicates Excellent and 1 means Very Bad. The data is in PackageDesign.xls. Is

    any design significantly preferred over another? Test at = 0.05

  • 7/24/2019 Ch 11 Winkler

    41/52

    Step 1:

    First the null hypothesis. The null hypothesis is that both designs are equally preferred which is

    denoted by:H0: x= y

    Step 2: Decision Rule: = 0.05

    Step 3:

    In SPSS, go to

    Analyze > Compare Means > > Paired-Sample T-test

    Move the variables so that they are side-by-side as shown below.

  • 7/24/2019 Ch 11 Winkler

    42/52

    From the top panel, we can see that the mean preference for Package is 7.07 while it is 6.34 for

    Package design B. Is this a significant difference? The bottom panel indicates a t-value of -

    2.470 and p-value of .02. Since the p-value is less than = 0.05, we reject the null and conclude

    that Package A is significantly preferred to Package B.

  • 7/24/2019 Ch 11 Winkler

    43/52

  • 7/24/2019 Ch 11 Winkler

    44/52

  • 7/24/2019 Ch 11 Winkler

    45/52

  • 7/24/2019 Ch 11 Winkler

    46/52

  • 7/24/2019 Ch 11 Winkler

    47/52

  • 7/24/2019 Ch 11 Winkler

    48/52

  • 7/24/2019 Ch 11 Winkler

    49/52

  • 7/24/2019 Ch 11 Winkler

    50/52

  • 7/24/2019 Ch 11 Winkler

    51/52

  • 7/24/2019 Ch 11 Winkler

    52/52