ch12 solutions

19
Chapter 12: Inference for Linear Regression 281 Chapter 12 Section 12.1 Check Your Understanding, page 750: 1. The scatterplot must show a linear pattern, the observations must be independent, the residuals need to be approximately Normally distributed, the residuals must show roughly equal scatter for all x-values and the observations must produced by random sampling or a randomized experiment. 2. Since there are 8 observations, there are 6 degrees of freedom and the * 2.447. t = This means that the confidence interval is ( ) ( ) 0.0908 2.447 0.02831 0.0908 0.06927 0.02153,0.16007 . ± = ± = We are 95% confident that the interval from 0.02153 to 0.16007 captures the true slope of the population regression line relating pack weight y and body weight x among ninth grade students at the Webb Schools. Check Your Understanding, page 755: 1. State: We want to perform a test of 0 : 0 H β = versus : 0 a H β where β is the true slope of the population regression line relating body weight to backpack weight. We will use 0.05. α = Plan: If the conditions are met, we will do a t test for the slope . β We are assuming that the conditions are met here. Do: According to the output, the test statistic is t = 3.21 and the P-value is 0.018 using df = 6. Conclude: Since the P-value is less than 0.05 we reject the null hypothesis and conclude that there is convincing evidence of a linear relationship between body weight and backpack weight for ninth grade students at the school. 2. Yes. The 99% confidence interval would contain 0 because the P-value for the two-sided test is greater than 0.01. Exercises, page 759: 12.1 No, the conditions for performing inference are not met. The variance of the residuals increases as the laboratory measurement increases. 12.2 No, the conditions for performing inference are not met. There is curvature in this plot which suggests that the original scatterplot has curvature to it. 12.3 Linear: The residual plot is reasonably centered around 0. This means that the scatterplot is approximately linear. Independent: This was a randomized experiment. Due to the random assignment, the observations can be viewed as independent. Normal: The histogram is mound shaped and approximately symmetric so the residuals could follow a Normal distribution. Equal Variance: The residual plot shows roughly equal scatter for all x values. Random: This was a randomized experiment. The conditions are met. 12.4 Linear: The residual plot is reasonably centered around 0. This means that the scatterplot is approximately linear. Independent: This was a randomized experiment. Due to the random assignment, the observations can be viewed as independent. Normal: The histogram is mound shaped and approximately symmetric so the residuals could follow a Normal distribution. Equal Variance: The residual plot shows roughly equal scatter for all x values. Random: This was a randomized experiment. The conditions are met. 12.5 α is the y-intercept. In this case it would measure the proportion of fish killed if there were 0 fish in the tank to begin with. Obviously this is extrapolation since the smallest number of fish to begin with

Upload: sleetdinomon

Post on 13-May-2017

291 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Ch12 Solutions

Chapter 12: Inference for Linear Regression 281

Chapter 12 Section 12.1 Check Your Understanding, page 750: 1. The scatterplot must show a linear pattern, the observations must be independent, the residuals need to be approximately Normally distributed, the residuals must show roughly equal scatter for all x-values and the observations must produced by random sampling or a randomized experiment. 2. Since there are 8 observations, there are 6 degrees of freedom and the * 2.447.t = This means that the confidence interval is ( ) ( )0.0908 2.447 0.02831 0.0908 0.06927 0.02153,0.16007 .± = ± = We are 95% confident that the interval from 0.02153 to 0.16007 captures the true slope of the population regression line relating pack weight y and body weight x among ninth grade students at the Webb Schools. Check Your Understanding, page 755: 1. State: We want to perform a test of 0 : 0H β = versus : 0aH β ≠ where β is the true slope of the population regression line relating body weight to backpack weight. We will use 0.05.α = Plan: If the conditions are met, we will do a t test for the slope .β We are assuming that the conditions are met here. Do: According to the output, the test statistic is t = 3.21 and the P-value is 0.018 using df = 6. Conclude: Since the P-value is less than 0.05 we reject the null hypothesis and conclude that there is convincing evidence of a linear relationship between body weight and backpack weight for ninth grade students at the school. 2. Yes. The 99% confidence interval would contain 0 because the P-value for the two-sided test is greater than 0.01. Exercises, page 759: 12.1 No, the conditions for performing inference are not met. The variance of the residuals increases as the laboratory measurement increases. 12.2 No, the conditions for performing inference are not met. There is curvature in this plot which suggests that the original scatterplot has curvature to it. 12.3 Linear: The residual plot is reasonably centered around 0. This means that the scatterplot is approximately linear. Independent: This was a randomized experiment. Due to the random assignment, the observations can be viewed as independent. Normal: The histogram is mound shaped and approximately symmetric so the residuals could follow a Normal distribution. Equal Variance: The residual plot shows roughly equal scatter for all x values. Random: This was a randomized experiment. The conditions are met. 12.4 Linear: The residual plot is reasonably centered around 0. This means that the scatterplot is approximately linear. Independent: This was a randomized experiment. Due to the random assignment, the observations can be viewed as independent. Normal: The histogram is mound shaped and approximately symmetric so the residuals could follow a Normal distribution. Equal Variance: The residual plot shows roughly equal scatter for all x values. Random: This was a randomized experiment. The conditions are met. 12.5 α is the y-intercept. In this case it would measure the proportion of fish killed if there were 0 fish in the tank to begin with. Obviously this is extrapolation since the smallest number of fish to begin with

Page 2: Ch12 Solutions

282 The Practice of Statistics for AP*, 4/e

was 10. So it really just gives us the point at which the line crosses the y-axis. The estimate for this parameter is 0.12. β is the slope. It tells us how much the proportion of fish killed either increases or decreases, on average, per one fish increase in the tank. The estimate for this parameter is 0.0086. In other words, we expect the proportion of fish killed to increase by 0.0086 for every extra fish in the tank. Finally, σ measures the standard deviation of proportion killed values about the population regression line. In this case the estimate is 0.1886. This means that the actual proportion of fish killed will vary from the estimate by 0.1886 on average. 12.6 α is the y-intercept. In this case it would measure the BAC level if no beers had been drunk. We would expect this to be 0 and the estimate is close to 0 with a value of -0.0127. β is the slope. It tells us how much the BAC increases, on average, with the drinking of each additional beer. The estimate for this parameter is 0.018. In other words, we expect the BAC level to increase by 0.018 with each additional beer. Finally, σ measures the standard deviation of BAC values about the population regression line. In this case the estimate is 0.0204. This means that the actual BAC level will vary from the estimated value by 0.0204 on average. 12.7 (a) If we repeated the experiment many times, the slope of the sample regression line would typically vary by about 0.002456 from the true slope of the population regression line for predicting proportion of fish eaten by the number of fish available. (b) Since there are 16 observations, the appropriate t distribution has 14 degrees of freedom and * 1.761.t = This leads to the confidence interval of ( ) ( )0.00857 1.761 0.002456 0.00857 0.00433 0.00424,0.0129 .± = ± = (c) We are 90% confident that the interval from 0.00424 to 0.0129 captures the true slope of the population regression line for predicting proportion of fish eaten from the number of fish available. (d) If we were to repeat the experiment many times and compute confidence intervals for the regression slope in each case, about 90% of the resulting intervals would contain the slope of the population regression line. 12.8 (a) If we repeated the experiment many times, the slope of the sample regression line would typically vary by about 0.0024 from the true slope of the population regression line for predicting BAC level from the number of beers consumed. (b) Since there are 16 observations, the appropriate t distribution has 14 degrees of freedom and * 2.977.t = This leads to the confidence interval of

( ) ( )0.018 2.977 0.0024 0.018 0.007 0.011,0.025 .± = ± = (c) We are 99% confident that the interval from 0.011 to 0.025 captures the true slope of the population regression line for predicting BAC level from the number of beers consumed. (d) If we were to repeat the experiment many times and compute confidence intervals for the regression slope in each case, about 99% of the resulting intervals would contain the slope of the population regression line. 12.9 State: We want to construct a 99% confidence interval for the true slope ,β of the population regression line relating number of stumps to number of clusters of beetle larvae. Plan: If the conditions are met, we will use a t interval for the slope to estimate .β We are assuming that the conditions are met here. Do: The sample size is 23 so the appropriate distribution has 21 degrees of freedom and

* 2.831.t = This leads to the confidence interval of ( ) ( )11.894 2.831 1.136 11.894 3.216 8.678,15.11 .± = ± = Conclude: We are 99% confident that the

interval from 8.678 to 15.11 captures the true slope of the population regression line for predicting clusters of beetle larvae from the number of stumps. 12.10 State: We want to construct a 90% confidence interval for the true slope ,β of the population regression line relating heights and arm spans of students in a large high school. Plan: If the conditions

Page 3: Ch12 Solutions

Chapter 12: Inference for Linear Regression 283

are met, we will use a t interval for the slope to estimate .β We are assuming that the conditions are met here. Do: The sample size is 18 so the appropriate distribution has 16 degrees of freedom and

* 1.746.t = This leads to the confidence interval of ( ) ( )0.8404 1.746 0.0809 0.8404 0.1413 0.6991,0.9817 .± = ± = Conclude: We are 90% confident that the

interval from 0.6991 to 0.9817 captures the true slope of the population regression line predicting height from arm span. 12.11 (a) ( )1.286 11.894 5 58.184− + = clusters. (b) σ measures the typical amount of error between the actual values and the predicted values. In this case our estimate for σ is 6.419 so we would expect our prediction of clusters of beetles to be off by that much on average. 12.12 (a) ( )11.547 0.84042 76 75.4189+ = inches. (b) σ measures the typical amount of error between the actual values and the predicted values. In this case our estimate for σ is 1.613 so we would expect our prediction heights to be off by that much on average. 12.13 (a) The scatterplot suggests that there is a somewhat weak negative linear relationship between the number of weeds per meter and the corn yield of the plots. (b) The equation for the line is ˆ 166.483 1.0987y x= − where y is the predicted corn yield and x is the number of weeds per meter. (c)

The y-intercept says that if there are no weeds per meter, we would predict a corn yield of 166.483 bushels. The slope says that for each additional weed per meter we can expect the average corn yield to decrease by 1.0987 bushels. (d) State: We want to perform a test of 0 : 0H β = versus : 0aH β < where β is the true slope of the population regression line relating weeds per meter to corn yield. We will use

0.05.α = Plan: If the conditions are met, we will do a t test for the slope .β We check the conditions. The residual plot and histogram of the residuals are given below. Linear: The scatterplot is approximately linear. Independent: This was a randomized experiment. Due to the random assignment, the observations can be viewed as independent. Normal: The histogram is mound shaped and approximately symmetric so the residuals could follow a Normal distribution. Equal Variance: The residual plot shows roughly equal scatter for all x values. Random: This was a randomized experiment. The conditions are met.

Do: According to the output, the test statistic is -1.92 and the one-sided P-value given is 0.0375 using df = 14. Conclude: Since the P-value is less than 0.05 we reject the null hypothesis and conclude that there is convincing evidence of a negative linear relationship between weeds per meter and corn yield. 12.14 (a) The scatterplot suggests that there is a moderately strong negative linear relationship between the amount of time spent at the table and the calories consumed for young children. (b) The equation for

Page 4: Ch12 Solutions

284 The Practice of Statistics for AP*, 4/e

the line is ˆ 560.65 3.0771y x= − where y is the calories consumed and x is the time spent at the table. (c) The y-intercept says that if there no time spent at the table, we would predict the average number of calories consumed to be 560.65. In this case that is extrapolation as the smallest amount of time measured was 20 minutes. Also, clearly, if the children spend no time at the table, they cannot consume 560 calories. The slope says that for each additional minute at the table we can expect the average caloric consumption to decrease by 3.0771 calories. (d) State: We want to perform a test of 0 : 0H β = versus

: 0aH β < where β is the true slope of the population regression line relating time at the table to calorie consumption. We will use 0.01.α = Plan: If the conditions are met, we will do a t test for the slope .β We check the conditions. The residual plot and histogram of the residuals are given below. Linear: The scatterplot is approximately linear. Independent: There were 20 toddlers observed. This is clearly less than 10% of all possible toddlers. Normal: The histogram is mound shaped and approximately symmetric so the residuals could follow a Normal distribution. Equal Variance: The residual plot shows roughly equal scatter for all x values. Random: The data come from a random sample. The conditions are met.

Do: According to the output, the test statistic is -3.62 and the one-sided P-value using df = 18 is 0.001. Conclude: Since the P-value is less than 0.01 we reject the null hypothesis and conclude that there is convincing evidence of a negative linear relationship between time at the table and caloric consumption. 12.15 (a) State: We want to construct a 90% confidence interval for ,β the true slope of the population regression line relating weeds per meter and corn yield. Plan: If the conditions are met, we will construct a t interval for the slope .β The conditions were checked in Exercise 12.13. Do: The sample size is 16 so the appropriate distribution has 14 degrees of freedom and * 1.761.t = This leads to the confidence interval of ( ) ( )1.0987 1.761 0.5712 1.0987 1.0059 2.1046, 0.0928 .− ± = − ± = − − Conclude: We are 90% confident that the interval from –2.1046 to –0.0928 captures the true slope of the population regression line for predicting corn yield from weeds per meter. In Exercise 12.13 we did a one-sided test at the 5% significance level. This is equivalent to computing a 90% confidence interval. With this interval we conclude that 0 is not a plausible value for the slope. In Exercise 12.13 we rejected the null hypothesis that the slope was 0. The conclusions are the same. (b) The typical error when using the regression line to predict corn yield is about 7.98 bushels. About 20.9% of the variation in corn yield can be explained by the linear relationship with the number of weeds per meter. If this experiment were done many times, the typical distance that the estimated slope would differ from the population slope by an average of 0.5712. 12.16 (a) State: We want to construct a 98% t interval for ,β the true slope of the population regression line relating time at the table and caloric consumption. Plan: If the conditions are met, we will construct

Page 5: Ch12 Solutions

Chapter 12: Inference for Linear Regression 285

a t interval for the slope .β The conditions were checked in Exercise 12.14. Do: The sample size is 20 so the appropriate distribution has 18 degrees of freedom and * 2.552.t = This leads to the confidence interval of ( ) ( )3.0771 2.552 0.8498 3.0771 2.1687 5.2458, 0.9084 .− ± = − ± = − − Conclude: We are 98% confident that the interval from 5.2458− to 0.9084− captures the true slope of the population regression line for predicting calorie consumption from time at the table. In Exercise 12.14 we did a one-sided test at the 1% significance level. This is equivalent to computing a 98% confidence interval. With this interval we conclude that 0 is not a plausible value for the slope. In Exercise 12.14 we rejected the null hypothesis that the true slope was 0. The conclusions are the same. (b) The typical error when using the regression line to predict calorie consumption is about 23.4 calories. Approximately 42.1% of the variation in calorie consumption can be explained by the linear relationship with the time spent at the table. If samples like this were observed many times, the typical distance that the estimated slope would differ from the population slope by an average of 0.8498. 12.17 (a) In computing a 99% confidence interval we use a t distribution with 14 degrees of freedom and

* 2.977.t = The interval is computed as ( ) ( )0.7902 2.977 0.0710 0.7902 0.2114 0.5788,1.0016 .± = ± = (The slight differences between this and the values in the text are due to rounding.) (b) Both variables are measuring tire wear in the same units. If one method of measuring wear gives an increase in wear of 1 unit, we would hope that the other way of measuring wear would also give an increase in wear of 1 unit. This translates into a slope of 1. (c) Since the interval in part (a) does include the value 1, it suggests that the slope could plausibly be 1. That is, we would not reject the null hypothesis that the slope is 1. 12.18 (a) In computing a 95% confidence interval we use a t distribution with 19 degrees of freedom and

* 2.093.t = The interval is computed as ( ) ( )11,630.6 2.093 1,249 11,630.6 2,614.16 9,016.44,14,244.76 .± = ± = (b) The vehicle is measured in

years and mileage in miles. Since the automotive group claims that people drive 15,000 miles per year, that says that for every increase of 1 year, the mileage would increase by 15,000 miles. This translates into a slope of 15,000. (c) Since the interval in part (a) does not include the value 15,000, it suggests that the slope could not plausibly be 15,000. That is, we would reject the null hypothesis that the slope is 15,000. 12.19 (a) State: We want to perform a test of 0 : 0H β = versus : 0aH β < where β is the true slope of the population regression line relating wine consumption to heart disease death rate. We will use

0.05.α = Plan: If the conditions are met, we will do a t test for the slope .β We check the conditions. The scatterplot, residual plot and histogram of the residuals are given below. Linear: The scatterplot is approximately linear. Independent: There were 19 countries observed. There are more than 190 countries in the world. Normal: The histogram of residuals is mound shaped and approximately symmetric so the residuals could follow a Normal distribution. Equal Variance: The residual plot shows a roughly equal scatter for all x values. Random: The data come from a random sample. The conditions are met.

Page 6: Ch12 Solutions

286 The Practice of Statistics for AP*, 4/e

Do: The output is given below. According to the output, the test statistic is -6.46 and the P-value given is 0.000. This P-value is for the two-sided test and we are conducting a one-sided test. But when you divide by 2 here you get the same thing. The regression equation is death = 261 - 23.0 wine Predictor Coef SE Coef T P Constant 260.56 13.84 18.83 0.000 wine -22.969 3.557 -6.46 0.000 S = 37.8786 R-Sq = 71.0% R-Sq(adj) = 69.3% Conclude: Since the P-value is less than 0.05 we reject the null hypothesis and conclude that there is convincing evidence of a negative linear relationship between wine consumption and death rate due to heart disease. (b) State: We want to construct a 95% t interval for ,β the true slope of the population regression line relating time at the table and caloric consumption. Plan: If the conditions are met, we will construct a t interval for the slope .β The conditions were checked in (a). Do: The sample size is 19 so the appropriate distribution has 17 degrees of freedom and * 2.11.t = This leads to the confidence interval of ( ) ( )22.969 2.11 3.557 22.969 7.505 30.474, 15.464 .− ± = − ± = − − Conclude: We are 95% confident that the interval from 30.474− to 15.464− captures the true slope of the population regression line for predicting heart disease death rate from wine consumption.

Page 7: Ch12 Solutions

Chapter 12: Inference for Linear Regression 287

12.20 (a) State: We want to perform a test of 0 : 0H β = versus : 0aH β < where β is the true slope of the population regression line relating swim time to pulse rate. We will use 0.05.α = Plan: If the conditions are met, we will do a t test for the slope .β We check the conditions. The scatterplot, residual plot and histogram of the residuals are given below. Linear: The scatterplot is approximately linear. Independent: This was clearly less than 10% of all possible swim times that could have been measured. Normal: The histogram of residuals is mound shaped and approximately symmetric so the residuals could follow a Normal distribution. Equal Variance: The residual plot shows roughly equal scatter for all x values. Random: The data come from a random sample. The conditions are met.

Do: The output is given below. According to the output, the test statistic is -5.13 and the P-value given is 0.000. This P-value is for the two-sided test and we are conducting a one-sided test. But when you divide by 2 here you get the same thing. The regression equation is pulse = 480 - 9.69 time Predictor Coef SE Coef T P Constant 479.93 66.23 7.25 0.000 time -9.695 1.889 -5.13 0.000 S = 6.45505 R-Sq = 55.6% R-Sq(adj) = 53.5% Conclude: Since the P-value is less than 0.05 we reject the null hypothesis and conclude that there is convincing evidence of a negative linear relationship between swim time and pulse rate. (b) State: We

Page 8: Ch12 Solutions

288 The Practice of Statistics for AP*, 4/e

want to construct a 95% t interval for ,β the true slope of the population regression line relating time at the table and caloric consumption. Plan: If the conditions are met, we will construct a t interval for the slope .β The conditions were checked in (a). Do: The sample size is 23 so the appropriate distribution has 21 degrees of freedom and * 2.08.t = This leads to the confidence interval of

( ) ( )9.695 2.08 1.889 9.695 3.929 13.624, 5.766 .− ± = − ± = − − Conclude: We are 95% confident that the interval from –13.624 to –5.766 captures the true slope of the population regression line for predicting pulse rate from swim time. 12.21 c 12.22 d 12.23 c 12.24 a 12.25 b 12.26 b 12.27 (a) Each student was assigned the two treatments (read in color printed and read the color name) in random order. (b) He used a randomized block design where each student was a block (this could also be called a matched pairs design since there were only two treatments per block). He did this to help control for the different abilities of students to read the color words (or to say the color they were printed in) with distractions. (c) The random assignment was used to help average out the effects of the order in which people did the two treatments. This was done so that we could average out the affects of having dealt with a similar distraction before. 12.28 First calculate the difference in times for each student (we used Colors – Words). The mean difference is 6.56 with a standard deviation of 4.66. The minimum was 0. This means that all students did at least as well on the Words as they did on the Colors. It appears that saying the color that the word is printed in takes longer. The histogram given below also shows that all scores for the difference are 0 or above.

Page 9: Ch12 Solutions

Chapter 12: Inference for Linear Regression 289

12.29 It is not safe to use the matched pairs t procedures for this data set. In the histogram in the solution for Exercise 12.28 we see that the distribution is skewed and the boxplot below shows that there is an outlier among the differences. Since there are only 16 observations, the Normal condition is not satisfied.

12.30 (a) The scatterplot is below. There appears to be a moderately strong, positive, linear relationship between the length of time to read the word and length of time to say the color names.

(b) The minitab output is given below. The regression equation is Colors = 4.89 + 1.13 Words Predictor Coef SE Coef T P Constant 4.887 6.569 0.74 0.469 Words 1.1321 0.5090 2.22 0.043 S = 4.81350 R-Sq = 26.1% R-Sq(adj) = 20.8% The regression equation is ˆ 4.887 1.1321y x= + where y is the predicted time to say the colors and x is the amount of time to read the words. (c) The predicted value for the student who completed the word task in 9 seconds is ( )4.887 1.1321 9 15.076+ = seconds so the residual is 13 15.076 2.076− = − seconds. (d) If the true slope is really 0, the probability of getting a sample with a slope of 1.1321 or larger is 0.0215.

Page 10: Ch12 Solutions

290 The Practice of Statistics for AP*, 4/e

12.31 (a) The probability that the person is a snowmobile owner is 295 0.1933.1526

= The probability that

the person belongs to an environmental organization or owns a snowmobile is 295 77 212 0.3827.1526+ +

=

The probability that the person has never used a snowmobile given that they belong to an environmental

organization is 212 0.6951.305

= (b) No. ( )snowmobile owner 0.1933.P =

( ) 305belongs to environmental organization 0.1999.1526

P = =

( ) ( )16snowmobile owner and belongs to environmental organization 0.0105 0.1933 0.1999 .1526

P = = ≠

Since the probability of the intersection is not the same as the product of the probabilities, these two events are not independent. (c) ( ) ( )2both are owners 0.1933 0.0374.P = =

( ) ( ) ( )2at least one belongs to an environmental organization 1 neither belong 1 1 0.1999 0.3598.P P= − = − − = 12.32 State: We want to perform a test of

0 : Environmental club membership and snowmobile use are independent: Environmental club membership and snowmobile use are not independenta

HH

at the 0.05α = level. Plan: We should use a chi-square test for association/independence if the conditions are satisfied. We check the conditions. Random: The data came from a random sample. Large sample size: We used technology to get the following expected counts:

Snowmobile use No Yes Never used 525.69 131.31 Snowmobile renter 459.28 114.72 Snowmobile owner 236.04 58.96

All of these counts are bigger than 5. Independent: Our sample includes 1526 people. This is less than 10% of the winter visitors to Yellowstone National Park. The conditions are met. Do: The test statistic is

( ) ( )2 22 445 525.69 16 58.96

... 116.588.525.69 58.96

χ− −

= + + = We use a chi-square distribution with 2 degrees of

freedom and find a P-value of approximately 0. Conclude: Since the P-value is less than 0.05, we reject the null hypothesis and conclude that environmental club membership and snowmobile use are not independent. Section 12.2 Check Your Understanding, page 776: 1. The scatterplot of the original data shows clear curvature. But the scatterplot of years and ( )ln population seems quite linear. This suggests that an exponential model fit the original data.

2. ln 1.193 0.0287y x= − + where y is the population and x is the years since 1700. 3. If we want to predict for 1890, that means that 1890 1700 190.x = − = So ( )ln 1.193 0.0287 190 4.26.y = − + = That means that 4.26ˆ 70.81y e= = million people.

Page 11: Ch12 Solutions

Chapter 12: Inference for Linear Regression 291

4. The residual plot suggests that the residual for this point would be negative. This means that our prediction is too high. Exercises, page 786: 12.33 The scatterplot is given below.

The scatterplot shows a fairly strong, positive, slightly curved relationship between length and period with one very unusual point (106.5, 2.115) in the top right corner. (b) In this case the class used the square root of the length as its explanatory variable and period as its response variable. (c) In this case the class used the length as the explanatory variable and the square of the period as its response variable. 12.34 The scatterplot is given below.

The scatterplot shows a strong, negative, curved relationship between volume and pressure. (b) In this case the explanatory variable is the reciprocal of the volume and the response variable is the pressure. (c) Here the explanatory variable is the volume and the response variable is the reciprocal of the pressure. 12.35 (a) For transformation 1: ˆ 0.08594 0.21y x= − + where y is the period and x is the length. For

transformation 2: 2 0.155 0.0428y x= − + where y is the period and x is the length. (b) For transformation 1: ˆ 0.08594 0.21 80 1.792y = − + = seconds. For transformation 2: ( )2 0.155 0.0428 80 3.269y = − + = so ˆ 3.269 1.808y = = seconds. (c) For transformation 1: The typical distance that a predicted value of the period will be from the actual value is about 0.046 seconds. For transformation 2: The typical distance that the square of a predicted value of the period will be from the square of the actual value is about 0.105 seconds-squared.

Page 12: Ch12 Solutions

292 The Practice of Statistics for AP*, 4/e

12.36 (a) For transformation 1: 1ˆ 0.3677 15.8994yx

= +

where y is the pressure and x is the volume.

For transformation 2: 1 0.1002 0.0398xy= + where y is the pressure and x is the volume. (b) For

transformation 1: 1ˆ 0.3677 15.8994 1.30317

y = + =

atmospheres. For transformation 2:

( )1 0.1002 0.0398 17 0.7768y= + = so 1ˆ 1.287

0.7768y = = atmospheres. (c) For transformation 1: The

typical distance that a predicted value of the pressure will be from the actual value is about 0.044 atmospheres. For transformation 2: The typical distance that the reciprocal of the predicted value of pressure will be from the reciprocal of the actual value is about 0.00355 ( ) 1atmospheres − . 12.37 The scatterplot is below.

The relationship is strong, negative, and slightly curved with one outlier in the top left hand corner. (b) Since the graph of the explanatory variable against the natural log of the response is fairly linear, an exponential model would be reasonable. (c) ln 5.973 0.218y x= − where y is the count of surviving

bacteria and x is time in minutes. (d) ( )ln 5.973 0.218 17 2.267y = − = so 2.267ˆ 9.65y e= = or 965 bacteria. Since the residual plot shows a random scatter around the value 0, we’d expect this prediction to be about right.

Page 13: Ch12 Solutions

Chapter 12: Inference for Linear Regression 293

12.38 The scatterplot is given below.

The relationship is strong, negative, and slightly curved with no outliers. (b) Since the graph of the explanatory variable against the natural log of the response is fairly linear, an exponential model would be reasonable. (c) ln 6.789 0.333y x= − where y is the light intensity and x is the depth. (d) ( )ln 6.789 0.333 12 2.793y = − = so 2.793ˆ 16.33y e= = lumens. If the actual light intensity was 16.2, that means that the natural log of the actual value is ( )ln 16.2 2.785.= This gives a residual of 2.785 2.793 0.008.− = − The value of s for this model is 0.00006, so the residual for this point is quite large.

12.39 The equation for the regression line is log 1.01 0.72logy x= + where x is body weight in kg and y is brain weight in g. So if Sasquatch has a body weight of 127 kg, we find that ( )log 1.01 0.72log 127 2.525.y = + = This means that 2.525ˆ 10 334.97y = = grams is the estimated brain weight of Sasquatch. 12.40 The equation for the regression line is ln 2.00 2.42lny x= − + where x is the diameter at breast height in cm and y is the aboveground biomass in kg. If a tree is 30 cm in diameter, then ( )ln 2.00 2.42ln 30 6.231.y = − + = This means that

6.231ˆ 508.263y e= = kg is the total aboveground biomass of the tree. 12.41 (a) The power model would work better because the graph with both variables transformed is linear whereas the graph with just the response variable transformed using a log still has quite a bit of curvature. (b) log 1.9503 1.0481logy x= − where y is the abundance (per 10,000 kg of prey) and x is the

body mass (kg). (c) ( )log 1.9503 1.0481log 92.5 0.1104y = − = − so 0.1102ˆ 10 0.7755y −= = per 10,000 kg of prey. (d) The residual plot is randomly scattered about 0. The model looks like it is a good fit. There are no patterns that suggest lack of linearity or lack of equal variance. We still need to check for normality of the residuals and think about how the data were collected to know for sure if the regression model was appropriate or not. 12.42 (a) The exponential model would work better because the graph with only the response variable transformed is linear whereas the graph with both variables transformed has curvature to it. (b) log 0.4537 0.1172y x= − where y is the height in feet and x is the bounce number. (c)

Page 14: Ch12 Solutions

294 The Practice of Statistics for AP*, 4/e

( )log 0.4537 0.1172 7 0.3667y = − = − so 0.3667ˆ 10 0.4298y −= = feet. (d) The trend in the residual plot suggests that the residual would be positive which means that our prediction would be too low. 12.43 (a) The scatterplot is given below.

There is a strong, positive, slightly curved relationship between height and distance. (b) Two scatterplots are given below.

The first scatterplot still shows curvature which suggests that the exponential model is not the best. The second plot is more linear. This suggests that a power model is the appropriate model to use. (c) The minitab output is given below. The regression equation is ln(distance) = 3.75 + 0.515 ln(height) Predictor Coef SE Coef T P Constant 3.75138 0.09623 38.98 0.000 ln(height) 0.51522 0.01481 34.78 0.000 S = 0.0139857 R-Sq = 99.8% R-Sq(adj) = 99.7% This gives the equation ln 3.7514 0.5152lny x= + where y is the distance and x is the height. (d) If the

ramp height was 700, ( )ln 3.7514 0.5152ln 700 7.1265y = + = and 7.1265ˆ 1244.53.y e= =

Page 15: Ch12 Solutions

Chapter 12: Inference for Linear Regression 295

12.44 The scatterplot is below.

There is a strong positive, curved relationship. (b) Two scatterplots are given below.

The first scatterplot still shows curvature which suggests that the exponential model is not the best. The second plot is more linear. This suggests that a power model is the appropriate model to use. (c) The minitab output is given below. The regression equation is ln(weight) = - 0.314 + 3.14 ln(length) Predictor Coef SE Coef T P Constant -0.3140 0.1958 -1.60 0.170 ln(length) 3.1387 0.1151 27.27 0.000 S = 0.353543 R-Sq = 99.3% R-Sq(adj) = 99.2% The equation is ln 0.314 3.1387lny x= − + where y is the weight of the heart and x is the length of the

cavity of the left ventricle. (d) ( )ln 0.314 3.1387ln 6.8 5.703y = − + = so 5.703ˆ 299.77y e= = g. 12.45 c 12.46 e 12.47 e

Page 16: Ch12 Solutions

296 The Practice of Statistics for AP*, 4/e

12.48 c 12.49 (a) A point is considered to be an outlier if it is more than 1.5IQR above 3Q . On a standard

Normal curve, 3 1.5 0.67 1.5(1.34) 2.68Q IQR+ = + = . Since 7 4.5 2.78

0.9− =

, Marcella’s shower

would be classified as an outlier. (b) The probability that she takes a 7 minute shower on any given day

is ( ) ( )7 4.57 2.78 0.0027.0.9

P x P z P z− > = > = > =

( )( )

( ) ( )( )10 9

shower time is 7 minutes or more on at least 2 of 10 days

1 shower is 7 minutes or more on 0 or 1 of 10 days

1 1 0.0027 10 0.0027 1 0.0027 1 0.9733 0.0264 0.0003.

P

P= −

= − − − − = − − =

(c) ( ) ( )5 4.55 1.76 0.0392.0.910

P x P z P z

− > = > = > =

12.50 The sample was taken from people who had agreed to participate in Harris polls. It is very likely that people who have agreed ahead of time to participate in polls will have different characteristics from the general population of U.S. adults. So the sample will be representative only of the population of people who have agreed to participate in these polls, not the population of all U.S. adults. 12.51 (a) Answers will vary. A possible response would be that (1) it would be hard to find all of the individual teachers in the sample since they were spread out among 40 different sessions and (2) there may not have been many teachers in a particular type of subject and the AP stat teachers wanted to make sure that they included teachers in those fields. (b) A stratified random sample would allow the statistics teachers to make sure that all subjects were included in appropriate proportions. 12.52 (a) State: We want to estimate the actual proportion of all AP teachers attending this workshop who have tattoos at a 95% confidence level. Plan: We should use a one-sample z-interval for p if the conditions are satisfied. Checking the conditions: Random: the teachers were selected randomly. Normal: there were 23 successes (had tattoos) and 75 failures (did not have tattoos). Both are at least 10. Independent: the sample is less than 10% of the population of all AP teachers attending the workshop. The conditions are met. Do: A 95% confidence interval is given by

0.235(0.765)0.235 1.96 0.235 0.084 (0.151,0.319).

98± = ± = Conclude: We are 95% confident that the

interval from 0.151 to 0.319 captures the true proportion of AP teachers at this workshop who have tattoos. (b) State: We want to perform a test at the 0.05α = significance level of 0 : 0.14H p = versus

: 0.14aH p ≠ where p is the actual proportion of AP teachers at the workshop who have tattoos. Plan: If conditions are met, we should do a one-sample z test for the population proportion p. We check the conditions. The conditions were checked in (a) and are met. Do: The sample proportion is ˆ 0.235.p =

The corresponding test statistic is ( )

0.235 0.14 2.71.0.14 0.86

98

z −= = Since this is a two-sided test the P-value is

( ) ( )2 2.71 2 0.0034 0.0068.P z > = = Conclude: Since our P-value is less than 0.05, we reject the null

Page 17: Ch12 Solutions

Chapter 12: Inference for Linear Regression 297

hypothesis. It appears that the proportion of AP teachers at the workshop who have tattoos is not 0.14. (c) No. If we had two more successes, this would only have increased the sample proportion and made it further from the hypothesized proportion. We would not have changed our conclusion. If we had two more failures, the sample proportion would have changed to 0.230. This would have changed the z-statistic to 2.59 which would still lead us to reject the null hypothesis. Chapter Review Exercises (page 793) R12.1 (a) There is a moderately strong positive linear relationship between the thickness and the velocity. (b) ˆ 70.44 274.78y x= + where y is the velocity and x is the thickness. (c) The predicted velocity is ( )ˆ 70.44 274.78 0.4 180.352y = + = feet/second. The residual is 104.8 – 180.352 = –75.552, so the line overpredicts the velocity by 75.552 ft/sec. (d) The linear model is appropriate. The scatterplot shows a linear relationship and the residual plot shows a random scatter of points. (e) Slope: For each increase of an inch in thickness we expect an average of an increase of 274.78 feet/second for the velocity. s: The typical prediction will vary from the actual value by about 56.36 feet/second. 2 :r About 49.3% of the variation in velocity is explained by the linear relationship with thickness. Standard error of the slope: If we take many different random samples of 12 pistons and compute the least-squares regression line for each one, the estimated slope will vary from the slope of the population regression line for predicting velocity from thickness by about 88.18, on average. R12.2 State: We want to perform a test of 0 : 0H β = versus : 0aH β ≠ where β is the true slope of the population regression line relating thickness to velocity. We will use 0.05.α = Plan: If the conditions are met, we will do a t test for the slope .β We check the conditions. Linear: The scatterplot is approximately linear. Independent: There were 12 pistons observed. This is less than 10% of the possible pistons we could have observed. Normal: We are told that the Normal probability plot of the residuals is roughly linear. Equal Variance: The residual plot shows a roughly equal scatter for all x values. Random: The data come from a random sample. The conditions are met. Do: The test statistic

is 0 274.78 3.116.88.18b

bts−

= = = The t-distribution has 10 degrees of freedom so the P-value is 0.0109.

Conclude: Since the P-value is less than 0.05 we reject the null hypothesis and conclude that there is convincing evidence of a linear relationship between thickness and velocity. R12.3 State: We want to construct a 95% confidence interval for ,β the true slope of the population regression line relating thickness to velocity. Plan: If the conditions are met, we will construct a t interval for the slope .β The conditions were checked in Exercise R12.2. Do: The sample size is 12 so the appropriate distribution has 10 degrees of freedom and * 2.228.t = This leads to the confidence interval of ( ) ( )274.78 2.228 88.18 274.78 196.465 78.315,471.245 .± = ± = Conclude: We are 95% confident that the interval from 78.315 to 471.245 captures the true slope of the population regression line for predicting velocity from thickness. This is consistent with Exercise R12.2. In that case we

Page 18: Ch12 Solutions

298 The Practice of Statistics for AP*, 4/e

determined that 0 was not a plausible value for the slope of the population regression line. In this Exercise we showed that 0 was not in the 95% confidence interval, again concluding that 0 is not a plausible value. R12.4 There is clear curvature to the scatterplot and the residual plot. There may also be a problem with the equal variances condition. R12.5 (a) The transformation did achieve linearity because the residuals are scattered randomly above

and below the value 0 in the residual plot. (b) The equation is 2

1ˆ 0.000595 0.3yx

= − +

so for a bulb at

a distance of 2.1 meters we would predict an intensity of ( )2

1ˆ 0.000595 0.3 0.06742.1

y = − + =

candelas.

R12.6 (a) Answers will vary. A possible residual plot is given below.

(b) There is clear curvature to these data so a straight line does not make sense as a model. (c) A power model is more appropriate in this case because the graph with both variables transformed is linear whereas the graph with only y transformed still has curvature. (d) The regression equation is ln 3.48 0.293lny x= + where y is the recall and x is the time. So for a time of 25 seconds we predict ( )ln 3.48 0.293ln 25 4.423y = + = and 4.423ˆ 83.36y e= = percentage of words recalled.

AP Statistics Practice Test (page 796) T12.1 c. There is no sample size limitation. T12.2 b. This looks like a power model would make sense. T12.3 d. The appropriate distribution is t with 2n − degrees of freedom.

Page 19: Ch12 Solutions

Chapter 12: Inference for Linear Regression 299

T12.4 a. The correlation is the square root of 2r and has the same sign as the slope. T12.5 d. The test statistic is the coefficient divided by the standard error of the coefficient. T12.6 d. The P-value refers to the population relationship, not the sample relationship. T12.7 e. With 67 degrees of freedom, the appropriate *t is 2.00. T12.8 d. These are the two outliers at the top of the graph. T12.9 d. For power models the graph that is linear is the logarithm of y against the logarithm of x. T12.10 c. ( )log 13.5 0.01 2020 6.7y = − + = so 6.7ˆ 10 5,011,872.y = = T12.11 (a) ˆ 4.546 4.832y x= + where y is the weight gain and x is the dose of growth hormone. (b) Slope: For each 1 mg increase in growth hormone we expect 4.832 grams of weight gain, on average. y-intercept: If a chicken is given no growth hormone we expect 4.54 g of weight gain. s: The typical predicted weight gain will vary from the actual weight gain by about 3.13 grams. Standard error of the slope: If we repeated this experiment many times the estimated slope will vary from the population slope to predict weight gain from amount of growth hormone by an average of 1.0164 g/mg. 2 :r About 38.4% of the variability in weight gain is explained by the linear relationship with the amount of growth hormone given. (c) State: We want to perform a test of 0 : 0H β = versus : 0aH β ≠ where β is the true slope of the population regression line relating amount of growth hormone to weight gain. We will use

0.05.α = Plan: If the conditions are met, we will do a t test for the slope .β We are told to assume that conditions are met Do: The computer output reports a test statistic of 4.75 with a P-value of 0.0004. Conclude: Since the P-value is less than 0.05 we reject the null hypothesis and conclude that there is convincing evidence of a linear relationship between amount of growth hormone and weight gain. (d) State: We want to construct a 95% confidence interval for ,β the true slope of the population regression line relating amount of growth hormone to weight gain. Plan: If the conditions are met, we will construct a t interval for the slope .β We are told to assume that conditions are met. Do: The sample size is 15 so the appropriate distribution has 13 degrees of freedom and * 2.160.t = This leads to the confidence interval of ( ) ( )4.8323 2.160 1.0164 4.8323 2.195 2.6373,7.0273 .± = ± = Conclude: We are 95% confident that the interval from 2.6373 to 7.0273 captures the true slope of the population regression line for predicting weight gain from amount of growth hormone. T12.12 (a) There is clear curvature evident in both the scatterplot and the residual plot. Also, there may be a problem with the equal variance condition as seen in the residual plot. (b) Option 1:

3ˆ 2.078 0.0042597 .y x= + We would predict the following number of board feet from a tree with a

diameter of 30 inches: ( )3ˆ 2.078 0.0042597 30 117.09.y = + = Option 2: ln 1.2319 0.113417 .y x= + For

30x = we get ( )ln 1.2319 0.113417 30 4.63441y = + = and 4.63441ˆ 102.967y e= = feet. (c) The residual plot for Option 2 still shows a large amount of curvature so the better option is Option 1.