chapter 10 solutions - west virginia universityghobbs/stat511hw/ips6e.ism.ch10.pdf · chapter 10...

23
Chapter 10 Solutions 10.1. The given model was µ y = 40.5 2.5x , with standard deviation σ = 2.0. (a) The slope is 2.5. (b) When x increases by 1, µ y decreases by 2.5. (Or equivalently, if x increases by 2, µ y decreases by 5, etc.) (c) When x = 10, µ y = 40.5 2.5(10) = 15.5. (d) Approximately 95% of observed responses would fall in the interval µ y ± 2σ = 15.5 ± 2(2.0) = 15.5 ± 4.0 = 11.5 to 19.5. 10.2. Example 10.4 gave the estimated regression equation as MPG =−7.796 + 7.874 LOGMPH, with s = 0.9995. In the text following that example, the parameter estimates were rounded to two decimal places: MPG =−7.80 + 7.87 LOGMPH, with s = 1.00. (a) If the car travels at 35 mph, then LOGMPH = ln 35 . = 3.5553, so we estimate MPG . = 20.1988 mpg (or 20.18 mpg, using the rounded estimates). (b) The residual is 21.0 MPG . = 0.8012 mpg (or 0.82, using the rounded estimates). (c) Because this regression line was based on speeds between about 12 and 53 mph, estimates near or outside those boundaries would not be very reliable; they grow less reliable the more we stray above 53 mph. Even the estimated mpg for 45 mph would be subject to lots of uncertainty because the points at the high end of the scatterplot exhibited more spread than those at the low end. Note: Some students might mistakenly use the common (base-10) logarithm instead of the natural logarithm; if they do so, they will find LOGMPH . = 1.5441 and MPG . = 4.36 mpg. Hopefully, the large residual (16.6 mpg) in part (b) would help them notice their mistake; in general, we expect residuals to fall in the range ±3s, if they follow a Normal distribution. 10.3. Example 10.6 gives the confidence interval 7.16 to 8.58 for the slope β 1 . Recall that slope is the change in y (i.e., MPG) when x (i.e., LOGMPH) changes by +1. (a) If LOGMPH increases by 1, we expect MPG to change by β 1 , so the 95% confidence interval for the change is (an increase of) 7.16 to 8.58 mpg. (b) If LOGMPH decreases by 1, we expect MPG to change by β 1 , so the 95% confidence interval for the change is 7.16 to 8.58 mpg—that is, a decrease of 7.16 to 8.58 mpg. (c) If LOGMPH increases by 0.5, we expect MPG to change by 0.5β 1 , so the 95% confidence interval for the change is (an increase of) 3.58 to 4.29 mpg. 10.4. Example 10.10 gives the 95% prediction interval ˆ y ± t SE ˆ y as 17.0 to 21.0, or 19.0 ± 2.0. For df = 58, the critical value t is very close to 2, so SE ˆ y . = 1.0. When x = 40 mph, SE ˆ y would be larger because predictions are less accurate for values of x near the extremes (relative to the range of speeds used to determine the regression formula, which was approximately 12 to 53 mph). Note: Working from the original data set, SE ˆ y equals 1.01 when x = 30 mph, and 1.02 when x = 40 mph. 266

Upload: lynguyet

Post on 08-Feb-2018

233 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

Chapter 10 Solutions

10.1. The given model was µy = 40.5 − 2.5x , with standard deviation σ = 2.0.(a) The slope is −2.5. (b) When x increases by 1, µy decreases by 2.5. (Orequivalently, if x increases by 2, µy decreases by 5, etc.) (c) When x = 10,µy = 40.5 − 2.5(10) = 15.5. (d) Approximately 95% of observed responses would fall inthe interval µy ± 2σ = 15.5 ± 2(2.0) = 15.5 ± 4.0 = 11.5 to 19.5.

10.2. Example 10.4 gave the estimated regression equation as MPG = −7.796+7.874LOGMPH,with s = 0.9995. In the text following that example, the parameter estimates were roundedto two decimal places: MPG = −7.80 + 7.87 LOGMPH, with s = 1.00. (a) If the car travelsat 35 mph, then LOGMPH = ln 35 .= 3.5553, so we estimate MPG

.= 20.1988 mpg (or20.18 mpg, using the rounded estimates). (b) The residual is 21.0 − MPG

.= 0.8012 mpg (or0.82, using the rounded estimates). (c) Because this regression line was based on speedsbetween about 12 and 53 mph, estimates near or outside those boundaries would not be veryreliable; they grow less reliable the more we stray above 53 mph. Even the estimated mpgfor 45 mph would be subject to lots of uncertainty because the points at the high end of thescatterplot exhibited more spread than those at the low end.

Note: Some students might mistakenly use the common (base-10) logarithm instead ofthe natural logarithm; if they do so, they will find LOGMPH

.= 1.5441 and MPG.= 4.36 mpg.

Hopefully, the large residual (16.6 mpg) in part (b) would help them notice their mistake; ingeneral, we expect residuals to fall in the range ±3s, if they follow a Normal distribution.

10.3. Example 10.6 gives the confidence interval 7.16 to 8.58 for the slope β1. Recall thatslope is the change in y (i.e., MPG) when x (i.e., LOGMPH) changes by +1. (a) IfLOGMPH increases by 1, we expect MPG to change by β1, so the 95% confidence intervalfor the change is (an increase of) 7.16 to 8.58 mpg. (b) If LOGMPH decreases by 1, weexpect MPG to change by −β1, so the 95% confidence interval for the change is −7.16to −8.58 mpg—that is, a decrease of 7.16 to 8.58 mpg. (c) If LOGMPH increases by 0.5,we expect MPG to change by 0.5β1, so the 95% confidence interval for the change is (anincrease of) 3.58 to 4.29 mpg.

10.4. Example 10.10 gives the 95% prediction interval y ± t∗SEy as 17.0 to 21.0, or 19.0 ± 2.0.For df = 58, the critical value t∗ is very close to 2, so SEy

.= 1.0.When x = 40 mph, SEy would be larger because predictions are less accurate for values

of x near the extremes (relative to the range of speeds used to determine the regressionformula, which was approximately 12 to 53 mph).

Note: Working from the original data set, SEy equals 1.01 when x = 30 mph, and 1.02when x = 40 mph.

266

Page 2: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

Solutions 267

10.5. (a) The plot suggests a linear increase. (b) The regression equation is y =−3271.9667 + 1.65x . (c) The fitted values and residuals are given in the table below.Squaring the residuals and summing gives 0.0016, so the standard error is:

s =√

0.0016n − 2

=√

0.0016 .= 0.04082

(d) Given x (the year), spending comes from a N (µy, σ ) distribution, where µy = β0 + β1x .The estimates of β0, β1, and σ are b0

.= −3271.9667, b1.= 1.65, and s

.= 0.04082.(e) We first note that x = 2000 and

∑(xi − x)2 = 2, so SEb1 = s/

√2 .= 0.02887.

We have df = n − 2 = 1, so t∗ = 12.71, and the 95% confidence interval for β1 is1.65 ± t∗SEb1

.= 1.283 to 2.017. This gives the rate of increase of R&D spending: between1.283 and 2.017 billion dollars per year.

Spending FittedYear ($billions) values Residuals1999 26.4 26.383 0.0162000 28.0 28.03 −0.032001 29.7 29.683 0.016

26

27

28

29

1999 2000 2001S

pend

ing

(bill

ions

of d

olla

rs)

Year

10.6. (a) The variables x and y are reversed: Slope gives the change in y for a change in x .(b) The population regression line has intercept β0 and slope β1 (not b0 and b1). (c) Theestimate µy = b0 + b1x∗ is more accurate when x∗ is close to x , so the width of theconfidence interval grows with (x∗ − x)2.

10.7. (a) The parameters are β0, β1, and σ ; b0, b1, and s are the estimates of thoseparameters. (b) H0 should refer to β1 (the population slope) rather than b1 (the estimatedslope). (c) The confidence interval will be narrower than the prediction interval because theconfidence interval accounts only for the uncertainty in our estimate of the mean response,while the prediction interval must also account for the random error of an individualresponse.

10.8. The table below gives two sets of answers: those found with critical valuesfrom Table D, and those found with software. In each case, the margin of error ist∗SEb1 = 6.31t∗, with df = n − 2.

df b1 t∗ Interval t∗ Interval

(a) 23 12.1 2.069 −0.9554 to 25.1554 2.0687 −0.9532 to 25.1532(b) 23 6.1 2.069 −6.9554 to 19.1554 2.0687 −6.9532 to 19.1532(c) 98 12.1 1.990* −0.4569 to 24.6569 1.9845 −0.4220 to 24.6220

*Note that for (c), if we use Table D, we take df = 80.

Page 3: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

268 Chapter 10 Inference for Regression

10.9. The test statistic is t = b1/SEb1 = b1/6.31, with df = n − 2. All three tests fail toproduce significant evidence against H0, although we are close in (a) and (c). This isconsistent with the confidence intervals from the previous exercise.

df b1 t P (Table D) P (software)(a) 23 12.1 1.92 0.05 < P < 0.10 0.0677(b) 23 6.1 0.97 0.30 < P < 0.40 0.3437(c) 98 12.1 1.92 0.05 < P < 0.10* 0.0581

*Note that for (c), if we use Table D, we take df = 80.

10.10. (a) The plot (below, left) shows a strong linear relationship with nostriking outliers. (b) The regression line (shown on the plot) is y = 1059 +1.3930x . (c) In the plot (below, right), it appears that for large x values,many residuals are negative. (d) A stemplot (shown) or histogram suggestsa slight left skew. (e) To test for a relationship, we test H0: β1 = 0 vs.Ha: β1 �= 0 (or equivalently, use ρ in place of β1). (f) The test statistic andP-value are given in the Minitab output below: t

.= 15.50, P < 0.0001. Wehave strong evidence of a non-zero slope.

−1 5−1−1 1−0 9−0 766−0 5−0 3332−0 1100

0 00110 22220 550 666770 99

Minitab outputThe regression equation is y2005 = 1059 + 1.39 y2000

Predictor Coef Stdev t-ratio pConstant 1059.0 396.8 2.67 0.012y2000 1.39301 0.08988 15.50 0.000

s = 626.6 R-sq = 88.9% R-sq(adj) = 88.5%

3000400050006000700080009000

1000011000

2000 3000 4000 5000 6000 7000

2005

tuiti

on a

nd fe

es

2000 tuition and fees

–2000

–1500

–1000

–500

0

500

1000

2000 3000 4000 5000 6000 7000

Res

idua

l

2000 tuition and fees

10.11. (a) From the Minitab output above, we have SEb1

.= 0.08988. With df = 30, t∗ = 2.042,so the 95% confidence interval for β1 is 1.3930 ± t∗SEb1

.= 1.2095 to 1.5765. This slopemeans that a $1 difference in tuition in 2000 changes 2005 tuition by between $1.21 and$1.58, so we estimate that tuition increased by 21% to 58%. (b) When x = 5000, theestimated 2005 tuition is y = 1059 + 1.3930(5000)

.= $8024. (c) The Minitab output belowgives the interval $6717 to $9331. If your software will not report this, here are the detailsof how to produce this interval (not for the faint of heart): We note that sx

.= $1252.17, so∑(xi − x)2 = s2

x (n − 1).= 48,605,492. With x∗ = $5000, x

.= $4238.69, s = $626.6, and

Page 4: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

Solutions 269

n = 32, we have:

SEy = s

√1 + 1

n+ (x∗ − x)2∑

(xi − x)2

.= $640

With t∗ = 2.042, this gives the prediction interval $8024±(2.042)($640).= $8024±$1307 =

$6717 to $9331.Note: The value “Stdev.Fit” given by Minitab (below) as 130 is SEµ. Observe that

SE2y = s2 + SE2

µ—reminiscent of the Pythagorean theorem—so that SEy can be computedrather easily if s and SEµ are both known.

Minitab outputFit Stdev.Fit 95.0% C.I. 95.0% P.I.8024 130 ( 7758, 8290) ( 6717, 9331)

10.12. (a) The scatterplot shows a fairly strongpositive linear association, with no extremeoutliers, so regression seems to be appro-priate. (b) The regression equation (shownon the scatterplot) is y = 11.81 + 0.7754x .(c) Student summaries will vary. The mpgvalues are certainly similar, but one notabledifference is that all but three of the computervalues are higher than the driver’s values, andthe mean computer mpg is about 2.7 mpghigher than the mean driver mpg. Addition-ally, the slope of the regression line is about 0.78, meaning that (on average) a 1 mpg changein the driver’s value corresponds to a 0.78 mpg for the computer. The intercept, however, isabout 11.8 mpg, suggesting that the computer’s value is generally higher when the driver’svalue is small.

343638404244464850

30 32 34 36 38 40 42 44 46 48

Com

pute

r's m

pg

Driver's mpg

Minitab outputThe regression equation is Computer = 11.8 + 0.775 Driver

Predictor Coef Stdev t-ratio pConstant 11.812 5.432 2.17 0.043Driver 0.7754 0.1335 5.81 0.000

s = 2.676 R-sq = 65.2% R-sq(adj) = 63.3%

Page 5: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

270 Chapter 10 Inference for Regression

10.13. (a) The regression equation is y =−0.0127 + 0.0180x , and r2 .= 80.0%. Notsurprisingly, we find that BAC increases asbeer consumption increases; the relationshipis quite strong, with beer consumptionexplaining 80% of the variation in BAC.(b) To test H0: β1 = 0 vs. Ha: β1 > 0, wefind t = 7.48 and P < 0.0001. There is verystrong evidence that drinking more beersincreases BAC. (c) The predicted meanBAC for x = 5 beers is 0.07712; the 90%prediction interval is 0.040 to 0.114. Steve might be safe, but cannot be sure that his BACwill be below 0.08.

Note: For (c), we use a prediction interval (rather than a confidence interval) because weare interested in a range of values for an individual BAC after 5 beers, rather than the meanBAC. The first printing of the text asked for a confidence interval, and the answer given inthe back of the first printing was the 90% confidence interval: 0.06808 to 0.08616.

0

0.05

0.1

0.15

0.2

1 2 3 4 5 6 7 8 9

Blo

od a

lcoh

ol c

onte

nt

Beers

Minitab outputThe regression equation is BAC = - 0.0127 + 0.0180 Beers

Predictor Coef Stdev t-ratio pConstant -0.01270 0.01264 -1.00 0.332Beers 0.017964 0.002402 7.48 0.000

s = 0.02044 R-sq = 80.0% R-sq(adj) = 78.6%

Fit Stdev.Fit 90.0% C.I. 90.0% P.I.0.07712 0.00513 ( 0.06808, 0.08616) ( 0.03999, 0.11425)

10.14. (a) Stemplots are shown on the right. x (wa-tershed area) is right-skewed; x

.= 28.2857 km2,sx

.= 17.7142 km2. y (IBI) is left-skewed; y.= 65.9388,

sy.= 18.2796. (b) The scatterplot (next page, left) shows a

weak positive association, with more scatter in y for smallx . (c) yi = β0 + β1xi + εi , i = 1, 2, ..., 49; εi are indepen-dent N (0, σ ) variables. (d) The hypotheses are H0: β1 = 0vs. Ha: β1 �= 0. (e) See the Minitab output on the nextpage. The regression equation is IBI = 52.92 + 0.4602 Area,and the estimated standard deviation is s

.= 16.53. Fortesting the hypotheses in (d), t = 3.42 and P = 0.001.(f) The residual plot (next page, right) again shows thatthere is more variation for small x . (g) As we can see froma stemplot and/or a Normal quantile plot (both on the next page), the residuals are somewhatleft-skewed but otherwise seem reasonably close to Normal. (h) Student opinions may vary.The two apparent deviations from the model are (i) a possible change in standard deviationas x changes and (ii) possible non-Normality of error terms.

Area

0 20 56889991 00241 668892 1111332 666678893 1122443 944 7995 2445 78966 97 0

IBI

2 993 2333 94 134 675 345 5568996 01246 77 111247 568898 0012223448 5568999 1

Page 6: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

Solutions 271

20

30

40

50

60

70

80

90

0 10 20 30 40 50 60 70

IBI

Watershed area (km2)

–40

–30

–20

–10

0

10

20

0 10 20 30 40 50 60 70

Res

idua

l

Watershed area (km2)

Minitab outputThe regression equation is IBI = 52.9 + 0.460 Area

Predictor Coef Stdev t-ratio pConstant 52.923 4.484 11.80 0.000Area 0.4602 0.1347 3.42 0.001

s = 16.53 R-sq = 19.9% R-sq(adj) = 18.2%

−3 2200−2 8−2 42−1 9665−1 3−0 8885−0 433100

0 2233340 6667891 0223341 67992 00242 5

–40

–30

–20

–10

0

10

20

–3 –2 –1 0 1 2 3

Res

idua

l

z score

10.15. (a) The stemplot of percent forested is shown on the right;see the solution to the previous exercise for the stemplot of IBI. x(percent forested) is right-skewed; x = 39.3878%, sx = 32.2043%. y(IBI) is left-skewed; y = 65.9388, sy = 18.2796. (b) The scatterplot(next page, left) shows a weak positive association, with more scatterin y for small x . (c) yi = β0 + β1xi + εi , i = 1, 2, ..., 49; εi areindependent N (0, σ ) variables. (d) The hypotheses are H0: β1 = 0vs. Ha: β1 �= 0. (e) See the Minitab output on the next page. Theregression equation is IBI = 59.91 + 0.1531 Forest, and the estimatedstandard deviation is s

.= 17.79. For testing the hypotheses in (d),t = 1.92 and P = 0.061. (f) The residual plot (next page, right) shows a slight curve—theresiduals seem to be (very) slightly lower in the middle and higher on the ends. (g) As wecan see from a stemplot and/or a Normal quantile plot (both on the next page), the residualsare left-skewed. (h) Student opinions may vary. The three apparent deviations from themodel are (i) a possible change in standard deviation as x changes, (ii) possible curvature ofresiduals, and (iii) possible non-Normality of error terms.

Percent forested

0 000000337891 00147782 1253 1233394 1337995 2296 387 5998 0699 055

10 00

Page 7: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

272 Chapter 10 Inference for Regression

20

30

40

50

60

70

80

90

0 20 40 60 80 100

IBI

Percent forested

–40

–30

–20

–10

0

10

20

30

0 20 40 60 80 100

Res

idua

l

Percent forested

Minitab outputThe regression equation is IBI = 59.9 + 0.153 Forest

Predictor Coef Stdev t-ratio pConstant 59.907 4.040 14.83 0.000Forest 0.15313 0.07972 1.92 0.061

s = 17.79 R-sq = 7.3% R-sq(adj) = 5.3%

−3 55−3 4−2 988−2 0−1 985−1 2110−0 99887−0 410

0 1340 5578991 011223331 556782 0442 78

–40

–30

–20

–10

0

10

20

30

–3 –2 –1 0 1 2 3

Res

idua

l

z score

10.16. The first model (using watershed area to predict IBI) is preferable because theregression was significant (P = 0.001 vs. P = 0.061) and explained a higher proportion ofthe variation in IBI (19.9% vs. 7.3%).

10.17. The precise results of these changes depend on which observation is changed. (Thereare six observations which had 0% forest and two which had 100% forest.) Specifically, ifwe change IBI to 0 for one of the first six observations, the resulting P-value is between0.019 (observation 6) and 0.041 (observation 3). Changing one of the last two observationschanges the P-value to 0.592 (observation 48) or 0.645 (observation 49).

In general, the first change decreases P (that is, the relationship is more significant)because it accentuates the positive association. The second change weakens the association,so P increases (the relationship is less significant).

Page 8: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

Solutions 273

10.18. With the regression equation IBI = 52.92 + 0.4602 Area, the predicted mean responsewhen x = Area = 30 km2 is µy = IBI

.= 66.73. While it is possible to find SEµ andSEy using the formulas from Section 10.2, we rely on the software output shown below.

(SEµ.= 2.37, reported by Minitab as “Stdev.fit,” and SEy =

√s2 + SE2

µ

.= 16.70, where

s.= 16.53 was given in the Minitab output shown with the solution to Exercise 10.14. For

df = 47, the appropriate critical value is t∗ = 2.0117.) (a) The 95% confidence intervalfor µy is 61.95 to 71.50. (b) The 95% prediction interval for a future response is 33.12 to100.34. (c) Among many streams with watershed area 30 km2, we estimate the mean IBI tobe between about 61.95 and 71.50. For an individual stream with watershed area 30 km2,we expect its IBI to be between about 33.12 and 100.34. (d) We probably cannot reliablyapply these results elsewhere; it is likely that the particular characteristics of the OzarkHighland region play some role in determining the regression coefficients.

Minitab outputFit Stdev.Fit 95.0% C.I. 95.0% P.I.

66.73 2.37 ( 61.95, 71.50) ( 33.12, 100.34)

10.19. Using Area = 10 in the model IBI = 52.92 + 0.4602 Area from Exercise 10.14,IBI

.= 57.52. Using Forest = 25 in the model IBI = 59.91+0.1531Forest from Exercise 10.15,IBI

.= 63.74. Both predictions have a lot of uncertainty; recall that r2 was fairly small forboth models. Also note that the prediction intervals (shown below) are both about 70 unitswide.

Minitab output– – – – – – – – – – – IBI predicted from watershed area – – – – – – – – – – –

Fit Stdev.Fit 95.0% C.I. 95.0% P.I.57.52 3.41 ( 50.66, 64.39) ( 23.55, 91.50)

– – – – – – – – – – – IBI predicted from percent forest – – – – – – – – – – –Fit Stdev.Fit 95.0% C.I. 95.0% P.I.

63.74 2.79 ( 58.13, 69.35) ( 27.51, 99.97)

10.20. (a) β0 is the population intercept, 4.6. This says that the mean overseas return is 4.6%when the U.S. return is 0%. (b) β1 is the population slope, 0.67. This says that when theU.S. return changes by 1%, the mean overseas return changes by 0.67%. (c) The full modelis yi = 4.6 + 0.67xi + εi , where yi and xi are observed overseas and U.S. returns in a givenyear, and εi are independent N (0, σ ) variables. The residual terms εi allow for variation inoverseas returns when U.S. returns remain the same.

10.21. (a) The stemplots (top of next page, left) are fairly symmetric. For x (MOE),x

.= 1, 799, 180 and sx.= 329, 253; for y (MOR), y

.= 11, 185 and sy.= 1980. (b) The plot

(top of next page, right) shows a moderately strong, positive, linear relationship. Because wewould like to predict MOR from MOE, we should put MOE on the x axis. (c) The model isyi = β0 + β1xi + εi , i = 1, 2, ..., 32; εi are independent N (0, σ ) variables. The regressionequation is MOR = 2653 + 0.004742 MOE, s

.= 1238. The slope is significantly differentfrom 0: t = 7.02 (df = 30), P < 0.0001. (d) Assumptions appear to be met: A stemplot ofthe residuals shows one slightly low (not quite an outlier), but acceptable, and the plot ofresiduals against MOE (not shown) does not suggest any particular pattern.

Page 9: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

274 Chapter 10 Inference for Regression

MOE

11 61213 5514 157815 558916 1417 247918 44719 35820 034821 822 123 472425 3

MOR

6 378 35889 222

10 2235611 22345579912 0077713 46914 515 3

Residuals

−3 3−2−2−1 6−1 31110−0 76555−0 43221

0 002230 781 13341 5992 1

6789

101112131415

1 1.2 1.4 1.6 1.8 2 2.2 2.4

MO

R (

thou

sand

s)

MOE (millions)

10.22. (a) The 95% confidence interval gives a range of values for the mean MOR of manypieces of wood with MOE equal to 2,000,000. The prediction interval gives a range ofvalues for the MOR of one piece of wood with MOE equal to 2,000,000. (b) The predictioninterval will include more values because the confidence interval accounts only for theuncertainty in our estimate of the mean response, while the prediction interval must alsoaccount for the random error of an individual response. (c) With the regression equationMOR = 2653 + 0.004742 MOE, the predicted mean response when x = MOE = 2,000,000is µy = MOE

.= 12, 137. The Minitab output below gives the two intervals, along with SEµ

(“Stdev.fit”).

Minitab outputFit Stdev.Fit 95.0% C.I. 95.0% P.I.

12137 257 ( 11611, 12663) ( 9554, 14720)

10.23. (a) The plot (top of next page, left) is roughly linear andincreasing. (b) The number of tornadoes in 2004 (1819) isnoticeably high. (c) The regression equation is y

.= −28,516 +14.86x ; both the slope and intercept are significantly differentfrom 0. In the Minitab output that follows, we see SEb1

.= 1.527.With t∗ = 2.0076 for df = 51, the confidence interval for β1

is b1 ± t∗SEb1 = 14.86 ± 3.07 .= 11.79 to 17.93 tornadoes peryear. (d) Apart from the large residual for 2004, there are nostriking features in the plot (top of next page, right). (e) Basedon a stemplot (right), the 2004 residual is an outlier; the other residuals appear to be roughlyNormal.

−3 5200−2 3−1 9843310−0 9887654443211110

0 0012235567781 0011242 001789345 5

Page 10: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

Solutions 275

400

600

800

1000

1200

1400

1600

1800

1950 1960 1970 1980 1990 2000

Tor

nado

es

Year

–400–300–200–100

0100200300400500

1950 1960 1970 1980 1990 2000

Res

idua

l

Year

Minitab outputThe regression equation is Tornado = -28516 + 14.9 Year

Predictor Coef Stdev t-ratio pConstant -28516 3022 -9.44 0.000Year 14.862 1.527 9.73 0.000

s = 170.0 R-sq = 65.0% R-sq(adj) = 64.3%

10.24. Refer to the scatterplot shown in the previous solution. With the 2004 tornado countremoved, the regression equation is y

.= −26,162 + 13.67x . The estimated rate of increase(that is, the slope) is lower than that found in the previous solution by about 1 tornado peryear. With SEb1

.= 1.396 and t∗ = 2.0086 for df = 50, the confidence interval for β1 isb1 ± t∗SEb1 = 13.67 ± 2.80 .= 10.87 to 16.47 tornadoes per year. In examining the residuals,neither the stemplot nor the plot of residuals versus year (both below) suggest any deviationsfrom the assumptions of our regression model.

−3 30−2 86−2 10−1 6−1 32200−0 9876−0 444333000000

0 0012230 558991 012331 582 0132 793 0

–400

–300

–200

–100

0

100

200

300

1950 1960 1970 1980 1990 2000

Res

idua

l

Year

Minitab outputThe regression equation is Tornado = - 26162 + 13.7 Year

Predictor Coef Stdev t-ratio pConstant -26162 2763 -9.47 0.000Year 13.666 1.396 9.79 0.000

s = 151.5 R-sq = 65.7% R-sq(adj) = 65.0%

Page 11: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

276 Chapter 10 Inference for Regression

10.25. (a) x (CRP) is sharply right-skewedwith high outliers; x = 10.0323 andsx = 16.5632. y (retinol) is slightly right-skewed; y = 0.7648 and sy = 0.3949.(b) No; no assumption is made aboutthe distribution of x values. Note thatthis does not mean that we do not careabout the distribution of the x values;the outliers cause trouble, as we seein (d). (c) The regression equation isRetinol = 0.8430 − 0.007800 CRP,s = 0.3781. With α = 0.05, the slope issignificantly different from 0: t = −2.13(df = 38), P = 0.039. (d) The high outliers in CRP are influential, as we can see from thesmall residuals on the right end of the plot below. Additionally, a stemplot or quantile plot ofthe residuals (not shown) shows that the distribution is right-skewed rather than Normal.

0.20.40.60.8

11.21.41.61.8

0 10 20 30 40 50 60 70

Ret

inol

CRP

–0.6–0.4–0.2

00.20.40.60.8

11.2

0 10 20 30 40 50 60 70

Res

idua

l

CRP

CRP

0 000000000000000033340 555556778991 21 52 022 63 0344 655 9667 3

Retinol

0 23333333333330 4550 66670 888899991 000111111 231 411 9

10.26. (a) Both distributions are right-skewed. OC has x = 33.4161 and sx = 19.6097; VO+has y = 985.8065 and sy = 579.8581. (b) Put OC on the x axis because we hope to use itas the explanatory variable. We see a positive association, but one point is an outlier (it isfar above the pattern of the rest of the points) and there appears to be more scatter about theline for large values of OC. (c) The regression equation is y = 334.0 + 19.505x , s = 443.3.The slope is significantly different from 0: t = 4.73, P < 0.0005. The residuals appear to besomewhat right-skewed, and the unusual point noted in (b) corresponds to a high outlier inthe distribution of residuals.

Page 12: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

Solutions 277

OC

0 891 01 56777992 0000423 0113 5684 044 75 2445 666 877 67

VO+

0 230 44445550 666670 8888991 0111 231 51 66122 222 5

Residuals

−0 7−0 5−0 33322222−0 11000000000

0 00010 230 500 88111 5

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

VO

+ (

thou

sand

s)

OC

10.27. (a) Both distributions are right-skewed (TRAP more than VO–). TRAP has x = 13.2484and sx = 6.5282; VO– has y = 889.1935 and sy = 427.6161. (b) Put TRAP on the x axisbecause we hope to use it as the explanatory variable. We see a positive association, but onepoint is an outlier (it is far above the pattern of the rest of the points). (c) The regressionequation is y = 300.9 + 44.406x , s = 319.7, The slope is significantly different from 0:t = 4.97, P < 0.0005. The unusual point noted in (b) corresponds to a high outlier amongthe residuals; otherwise the distribution of residuals seems reasonably Normal.

TRAP

0 30 50 660 8888999991 0000011 444411 8999922 32 5522 8

VO–

0 230 44444550 67770 8899999991 00011 21 441 7122 2

Residuals

−0 54−0 32222−0 1111110000000

0 01110 333330 4001 0

0

0.5

1

1.5

2

0 5 10 15 20 25

VO

– (

thou

sand

s)

TRAP

10.28. After taking (natural) logarithms, both distributions are considerably less skewed; logOCis irregular, while logVO+ is quite symmetric. logOC has x = 3.3379 and sx = 0.6085;logVO+ has y = 6.7419 and sy = 0.5554. (If common logarithms are used, multiply theseresults by about 2.3026.) The scatterplot shows a positive association, stronger than that seenin Exercise 10.26. The regression equation is y = 4.3852 + 0.7060x , s = 0.3580. The slopeis significantly different from 0: t = 6.57, P < 0.0005. The distribution of the residualsappears to be much more Normal than in Exercise 10.26.

Page 13: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

278 Chapter 10 Inference for Regression

logOC

2 02 2322 72 88889993 000133 444553 6673 894 0004 233

logVO+

5 65 996 0116 2236 44556 677776 8897 00117 337 47 777 8

Residuals

−0 5444−0 33332−0 110000000

0 00010 2230 4444500 9

5.5

6

6.5

7

7.5

2 2.5 3 3.5 4

logV

O+

logOC

10.29. After taking (natural) logarithms, both distributions are somewhat irregular, but slightlymore symmetric. logTRAP has x = 2.4674 and sx = 0.4979; logVO– has y = 6.6815 andsy = 0.4832. (If common logarithms are used, multiply these results by about 2.3026.) Thescatterplot shows a positive association; the outlier visible in Exercise 10.27 has movedcloser to the other points. The regression equation is y = 5.0910 + 0.6446x , s = 0.3674.The slope is significantly different from 0: t = 4.78, P < 0.0005. The distribution of theresiduals appears to be reasonably Normal, apart from a low outlier.

logTRAP

1 1111 71 892 0011112 2223333322 66672 999993 13 223

logVO–

5 555 86 0016 22236 4556 6776 88888889997 017 237 47 7

Residuals

−1 0−0−0−0 444−0 3222−0 110000

0 000000110 22230 44550 7 5.5

6

6.5

7

7.5

1 1.5 2 2.5 3

logV

O –

logTRAP

10.30. (a) With all 60 points, the regres-sion equation is y = −34.55 + 0.8605x ,s

.= 20.17. (This is the solid line in thescatterplot on the right.) The slope is sig-nificantly different from 0: t = 4.82,P < 0.0005. (b) Without the four pointsfrom the bottom of the scatterplot, the re-gression equation is y = −33.40 + 0.8818x ,s

.= 15.18. (This is the dashed line in thescatterplot.) The slope is again significantlydifferent from 0: t = 6.57, P < 0.0005.With the outliers removed, the line changes slightly; the most significant change is the de-crease in the estimated standard deviation s. This correspondingly makes t larger (i.e., b1 ismore significantly different from 0) and makes the regression line more useful for prediction(r2 increases from 28.9% to 44.4%). Of course, we should not arbitrarily remove data points;more investigation is needed to determine why these students’ reading scores were so muchlower than we would expect based on their IQs.

0

20

40

60

80

100

80 90 100 110 120 130 140

Rea

ding

sco

re

IQ

Page 14: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

Solutions 279

10.31. (a) Both variables are right-skewed. For pure tones, x.= 106.20 and

s.= 91.76 spikes/second, and for monkey calls, y = 176.57 and sy = 111.85 spikes/second.

(b) There is a moderate positive association; the third point (circled) has the largest residual;the first point (marked with a square) is an outlier for tone response. (c) With all 37 points,CALL = 93.9 + 0.778 TONE and s = 87.30; the test of β1 = 0 gives t = 4.91, P < 0.0001.(d) Without the first point, y = 101 + 0.693x , s = 88.14, t = 3.18. Without the third point,y = 98.4 + 0.679x , s = 80.69, t = 4.49. With neither, y = 116 + 0.466x , s = 79.46,t = 2.21. The line changes a bit, but always has a slope significantly different from 0.

Tone

0 1222222334440 555567771 00112441 5667782 242 53344 7

Call

0 40 5666678891 0112233341 58899992 00042 73 0134344 85 0

Residual

−1 65−1 3−0 8876555−0 44444331100

0 0123340 6678881 141 72 0

0

100

200

300

400

500

0 100 200 300 400 500

Mon

key

call

resp

onse

(spi

kes/

seco

nd)

Pure tone response (spikes/second)

10.32. The model is yi = β0 + β1xi + εi ; εi are independent N (0, σ ) variables. (a) β0

represents the fixed costs. (b) β1 represents how costs change as the number of studentschanges. This should be positive because more students mean more expenses. (c) The errorterm (εi ) allows for variation among equal-sized schools.

10.33. (a) The scatterplot shows a weak neg-ative association; the regression equationis Bonds = 53.41 − 0.1962 Stocks withs

.= 59.88. (b) For testing H0: β1 = 0 vs.Ha: β1 �= 0, we have t = −1.27 (df = 14)and P = 0.226. The slope is not signifi-cantly different from 0. (c) The scatterplotshows a lot of variation, so s is large and tis small. –100

–50

0

50

100

150

–50 0 50 100 150 200 250 300

Cas

h flo

w in

to b

onds

($bi

llions

)

Cash flow into stocks ($billions)

10.34. (a) The t statistic for testing H0: β1 = 0 vs. Ha: β1 �= 0 is t = b1/SEb1 = 0.76/0.44 .=1.73 with df = 80. This has P = 0.0880, so we do not reject H0. (b) For the one-sidedalternative β1 > 0, we would have P = 0.0440, so we could reject H0 at the 5%significance level.

Page 15: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

280 Chapter 10 Inference for Regression

10.35. See also the solutions to Exercises 2.20 and 2.68.(a) MA angle is the explanatory variable, so it should beon the horizontal axis of the scatterplot. (This scatterplothas the same scale on both axes because both variables aremeasured in degrees.) (b) The scatterplot shows a moderate-to-weak positive linear association, with one clear outlier(the patient with HAV angle 50◦). (c) The model is yi =β0 + β1xi + εi , i = 1, 2, . . . , 38; εi are independent N (0, σ )

variables. (d) Because doctors expect there to be a positiveassociation, we use a one-sided alternative: H0: β1 = 0 vs.Ha: β1 > 0. (e) For the estimated slope b1

.= 0.3388, wehave t = 1.90 (df = 36) and P = 0.033 (half of Minitab’stwo-sided P-value); this is significant evidence (at α = 0.05) to support the doctors’ belief.

10

15

20

25

30

35

40

45

50

5 10 15 20 25 30 35

HA

V a

ngle

(de

gree

s)

MA angle (degrees)

Minitab outputThe regression equation is HAV = 19.7 + 0.339 MA

Predictor Coef Stdev t-ratio pConstant 19.723 3.217 6.13 0.000MA 0.3388 0.1782 1.90 0.065

s = 7.224 R-sq = 9.1% R-sq(adj) = 6.6%

10.36. Software (Minitab output above) reports b1.= 0.3388 and SEb1

.= 0.1782. For a tdistribution with df = 36, t∗ .= 2.0281 for a 95% confidence interval, so the intervalis −0.0226 to 0.7002. The slope was significantly different from 0 using a one-sidedalternative, but this interval tells us that it could be 0 (or even slightly negative); we wouldnot reject β1 = 0 in favor of a two-sided alternative.

10.37. (a) Aside from the one high point (70months of service, and wages 97.6801),there is a moderate positive association—fairly clear but with quite a bit of scat-ter. (b) The regression equation isWAGES = 43.383 + 0.07325 LOS, withs

.= 10.21 (Minitab output below). The slopeis significantly different from 0: t = 2.85(df = 57), P = 0.006. (c) Wages rise anaverage of 0.07325 wage units per weekof service. (d) We have b1

.= 0.07325 andSEb1

.= 0.02571. For a t distribution with df = 57, t∗ .= 2.0025 for a 95% confidenceinterval, so the interval is 0.0218 to 0.1247.

2030405060708090

100

0 50 100 150 200

Wag

es (

resc

aled

)

Length of service (months)

Minitab outputThe regression equation is wages = 43.4 + 0.0733 los

Predictor Coef Stdev t-ratio pConstant 43.383 2.248 19.30 0.000los 0.07325 0.02571 2.85 0.006

s = 10.21 R-sq = 12.5% R-sq(adj) = 10.9%

Page 16: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

Solutions 281

10.38. The table below summarizes the regression results with the outlier excluded, and thosewith all points. (a) The intercept and slope estimates change very little, but the estimate ofσ increases from 10.21 to 11.98. (b) With the outlier, the t statistic decreases (because s hasincreased), and the P-value increases slightly—although it is still significant at the 5% level.(c) The interval width 2t∗SEb1 increases from 0.1030 to 0.1207—roughly the same factorby which s increased. (Because the degrees of freedom change from 57 to 58, t∗ decreasesfrom 2.0025 to 2.0017, but the change in s has a much greater impact.)

b0 b1 s t P Interval widthOutlier excluded 43.383 0.07325 10.21 2.85 0.006 0.1030All points 44.213 0.07310 11.98 2.42 0.018 0.1207

Minitab outputThe regression equation is wages = 44.2 + 0.0731 los

Predictor Coef Stdev t-ratio pConstant 44.213 2.628 16.82 0.000los 0.07310 0.03015 2.42 0.018

s = 11.98 R-sq = 9.2% R-sq(adj) = 7.6%

10.39. (a) The trend appears to be quite linear.(b) The regression equation is Lean =−61.12 + 9.3187 Year with s

.= 4.181.The regression explains r2 = 98.8% of thevariation in lean. (c) The rate we seek is theslope. For df = 11 and 99% confidence,t∗ = 3.1058, so the interval is 9.3187 ±(3.1058)(0.3099) = 8.3562 to 10.2812 tenthsof a millimeter/year. 625

650

675

700

725

750

74 76 78 80 82 84 86

Lean

(0.

1 m

m o

ver

2.9

m)

YearMinitab outputThe regression equation is Lean = -61.1 + 9.32 Year

Predictor Coef Stdev t-ratio pConstant -61.12 25.13 -2.43 0.033Year 9.3187 0.3099 30.07 0.000

s = 4.181 R-sq = 98.8% R-sq(adj) = 98.7%

10.40. (a) y = −61.12 + 9.3187(18).= 107, for a prediction of 2.9107 m. (b) This is an

example of extrapolation—trying to make a prediction outside the range of given x-values.Minitab reports that a 95% prediction interval for y when x∗ = 18 is about 62.6 to 150.7.The width of the interval is an indication of how unreliable the prediction is.

Note: Minitab’s “Stdev.Fit” value of 19.56 is SEµ, so SEy.=

√s2 + SE2

µ

.= 20.00, which

agrees with the margin for the prediction interval: t∗SEy = (2.201)(20.00).= 44.02.

Minitab outputFit Stdev.Fit 95.0% C.I. 95.0% P.I.

106.62 19.56 ( 63.56, 149.68) ( 62.58, 150.65) XXXX denotes a row with very extreme X values

Page 17: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

282 Chapter 10 Inference for Regression

10.41. (a) Use x = 109 (the number of years after 1900). (b) y = −61.12+9.3187(109).= 955,

for a prediction of 2.9955 m. (c) A prediction interval is appropriate because we areinterested in one future observation, not the mean of all future observations; in this situation,it does not make sense to talk of more than one future observation. In the output below,note that Minitab warns us of the risk of extrapolation.

Minitab outputFit Stdev.Fit 95.0% C.I. 95.0% P.I.

954.62 8.75 ( 935.34, 973.89) ( 933.26, 975.97) XXXX denotes a row with very extreme X values

10.42. A negative association makes sense here: If the price of beer is above average, fewerstudents can afford to drink, while more drinking happens when beer is cheaper.

Note: The fact that the correlation is relatively small indicates that the price of beeris not a crucial factor in determining the prevalence of binge-drinking. In particular, astraight-line relationship with the cost of beer only explains about r2 .= 13% of the variationin binge-drinking rates.

10.43. To test H0: ρ = 0 vs. Ha: ρ �= 0, we compute t = r√

n −2√1−r2

.= −4.16. Comparing this

to a t distribution with df = 116, we find P < 0.0001, so we conclude the correlation isdifferent from 0.

10.44. (a) Scatterplot below, left. (b) Scatterplot below, right. (c) The regression equation isy = −872.93 + 0.4464x with s

.= 0.1739. For 95% confidence with df = 4, t∗ = 2.7765, sowith b1

.= 0.4464 and SEb1

.= 0.006856, the confidence interval is 0.4274 to 0.4654.

0

100

200

300

400

500

1970 1975 1980 1985 1990 1995 2000

DR

AM

cap

acity

(bi

ts x

106 )

Year

6

8

10

12

14

16

18

20

1970 1975 1980 1985 1990 1995 2000

log(

DR

AM

cap

acity

)

Year

Minitab outputThe regression equation is logBits = - 873 + 0.446 year

Predictor Coef Stdev t-ratio pConstant -872.93 13.63 -64.03 0.000year 0.446390 0.006856 65.11 0.000

s = 0.1739 R-sq = 99.9% R-sq(adj) = 99.9%

10.45. Recall that testing H0: ρ = 0 vs. Ha: ρ �= 0 is the same as testing H0: β1 = 0 vs.Ha: β1 �= 0. In the solution to Exercise 10.33, we had t = −1.27 (df = 14) and P = 0.226,so we cannot reject H0.

Page 18: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

Solutions 283

10.46. (a) With r = −0.19 and n = 713, we have t = r√

n −2√1−r2

.= −5.16. (b) Comparing to a t

distribution with df = 711 (or anything reasonably close), the P-value is less than 0.0001, sowe conclude that ρ �= 0.

10.47. Because DFT = DFM + DFE and SST = SSM + SSE, we can find themissing degrees of freedom (DF) and sum of squares (SS) entries by subtraction:df = DFE = 28 and SSE = 10152.4. The missing entry in the mean square (MS) columnis MSE = SSE/DFE .= 362.6. (We can also compute MSE = MSM/F .= 362.7—the sameanswer, up to rounding.)

10.48. s = √MSE .= 19.0416 and r2 = SSM

SST = 3445.913598.3

.= 0.2534.

10.49. As sx =√

129

∑(xi − x)2 = 16.45%, we have

√∑(xi − x)2 = sx

√29 .= 88.5860%, so:

SEb1 = s√∑(xi − x)2

.= 19.0420

88.5860.= 0.2150

Alternatively, note that we have F = 9.50 and b1 = 0.663. Because t2 = F , we know thatt = 3.0822 (take the positive square root, because t = b1/SEb1 , and b1 is positive). ThenSEb1 = b1/t = 0.2151—the same answer, up to rounding. (Note that with this approach, wedo not need to know that sx = 16.45%.)

With df = 28, t∗ = 2.0484 for 95% confidence, so the 95% confidence interval is0.663 ± 0.4403 = 0.2227 to 1.1033.

10.50. (a) With x.= 80.9, sx

.= 17.2, y.= 43.5, sy

.= 20.3, and r.= 0.68, we find:

b1 = (0.68)(20.3

17.2

).= 0.8026

b0 = 43.5 − (0.8026)(80.9).= −21.4270

(Answers may vary slightly due to rounding.) The regression equation is thereforeGHP = −21.4270 + 0.8026 FVC. (b) Testing β1 = 0 is equivalent to testing ρ = 0, so thetest statistic is t = r

√n −2√

1−r2

.= 6.43 (df = 48), for which P < 0.0005. The slope (correlation)

is significantly different from 0.

10.51. Use the formula t = r√

n −2√1−r2

with r = 0.5. For n = 20, t = 2.45 with df = 18, for which

the two sided P-value is P = 0.0248. For n = 10, t = 1.63 with df = 8, for which the twosided P-value is P = 0.1411. With the larger sample size, r should be a better estimate ofρ, so we are less likely to get r = 0.5 unless ρ is really not 0.

Page 19: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

284 Chapter 10 Inference for Regression

10.52. Most of the small banks have negativeresiduals, while most large-bank residuals arepositive. This means that, generally, wages atlarge banks are higher, and small bank wagesare lower, than we would predict from theregression.

–20

–10

0

10

20

30

40

50

0 50 100 150 200

Res

idua

l

Length of service (months)

Large

Small

10.53. (a) Not surprisingly, there is a positiveassociation between scores. The 47th pairof scores (circled) is an outlier—the ACTscore (21) is higher than one would expectfor the SAT score (420). Since this SAT scoreis so low, this point may be influential. Noother points fall outside the pattern. (b) Theregression equation is y = 1.626 + 0.02137x .The slope is significantly different from 0:t = 10.78 (df = 58) for which P < 0.0005.(c) r = 0.8167.

5

10

15

20

25

30

300 500 700 900 1100 1300 1500

AC

T

SAT

Minitab outputThe regression equation is ACT = 1.63 + 0.0214 SAT

Predictor Coef Stdev t-ratio pConstant 1.626 1.844 0.88 0.382SAT 0.021374 0.001983 10.78 0.000

s = 2.744 R-sq = 66.7% R-sq(adj) = 66.1%

10.54. (a) The means are identical (21.133). (b) For the observed ACT scores,sy = 4.714; for the fitted values, sy = 3.850. (c) For z = 1, the SAT score isx + sx = 912.7 + 180.1 = 1092.8. The predicted ACT score is y

.= 25 (Minitab reports24.983), which gives a standard score of about 1 (using the standard deviation of thepredicted ACT scores. (d) For z = −1, the SAT score is x − sx = 912.7 − 180.1 = 732.6.The predicted ACT score is y

.= 17.3 (Minitab reports 17.285), which gives a standard scoreof about −1. (e) It appears that the standard score of the predicted value is the same as theexplanatory variable’s standard score. (See note below.)

Notes: (a) This will always be true because∑

i yi = ∑i (b0 + b1xi ) = n b0 + b1

∑i xi =

n(y − b1x) + b1n x = n y. (b) The standard deviation of the predicted values will besy = |r |sy ; in this case, sy = (0.8167)(4.714). To see this, observe that the variance of thepredicted values is 1

n −1

∑i (yi − y)2 = 1

n −1

∑i (b1xi − b1x)2 = b2

1 s2x = r2s2

y . (e) For a givenstandard score z, note that y = b0 + b1(x + z sx) = y − b1x + b1x + b1z sx = y + z r sy . Ifr > 0, the standard score for y equals z; if r < 0, the standard score is −z.

Page 20: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

Solutions 285

10.55. (a) For SAT: x = 912.6 andsx = 180.1117. For ACT: y = 21.13and sy = 4.7137. Therefore, the slopeis a1

.= 0.02617 and the intercept isa0

.= −2.7522. (b) The new line is dashed.(c) For example, the first prediction is−2.7522 + (0.02617)(1000)

.= 23.42. Upto rounding error, the mean and standarddeviation of the predicted scores are the sameas those of the ACT scores: y = 21.13 andsy = 4.7137.

Note: The usual least-squares line minimizes the total squared vertical distance from thepoints to the line. If instead we seek to minimize the total of

∑i |hivi|, where hi is the hori-

zontal distance and vi is the vertical distance, we obtain the line y = a0 + a1x—except thatwe must choose the sign of a1 to be the same as the sign of r. (It would hardly be the “bestline” if we had a positive slope with a negative association.) If r = 0, either sign will do.

5

10

15

20

25

30

300 500 700 900 1100 1300 1500

AC

T

SAT

10.56. (a) The regression equations are:WEIGHT = −468.91 + 28.462 LENGTH with s.= 109.4 and r2 .= 0.902WEIGHT = −449.44 + 174.63 WIDTH with s

.= 107.9 and r2 .= 0.905(b) Both scatterplots suggest that the relationships are curved rather than linear. (Points tothe left and right lie above the line; those in the middle are generally below the line.)

0

2

4

6

8

10

5 10 15 20 25 30 35 40 45

Wei

ght (

100g

)

Length (cm)

0

2

4

6

8

10

1 2 3 4 5 6 7

Wei

ght (

100g

)

Width (cm)

Minitab output– – – – – – – – – – – MODEL 1: LENGTH & WEIGHT – – – – – – – – – – –The regression equation is weight = -469 + 28.5 length

Predictor Coef Stdev t-ratio pConstant -468.91 92.55 -5.07 0.000length 28.462 2.967 9.59 0.000

s = 109.4 R-sq = 90.2% R-sq(adj) = 89.2%– – – – – – – – – – – – MODEL 2: WIDTH & WEIGHT – – – – – – – – – – – –The regression equation is weight = -449 + 175 width

Predictor Coef Stdev t-ratio pConstant -449.44 89.27 -5.03 0.000width 174.63 17.93 9.74 0.000

s = 107.9 R-sq = 90.5% R-sq(adj) = 89.5%

Page 21: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

286 Chapter 10 Inference for Regression

10.57. (a) For squared length: Weight = −117.99 + 0.4970 SQLEN, s.= 52.76, r2 = 0.977.

(b) For squared width: Weight = −98.99 + 18.732 SQWID, s.= 65.24, r2 = 0.965.

Both scatterplots look more linear.

0

2

4

6

8

10

0 500 1000 1500 2000

Wei

ght (

100g

)

Length2 (cm2)

0

2

4

6

8

10

0 10 20 30 40 50 60

Wei

ght (

100g

)

Width2 (cm2)

Minitab output– – – – – – – – MODEL 1: SQUARED LENGTH & WEIGHT – – – – – – – –The regression equation is weight = -118 + 0.497 sqlen

Predictor Coef Stdev t-ratio pConstant -117.99 27.88 -4.23 0.002sqlen 0.49701 0.02400 20.71 0.000

s = 52.76 R-sq = 97.7% R-sq(adj) = 97.5%– – – – – – – – MODEL 2: SQUARED LENGTH & WEIGHT – – – – – – – –The regression equation is weight = -99.0 + 18.7 sqwid

Predictor Coef Stdev t-ratio pConstant -98.99 33.67 -2.94 0.015sqwid 18.732 1.126 16.64 0.000

s = 65.24 R-sq = 96.5% R-sq(adj) = 96.2%

10.58. (a) The regression line is WEIGHT =−115.10 + 3.1019(LENGTH)(WIDTH),s

.= 41.69, r2 = 0.986. (b) As measured byr2, this last model is (by a slim margin) thebest. (However, this scatterplot again givessome suggestion of curvature, indicating thatsome other model might do better still.)

0

2

4

6

8

10

0 50 100 150 200 250 300 350

Wei

ght (

100g

)

Length times width (cm2)Minitab outputThe regression equation is weight = -115 + 3.10 lenwid

Predictor Coef Stdev t-ratio pConstant -115.10 21.87 -5.26 0.000lenwid 3.1019 0.1179 26.32 0.000

s = 41.69 R-sq = 98.6% R-sq(adj) = 98.4%

Page 22: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

Solutions 287

10.59. The table on the right shows the correlationsand the corresponding test statistics. The first tworesults agree with the results of (respectively)Exercises 10.14 and 10.15.

r t PIBI/area 0.4459 3.42 0.0013IBI/forest 0.2698 1.92 0.0608area/forest −0.2571 −1.82 0.0745

10.60. The correlation was significant for veg-etables, fruit, and meat, and nearly significantfor eggs. All the significant correlations arenegative, meaning (for example) that childrenwith high neophobia tend to eat these foodsless frequently.

r t PVegetables −0.27 −6.65 0.0000Fruit −0.16 −3.84 0.0001Meat −0.15 −3.60 0.0004Eggs −0.08 −1.90 0.0576Sweet/fatty snacks 0.04 0.95 0.3430Starchy staples −0.02 −0.47 0.6355

10.61. Quickness and creativity are both (significantly) positively correlated with all GREscores, so creative people are not penalized (contradicting the critics), while quickworkers do better (as some have suggested). Depth is positively associated with verbalscores (refuting the opinion that deep thinkers are penalized). The only other significantcorrelations are those for conscientious workers, who apparently tend to score lower on allparts of the GRE. Depending on what “conscientiousness” measures, this might simply bethe flip side of the positive correlation with quickness; perhaps “conscientious” means (inpart) that these people work more slowly.

Analytical Quantitative VerbalConscientiousness −0.17** −0.14** −0.12*Rationality −0.06 −0.03 −0.08Ingenuity −0.06 −0.08 −0.02Quickness 0.21*** 0.15** 0.26***Creativity 0.24*** 0.26*** 0.29***Depth 0.06 0.08 0.15**

10.62. See also the solution to Exercise 2.13.(a) The association is linear and positive;the women’s points show a stronger asso-ciation. As a group, males typically havelarger values for both variables. (b) Thewomen’s regression line (the solid line inthe graph) is y = 201.2 + 24.026x , withs

.= 95.08 and r2 = 0.768. The men’s line(the dashed line) is y = 710.5 + 16.75x ,with s

.= 167.1 and r2 = 0.351. Thewomen’s slope is significantly differentfrom 0 (t = 5.76, df = 10, P < 0.0005), but the men’s is not (t = 1.64, df = 5, P = 0.161).These test results, and the values of s and r2, confirm the observation that the women’sassociation is stronger—however, see the solution to the next exercise.

850

1000

1200

1400

1600

1800

30 35 40 45 50 55 60

Met

abol

ic r

ate

(cal

/day

)

Lean body mass (kg)

Women

Men

Page 23: Chapter 10 Solutions - West Virginia Universityghobbs/stat511HW/IPS6e.ISM.Ch10.pdf · Chapter 10 Solutions 10.1. The given model was µ y = 40.5 − 2.5x, with standard deviation

288 Chapter 10 Inference for Regression

10.63. (a) These intervals (in the table below) overlap quite a bit. (b) These quantities can becomputed from the data, but it is somewhat simpler to recall that they can be found fromthe sample standard deviations sx,w and sx,m:

sx,w

√11 .= 6.8684

√11 .= 22.78 and sx,m

√6 .= 6.6885

√6 .= 16.38

The women’s SEb1 is smaller in part because it is divided by a large number. (c) In order toreduce SEb1 for men, we should choose our new sample to include men with a wider varietyof lean body masses. (Note that just taking a larger sample will reduce SEb1 ; it is reducedeven more if we choose subjects who will increase sx,m.)

b1 SEb1 df t∗ IntervalWomen 24.026 4.174 10 2.2281 14.7257 to 33.3263Men 16.75 10.20 5 2.5706 −9.4699 to 42.9699

10.64. Scatterplots, and portions of the Minitab outputs, are shown below. The equations are:

For all points, MPG = −7.796 + 7.8742 LOGMPHFor speed ≤ 30 mph, MPG = −9.786 + 8.5343 LOGMPHFor fuel efficiency ≤ 20 mpg, MPG = −4.282 + 6.6854 LOGMPH

Students might make a number of observations about the effects of the restrictions; forexample, the estimated coefficients (and their standard errors) change quite a bit.

121314151617181920

2.4 2.6 2.8 3 3.2 3.4

Fue

l effi

cien

cy (

MP

G)

log(Speed in miles per hour)

Speed ≤ 30 mph

121314151617181920

2.4 2.6 2.8 3 3.2 3.4 3.6 3.8

Fue

l effi

cien

cy (

MP

G)

log(Speed in miles per hour)

Fuel efficiency ≤ 20 MPG

Minitab output– – – – – – – – – – – – – – – – – – All points – – – – – – – – – – – – – – – – – –Predictor Coef Stdev t-ratio pConstant -7.796 1.155 -6.75 0.000logMPH 7.8742 0.3541 22.24 0.000

s = 0.9995 R-sq = 89.5% R-sq(adj) = 89.3%– – – – – – – – – – – – – – – Speed 30 mph or less – – – – – – – – – – – – – – –Predictor Coef Stdev t-ratio pConstant -9.786 1.862 -5.26 0.000logMPH 8.5343 0.6154 13.87 0.000

s = 0.7600 R-sq = 83.5% R-sq(adj) = 83.1%– – – – – – – – – – – – – Fuel efficiency 20 mpg or less – – – – – – – – – – – – –Predictor Coef Stdev t-ratio pConstant -4.282 1.647 -2.60 0.013logMPH 6.6854 0.5323 12.56 0.000

s = 0.9462 R-sq = 78.6% R-sq(adj) = 78.1%