chapter 4hinemanmath.weebly.com/uploads/1/3/4/2/13425067/chapter... · 2019-11-12 · chapter 4 4.1...

33
Chapter 4 4.1 (a) Yes, the scatterplot below (left) shows a linear relationship between the cube root of weight, 3 weight , and length. Length (cm) Cube root of weight 40 35 30 25 20 15 10 5 10 9 8 7 6 5 4 3 2 1 Length (cm) Residual 40 35 30 25 20 15 10 5 0.3 0.2 0.1 0.0 -0.1 -0.2 -0.3 (b) Let x = length and 3 y weigh = t . The least-squares regression line is . The intercept of –0.0220 clearly has no practical interpretation in this situation, since weight and the cube root of weight must be positive. The slope 0.2466 indicates that for every 1 cm increase in length, the cube root of weight will increase, on average, by 0.2466. (c) ˆ 0.0220 0.2466 y x =− + 3 0.0220 0.2466 36 8.8556 weight =− + × , so the predicted weight is g. The predicted weight with this model is slightly higher than the predicted weight of 689.9g with the model in Example 4.2. (d) The residual plot above (right) shows the residuals are negative for lengths below 17 cm, positive for lengths between 18 cm and 27 cm, and have no clear pattern for lengths above 28 cm. (e) Nearly all (99.88%) of the variation in the cube root of the weight can be explained by the linear relationship with the length. 3 8.8556 694.5 4.2 (a) The scatterplot below (left) shows positive association between length and period with one very unusual point (106.5, 2.115) in the top right corner. Length (cm) Period (s) 110 100 90 80 70 60 50 40 30 20 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 Length (cm) Residual 110 100 90 80 70 60 50 40 30 20 0.100 0.075 0.050 0.025 0.000 -0.025 -0.050 (b) The residual plot above (right) shows that the residuals tend to be small or negative for small lengths and then get larger for lengths between 40 and 50 cm. The residual for the one very large length is negative again. Even though the value of is 0.983, the residual plot suggests that a model with some curvature (or a linear model after a transformation) might be better. (c) The information from the physics student suggests that there should be a linear relationship between 2 r 89

Upload: others

Post on 27-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Chapter 4 4.1 (a) Yes, the scatterplot below (left) shows a linear relationship between the cube root of weight, 3 weight , and length.

Length (cm)

Cub

e ro

ot o

f w

eigh

t

403530252015105

10

9

8

7

6

5

4

3

2

1

Length (cm)

Res

idua

l

403530252015105

0.3

0.2

0.1

0.0

-0.1

-0.2

-0.3

(b) Let x = length and 3y weigh= t . The least-squares regression line is . The intercept of –0.0220 clearly has no practical interpretation in this situation, since weight and the cube root of weight must be positive. The slope 0.2466 indicates that for every 1 cm increase in length, the cube root of weight will increase, on average, by 0.2466. (c)

ˆ 0.0220 0.2466y x= − +

3 0.0220 0.2466 36 8.8556weight = − + × , so the predicted weight is g. The predicted weight with this model is slightly higher than the predicted weight of 689.9g with the model in Example 4.2. (d) The residual plot above (right) shows the residuals are negative for lengths below 17 cm, positive for lengths between 18 cm and 27 cm, and have no clear pattern for lengths above 28 cm. (e) Nearly all (99.88%) of the variation in the cube root of the weight can be explained by the linear relationship with the length.

38.8556 694.5

4.2 (a) The scatterplot below (left) shows positive association between length and period with one very unusual point (106.5, 2.115) in the top right corner.

Length (cm)

Per

iod

(s)

1101009080706050403020

2.2

2.0

1.8

1.6

1.4

1.2

1.0

0.8

0.6

Length (cm)

Res

idua

l

1101009080706050403020

0.100

0.075

0.050

0.025

0.000

-0.025

-0.050

(b) The residual plot above (right) shows that the residuals tend to be small or negative for small lengths and then get larger for lengths between 40 and 50 cm. The residual for the one very large length is negative again. Even though the value of is 0.983, the residual plot suggests that a model with some curvature (or a linear model after a transformation) might be better. (c) The information from the physics student suggests that there should be a linear relationship between

2r

89

period and length . (d) A scatterplot (left) and residual plot (right) are shown below for the transformed data. The least-squares regression line for the transformed data is ˆ 0.0858 0.210y length= − + . The value of is slightly higher, 0.986 versus 0.983, and the

residual plot looks better, although the residuals for the three smallest lengths are positive and the residuals for the next six lengths are negative.

2r

Square root of length

Per

iod

1110987654

2.2

2.0

1.8

1.6

1.4

1.2

1.0

0.8

0.6

Square root of length

Res

idua

l

1110987654

0.05

0.00

-0.05

-0.10

(e) According to the theoretical relationship, the slope in the model for (d) should be

2 0.2007980π . The estimated model appears to agree with the theoretical relationship because

the estimated slope is 0.210, an absolute difference of about 0.0093. (f) The predicted length of an 80-centimeter pendulum is ˆ 0.0858 0.210 80 1.7925y = − + seconds. 4.3 (a) A scatterplot is shown below (left). The relationship is strong, negative and slightly nonlinear (or curved), with no outliers.

Volume (cubic cm)

Pre

ssur

e (a

tmos

pher

es)

20.017.515.012.510.07.55.0

3.0

2.5

2.0

1.5

1.0

1/Volume

Pre

ssur

e (a

tmos

pher

es)

0.1750.1500.1250.1000.0750.050

3.0

2.5

2.0

1.5

1.0

(b) Yes, the scatterplot for the transformed data (above on the right) shows a clear linear relationship. (c) The least-squares regression equation is (ˆ 0.3677 15.8994 1P V= + ) . The

square of the correlation coefficient, , indicates almost a perfect fit. The residual plot (below) shows a definite pattern, which should be of some concern, but the model still provides a good fit.

2 0.9958r =

90 Chapter 4

1/Volume

Res

idua

l

0.1750.1500.1250.1000.0750.050

0.050

0.025

0.000

-0.025

-0.050

-0.075

(d) Letting 1y = P V, the least-squares regression line is ˆ 0.1002 0.0398y = + . The scatterplot (below on the left), the value of , and the residual plot (below on the right) indicate that the linear model provides an excellent fit for the transformed data. This transformation also achieves linearity because

2 0.9997r =

V k P= .

Volume (cubic cm)

1/P

ress

ure

20.017.515.012.510.07.55.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

Volume (cubic cm)

Res

idua

l

20.017.515.012.510.07.55.0

0.005

0.004

0.003

0.002

0.001

0.000

-0.001

-0.002

-0.003

-0.004

(e) When the gas volume is 15 cm3 the model in part (c) predicts the pressure to be

( )ˆ 0.3677 15.8994 1 15 1.4277P = + atmospheres, and the model in part (d) predicts the

reciprocal of pressure to be 0.1002 + 0.0398(15) = 0.6972 or atmospheres. The predictions are the same to the nearest one-hundredth of an atmosphere.

ˆ 1/ 0.6972 1.4343P =

4.4 (a) The scatterplot below (left) shows that the relationship between period2 and length is roughly linear.

Length (cm)

Per

iod

squa

red

1101009080706050403020

5

4

3

2

1

Length (cm)

Res

idua

l

1101009080706050403020

0.2

0.1

0.0

-0.1

-0.2

(b) The least-squares regression line for the transformed data y = period2 and x = length is . The value of and the residual plot above (right) indicate that ˆ 0.1547 0.0428y x= − + 2 0.992r =

More about Relationships between Two Variables 91

the linear model provide a good fit for the transformed data. As we noticed in Exercise 4.2 part (d), the residual plot looks better, but there is still a pattern with the residuals for the three smallest lengths being positive and the residuals for the next six lengths being negative. (c)

According to the theoretical relationship, the slope in the model should be24 0.0403

980π . The

estimated model appears to agree with the theoretical relationship because the estimated slope is 0.0428, an absolute difference of about 0.0025. (d) The predicted length of an 80-centimeter pendulum is or a period of 1.8081 seconds. The two models provide very similar predicted values, with an absolute difference of only 0.0156.

ˆ 0.1547 0.0428 80 3.2693y = − + ×

4.5 (a) A scatterplot is shown below (left). The relationship is strong, negative and nonlinear (or curved).

Depth (meters)

Ligh

t in

tens

ity

(lum

ens)

111098765

180

160

140

120

100

80

60

40

20

Depth (meters)

ln(L

ight

inte

nsit

y)

111098765

5.0

4.5

4.0

3.5

3.0

(b) The ratios (120.42/168, 86.31/120.42, 61.87/86.31, 44.34/61.87, 31.78/44.34, and 22.78/31.78) are all 0.717. Since the ratios are all the same, the exponential model is appropriate. (c) Yes, the scatterplot (above on the right) shows that the transformation achieves linearity. (d) If x = Depth and = ln(Light Intensity), then the least-squares regression lines is

. The intercept 6.7891 provides an estimate for the average value of the natural log of the light intensity at the surface of the lake. The slope, −0.3330, indicates that the natural log of the light intensity decreases on average by 0.3330 for each one meter increase in depth. (e) The residual plot below (left) shows that the linear model on the transformed data is appropriate. (Some students may suggest that there is one unusually large residual, but they need to look carefully at the scale on the y-axis. All of the residuals are extremely small.) (f) If

yˆ 6.7891 0.3330y = − x

x = Depth and = Light Intensity, then the model after the inverse transformation is y

6.7891 0.333ˆ xy e e−= or . The scatterplot below (right) shows that the exponential model is excellent for these data.

ˆ 888.1139 0.7168xy = ×

92 Chapter 4

Depth (meters)

Res

idua

l

111098765

0.000100

0.000075

0.000050

0.000025

0.000000

-0.000025

-0.000050

Depth (meters)

Ligh

t in

tens

ity

(lum

ens)

111098765

180

160

140

120

100

80

60

40

20

(g) At 22m, the predicted light intensity is lumens. No, the absolute difference between the observed light intensity 0.58 and the predicted light intensity 0.5846 is very small (0.0046 lumens) because the model provides an excellent fit.

0.333 22ˆ 888.1139 0.5846y e− ×=

4.6 (a) A scatterplot is shown below (left).

Year

Acr

es

1981198019791978

3000000

2500000

2000000

1500000

1000000

500000

0

Year

log(

Acr

es)

1981198019791978

6.5

6.0

5.5

5.0

(b) The ratios are 226,260/63,042 = 3.5890, 907,075/226,260 = 4.0090, and 2,826,095/907,075 = 3.1156. (c) The transformed values of y are 4.7996, 5.3546, 5.9576, and 6.4512. A scatterplot of the logarithms against year is shown above (right). (d) Minitab output is shown below.

The regression equation is log(Acres) = - 1095 + 0.556 year Predictor Coef SE Coef T P Constant -1094.51 29.26 -37.41 0.001 year 0.55577 0.01478 37.60 0.001 S = 0.0330502 R-Sq = 99.9% R-Sq(adj) = 99.8%

(e) If x = year and = acres, then the model after the inverse transformation is y1094.51 0.5558ˆ 10 10 xy −= . The coefficient of is 0.0000 (rounded to 4 decimal places) so all of

the predicted values would be 0. (Note: If properties of exponents are not used to simplify the right-hand-side, then some calculators will be able to do the calculations without having serious overflow problems.) (f) The least-squares regression line of log(acres) on year is

. (g) The residual plot below shows no clear pattern, so the linear regression model on the transformed data is appropriate.

0.555810 x

ˆ 4.2513 0.5558y = + x

More about Relationships between Two Variables 93

Years since 1977

Res

idua

l

4.03.53.02.52.01.51.0

0.04

0.03

0.02

0.01

0.00

-0.01

-0.02

-0.03

Years since 1977

Acr

es

4321

3000000

2500000

2000000

1500000

1000000

500000

0

(h) If x = year and = acres, then the model after the inverse transformation is y4.2513 0.5558 0.5558ˆ 10 10 17,836.1042 10xy = x× . A scatterplot with the exponential model

superimposed is shown above (right). The exponential model provides an excellent fit. (i) The predicted number of acres defoliated in 1982 (5 years since 1977) is

acres. 0.5558 5ˆ 17,836.1042 10 10,722,597.42y ×× = 4.7 (a) If = number of transistors and y x = number of years since 1970, then

and , so

1(1) 2250y ab= =

4(4) 9000y ab= =

43

0.25

2250 1417.41129000

a ⎛ ⎞= ⎜ ⎟⎝ ⎠

and 2250 1.58741417.4112

b = . This model

predicts the number of transistors in year x after 1970 to be . (b) Using the natural logarithm transformation on both sides of the model in (a), produces the line

ˆ 1417.4112 1.5874xy = ×

ˆln 7.2566 0.4621y = + x . (c) The slope for Moore’s model (0.4621) is larger than the estimated slope in Example 4.6 (0.332), so the actual transistor counts have grown more slowly than Moore’s law suggests. 4.8 (a) According to the claim, the number of children killed doubled every year after 1950. Year 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 Number of deaths 2 4 8 16 32 64 128 256 512 1024 (b) A scatterplot showing the exponential relationship is shown below (left).

Year

Num

ber

of d

eath

s

196019581956195419521950

1000

800

600

400

200

0

Year

log(

Num

ber

of d

eath

s)

196019581956195419521950

3.0

2.5

2.0

1.5

1.0

0.5

0.0

(c) According to the paper, the number of children killed x years after 1950 is 2 . Thus, or approximately 35 trillion children were killed in 1995. This is clearly a

x

45 132 3.5184 10= ×

94 Chapter 4

mistake. (d) A scatterplot of the logarithms against year (above on the right) shows a strong, positive linear relationship. (e) The least-squares regression line for predicting the logarithm of y = deaths from x = year is approximately ˆ 587.0 0.301y x= − + . Thus, the predicted value in 1995 is . As a check, . The absolute difference in these two predictions, 0.0513, is relatively small.

ˆ 587.0 0.301 1995 13.495y = − + × 45log(2 ) 13.5463

4.9 (a) A scatterplot is shown below.

Time (since 1790)

Pop

ulat

ion

(in

mill

ions

)

200150100500

300

250

200

150

100

50

0

Time (since 1790)

Log(

Pop

ulat

ion)

200150100500

2.5

2.0

1.5

1.0

0.5

(b) In the scatterplot above (right), the transformed data appear to be linear from 0 to 90 (or 1790 to about 1880), and then linear again, but with a smaller slope. The linear trend indicates that the exponential model is still appropriate and the smaller slope reflects a slower growth rate. (c) The least-squares regression line for predicting y = log(population) from x = time since 1790 is

. Transforming back to the original variables, the estimated population size is 21.3304 . A scatterplot with this regression line is shown below (left). (d) The residual plot (below on the right) shows random scatter and = 0.995, so the exponential model provides an excellent fit.

ˆ 1.329 0.0054y = + x1.0125x×

2r

Time (since 1790)

Log(

Pop

ulat

ion)

210200190180170160150140130120

2.5

2.4

2.3

2.2

2.1

2.0

Time (since 1790)

Res

idua

l

210200190180170160150140130120

0.010

0.005

0.000

-0.005

-0.010

-0.015

(e) The predicted population in 2010 is ˆ 1.329 0.0054 220 2.517y = + × or about million people. The prediction is probably too low, because these estimates

usually do not include homeless people and illegal immigrants. 2.51710 328.8516=

4.10 (a) A scatterplot of distance versus height is shown below (left).

More about Relationships between Two Variables 95

Height

Dis

tanc

e

1000900800700600500400300

1500

1400

1300

1200

1100

1000

900

800

Square root of height

Dis

tanc

e

323028262422201816

1500

1400

1300

1200

1100

1000

900

800

(b) The curve tends to bow downward, which resembles a power curve px with p < 1. Since we want to pull in the right tail of the distribution, we should apply a transformation px with p < 1. (c) A scatterplot of distance against the square root of height (shown above, right) straightens the graph quite nicely. 4.11 (a) Let x = Body weight in kg and = Life span in years. Scatterplots of the original data (left) and the transformed data (right), after taking the logarithms of both variables, are shown below. The linear trend in the scatterplot for the transformed data suggests that the power model is appropriate.

y

Weight (kg)

Life

spa

n (y

ears

)

300025002000150010005000

40

30

20

10

0

Log(Weight)

Log(

Life

spa

n)

43210-1-2

1.75

1.50

1.25

1.00

0.75

0.50

(b) The least squares regression line for the transformed data is ˆlog 0.7617 0.2182log( )y x= + . The residual plot (below on the left) shows fairly random scatter about zero and . Thus, 71.17% of the variation in the log of the life spans is explained by the linear relationship with the log of the body weight.

2 0.7117r =

96 Chapter 4

Log(Weight)

Res

idua

l

43210-1-2

0.3

0.2

0.1

0.0

-0.1

-0.2

-0.3

-0.4

Transformed weight

Life

spa

n (y

ears

)

543210

40

30

20

10

0

(c) The inverse transformation gives the estimated power model . (d) This model predicts the average life span for humans to be

years, considerably shorter than the expected life span of humans. (e) According to the biologists, the power model is . The easiest and best option is to plot a graph of ( and then fit a least-squares regression line using the transformed weight as the explanatory variable. The scatterplot (above on the right) shows that this model provides a good fit for the data. The least-squares regression line is

with a predicted average life span of years for humans. Note: Students may try some other models, which are not as good. For example, raising both sides of the equation to the fifth power, the model becomes , which is a linear regression model with no intercept parameter (or an intercept of zero). After transforming life span y to y

0.7617 0.2182 0.2182ˆ 10 5.7770y x x=

0.2182ˆ 5.7770 65 14.3642y × =0.2y ax=

)

x

x

x

)x

0.2weight , lifespan

0.2ˆ 2.70 7.95y = − + 0.2ˆ 2.7 7.95 65 15.62y = − + ×

5 5y a=

5, the estimated model is . This model predicts the average life span of humans to be years. Another option is to try plotting a graph of

to achieve linearity. The least-squares regression line for this set of

transformed data is with a predicted average life span of years for humans. Note that none of the models

provides a reasonable estimate for the average life span of humans.

5ˆ 30,835y =

( )0.2ˆ 30,835 65 18.2134y = ×

( 5weight, lifespan5ˆ 1389463 30,068y = +

( )0.2ˆ 1389463 30068 65 20.1767y = + ×

4.12 (a) The power model would be more appropriate for these data. The scatterplot of the log of cost versus diameter (below on the left) is linear, but the plot of the log of cost versus the log of diameter (below on the right) shows almost a perfect straight line.

More about Relationships between Two Variables 97

Diameter (inches)

Log(

Cos

t)

1817161514131211109

1.2

1.1

1.0

0.9

0.8

0.7

0.6

Log(Diameter)

Log(

Cos

t)

1.251.201.151.101.051.00

1.2

1.1

1.0

0.9

0.8

0.7

0.6

(b) Let y = the cost of the pizza and x = the diameter of the pizza. The least-squares regression line is . The inverse transformation gives the estimated power model . (c) According to this model, the predicted costs of the four different size pizzas are $4.01, $5.90, $8.18, and $13.91, from smallest to largest. There are only slight differences between the predicted costs for the model and the actual costs, so an adjustment does not appear to be necessary based on this model. (d) According to our estimated power model in part (b), the predicted cost for the new “soccer team” pizza is

. (e) An alternative model is based on setting the cost proportional to the area, or the power model of the form

ˆlog 1.5118 2.1150logy = − + xx1.5118 2.115 2.115ˆ 10 0.0308y x−=

2.115ˆ 0.0308 24 $25.57y = ×( ) 2cost 4 xπ∝ . Most students will square the

diameter and then fit a linear model to obtain the least squares regression line . The estimated price of the “soccer team” pizza is

Alternatively, this model can be rewritten as

2ˆ 0.506 0.0445y = − + x2ˆ 0.506 0.0445 24 $25.13y = − + × y bx= .

Using least-squares with no intercept, the value of b is estimated to be 0.2046, so the predicted cost of the “soccer team” pizza is . ( )2ˆ 0.2046 24 $24.11y = × 4.13 (a) As height increases, weight increases. Since weight is a 3-dimensional characteristic and height is 1-dimensional, weight should be proportional to the cube of the height. A model of the form weight = a(height)b would be a good place to start. (b) A scatterplot of the response variable y = weight versus the explanatory variable x = height is shown below.

Height (inches)

Wei

ght

(pou

nds)

8075706560

250

225

200

175

150

(c) Calculate the logarithms of the heights and the logarithms of the weights. The least-squares regression line for the transformed data is ˆlog 1.3912 2.0029logy x= − + . ; almost all (99.99% of the variation in log of weight is explained by the linear relationship with log of

2 0.9999r =

98 Chapter 4

height. (d) The residual plot below for the transformed data shows that the residuals are very close to zero with no discernable pattern. This model clearly fits the transformed data very well.

Log(Height)

Res

idua

l

1.9001.8751.8501.8251.8001.7751.750

0.0015

0.0010

0.0005

0.0000

-0.0005

-0.0010

(e) The inverse transformation gives the estimated power model

. The predicted weight of a 5’10 (70”) adult is lbs, and the predicted weight of a 7’ (84”) adult is lbs.

1.3912 2.0029 2.0029ˆ 10 0.0406y x x−=2.0029ˆ 0.0406 70 201.4062y = ×2.0029ˆ 0.0406 84 290.1784y = ×

4.14 Who? The individuals are hearts from various mammals. What? The response variable y is the weight of the heart (in grams) and the explanatory variable x is the length of the left ventricle (in cm). Why? The data were collected to explore the relationship in these two quantitative measurements for hearts of mammals. When, where, how, and by whom? The data were originally collected back in 1927 by researchers studying the physiology of the heart. Graphs: A scatterplot of the original data is shown below (left). The nonlinear trend in the scatterplot makes sense because the heart weight is a 3-dimensional characteristic which should be proportional to the length of the cavity of the left ventricle. A scatterplot, after transforming the data by taking the logarithms of both variables, shows a clear linear trend (below, right) so the power model is appropriate.

Cavity length (cm)

Hea

rt w

eigh

t (g

ram

s)

181614121086420

4000

3000

2000

1000

0

Log(Cavity lenght)

Log(

Hea

rt w

eigh

t)

1.21.00.80.60.40.20.0-0.2-0.4

4

3

2

1

0

-1

Numerical Summaries: The correlation between log of cavity length and log of heart weight is 0.997, indicating a near perfect association. Model: The power model is . After taking the logarithms of both variables, the least-squares regression line is

. Approximately 99.3% of the variation in the log of heart weight is explained by the linear relationship with log of cavity length. The residual plot below suggests that there may be a little bit of curvature remaining, but nothing to get overly concerned about.

bweight a length= ×

ˆlog 0.1364 3.1387logy = − + x

More about Relationships between Two Variables 99

Log(Cavity length)

Res

idua

l

1.21.00.80.60.40.20.0-0.2-0.4

0.3

0.2

0.1

0.0

-0.1

-0.2

Interpretation: The inverse transformation gives the estimated power model

, which provides a good fit for these data. 0.1364 3.1387 3.1387ˆ 10 0.7305y x x−= 4.15 (a) The scatterplot below (left) shows that the relationship between y = distance and x = time is strong, positive, and nonlinear (curved).

Time (seconds)

Dis

tanc

e (c

m)

0.90.80.70.60.50.40.30.20.1

400

300

200

100

0

Time squared

Res

idua

l

0.80.70.60.50.40.30.20.10.0

10

5

0

-5

-10

(b) The least-squares regression line for the transformed data is . (c) The residual plot above (right) shows random scatter and , so 99.84% of the variability in the distance fallen is explained with this linear model. (d) Yes, the scatterplot below (left) shows that this transformation does a very good job creating a linear trend. The least-squares regression line for the transformed data is

2ˆ 0.990 490.416y x= +2 0.9984r =

ˆ 0.1046 22.0428y x= + .

Time (seconds)

Squa

re r

oot

of d

ista

nce

0.90.80.70.60.50.40.30.20.1

20

15

10

5

Time (seconds)

Res

idua

l

0.90.80.70.60.50.40.30.20.1

0.4

0.3

0.2

0.1

0.0

-0.1

-0.2

-0.3

-0.4

-0.5

(e) The residual plot above (right) shows no obvious pattern and . This is an excellent model. (f) The predicted distance that an object had fallen after 0.47 seconds is 109.32 cm using

2 0.9986r =

100 Chapter 4

the model from (b) and 109.51 cm using the model from (d). There is very little difference in the predicted values, but most students will probably pick the prediction from (d) because is a little higher and the residual plot shows less variability about the regression line.

2r

4.16 (a) We are given the model ln 2.00 2.42lny x= − + . Using properties of logarithms, the power model is ln 2.00 2.42lny xe e− += or . (b) The estimated biomass of a tree with a diameter of 30 cm is kg.

2.00 2.42y e x−=2.00 2.42ˆ 30 508.2115y e−= ×

4.17 Who? The individuals are carnivores. What? The response variable y is a measure of abundance and the explanatory variable x is the size of the carnivore. Why? Ecologists were interested in learning more about nature’s patterns. When, where and how? The data were collected before 2002 (the publication date) by relating the body mass of the carnivore to the number of carnivores. Rather than simply counting the total number of observed carnivores, the researchers created a measure of abundance based on a count relative to the size of prey in an area. Graphs: A scatterplot of y = abundance versus x = body mass (on the left below) shows a nonlinear relationship. Using the log transformation for both variables provides a moderately strong, negative, linear relationship (see the scatterplot below on the right).

Body mass (kg)

Abu

ndan

ce

350300250200150100500

1800

1600

1400

1200

1000

800

600

400

200

0

Log(Body mass)

Log(

Abu

ndan

ce)

2.52.01.51.00.50.0-0.5-1.0

3

2

1

0

-1

Numerical Summaries: The correlation between log body mass and log abundance is −0.912. Model: The least-squares regression line for the transformed data is

, with an and a residual plot (below) showing no obvious patterns.

ˆlog 1.9503 1.0481logy x= − 2 0.8325r =

Log(Body mass)

Res

idua

l

2.52.01.51.00.50.0-0.5-1.0

1.0

0.5

0.0

-0.5

-1.0

Interpretation: The inverse transformation gives the estimated power model

, which provides a good fit for these data. 1.9503 1.0481 1.0481ˆ 10 89.1867y x x−= −

More about Relationships between Two Variables 101

4.18 Let x = the breeding length, length at which 50% of females first reproduce and y = the asymptotic body length. The scatterplot (left) and residual plot (right) below show that the linear model does not provide a great fit for these body measurements of this fish species. Most of the residuals are negative for breeding lengths below 30 cm and above 150 cm.

Breeding length (cm)

Asy

mpt

otic

bod

y le

ngth

(cm

)

350300250200150100500

500

400

300

200

100

0

Breeding length

Res

idua

l

350300250200150100500

100

75

50

25

0

-25

-50

Applying the log transformation to both lengths produces better results. The scatterplot (left) and residual plot (right) below show that a linear model provides a very good fit. The least squares regression model for the transformed data is ˆlog 0.3011 0.9520logy x= + , with an

and a residual plot with very little structure, although most of the residuals are still negative when the explanatory variable is above 1.9.

2 0.898r =

Log(Breeding length)

Log(

Asy

mpt

otic

bod

y le

ngth

)

2.52.01.51.00.5

3.0

2.5

2.0

1.5

1.0

0.5

Log(Breeding length)

Res

idua

l

2.52.01.51.00.5

0.4

0.3

0.2

0.1

0.0

-0.1

-0.2

-0.3

The inverse transformation gives the estimated power model , which provides a good fit for these data.

0.3011 0.952 0.952ˆ 10 2.0003y x x=

4.19 (a) Scatterplots of the original data (left) and the transformed data (right) are shown below.

102 Chapter 4

Time (hours)

Mea

n co

lony

siz

e

403020100

120

100

80

60

40

20

0

Time (hours)

Log(

Mea

n co

lony

siz

e)

403020100

2.0

1.5

1.0

0.5

0.0

(b) The first phase is from 0 to 6 hours when the mean colony size actually decreases. This decrease is hard to see on the graph of the original data, but is more obvious on the graph of the transformed data. In the second phase, from 6 to 24 hours, the mean colony size increases exponentially. Both graphs show this phase clearly, but it is most noticeable from the linear trend on the graph of the transformed data for this time period. At 36 hours, mean growth is in the third phase where growth is still occurring, but at a lower rate than the previous phase. The point in the top right corner of both graphs clearly shows the new phase because this point does not fit the pattern for phase two. (c) Let y = mean colony size and x = time. The least-squares regression line for the transformed data is ˆlog 0.5942 0.0851y x= − + . Using the inverse transformation, the predicted size of a colony 10 hours after inoculation is

. 0.5942 0.0851 10 0.2568ˆ 10 10 10 1.8063y − ×= = 4.20 The correlation for time (hours 6–24) and log (mean colony size) is . The correlation time (hours 6–24) and log (individual colony size) is

0.9915r =0.9846r = . As expected, the

correlation for the individual colony size is smaller than the correlation for the mean colony size because individual measurements have more variability. The scatterplots below show the differences in the relationships for mean colony sizes (left) and individual colony sizes (right).

Time (hours)

Log(

Mea

n co

lony

siz

e)

252015105

1.6

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0

Time (hours)

Log(

Col

ony

size

)

252015105

1.75

1.50

1.25

1.00

0.75

0.50

0.25

0.00

4.21 (a) and , so , where ,

, and are arbitrary constants. (b) The graph of below shows that strength does not increase linearly with body weight, as would be the case if a person 1 million times as heavy as an ant could lift 1 million times more than the ant. Strength increases more slowly. For example, if weight is multiplied by 1000, strength will increase by a factor of .

31( )Weight c height= 2

2 ( )strength c height= 2/33( )strength c weight= 1c

2c 3c 2 / 3y x=

2/31000 100=

More about Relationships between Two Variables 103

Weight

Stre

ngth

10008006004002000

100

80

60

40

20

0

4.22 (a) Answers will vary. (b) The population of cancer cells after 1n− years is 1

0 (7 / 6)nP P −= . The population of cancer cells after years is . (c) Answers will vary, but the exponential model should provide a good fit for the data collected.

n 1 10 0 (7 / 6) (1/ 6)( (7 / 6) ) (7 / 6)n nP P P P− −= + = 0

n

4.23 (a) The sum of the six counts is 10+9+24+61+206+548 = 858 people. (b) The sum of the top row shows 10+9+24 = 43 people had arthritis. (c) The marginal distribution of participation in soccer is shown below.

Elite Non-elite Did not play Count 71 215 572 Percent 8.3% 25.1% 66.7%

(d) The percent of each group who have arthritis is 14.08% for the elite soccer players, 4.2% for the non-elite soccer players and 4.19% for the people who did not play. This suggests an association between playing elite soccer and developing arthritis. 4.24 The percents should add to 100% because they provide a breakdown of all participants according to one categorical variable. The sum is 8.3% + 25.1% + 66.7% = 100.1 %. If one more decimal place is included in each of the percents, then the sum is 8.28% + 25.06% + 66.67% = 100.01%. The percents do not add to 100% because of rounding. 4.25 (a) The sum of the six counts is 5375 students. (b) The proportion of these students who smoke is 1004/5375 = 0.1868, so the percent of smokers is 18.68%. (c) The marginal distribution of parents smoking behavior is shown below.

Neither parent smokes One parent smokes Both parents smoke Count 1356 2239 1780 Percent 25.23% 41.66% 33.12%

(d) The three conditional distributions are shown in the table below. Neither parent

smokes One parent smokes

Both parents smoke

Student does not smoke 86.14% 81.42% 77.53% Student smokes 13.86% 18.58% 22.47%

The conditional distributions reveal what many people expect—parents have a substantial influence on their children. Students that smoke are more likely to come from families where one or more of their parents smoke.

104 Chapter 4

4.26 (a) The two-way table is shown below. (b) The percent of eggs in each group that hatched are 59.26% in a cold nest, 67.86% in a neutral nest, and 72.12% in a hot nest. The percents indicate that hatching increases with temperature. The cold nest did not prevent hatching, but made it less likely.

Cold Neutral Hot Hatched 16 38 75 Not hatched 11 18 29 Total 27 56 104

4.27 (a) The two conditional distributions are shown in the table below. The biggest difference between men and women is in Administration—a higher percentage of women chose this major. A greater percent of men chose the other fields, especially finance. (b) A total of 386 students responded , so 722−386 = 336 did not respond. About 46.54% of the students did not respond.

Female Male Accounting 30.22% 34.78%Administration 40.44% 24.84%Economics 2.22% 3.73% Finance 27.11% 36.65%

4.28 Two examples are shown below. In general, choose a to be any number from 0 to 50, and then all the other entries can be determined.

25 25 10 40 35 15 50 0

Note: This is why we say that such a table has “one degree of freedom:” We can make one (nearly) arbitrary choice for the value of a, and then have no more decisions to make. 4.29 (a) The two-way table is shown below. (b) Overall, 11.88% of white defendants and 10.24% of black defendants receive the death penalty. For white victims, 12.58% of white defendants and 17.46% of black defendants receive the death penalty. For black victims, 0% of white defendants and 5.83% of black defendants receive the death penalty. (c) The death penalty is more likely when the victim was white (14.02%) rather than black (5.36%). Because most convicted killers are of the same race as their victims, whites are more often sentenced to death.

Death penalty No death penalty White defendant 19 141 Black defendant 17 149

4.30 (a) The two-way table is shown below. (b) Overall, 70% of male applicants are admitted, while only 56% of females are admitted. (c) In the business school, 80% of male applicants are admitted, compared with 90% of females. In the law school, 10% of males are admitted, compared with 33.33% of females. (d) Six out of 7 men apply to the business school, which admits 82.5% of all applicants, while 3 out of 5 women apply to the law school, which admits only 27.5% of its applicants.

Admit Deny Male 490 210 Female 280 220

More about Relationships between Two Variables 105

4.31 The table below gives the two marginal distributions. The marginal distribution of marital status is found by taking, e.g., 337/8235 4.1%. The marginal distribution of job grade is found by taking, e.g., 955/8235 11.6%.

Single Married Divorced Widowed 4.1% 93.9% 1.5% 0.5% Grade 1 Grade 2 Grade 3 Grade 4 11.6% 51.5% 30.2% 6.7%

As rounded here, both sets of percents add up to 100%. If students round to the nearest whole percent, the marital status numbers add up to 101%. If they round to two places after the decimal, the job grade percents add up to 100.01%. 4.32 The percent of single men in grade 1 jobs is 58/337 17.21%. The percent of grade 1 jobs held by single men is 58/955 6.07%. 4.33 Divide the entries in the first column by the first column total; e.g., 17.21% 58/337. These should add to 100% (except for rounding error). The percentages in the table below add to 100.01%.

Job grade % of single men 1 17.21% 2 65.88% 3 14.84% 4 2.08%

If the percents are rounded to the nearest tenth, 17.2%, 65.9%, 14.8%, and 2.1%, then they add to 100%. 4.34 (a) We need to compute percents to account for the fact that the study included many more married men than single men, so that we would expect their numbers to be higher in every job grade (even if marital status had no relationship with job level). (b) A table of percents is below; descriptions of the relationship may vary. Single and widowed men had higher percents of grade 1 jobs; single men had the lowest (and widowed men the highest) percents of grade 4 jobs.

Job grade Single Married Divorced Widowed 1 17.21% 11.31% 11.90% 19.05% 4 2.08% 6.90% 5.56% 9.52%

4.35 Age is the main lurking variable: Married men would generally be older than single men, so they would have been in the work force longer, and therefore had more time to advance in their careers. 4.36 (a) A bar graph is shown below—58.33% of desipramine users did not have a relapse, while 25.0% of lithium users and 16.7% of those who received a placebo succeeded in breaking their addictions. (b) Because random assignment was used, there is statistical evidence for causation (though there are other questions we need to consider before we can reach that conclusion).

106 Chapter 4

Label

Per

cent

wit

h no

rel

apse

PlaceboLithiumDesipramine

60

50

40

30

20

10

0

4.37 (a) To find the marginal distribution of opinion, we need to know the total numbers of people with each opinion: 49/133 36.84% said “higher,” 32/133 24.06% said “the same,” and 52/133 39.10% said “lower.” The numbers are summarized in the first table below. The main finding is probably that about 39% of users think the recycled product is of lower quality. This is a serious barrier to sales. (b) There were 36 buyers and 97 nonbuyers among the respondents, so (for example) 20/36 55.56% of buyers rated the quality as higher. Similar arithmetic with the buyers and nonbuyers rows gives the two conditional distributions of opinion, shown in the second table below. We see that buyers are much more likely to consider recycled filters higher in quality, though 25% still think they are lower in quality. We cannot draw any conclusion about causation: It may be that some people buy recycled filters because they start with a high opinion of recycled products, or it may be that use persuades people that the quality is high.

Higher The same Lower 36.84% 24.06% 39.10%

Higher The same Lower Buyers 55.56% 19.44% 25.00%Nonbuyers 29.90% 25.77% 44.33%

4.38 (a) The two-way table is shown below. (b) The overall batting averages are 0.240 for Joe and 0.260 for Moe. Moe has the best overall batting average.

Hit No hit Joe 120 380 Moe 130 370

(c) Two separate tables, one for each type of pitcher, are shown below. Against left-handed pitchers, Joe’s batting average is 0.200 and Moe’s batting average is 0.100. Against right-handed pitchers, Joe’s batting average is 0.400 and Moe’s batting average is 0.300. Joe is better against both kinds of pitchers.

Left-handed pitchers Right-handed pitchers Hit No hit Hit No hit Joe 80 320 Joe 40 60 Moe 10 90 Moe 120 280

(d) Both players do better against right-handed pitchers than against left-handed pitchers. Joe spent 80% of his at-bats facing left-handers, while Moe only faced left-handers 20% of the time.

More about Relationships between Two Variables 107

4.39 Examples will vary, of course; one very simplistic possibility is shown below. The key is to be sure that there is a lower percentage of overweight people among the smokers than among the nonsmokers.

Combined – All People Early Death Yes No Overweight 41 59 Not overweight 50 50

Smokers Non smokers Early Death Early Death Yes No Yes No Overweight 10 0 Overweight 31 59 Not overweight 40 20 Not overweight 10 30

4.40 Who? The individuals are students. What? The categorical variables of interest are educational level or degree (Associate’s, Bachelor’s, Master’s, Professional, or Doctor’s) and gender (male or female). Why? The researchers were interested in checking if the participation of women changes with level of degree. When, where, how, and by whom? These projections, in thousands, were made for 2005-2006 by the National Center for Education Statistics. Graphs: The conditional distributions of sex for each degree level are shown in the bar graph below (left). The conditional distributions of degree level for each gender are shown in the bar graph below (right).

Per

cent

Degree

Doctor's

Professio

nal

Master

's

Bach

elor's

Asso

ciate'

s

Male

FemaleMale

FemaleMale

FemaleMale

FemaleMale

Female

70

60

50

40

30

20

10

0

Per

cent

Degree

Male

Female

Doctor's

Professio

nal

Master

's

Bach

elor's

Assoc

iate's

Doctor's

Profes

siona

l

Master

's

Bach

elor's

Assoc

iate's

50

40

30

20

10

0

Numerical summaries: The software output below from Mintab provides the joint distribution, marginal distributions, and conditional distributions in one consolidated table. The first entry in each cell is the count, the second entry is the % of the row (or the conditional distribution of gender for each type of degree), the third entry is the % of the column (or the conditional distribution of degree for each gender), and the fourth entry is the overall %.

108 Chapter 4

Rows: Degree Columns: Gender Female Male All Associate's 431 244 675 63.85 36.15 100.00 26.85 21.90 24.83 15.851 8.974 24.825 Bachelor's 813 584 1397 58.20 41.80 100.00 50.65 52.42 51.38 29.901 21.478 51.379 Doctor's 21 24 45 46.67 53.33 100.00 1.31 2.15 1.66 0.772 0.883 1.655 Master's 298 215 513 58.09 41.91 100.00 18.57 19.30 18.87 10.960 7.907 18.867 Professional 42 47 89 47.19 52.81 100.00 2.62 4.22 3.27 1.545 1.729 3.273 All 1605 1114 2719 59.03 40.97 100.00 100.00 100.00 100.00 59.029 40.971 100.000 Cell Contents: Count % of Row % of Column % of Total

Interpretation: Women earn a majority of associate’s, bachelor’s, and master’s degrees, but fall slightly below 50% for professional and doctoral degrees. The distributions of degree level are very similar for females and males. 4.41 No. Rich nations have more TV sets than poor nations. Rich nations also have longer life expectancies because they have better nutrition, clean water, and better health care. There is common response relationship between TV sets and length of life.

More about Relationships between Two Variables 109

4.42 In this case, there may be a causative effect, but in the direction opposite to the one suggested: People who are overweight are more likely to be on diets, and so choose artificial sweeteners over sugar. (Also, heavier people are at a higher risk to develop diabetes; if they do, they are likely to switch to artificial sweeteners.)

4.43 No. The number of hours standing up is a confounding variable in this case. The diagram below illustrates the confounding between exposure to chemicals and standing up.

Use of sweeteners

Weight gain

Wealth

x = # of TV sets y = average

life span

110 Chapter 4

4.44 Well-off people tend to have more cars. They also tend to live longer, probably because they are better educated, take better care of themselves, and get better medical care. The cars have nothing to do with it. The relationship between number of cars and length of life is common response.

4.45 It could be that children with lower intelligence watch many hours of television and get lower grades as well. It could be that children from lower socio-economic households where parents are less likely to limit television viewing and are unable to help their children with their schoolwork because the parents themselves lack education. The variables “number of hours

Time standing up

Exposure to chemicals Miscarriages

?

Wealth

Number of cars Length of

life

More about Relationships between Two Variables 111

watching television” and “grade point average” change in common response to “socio-economic status” or “IQ”.

4.46 Single men tend to have a different value system than married men. They have many interests, but getting married and earning a substantial amount of money are not among their top priorities. Confounding is the best term to describe the relationship between marital status and income.

4.47 The effects of coaching are confounded with those of experience. A student who has taken the SAT once may improve his or her score on the second attempt because of increased

IQ or socioeconomic status

Number of hours spent watching TV GPA

Values

Marital status Annual

income

?

112 Chapter 4

familiarity with the test. The student may also have increased knowledge from additional math and science courses.

4.48 A reasonable explanation is that the cause-and-effect relationship goes in the other direction: Doing well makes students feel good about themselves, rather than vice versa.

CASE CLOSED! 1. (a) Let y = premium and x = age. Scatterplots of the original data (left) and transformed data (right) after taking the logarithms of both variables are shown below. The plot of the original data shows a strong nonlinear relationship. The plot for the transformed data shows a clear linear trend, so the power model is appropriate.

Experience

Coaching Course

SAT score

?

Self-esteem

Quality of work

More about Relationships between Two Variables 113

Age (years)

Pre

miu

m (

$)

656055504540

250

200

150

100

50

0

Log(Age)

Log(

Pre

miu

m)

1.801.751.701.651.60

2.4

2.3

2.2

2.1

2.0

1.9

1.8

1.7

1.6

1.5

(b) A scatterplot of the logarithm of premium versus age is shown below (left). The linear trend suggests that the exponential model is appropriate.

age

Log(

Pre

miu

m)

656055504540

2.4

2.3

2.2

2.1

2.0

1.9

1.8

1.7

1.6

1.5

Age

Res

idua

l

656055504540

0.010

0.005

0.000

-0.005

-0.010

-0.015

(c) Since the association between the log of premium and age is nearly perfect, the exponential model is most appropriate. The least-squares regression line for the transformed data is

. Using the inverse transformation, the predicted premium is . (d) The predicted monthly premiums are

for a 58-year-old and for a 68-year-old. (e) You should feel very comfortable with these predictions. The residual plot above (right) shows no clear patterns and , so the exponential model provides an excellent fit.

ˆlog 0.0275 0.0373y = − + xx×0.0275 0.0373 0.0373ˆ 10 10 0.9386 10xy −=

0.0373 58ˆ 0.9386 10 $136.74y ×= × 0.0373 68ˆ 0.9386 10 $322.76y ×= ×

2 99.9%r =

2. (a) The entries in each column are only from these six selected causes of death. There are other causes of death so the total number of deaths in each age group is higher than the sum of the deaths for these six causes. (b) Percents should be used to compare the age groups because the age groups contain different numbers of individuals. (c) The conditional distributions are shown in the table below. Each entry is obtained by dividing the count for that cause of death by the appropriate column total.

15 to 24 years 25 to 44 years 45 to 64 years Accidents 45.32% 21.60% 5.42%AIDS 0.52% 5.34% 1.35%Cancer 4.93% 14.77% 33.16%Heart disease 3.28% 12.63% 23.27%Homicide 15.59% 5.71% 0.63%Suicide 11.87% 8.73% 2.30%

114 Chapter 4

Two different bar graphs below show the conditional distributions. P

erce

nt

Cause

P64

P25

P15

Suicide

Homicide

Heart

Canc

erAID

S

Acciden

ts

Suicide

Homici

deHea

rt

Canc

erAID

S

Acciden

ts

Suicide

Homici

deHea

rt

Canc

erAID

S

Acciden

ts

50

40

30

20

10

0

Per

cent

Cause

Suicid

e

Homicid

eHea

rt

Cance

rAI

DS

Accid

ents

P64

P25

P15

P64

P25

P15

P64

P25

P15

P64

P25

P15

P64

P25

P15

P64

P25

P15

50

40

30

20

10

0

(d) The leading cause of death for the youngest age group is accidents, followed by homicide and suicide. For the middle age group, accidents are still the leading cause of death, but cancer and heart disease are second and third, respectively. For the oldest age group, cancer is the leading cause of death, with heart disease running a close second. 3. (a) The chance of dying for men over 65 who walk at least 2 miles a day is half that of men who do not exercise. (b) Individuals who exercise regularly have many other habits and characteristics that could contribute to longer lives. 4.49 Spending more time watching TV means that less time is spent on other activities. Answers will vary, but some possible lurking variables are: the amount of time parents spend at home, the amount of exercise and the economy. For example, parents of heavy TV watchers may not spend as much time at home as other parents. Heavy TV watchers may not get as much exercise as other adolescents. As the economy has grown over the past 20 years, more families can afford TV sets (many homes now contain more than two TV sets), and as a result, TV viewing has increased and children have less physical work to do in order to make ends meet. 4.50 (a) Let y = intensity and x = distance. A scatterplot of the original data is shown below (left). The data appear to follow a power law model of the form where b is some negative number.

by ax=

Distance (meters)

Inte

nsit

y (c

ande

las)

2.01.81.61.41.21.0

0.30

0.25

0.20

0.15

0.10

Log(distance)

Log(

Inte

nsit

y)

0.300.250.200.150.100.050.00

-0.5

-0.6

-0.7

-0.8

-0.9

-1.0

-1.1

-1.2

(b) A scatterplot of the transformed data (above on the right), after taking the logarithms of both variables, shows a clear linear trend, so the power model is appropriate. The least-squares

More about Relationships between Two Variables 115

regression line for the transformed data is ˆlog 0.5235 2.0126logy x= − − . (c) The residual plot below shows no obvious patterns and so this linear model on the transformed data provides an excellent fit.

2 99.9%r =

Log(Distance)

Res

idua

l

0.300.250.200.150.100.050.00

0.010

0.005

0.000

-0.005

Distance (meters)

Inte

nsit

y (c

ande

las)

2.01.81.61.41.21.0

0.30

0.25

0.20

0.15

0.10

Variableintensitypredicted

(d) Using the inverse transformation to find the predicted intensity gives . The plot of the original data with this model is shown above

(right). (e) The predicted intensity of the 100-watt bulb at 2.1 meters is candelas.

0.5235 2.0126 2.0126ˆ 10 0.2996y x x− − −=

2.0126ˆ 0.2996 2.1 0.0673y −= × 4.51 (a) Yes, this transformation achieves linearity; see the scatterplot below.

1/(Distance-squared)

Inte

nsit

y (c

ande

las)

1.00.90.80.70.60.50.40.30.2

0.30

0.25

0.20

0.15

0.10

(b) Let x = distance and y = intensity. The least-squares regression line for the transformed data

is 2

1ˆ 0.0006 0.30yx

⎛= − + ⎜⎝ ⎠

⎞⎟ . (c) The predicted intensity of the 100-watt bulb at 2.1 meters is

2

1ˆ 0.0006 0.30 0.06742.1

y ⎛ ⎞= − + ⎜ ⎟⎝ ⎠

candelas. (d) Writing the model from part (d) of Exercise

4.50 in a slightly different form shows that the models are very similar, 2

0.3ˆ 0.00062.1

y ⎛ ⎞= − + ⎜ ⎟⎝ ⎠

versus 2.0126

0.3ˆ2.1

y ⎛⎜⎝ ⎠

⎞⎟ . The absolute difference in the predicted values is 0.0001. Thus, the

inverse square law provides an excellent model.

116 Chapter 4

4.52 The explanatory variable is the amount of herbal tea and the response variable is a measure of health and attitude. The most important lurking variable is social interaction—many of the nursing home residents may have been lonely before the students started visiting. 4.53 (a) The column sums are shown below.

Single: 10,949 + 7,653 + 4,009 + 720 = 23,331 Married: 2,472 + 19,640 + 32,183 + 8,539 = 62,834 Widowed: 16 + 228 + 2,312 + 8,732 = 11,288 Divorced: 155 + 2,904 + 7,898 + 1,703 = 12,660

The sum of these column totals is 23,331 + 62,834 + 11,288 + 12,660 = 110,113, which is not equal to 110,115. The difference is due to rounding. (b) The marginal distributions, conditional distributions, and joint distribution are shown in the software output from Minitab below.

Rows: Age Columns: Marital Status divorced married single widowed All 15-24 155 2472 10949 16 13592 1.14 18.19 80.55 0.12 100.00 1.22 3.93 46.93 0.14 12.34 0.141 2.245 9.943 0.015 12.344 25-39 2904 19640 7653 228 30425 9.54 64.55 25.15 0.75 100.00 22.94 31.26 32.80 2.02 27.63 2.637 17.836 6.950 0.207 27.631 40-64 7898 32183 4009 2312 46402 17.02 69.36 8.64 4.98 100.00 62.39 51.22 17.18 20.48 42.14 7.173 29.227 3.641 2.100 42.140 65+ 1703 8539 720 8732 19694 8.65 43.36 3.66 44.34 100.00 13.45 13.59 3.09 77.36 17.89 1.547 7.755 0.654 7.930 17.885 All 12660 62834 23331 11288 110113 11.50 57.06 21.19 10.25 100.00 100.00 100.00 100.00 100.00 100.00 11.497 57.063 21.188 10.251 100.000 Cell Contents: Count % of Row % of Column % of Total

The table below provides just the marginal distribution for marital status. Single Married Widowed Divorced21.19% 57.06% 10.25% 11.50%

A bar chart of the marginal distribution is shown below.

More about Relationships between Two Variables 117

Marital status

Per

cent

DivorcedWidowedMarriedSingle

60

50

40

30

20

10

0

(c) The two conditional distributions are shown in the table below.

Age Single Married Widowed Divorced15−24 80.55% 18.19% 0.12% 1.14% 40−64 8.64% 69.36% 4.98% 17.02%

Among the younger women, more than 4 out of 5 have not yet married, and those who are married have had little time to become widowed or divorced. Most of the older group is or has been married—only about 8.64% are still single. (d) Among single women, 46.93% are 15−24, 32.8% are 25−39, 17.18% are 40−64 and 3.09% are 65 or older. 4.54 (a) The scatterplots below show a strong nonlinear relationship for the original data (left) and a nearly perfect, negative linear association for the transformed data (right).

Bounce

Hei

ght

(fee

t)

543210

3.0

2.5

2.0

1.5

1.0

Bounce

Log(

Hei

ght)

543210

0.5

0.4

0.3

0.2

0.1

0.0

-0.1

-0.2

Not only is the linear association between the log(height) and bounce stronger than the linear association between the logarithms of both variables, but there is also a value of zero for the bounce number which means that the logarithm cannot be used for this point. The exponential model is more appropriate for predicting y = height from x = bounce number. (b) The least-squares regression line for the transformed data is ˆlog 0.4610 0.1191y x= − . The residual plot below shows that the first two residuals are positive and the next three residuals are negative, but the residuals are all very small. The value of is 0.998, which indicates that 99.8% of the variability in log(height) is explained by linear relationship with bounce. This model provides an excellent fit.

2r

118 Chapter 4

Bounce

Res

idua

l

543210

0.015

0.010

0.005

0.000

-0.005

-0.010

(c) The inverse transformation gives a predicted height of

. The predicted height on the 70.4610 0.1191 0.1191ˆ 10 10 2.8907 10xy −= x−× th bounce is feet. 0.1191 7ˆ 2.8907 10 0.4239y − ×= × =

4.55 The lurking variable is temperature or season. More flu cases occur in winter when less ice cream is sold, and fewer flu cases occur in the summer when more ice cream is sold. This is an example of common response.

4.56 Who? The individual are randomly selected people from three different locations. What? The response variable is whether or not the individual suffered from CHD and the explanatory variable is a measure of how prone an individual is to sudden anger. Both variables are categorical, with CHD being yes or no and the level of anger being classified as low, moderate, or high. Why? The researchers wanted to see if there was an association between these two categorical variables. When, where, how, and by whom? In the late 1990s a random sample of almost 13,000 people was followed for four years. The Spielberger Trait Anger Scale was used to classify the level of anger and medical records were used for CHD. Graphs: A bar graph of the conditional distributions of CHD for each level of anger is shown below (left). To see the

Season or temperature

Number of flu cases reported Amount of

ice cream sold

More about Relationships between Two Variables 119

increase in the percent of individual with CHD in each group, a separate bar graph is shown (right). Notice how the change in scale changes your impression of the effect.

Per

cent

Anger levelCHD

highmoderatelownoyesnoyesnoyes

100

80

60

40

20

0

Anger level

Per

cent

highmoderatelow

4

3

2

1

0

Numerical summaries: The software output below from Minitab shows the marginal distributions, conditional distributions, and joint distribution.

Rows: CHD Columns: Anger high low moderate All No 606 3057 4621 8284 7.32 36.90 55.78 100.00 95.73 98.30 97.67 97.76 7.151 36.075 54.532 97.758 Yes 27 53 110 190 14.21 27.89 57.89 100.00 4.27 1.70 2.33 2.24 0.319 0.625 1.298 2.242 All 633 3110 4731 8474 7.47 36.70 55.83 100.00 100.00 100.00 100.00 100.00 7.470 36.700 55.830 100.000 Cell Contents: Count % of Row % of Column % of Total

The most important numbers for comparison are the percents of each anger group that experienced CHD: 53/3110 1.70% of the low-anger group, 110/4731 2.33% of the moderate-anger group, and 27/633 4.27% of the high-anger group. Interpretation: Risk of CHD increases with proneness to sudden anger. It might be good to point out to students that results like these are typically reported in the media with a reference to

the relative risk of CHD; for example, because 4.3% 2.51.7%

, we might read that “subjects in the

high-anger group had 2.5 times the risk of those in the low-anger group.” 4.57 Who? The individuals are cultures of marine bacteria. What? The two quantitative variables are x = time (minutes) and y = count (number of surviving bacteria in hundreds). Why? Researchers wanted to see if the bacteria would decay exponentially over time when exposed to X-rays. When, where, how, and by whom? It is not clear when or where the data were collected, but the counts were obtained after exposing cultures to X-rays for different lengths of time.

120 Chapter 4

Graphs: Scatterplots below show the original data (left) and the transformed data (right) after taking the logarithm of count. Both plots suggest that the exponential decay model is appropriate for these data.

Time (minutes)

Cou

nt (

in h

undr

eds)

1614121086420

400

300

200

100

0

Time (minutes)

Log(

Cou

nt)

1614121086420

2.6

2.4

2.2

2.0

1.8

1.6

1.4

1.2

1.0

Numerical summaries: The least-squares regression line for the transformed data is . Using the inverse transformation, the predicted count is

. Interpretation: The residual plot below shows no clear pattern and , so the exponential decay model provides an excellent model for the number of surviving bacteria after exposure to X-rays.

ˆlog 2.5941 0.0949y = − xx−×2.5941 0.0949 0.0949ˆ 10 10 392.7354 10xy −=

2 98.8%r =

Time (minutes)

Res

idua

l

1614121086420

0.10

0.05

0.00

-0.05

-0.10

4.58 (a) The two-way table below was obtained by adding the corresponding entries for each age group. The proportion of smokers who stayed alive for 20 years is 443/582 0.7612 or 76.12% and the proportion of nonsmokers who stayed alive is 502/732 0.6858 or 68.58%.

Smoker Not Dead 139 230 Alive 443 502

(b) For the youngest group, 269/288 or 93.40% of the smokers and 327/340 or 96.18% of the nonsmokers survived. For the middle group, 167/245 or 68.16% of the smokers and 147/199 or 73.87% of the nonsmokers survived. For the oldest group, 7/49 or 14.29% of the smokers and 28/193 or 14.51% of the nonsmokers survived. The results are reversed when the data for the three age groups are combined. (c) The percents of smokers in the three age groups are 288/628×100 45.86% for the youngest group, 245/444×100 55.18% for the middle aged group, but only 49/242×100 20.25% for the oldest group.

More about Relationships between Two Variables 121