fixing problems with the model transforming the data so that the simple linear regression model is...
TRANSCRIPT
Fixing problems with the model
Transforming the data so that the simple linear regression model is
okay for the transformed data.
Options for fixing problems with the model
• Abandon simple linear regression model and find a more appropriate – but typically more complex – model.
• Transform the data so that the simple linear regression model works for the transformed data.
Abandoning the model
• If not linear: try a different function, like a quadratic (Ch. 7) or an exponential function (Ch. 13).
• If unequal error variances: use weighted least squares (Ch. 10).
• If error terms are not independent: try fitting a time series model (Ch. 12).
• If important predictor variables omitted: try fitting a multiple regression model (Ch. 6).
• If outlier: use robust estimation procedure (Ch. 10).
Choices for transforming the data
• Transform X values only.
• Transform Y values only.
• Transform both the X and the Y values.
Transforming the X values only
Transforming the X values only
• Appropriate when non-linearity is the only problem – normality and equal variance okay – with the model.
• Transforming the Y values would likely change the well-behaved error terms into badly-behaved error terms.
Memory retention
time prop1 0.845 0.7115 0.6130 0.5660 0.54120 0.47240 0.45480 0.38720 0.361440 0.262880 0.205760 0.1610080 0.08
• Subjects asked to memorize a list of disconnected items. Asked to recall them at various times up to a week later
• Predictor time = time, in minutes, since initially memorized the list.
• Response prop = proportion of items recalled correctly.
Example 1
Fitted line plot
10000 5000 0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
time
pro
p
S = 0.152284 R-Sq = 57.1 % R-Sq(adj) = 53.2 %
prop = 0.525870 - 0.0000557 time
Regression Plot
Example 1
Residual vs. fits plot
0.50.40.30.20.10.0
0.3
0.2
0.1
0.0
-0.1
-0.2
Fitted Value
Re
sid
ual
Residuals Versus the Fitted Values(response is prop)
Example 1
Normal probability plot
P-Value (approx): > 0.1000R: 0.9751W-test for Normality
N: 13StDev: 0.145801Average: -0.0000000
0.30.20.10.0-0.1-0.2
.999
.99
.95
.80
.50
.20
.05
.01
.001
Pro
babi
lity
RESI1
Normal Probability Plot
Example 1
Transform the X values
time prop log10_time1 0.84 0.000005 0.71 0.6989715 0.61 1.1760930 0.56 1.4771260 0.54 1.77815120 0.47 2.07918240 0.45 2.38021480 0.38 2.68124720 0.36 2.857331440 0.26 3.158362880 0.20 3.459395760 0.16 3.7604210080 0.08 4.00346
Change (“transform”) the predictor time to log10(time).
Example 1
Fitted line plot using transformed X values
0 1 2 3 4
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
log10time
pro
p
prop = 0.846415 - 0.182427 log10timeS = 0.0233881 R-Sq = 99.0 % R-Sq(adj) = 98.9 %
Regression Plot
Example 1
Residuals vs. fits plot using transformed X values
0.90.80.70.60.50.40.30.20.1
0.04
0.03
0.02
0.01
0.00
-0.01
-0.02
-0.03
-0.04
Fitted Value
Re
sid
ual
Residuals Versus the Fitted Values(response is prop)
Example 1
Normal probability plotusing transformed X values
P-Value (approx): > 0.1000R: 0.9786W-test for Normality
N: 13StDev: 0.0223924Average: -0.0000000
0.030.00-0.03
.999
.99
.95
.80
.50
.20
.05
.01
.001
Pro
babi
lity
RESI1
Normal Probability Plot
Example 1
Predicting new proportion
Estimated regression function:
timeY 10log182.0846.0ˆ
Therefore, we predict the proportion of words recalled after 1000 minutes is:
30.03182.0846.0ˆ
1000log182.0846.0ˆ10
Y
Y
Example 1
Predicting new proportion
Example 1
Predicted Values for New Observations
New Fit SE Fit 95.0% CI 95.0% PI1 0.299 0.00765 (0.282, 0.316) (0.245, 0.353)
Values of Predictors for New Observations
New Obs log10tim1 3.00
We can be 95% confident that a person will recall between 24.5% and 35.3% of the words after 1000 minutes.
Transforming the Y values only
Transforming the Y values only
• Appropriate when non-normality and/or unequal variances are the problems.
• The transformation on Y may also help to “straighten out” a curved relationship.
Gestation time and birth weight for mammals
Mammal Birthwgt GestationGoat 2.75 155Sheep 4.00 175Deer 0.48 190Porcupine 1.50 210Bear 0.37 213Hippo 50.00 243Horse 30.00 340Camel 40.00 380Zebra 40.00 390Giraffe 98.00 457Elephant 113.00 670
• Predictor Birthwgt = birth weight, in kg, of mammal.
• Response Gestation = number of days until birth
Example 2
Fitted line plot
0 50 100
200
300
400
500
600
700
Birthwgt
Ge
sta
tion
Gestation = 187.084 + 3.59137 BirthwgtS = 66.0943 R-Sq = 83.9 % R-Sq(adj) = 82.1 %
Regression Plot
Example 2
Residual vs. fits plot
600500400300200
100
0
-100
Fitted Value
Re
sid
ual
Residuals Versus the Fitted Values(response is Gestatio)
Example 2
Normal probability plot
P-Value (approx): > 0.1000R: 0.9703W-test for Normality
N: 11StDev: 62.7025Average: -0.0000000
500-50-100
.999
.99
.95
.80
.50
.20
.05
.01
.001
Pro
babi
lity
RESI1
Normal Probability Plot
Example 2
Transform the Y values
Mammal Birthwgt Gestation log10GestGoat 2.75 155 2.19033Sheep 4.00 175 2.24304Deer 0.48 190 2.27875Porcupine 1.50 210 2.32222Bear 0.37 213 2.32838Hippo 50.00 243 2.38561Horse 30.00 340 2.53148Camel 40.00 380 2.57978Zebra 40.00 390 2.59106Giraffe 98.00 457 2.65992Elephant 113.00 670 2.82607
Change (“transform”) the response Gestation to log10(Gestation).
Example 2
Fitted line plot using transformed Y values
0 50 100
2.2
2.3
2.4
2.5
2.6
2.7
2.8
Birthwgt
log1
0G
est
log10Gest = 2.29256 + 0.0045211 BirthwgtS = 0.0939425 R-Sq = 80.3 % R-Sq(adj) = 78.1 %
Regression Plot
Example 2
Residual vs. fits plotusing transformed Y values
2.3 2.4 2.5 2.6 2.7 2.8
-0.1
0.0
0.1
Fitted Value
Res
idua
l
Residuals Versus the Fitted Values(response is log10Gest)
Example 2
Normal probability plotusing transformed Y values
P-Value (approx): > 0.1000R: 0.9743W-test for Normality
N: 11StDev: 0.0891217Average: -0.0000000
0.10.0-0.1
.999
.99
.95
.80
.50
.20
.05
.01
.001
Pro
babi
lity
RESI2
Normal Probability Plot
Example 2
Predicting new gestation Estimated regression function:
BirthwgtestG 0045.029.2)ˆ(log10
Therefore, since:
515.2500045.029.2)ˆ(log10 estG
we predict the gestation length of another mammal at 50 kgs to be:
3.3271010ˆ 515.2)ˆ(log10 estGestG
Example 2
Predicting new gestation
Example 2
Predicted Values for New Observations
New Fit SE Fit 95.0% CI 95.0% PI1 2.5186 0.0306 (2.4494, 2.5878) (2.2951, 2.7421)
Values of Predictors for New Observations
New Birthwgt1 50.0
3.19710 2951.2
2.55210 7421.2
We can be 95% confident that the gestation length for a new mammal at 50 kgs will be between 197.3 and 552.2 days.
Transforming both the X and Y values
Transforming both the X and Y values
• Appropriate when the error terms are not normal, have unequal variances, and the function is not linear.
• Transforming the Y values corrects the problems with the error terms (and may help the non-linearity).
• Transforming the X values corrects the non-linearity.
Diameter (inches) and volume (cu. ft.) of 70 shortleaf pines
Example 3
5 15 25
0
50
100
150
Diameter
Vo
lum
e
Volume = -41.5681 + 6.83672 DiameterS = 9.87485 R-Sq = 89.3 % R-Sq(adj) = 89.1 %
Regression Plot
Residuals vs. fits plot
Example 3
100500
5
4
3
2
1
0
-1
-2
Fitted Value
Sta
ndar
diz
ed
Re
sid
ual
Residuals Versus the Fitted Values(response is Volume)
Normal probability plot
Example 3
P-Value (approx): < 0.0100R: 0.9409W-test for Normality
N: 70StDev: 1.02852Average: 0.0085024
543210-1-2
.999
.99
.95
.80
.50
.20
.05
.01
.001
Pro
babi
lity
SRES1
Normal Probability Plot
Transform the Y values onlyDiameter Volume logVol 4.4 2.0 0.69315 4.6 2.2 0.78846 5.0 3.0 1.09861 5.1 4.3 1.45862 5.1 3.0 1.09861 5.2 2.9 1.06471 5.2 3.5 1.25276 5.5 3.4 1.22378 5.5 5.0 1.60944 5.6 7.2 1.97408 5.9 6.4 1.85630 5.9 5.6 1.72277 7.5 7.7 2.04122 7.6 10.3 2.33214… and so on …
Transform response volume to loge(volume)
Example 3
Fitted line plotusing transformed Y values
5 15 25
0
1
2
3
4
5
6
Diameter
logV
ol
logVol = 0.451703 + 0.239531 DiameterS = 0.322919 R-Sq = 90.5 % R-Sq(adj) = 90.4 %
Regression Plot
Example 3
Residuals vs. fits plotusing transformed Y values
654321
1
0
-1
-2
-3
Fitted Value
Sta
ndar
diz
ed
Re
sid
ual
Residuals Versus the Fitted Values(response is logVol)
Example 3
Normal probability plotusing transformed Y values
P-Value (approx): < 0.0100R: 0.9610W-test for Normality
N: 70StDev: 1.01888Average: -0.0077969
10-1-2-3
.999
.99
.95
.80
.50
.20
.05
.01
.001
Pro
babi
lity
SRES4
Normal Probability Plot
Example 3
Transform both the X and Y valuesDiameter Volume logDiam logVol 4.4 2.0 1.48160 0.69315 4.6 2.2 1.52606 0.78846 5.0 3.0 1.60944 1.09861 5.1 4.3 1.62924 1.45862 5.1 3.0 1.62924 1.09861 5.2 2.9 1.64866 1.06471 5.2 3.5 1.64866 1.25276 5.5 3.4 1.70475 1.22378 5.5 5.0 1.70475 1.60944 5.6 7.2 1.72277 1.97408 5.9 6.4 1.77495 1.85630 5.9 5.6 1.77495 1.72277 7.5 7.7 2.01490 2.04122 7.6 10.3 2.02815 2.33214… and so on …
Transform predictor diameter to
loge(diameter)
Transform response volume to loge(volume)
Example 3
Fitted line plotusing transformed X and Y values
Example 3
1.5 2.0 2.5 3.0
1
2
3
4
5
logDiam
logV
ol
logVol = -2.87179 + 2.56442 logDiamS = 0.170263 R-Sq = 97.4 % R-Sq(adj) = 97.3 %
Regression Plot
Residual plot using transformed X and Y values
Example 3
54321
3
2
1
0
-1
-2
Fitted Value
Sta
ndar
diz
ed
Re
sid
ual
Residuals Versus the Fitted Values(response is logVol)
Normal probability plot using transformed X and Y values
Example 3
P-Value (approx): > 0.1000R: 0.9896W-test for Normality
N: 70StDev: 1.00930Average: -0.0028401
210-1-2
.999
.99
.95
.80
.50
.20
.05
.01
.001
Pro
babi
lity
SRES5
Normal Probability Plot
Transformation strategies
Effects of transformations
• Transforming the Y values corrects the problems with the error terms – and may simultaneously help non-linearity.
• Transforming the X values can only correct non-linearity.
Transformation strategies
• If form of the relationship between x and y is known, then it may be possible to find a linearizing transformation analytically.
• Fitting a regression model empirically generally requires trial and error – try different transformations to see which does best.
Transformation strategies
Finding a linearizing transformation analytically
Knowing functional relationship is of the power form
If the relationship between x and y is of the power form:
xy
taking log of both sides transforms it into a linear form:
xy eee logloglog
Knowing functional relationship is of the exponential form
If the relationship between x and y is of exponential form:
xey
taking log of both sides transforms it into a linear form:
xy ee loglog
Transformation strategies
Finding a transformation by trial and error
Family of power transformations
The most common transformation involves transforming the response by taking it to some power λ. That is:
yy Most commonly, for interpretation reasons, λ is a number between -1 and 2, such as -1, -0.5, 0, 0.5, (1), 1.5, and 2.
When λ = 0, the transformation is taken to be the log transformation. That is:
yy elog
Effect of loge transformation
10005000
5
0
-5
x
f(x)
Natural log function
Effect of loge transformation
543210
2
1
0
-1
-2
-3
-4
-5
-6
x
f(x)
Natural log function
Some guidelines for specifying λ
• To make smaller values more spread out, use a smaller λ.
• To make larger values more spread out, use a larger λ.
Possible transformations
x
y
2x
x y
y
y
ylog
y1
3x
x
x
Possible transformations
y
x y
y
2y
xlog
x1
3yx
xx
y
Possible transformations
x y
y
y
ylog
y1
x
xx
f(x)
xlog
ylog
xlog
x1
Possible transformations
2x
x y
y
y3x
x
xx
f(x)
2y
3y
Transformation strategies
Variance stabilizing transformations
Common variance stabilizing transformations
If the response is a Poisson count, so that the variance is proportional to the mean, use the square root transformation:
yyy 21
If the response is a binomial proportion, use the arcsine square root transformation:
pp ˆsinˆ 1
Common variance stabilizing transformations
If the variance is proportional to the mean squared, use the natural log transformation:
yy elog
If the variance is proportional to the mean to the fourth power, use the reciprocal transformation:
yy 1
Transforming data in Minitab
• Select Calc >> Calculator …• In box labeled “Store result in variable,”, tell
Minitab in which column (variable) you want the transformed data stored.
• Type (input) the expression for the desired transformation in the box labeled Expression. (Use the available functions.)
• Select OK. The data will appear in the column of the worksheet that you specified.