polynomial regression models possible models for when the response function is “curved”
Post on 23-Dec-2015
218 Views
Preview:
TRANSCRIPT
Polynomial regression models
Possible models for when the response function is “curved”
Uses of polynomial models
• When the true response function really is a polynomial function.
• (Very common!) When the true response function is unknown or complex, but a polynomial function approximates the true function well.
Example
• What is impact of exercise on human immune system?
• Is amount of immunoglobin in blood (y) related to maximal oxygen uptake (x) (in a curved manner)?
7060504030
2000
1500
1000
Immunoglobin (mg)
Max
imal
oxy
ge
n up
take
(m
l/kg
)
Scatter plot
A quadratic polynomial regression function
iiii XXY 21110
where:
• Yi = amount of immunoglobin in blood (mg)
• Xi = maximal oxygen uptake (ml/kg)
• typical assumptions about error terms (“INE”)
Estimated quadratic function
7060504030
2000
1500
1000
oxygen
igg
S = 106.427 R-Sq = 93.8 % R-Sq(adj) = 93.3 %
igg = -1464.40 + 88.3071 oxygen - 0.536247 oxygen**2
Regression Plot
Interpretation of the regression coefficients
• If 0 is a possible x value, then b0 is the predicted response. Otherwise, interpretation of b0 is meaningless.
• b1 does not have a very helpful interpretation. It is the slope of the tangent line at x = 0.
• b2 indicates the up/down direction of curve– b2 < 0 means curve is concave down– b2 > 0 means curve is concave up
The regression equation is igg = - 1464 + 88.3 oxygen - 0.536 oxygensq
Predictor Coef SE Coef T P VIFConstant -1464.4 411.4 -3.56 0.001oxygen 88.31 16.47 5.36 0.000 99.9oxygensq -0.5362 0.1582 -3.39 0.002 99.9
S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3%
Analysis of Variance
Source DF SS MS F PRegression 2 4602211 2301105 203.16 0.000Residual Error 27 305818 11327Total 29 4908029
Source DF Seq SSoxygen 1 4472047oxygensq 1 130164
A multicollinearity problem
7060504030
5000
4000
3000
2000
1000
oxygen
oxy
ge
nsq
Pearson correlation of oxygen and oxygensq = 0.995
“Center” the predictors
637.50OxygenOxCent
2637.50 OxygenOxCentSq
Mean of oxygen = 50.637
oxygen oxcent oxcentsq 34.6 -16.037 257.185 45.0 -5.637 31.776 62.3 11.663 136.026 58.9 8.263 68.277 42.5 -8.137 66.211 44.3 -6.337 40.158 67.9 17.263 298.011 58.5 7.863 61.827 35.6 -15.037 226.111 49.6 -1.037 1.075 33.0 -17.637 311.064
Does it really work?
20100-10-20
400
300
200
100
0
oxcent
oxc
ent
sq
Pearson correlation of oxcent and oxcentsq = 0.219
A better quadratic polynomial regression function
iiii xxY 2*11
*1
*0
XXx ii where denotes the centered predictor, and
β*0 = mean response at the predictor mean
β*1 = “linear effect coefficient”
β*11 = “quadratic effect coefficient”
The regression equation isigg = 1632 + 34.0 oxcent - 0.536 oxcentsq
Predictor Coef SE Coef T P VIFConstant 1632.20 29.35 55.61 0.000oxcent 34.000 1.689 20.13 0.000 1.1oxcentsq -0.5362 0.1582 -3.39 0.002 1.1
S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3%
Analysis of Variance
Source DF SS MS F PRegression 2 4602211 2301105 203.16 0.000Residual Error 27 305818 11327Total 29 4908029
Source DF Seq SSoxcent 1 4472047oxcentsq 1 130164
Interpretation of the regression coefficients
• b0 is predicted response at the predictor mean.
• b1 is the estimated slope of the tangent line at the predictor mean; and, typically, also the estimated slope in the simple model.
• b2 indicates the up/down direction of curve
– b2 < 0 means curve is concave down
– b2 > 0 means curve is concave up
20 10 0-10-20
2000
1500
1000
oxcent
igg
S = 106.427 R-Sq = 93.8 % R-Sq(adj) = 93.3 %
igg = 1632.20 + 33.9995 oxcent - 0.536247 oxcent**2
Regression Plot
Estimated regression function
Similar estimates
20 10 0-10-20
2000
1500
1000
oxcent
igg
S = 124.783 R-Sq = 91.1 % R-Sq(adj) = 90.8 %
igg = 1557.63 + 32.7427 oxcent
Regression Plot
The relationship between the two forms of the model
2*11
*1
*0
ˆiii xbxbbY Centered model:
21110
ˆiii XbXbbY Original model:
*1111
*11
*11
2*11
*1
*00
2
bb
Xbbb
XbXbbb
Where:
25362.00.342.1632ˆiii xxY
5362.0
3.88)637.50)(5362.(234
3.1464)637.50(5362.0)637.50(342.1632
11
1
20
b
b
b
2536.031.884.1464ˆiii XXY
Mean of oxygen = 50.637
200015001000
200
100
0
-100
-200
Fitted Value
Res
idua
lResiduals Versus the Fitted Values
(response is igg)
2001000-100-200
2
1
0
-1
-2
Nor
mal
Sco
re
Residual
Normal Probability Plot of the Residuals(response is igg)
What is predicted IgG if maximal oxygen uptake is 90?
There is an even greater danger in extrapolation when modeling data with a polynomial function, because of changes in direction.
Predicted Values for New Observations
New Obs Fit SE Fit 95.0% CI 95.0% PI1 2139.6 219.2 (1689.8,2589.5) (1639.6,2639.7) XXX denotes a row with X values away from the centerXX denotes a row with very extreme X values
Values of Predictors for New Observations
New Obs oxcent oxcentsq1 39.4 1549
It is possible to “overfit” the data with polynomial models.
65432
8
7
6
5
4
3
2
x
y
S = 2.62950 R-Sq = 64.0 % R-Sq(adj) = 0.0 %
- 8.64286 x**2 + 0.666667 x**3
y = -38.4 + 34.9762 x
Regression Plot
It is even theoretically possible to fit the data perfectly.
If you have n data points, then a polynomial of order n-1 will fit the data perfectly, that is, it will pass through each data point.
** Error ** Not enough non-missing observations to fit a polynomial of this order; execution aborted
But, good statistical software will keep an unsuspecting user from fitting such a model.
The hierarchical approach to model fitting
Widely accepted approach is to fit a higher-order model and then explore whether a lower-order (simpler) model is adequate.
iiiii xxxY 3111
21110
Is a first-order linear model (“line”) adequate?
0: 111110 H
The hierarchical approach to model fitting
But then … if a polynomial term of a given order is retained, then all related lower-order terms are also retained.
That is, if a quadratic term was significant, you would use this regression function:
21110 iii xxYE
2110 ii xYE
and not this one:
Example
• Quality of a product (y) – a score between 0 and 100
• Temperature (x1) – degrees Fahrenheit
• Pressure (x2) – pounds per square inch
82.725
53.375
95
85
82.72553.375
57.5
52.5
9585 57.552.5
quality
temp
pressure
A two-predictor, second-order polynomial regression function
iiiiiiii XXXXXXY 21122222
211122110
where:
• Yi = quality
• Xi1 = temperature
• Xi2 = pressure
• β12 = “interaction effect coefficient”
The regression equation isquality = - 5128 + 31.1 temp + 140 pressure - 0.133 tempsq - 1.14 presssq - 0.145 tp
Predictor Coef SE Coef T P VIFConstant -5127.9 110.3 -46.49 0.000temp 31.096 1.344 23.13 0.000 1154.5pressure 139.747 3.140 44.50 0.000 1574.5tempsq -0.133389 0.006853 -19.46 0.000 973.0Press -1.14422 0.02741 -41.74 0.000 1453.0tp -0.145500 0.009692 -15.01 0.000 304.0
S = 1.679 R-Sq = 99.3% R-Sq(adj) = 99.1%
Again, some correlation
quality temp pressure tempsq presssqtemp -0.423pressure 0.182 0.000tempsq -0.434 0.999 0.000presssq 0.162 0.000 1.000 -0.000tp -0.227 0.773 0.632 0.772 0.632
Cell Contents: Pearson correlation
A better two-predictor, second-order polynomial regression function
iiiiiiii xxxxxxY 21*12
22
*22
21
*112
*21
*1
*0
where:
• Yi = quality
• xi1 = centered temperature
• xi2 = centered pressure
• β*12 = “interaction effect coefficient”
Reduced correlation
quality tcent pcent tpcent tcentsqtcent -0.423pcent 0.182 0.000tpcent -0.274 0.000 0.000tcentsq -0.355 -0.000 0.000 0.000pcentsq -0.762 0.000 0.000 0.000 -0.000
Cell Contents: Pearson correlation
The regression equation isquality = 94.9 - 0.916 tcent + 0.788 pcent - 0.146 tpcent - 0.133 tcentsq - 1.14 pcentsq
Predictor Coef SE Coef T P VIFConstant 94.9259 0.7224 131.40 0.000tcent -0.91611 0.03957 -23.15 0.000 1.0pcent 0.78778 0.07913 9.95 0.000 1.0tpcent -0.145500 0.009692 -15.01 0.000 1.0tcentsq -0.133389 0.006853 -19.46 0.000 1.0pcentsq -1.14422 0.02741 -41.74 0.000 1.0
S = 1.679 R-Sq = 99.3% R-Sq(adj) = 99.1%
100908070605040
3
2
1
0
-1
-2
-3
Fitted Value
Res
idua
l
Residuals Versus the Fitted Values(response is quality)
3210-1-2-3
2
1
0
-1
-2
Nor
mal
Sco
re
Residual
Normal Probability Plot of the Residuals(response is quality)
Predicted Values for New Observations
New Obs Fit SE Fit 95.0% CI 95.0% PI1 94.926 0.722 (93.424,96.428) (91.125,98.726)
Values of Predictors for New Observations
New Obs tcent pcent tpcent tcentsq pcentsq1 0.0000 0.0000 0.0000 0.0000 0.0000
top related