polynomial regression models possible models for when the response function is “curved”

Post on 23-Dec-2015

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Polynomial regression models

Possible models for when the response function is “curved”

Uses of polynomial models

• When the true response function really is a polynomial function.

• (Very common!) When the true response function is unknown or complex, but a polynomial function approximates the true function well.

Example

• What is impact of exercise on human immune system?

• Is amount of immunoglobin in blood (y) related to maximal oxygen uptake (x) (in a curved manner)?

7060504030

2000

1500

1000

Immunoglobin (mg)

Max

imal

oxy

ge

n up

take

(m

l/kg

)

Scatter plot

A quadratic polynomial regression function

iiii XXY 21110

where:

• Yi = amount of immunoglobin in blood (mg)

• Xi = maximal oxygen uptake (ml/kg)

• typical assumptions about error terms (“INE”)

Estimated quadratic function

7060504030

2000

1500

1000

oxygen

igg

S = 106.427 R-Sq = 93.8 % R-Sq(adj) = 93.3 %

igg = -1464.40 + 88.3071 oxygen - 0.536247 oxygen**2

Regression Plot

Interpretation of the regression coefficients

• If 0 is a possible x value, then b0 is the predicted response. Otherwise, interpretation of b0 is meaningless.

• b1 does not have a very helpful interpretation. It is the slope of the tangent line at x = 0.

• b2 indicates the up/down direction of curve– b2 < 0 means curve is concave down– b2 > 0 means curve is concave up

The regression equation is igg = - 1464 + 88.3 oxygen - 0.536 oxygensq

Predictor Coef SE Coef T P VIFConstant -1464.4 411.4 -3.56 0.001oxygen 88.31 16.47 5.36 0.000 99.9oxygensq -0.5362 0.1582 -3.39 0.002 99.9

S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3%

Analysis of Variance

Source DF SS MS F PRegression 2 4602211 2301105 203.16 0.000Residual Error 27 305818 11327Total 29 4908029

Source DF Seq SSoxygen 1 4472047oxygensq 1 130164

A multicollinearity problem

7060504030

5000

4000

3000

2000

1000

oxygen

oxy

ge

nsq

Pearson correlation of oxygen and oxygensq = 0.995

“Center” the predictors

637.50OxygenOxCent

2637.50 OxygenOxCentSq

Mean of oxygen = 50.637

oxygen oxcent oxcentsq 34.6 -16.037 257.185 45.0 -5.637 31.776 62.3 11.663 136.026 58.9 8.263 68.277 42.5 -8.137 66.211 44.3 -6.337 40.158 67.9 17.263 298.011 58.5 7.863 61.827 35.6 -15.037 226.111 49.6 -1.037 1.075 33.0 -17.637 311.064

Does it really work?

20100-10-20

400

300

200

100

0

oxcent

oxc

ent

sq

Pearson correlation of oxcent and oxcentsq = 0.219

A better quadratic polynomial regression function

iiii xxY 2*11

*1

*0

XXx ii where denotes the centered predictor, and

β*0 = mean response at the predictor mean

β*1 = “linear effect coefficient”

β*11 = “quadratic effect coefficient”

The regression equation isigg = 1632 + 34.0 oxcent - 0.536 oxcentsq

Predictor Coef SE Coef T P VIFConstant 1632.20 29.35 55.61 0.000oxcent 34.000 1.689 20.13 0.000 1.1oxcentsq -0.5362 0.1582 -3.39 0.002 1.1

S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3%

Analysis of Variance

Source DF SS MS F PRegression 2 4602211 2301105 203.16 0.000Residual Error 27 305818 11327Total 29 4908029

Source DF Seq SSoxcent 1 4472047oxcentsq 1 130164

Interpretation of the regression coefficients

• b0 is predicted response at the predictor mean.

• b1 is the estimated slope of the tangent line at the predictor mean; and, typically, also the estimated slope in the simple model.

• b2 indicates the up/down direction of curve

– b2 < 0 means curve is concave down

– b2 > 0 means curve is concave up

20 10 0-10-20

2000

1500

1000

oxcent

igg

S = 106.427 R-Sq = 93.8 % R-Sq(adj) = 93.3 %

igg = 1632.20 + 33.9995 oxcent - 0.536247 oxcent**2

Regression Plot

Estimated regression function

Similar estimates

20 10 0-10-20

2000

1500

1000

oxcent

igg

S = 124.783 R-Sq = 91.1 % R-Sq(adj) = 90.8 %

igg = 1557.63 + 32.7427 oxcent

Regression Plot

The relationship between the two forms of the model

2*11

*1

*0

ˆiii xbxbbY Centered model:

21110

ˆiii XbXbbY Original model:

*1111

*11

*11

2*11

*1

*00

2

bb

Xbbb

XbXbbb

Where:

25362.00.342.1632ˆiii xxY

5362.0

3.88)637.50)(5362.(234

3.1464)637.50(5362.0)637.50(342.1632

11

1

20

b

b

b

2536.031.884.1464ˆiii XXY

Mean of oxygen = 50.637

200015001000

200

100

0

-100

-200

Fitted Value

Res

idua

lResiduals Versus the Fitted Values

(response is igg)

2001000-100-200

2

1

0

-1

-2

Nor

mal

Sco

re

Residual

Normal Probability Plot of the Residuals(response is igg)

What is predicted IgG if maximal oxygen uptake is 90?

There is an even greater danger in extrapolation when modeling data with a polynomial function, because of changes in direction.

Predicted Values for New Observations

New Obs Fit SE Fit 95.0% CI 95.0% PI1 2139.6 219.2 (1689.8,2589.5) (1639.6,2639.7) XXX denotes a row with X values away from the centerXX denotes a row with very extreme X values

Values of Predictors for New Observations

New Obs oxcent oxcentsq1 39.4 1549

It is possible to “overfit” the data with polynomial models.

65432

8

7

6

5

4

3

2

x

y

S = 2.62950 R-Sq = 64.0 % R-Sq(adj) = 0.0 %

- 8.64286 x**2 + 0.666667 x**3

y = -38.4 + 34.9762 x

Regression Plot

It is even theoretically possible to fit the data perfectly.

If you have n data points, then a polynomial of order n-1 will fit the data perfectly, that is, it will pass through each data point.

** Error ** Not enough non-missing observations to fit a polynomial of this order; execution aborted

But, good statistical software will keep an unsuspecting user from fitting such a model.

The hierarchical approach to model fitting

Widely accepted approach is to fit a higher-order model and then explore whether a lower-order (simpler) model is adequate.

iiiii xxxY 3111

21110

Is a first-order linear model (“line”) adequate?

0: 111110 H

The hierarchical approach to model fitting

But then … if a polynomial term of a given order is retained, then all related lower-order terms are also retained.

That is, if a quadratic term was significant, you would use this regression function:

21110 iii xxYE

2110 ii xYE

and not this one:

Example

• Quality of a product (y) – a score between 0 and 100

• Temperature (x1) – degrees Fahrenheit

• Pressure (x2) – pounds per square inch

82.725

53.375

95

85

82.72553.375

57.5

52.5

9585 57.552.5

quality

temp

pressure

A two-predictor, second-order polynomial regression function

iiiiiiii XXXXXXY 21122222

211122110

where:

• Yi = quality

• Xi1 = temperature

• Xi2 = pressure

• β12 = “interaction effect coefficient”

The regression equation isquality = - 5128 + 31.1 temp + 140 pressure - 0.133 tempsq - 1.14 presssq - 0.145 tp

Predictor Coef SE Coef T P VIFConstant -5127.9 110.3 -46.49 0.000temp 31.096 1.344 23.13 0.000 1154.5pressure 139.747 3.140 44.50 0.000 1574.5tempsq -0.133389 0.006853 -19.46 0.000 973.0Press -1.14422 0.02741 -41.74 0.000 1453.0tp -0.145500 0.009692 -15.01 0.000 304.0

S = 1.679 R-Sq = 99.3% R-Sq(adj) = 99.1%

Again, some correlation

quality temp pressure tempsq presssqtemp -0.423pressure 0.182 0.000tempsq -0.434 0.999 0.000presssq 0.162 0.000 1.000 -0.000tp -0.227 0.773 0.632 0.772 0.632

Cell Contents: Pearson correlation

A better two-predictor, second-order polynomial regression function

iiiiiiii xxxxxxY 21*12

22

*22

21

*112

*21

*1

*0

where:

• Yi = quality

• xi1 = centered temperature

• xi2 = centered pressure

• β*12 = “interaction effect coefficient”

Reduced correlation

quality tcent pcent tpcent tcentsqtcent -0.423pcent 0.182 0.000tpcent -0.274 0.000 0.000tcentsq -0.355 -0.000 0.000 0.000pcentsq -0.762 0.000 0.000 0.000 -0.000

Cell Contents: Pearson correlation

The regression equation isquality = 94.9 - 0.916 tcent + 0.788 pcent - 0.146 tpcent - 0.133 tcentsq - 1.14 pcentsq

Predictor Coef SE Coef T P VIFConstant 94.9259 0.7224 131.40 0.000tcent -0.91611 0.03957 -23.15 0.000 1.0pcent 0.78778 0.07913 9.95 0.000 1.0tpcent -0.145500 0.009692 -15.01 0.000 1.0tcentsq -0.133389 0.006853 -19.46 0.000 1.0pcentsq -1.14422 0.02741 -41.74 0.000 1.0

S = 1.679 R-Sq = 99.3% R-Sq(adj) = 99.1%

100908070605040

3

2

1

0

-1

-2

-3

Fitted Value

Res

idua

l

Residuals Versus the Fitted Values(response is quality)

3210-1-2-3

2

1

0

-1

-2

Nor

mal

Sco

re

Residual

Normal Probability Plot of the Residuals(response is quality)

Predicted Values for New Observations

New Obs Fit SE Fit 95.0% CI 95.0% PI1 94.926 0.722 (93.424,96.428) (91.125,98.726)

Values of Predictors for New Observations

New Obs tcent pcent tpcent tcentsq pcentsq1 0.0000 0.0000 0.0000 0.0000 0.0000

top related