spline regression models things to keep in mind when ... things to keep in mind when fitting...
TRANSCRIPT
Spline Regression Models
• Things to Keep in Mind when Fitting Polynomial Regression Models
• Piecewise Polynomials (Splines)
• Applying Variable Selection Methods to Choose Knots
1
Important Considerations when Fitting Polynomials to Data
• Example (paper):
• How does the speed of a paper mill machine affect quality of the fin-ished product?
• Measurements of the amount of green liquor produced are recordedfor various speeds.
2
Paper Mill Example (Cont’d)
Data:
\item[ ] Data:
\begin{verbatim}
> paper
green.liquor machine.speed
1 16.0 1700
2 15.8 1720
3 15.6 1730
4 15.5 1740
5 14.8 1750
6 14.0 1760
7 13.5 1770
8 13.0 1780
9 12.0 1790
10 11.0 1795
3
Example (Cont’d)
> attach(paper)
> plot(green.liquor ˜ machine.speed) # paperplot.pdf
1700 1720 1740 1760 1780
1112
1314
1516
machine.speed
gree
n.liq
uor
4
What happens if we use a linear model?
Fit the model and check the residual plot:
> paper.lm <- lm(green.liquor ˜ machine.speed)> plot(paper.lm, which=1, pch=16) # paperres.pdf
12 13 14 15 16 17
−1.
0−
0.5
0.0
0.5
Fitted values
Res
idua
ls
lm(formula = green.liquor ~ machine.speed)
Residuals vs Fitted
10 1
4
5
Try a quadratic model
> paper.lm2 <- lm(green.liquor ˜ machine.speed +
I(machine.speedˆ2))
> summary(paper.lm2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.709e+03 2.448e+02 -6.984 0.000215
machine.speed 2.023e+00 2.798e-01 7.230 0.000173
I(machine.speedˆ2) -5.929e-04 7.994e-05 -7.417 0.000147
Residual standard error: 0.2101 on 7 degrees of freedom
• Fitted model:
y = −1709 + 2.02x− .00059x2
(x = machine speed, y is amt of green liquor) error st. dev. estimate =.21.
6
Is this model satisfactory?
Check the residual plot again:
> plot(paper.lm2, which=1, pch=16) # paper2res.pdf
12 13 14 15 16
−0.
3−
0.2
−0.
10.
00.
10.
20.
3
Fitted values
Res
idua
ls
lm(formula = green.liquor ~ machine.speed + I(machine.speed^2))
Residuals vs Fitted
10
8
6
7
Overlaying the Data with the Fitted Curve
> plot(green.liquor ˜ machine.speed)
> quadline(paper.lm2) # paper2plot.pdf
1700 1720 1740 1760 1780
1112
1314
1516
machine.speed
gree
n.liq
uor
Quadratic Fitted to Paper Data
8
Are there any Influential Observations?
Check Cook’s Distance:
> plot(paper.lm2, which=4, pch=16) # paper2cook.pdf
2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
Obs. number
Coo
k’s
dist
ance
lm(formula = green.liquor ~ machine.speed + I(machine.speed^2))
Cook’s distance plot
10
9
1
9
Check observation 10 more closely:
Influence on the coefficients and on the fitted values:
> dfbetas(paper.lm2)[10,]
(Intercept) machine.speed I(machine.speedˆ2)
-1.383708 1.395116 -1.406660
> dffits(paper.lm2)[10]
10
-2.258152
Observation 10 is highly influential.
10
Important Considerations when Fitting Polynomials to Data
• Order of the Model
Keep this as low as possible – parsimony
• Example: titanium heat data - 49 observations on g and temperature.
Quadratic fit:
> attach(titanium)
> titanium.lm2 <- lm(g˜poly(temperature,2))
> plot(titanium, pch=16)
> lines(spline(temperature,predict(titanium.lm2)),
col=4, lwd=2)
11
Titanium Example (Cont’d): A Failure for the Quadratic Polynomial Model
A pretty miserable fit:
600 700 800 900 1000
1.0
1.5
2.0
temperature
g
12
Does a higher order polynomial fit better?
• 5th order fit:
> titanium.lm5 <- lm(g˜poly(temperature,5))
> plot(titanium, pch=16)
> lines(spline(temperature,predict(titanium.lm5)),
col=4, lwd=2)
13
Titanium Heat Data Example (Cont’d): Attempting to Fit Using High DegreePolynomials
• 5th order model:
600 700 800 900 1000
1.0
1.5
2.0
temperature
g
14
Example (Cont’d)
• 21st order fit:
> titanium.lm21 <- lm(g˜poly(temperature,21))
> plot(titanium, pch=16)
> lines(spline(temperature,predict(titanium.lm21)),
col=4, lwd=2)
15
Titanium Heat Data Example (Cont’d): High Degree Polynomial Regres-sion is Futile
• 21st order model:
600 700 800 900 1000
1.0
1.5
2.0
temperature
g
16
Better to use piecewise polynomials (splines)
Important Considerations (Cont’d)
• Ill-Conditioning
• Example: Consider the Hilbert matrix Hp: hij = 1i+j−1, i, j = 1, . . . , p:
hilbert <- function(n=2){
matrix(1/(rep(seq(1,n),n)+
rep(seq(0,n-1),rep(n,n))),ncol=n)
}
> hilbert(2)
[,1] [,2]
[1,] 1.0 0.500
[2,] 0.5 0.333
17
Ill-Conditioning (Cont’d)
> hilbert(4)
[,1] [,2] [,3] [,4]
[1,] 1.000 0.500 0.333 0.250
[2,] 0.500 0.333 0.250 0.200
[3,] 0.333 0.250 0.200 0.167
[4,] 0.250 0.200 0.167 0.143
18
Ill-Conditioning (Cont’d)
• The Hilbert matrix is famous for being ill-conditioned.• The ratio of the largest eigenvalue to the smallest eigenvalue is very
large.• Matrix inversion is unstable, since the smallest eigenvalue is numeri-
cally indistinguishable from 0.• The inverse of the 2 × 2 Hilbert matrix:
solve(hilbert(2))
[,1] [,2]
[1,] 4 -6
[2,] -6 12
19
Ill-Conditioned Matrices (Cont’d)
• Try to invert the 7 × 7 Hilbert matrix:
solve(hilbert(7))
Error in solve.default(hilbert(7)) : singular
matrix ‘a’ in solve
• From a numerical point of view, large Hilbert matrices appear singular.
• The determinant of a matrix is equal to the product of the eigenvalues;thus, all Hilbert matrices are invertible (nonsingular).
20
Ill-Conditioned Matrices (Cont’d)
• The inverse of the Hilbert matrix can be computed using the followingfunction
inverse.hilbert <- function (n){Hinv <- matrix(0, nrow=n, ncol=n)for (i in 1:n){for (j in 1:n){Hinv[i,j] <- (-1)ˆ(i+j)*(i+j-1)*
choose(n+i-1,n-j)*choose(n+j-1,n-i)*choose(i+j-2,i-1)ˆ2
}}Hinv}
> inverse.hilbert(7)[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 49 -1176 8820 -29400 48510 -38808 12012[2,] -1176 37632 -317520 1128960 -1940400 1596672 -504504[3,] 8820 -317520 2857680 -10584000 18711000 -15717240 5045040[4,] -29400 1128960 -10584000 40320000 -72765000 62092800 -20180160[5,] 48510 -1940400 18711000 -72765000 133402500 -115259760 37837800[6,] -38808 1596672 -15717240 62092800 -115259760 100590336 -33297264[7,] 12012 -504504 5045040 -20180160 37837800 -33297264 11099088
21
Example (Cont’d)
• We can check that this is correct by multiplying with the Hilbert matrixof size 7:
> sum(hilbert(7)[1,]*inverse.hilbert(7)[1,])[1] 1> sum(hilbert(7)[1,]*inverse.hilbert(7)[2,])[1] 0> sum(hilbert(7)[1,]*inverse.hilbert(7)[3,])[1] 0> sum(hilbert(7)[1,]*inverse.hilbert(7)[4,])[1] 0> sum(hilbert(7)[1,]*inverse.hilbert(7)[5,])[1] 0> sum(hilbert(7)[1,]*inverse.hilbert(7)[6,])[1] 0> sum(hilbert(7)[1,]*inverse.hilbert(7)[7,])[1] 0
22
Is the Hilbert Matrix just an Artificial Example?
• Consider the following regression problem:
y = β0 + β1x + · · ·+ βpxp + ε
where the x values have been taken equally spaced on the intervalfrom 0 to 1: e.g. if n = 10, x = .1, .2, . . . ,1.
• As n →∞, 1nXTX = Hp.
23
Multicollinearity
• Consider the paper example again:
> paper.lm3 <- lm(green.liquor˜machine.speed+
I(machine.speedˆ2)+I(machine.speedˆ3))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.75e+03 1.80e+04 0.38 0.72
machine.speed -1.25e+01 3.08e+01 -0.41 0.70
I(machine.speedˆ2) 7.72e-03 1.76e-02 0.44 0.68
I(machine.speedˆ3) -1.59e-06 3.36e-06 -0.47 0.65
24
Multicollinearity
> vif(paper.lm2)
machine.speed I(machine.speedˆ2)
15616 15616
• Reminder: If VIF > 10, we have a problem.
> vif(paper.lm3)
machine.speed I(machine.speedˆ2) I(machine.speedˆ3)
1.68e+08 6.76e+08 1.70e+08
25
Orthogonal Polynomials
• It is better to use orthogonal polynomials.
• Obtain them using the Gram-Schmidt orthogonalization procedure.
• poly(x,k) evaluates the first k +1 orthogonal polynomials Pi(x) atx.
• The model becomes
yi = β0P0(xi) + β1P1(xi) + · · ·+ βkPk(xi) + εi
where P0(xi) = 1, P1(x) is linear in x, . . . , Pk(x) is a kth degreepolynomial in x.
26
Orthogonal Polynomials
• Orthogonality property:
n∑i=1
Pj(xi)Pk(xi) = 0 if i 6= j
• Implication: XTX is a diagonal matrix with jth element
XTXjj =n∑
i=1
P2j (xi)
• Numerically stable to invert!
27
Example: Applying Orthogonal Polynomials to the Paper Data
> paper.orth3 <- lm(green.liquor ˜ poly(machine.speed,3))> summary(paper.orth3)
Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.1200 0.0705 200.40 1.0e-12poly(machine.speed, 3)1 -4.9051 0.2228 -22.01 5.7e-07poly(machine.speed, 3)2 -1.5581 0.2228 -6.99 0.00043poly(machine.speed, 3)3 -0.1050 0.2228 -0.47 0.65399
• Forward selection is straightforward now. VIFs are all 1.
28
7.5 More on Orthogonal Polynomials
• Model:
yi = α0P0(xi) + α1P1(xi) + · · ·+ αkPk(xi) + εi
where P0(xi) = 1, P1(x) is linear in x, . . . , Pk(x) is a kth degreepolynomial in x.
• Orthogonality property:
n∑i=1
Pj(xi)Pk(xi) = 0 if i 6= j
29
Quadratic Example
x -1 0 1 2y 2 1 2 10
• If we use the non-orthogonal polynomials 1, x, x2, then
X =
1 −1 11 0 01 1 11 2 4
= [x1˜
x2˜
x3˜
]
30
Gram-Schmidt
• Convert this to an orthogonal basis:
yT1˜
=
xT1˜
||x1˜||
= [1/2 1/2 1/2 1/2]
y2′
˜= x2
˜−(xT
2˜
y1˜
) y1˜
y2˜
=
y′2˜
|| y′2˜||
yT2˜
= [−3 − 1 1 3]/√
20
y3′
˜= x3
˜−(xT
3˜
y1˜
) y1˜−(xT
3˜
y2˜
) y2˜
31
Gram-Schmidt (Cont’d)
y3˜
=
y′3˜
|| y′3˜||
yT3˜
= [1 − 1 − 1 1]/2
z1˜
= 2 y1˜
, z2˜
=√
20 y2˜
, z3˜
= 2 y3˜
Xorth = [z1˜
z2˜
z3˜
] =
1 −3 11 −1 −11 1 −11 3 1
32
Gram-Schmidt (Cont’d)
• The columns of Xorth are orthogonal so
XTorthXorth =
4 0 00 20 00 0 4
• What are the orthogonal polynomials in this case?
P0(x) = 1
P1(x) = Ax + B = 2x− 1
P2(x) = Ax2 + Bx + C = x2 − x− 1
33
Orthogonal Polynomials (Cont’d)
• Check orthogonality:
4∑i=1
Pj(xi)Pk(xi) = 0, j 6= k
• The orthogonalized regression problem is
y = β0 + β1(2x− 1) + β2(x2 − x− 1) + ε
y = 3.75 + 1.25(2x− 1) + 2.25(x2 − x− 1)
• This simplifies to
y = .25 + .25x + 2.25x2
which could be obtained (but with attendant numerical difficulties) fromthe original X matrix.
34
General Case: Higher Order Orthogonal Polynomial Regression
yi = α0P0(xi) + α1P1(xi) + · · ·+ αkPk(xi) + εi
X =
P0(x1) P1(x1) ... Pk(x1)P0(x2) P1(x2) ... Pk(x2)· · · · · · · · · · · ·
P0(xn) P1(xn) ... Pk(xn)
XTX =
∑n
i=1 P20 (xi) ... 0
· · · . . . · · ·0 ...
∑ni=1 Pk(xi)
XT y˜
=
∑
P0(xi)yi...∑
Pk(xi)yi
αj =
∑Pj(xi)yi∑Pj(xi)2
35
Confidence interval for αj
E[αj] = αj
V (αj) =σ2∑n
i=1 P2j (xi)
SSE =∑
y2i −
k∑j=1
αj
n∑i=1
Pj(xi)yi
MSE = SSE/(n− k − 1)
αj ± tα/2,n−k−1
√√√√ MSE∑P2
j (xi)
36
Significance testing
:SSR(αj) = αj
∑Pj(xi)yi
H0 : αj = 0 H1 : αj 6= 0
F0 =SSR(αj)
MSE
Reject H0 if F0 > Fα/2,1,n−k−1.
Since the regression sum of squares for the jth term does not dependon any other coefficients, this partial F test does not depend on termsalready included in the model.
37
Introduction to Splines
• Polynomials are not flexible enough to adequately model all smoothfunctions.
• Piecewise polynomials or splines are more flexible.
• What is a piecewise polynomial?
• Example: Consider two polynomials
p1(x) = −2x2 + .5x3
p2(x) = −2x2 + .5x3 − 2(x− 6)3
38
Splines (Cont’d)
• We can make a piecewise polynomial out of these two polynomials bycutting them at x = 6 (they are equal there) and tying them togetherwith a knot at x = 6:
s(x) = −2x2 + .5x3, x < 6
s(x) = −2x2 + .5x3 − 2(x− 6)3, x ≥ 6
or
s(x) = −2x2 + .5x3 − 2(x− 6)3I(x ≥ 6)
or
s(x) = −2x2 + .5x3 − 2(x− 6)+3
• This is a cubic spline with a knot at 6.
39
Splines (Cont’d)
0 2 4 6 8 10
y
− 2x2 + 0.5x3
0 2 4 6 8 10
010
020
030
040
0
y
y2
− 2x2 + 0.5x3
− 2x2 + 0.5x3 − 2(
− 2x2 + 0.5x3
− 2x2 + 0.5x3 − 2(x − 6)3
cubic spline, knot at 6.0
> source("splineeg.R") # some spline examples
> spline.eg() # plots two polynomials and a spline
40
Another Example: Three different cubic polynomials
p1(x) = −2x2 + .5x3
p2(x) = −2x2 + .5x3 − 5(x− 6)3
p3(x) = −2x2 + .5x3 − 5(x− 6)3 + 5(x− 6.5)3
• Cut them up at x = 6 and x = 6.5 and tie together with knots there:
s(x) = −2x2+ .5x3−5(x−6)3I(x ≥ 6)+5(x−6.5)3I(x ≥ 6.5)
or
s(x) = −2x2 + .5x3 − 5(x− 6)+3 + 5(x− 6.5)+
3
• This is a cubic spline with knots at 6 and 6.5.
41
A Spline Curve
0 2 4 6 8 10
y
− 2x2 + 0.5x3
0 2 4 6 8 10
050
0015
000
2500
0
y
y1
− 2x2 + 0.5x3 − 5(x − 6)3
− 2x2 + 0.5x3
− 2x2 + 0.5x3
− 2x2 + 0.5x3 − 5(x − 6)3
− 2x2 + 0.5x3 − 5(x − 6)3 + 5(x − 6.5)3
050
0015
000
2500
0
y1
− 2x2 + 0.5x3
− 2x2 + 0.5x3 − 5(x − 6)3
− 2x2 + 0.5x3 − 5(x − 6)3 + 5(x −
cubic spline, knots at 5.0, 6.5
> spline.eg2() # plots three polynomials and a spline
42
Truncated Power function
T (x) = (x− τ)k+ = (x− τ)kI(x ≥ τ)
T (x) is 0 for x < τ and (x− τ)k for x ≥ τ .
T (x) is a simple k degree spline with a knot at τ .
43
Spline Regression Models:
y = β0 + β1x + · · ·+ βkxk + γ1T1(x) + · · ·+ γhTh(x)
where Tj(x) = (x− τj)k+.
• Knots are at τ1, τ2, . . . , τh.
• The β’s and γ’s can be estimated by least-squares. The X matrix hask + h + 1 columns.
• (Exercise: What are the columns of X?)
44
Splines (Cont’d)
• B-splines are more numerically stable than truncated splines.
• The regression model in terms of B-Splines:
y = β0 + β1B1(x) + β2B2(x) + · · ·+ βkBk(x) + ε
45
Example: The B-spline transformations of temperature (Titanium Data)
knots at 825, 885, 895, 905, 990; degree = 3
600 800 1000
0.0
0.2
0.4
temperature
B(t
empe
ratu
re)
Spline Transformation of Temperature
600 800 1000
0.0
0.2
0.4
temperature
B(t
empe
ratu
re)
Spline Transformation of Temperature
600 800 1000
0.0
0.4
0.8
temperature
B(t
empe
ratu
re)
Spline Transformation of Temperature
600 800 1000
0.0
0.4
0.8
temperature
B(t
empe
ratu
re)
Spline Transformation of Temperature
46
Example (Cont’d)
The other four transformations:
600 800 1000
0.0
0.2
0.4
0.6
temperature
B(t
empe
ratu
re)
Spline Transformation of Temperature
600 800 1000
0.0
0.2
0.4
temperature
B(t
empe
ratu
re)
Spline Transformation of Temperature
600 800 1000
0.0
0.2
0.4
0.6
temperature
B(t
empe
ratu
re)
Spline Transformation of Temperature
600 800 1000
0.0
0.4
0.8
temperature
B(t
empe
ratu
re)
Spline Transformation of Temperature
Number of transformations = degree + number of knots
47
Example: Fitting a Spline Curve to the Titanium Data
> require(splines)
[1] TRUE, scale=.9
> titanium.spline<-lm(g ˜ bs(temperature,
knots=c(825,885,895,905,990),degree=3))
48
Example (Cont’d)
> plot(titanium,pch=16)> lines(spline(temperature,predict(titanium.spline)),
col=4,lwd=2)> title("Cubic Spline Fitted to Titanium Data")
600 700 800 900 1000
1.0
1.5
2.0
temperature
g
Cubic Spline Fitted to Titanium Data
49
Choosing knots using Variable Selection Methods
• Spline example - geophones:
tr.pwr <- function (x, knot, degree=3)
{ # truncated power function
(x > knot)*(x - knot)ˆdegree
}
# one knot per function
xx <- cbind(distance,distanceˆ2,distanceˆ3,
outer(distance,seq(20,80,length=20),tr.pwr))
# we start with 20 knots equally spaced between
# 20 and 80, and use forward selection to choose
# the best ones:
50
geophones.fwd <- regsubsets(thickness ˜ xx,
method="forward", nvmax=12, data=geophones)
summary.regsubsets(geophones.fwd)$cp
[1] 153.39 52.34 31.41 20.27 15.77 10.59
[7] 10.78 10.68 9.75 8.81 8.52 9.21
# Which knots are in?
seq(20,80,length=20)[summary.regsubsets(
geophones.fwd)$which[11,-seq(1,4)]]
[1] 20.0 32.6 35.8 38.9 45.3 51.6 54.7 70.5 73.7 76.8
knots.sub <- summary.regsubsets(
geophones.fwd)$which[11,-seq(1,4)]
knots.try<-seq(20,80,length=20)[knots.sub]
geophones.bs <- lm(thickness ˜ bs(distance, knots =
knots.try,Boundary.knots = c(0,100)),data=geophones)
PRESS(geophones.bs)
[1] 285
plot(geophones)
lines(spline(geophones$distance,
predict(geophones.bs)),col=4)
# you can check plot(geophones.bs) to see if there are
# problems
Titanium Example
xx <- cbind(temperature,temperatureˆ2,temperatureˆ3,outer(temperature,seq(620,1050,length=30),tr.pwr))
titanium.fwd <- regsubsets(g ˜ xx, method="forward",nvmax=15, data=titanium)
summary.regsubsets(titanium.fwd)$cp[1] 72343.36 57292.13 42136.90 27174.83 11382.40 5551.22[7] 1795.84 351.67 103.14 68.64 11.84 5.72
[13] 6.91 8.15 9.52> knots.try<-seq(620,1050,
length=30)[summary.regsubsets(titanium.fwd)$which[12,-seq(1,4)]]titanium.bs <- lm(g ˜ bs(temperature, knots = knots.try,
Boundary.knots = c(500,1100)))plot(titanium)lines(spline(temperature,predict(titanium.bs)),col=4)
# plot(titanium.bs)
51