spline regression models things to keep in mind when

Spline Regression Models

• Things to Keep in Mind when Fitting Polynomial Regression Models

• Piecewise Polynomials (Splines)

• Applying Variable Selection Methods to Choose Knots

1

Important Considerations when Fitting Polynomials to Data

• Example (paper):

• How does the speed of a paper mill machine affect quality of the fin-ished product?

• Measurements of the amount of green liquor produced are recordedfor various speeds.

2

Paper Mill Example (Cont’d)

Data:

\item[ ] Data:

\begin{verbatim}

> paper

green.liquor machine.speed

1 16.0 1700

2 15.8 1720

3 15.6 1730

4 15.5 1740

5 14.8 1750

6 14.0 1760

7 13.5 1770

8 13.0 1780

9 12.0 1790

10 11.0 1795

3

Example (Cont’d)

> attach(paper)

> plot(green.liquor ˜ machine.speed) # paperplot.pdf

1700 1720 1740 1760 1780

1112

1314

1516

machine.speed

gree

n.liq

uor

4

What happens if we use a linear model?

Fit the model and check the residual plot:

> paper.lm <- lm(green.liquor ˜ machine.speed)> plot(paper.lm, which=1, pch=16) # paperres.pdf

12 13 14 15 16 17

−1.

0−

0.5

0.0

0.5

Fitted values

Res

idua

ls

lm(formula = green.liquor ~ machine.speed)

Residuals vs Fitted

10 1

4

5

Try a quadratic model

> paper.lm2 <- lm(green.liquor ˜ machine.speed +

I(machine.speedˆ2))

> summary(paper.lm2)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.709e+03 2.448e+02 -6.984 0.000215

machine.speed 2.023e+00 2.798e-01 7.230 0.000173

I(machine.speedˆ2) -5.929e-04 7.994e-05 -7.417 0.000147

Residual standard error: 0.2101 on 7 degrees of freedom

• Fitted model:

y = −1709 + 2.02x− .00059x2

(x = machine speed, y is amt of green liquor) error st. dev. estimate =.21.

6

Is this model satisfactory?

Check the residual plot again:

> plot(paper.lm2, which=1, pch=16) # paper2res.pdf

12 13 14 15 16

−0.

3−

0.2

−0.

10.

00.

10.

20.

3

Fitted values

Res

idua

ls

lm(formula = green.liquor ~ machine.speed + I(machine.speed^2))

Residuals vs Fitted

10

8

6

7

Overlaying the Data with the Fitted Curve

> plot(green.liquor ˜ machine.speed)

> quadline(paper.lm2) # paper2plot.pdf

1700 1720 1740 1760 1780

1112

1314

1516

machine.speed

gree

n.liq

uor

Quadratic Fitted to Paper Data

8

Are there any Influential Observations?

Check Cook’s Distance:

> plot(paper.lm2, which=4, pch=16) # paper2cook.pdf

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Obs. number

Coo

k’s

dist

ance

lm(formula = green.liquor ~ machine.speed + I(machine.speed^2))

Cook’s distance plot

10

9

1

9

Check observation 10 more closely:

Influence on the coefficients and on the fitted values:

> dfbetas(paper.lm2)[10,]

(Intercept) machine.speed I(machine.speedˆ2)

-1.383708 1.395116 -1.406660

> dffits(paper.lm2)[10]

10

-2.258152

Observation 10 is highly influential.

10

Important Considerations when Fitting Polynomials to Data

• Order of the Model

Keep this as low as possible – parsimony

• Example: titanium heat data - 49 observations on g and temperature.

Quadratic fit:

> attach(titanium)

> titanium.lm2 <- lm(g˜poly(temperature,2))

> plot(titanium, pch=16)

> lines(spline(temperature,predict(titanium.lm2)),

col=4, lwd=2)

11

Titanium Example (Cont’d): A Failure for the Quadratic Polynomial Model

A pretty miserable fit:

600 700 800 900 1000

1.0

1.5

2.0

temperature

g

12

Does a higher order polynomial fit better?

• 5th order fit:




col=4, lwd=2)

13

Titanium Heat Data Example (Cont’d): Attempting to Fit Using High DegreePolynomials

• 5th order model:

600 700 800 900 1000

1.0

1.5

2.0

temperature

g

14

Example (Cont’d)

• 21st order fit:




col=4, lwd=2)

15

Titanium Heat Data Example (Cont’d): High Degree Polynomial Regres-sion is Futile

• 21st order model:

600 700 800 900 1000

1.0

1.5

2.0

temperature

g

16

Better to use piecewise polynomials (splines)

Important Considerations (Cont’d)

• Ill-Conditioning

• Example: Consider the Hilbert matrix Hp: hij = 1i+j−1, i, j = 1, . . . , p:

hilbert <- function(n=2){

matrix(1/(rep(seq(1,n),n)+

rep(seq(0,n-1),rep(n,n))),ncol=n)

}

> hilbert(2)

[,1] [,2]

[1,] 1.0 0.500

[2,] 0.5 0.333

17

Ill-Conditioning (Cont’d)

> hilbert(4)

[,1] [,2] [,3] [,4]

[1,] 1.000 0.500 0.333 0.250

[2,] 0.500 0.333 0.250 0.200

[3,] 0.333 0.250 0.200 0.167

[4,] 0.250 0.200 0.167 0.143

18

Ill-Conditioning (Cont’d)

• The Hilbert matrix is famous for being ill-conditioned.• The ratio of the largest eigenvalue to the smallest eigenvalue is very

large.• Matrix inversion is unstable, since the smallest eigenvalue is numeri-

cally indistinguishable from 0.• The inverse of the 2 × 2 Hilbert matrix:

solve(hilbert(2))

[,1] [,2]

[1,] 4 -6

[2,] -6 12

19

Ill-Conditioned Matrices (Cont’d)

• Try to invert the 7 × 7 Hilbert matrix:

solve(hilbert(7))

Error in solve.default(hilbert(7)) : singular

matrix ‘a’ in solve

• From a numerical point of view, large Hilbert matrices appear singular.

• The determinant of a matrix is equal to the product of the eigenvalues;thus, all Hilbert matrices are invertible (nonsingular).

20

Ill-Conditioned Matrices (Cont’d)

• The inverse of the Hilbert matrix can be computed using the followingfunction

inverse.hilbert <- function (n){Hinv <- matrix(0, nrow=n, ncol=n)for (i in 1:n){for (j in 1:n){Hinv[i,j] <- (-1)ˆ(i+j)*(i+j-1)*

choose(n+i-1,n-j)*choose(n+j-1,n-i)*choose(i+j-2,i-1)ˆ2

}}Hinv}

> inverse.hilbert(7)[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] 49 -1176 8820 -29400 48510 -38808 12012[2,] -1176 37632 -317520 1128960 -1940400 1596672 -504504[3,] 8820 -317520 2857680 -10584000 18711000 -15717240 5045040[4,] -29400 1128960 -10584000 40320000 -72765000 62092800 -20180160[5,] 48510 -1940400 18711000 -72765000 133402500 -115259760 37837800[6,] -38808 1596672 -15717240 62092800 -115259760 100590336 -33297264[7,] 12012 -504504 5045040 -20180160 37837800 -33297264 11099088

21

Example (Cont’d)

• We can check that this is correct by multiplying with the Hilbert matrixof size 7:

> sum(hilbert(7)[1,]*inverse.hilbert(7)[1,])[1] 1> sum(hilbert(7)[1,]*inverse.hilbert(7)[2,])[1] 0> sum(hilbert(7)[1,]*inverse.hilbert(7)[3,])[1] 0> sum(hilbert(7)[1,]*inverse.hilbert(7)[4,])[1] 0> sum(hilbert(7)[1,]*inverse.hilbert(7)[5,])[1] 0> sum(hilbert(7)[1,]*inverse.hilbert(7)[6,])[1] 0> sum(hilbert(7)[1,]*inverse.hilbert(7)[7,])[1] 0

22

Is the Hilbert Matrix just an Artificial Example?

• Consider the following regression problem:

y = β0 + β1x + · · ·+ βpxp + ε

where the x values have been taken equally spaced on the intervalfrom 0 to 1: e.g. if n = 10, x = .1, .2, . . . ,1.

• As n →∞, 1nXTX = Hp.

23

Multicollinearity

• Consider the paper example again:

> paper.lm3 <- lm(green.liquor˜machine.speed+

I(machine.speedˆ2)+I(machine.speedˆ3))

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.75e+03 1.80e+04 0.38 0.72

machine.speed -1.25e+01 3.08e+01 -0.41 0.70

I(machine.speedˆ2) 7.72e-03 1.76e-02 0.44 0.68

I(machine.speedˆ3) -1.59e-06 3.36e-06 -0.47 0.65

24

Multicollinearity

> vif(paper.lm2)

machine.speed I(machine.speedˆ2)

15616 15616

• Reminder: If VIF > 10, we have a problem.

> vif(paper.lm3)

machine.speed I(machine.speedˆ2) I(machine.speedˆ3)

1.68e+08 6.76e+08 1.70e+08

25

Orthogonal Polynomials

• It is better to use orthogonal polynomials.

• Obtain them using the Gram-Schmidt orthogonalization procedure.

• poly(x,k) evaluates the first k +1 orthogonal polynomials Pi(x) atx.

• The model becomes

yi = β0P0(xi) + β1P1(xi) + · · ·+ βkPk(xi) + εi

where P0(xi) = 1, P1(x) is linear in x, . . . , Pk(x) is a kth degreepolynomial in x.

26

Orthogonal Polynomials

• Orthogonality property:

n∑i=1

Pj(xi)Pk(xi) = 0 if i 6= j

• Implication: XTX is a diagonal matrix with jth element

XTXjj =n∑

i=1

P2j (xi)

• Numerically stable to invert!

27

Example: Applying Orthogonal Polynomials to the Paper Data

> paper.orth3 <- lm(green.liquor ˜ poly(machine.speed,3))> summary(paper.orth3)

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 14.1200 0.0705 200.40 1.0e-12poly(machine.speed, 3)1 -4.9051 0.2228 -22.01 5.7e-07poly(machine.speed, 3)2 -1.5581 0.2228 -6.99 0.00043poly(machine.speed, 3)3 -0.1050 0.2228 -0.47 0.65399

• Forward selection is straightforward now. VIFs are all 1.

28

7.5 More on Orthogonal Polynomials

• Model:

yi = α0P0(xi) + α1P1(xi) + · · ·+ αkPk(xi) + εi

where P0(xi) = 1, P1(x) is linear in x, . . . , Pk(x) is a kth degreepolynomial in x.

• Orthogonality property:

n∑i=1

Pj(xi)Pk(xi) = 0 if i 6= j

29

Quadratic Example

x -1 0 1 2y 2 1 2 10

• If we use the non-orthogonal polynomials 1, x, x2, then

X =

1 −1 11 0 01 1 11 2 4

= [x1˜

x2˜

x3˜

]

30

Gram-Schmidt

• Convert this to an orthogonal basis:

yT1˜

=

xT1˜

||x1˜||

= [1/2 1/2 1/2 1/2]

y2′

˜= x2

˜−(xT

2˜

y1˜

) y1˜

y2˜

=

y′2˜

|| y′2˜||

yT2˜

= [−3 − 1 1 3]/√

20

y3′

˜= x3

˜−(xT

3˜

y1˜

) y1˜−(xT

3˜

y2˜

) y2˜

31

Gram-Schmidt (Cont’d)

y3˜

=

y′3˜

|| y′3˜||

yT3˜

= [1 − 1 − 1 1]/2

z1˜

= 2 y1˜

, z2˜

=√

20 y2˜

, z3˜

= 2 y3˜

Xorth = [z1˜

z2˜

z3˜

] =

1 −3 11 −1 −11 1 −11 3 1

32

Gram-Schmidt (Cont’d)

• The columns of Xorth are orthogonal so

XTorthXorth =

4 0 00 20 00 0 4

• What are the orthogonal polynomials in this case?

P0(x) = 1

P1(x) = Ax + B = 2x− 1

P2(x) = Ax2 + Bx + C = x2 − x− 1

33

Orthogonal Polynomials (Cont’d)

• Check orthogonality:

4∑i=1

Pj(xi)Pk(xi) = 0, j 6= k

• The orthogonalized regression problem is

y = β0 + β1(2x− 1) + β2(x2 − x− 1) + ε

y = 3.75 + 1.25(2x− 1) + 2.25(x2 − x− 1)

• This simplifies to

y = .25 + .25x + 2.25x2

which could be obtained (but with attendant numerical difficulties) fromthe original X matrix.

34

General Case: Higher Order Orthogonal Polynomial Regression

yi = α0P0(xi) + α1P1(xi) + · · ·+ αkPk(xi) + εi

X =

P0(x1) P1(x1) ... Pk(x1)P0(x2) P1(x2) ... Pk(x2)· · · · · · · · · · · ·

P0(xn) P1(xn) ... Pk(xn)

XTX =

∑n

i=1 P20 (xi) ... 0

· · · . . . · · ·0 ...

∑ni=1 Pk(xi)

XT y˜

=

∑

P0(xi)yi...∑

Pk(xi)yi

αj =

∑Pj(xi)yi∑Pj(xi)2

35

Confidence interval for αj

E[αj] = αj

V (αj) =σ2∑n

i=1 P2j (xi)

SSE =∑

y2i −

k∑j=1

αj

n∑i=1

Pj(xi)yi

MSE = SSE/(n− k − 1)

αj ± tα/2,n−k−1

√√√√ MSE∑P2

j (xi)

36

Significance testing

:SSR(αj) = αj

∑Pj(xi)yi

H0 : αj = 0 H1 : αj 6= 0

F0 =SSR(αj)

MSE

Reject H0 if F0 > Fα/2,1,n−k−1.

Since the regression sum of squares for the jth term does not dependon any other coefficients, this partial F test does not depend on termsalready included in the model.

37

Introduction to Splines

• Polynomials are not flexible enough to adequately model all smoothfunctions.

• Piecewise polynomials or splines are more flexible.

• What is a piecewise polynomial?

• Example: Consider two polynomials

p1(x) = −2x2 + .5x3

p2(x) = −2x2 + .5x3 − 2(x− 6)3

38

Splines (Cont’d)

• We can make a piecewise polynomial out of these two polynomials bycutting them at x = 6 (they are equal there) and tying them togetherwith a knot at x = 6:

s(x) = −2x2 + .5x3, x < 6

s(x) = −2x2 + .5x3 − 2(x− 6)3, x ≥ 6

or

s(x) = −2x2 + .5x3 − 2(x− 6)3I(x ≥ 6)

or

s(x) = −2x2 + .5x3 − 2(x− 6)+3

• This is a cubic spline with a knot at 6.

39

Splines (Cont’d)

0 2 4 6 8 10

y

− 2x2 + 0.5x3

0 2 4 6 8 10

010

020

030

040

0

y

y2

− 2x2 + 0.5x3

− 2x2 + 0.5x3 − 2(

− 2x2 + 0.5x3

− 2x2 + 0.5x3 − 2(x − 6)3

cubic spline, knot at 6.0

> source("splineeg.R") # some spline examples

> spline.eg() # plots two polynomials and a spline

40

Another Example: Three different cubic polynomials

p1(x) = −2x2 + .5x3

p2(x) = −2x2 + .5x3 − 5(x− 6)3

p3(x) = −2x2 + .5x3 − 5(x− 6)3 + 5(x− 6.5)3

• Cut them up at x = 6 and x = 6.5 and tie together with knots there:

s(x) = −2x2+ .5x3−5(x−6)3I(x ≥ 6)+5(x−6.5)3I(x ≥ 6.5)

or

s(x) = −2x2 + .5x3 − 5(x− 6)+3 + 5(x− 6.5)+

3

• This is a cubic spline with knots at 6 and 6.5.

41

A Spline Curve

0 2 4 6 8 10

y

− 2x2 + 0.5x3

0 2 4 6 8 10

050

0015

000

2500

0

y

y1

− 2x2 + 0.5x3 − 5(x − 6)3

− 2x2 + 0.5x3

− 2x2 + 0.5x3

− 2x2 + 0.5x3 − 5(x − 6)3

− 2x2 + 0.5x3 − 5(x − 6)3 + 5(x − 6.5)3

050

0015

000

2500

0

y1

− 2x2 + 0.5x3

− 2x2 + 0.5x3 − 5(x − 6)3

− 2x2 + 0.5x3 − 5(x − 6)3 + 5(x −

cubic spline, knots at 5.0, 6.5

> spline.eg2() # plots three polynomials and a spline

42

Truncated Power function

T (x) = (x− τ)k+ = (x− τ)kI(x ≥ τ)

T (x) is 0 for x < τ and (x− τ)k for x ≥ τ .

T (x) is a simple k degree spline with a knot at τ .

43

Spline Regression Models:

y = β0 + β1x + · · ·+ βkxk + γ1T1(x) + · · ·+ γhTh(x)

where Tj(x) = (x− τj)k+.

• Knots are at τ1, τ2, . . . , τh.

• The β’s and γ’s can be estimated by least-squares. The X matrix hask + h + 1 columns.

• (Exercise: What are the columns of X?)

44

Splines (Cont’d)

• B-splines are more numerically stable than truncated splines.

• The regression model in terms of B-Splines:

y = β0 + β1B1(x) + β2B2(x) + · · ·+ βkBk(x) + ε

45

Example: The B-spline transformations of temperature (Titanium Data)

knots at 825, 885, 895, 905, 990; degree = 3

600 800 1000

0.0

0.2

0.4

temperature

B(t

empe

ratu

re)

Spline Transformation of Temperature

600 800 1000

0.0

0.2

0.4

temperature

B(t

empe

ratu

re)


600 800 1000

0.0

0.4

0.8

temperature

B(t

empe

ratu

re)


600 800 1000

0.0

0.4

0.8

temperature

B(t

empe

ratu

re)


46

Example (Cont’d)

The other four transformations:

600 800 1000

0.0

0.2

0.4

0.6

temperature

B(t

empe

ratu

re)


600 800 1000

0.0

0.2

0.4

temperature

B(t

empe

ratu

re)


600 800 1000

0.0

0.2

0.4

0.6

temperature

B(t

empe

ratu

re)


600 800 1000

0.0

0.4

0.8

temperature

B(t

empe

ratu

re)


Number of transformations = degree + number of knots

47

Example: Fitting a Spline Curve to the Titanium Data

> require(splines)

[1] TRUE, scale=.9

> titanium.spline<-lm(g ˜ bs(temperature,

knots=c(825,885,895,905,990),degree=3))

48

Example (Cont’d)

> plot(titanium,pch=16)> lines(spline(temperature,predict(titanium.spline)),

col=4,lwd=2)> title("Cubic Spline Fitted to Titanium Data")

600 700 800 900 1000

1.0

1.5

2.0

temperature

g

Cubic Spline Fitted to Titanium Data

49

Choosing knots using Variable Selection Methods

• Spline example - geophones:

tr.pwr <- function (x, knot, degree=3)

{ # truncated power function

(x > knot)*(x - knot)ˆdegree

}

# one knot per function

xx <- cbind(distance,distanceˆ2,distanceˆ3,

outer(distance,seq(20,80,length=20),tr.pwr))

# we start with 20 knots equally spaced between

# 20 and 80, and use forward selection to choose

# the best ones:

50

geophones.fwd <- regsubsets(thickness ˜ xx,

method="forward", nvmax=12, data=geophones)

summary.regsubsets(geophones.fwd)$cp

[1] 153.39 52.34 31.41 20.27 15.77 10.59

[7] 10.78 10.68 9.75 8.81 8.52 9.21

# Which knots are in?

seq(20,80,length=20)[summary.regsubsets(

geophones.fwd)$which[11,-seq(1,4)]]

[1] 20.0 32.6 35.8 38.9 45.3 51.6 54.7 70.5 73.7 76.8

knots.sub <- summary.regsubsets(

geophones.fwd)$which[11,-seq(1,4)]

knots.try<-seq(20,80,length=20)[knots.sub]

geophones.bs <- lm(thickness ˜ bs(distance, knots =

knots.try,Boundary.knots = c(0,100)),data=geophones)

PRESS(geophones.bs)

[1] 285

plot(geophones)

lines(spline(geophones$distance,

predict(geophones.bs)),col=4)

# you can check plot(geophones.bs) to see if there are

# problems

Titanium Example

xx <- cbind(temperature,temperatureˆ2,temperatureˆ3,outer(temperature,seq(620,1050,length=30),tr.pwr))

titanium.fwd <- regsubsets(g ˜ xx, method="forward",nvmax=15, data=titanium)

summary.regsubsets(titanium.fwd)$cp[1] 72343.36 57292.13 42136.90 27174.83 11382.40 5551.22[7] 1795.84 351.67 103.14 68.64 11.84 5.72

[13] 6.91 8.15 9.52> knots.try<-seq(620,1050,

length=30)[summary.regsubsets(titanium.fwd)$which[12,-seq(1,4)]]titanium.bs <- lm(g ˜ bs(temperature, knots = knots.try,

Boundary.knots = c(500,1100)))plot(titanium)lines(spline(temperature,predict(titanium.bs)),col=4)

# plot(titanium.bs)

51

spline regression models things to keep in mind when

Documents