smoothing scatterplots using penalized splinesdnett/s511/32smoothingpenspline.pdfsmoothing parameter...

94
Smoothing Scatterplots Using Penalized Splines 1

Upload: duongnhan

Post on 15-Mar-2018

222 views

Category:

Documents


3 download

TRANSCRIPT

 

Smoothing Scatterplots Using Penalized Splines

1

 

What do we mean by smoothing? Fitting a "smooth" curve to the data in a scatterplot

2

 

Why would we want to fit a smooth curve to the data in a scatterplot? Imagine the model

yi=f(xi)+ei (i=1,…,n) e1,…,en ~ independent, mean 0, and f is some unknown smooth function.

3

 

If the subject matter underlying the data set tells us nothing about a parametric form for f, we may prefer to let the data suggest a curve rather than concocting some parametric function that we hope will fit the data well. The estimated curve might help us see features of the data that are obscured by variation or simply provide a nice summary of the relationship between y and x.

4

 

d=read.delim( "http://www.public.iastate.edu/~dnett/S511/Diabetes.txt") head(d) subject age acidity y 1 1 5.2 -8.1 4.8 2 2 8.8 -16.1 4.1 3 3 10.5 -0.9 5.2 4 4 10.6 -7.8 5.5 5 5 10.4 -29.0 5.0 6 6 1.8 -19.2 3.4 #Variables are #subject: subject ID number #age: age diagnosed with diabetes #acidity: a measure of acidity called base deficit #y: natural log of serum C-peptide concentration #Original source is Sockett et al. (1987) #mentioned in Hastie and Tibshirani's book #"Generalized Additive Models".

5

  6

  7

  8

  9

  10

 

Again consider the model

yi=f(xi)+ei (i=1,…,n) e1,…,en ~ independent, mean 0, and f is some unknown smooth function.

11

12

13

  14

15

16

  17

18

  19

20

21

22

23

24

  25

26

27

28

29

30

31

32

33

  34

  35

  36

 

Some Strategies for Choosing the Smoothing Parameter

1. Cross-Validation (CV)

2. Generalized Cross-Validation (GCV)

3. Linear Mixed Effects Model Approach

There are other approaches, but we will restrict our discussion to the methods above.

37

 

1. Cross-Validation (CV): CV is a general strategy for choosing "tuning" parameters like our smoothing parameter λ2. These are parameters whose values are not of interest except for the fact that they affect estimates of the model parameters that are of interest.

38

 

We will talk specifically about leave-one-out cross-validation, which is a special case of cross-validation. This approach is known as PRESS (PRediction Error Sum of Squares) when it is used to select variables in multiple regression.

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

  55

  56

57

58

59

60

61

62

63

64

65

  66

dnett
Text Box
DF=10

  67

dnett
Text Box
DF=2

  68

dnett
Text Box
DF=3.59

69

70

71

72

73

74

75

76

77

78

79

 

The source for the information in these slides is Ruppert, D., Wand, M.P., Carroll, R.J. (2003). Semiparametric Regression. Cambridge University Press, New York.

80

 

d=read.delim( "http://www.public.iastate.edu/~dnett/S511/Diabetes.txt") head(d) subject age acidity y 1 1 5.2 -8.1 4.8 2 2 8.8 -16.1 4.1 3 3 10.5 -0.9 5.2 4 4 10.6 -7.8 5.5 5 5 10.4 -29.0 5.0 6 6 1.8 -19.2 3.4 #Variables are #subject: subject ID number #age: age diagnosed with diabetes #acidity: a measure of acidity called base deficit #y: natural log of serum C-peptide concentration #Original source is Sockett et al. (1987) #mentioned in Hastie and Tibshirani's book #"Generalized Additive Models".

81

 

#First install the package SemiPar. #Then issue the following commands. #Load the package SemiPar. library(SemiPar) #spm does not allow a data argument. o=spm(d$y~f(d$age,basis="trunc.poly",degree=1)) summary(o) Summary for non-linear components: df spar knots f(d$age) 3.59 5.705 8 Note this includes 1 df for the intercept.

82

 

plot(d$age,d$y,pch=19,col=4, xlab="Age at Diagnosis", ylab="Log C-Peptide Concentration", main = expression( paste( "Linear Spline Fit with ", lambda^2,"=5.7"))) lines(o,shade=F,se=F)

83

  84

 

plot(o) 85

 

#Load the data set fossil that comes #with the SemiPar package. data(fossil) head(fossil) age strontium.ratio 1 91.78525 0.707343 2 92.39579 0.707359 3 93.97061 0.707410 4 95.57577 0.707438 5 95.60286 0.707463 6 112.33691 0.707320 dim(fossil) [1] 106 2

86

 

#Shows relationship between strontium #ratios of ocean fossils and their age #in millions of years. The dip just less #then 115 million years ago coincides #with the mid-plate volcanic activity. #See Bralower et al. (1997). #Mid-Cretaceous strontium isotope #stratigraphy of deep-sea sections. #Geological Society of America Bulletin #109, 1421-1442. plot(fossil)

87

  88

 

y=fossil$strontium.ratio x=fossil$age o=spm(y~f(x,basis="trunc.poly",degree=1)) summary(o) Summary for non-linear components: df spar knots f(x) 12.76 1.324 25 Note this includes 1 df for the intercept. plot(fossil) lines(o,se=F)

89

  90

 

#Try penalized quadratic splines #rather than linear splines. o=spm(y~f(x,basis="trunc.poly",degree=2)) summary(o) Summary for non-linear components: df spar knots f(x) 10.06 2.243 25 Note this includes 1 df for the intercept. plot(fossil) lines(o,se=F)

91

  92

 

#The next set of notes covers the lowess #(or loess) smoother. o=lowess(y~x,f=.2) plot(fossil) lines(o,lwd=2) #See also the function 'loess' which has more #capabilities then 'lowess'. #capabilities then 'lowess'.

93

  94