biostatistics-lecture 10 regression
TRANSCRIPT
![Page 1: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/1.jpg)
Biostatistics-Lecture 10 Regression
Ruibin Xi
Peking University
School of Mathematical Sciences
![Page 2: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/2.jpg)
Analysis of Variance (ANOVA)
• Consider the Iris data again
• Want to see if the average sepal widths of the three species are the same
– μ1 , μ2, μ3 : the mean sepal width of Setosa, Versicolor, Virginica
– Hypothesis:
H0: μ1 = μ2= μ3
H1: at least one mean is different
![Page 3: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/3.jpg)
Analysis of Variance (ANOVA)
• Used to compare ≥ 2 means
• Definitions
– Response variable (dependent)—the outcome of interest, must be continuous
– Factors (independent)—variables by which the groups are formed and whose effect on response is of interest, must be categorical
– Factor levels—possible values the factors can take
![Page 4: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/4.jpg)
Sources of Variation in One-Way ANOVA
• Partition the total variability of the outcome into components—source of variation
•
– the sepal width of the jth plant from the ith species (group)
–
jiy , jnjki 1,1
)()( yyyyyy iiijij
Grand mean The ith group mean
![Page 5: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/5.jpg)
Sources of Variation in One-Way ANOVA
• SST: sum of squares total
• SSB: sum of squares between
• SSW (SSE): sum of squares within (error)
2
1 1 k
i
n
j ij
i
yySSWSSBSST
2
1 k
i ii yynSSB
2
1 1 k
i
n
j iij
j
yySSW
![Page 6: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/6.jpg)
F-test in one-way ANOVA
• The test statistic is called F-statistic
Under the null hypothesis, follows an F-distribution with
(df1,df2) = (k-1,n-k)
• For the Iris data – SSB=11.34, MSB = 5.67, SSE=16.96, MSE=0.12 – f = 49.16, df1=2,df2=147 – Critical value 3.06 at α=0.05, reject the null – Pvalue = P(F>f)=4.49e-17
)/(
)1/(
knSSE
kSSB
MSE
MSBF
![Page 7: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/7.jpg)
One-way ANOVA
• ANOVA table
![Page 8: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/8.jpg)
One-way ANOVA
• ANOVA table
![Page 9: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/9.jpg)
ANOVA model
• The statistical model
Yij = μ + αi + eij
The ith response in the jth group
grand mean
The effect of group j
error
![Page 10: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/10.jpg)
ANOVA assumptions
• Normality
• Homogeneity
• Independence
![Page 11: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/11.jpg)
Regression—an example
• Cystic fibrosis (囊胞性纤维症) lung function data
– PEmax (maximal static expiratory pressure) is the response variable
– Potential explanatory variables
• age, sex, height, weight,
• BMP (body mass as a percentage of the age‐specific median)
• FEV1 (forced expiratory volume in 1 second)
• RV (residual volume)
• FRC(funcAonal residual capacity)
• TLC (total lung capacity)
![Page 12: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/12.jpg)
Regression—an example
• Let’s first concentrate on the age variable
• The model
• Plot PEmax vs age
![Page 13: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/13.jpg)
Regression—an example
• Let’s first concentrate on the age variable
• The model
• Plot PEmax vs age
![Page 14: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/14.jpg)
Simple Linear regression
![Page 15: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/15.jpg)
Assumptions
• Normality
– Given x, the distribution of y is normal with mean α+βx with standard deviation σ
• Homogeneity
– σ does not depend on x
• Independence
![Page 16: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/16.jpg)
Residuals
![Page 17: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/17.jpg)
Fitting the model
![Page 18: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/18.jpg)
Fitting the model
![Page 19: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/19.jpg)
Goodness of Fit
![Page 20: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/20.jpg)
Inference about β
![Page 21: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/21.jpg)
Inference about β
![Page 22: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/22.jpg)
Inference about β: the CF data
![Page 23: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/23.jpg)
Plotting the regression line
![Page 24: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/24.jpg)
R2
![Page 25: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/25.jpg)
Residual plot
![Page 26: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/26.jpg)
Residual plot
![Page 27: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/27.jpg)
Residual plot
![Page 28: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/28.jpg)
Residual plot
• The CF patients data
![Page 29: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/29.jpg)
Linear Regression
![Page 30: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/30.jpg)
Summary: simple linear regression
![Page 31: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/31.jpg)
Multiple regression
• See blackboard
![Page 32: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/32.jpg)
GENERALIZED LINEAR MODELS Regression
![Page 33: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/33.jpg)
Generalized liner models
• GLM allow for response distributions other than normal
– Basic structure
• The distribution of Y is usually assumed to be independent and
![Page 34: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/34.jpg)
Generalized Linear model-an example
• An example
– A study investigated the roadkills of amphibian
• Response variable: the total number of amphibian fatalities per segment (500m)
• Explanatory variables
![Page 35: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/35.jpg)
Generalized Linear model-an example
• An example
– A study investigated the roadkills of amphibian
• Response variable: the total number of amphibian fatalities per segment (500m)
• Explanatory variables
![Page 36: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/36.jpg)
Generalized Linear model-an example
• An example
– A study investigated the roadkills of amphibian
• Response variable: the total number of amphibian fatalities per segment (500m)
• Explanatory variables
• For now, we are only interested in Distance to Natural Park
![Page 37: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/37.jpg)
Generalized Linear model-an example
• An example
– A study investigated the roadkills of amphibian
• Response variable: the total number of amphibian fatalities per segment (500m)
• Explanatory variables
• For now, we are only interested in Distance to Natural Park
![Page 38: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/38.jpg)
Generalized Linear model-an example
• An over-simplified model
• For Poisson we have
![Page 39: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/39.jpg)
GLM-exponential families
• Exponential families
– The density
– Normal distributions is an exponential family
![Page 40: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/40.jpg)
GLM-exponential families
• Consider the log-likelihood of a general exponential families
Since
![Page 41: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/41.jpg)
GLM-exponential families
• Differentiating the likelihood one more time
using the equation
We often assume
Define
![Page 42: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/42.jpg)
GLM-exponential families
![Page 43: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/43.jpg)
Fitting the GLM
• In a GLM
• The joint likelihood is
• The log likelihood
![Page 44: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/44.jpg)
Fitting the GLM
• Assuming ( is known)
• Differentiating the log likelihood and setting it to zero
![Page 45: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/45.jpg)
Fitting the GLM
• By the chain rule
• Since
• We have
)( ii b
![Page 46: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/46.jpg)
Canonical Link Function
• The Canonical Link Function is such that
• Remember that
• So
)( ii b
![Page 47: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/47.jpg)
Iteratively Reweighted Least Squares (IRLS)
• We first consider fitting the nonlinear model
by minimizing
where f is a nonlinear function
![Page 48: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/48.jpg)
Iteratively Reweighted Least Squares (IRLS)
• Given a good guess
by using the Taylor expansion
Define the pseudodata
![Page 49: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/49.jpg)
Iteratively Reweighted Least Squares (IRLS)
• Note that in GLM, we are trying to solve
• If are known, this is equivalent to minimizing
![Page 50: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/50.jpg)
Iteratively Reweighted Least Squares (IRLS)
• We are inspired to use the following algorithm
– At the kth iteration, define
•
• update as in the nonlinear model case
• set k to be k+1
– But the second step also involves iteration, we may perform one step iteration here to obtain
![Page 51: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/51.jpg)
Deviance
• The deviance is defined as
where is the log likelihood with the saturated model: the model with one parameter per data point
Also note that deviance is defined to be independent of
![Page 52: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/52.jpg)
Deviance
• GLM does not have R2
• The closest one is the explained deviance
• The over-dispersion parameter may be estimated by
or by the Pearson statistic
![Page 53: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/53.jpg)
Model comparison
• Scaled deviance
• For the hypothesis testing problem
under H0
![Page 54: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/54.jpg)
Residuals
• Pearson Residuals
– Approximately zero mean and variance
• Deviance Residuals
![Page 55: Biostatistics-Lecture 10 Regression](https://reader031.vdocuments.site/reader031/viewer/2022020620/61e3ffde1649564a821b294b/html5/thumbnails/55.jpg)
Negative binomial
• Density