reithinger florian

Mixed Models based onLikelihood Boosting

Florian Reithinger

Mnchen 2006

Mixed Models based onLikelihood Boosting

Florian Reithinger

Dissertationan der Fakultt fr Mathematik, Informatik und Statistik

der LudwigMaximiliansUniversittMnchen

vorgelegt vonFlorian Reithinger

aus Mnchen

Mnchen, den 28. September 2006

Erstgutachter: Prof. Dr. Gerhard TutzZweitgutachter: PD. Dr. Christian HeumannTag der mndlichen Prfung: 20. Dezember 2006

Contents

1 Introduction 1

1.1 Mixed Models and Boosting . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Guideline trough This Thesis . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Linear Mixed Models 6

2.1 Motivation: CD4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 The Restricted Log-Likelihood . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 The Maximum Likelihood Method . . . . . . . . . . . . . . . . . 12

2.4 Estimation with Iteratively Weighted Least Squares . . . . . . . . . . . . 13

2.5 Estimation with EM-Algorithm - The Laird-Ware Method . . . . . . . . . 14

2.6 Robust Linear Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Semi-Parametric Mixed Models 21

3.1 Short Review on Splines in Semi-Parametric Mixed Models . . . . . . . . 22

3.1.1 Motivation: The Interpolation Problem . . . . . . . . . . . . . . 22

3.1.2 Popular Basis Functions . . . . . . . . . . . . . . . . . . . . . . 23

3.1.3 Motivation: Splines and the Concept of Penalization - SmoothingSplines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.4 Motivation: The Concept of Regression Splines . . . . . . . . . . 26

CONTENTS

3.1.5 Identification Problems: The Need of a Semi-Parametric Repre-sentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.6 Singularities: The Need of a Regularization of Basis Functions . . 31

3.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3 Penalized Maximum Likelihood Approach . . . . . . . . . . . . . . . . . 33

3.4 Mixed Model Approach to Smoothing . . . . . . . . . . . . . . . . . . . 36

3.5 Boosting Approach to Additive MixedModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5.1 Short Review of Likelihood Boosting . . . . . . . . . . . . . . . 37

3.5.2 The Boosting Algorithm for Mixed Models . . . . . . . . . . . . 40

3.5.3 Stopping Criteria and Selection in BoostMixed . . . . . . . . . . 42

3.5.4 Standard Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5.5 Visualizing Variable Selection in Penalty Based Approaches . . . 45

3.6 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.6.1 BoostMixed vs. Mixed Model Approach . . . . . . . . . . . . . 47

3.6.2 Pointwise Confidence Band for BoostMixed . . . . . . . . . . . . 55

3.6.3 Choosing an Appropriate Smoothing Parameter and an Appropri-ate Selection Criterion . . . . . . . . . . . . . . . . . . . . . . . 59

3.6.4 Surface-Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.7 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.7.1 CD4 data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4 Extending Semi-Structured Mixed Models to incorporate Cluster-SpecificSplines 70

4.1 General Model with Cluster-Specific Splines . . . . . . . . . . . . . . . . 72

4.1.1 The Boosting Algorithm for Models with Cluster-Specific Splines 73

4.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3 Application of Cluster-Specific Splines . . . . . . . . . . . . . . . . . . . 79

CONTENTS

4.3.1 Jimma Data: Description . . . . . . . . . . . . . . . . . . . . . . 79

4.3.2 Jimma Data: Analysis with Cluster-Specific Splines . . . . . . . 79

4.3.3 Jimma Data: Visualizing Variable Selection . . . . . . . . . . . . 83

4.3.4 Ebay-Auctions: Description . . . . . . . . . . . . . . . . . . . . 85

4.3.5 Ebay-Data: Mixed Model Approach vs. Penalized Splines: Prog-nostic Performance . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.3.6 Ebay Data: Final model . . . . . . . . . . . . . . . . . . . . . . 90

4.3.7 Canadian Weather Stations: Description and Model . . . . . . . . 92

5 Generalized Linear Mixed Models 95

5.1 Motivation: The European patent data . . . . . . . . . . . . . . . . . . . 95

5.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.3 Numerical Integration Tools . . . . . . . . . . . . . . . . . . . . . . . . 99

5.4 Methods for Crossed Random Effects . . . . . . . . . . . . . . . . . . . 104

5.4.1 Penalized Quasi Likelihood Concept . . . . . . . . . . . . . . . . 1045.4.2 Bias Correction in Penalized Quasi Likelihood . . . . . . . . . . 1095.4.3 Alternative Direct Maximization Methods . . . . . . . . . . . . . 110

5.4.4 Indirect Maximization using EM-Algorithm . . . . . . . . . . . . 112

5.5 Methods for Clustered Data . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.5.1 Gauss-Hermite-Quadrature . . . . . . . . . . . . . . . . . . . . . 1165.5.2 Adaptive Gauss-Hermite Quadrature . . . . . . . . . . . . . . . . 1195.5.3 Gauss-Hermite-Quadrature using EM-Algorithm . . . . . . . . . 122

6 Generalized Semi-Structured Mixed Models 126

6.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.1.1 The Penalized Likelihood Approach . . . . . . . . . . . . . . . . 128

6.2 Boosted Generalized Additive Mixed Models - bGAMM . . . . . . . . . 130

6.2.1 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 131

CONTENTS

6.2.2 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.3 Application of the European Patent Data . . . . . . . . . . . . . . . . . 139

7 Summary and Perspectives 143

Appendix A:Splines 147

A.1 Solving Singularities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

A.1.1 Truncated Power Series for Semi-Parametric Models . . . . . . . 147

A.1.2 Parametrization of and Using Restrictions . . . . . . . . . . 147

A.1.3 Parametrization of and Using Mixed Models . . . . . . . . . 148

A.2 Smoothing with Mixed Models . . . . . . . . . . . . . . . . . . . . . . . 149

Appendix B:Parametrization of covariance structures 151

B.1 Independent Identical . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

B.2 Independent but Not Identical . . . . . . . . . . . . . . . . . . . . . . . 151

B.3 Unstructured . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Appendix C:Simulation Studies 153

C.1 Mixed Model Approach vs. BoostMixed . . . . . . . . . . . . . . . . . 153

C.2 Choosing an Appropriate Smoothing Parameter and an Appropriate Se-lection Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

C.2.1 BIC as Selection/Stopping Criterion . . . . . . . . . . . . . . . . 166

C.2.2 AIC as Selection/Stopping Criterion . . . . . . . . . . . . . . . . 169

C.3 Linear BoostMixed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

C.4 Boosted GAMM - Poisson . . . . . . . . . . . . . . . . . . . . . . . . . 181

C.5 Boosted GLMM - Binomial . . . . . . . . . . . . . . . . . . . . . . . . 187

Bibliography 193

List of Tables

3.1 Study 1: Comparison between additive mixed model fit and BoostMixed( = 0.1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2 Study 2: Comparison between additive mixed model fit and BoostMixed( = 0.5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3 Simulation study 8: Mixed Model approach vs BoostMixed . . . . . . . . 64

3.4 MACS: Estimates computed with mixed model approach and BoostMixed 69

4.1 Simulation study: estimated covariance matrices Q := Q() . . . . . . . 78

4.2 Simulation study: MSE for BoostMixed vs. cluster-specific splines . . 78

4.3 Jimma study: Effects of categorical covariates in Jimma study . . . . . . 82

4.4 Jimma study: Covariance matrix for random intercept and slope forJimma data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.5 Ebay study: Estimated covariance matrix . . . . . . . . . . . . . . . . . 92

4.6 Weather study: estimated covariance matrix . . . . . . . . . . . . . . . . 92

6.1 Simulation study: Generalized additive model and poisson data . . . . . . 135

6.2 Simulation stuy: Generalized linear mixed model and binomial data . . . 138

6.3 Summary statistics for the response considering small campanies . . . . . 140

6.4 Patent study: Summary statistics . . . . . . . . . . . . . . . . . . . . . . 140

6.5 Patent study: Estimated effects and variance . . . . . . . . . . . . . . . . 141

C.1 Study 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

LIST OF TABLES

C.2 Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

C.3 Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

C.4 Study 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

C.5 Study 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

C.6 Study 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

C.7 Study 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

C.8 Study 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

C.9 Study 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

C.10 Study 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

C.11 Study 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

C.12 Study 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

C.13 Study 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

C.14 Study 15 - AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

C.15 Study 15 - BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

C.16 Study 16 - AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

C.17 Study 16 - BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

C.18 Study 17 - AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

C.19 Study 17 - BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

C.20 Study 18 - AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

C.21 Study 18 - BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

C.22 Study 19 - AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

C.23 Study 19 - BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

C.24 Study 20 - AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

C.25 Study 20 - BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

C.26 Study 21 - AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

C.27 Study 21 - BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

LIST OF TABLES

C.28 Study 22 - AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

C.29 Study 22 - BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

C.30 Study 23 - AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

C.31 Study 23 - BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

C.32 Study 24 - AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

C.33 Study 24 - BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

C.34 Study 25 - AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

C.35 Study 25 - BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

C.36 Study 26 - AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

C.37 Study 26 - BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

List of Figures

3.1 Visualization of the interpolation problem . . . . . . . . . . . . . . . . . 23

3.2 Spline solution for the interpolation problem . . . . . . . . . . . . . . . . 24

3.3 B-splines (no penalization) . . . . . . . . . . . . . . . . . . . . . . . . . 273.4 B-splines (penalization) . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.5 Additive or semi-parametric view of functions . . . . . . . . . . . . . . . 30

3.6 Computation of the generalized coefficient build-up . . . . . . . . . . . . 46

3.7 Coefficient build-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.8 Estimated effects for coefficient build-up simulation study . . . . . . . . 48

3.9 Study 1: Estimated smooth curves . . . . . . . . . . . . . . . . . . . . . 50

3.10 Study 1: Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.11 Study 2: Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.12 Simulation study: Pointwise confidence bands for datasets with n=80 . . . 57

3.13 Simulation study: Pointwise confidence bands for datasets with n=40 . . . 58

3.14 Simulation study 7: The influence of on the MSE for 3 smooth effects,c = 0.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.15 Simulation study 7: The influence of on the MSE for different smootheffects, c = 0.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.16 Simulation study 8: Sufaceplot for Tensor-splines . . . . . . . . . . . . . 65

3.17 Simulation study 8: Levelplot for Tensor-splines . . . . . . . . . . . . . . 66

3.18 MACS-Study: smoothed time effects . . . . . . . . . . . . . . . . . . . . 67

LIST OF FIGURES

3.19 MACS-Study: smoothed time effects with random intercept . . . . . . . . 68

3.20 MACS: Smoothed effects for age, illness score and time . . . . . . . . . 69

4.1 Simulation study: Cluster-specific splines with random intercept . . . . . 77

4.2 Simulation study: Cluster-specific splines without random intercept . . . 77

4.3 Jimma study: Evolution of weight with respect to increasing age . . . . . 80

4.4 Jimma study: Subject specific infant curves (observed and predicted) . . . 804.5 Jimma study: Smoothed effects for age of children and age of the mother 81

4.6 Jimma study: Generalized coefficient build up for parametric and semi-parametric model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.7 Ebay study: Bid history . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.8 Ebay study: Scatterplot for bids . . . . . . . . . . . . . . . . . . . . . . 88

4.9 Ebay study: Three individual bids . . . . . . . . . . . . . . . . . . . . . 88

4.10 Ebay study: Estimated spline function . . . . . . . . . . . . . . . . . . . 91

4.11 Ebay study: Cluster-specific spline functions . . . . . . . . . . . . . . . . 91

4.12 Canadian weather study: Monthly temperatures for 16 selected Canadianweather stations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.13 Canadian weather study: Temperatures for the Canadian weather stationsdepending on precipitation . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.1 Numerical integration methods based on integration points . . . . . . . . 102

6.1 Patent data: estimated smooth effects for the patent data . . . . . . . . . . 142

C.1 Simulation study 1: MSE of BoostMixed and mixed model approach . 153






LIST OF FIGURES

C.7 Simulation study 7: The influence of on the MSE for different parame-ters, c = 0.5, BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

C.8 Simulation study 7: The influence of on the MSE for different parame-ters, c = 1, BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

C.9 Simulation study 7: The influence of on the MSE for different parame-ters, c = 5, BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

C.10 Simulation study 7: The influence of on the MSE for different parame-ters, c = 0.5, AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

C.11 Simulation study 7: The influence of on the MSE for different parame-ters, c = 1, AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

C.12 Simulation study 7: The influence of on the MSE for different parame-ters, c = 5, AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

Notation

Mixed model notation

yit = xTit + z

Titbi + it = x

Tit +

Pcj=1 zitjb

(j)i + it, i {1, . . . , n} t {1, . . . , Ti}

yi = Xi + Zibi + i, i {1, . . . , n}

y = X + Zb+ = X +Pc

j=1 Z.(j)b(j) +

y(i) = xT(i) + z

T(i)b+ (i), i {1, . . . , N}

yit Response of cluster i at observation th(.) Response function, inverse of link-function gxit Design vector for fixed effects (cluster i, measurement t)zTit = [zit1, . . . , zitc] design vector (cluster i, measurement t)Xi Design matrix for fixed effects ( cluster i )Zi Design matrix for random effects ( cluster i )Z = (ZT1 , . . . , Z

Tn )

T Design matrix for random effects ( stacked version )n Clusters in totalXT = (XT1 , . . . , X

Tn ) Design matrix for fixed effects (complete dataset)

Z Design matrix for random effects (usually block-diagonal version, complete dataset)Z = bdiag(Z1, . . . , Zn) in longitudinal settings

xT(i) i-th row of XzT(i) i-th row of ZyT(i) i-th element of yT = (1, . . . , p) Parameter vector for fixed effectsbi Vector of random effects for cluster ib = (bT1 , . . . , b

Tn )

T Vector of random effectsN =

Pni=1 Ti Observations in total

Notation

p(b; ) Mixing densityp(a; ) Standardized mixing density with standardized random variable a Parameter vector for the covariance structure of random effects and nuisance parametersQ() Covariance matrix for the random effect biQ() Covariance matrix for the random effect bVi := Vi() Marginal covariance of the cluster iV := V () Marginal covariance over all clusters Correlation between two covariatesf(.) Density or conditional density of y or y given bq =

Pcj=1 qj Dimension of b, dimension of b

(j)

E(.) Expectation2 Error term, eit N(0, 2 ) d-dimensional vector of parameters for variance componentstrace() Trace of a matrixf(y, b) = f(y|b)p(b; ) Joint density of y and brows(A, i, j) Submatrix of matrix A from row i to row jelem(y, i, j) Subvector of vector y from element i to element jvec(A) Symmetric direct operator on symmetric matrix A.

Vector from the lower triangular entries of matrix Avech(A) Vech operator on symmetric matrix A

Vector from rows of matrix A.c Random design matrix has c components of theZ.(j) Partitioned random effect design matrix associated with componentb(j) Partitioned random effect associated with component j

Notation

Additve mixed model notation

yit = xTit +

Pmj=1

Titjj + z

Titbi + it , i 1, . . . , n t {1, . . . , Ti}

yi = Xi +Pm

j=1 ijj + Zibi + i = Xi +i.+ Zibi + i, i{ 1, . . . , n}

Y = X ++ Zb+ = X +Pm

j=1 .jj + Zb+

y(i) = xT(i) +

T(i)+ z

T(i)b+ (i) = x

T(i) +

Pmj=1(

(j)(u(i)j))Tj + z

T(i)b+ (i), i {1, . . . , N}

= xT(i) +Pm

j=1 T(i)jj + z

T(i)b+ (i), i {1, . . . , N}

(j)(.) Unspecified j-th functionuitj Measured covariate for the j-th unspecified

function in cluster i at measurement t(j)(uitj) Function evaluation of the measured covariate for

j-th function (j)(.) in cluster i at measurement tM Dimension of the spline basism Number of unspecified functions(j)s (.) s-th basis function for variable j

(j)(.)T = ((j)1 (.), . . . ,

(j)M (.)) Basis functions for variable j (M-dimensional, vector)

itj = (j)(uitj) Function evaluation of covariate uitj (vector)

uTi.j = (ui1j , . . . , uiTij) Vector of covariates needed for function j in cluster iij = i.j = (i1j , . . . , iTij)

T Matrix for elementwise basis Function evaluations for thej-th function of covariates ui.j

uT..j = (uT1.j , . . . , u

Tn.j) Vector of covariates needed for function j (complete dataset)

.j := ..j = (T1.j , . . . ,

Tn.j)

T Matrix for elementwise basis function evaluationsfor the j-th function of covariates u..j

i. := i.. = (i.1, . . . ,i.m) Matrix for basis function evaluation forCovariates ui.1, . . . , ui.m in cluster i

:= ... = (T1.., . . . ,

Tn..)

T = (..1, . . . ,..m) Matrix for basis function evaluation of all covariatesj M-dimensional vector of basis coefficients needed for

approximation (j)(u) = ((j))TjT = (T1 , . . . ,

Tm) Vector of all basis coefficients

(i) i-th row of matrix ...K Penalty matrix for all components including fixed effectsu(i)j i-th entry of vector u..j Smoothing parameterXi = [Xi,i1, . . . ,im] Generalized design matrix for fixed and smooth effects(i,j)(.) = (i)(.) (j)(.) Elementwise Kronecker product of (i)(.) and (j)(.)(j)s Coefficient s for the j-th smooth component

(i)j i-th row of .j

Notation

Boosted additive mixed model notation

Xi(r) = [Xi,i.r] Designmatrix for the r-th component including fixed effectsKr Penalty matrix for the r-th component including fixed effectsr Weak learner for the r-th component including fixed effects in a boosting step(l), (l), (l), (l) Ensemble estimates in the l-th boosting step(l)

i(r) Predictor using the r-th component in boosting step l in cluster iM

(l)r Projection matrix for residuals on component r of l-th boosting step

to the weak learner r , given the selection beforeH

(l)r Local hat matrix for projection for residuals on component r of l-th boosting step

to the predictor (l)r , given the selection beforeG(l)r Global hat matrix for projection for y on component r in the l-th boosting step

to the predictor (l)r , given the selection beforejl Selected component in the l-th boosting stepS(l)r Selection criterion in the l-th boosting step using the trace of G(l)r

M (l) := M(l)jl

Short notation, if j was selected in the l-th boosting stepH(l) := H

(l)jl

Short notation, if j was selected in the l-th boosting stepG(l) := G

(l)jl

Short notation, if j was selected in the l-th boosting stepk Number of flexible splinesRi := [i11, . . . ,ikk] Random design matrix for flexible splinesR(l)i := [i1

(l)1 , . . . ,ik

(l)k ] Random design matrix for flexible splines in the l-th boosting step

Design matrix for unspecified functionsZ(l)i = [Zi, R

(l)i ] Random design matrix for cluster i for parametric and smooth covariates

X.(r) = [XT1(r), . . . , X

Tn(r)]

T Design matrix for component r (complete dataset).(r) Predictor with component r (complete dataset)

Preface

This thesis has been written during my activities as research assistant at the Departmentof Statistics of, Ludwigs-Maximilians-University, Munich. The financial support fromSonderforschungsbereich 386 is greatfully acknowledged.

First I would like to express my thank to my doctoral advisor Gehard Tutz, who gavethe important impulse and and advised this thesis. I would also like to thank all profes-sors, all other colleagues, other research assistants and everybody, who was accessible forquestions or problems.

I thank Prof. Brian Marx to agree to be the external referee.

In addition I want to thank Dr. Christian Heumann for being my second referee. Thediscussions around mixed models were important for the ingredients of this thesis. Espe-cially Stefan Pilz was a big help as a personal mentor.

An useful aid was the Leibniz-Rechenzentrum in Munich, which allowed me to do mycomputations on the high-performance-cluster. The computations around the simulationstudies did approximately take ten-thousand hours, where only the time on the cpu unitwas measured. Around three-thousand datasets of different length were analyzed here.The number crunshers of the department did not deliver enough power to compute mostof the settings which were checked in this thesis.

Special thanks to my parents who supported my interests in mathematics and to my wifeDaniela, who spent patiently a lot of time at home with me on statistical papers, work andmanaged with our son Vinzenz.

Chapter 1

Introduction

1.1 Mixed Models and Boosting

The methodology of linear mixed model in combination with penalized splines has be-come popular in the past few years.

The idea of mixed models was originally developed in 1953 by Henderson (1953). Hederived the mixed models to analyze longitudinal data by assuming a latent, unobservedstructure in the data. The response was continuous and the structure was assumed to belinear. An example for longitudinal data may be patients with many repeated measureson each patient. The latent structure in this case may be on the individual level of thepatient. So individuality is getting important in this context. For the statistical analysis,the observed covariates are considered to be conditionally independent given the patientand the patients are themselves are assumed to be independent. The latent structure maybe only at the individual level of the patients (random intercept) or individual level andslope of these patient (random intercept and slope). A nice overview on mixed modelsis given by Verbeke & Molenberghs (2001). Another way of modeling longitudinal datawith weak assumptions is using the generalized estimation equations (GEE), see, Liang& Zeger (1986). In this thesis, only mixed models are investigated.The influence of covariates is often reflected insufficiently because the assumed parame-trization for continuous covariates is usually very restrictive. For models with continuouscovariates, a huge repertoire of nonparametric methods have been developed within thepast few years. The effect of a continuous covariate on the response is specified by asmooth function. A smooth function is meant to be sufficient differentiable. Representa-tives of the nonparametric methods are kernel regression estimators, see, Gasser & Mller

1.1 Mixed Models and Boosting 2

(1984), Staniswalis (1989), for regression splines is Eubank (1988), Friedman & Silver-man (1989), for local polynomial fitting is Hastie & Loader (1993), Fan & Gijbels (1996)and for smoothing splines is Silverman (1984), Green (1987). A nice overview on non-parametric methods may be found in Simonoff (1996).The used representation for smooth effects in this thesis combines the different approachesfor spline regression. When polynomial splines are used one has to specify the degree ofits polynomial pieces as well as the decomposition of the range by a finite number ofknots. The decomposition of polynomial splines can be expressed by a vector space.There exists a basis representation for every element of this space. That is why the ap-proximated smooth effects can be parameterized. The regression spline can be reduced toa strict parametric structure, which is a great benefit of this approach.

The goodness of the approximation by polynomial splines is determined by the decom-position of the range. A large number of knots increase the fit of the data but for datawith huge noise, the estimated curves are very wiggly. One way to control the variabilityof the estimated function is the adaptive selection of knots and positions, see Friedman(1991) and Stone, Hansen, Kooperberg & Truong (1997). An alternative approach is touse penalization techniques. In the latter case the penalty terms are focused on the basiscoefficients of the polynomial spline representation. Two concepts for penalization havebeen established in recent years. One concept encompasses the truncated power series assuggested in Ruppert, Wand & Carroll (2003). In this case, one uses the ridge penalty.The other concept is maintained by Eilers & Marx (1996). They use the B-spline basistogether with a penalization of neighboured basis coefficients which is called P-splines.Both concepts have the advantage that the estimation of parameters can be obtained bythe maximizing of a penalized likelihood function.

The crucial part of a penalized likelihood is that the smoothing parameter , which con-trols the variability of the estimated functions, has to be optimized . One idea suggestedby Eilers and Marx (Eilers & Marx (1996)) is to optimize the AIC criterion which mea-sures the likelihood of the model given the fixed smoothing parameter . The likelihoodis penalized by the effective degrees of freedom in the model, see Hastie & Tibshirani(1990). Another idea is to use the cross-validation criterion which is a computational bur-den in many data situation. Another driven criterion is the generalized cross validationcriterion established by Craven & Wahba (1979). Recent investigations on this criterionare documented in Wood (2004). Another strategy to optimize this tuning parameter isbased on mixed models.

The reason for the popularity of mixed models in the 90s is the comment of Terry Speed(Speed (1991)) on Robinsons article ( Robinson (1991)) on BLUP equations. Terry Speedstates that the maximization of a penalized likelihood is equivalent to the solutions of the

1.1 Mixed Models and Boosting 3

BLUP equations in a linear mixed model. These statements were picked up by Wand(2000), Parise, Wand, Ruppert & Ryan (2001), Ruppert, Wand & Carroll (2003) and Lin& Zhang (1999). So nowadays smoothing is often connected to penalized splines or it isseen as a suitable method to find reliable estimates for the smoothing parameter.

Boosting originates in the machine learning community where it has been proposed as atechnique to improve classification procedures by combining estimates with reweightedobservations. Recently it has been shown that boosting is also a way of fitting an additiveexpansion in basis functions when the single basis functions represent the results of one it-eration of the boosting procedure. The procedure is quite similar to the method of gradientdescent by the use of specific loss functions, see Breiman (1999) and Friedman, Hastie &Tibshirani (2000). Since it has been shown that reweighting corresponds to minimizing aloss function iteratively (Breiman (1999), Friedman (2001)), boosting has been extendedto regression problems in a L2-estimation framework by Bhlmann & Yu (2003). Tutz& Binder (2006) introduced the likelihood-based boosting concept for all kinds of linkfunctions and distributions.

The aim of this thesis it to combine the mixed model methodology with boosting ap-proaches. Especially the concept of componentwise boosting is introduced where in eachiteration step, only one variable is allowed to be updated. This is a useful strategy if onetries to optimize a huge number of continuous variables. It is a very robust method interms of algorithmic optimization. Part of the algorithmic structure is that one can dovariable selection since among all covariates, only one is selected to be optimized whichis a nice add-on in the boosting methodology.

Often the application of an additive mixed model is too restrictive because each clustermay have its own specific function. So one idea is to compute a joint smooth functionof the continuous covariates and a random, cluster-specific smooth function as suggestedby Ruppert, Wand & Carroll (2003), Wu & Liang (2004) or Verbyla, Cullis, Kenward& Welham (1999). If a joint additive structure with a random intercept is not sufficientto capture the variation of subjects, then one may extend the model by cluster specificmodifications on the joint spline function, which is realized by a random effect. This kindof model is simply structured and needs only two additional parameters, the variance ofthe slope and the covariance between slope and intercept. It is therefore very parsimoniousand allows simple interpretation. By using few additional parameters it has a distinctadvantage over methods that allows subjects to have their own function, yielding as manyfunctions as subjects (see for example Verbyla, Cullis, Kenward & Welham (1999) andRuppert, Wand & Carroll (2003)).An adequate formulation, investigation and interpretation of regression models needs anexplicit consideration of the feature space of response variables. So in some areas the

1.2 Guideline trough This Thesis 4

assumption of a normal distribution, which is part of the classical regression models, hasto be generalized to regression models for discrete responses. The statistical concept forthese regression models was build by Nelder & Wedderburn (1972). They introduced theGeneralized Linear Models. Many publications based on these models were publishedby Kay & Little (1986), Armstrong & Sloan (1989) in the medical research, Amemiya(1981), Maddala (1983) in economics and many other areas.Heavy numerical problems arise if one tries to do inference in generalized linear modelsfor longitudinal data. And the problems are not only restricted to the generalized linearmixed models. In the generalized estimation equations, the optimization is not a trivialthing, see Liang & Zeger (1986). These problems originate in the fact that the marginaldistribution is not analytically accessible for generalized linear mixed models. Thereforcomplicate integrals have to be solved. In the mixed model one can do analytical integra-tion by using some nice properties of gaussian random variables. But in generalized linearmixed models numerical integration has to be done. This can be either done by using theLaplace approximation (Breslow & Clayton (1993), Schall (1991), Wolfinger (1994)) us-ing a normal approximation or either using integration points based methods like Gauss-Hermite quadrature (Hedeker & Gibbons (1996), Pinheiro & Bates (1995)), or Monte-Carlo integration (McCulloch (1997), McCulloch (1994), Booth & Hobert (1999)). Onemay use direct methods or the EM algorithm to get parameter estimates.

In the context of categorial the adequacy is not only restricted to the consideration of adiscrete response structure. Properties of the variables are often reflected in a bad wayusing linear assumptions on the covariates, see Lin & Zhang (1999). Again, aim of thisthesis is to extend generalized linear mixed models by nonparametric effects. The two re-maining strategies for optimization is on the one side the approach discussed by Ruppert,Wand & Carroll (2003) and on the other side boosted generalized semi-parametric mixedmodels pictured in this thesis.

1.2 Guideline trough This Thesis

Chapter two gives a short introduction of linear mixed models. The different strategies ofoptimizing a linear mixed model as well as a robust variant of a linear mixed model areproposed.

The semi-parametric mixed models are part of the third chapter. The nonparametric mod-els are sketched briefly as well as how nonparametric approaches are handled. It is men-tioned that which of the problems arise if nonparametric modeling is used.

1.2 Guideline trough This Thesis 5

In the fourth chapter, the semi-parametric mixed models are extended to the flexible mixedmodels where all clusters have a common development of the covariate and each clusterhas its distinct modifications in the sense that the effect of a covariate are strengthened orattenuated individually.

The generalized linear mixed models are the topic of the fifth chapter. An overview ofthe most popular methods are given here since there is no canonic way to estimate ageneralized linear mixed model.

The sixth chapter deals with generalized semi-parametric mixed models. The Laplacianapproximation is used to implement the boosting idea into the generalized linear mixedmodel framework. In simulation studies the results are compared to the optimized modelbased on the mixed model approach for additive models (see Ruppert, Wand & Carroll(2003)).A short summary on the results, the given problems as well as an outlook on furtherdevelopment and questions are given which have been accumulated in the course of thisthesis in the last.

Chapter 2

Linear Mixed Models

2.1 Motivation: CD4 Data

The data was collected within the Multicenter AIDS Cohort Study (MACS), which fol-lowed nearly 5000 gay or bisexual men from Baltimore, Pittsburgh, Chicago and LosAngeles since 1984 (see Kaslow, Ostrow, Detels, Phair, Polk & Rinaldo (1987), Zeger& Diggle (1994)). The study includes 1809 men who were infected with HIV at studyentry and another 371 men who were seronegative at entry and seroconverted during thefollow-up. In the study 369 seroconverters (n = 369) with 2376 measurements in total(N = 2376) were used and two subjects were dropped since covariate information wasnot available. The interesting response variable is the number or percent of CD4 cells(CD4) by which progression of disease may be assessed. Covariates include years sinceseroconversion (time), packs of cigarettes a day (cigarettes), recreational drug use (drugs)with expression yes or no, number of sexual partners (partners), age (age) and a mentalillness score (cesd).In this study, we have a repeated measurement design because every seroconverter hasseveral measurement of covariates at different time points. For the i-th seroconverters, Tirepeated measurement were observed. For example the first seroconverter in the datasethas three repeated measurements, so in this case is Ti = 3. The described observationsfor the i-th seroconverter at repeated measurement t can then be addressed by CD4it forthe response and for the corresponding covariates ageit, partnersit, drugsit, cesdit andtimeit.

If one has only one observation on each seroconverter, then Ti = 1 for all seroconverters(i 1, . . . , n). So one can use standard cross sectional methods, because the measurementerror for each seroconverter can be assumed to be independent. In the case of repeated

2.1 Motivation: CD4 Data 7

measurements (Ti 6= 1) one has clusterd observations with error term (i1, . . . , iTi) forthe i-th person.

One approach to modeling data of this type is based on mixed models. Here the obser-vations (CD4it, t = 1, . . . , Ti) for the i-th seroconverter are assumed to be conditionalindependent. In other words, given the level of the i-th seroconverter, the errors for re-peated measurements of this person may assumed to be independent. The unknown levelfor the i-th seroconverter is expressed in mixed models by the so called random interceptbi. A common assumption on random intercepts is that they are Gausssian distributedwith bi N(0, 2b ). 2b is the random intercept variance.A mixed model with linear parameters age, partners, drugs, time and cesd the form isgiven by

CD4it = 0 + 1ageit + 2drugsit + 3timeit + bi + it

for i = 1, . . . , n and t = 1, . . . , Ti, where it is the error term. This can also be rewrittenin vector notation for the i-th seroconverter as

CD4i = 0 + 1agei + 2partnersi + 3drugsi + 4timei + 1Tibi + i

where CD4Ti = (CD4i1, . . . , CD4iTi), ageTi = (agei1, . . . , ageiTi), drugsTi =(drugsi1, . . . , drugsiTi), time

Ti = (timei1, . . . , timeiTi) and Ti = (i1, . . . , iTi). 1Ti

is a vector of the length Ti with ones. The assumption on the model may be(i

bi

)= N

((0

0

),

(2 ITi 0

0 2b

)),

where the errors and random intercepts of the i-th person are not correlated with those ofthe j-th person (j 6= i). If the vector notation without an index is preferred, one can alsowrite

CD4 = 0 + 1age+ 2partners+ 3drugs+ 4time+ Zb+ ,

where CD4T = (CD4T1 , . . . , CD4Tn ), ageT = (ageT1 , . . . , ageTn ), drugsT =(drugsT1 , . . . , drugs

Tn ), time

T = (timeT1 , . . . , timen)T

, T = (T1 , . . . , Tn ) and bT =

(b1, . . . , bn) The matrix Z is then a blockdiagonal matrix of 1T1 , . . . , 1Tn . The assump-tion on the mixed model can than reduced to(

b

)= N

((0

0

),

(2 IN 0

0 2b In

)).

Since one may use a short notation without addressing the variable names CD4,age, drugs, cesd, time, then one set generally response to yit := CD4it. The

2.2 The Model 8

variables that are responsible for the fixed effects are put into the vector xTit :=(1, ageit, drugsit, timeit).. The variables associated with the random effect are stackedin blockdiagonal entries in the matrix Z. The short term notation is with XTi =(xi1, . . . , xiT ), X

T = (XT1 , . . . , XTn ), y

Ti = (yi1, . . . , yiT ), y

T = (yT1 , . . . , yTn ) and

T = (0, . . . , 4)

y = X + Zb+ .

For example, one might extend the mixed model to a mixed model with random slopes as

CD4it = 0 + 1ageit + 2drugsit + 3timeit + b(1)i + timeitb

(2)i + it,

allowing a random variation in the slope for the linear time effect. Using the vectorzTit := (1, timeit), Z

Ti := (zi1, . . . , ziTi), Z = bdiag(Z1, . . . , Zn) and bTi = (b

(1)i , b

(2)i ),

where b(1)i is the random intercept and b(2)i is the random slope, one can write

y = X + Zb+ .

The assumption on the random effects of the i-th seroconverter may be bi N(0, Q),where Q is a 2 2-covariance matrix. This covariance matrix may be assumed to be thesame for all persons. The random effects of the persons are not correlated within eachother. This may be denoted by(

b

)= N

((0

0

),

(2 IN 0

0 Q

))

with Q being the n-times blockdiagonal matrix of Q.

2.2 The Model

The linear mixed model for longitudinal data was introduced by Henderson (1953). Letthe data be given by (yit, xit), i = 1, . . . , n, t = 1, . . . , Ti with yit connected to obser-vation t in cluster i and xit denoting a vector of covariates which may vary across theobservations within one cluster. N =

ni=1 Ti is the number of observations in total.

For the simplicity of presentation, let the number of observations within one cluster T donot depend on the cluster. Let xit and zit are design vectors composed from given co-variates. We set XTi = (xi1, . . . , xiT ), ZTi = (zi1, . . . , ziT ), yTi = (yi1, . . . , yiT ), XT =(XT1 , . . . , X

Tn ), y

T = (yT1 , . . . , yTn ), ,Z = bdiag(Z1, . . . , Zn). The basic idea of the ran-

dom effect models it to model the joint distribution of the observed covariate y and anunobservable random effect b.

2.2 The Model 9

The assumption on the distribution of y given the random effect b is

y|b N(X + Zb, 2 IN ).Here the (conditional) distribution f(y|b) follows a Gaussian normal distribution withmean X + Zb and covariance 2 IN .

The assumption on the random effects b and the error component is(

b

)= N

((0

0

),

(2 IN 0

0 Q()

)).

Here the distribution p(b; ) of the random effects is Gaussian distribution with meanzero and covariance matrix Q(). In this case the structural parameter specifies thecovariance matrix Q() = cov(b). Since the mean of p(b; ) is assumed to be zero,so p(b) is fully specified up to the unknown covariance matrix Q(). An overview onparameterized covariance matrices and its derivatives is given in the appendix.

If one assumes that the observations within and between clusters given the random effects

are independent, the joint density of y, f(y | b)=ni=1

f(yi | bi) with f(yi | bi) =Tit=1

f(yit |bi) reduces to product densities, which can be handled easily. Here f(yi | bi) and f(yit |bi) are also Gaussian densities with mean Xi + Zibi and xTit + zTitbi and covariance2 ITi and 2 .

One reason for the popularity of this assumption is that it is easy to understand and itsusefulness in the context of longitudinal data. More complex structures are possible, i.e.,clusters are correlated or the observations within the clusters are correlated in a distinctway. Since correlation problems can easy be expressed in the Gaussian framework westart with the linear mixed model specified by Gaussian mixing densities and conditionaloutcomes that are normally distributed.

One gets the marginal densities as

f(y) =ni=1

f(yi) =ni=1

f(yi|bi)p(bi; )dbi,

where p(bi; ) is the density of N(0, Q()). In this case Q() = bdiag(Q(), . . . , Q()).In other word, each cluster has the same structure Q(). So only Q() has to be estimatedto get Q().

Let be the vector of variance parameters T = (, )T . The result of this considerationis the marginal form of a Gaussian random effects model

yi N(Xi, Vi()),

2.2 The Model 10

whereVi() =

2 IT + ZiQ()Z

Ti .

In matrix notationy N(X,V())

with V() = bdiag(V1(), . . . , Vn()). The joint density of y and b is reduced to a mar-ginal density by integrating out the random effect b.

As already mentioned, the advantage of this restriction is that this parametrization is easyto handle regarding numerical aspects. The operations on huge covariance matrices anddesign matrices can be reduced to operations on block matrices of the block diagonalmatrix. This is a well conditioned problem in the numerical sense. Here, Q() is thecovariance matrix of the random effects within one cluster which is assumed to be thesame in each cluster.

Gaussian random effect models have an advantage, because on the basis of

f(y) =ni=1

f(yi) =ni=1

f(yi|bi)p(bi; )dbi,

one can easily switch between the marginal and conditional views. The n integrals canbe solved analytically by using the marginal distribution. For arbitrary mixtures of condi-tional and random distribution, it is not possible, in general. The log-likelihood for and is given by

l(, |y) =ni=1

log(f(yi)) = 12

ni=1

log(|Vi()|)+ni=1

(yiXi)TVi()1(yiXi).

So the estimator is obtained by solving the following equation, which is derived fromthe log-likelihood by differentiating with respect to (

ni=1

(XTi V1i Xi)

) =

(ni=1

XTi V1i yi

). (2.1)

As shown in Harville (1976) and described in Harville (1977) bi can be estimated by

bi = Q()ZTi Vi()

1(yi Xi). (2.2)

Harville (1976) shows, that the solutions of equation 2.1 and 2.2 are equivalent to thesolution of the BLUP-equation[

XTWX XWZ

ZTWX ZTWZ + Q()1

](

b

)=

[XTWy

ZTWy

](2.3)

2.3 The Restricted Log-Likelihood 11

with W = 12I .

The estimator b is called BLUP (best linear unbiased predictor) which minimizes E((bb)T (b b)), see Harville (1976). Additional effort is necessary if Q() or the structuralparameters are not known. A usual way to solve these problems are often based onthe restricted log-likelihood. Therefore profile likelihood concepts are used, which alter-natingly plug in the estimate for the variance components and the estimate for the fixedeffects.

A detailed introduction in linear mixed models is given by Robinson (1991), McCulloch& Searle (2001). Especially on longitudinal mixed models, information can be found inVerbeke & Molenberghs (2001) and Harville (1976) and Harville (1977).

2.3 The Restricted Log-Likelihood

The restricted log-likelihood is based on Patterson & Thompson (1971) . It was reviewedby Harville (1974), Harville (1977) and by Verbeke & Molenberghs (2001). It is given by

lr(, ) = 12

ni=1

log(|Vi()|) +ni=1

(yi Xi)TVi()1(yi Xi)

12

ni=1

log(|XTi Vi()Xi|).

The restricted log-likelihood differs from the log-likelihood by an additional component,since

lr(, ) = l(, ) 12

ni=1

log(|XTi Vi()Xi|). (2.4)

Differentiating lr(, ) with respect to results in the same equation as differentiatingl(, ) with respect to . An important question is now why lr(, ) should be used for thefurther computation. By plugging in the estimates alternatingly, degrees of freedom forthe estimate of the variance components are lost. The loss of degrees is compensated bythe additional component in the restricted log likelihood. Details can be found in Harville(1977)Since lr(, ) is nonlinear in , lr(, ) has to be maximized by a Fisher-Scoring algo-rithm.

2.3 The Restricted Log-Likelihood 12

The estimation of the variance components is based on the profile log-likelihood that isobtained by plugging in the estimates in the marginal log-likelihood formula 2.4.

Differentiation with respect to T = (, T ) = (1, . . . , d) yields

s(, ) =l(, )

= (s(, )i)i=1,...,d

and

F (, ) = E(2l(, )

T) = (F (, )i,j)i,j=1,...,d

with

s(, )i =l(, )

i= 12

nk=1 trace

((Pk())

1 Vk()i

)

+12n

k=1(yk k)TVk()1Vk()

iVk()

1(yk k).

Pk is defined in Harville (1977).

Pk() = Vk()1 Vk()1Xk

(n

k=1

XTk Vk()1Xk

)1XTk Vk()

1

and

F (, )i,j =1

2

nk=1

trace((Pk())

1 Vk()i

(Pk())1 Vk()

j

),

whereVk()

i=

2ITk if i = 1Zk Q()j ZTk if j = i, i 6= 1.The estimator can now be obtained by running a common Fisher scoring algorithm with

(s+1) = (s) + F (, (s))1s(, (s)).

where s denotes the iteration index of the Fisher scoring algorithm. If Fisher scoring hasconverged, the resulting represents the estimates of variances for the considered step.

2.3.1 The Maximum Likelihood Method

In special cases, it is necessary to use the ML instead of the REML, because the Fisher-Scoring in the REML-methods may be affected by numerical problems. Especially when

2.4 Estimation with Iteratively Weighted Least Squares 13

there are many covariates, which have no effect on the response, the REML estimator thendo not converge.

On the other hand it is criticized that the maximum likelihood estimator for 2 does nottake into account the loss of degrees of freedom when plugging in .

The estimation of the variance components is based on the profile log-likelihood that isobtained by plugging in the estimates in the marginal log-likelihood

l(; ) = 12n

i=1 log(|Vi()|) +n

i=1(yi )TVi()1(yi ).

Differentiation with respect to T = (, T ) = (1, . . . , d) yields

s(, ) =l(, )

= (s(, )i)i=1,...,d

and

F (, ) = E(2l(, )

T) = (F (, )i,j)i,j=1,...,d

with

s(, )i =l(,)

i= 12

nk=1 trace

((Vk())

1 Vk()i

)+12

nk=1(yk k)TVk()1 Vk()i Vk()1(yk k)

and

F (, )i,j =1

2

nk=1

trace((Vk())

1 Vk()i

(Vk())1 Vk()

j

),

whereVk()

i=

2ITk if i = 1Zk Q()j ZTk if j = i, i 6= 1.2.4 Estimation with Iteratively Weighted Least Squares

The estimation algorithm can be described as following:

Compute good start values 0 and 0. The value of 0 can be the estimator from a linearmodel. The elements of 0 are set to be small values, i.e. 0.1.

1. set k = 0

2.5 Estimation with EM-Algorithm - The Laird-Ware Method 14

2. compute (k+1) by solving the equation l(, (k)) above with plugged in (k)

3. compute (k+1) in l(, ) by running a Fisher scoring algorithm with plugged in(k+1).

4. stop, if all stopping criteria are reached, else start in 1 with k = k + 1.

This algorithm corresponds to the iteratively weighted least squares algorithms. Alter-natively, the variance parameters can be obtained by using the EM-algorithm originallydescribed in Laird & Ware (1982). Later, Lindstrom & Bates (1990) suggested that theNewton-Raphson-algorithm should be preferred over the EM-algorithm.

2.5 Estimation with EM-Algorithm - The Laird-Ware Method

The idea of this maximization method is based on Laird & Ware (1982). Indirect max-imation of the marginal density starts from the joint log-density of the observed datay = (y1, . . . , yn) and the unobservable effects = (, b1, . . . , bn). The joint log-likelihood is

log f(y, |) = log f(y|;2 ) + log p(b1, . . . , bn; )From the model assumptions one obtains, up to constants,

S1(2 ) 12N log 2 122

ni=1

Ti i,

S2(Q()) n2 log det(Q()) 12n

i=1 bTi Q()

1bi

= n2 log det(Q()) 12n

i=1 tr(Q()1bTi bi)

Next we start in the EM-framework with building the conditional expections with respectto (p)

M(|p) = E{S1(2 )|y; (p)}+ E{S2(Q())|y; (p)} (2.5)

2.5 Estimation with EM-Algorithm - The Laird-Ware Method 15

which is called the E-step. In detail are the E-equations

E{S1(2 )|y; (p)} = 12n

i=1 Ti log(2 )

122

ni=1[(

(p)i )

T (p)i + trcov(i|yi; (p)],

E{S2(Q())|y; (p)} = n2 log det(Q((p)))

12n

i=1[tr[Q((p))1b(p)i (b

(p)i )

T ] + cov(bi|yi; (p))

(2.6)

with current residuals (p)i = yi Xi(p) Zib(p)i . Differentiation of (2.6) with respectto and Q() yields the M-equations (Equations that maximize the E-equations)

2(p+1)

= 1Nn

i=1((p)i )

T (p)i + trcov(i|yi; (p)]

Q((p+1)) = 1nn

i=1[b(p)i (b

(p)i )

T + cov(bi|yi; (p))].(2.7)

We use projection matrices according Laird, Lange & Stram (1987)

P ((p)) =(Vi(

(p))1

(2.8)in ML-Estimation and

P ((p)) =(Vi(

(p))1

(Vi(

(p))1

Xi(XTi

(Vi(

(p))1

Xi)1XTi

(Vi(

(p))1

(2.9)in REML-Estimation with V (p) = 2

(p+1)+ ZiQ(

(p+1))ZTi .

Therefore we can denote with with projection matrix (2.8) or (2.9) as described in Laird,Lange & Stram (1987)

2(p+1)

= 1Nn

i=1((p)i )

T (p)i +

2(p)tr(I 2(p)Pi((p))]

Q((p+1)) = 1nn

i=1[b(p)i (b

(p)i )

T +Q((p))(I ZTi Pi((p))Zi)Q((p))](2.10)

The estimates (p) and b(p)i are obtained in the usual way

(p) =(n

i=1Xi(Vi((p)))1Xi

)1ni=1X

Ti (Vi(

(p)))1yi

b(p)i = Q(

(p))ZTi (Vi((p)))1(y X(p)).

(2.11)

2.6 Robust Linear Mixed Models 16

The EM-Algorithm is now

1. Calculate start value (0) = ((0), (0)).

2. For p = 1, 2, . . . compute (p) = ((p), b(p)1 , . . . , b(p)n ) with variance-covariance

components replaced by their current estimates (p) = ((p), (p)), together withcurrent residuals (p)i = yi Xi(p) Zib(p)i and posterior covariance matricescov(i|yi; (p)) and cov(bi|yi; (p)). This step may be seen as the E-step.

3. Do the M-step to compute updates with 2.10.

4. If the condition||(p+1) (p)||

||(p)||is accomplished, convergence of the EM-algorithm is achieved. If not start in step2 with (p+1) as update for (p).

More information on the estimation of mixed models via EM-algorithm can be found inLaird & Ware (1982). Later, Lindstrom & Bates (1988) compare the Newton-Raphsonmethod to EM-estimation. Especially fast algorithm reparametrization can be found here.Laird, Lange & Stram (1987) gave detailed information on EM-estimation and algorith-mic acceleration. Alternatively, the gradient algorithm as described in Lange (1995), canalso be used which is closely related to the EM-Algorithm.

2.6 Robust Linear Mixed Models

The marginal distribution of a linear mixed models is

yi N(Xi, Vi()). (2.12)

This assumption on the distribution is now replaced by the robust variant as suggested inLange, Roderick, Little & Taylor (1989)

yi tT (Xi,i(), ), (2.13)

where tk(,, ) denotes the k-variate t-distribution as given in Cornish (1954), is thescaling matrix which has the function of in the mixed model concept. The additionalparameter must be positive and can be noninteger. It has the funtion of a robustness


parameter, since it downweights outlying cases. We set now = X. The marginaldensity is

f(y|,, ) = ||1/2((+k)/2)((1/2)))k(/2)k/2(1 + (y)

T1(y)

)(+k)/2.

(2.14)

Important properties for 2 are:

y tk(,, )

E(y) = and Cov(y) = = 2 for ( > 2)

b|y 2+k

+2with 2 = (y )T1(y ),

2k(.) denotes the Chi-Square distribution with k degrees of freedom

2k Fk,

According to these properties, the model can be derived from multivariate normal-distribution with scaling variable bi

yi|bi NT (i, ()bi

) where bi 2T

T. (2.15)

The log-likelihood for model (2.13) ignoring constants is

l(, , ) ni=1 li(, , )withli(, , ) =

12 log |i()| 12( + T ) log

(1 +

2i (,)

)12T log() + log

[(+T2

)] log (2 ).(2.16)

The likelihood-equations regarding are closely related to the likelihood-equations of alinear mixed model. Setting the first derivative of (2.16) regarding zero yields

ni=1

wiXTi ()

1(yi i) = 0 (2.17)

with the weight wi = +T+2i .


The log-likelihood (2.16) can be maximized via a Fisher-Scoring-Algorithm. First wecollect all necessary parameters in = (, , ).

One has to rewrite (2.14) into

f(y|) = |()|1/2g((y )T()1(y ), ) (2.18)with

g(s, ) =(( + k)/2)

(1/2)k(/2)k/2

(1 +

s

)(+k)/2. (2.19)

The first derivatives are

l() = 2g(

2,)2

1g(2,)

()

1(y ),

l()i

= 12 tr(()1 ()i

) g(2,)

21

g(2,)(y )T()1)()i ()1(y ),

l() =

g(2,)

1g(2,)

.

(2.20)So one can write

s() =

l()l()l()

(2.21)with l()

T= ( l()1 , . . . ,

l()d

).

The elements of the expected Fisher-matrix are


F = E(

l()T

)= +k+k+2

T

T()1 ,

Fij = E(

l()ij

)= +k+k+2

12 tr(()1 ()i ()

1 ()j

),

12(+k+2)(()1 ()i

)(()1 ()j

),

Fi = E(

l()i

)= 1(+k+2)(+k) tr

(()1 ()i

),

F = E(

l()

)= 12 [12TG

(+k2

) 12TG (2)+ k(+k) 1+k + +2(+k+2) ],

Fi = E(

l()i

)= E

(l()

)= 0,

(2.22)

where TG(x) = d2d2x

log((x)) is the trigamma function. The partitioned Fisher matrix is

F () =

F 0 00 F F0 F F

.The log-likelihood function (2.14) can be maximized using Fisher-Scoring-Algorithm.Lange, Roderick, Little & Taylor (1989) compare Fisher-Scoring to EM-estimation. Al-gorithmic details and proofs for Score and Fisher matrix are given in this paper as well asalternative assumptions on robust linear mixed models.

In the literature, one can find the extension of linear random effect models to semiparamet-ric linear mixed models, see Ishwaran & Takahara (2002) and Zhang & Davidian (2001).The semiparametric term refers to the unknown density of the random effects density or arandom measure with unknowns random effects. This terminology is often misleading be-cause semiparametric modeling also refers to additive modeling of continuous covariates,which underlying structure is deterministic.

For more flexibility in the random effects structure, mixture models got very popular inthe past. A mixture model is obtained by finitely mixing the conditional distribution. Fornon-Bayesian semiparametric approaches to linear mixed models, see Verbeke & Lessafre(1996) and Aitkin (1999) who have used a finite mixture approach with implementationby the EM algorithm.

A more Bayesian based framework is founded on the Dirichlet process as discussed inFerguson (1973). It is applied on random effect models by the idea of random partition


structures and Chinese Restaurant processes (CR process). Later, Brunner, Chan, James &A.Y.Lo (2001) extended these ideas to the weighted Chinese Restaurant Processes whichthey applied to Bayesian Mixture models. Brunner, Chan, James & A.Y.Lo (1996) pro-vided the general methodology related to i.i.d weighted Chinese Restaurant algorithms(WCR-Algorithms). Ishwaran & Takahara (2002) combined the iid WCR algorithm withREML estimates for inference in Laird-Ware random effect models. Naskar, Das &Ibrahim (2005) used WCR and EM algorithm for survival data.

Chapter 3

Semi-Parametric Mixed Models

There is an extensive amount of literature on the linear mixed model, starting from Hen-derson (1953), Laird & Ware (1982) and Harville (1977). Nice overviews including morerecent work is described in Verbeke & Molenberghs (2001), McCulloch & Searle (2001).Generally, the influence of covariates is restricted to a strictly parametric form in linearmixed models. While in regression models, much work has been done to extend the strictparametric form to more flexible forms of semi- and nonparametric regression, but muchless has been done to develop flexible mixed model. For overviews on semiparametric re-gression models, see Hastie & Tibshirani (1990), Green & Silverman (1994) and Schimek(2000).A first step to more flexible mixed models is the generalization to additive mixed modelswhere a random intercept is included. With response yit for observation t on individ-ual/cluster i and covariates ui1, . . . uim, the basic form is

yit = 0 + (1)(ui1) + + (m)(uim) + bi0 + it, (3.1)where (1)(.), . . . , (m)(.) are unspecified functions of covariates ui1, . . . , uim, bi0 is asubject-specific random intercept with bi0 N(0, 2b ) and it is an additional noise vari-able. Estimation for this model may be based on the observation that regression modelswith smooth components may be fitted by mixed model methodology. Speed (1991) in-dicated that the fitted cubic smoothing splines is a best linear unbiased predictor. Subse-quently the approach has been used in several papers to fit mixed models, see e.g. Ver-byla, Cullis, Kenward & Welham (1999), Parise, Wand, Ruppert & Ryan (2001), Lin& Zhang (1999), Brumback & Rice (1998), Zhang, Lin, Raz & Sowers (1998),Wand(2003). Bayesian approaches have also been considered, see e.g., by Fahrmeir & Lang(2001), Lang, Adebayo, Fahrmeir & Steiner (2003), Fahrmeir, Kneib & Lang (2004) andKneib & Fahrmeir (2006).

3.1 Short Review on Splines in Semi-Parametric Mixed Models 22

3.1 Short Review on Splines in Semi-Parametric Mixed Mod-els

In linear mixed models the strict linear and parametric terms can be extended by incor-porating a continuous covariate u which has an additive functional influence in the senseof

yit = xTit + (uit) + z

Titbi + it. (3.2)

In the following, we consider several ways to approximated an unknown function (.).

3.1.1 Motivation: The Interpolation Problem

For simplicity, we first consider the approximation of a function (.) with known valuesat measurement points (knots) uit. Then we start with observations (uit, (uit)), i =1, . . . , n.

The interpolation problem is given by finding a function s(.) with property

s(uit) = (uit), i = 1, . . . , n t = 1, . . . , Ti.

Spline theory is a common way to solve this problem. The function (.) may be approx-imated by a Spline function s(.). A spline is based on a set of knots K = {k1, . . . , kM}in the range [k1, kM ]. K is the set of ordered observations K = {kj |kj kj+1}. It haselements kj which are ordered. The smallest value is k1

The spline is of degree d, d N0 on K, if s(.) is d 1 times continuously differentiableand for every u [kj , kj+1) s(u) is a polynomial of degree d, j = 1, . . . , M1. A splineon [kj , kj+1) of degree d or order d+1 may be represented by

s(u) = a[d]j u

d + a[d1]j u

d1 + + a[1]j u1 + a[0]j .

The vector space for splines with degree d for the given knots K is denoted by Sd(K).

Splines interpolate given data points uit and their known function evaluations (uit) byusing piecewise polynomials to connect these data points. Other interpolation strategiesare also possible, i.e., trigonometric interpolation or classical polynomial interpolation,but these methods often have disadvantages in the numeric sense. In Figure 3.1, such aninterpolation problem is given for known pairs (uit, s(uit)). Since the data points are thegiven knots, s(.) is a polynomial between successive u-values. In the case of cubic splines


first and second derivative are continuous at the observation points. The interpolationspline is given in Figure 3.2.

The set of all splines of degree d on the knots K is a M + d 1 = M subspace of thevector space, which contains the d1 times differentiable functions. That is why a splinecan be expressed by a set of M linear independent basis functions j(.), j 1,M . SoSd(K) can be described by the spline basis B = {1(u), . . . , M (u)}.

-4 -3 -2 -1 0 1 2 3 4 5

-40

-20

0

20

Figure 3.1: Interpolation problem - The values uit are on the x-axis and the correspondings(uit) are on the y-axis. These observed points should be part of continous functin s(.)

Function (u) with u [a, b] may be approximated by a spline s(u) using basis functionsso that

(u) s(u) =Mj=1

j(u)j = (u)T.

The problem in (3.2) for known function evaluations (uit) can be written as

yit = xTit + (uit)

T+ zTitbi + it, (3.3)

since s(uit) = (uit)T is the spline interpolation of (uit).

3.1.2 Popular Basis Functions

Truncated Power Series One basis for Sd(K) for a given set of knots K with degree dis

1(u) = 1, 1(u) = u, . . . , d+1(u) = ud,

d+2(u) = (u k2)d+, . . . , d+M1(u) = (u kM1)d+,(3.4)


-4 -2 0 2 4

-40

-20

0

20

Figure 3.2: Spline solution of the interpolation problem. ( The solid points are the givenpoints, the blue line is the spline function (interpolation spline)

where

(u kj)d+ =(u kj)d , if u kj0 , else .

The basis isB = {1(u), . . . , M (u)}.

B-splines B-splines for degree d with M inner knots in the range [a, b] can be recur-sively defined by

"De Boor" recursion for B-splines

0j (u) = [kj ,kj+1](u) =

1 , if kj u < kj+10 , else ,

dj (u) =kj+d1u

kj+d1kj+1d1j+1(x) +

ukjkj+dkj

d1j (u).

(3.5)

For the construction of the B-spline basis with the recursion (3.5), outer knots are nec-essary in the form of k1


3.1.3 Motivation: Splines and the Concept of Penalization - SmoothingSplines

A main problem in statistics is that the function evaluation (uit) is not observable fromthe data. Instead, only the response yit is observable, which is a sum of the unknownvalues (uit) and it. These values have to be estimated from the data.

In this subsection, all observation are taken as knots (M == N). K is the set of orderedobservations K = {kj |kj kj+1}. It has elements uit, which are ordered. It is alsopossible that a (equidistant) grid of knots is given. This is a useful condition, especiallyin the regression spline context . That is why M was used instead of N .

We change the cluster representation of the simple semi-parametric model (3.2)

yit = xTit + (uit)

T+ zTitbi + it,

which is in matrix formY = X ++ Zb+ , (3.6)

to the elementwise measurement representation

y(i) = xT(i) + (u(i))

T+ zT(i)b+ (i),

where .(i) indicates i-th row vector for matrix Z,X or .(i) indicates the i-the element ofvector Y ,u, and (u(i))T = (1(u(i)), . . . , M (u(i))) is the basis function evaluation ofu(i).

The job of spline smoothing is primary to find a good estimation s(uit) = T (uit)for the unknown function evaluations (uit) s(uit). The difficulty of equation (3.6)is that it is not identifiable since dim(y) = N and dim() = N (N + d 1). Tosolve this problems further restrictions to the estimation concepts have to be made. Sinceequation (3.6) is the equation of a mixed model with assumption N(0, 2 I) and b N(0,Q()) the estimates for fixed effects , and structural parameters T = (2 , )would normally be obtained by ML or REML for cases dim(y) > dim((T , T )T ).

That is why the idea of penalizing the roughness of estimated curves was born. Theroughness of curves s(.) is controlled by a penalty term and a smoothing parameter .Consider the minimization problem for , ,

Ni=1

li(, , ) 12

((u))2du min (3.7)


where l(i)(, , ) is the likelihood contribution of observation y(i). It is assumed that(.) has continuous first and second derivatives (.) and (.) with (.) is quadrati-cally integrable. That is in detail the function class described by the Sobolev space (seeAlt (1985) and Adams (1975)). In other words, (.) can be approximated by a splinefunctions s(.) that is based on an smoothing parameter . To show the dependence of let s(u) the spline function, which is the result of the minimization problem in formula

3.7 for given . l(, , ) =Ni=1

li(, , ) is the marginal likelihood for model (3.6).

The bias(s(u), (u)) of (u) and s(u) is increasing for big s. The principal trade-offbetween the bias and the variance of the estimated is reflected by the mean squared error

E(s(u) (u))2 = var(s(u)) + [E(s(u)) (u))]2= var(s(u)) + (bias(s(u), (u)))2.

In other words large values of lead to underfitting, to small values to overfitting. Gettingan optimal and an optimal spline s is a difficult statistical problem to solve.

The maximization problem (3.7) may be solved by a natural cubic smoothing spline asdescribed in Reinsch (1967) without the need of a basis function representation of s(.).The concept of Reinsch (1967) and De Boor (1978) can be transferred to the mixed modelmethology where minimizing (3.7) for given is equivalent to

Ni=1

li(, , )2 1

2sKs min, (3.8)

where y(i) = xT(i) + s(u(i)) + zT(i)b, s

T = (s(k1), . . . , s(kM )). For details on the cubicsplines s(.) and penalty matrix K see Fahrmeir & Tutz (2001).

3.1.4 Motivation: The Concept of Regression Splines

The basic idea of regression splines is to work with only a small number of knots (M


For a detailed discussion of these issues see Friedman & Silverman (1989),Wand (2000)or Stone, Hansen, Kooperberg & Truong (1997)Another idea is to take a (equidistant) fixed grid of knots. That is the main difference tosmoothing splines, since the knots are chosen individually ( how many knots, range ofthe interval where the knots coming from ). A spline function for a fixed grid withoutpenalization is visualized in Figure 3.4. For this Figure and Figure 3.3, a random effectsmodel with only one smooth covariate ((u) = sin(u)) was used. Therefore forty clusterswith five repeated measurements each were simulated. The random effect was assumedto be N(0, 2b ), 2b = 2 and the error term was assumed to be N(0, 2 ), 2 = 2. In Figure3.3 and 3.3, the concept of regression splines was used. In Figure 3.3, the smoothingparameter was set to zero and in Figure 3.4, it was set to sixty. It is obvious that theroughness of the curve has to be penalized. On the other hand a spline function is desiredthat is independent of the placement and the number of knots.

0 1 2 3 4 5 6-1.5

-1.0

-0.5

0.0

0.5

1.0

Figure 3.3: The red points describe the given data, the blue line in the figure is a splinecomputed with 40 knots and B-splines of degree 3. (no penalization)

Again the penalization problem for given is

Ni=1

li(, , ) P (s(.)) min, (3.9)

where s(u) has the functional form s(u) =M

j=1 j(u)j and P (s(.)) is a roughnessfunctional. A widely used roughness functional is the integrated squared second derivative


0 1 2 3 4 5 6-1.5

-1.0

-0.5

0.0

0.5

1.0

Figure 3.4: The spline in the figure is the optimal spline with respect to the penalty and .Using the penalty term reduces the influence of knots. Here 40 knots and B-splines of degree3 were used.

P (s(.)) =(s(u))du as an measure for the curvature or the roughness of the function

s(.). If the basis function representation of a spline is used the penalty term has the form

P (s(.)) = TK (3.10)where K is a matrix with entries Kij =

i (u)

j (u)du.

Eilers & Marx (1996) introduced a penalty term where adjacent B-spline coefficientsare connected to each other in a very distinct way. The penalty term is based onK = (Dl)TDl, where Dl is a contrast matrix of order l which contrasts polynomialsof the order l. Using B-splines with penalization K one penalizes the difference betweenadjacent categories in the form TK = j{lj}2. is the difference operatorwith j = j+1 j , 2j = (j) etc., for details see Eilers & Marx (1996).Usually the order of the penalized differences is the same as the order of the spline (B-Spline).In Figure 3.4, one can see that penalization reduces the influence of knots, which has aneffect on the roughness of the curve. So penalization reduces also the influence of thenumber and placement of the knots. Another number of knots with different placementswould deliver a quite similar spline function solution.

The difference matrix D is needed to compute the difference matrix of the l-th order


Dl corresponding to B-Spline penalization (see Eilers & Marx (1996)) in an recursivescheme. With D being the (M 1) (M) contrast matrix

D =

1 1

1 1...

...

1 1

one obtains higher order differences by the recursionDl = DDl1 which is a (Ml)Mmatrix. This can be used for a more simple and intuitive definition of the penalty thanequation 3.10.

A similar argumentation is used for the truncated Power Series basis where the penaltymatrix is simply set to

K = bdiag(0(d)(d), IMd).

where 0(d)(d) is a d-dimensional quadratic zero matrix, and I(Md) is the identity matrixof dimension (M d).

3.1.5 Identification Problems: The Need of a Semi-ParametricRepresentation

Problems in additive modeling based on splines arise, if intercepts or splines for othercovariates are used. If no further restriction of the splines is made, the resulting splinesare not clearly identifiable. This is illustrated in the following example

Example 3.1 : Rewriting an additive term to a semi-parametric termOne can write the additive term without parametric terms

(u) = 10 + u2, for u [3, 3]

to a semi-parametrical representation

(u) = 0 + (u) = 10 + u2, for u [3, 3]

with 0 = 10 and (u) = u2. But also 0 = 5 and (u) = 5 + u2 is a valid semiparametricparametrization for the additive term (u).

The interest is often in the population mean level and in the absolute deviations from this mean

as a function of a continuous covariates. It is a natural idea to center the continuous covariates


-3 -2 -1 0 1 2 3

010

2030

(a)

-3 -2 -1 0 1 2 3

010

2030

(b)

Figure 3.5: (a) may be seen as pure additive spline smoothing, (b) may be seen that (u)describes the absolute deviation from zero. In this case the level is 0 =

R

ba(u)du

ba. The

desired property of the additive term (u) isR b

a(u) = 0

around zero. Figure (3.5) shows the difference in the interpretation. Nice benefit of this restrictionis that the semi-parametrical representation is identifiable. 2

Also two or more additive terms have to be rewritten to semi-parametric terms, sinceadditive terms should be identifiable. This is illustrated in the

Example 3.2 : Rewriting additive terms to semi-parametric termsOne can write additive terms

(1)(u) = 10 + u2, for u [3, 3] = [a(1), b(1)]

(2)(v) = 5 + v3, for v [3, 3] = [a(2), b(2)]to the additive predictor

add(u, v) = (1)(u) + (2)(v).

Again (1)(u) = 5 + u2 and (2)(v) = v3 corresponds to the same additive predictor sinceadd(u, v) = (1)(u) + (2)(v). Using the same idea described in example 3.1, one gets identifi-able additive terms by the reparametrization of the additive predictor to semi-parametric terms

add(u, v) = 0 + (1)(u) + (2)(v)

with properties b(1)a(1)

(1)(u) = 0, b(2)a(2)

(2)(v) = 0 and =R b(1)a(1)

(u)du

b(1)a(1) +

R b(2)a(2)

(v)dv

b(2)a(2) . 2

So the basic idea for the identifiable additive terms (u) is to rewrite additive terms (u)to a semi-parametric consideration with

ba (u) = 0. So (u) has the form

(u) = (u) ba

(u)

b adu. (3.11)


Using simple analysis, one can show that equation (3.11) holds ba (u) = 0.Since (u) is often approximated by a spline function that is composed of basis functions(u) T (u), where T (u) = (1(u), . . . , M (u)), the discrete version using thecoefficients of basis functions can also used to get regularized T = (1, . . . , M ) withrestriction

Mj=1 j = 0. A regularized version may be obtained by

= M

j=1 j

M, m =

M1j=1

j .

There termPM

j=1 jM is often understood as shift in the level of the function (u). For

detailed information on these restrictions, see the Appendix A.1.

3.1.6 Singularities: The Need of a Regularization of Basis Functions

Let a semi-parametrization be given by

y(i) = xT(i) +

Mj=1

j(u(i))j + zT(i)b+ (i) =

[1 x(i) (u(i))

T]0

+ zT(i)b+ (i)where xT(i) = (1, x(i)) and

T = (0, T ). For a truncated power series bases, the column

corresponding to the intercept is absolute linear dependent on 1(u) = 1. The rows for thedesign matrix for fixed effects can be written as

[1 x(i) 1 2(u(i)) . . . M (u(i))

].

It is obvious that the design matrix has not full rank. The same problem affects also B-splines. A B-spline basis is a specific decomposition of 1. A main property of B-splinesis that one has

Mj=1 j(u) = 1. There exists the linear combination of the design matrix[

1 x(i)M

j=1 j(u(i))], which shows once again problems in the rank of the design

matrix.

The same problem is arising when more than one additive functions are used. For trun-cated power series one has columns (k)1 = 1 and

(l)1 = 1 in the design matrix for

additive components k and l. For B-splines the corresponding sumsM

j=1 (k)j = 1 andM

j=1 (l)j = 1 are absolute linear dependent.

To solve these singularities specific transformations T must be applied to (u) with

(u) = T(u),

where T(u) has full rank. Generally these transformations also affects the penalty ma-trix K that occur in a penalized likelihood function. See the Appendix A.1 for a detaileddiscussion on these transformations.

3.2 The Model 32

3.2 The Model

Let the data be given by (yit, xit, uit, zit), i = 1, . . . , n, t = 1, . . . , Ti, where yitis the response for observation t within cluster i and xTit = (xit1, . . . , xitp), uTit =(uit1, . . . , uitm), z

Tit = (zit1, . . . , zits) are vectors of covariates, which may vary across

clusters and observations. The semi-parametric mixed model that is considered in thefollowing has the general form

yit = xTit +

mj=1

(j)(uitj) + zTitbi + it

= parit + addit +

randit + it (3.12)

where

parit = xTit is a linear parametric term,

addit =m

j=1 (j)(uitj) is an additive term with unspecified influence functions(1), . . . , (m),

randit = zTitbi contains the cluster-specific random effect bi, bi N (0, Q()), where

Q() is a parameterized covariance matrix and

it is the noise variable, it N(0, 2I), it, bi independent.

In spline methodology, the unknown functions (j) are approximated by basis functions.A simple basis is known as the truncated power series basis of degree d, yielding

(j)(u) = (j)0 +

(j)1 u+ . . .

(j)d u

d +Ms=1

(j)s (u k(j)s )d+ ,

where k(j)1 < . . . < k(j)

Mare distinct knots. More generally, one uses

(j)(u) =Ms=1

(j)s (j)s (u) =

Tj

(j)(u), (3.13)

where (j)s denotes the s-th basis function for variable j, Tj = ((j)1 , . . . ,

(j)M ) are un-

known parameters and (j)(u)T = ((j)1 (u), . . . , (j)M (u)) represent the vector-valued

evaluations of the basis functions.For semi- and nonparametric regression models, Eilers & Marx (1996), Marx & Eilers

3.3 Penalized Maximum Likelihood Approach 33

(1998) proposed the numerically stable B-splines which have also been used by Wood(2004). For further investigation of basis functions, see also Wand (2000), Ruppert &Carroll (1999).By collecting observations within one cluster the model has the form

yi = Xi +i11 + . . .+imm + Zibi + i,[i

bi

] N

((0

0

),

(2I

Q()

)), (3.14)

where Xi contains the linear term, ijj represents the one additive term and Zithe random term. Vectors and matrices are given by yTi = (yi1, . . . , yiTi), XTi =(xi1, . . . , xiTi),

Tij = (

(j)(ui1j), . . . , (j)(uiTij)), Z

Ti = (zi1, . . . , ziTi),

Ti =

(i1, . . . , iTi). In the case of the truncated power series the "fixed" term (j)0 +

(j)1 u +

. . .+ (j)d u

d is taken into the linear term Xi without specifying Xi and explicitly.In matrix form one obtains

y = X +.11 + . . .+.mm + Zb+

where yT = (yT1 , . . . , yTn ), bT = (bT1 , . . . , bTn ), T = (T1 , . . . , Tn ),XT = (XT1 , . . . , X

Tn ),

T.j = (

T1j , . . . ,

Tnj), Z

T = (ZT1 , . . . , ZTn ).

Parameters to be estimated are the fixed effects, which are collected in T =(T , T1 , . . . ,

Tm) and the variance specific parameters T = (, T ) which determine

the covariances cov(it) = 2ITi and cov(bi) = Q(). In addition one wants to estimatethe random effects bi. Since bi is a random variable, the latter is often called predictionrather than estimation. We set Xi. = [Xi,i1, . . . ,im].

3.3 Penalized Maximum Likelihood Approach

Starting from the marginal version of the model

yi = Xi +i11 + . . .+imm + i

or

yi = Xi + i ,

(3.15)

i N(0, Vi()), Vi() = 2 ITi + ZiQ()ZTi ,estimates for may be based on the penalized log-likelihood

lp(; ) = 12

ni=1

log(|Vi()|)ni=1

1

2(yi Xi.)T (Vi())1(y Xi.) 1

2TK,

(3.16)


where TK is a penalty term which penalized the coefficients 1, . . . , n. For the trun-cated power series an appropriate penalty is given by

K = Diag(0, 1I, . . . , mI),

where I denotes the identity matrix and j steers the smoothness of the function (j).For j a polynomial of degree d is fitted. P-splines ((Eilers & Marx, 1996)) useK = DTD where D is a matrix that builds the difference between adjacent parametersyielding the penalty TK = jjs((j)s+1 (j)s )2 or higher differences.From the derivative of lp(, ), one obtains the estimation equation lp(, )/ = 0which yields

ni=1

(XTi.(Vi())1yi) = (

ni=1

(XTi.(Vi())1Xi. +K)1)

and therefore

= (ni=1

(XTi.(Vi())1Xi. +K))1

ni=1

XTi.(Vi())1yi

which depends on the variance parameters . It is well known that maximization of thelog-likelihood with respect to yields biased estimates since maximum likelihood doesnot take into account that fixed parameters have been estimated (see Patterson & Thomp-son (1974)). The same holds for the penalized log-likelihood (3.16). Therefore for theestimation of variance parameters often restricted maximum likelihood estimates (REML)are preferred which are based on the log-likelihood

lr(, ) = 12

ni=1

log(|Vi()|) 12

ni=1

(yi Xi.)TVi()1(yi Xi.)

12

ni=1

log(|XTi.Vi()Xi.|),

see, Harville (1974), Harville (1977) and Verbeke & Molenberghs (2001).The restricted log-likelihood differs from the log-likelihood by an additional component.One has

lr(, ) = l(, ) 12

ni=1

log(|XTi.Vi()Xi.|).

It should be noted that for the estimation of , the penalization term TK may be omittedsince it has no effect. Details on REML is given in the Appendix.


BLUP EstimatesUsually one also wants estimates of the random effects. Best linear unbiased prediction(BLUP) is a framework to obtain estimates for and b1, . . . , bn for given variance com-ponents. There are several ways to motivate BLUP (see Robinson (1991)). One way is toconsider the joint density of y and b which is normal and maximize with respect to andb. By adding the penalty term TK one has to minimize

ni=1

1

2(yi Xi. Zibi)T (yi Xi. Zibi) + bTi Q()1bi + TK, (3.17)

where Xi. = [Xi,i1, . . . ,im], Q() = Diag(Q() . . . Q()).With XT = (XT.1 . . . XT.m) the criterion (3.17) may be rewritten as

1

2(y X Zb)T (y X Zb) + bTQ()1bT + TK

which yields the "ridge regression" solution[

b

]=(CT 1

2IC +B

)1CT

1

2Iy

with C = (X, Z) and

B =

(K 0

0 Q()1

).

Some matrix derivation shows that has the form

= (XTV ()1X +K)1XTV ()1y,

where V () = Diag(V1() . . . Vn()), and for the vector of random coefficients bT =(bT1 , . . . , b

Tn ) one obtains

b = Q()ZTV ()1(y X).

In simpler form BLUP estimates are given by

= (ni=1

(XTi.(Vi())1Xi. +K))1

ni=1

XTi.(Vi())1yi,

bi = QZTi Vi()

1(yi Xi.).

3.4 Mixed Model Approach to Smoothing 36

3.4 Mixed Model Approach to Smoothing

It is necessary to specify the smoothing parameters 1, . . . , m for the computation of .Using cross-validation techniques, problems arise, if the number of smooth covariates ishigh. An approach that works for moderate number of smooth covariates uses the MLor REML estimates of variance components. The basic concept ist to reformulate theestimation as a more general mixed model. Let us consider again the criterion for BLUPestimates (3.17) which has the form

1

2(y X Zb)T (y X Zb) + TK+ bTQ()1b

=1

2(y X Zb)T (y X Zb) + (T bT )

(K 0

0 Q()1

)(

b

)(3.18)

where = [.1 . . ..m] and K for the truncated power series has the form K =Diag(1I, . . . , mI).

Thus (3.18) corresponds to the BLUP criterion of the mixed model

y = X +[ Z

](b

)+

with

b

N000

,K

1 0 0

0 Q() 0

0 0 2I

.

Since K = Diag(1I, . . . , mI) the smoothing parameters 1, . . . , m correspond tothe variance of the random effects 1, . . . , m for which cov(j) = jI is assumed. Thus1, . . . , m are treated as random effects for the purpose of estimation. REML estimatesyield 1, . . . , m. For alternative basis function like B-splines some reformulation is nec-essary to obtain the simple independence stru

reithinger florian

Documents

parametric mixed models

structured mixed models

mixed model approach

robust linear mixed

penalized splines

ebay data

jimma data

clusterspecific splines