regresiÓn flexible a travÉs de - usc

63
1 R R E E G G R R E E S S I I Ó Ó N N F F L L E E X X I I B B L L E E A A T T R R A A V V É É S S D D E E M M O O D D E E L L O O S S A A D D I I T T I I V V O O S S G G E E N N E E R R A A L L I I Z Z A A D D O O S S ( ( G G A A M M ) ) Carmen María Cadarso Suárez

Upload: others

Post on 28-Jun-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

1

RREEGGRREESSIIÓÓNN FFLLEEXXIIBBLLEE

AA TTRRAAVVÉÉSS DDEE

MMOODDEELLOOSS AADDIITTIIVVOOSS GGEENNEERRAALLIIZZAADDOOSS

((GGAAMM))

Carmen María Cadarso Suárez

Page 2: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

2

  1) Hemos visto que Los Modelos Lineales Generalizados

(GLM) extienden a los LM, permitiendo varios tipos de

respuesta (incluyendo la continua):

-Binaria (ausencia/presencia)

-Poisson (conteo),….

En R, se ajustan con la función glm()

2) Los GLM asumen que los efectos de las covariables continuas

en la respuesta son LINEALES.

Los Modelos Aditivos Generalizados (GAM) extienden los

GLM, asumiendo que los efectos de las covariables continuas

son desconocidas pero “suaves”.

Se pueden ajustar con diversos paquetes escritos en R y en otros

lenguajes.

Page 3: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

3

BIBLIOGRAFÍA y SOFTWARE “GAM”

Hastie TJ, Tibshirani RJ. Generalized Additive Models. Chapman-Hall, 1990.

en R: gam

Ruppert D, Wand MP, Carroll RJ. Semiparametric regression. Cambridge University Press, 2003.

en R: SemiPar Wood SN. Generalized Additive Models. An introduction with R. CRC/Chapman-Hall, 2006.

en R: mgcv, gamair

Page 4: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

4

MODELOS ADITIVOS GENERALIZADOS (Hastie-Tibshirani, 1990; Wood, 2006)

• El Modelo Aditivo Generalizado (GAM) extiende la

noción de Modelo GLM, reemplazando las funciones de

las covariables continuas por términos suaves.

• Explícitamente, un modelo GAM de respuesta

transformada de la familia exponencial se expresa

como:

0 1 1( ) ( ) ... ( )X X p pg f X f Xη μ β= = + + +

asumiendo que las funciones ( )j jf X son univariantes y

suaves arbitrarias, y g una función link conocida. • El modelo GAM es muy flexible, permitiendo incluir:

1. FACTORES (creando variables dummy). 2. INTERACCIONES de varios tipos, como p.e.,

a) Continua x Continua (efecto “superficie”)

0 1 1 2 2 12 1 2( ) ( ) ( ) ( , )X Xg f X f X f X Xη μ β= = + + +

b) Factor x continua

Page 5: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

5

MODELOS GAM PARTICULARES

0 1 1( ) ( ) ... ( )X X p pg f X f Xη μ β= = + + +

• GLM (McCullagh-Nelder, 1989)

( ) 1,....,j j j jf X X j pβ= =

• SEMI-PARAMÉTRICOS (Green,1987; ´Sullivan,1983, Speckman,1988; Hastie-Tibshirani, 1990)

, 0 ( )X t X f t= + +η β β

donde t = variable extraña o “ruido”.

Page 6: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

6

Herramientas en la estimación GAM

a) Estimación flexible a través de Suavizadores

• Núcleo (Härdle, 1990).

• Loess (Cleveland, 1979).

• Cubic Smoothing Splines (Wahba, 1990; Hastie-Tibshirani, 1990).

• Splines de regresión penalizados (Ruppert et al,

2003; Wood, 2006; Brezger-Lang, 2006),….

b) Algoritmo de estimación conjunta de las funciones parciales

( ) , 1,...,j jf X j p=

• Backfitting

• Regresión spline penalizada

Distintas combinaciones de suavizadores y algoritmos de estimación, nos llevan a distintas aproximaciones al modelo GAM.

Page 7: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

7

A) Regresión spline penalizada

A.2. Aproximación spline general

- Propuesta por Wood (2006). - Suavizadores: P-Splines, thin plate regression splines,… - Algoritmo P-IRLS. - Paquete en R: mgcv.

A.1. Utilizando Modelos Lineales Mixtos

Usan como suavizadores los P-splines. • Versión frecuen tista:

-Estudiada por Ruppert et al (2003). -Algoritmo: REML -Paquete en R: SemiPar.

• Versiones bayesianas: - Estudiadas por Kneib-Hennerfeind (2006), Brezger-Lang (2006). - Suavizadores : P-splines bayesianos. - Algoritmos: REML (bayes empírico), MCMC (bayes completo).

-Paquete: BayesX

Page 8: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

8

B) Aproximación Backfitting

• Propuesta por Hastie-Tibshirani (1990).

• Suavizadores; Smoothing Splines.

• Algoritmo iterativo final: Local Fisher Scoring.

• Paquete en R: gam

Atención: mgcv no es un clónico de gam

Page 9: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

9

A.1) Modelos GAM basados en regresión

spline penalizada

Wood S.N. (2006) Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC Press.

Wood, S.N. (2003) Thin plate regression splines. J.R.Statist.Soc.B 65(1):95-114

Page 10: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

10

Modelo GAM

Wood (2006) parte de un modelo GAM de la forma

1

( )*X θp

X j jj

f X=

η = +∑ donde:

a) La respuesta pertenece a la familia exponencial (si bien lo adapta a otras distribuciones, como la binomial negativa).

b) X* es la matriz del modelo correspondiente a la parte

estrictamente paramétrica (factores, efectos lineales,..) con vector de parámetros θ.

c) Las funciones fj son “suaves” desconocidas.

• El modelo es muy flexible y permite introducir interacciones

de tipo “continua x continua” (p. ej. fij(Xi,Xj)) o “factor x continua” y otras más complicadas.

• La idea de esta metodología es la siguiente:

Especificar una base de splines de regresión para cada función “suave”, fj , con lo que el modelo quedará completamente parametrizado.

Page 11: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

11

Específicamente, asumiendo el modelo GAM

1( ) [1]*X θ

p

X j jj

f X=

η = +∑

Para cada función “suave”, fj , elegiremos una base de funciones bji(Xj), de manera que pueda ser representada por:

( )1

( )jq

j j j j ji

f X b X=

= β∑ Dada una base, es fácil construir la matriz del modelo para cada función, de manera que:

f X βjj j= donde

( ),1 2 ,, ,....... y β Xj

Tj kj j jq j k jj b X⎡ ⎤= β β β =⎣ ⎦

• Típicamente, el modelo GAM en [1] es un modelo no

identificado, a menos que las funciones suaves estén sujetas a la restricción de ser “centradas”.

• Una restricción adecuada es que la suma (ó media) de los

elementos de las funciones fj sea cero, que puede ser escrita

01 X βTj j =

• Esta restricción puede ser absorbida por una re-parametrización. Específicamente, podemos encontrar una matriz Z, con qj-1 columnas ortogonales, que satisfacen:

01 X ZTj =

Page 12: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

12

• Reparametrizando la función suave en términos de los qj-1

parámetros, βj , tal que X X Zjj =

Obtenemos una nueva matriz del modelo para el término j-ésimo, tal que

f X βj j j= ya satisface la restricción de estar centrada • Dadas las matrices centradas para cada término suave, el

modelo GAM [1] puede ser re-escrito como un modelo GLM, de la siguiente forma:

[2]Xβ Xη = donde

1 2: : : y * T1 pX X X ....... X β θ ,β ,β , ....,βT T T T

p⎡ ⎤ ⎡ ⎤= =⎣ ⎦ ⎣ ⎦

Page 13: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

13

Estimación penalizada del modelo GAM: Algoritmo P-IRLS

• Si estimásemos el modelo “GLM” [2]:

Xβ Xη =

minimizando la (log) Verosimilitud usual, l(β), el modelo puede ser sobreajustado.

• Por esta razón, el modelo [2] debe ser estimado introduciendo

penalizaciones (una por cada función suave) en la versosimilitud para no tener estimaciones demasiado “ruidosas”.

• Esto nos lleva a considerar la log-Verosimilitud Penalizada:

( ) ( ) 12

β β Tp j j

jl l S= − λ β β∑

donde λj son los parámetros que controlan el grado de suavización de las funciones parciales.

• Algoritmo P-IRLS (Wood, 2006, pp 169-170)

Fijados los parámetros de suavización, la maximización de la verosimilitud penalizada se obtiene con el algoritmo P-IRLS, el análogo al algoritmo IRLS, introduciendo la penalización de los parámetros.

Page 14: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

14

Estimadores de los parámetros

Aplicado el algoritmo P-IRLS, la solución es

( ) 1β̂ X WX S X Wz T T−= +

donde: W es la matriz final de pesos, z es la”respuesta de trabajo” y

S j jj

S= λ∑

Grados de libertad “efectivos”

Los grados de libertad “efectivos” (edf) asociados al modelo conforman la diagonal principal de la matriz

( ) 1F X WX S X WX T T−= +

Page 15: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

15

Criterios de selección de λj (ó edfj) Fijados los λ, la función objetivo para estimar el GAM, puede ser escrita en términos de la Deviance del modelo

( ) ( )2

W z Xβ βT Tj j j j

j jS Dev S− + λ β β = + λ β β∑ ∑

que se minimiza con respecto a β Existen dos criterios que permiten calcular los parámetros de suavización óptimos (o equivalentemente los edfs óptimos) de todas las funciones parciales, simultáneamente. a) Criterio GCV

( )( )

( )( )

2

2 2GCV=W z Xβ β

F F

n Dev

n tr n tr

−=

− γ − γ⎡ ⎤ ⎡ ⎤⎣ ⎦ ⎣ ⎦

donde γ es un parámetro que previene de valores de los parámetros de suavización demasiado pequeños (o equivalentemente, edfs demasiado grandes) b) Criterio UBRE (Unbiased Risk Estimator)

( ) 2 ( )UBREβ FDev tr

n nγφ

= + −φ Es el criterio AIC re-escalado.

Page 16: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

16

Algoritmos globlales

• El problema principal es que en un modelo GAM hay que aplicar el algoritmo P-IRLS para estimar los parámetros (fijando los edfs), y al tiempo implementar un criterio para optimizar los edfs (GCV, UBRE).

• Wood propone 2 algoritmos para la solución simultánea:

1. “Performance” algorithm -Estima los edfs “dentro” del algoritmo P-IRLS. -Computacionalmente eficiente. -Problemas de convergencia.

2. “Outer” algorithm (por defecto en mgcv) -Estima los edfs “fuera” del algoritmo P-IRLS. -Los criterios GCV/UBRE son minimizados directamente, -Computacionalmente más costoso.

Gu and Wahba (1991) Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method. SIAM J. Sci. Statist. Comput. 12:383-398

Wood, S.N. (2004) Stable and efficient multiple smoothing parameter estimation for generalized additive models. J. Amer. Statist. Ass. 99:673-686.

Wood S.N. (2006) Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC Press. Pp 179-189.

Wood, S.N. (2008) Fast stable direct fitting and smoothness selection for generalized additive models. J.R.Statist.Soc.B 70(3):495-518

Page 17: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

17

Inferencia en GAM

Considerando el modelo GAM penalizado

Xβ Xη =

1) Wood propone inferencia frecuentista y bayesiana sobre

• Parámetros β

• Efectos suaves (bandas de confianza puntuales)

X β j j • Predictor lineal

Xβ Xη = • Respuesta

( )1 Xβ X g−μ =

2) Wood aconseja la inferencia bayesiana (por defecto en el paquete mgcv).

Wood, S.N. (2006) On confidence intervals for generalized additive models based on penalized regression splines. Australian and New Zealand Journal of Statistics. 48(4): 445-464.

Wood S.N. (2006) Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC Press.

Page 18: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

18

Page 19: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

19

Paquete en R: mgcv

1. Creado por Simon Wood. 2. Acompaña al libro:

Wood S.N. (2006) Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC Press.

Las bases de datos ejemplo se encuentran en la librería gamair.

3. Bases Suavizadoras:

• Cubic regression splines • Thin plate regression splines • P-splines. • Adaptive P-splines. • Tensor Product Splines,…

4. Tiene implementados diversos criterios automáticos

multivariantes de selección de parámetros de

suavización (ó edfs). en un modelo GAM : GCV,

UBRE.

5. Es sencillo implementar interacciones de todo tipo.

Page 20: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

20

gam(mgcv) R Documentation

Generalized additive models with integrated smoothness estimation

Description

Fits a generalized additive model (GAM) to data, the term `GAM' being taken to include any quadratically penalized GLM. The degree of smoothness of model terms is estimated as part of fitting. gam can also fit any GLM subject to multiple quadratic penalties (including estimation of degree of penalization). Isotropic or scale invariant smooths of any number of variables are available as model terms, as are linear functionals of such smooths; confidence/credible intervals are readily available for any quantity predicted using a fitted model; gam is extendable: users can add smooths.

Smooth terms are represented using penalized regression splines (or similar smoothers) with smoothing parameters selected by GCV/UBRE/AIC or by regression splines with fixed degrees of freedom (mixtures of the two are permitted). Multi-dimensional smooths are available using penalized thin plate regression splines (isotropic) or tensor product splines (when an isotropic smooth is inappropriate). For an overview of the smooths available see smooth.terms. For more on specifying models see gam.models and linear.functional.terms. For more on model selection see gam.selection.

Page 21: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

21

Usage

gam(formula,family=gaussian(),data=list(),weights=NULL,subset=NULL,na.action,offset=NULL,control=gam.control(),method=gam.method(), scale=0, knots=NULL,sp=NULL,min.sp=NULL,H=NULL,gamma=1, fit=TRUE,paraPen=NULL,G=NULL,in.out,...)

Arguments formula A GAM formula (see formula.gam and also

gam.models). This is exactly like the formula for a GLM except that smooth terms, s and te can be added to the right hand side to specify that the linear predictor depends on smooth functions of predictors (or linear functionals of these).

family This is a family object specifying the distribution and link to use in fitting etc. See glm and family for more details. A negative binomial family is provided: see negbin.

data A data frame or list containing the model response variable and covariates required by the formula. By default the variables are taken from environment(formula): typically the environment from which gam is called.

weights prior weights on the data. subset an optional vector specifying a subset of observations to be

used in the fitting process. na.action a function which indicates what should happen when the

data contain `NA's. The default is set by the `na.action' setting of `options', and is `na.fail' if that is unset. The ``factory-fresh'' default is `na.omit'.

offset Can be used to supply a model offset for use in fitting. Note that this offset will always be completely ignored when predicting, unlike an offset included in formula: this conforms to the behaviour of lm and glm.

control A list of fit control parameters returned by gam.control.

Page 22: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

22

method A list controlling the fitting methods used. This can make a difference to computational speed, and, in some cases, reliability of convergence: see gam.method for details. (método iterativo para elegir los df óptimos: “performance”, “outer”)

scale If this is zero then GCV is used for all distributions except Poisson and binomial where UBRE is used with scale parameter assumed to be 1. If this is greater than 1 it is assumed to be the scale parameter/variance and UBRE is used. If scale is negative GCV is always used, which means that the scale parameter will be estimated by GCV and the Pearson estimator. For binomial models in particular, it is probably worth comparing UBRE and GCV results; for ``over-dispersed Poisson'' GCV is probably more appropriate than UBRE.

knots this is an optional list containing user specified knot values to be used for basis construction. For most bases the user simply supplies the knots to be used, which must match up with the k value supplied (note that the number of knots is not always just k). See tprs for what happens in the "tp"/"ts" case. Different terms can use different numbers of knots, unless they share a covariate.

sp A vector of smoothing parameters can be provided here. Smoothing parameters must be supplied in the order that the smooth terms appear in the model formula. Negative elements indicate that the parameter should be estimated, and hence a mixture of fixed and estimated parameters is possible. If smooths share smoothing parameters then length(sp) must correspond to the number of underlying smoothing parameters.

min.sp Lower bounds can be supplied for the smoothing parameters. Note that if this option is used then the smoothing parameters full.sp, in the returned object, will need to be added to what is supplied here to get the

Page 23: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

23

smoothing parameters actually multiplying the penalties. length(min.sp) should always be the same as the total number of penalties (so it may be longer than sp, if smooths share smoothing parameters).

H A user supplied fixed quadratic penalty on the parameters of the GAM can be supplied, with this as its coefficient matrix. A common use of this term is to add a ridge penalty to the parameters of the GAM in circumstances in which the model is close to un-identifiable on the scale of the linear predictor, but perfectly well defined on the response scale.

gamma It is sometimes useful to inflate the model degrees of freedom in the GCV or UBRE/AIC score by a constant multiplier. This allows such a multiplier to be supplied.

fit If this argument is TRUE then gam sets up the model and fits it, but if it is FALSE then the model is set up and an object G containing what would be required to fit is returned is returned. See argument G.

paraPen optional list specifying any penalties to be applied to parametric model terms. gam.models explains more.

G Usually NULL, but may contain the object returned by a previous call to gam with fit=FALSE, in which case all other arguments are ignored except for gamma, in.out, control, method and fit.

in.out optional list for initializing outer iteration. If supplied then this must contain two elements: sp should be an array of initialization values for all smoothing parameters (there must be a value for all smoothing parameters, whether fixed or to be estimated, but those for fixed s.p.s are not used); scale is the typical scale of the GCV/UBRE function, for passing to the outer optimizer.

... further arguments for passing on e.g. to gam.fit (such as mustart).

Page 24: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

24

s(mgcv) R Documentation

Defining smooths in GAM formulae Description

Function used in definition of smooth terms within gam model formulae. The function does not evaluate a (spline) smooth - it exists purely to help set up a model using spline based smooths.

Usage s(..., k=-1,fx=FALSE, bs="tp", m=NA, by=NA, xt=NULL, id=NULL, sp=NULL)

Arguments ... a list of variables that are the covariates that this smooth is a

function of. k the dimension of the basis used to represent the smooth term.

The default depends on the number of variables that the smooth is a function of. k should not be less than the dimension of the null space of the penalty for the term (see null.space.dimension), but will be reset if it is. See choose.k for further information. By default k=10.

fx indicates whether the term is a fixed d.f. regression spline (TRUE) or a penalized regression spline (FALSE).

bs a two letter character string indicating the (penalized) smoothing basis to use. (eg "tp" for thin plate regression spline, "cr" for cubic regression spline). see smooth.terms for an over view of what is available.

m The order of the penalty for this term (e.g. 2 for normal cubic spline penalty with 2nd derivatives when using default t.p.r.s basis). NA signals autoinitialization. Only some smooth classes use this. The "ps" class can use a 2 item array giving the basis and penalty order separately.

by a numeric or factor variable of the same dimension as each covariate. In the numeric vector case the elements multiply the

Page 25: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

25

smooth evaluated at the corresponding covariate values (a `varying coefficient model' results). In the factor case causes a replicate of the smooth to be produced for each factor level. See gam.models for further details. May also be a matrix if covariates are matrices: in this case implements linear functional of a smooth (see gam.models and linear.functional.terms for details).

xt Any extra information required to set up a particular basis. Used e.g. to set large data set handling behaviour for "tp" basis.

id A label or integer identifying this term in order to link its smoothing parameters to others of the same type. If two or more terms have the same id then they will have the same smoothing paramsters, and, by default, the same bases (first occurance defines basis type, but data from all terms used in basis construction). An id with a factor by variable causes the smooths at each factor level to have the same smoothing parameter.

sp any supplied smoothing parameters for this term. Must be an array of the same length as the number of penalties for this smooth. Positive or zero elements are taken as fixed smoothing parameters. Negative elements signal auto-initialization. Over-rides values supplied in sp argument to gam. Ignored by gamm.

Author(s)

Simon N. Wood [email protected] References

Wood, S.N. (2003) Thin plate regression splines. J.R.Statist.Soc.B 65(1):95-114

Wood S.N. (2006) Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC Press.

Page 26: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

26

smooth.terms(mgcv) R DocumentationSmooth terms in GAM

Description

Smooth terms are specified in a gam formula using s and te terms. Various smooth classes are available, for different modelling tasks, and users can add smooth classes (see user.defined.smooth). What defines a smooth class is the basis used to represent the smooth function and quadratic penalty (or multiple penalties) used to penalize the basis coefficients in order to control the degree of smoothness. Smooth classes are invoked directly by s terms, or as building blocks for tensor product smoothing via te terms (only smooth classes with single penalties can be used in tensor products). The smooths built into the mgcv package are all based one way or another on low rank versions of splines. For the full rank versions see Wahba (1990).

Note that smooths can be used rather flexibly in gam models. In particular the linear predictor of the GAM can depend on (a discrete approximation to) any linear functional of a smooth term, using by variables and the `summation convention' explained in linear.functional.terms.

Page 27: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

27

Smooth classes

Thin plate regression splines bs="tp". These are low rank isotropic smoothers of any number of covariates. By isotropic is meant that rotation of the covariate co-ordinate system will not change the result of smoothing. By low rank is meant that they have far fewer coefficients than there are data to smooth. They are reduced rank versions of the thin plate splines and use the thin plate spline penalty. They are the default smooth for s terms because there is a defined sense in which they are the optimal smoother of any given basis dimension/rank (Wood, 2003). Thin plate regression splines do not have `knots' (at least not in any conventional sense): a truncated eigen-decomposition is used to achieve the rank reduction. See tprs for further details. bs="ts" is as "tp" but with a small ridge penalty added to the smoothing penalty, so that the whole term can be shrunk to zero. Cubic regression splines bs="cr". These have a cubic spline basis defined by a modest sized set of knots spread evenly through the covariate values. They are penalized by the conventional integrated square second derivative cubic spline penalty. bs="cs" specifies a shrinkage version of "cr".

bs="cc" specifies a cyclic cubic regression splines. i.e. a penalized cubic regression splines whose ends match, up to second derivative.

Page 28: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

28

P-splines bs="ps". These are P-splines as proposed by Eilers and Marx (1996). They combine a B-spline basis, with a discrete penalty on the basis coefficients, and any sane combination of penalty and basis order is allowed. Although this penalty has no exact interpretation in terms of function shape, in the way that the derivative penalties do, P-splines perform almost as well as conventional splines in many standard applications, and can perform better in particular cases where it is advantageous to mix different orders of basis and penalty. bs="cs" gives a cyclic version of a P-spline.

Tensor product: te() All the preceding classes (and any user defined smooths with single penalties) may be used as marginal bases for tensor product smooths specified via te terms. Tensor product smooths are smooth functions of several variables where the basis is built up from tensor products of bases for smooths of fewer (usually one) variable(s) (marginal bases). The multiple penalties for these smooths are produced automatically from the penalties of the marginal smooths. Wood (2006b) gives the general recipe for this construction.

Tensor product smooths often perform better than isotropic smooths when the covariates of a smooth are not naturally on the same scale, so that their relative scaling is arbitrary. For example, if smoothing with repect to time and distance, an isotropic smoother will give very different results if the units are cm and minutes compared to if the units are metres and seconds: a tensor product smooth will give the same answer in both cases (see te for an example of this). Note that tensor product terms are knot based, and the thin plate splines seem to offer no advantage over cubic or P-splines as marginal bases.

Page 29: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

29

Adaptive smoothers

bs="ad" univariate and bivariate adaptive (smooths are available (see adaptive.smooth). These are appropriate when the degree of smoothing should itself vary with the covariates to be smoothed, and the data contain sufficient information to be able to estimate the appropriate variation. Because this flexibility is achieved by splitting the penalty into several `basis penalties' these terms are not suitable as components of tensor product smooths, and are not supported by gamm.

Author(s)

Simon Wood <[email protected]> References

Eilers, P.H.C. and B.D. Marx (1996) Flexible Smoothing with B-splines and Penalties. Statistical Science, 11(2):89-121

Wahba (1990) Spline Models of Observational Data. SIAM

Wood, S.N. (2003) Thin plate regression splines. J.R.Statist.Soc.B 65(1):95-114

Wood, S.N. (2006a) Generalized Additive Models: an introduction with R, CRC

Wood, S.N. (2006b) Low rank scale invariant tensor product smooths for generalized additive mixed models. Biometrics 62(4):1025-1036

R Documentation

Page 30: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

30

te(mgcv)

Define tensor product smooths in GAM formulae

Description Function used in definition of tensor product smooth terms within gam model formulae. The function does not evaluate a smooth - it exists purely to help set up a model using tensor product based smooths. Designed to construct tensor products from any marginal smooths with a basis-penalty representation (with the restriction that each marginal smooth must have only one penalty).

Usage te(..., k=NA,bs="cr",m=NA,d=NA,by=NA,fx=FALSE, mp=TRUE,np=TRUE,xt=NULL,id=NULL,sp=NULL)

Arguments ... a list of variables that are the covariates that this smooth is a function of.k the dimension(s) of the bases used to represent the smooth term. If not

supplied then set to 5^d. If supplied as a single number then this basis dimension is used for each basis. If supplied as an array then the elements are the dimensions of the component (marginal) bases of the tensor product. See choose.k for further information.

bs array (or single character string) specifying the type for each marginal basis. "cr" for cubic regression spline; "cs" for cubic regression spline with shrinkage; "cc" for periodic/cyclic cubic regression spline; "tp" for thin plate regression spline; "ts" for t.p.r.s. with extra shrinkage. See smooth.terms for details and full list. User defined bases can also be used here (see smooth.construct for an example). If only one basis code is given then this is used for all bases.

m The order of the penalty (for smooth classes that use this) for each term. If a single number is given then it is used for all terms. NA autoinitializes. m is ignored by some bases (e.g. "cr").

d array of marginal basis dimensions. For example if you want a smooth for 3 covariates made up of a tensor product of a 2 dimensional t.p.r.s. basis and a 1-dimensional basis, then set d=c(2,1).

by a numeric or factor variable of the same dimension as each covariate. In

Page 31: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

31

the numeric vector case the elements multiply the smooth evaluated at the corresponding covariate values (a `varying coefficient model' results). In the factor case causes a replicate of the smooth to be produced for each factor level.

fx indicates whether the term is a fixed d.f. regression spline (TRUE) or a penalized regression spline (FALSE).

mp TRUE to use multiple penalties for the smooth. FALSE to use only a single penalty: single penalties are not recommended - they tend to allow only rather wiggly models.

np TRUE to use the `normal parameterization' for a tensor product smooth. This represents any 1-d marginal smooths via parameters that are function values at `knots', spread evenly through the data. The parameterization makes the penalties easily interpretable, however it can reduce numerical stability in some cases.

xt Either a single object, providing any extra information to be passed to each marginal basis constructor, or a list of such objects, one for each marginal basis.

id A label or integer identifying this term in order to link its smoothing parameters to others of the same type. If two or more smooth terms have the same id then they will have the same smoothing paramsters, and, by default, the same bases (first occurance defines basis type, but data from all terms used in basis construction).

sp any supplied smoothing parameters for this term. Must be an array of the same length as the number of penalties for this smooth. Positive or zero elements are taken as fixed smoothing parameters. Negative elements signal auto-initialization. Over-rides values supplied in sp argument to gam. Ignored by gamm.

Simon N. Wood [email protected]

References Wood, S.N. (2006a) Low rank scale invariant tensor product smooths for generalized additive mixed models. Biometrics 62(4):1025-1036

http://www.maths.bath.ac.uk/~sw283/

Page 32: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

32

ejemplo simulado set.seed(90) library(splines) eps<-rnorm(400,0,0.1) x<-runif(400,0,1) x<-x[order(x)] yteor<-(sin(2*3.141516*x**3))**3;yobs<-(sin(2*3.141516*x**3))**3+eps #Smoothing Spline con gl elegidos por el criterio GCV plot(x,yobs,pch=1,ylab="Y",main="Smoothing Spline") lines(x,yteor,lty=1,lwd=2, col="black") lines(smooth.spline(x,yobs,cv=F),lty=1,lwd=5,col="red") legend(0.0,1.3,c("teórica","GCV (df=25.3)"),col=c("black","red"), lty=c(1,1),lwd=c(2,5))

Page 33: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

33

library(mgcv)

#Thin regression spline (por defecto) con gl elegidos por el criterio GCV plot(x,yobs,pch=1,ylab="Y",main="Thin regression splines") lines(x,yteor,lty=1,lwd=2, col="black");fit<-gam(yobs~s(x, bs="tp")) summary(fit)$edf;lines(x, fitted (fit), lty=1,lwd=5,col="red") legend(0.0,1.3,c("teórica"," GCV(df= 8.9)"),col=c("black","red"), lty=c(1,1),lwd=c(2,5))

#El problema es que elige df como mucho k=10. Tomemos k=30 #Thin Regression Splines (k=30) con gl elegidos por el criterio GCV plot(x,yobs,pch=1,ylab="Y",main="Thin regression splines (k=30)") lines(x,yteor,lty=1,lwd=2, col="black");fit<-gam(yobs~s(x, bs="tp",k=30)) summary(fit)$edf;lines(x, fitted (fit), lty=1,lwd=5,col="red") legend(0.0,1.3,c("teórica"," GCV(df= 23.06)"),col=c("black","red"), lty=c(1,1),lwd=c(2,5))

Page 34: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

34

#Cubic Regression Splines (k=30) con gl elegidos por el criterio GCV plot(x,yobs,pch=1,ylab="Y",main="Cubic regression splines (k=30)") lines(x,yteor,lty=1,lwd=2, col="black");fit<-gam(yobs~s(x, bs="cr",k=30)) summary(fit)$edf;lines(x, fitted (fit), lty=1,lwd=5,col="red") legend(0.0,1.3,c("teórica"," GCV(df= 23.06)"),col=c("black","red"), lty=c(1,1),lwd=c(2,5))

#P- Splines (k=30) con gl elegidos por el criterio GCV plot(x,yobs,pch=1,ylab="Y",main="P-Splines") lines(x,yteor,lty=1,lwd=2, col="black");fit<-gam(yobs~s(x, bs="ps", k=30)) summary(fit)$edf;lines(x, fitted (fit), lty=1,lwd=5,col="red") legend(0.0,1.3,c("teórica"," GCV(df= 21.9)"),col=c("black","red"), lty=c(1,1),lwd=c(2,5))

Page 35: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

35

#Adaptive Splines con gl elegidos por el criterio GCV .Método “Outer” para seleccionar los gls plot(x,yobs,pch=1,ylab="Y",main="Adaptive Splines-Outer") lines(x,yteor,lty=1,lwd=2, col="black");fit<-gam(yobs~s(x, bs="ad")) summary(fit)$edf;lines(x, fitted (fit), lty=1,lwd=5,col="red") legend(0.0,1.3,c("teórica"," GCV(df= 14.1)"),col=c("black","red"), lty=c(1,1),lwd=c(2,5) , bty="n")

#Adaptive Splines con gl elegidos por el criterio GCV .Método “Performance” para seleccionar los dfs plot(x,yobs,pch=1,ylab="Y",main="Adaptive Splines-Performance") lines(x,yteor,lty=1,lwd=2, col="black");gm <- gam.method(gam="perf") fit<-gam(yobs~s(x, bs="ad"), method=gm);summary(fit)$edf lines(x, fitted (fit), lty=1,lwd=5,col="red") legend(0.0,1.3,c("teórica"," GCV(df= 14.1)"),col=c("black","red"), lty=c(1,1),lwd=c(2,5), bty="n")

Page 36: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

36

Airquality (mgcv)

library(mgcv) air<-na.omit(airquality) plot(air[,1:4])

Page 37: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

37

AJUSTE DEL MODELO Realizamos un GAM, de Ozone en función de Temp.

[ ] ( )0/E Ozone Temp f Tempβ= +

library(mgcv) air<-na.omit(airquality) air.gam1<-gam(Ozone~ s(Temp), family=gaussian,data=air) summary(air.gam1)

Family: gaussian Link function: identity Formula: Ozone ~ s(Temp) Estimated degrees of freedom: 3.487238 total = 4.487238 GCV score: 518.6911

El parámetro de suavizado estimado es:

air.gam$sp

[1] 0.05481123

Page 38: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

38

summary(air.gam1)

Family: gaussian

Link function: identity

Formula:

Ozone ~ s(Temp)

Parametric coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 42.099 2.118 19.88 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:

edf Ref.df F p-value

s(Temp) 3.487 3.987 34.06 <2e-16 ***

R-sq.(adj) = 0.551 Deviance explained = 56.5%

GCV score = 518.69 Scale est. = 497.72 n = 111

• Los gl efectivos óptimos son 3.487.

• El test F anterior contrasta si el efecto f(Temp) es

significativo.

Page 39: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

39

REPRESENTACIÓN GRÁFICA DEL EFECTO

Representamos el efecto parcial centrado de Temp. Se añaden

las bandas de confianza al 95% puntuales. También se pueden

representar en la misma figura los residuos.

op<-par(mfrow=c(1,2))

plot(air.gam,se=T)

plot(air.gam1,se=T,residuals=T)

par(op)

Page 40: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

40

Diagnosis del modelo

gam.check(air.gam1)

Page 41: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

41

CAMBIO DE BASES SPLINE

Por defecto, hemos usado la base thin regression splines (bs=”tp”).

Vamos a comprobar cuál es el efecto de cambiar de bases. Usaremos:

Cubic regression splines (bs=”cr”) P-Splines (bs=”ps”)

op<-par(mfrow=c(1,3))

plot(gam(Ozone~ s(Temp,bs="tp"), data=air),se=T,main="thin-plate") plot(gam(Ozone~ s(Temp,bs="cr"), data=air),se=T,main="cubic regression spline") plot(gam(Ozone~ s(Temp,bs="ps"), data=air),se=T,main="P-spline") par(op)

Page 42: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

42

Análisis multivariante Realizamos un modelo GAM, de Ozone en función de Temp, Wind, Solar.R

[ ] ( ) ( ) ( )1 2 3 0 1 1 2 2 3 3/ , ,X E Y X X X f X f X f Xμ β= = + + + donde:

1 2 3Ozone, =Solar.R , =Wind , TempY X X X= = air.gam2<-gam(Ozone~s(Solar.R)+s(Wind)+s(Temp), data=air) summary(air.gam2)

Family: gaussian Link function: identity Formula: Ozone ~ s(Solar.R) + s(Wind) + s(Temp) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 42.099 1.663 25.32 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms:

edf Ref.df F p-value s(Solar.R) 2.760 3.260 4.109 0.00698 ** s(Wind) 2.910 3.410 14.609 1.36e-08 *** s(Temp) 3.833 4.333 12.786 7.46e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.723 Deviance explained = 74.7% GCV score = 338.9 Scale est. = 306.83 n = 111

Page 43: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

43

op<-par(mfrow=c(1,3))

plot(air.gam2,shade=T) par(op)

Page 44: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

44

Corrección del criterio GCV

Otro parámetro que podemos cambiar en gam() es

gamma (por defecto es 1)

Este parámetro interviene en los criterios para optimizar los

grados de libertad como el GCV.

Se sabe que GCV tiende a dar más grados de libertad que los

deseados. Kim y Gu (2004) recomiendan utilizar gamma≈1.4 air.gam3<-gam(Ozone~s(Solar.R)+s(Wind)+s(Temp), data=air, gamma=1.4) op<-par(mfrow=c(1,3))

plot(air.gam3,shade=T) par(op)

Page 45: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

45

COMPARACIÓN DE MODELOS ANIDADOS

El siguiente modelo, air.glm, está anidado en air.gam3

air.glm<-glm(Ozone~Solar.R+Wind+Temp, data=air)

anova(air.glm, air.gam3,test="F")

Analysis of Deviance Table

Model 1: Ozone ~ Solar.R + Wind + Temp

Model 2: Ozone ~ s(Solar.R) + s(Wind) + s(Temp)

Resid. Df Resid. Dev Df Deviance F Pr(>F)

1 107.0000 48003

2 103.5961 33191 3.4039 14812 13.582 4.014e-08

Page 46: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

46

PREDICCIONES EN GAM predict.gam(mgcv) R Documentation

Prediction from fitted GAM model Description

Takes a fitted gam object produced by gam() and produces predictions given a new set of values for the model covariates or the original values used for the model fit. Predictions can be accompanied by standard errors, based on the posterior distribution of the model coefficients. The routine can optionally return the matrix by which the model coefficients must be pre-multiplied in order to yield the values of the linear predictor at the supplied covariate values: this is useful for obtaining credible regions for quantities derived from the model, and for lookup table prediction outside R (see example code below).

Usage predict(object,newdata,type="link",se.fit=FALSE,terms=NULL, block.size=1000,newdata.guaranteed=FALSE,na.action=na.pass,...)

Arguments object a fitted gam object as produced by gam(). newdata A data frame or list containing the values of the model covariates

at which predictions are required. If this is not provided then predictions corresponding to the original data are returned. If newdata is provided then it should contain all the variables needed for prediction: a warning is generated if not.

type When this has the value "link" (default) the linear predictor (possibly with associated standard errors) is returned. When type="terms" each component of the linear predictor is returned seperately (possibly with standard errors): this includes parametric model components, followed by each smooth component, but excludes any offset and any intercept. type="iterms" is the same, except that any standard errors returned for smooth components will include the uncertainty about the intercept/overall mean. When type="response" predictions on the scale of the response are returned (possibly with approximate standard errors). When type="lpmatrix" then a matrix is returned which yields the values of the linear predictor (minus any offset) when postmultiplied by the parameter vector (in this case se.fit is ignored). The latter option is most useful for getting variance estimates for quantities derived from the model: for example integrated quantities, or derivatives of smooths. A linear predictor matrix can also be used to implement approximate prediction outside R (see example code, below).

se.fit when this is TRUE (not default) standard error estimates are returned for each prediction.

terms if type=="terms" then only results for the terms given in this

Page 47: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

47

array will be returned. block.size maximum number of predictions to process per call to underlying

code: larger is quicker, but more memory intensive. Set to < 1 to use total number of predictions as this.

newdata.guaranteed Set to TRUE to turn off all checking of newdata except for sanity of factor levels: this can speed things up for large prediction tasks, but newdata must be complete, with no NA values for predictors required in the model.

na.action what to do about NA values in newdata. With the default na.pass, any row of newdata containing NA values for required predictors, gives rise to NA predictions (even if the term concerned has no NA predictors). na.exclude or na.omit result in the dropping of newdata rows, if they contain any NA values for required predictors. If newdata is missing then NA handling is determined from object$na.action.

... other arguments. Details

The standard errors produced by predict.gam are based on the Bayesian posterior covariance matrix of the parameters Vp in the fitted gam object.

To facilitate plotting with termplot, if object possesses an attribute "para.only" and type=="terms" then only parametric terms of order 1 are returned (i.e. those that termplot can handle).

Note that, in common with other prediction functions, any offset supplied to gam as an argument is always ignored when predicting, unlike offsets specified in the gam model formula.

See the examples for how to use the lpmatrix for obtaining credible regions for quantities derived from the model.

Author(s) Simon N. Wood [email protected]

The design is inspired by the S function of the same name described in Chambers and Hastie (1993) (but is not a clone).

References Chambers and Hastie (1993) Statistical Models in S. Chapman & Hall.

Wood S.N. (2006b) Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC Press.

Page 48: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

48

Si del modelo air.gam3 queremos obtener, por ejemplo, los

valores de la predicción de Ozono y sus correspondientes errores

estándar,

air.gam3.resp<-predict(air.gam3,type="response", se=T)

air.gam3.resp

$fit 33.068430 26.886849 15.720708 24.301989 33.070936 8.302647 5.575943 25.827298 29.580631 22.846092 6.887527 25.241508 22.417444 8.695953 ……. $se.fit

4.256968 3.739995 3.426930 5.618747 5.004028 5.991667 9.495283 3.797224 4.576920 3.922912 6.736037 5.258540 4.556668 8.679924 ………

Page 49: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

49

Estudio de infección post-quirúrgica Objetivos:

a) Estudiar si la glucosa (variable continua), antes de la operación,

es un factor de riesgo de infección post-quirúrgica (infec), y cuál

es la forma funcional del riesgo.

b) Se sabe que, por separado, la edad el sexo y la diabetes son

factores de riesgo para la infección. Estudiar si el efecto de la

glucosa es el mismo ajustando por edad, sexo, y diabetes.

En este estudio, la respuesta es binaria (infec= “si”, no”), por lo

que utilizaremos los siguientes GAMs logísticos:

a) GAM logístico univariante

( )0( / ) logit

( / )p INFEC SI XLn f Gluc

p Y NO Xβ

⎛ ⎞== = +⎜ ⎟=⎝ ⎠

b) GAMs logísticos multivariantes

c) b.1) Efectos principales

( ) ( )0 1 2logit f Gluc f Edad Sexo Diabβ= + + + +

a) b.2) Interacción “diabetes x glucosa”

( )0 2logit ( )f Edad Sexo Diab Diab f Glucβ= + + + + ×

Page 50: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

50

fichero infec.txt Descripción: El fichero infec.txt contiene información parcial de un estudio realizado en el Hospital Clínico Universitario El fichero contiene las siguientes columnas:

sexo (“varón”, “mujer”) edad (años) diab=diabetes (”si”, no”). gluc=glucosa en ayunas antes de la operación. infec=infección post-quirúrgica (”si”, no”).

infec<-read.table("F:\\Regresion Máster\\infec.txt ", header=T)

names(infec)

[1] "edad" "sexo" "gluc" "diab" "infec"

Page 51: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

51

Efecto de la glucosa: Análisis univariante

infec.gam1<-gam(infec~ s(gluc), family=binomial,data=infec) summary(infec.gam1) Family: binomial Link function: logit Formula: infec ~ s(gluc) Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.4653 0.0544 -26.93 <2e-16 Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(gluc) 5.661 6.161 92.94 <2e-16 *** R-sq.(adj) = 0.0414 Deviance explained = 4.13% UBRE score = -0.041914 Scale est. = 1 n = 2351 AIC(infec.gam1)

[1] 2252.459

Page 52: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

52

Representación de la forma funcional del efecto de la glucosa

5.661 optedf =

De los resultados obtenidos, podemos decir que

1. La glucosa tiene un efecto significativo en el riesgo de infección

post-quirúrgica.

2. La forma funcional del riesgo es similar a una “cuchara” (spoon-

shaped), que indica (a) un menor riesgo en el rango “clínicamente”

normal de la glucosa, [70,115] ; (b) un riesgo alto se encuentra no

sólo en valores altos de glucemia, sino también en los valores bajos

de esta variable (hipoglucemia).

Page 53: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

53

Efecto de la glucosa: Análisis multivariante

Queremos saber si el riesgo de la glucosa se mantiene, ajustando

por la edad del individuo, el sexo, y sobretodo el status de

diabetes. Para ello ajustamos el modelo:

infec.gam2<-gam(infec~ s(gluc)+s(edad)+sexo+diab, family=binomial,data=infec)

summary(infec.gam2)

Family: binomial Link function: logit Formula: infec ~ s(gluc) + s(edad) + sexo + diab Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.76980 0.08497 -20.827 < 2e-16 sexovarón 0.49405 0.10920 4.524 6.06e-06 diabsi 0.10281 0.22205 0.463 0.643 Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(gluc) 5.58 6.080 57.48 1.61e-10 s(edad) 1.00 1.500 41.56 3.58e-10 R-sq.(adj) = 0.0607 Deviance explained = 6.72% UBRE score = -0.065208 Scale est. = 1 n = 2351

Page 54: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

54

Representación gráfica de los efectos

op<-par(mfrow=c(2,2))

plot(infec.gam2, all.terms=T)

par(op)

Page 55: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

55

Conclusiones del modelo infec.gam2 En resumen. Las conclusiones generales de este modelo son las siguientes: • El efecto ajustado de la glucosa es significativo con forma

“cuchara”. • El efecto de la edad es significativo con forma lineal. • El efecto del sexo es significativo: Los varones tienen más

riesgo que las mujeres para la infección post-quirúrgica.

• Interesante: en presencia de la glucosa, el efecto de la diabetes deja de ser significativo para la infección.

Referencia:

Figueiras A and Cadarso-Suárez C (2001). Application of Nonparametric Models for Calculating Odds Ratios and Their Confidence Intervals for Continuous Exposures. American Journal of Epidemiology, 154, 264-275.

Page 56: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

56

Interacción “diabetes x glucosa”

( )0 2logit ( )f Edad Sexo Diab Diab f Glucβ= + + + + × infec.gam3<-gam(infec~ s(gluc,by=diab)+s(edad)+sexo+diab, family=binomial,data=infec) summary(infec.gam3) Family: binomial Link function: logit Formula: infec ~ s(gluc, by = diab) + s(edad) + sexo + diab Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.77887 0.08553 -20.798 < 2e-16 sexovarón 0.49879 0.10961 4.551 5.35e-06 diabsi 0.68963 0.27500 2.508 0.0121 Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(gluc):diabno 5.363 5.863 63.25 8.21e-12 s(gluc):diabsi 1.000 1.500 1.02 0.466 s(edad) 1.000 1.500 40.12 7.40e-10 R-sq.(adj) = 0.0636 Deviance explained = 6.99% UBRE score = -0.067222 Scale est. = 1 n = 2351

Page 57: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

57

plot(infec.gam3,shade=T) no diabéticos diabéticos

p-valor<0.001 p-valor= 0.466 (ns)

Conclusiones: Detectamos una interacción entre glucosa y diabetes: el efecto de

la glucosa es significativo sólo en los no diabéticos.

Page 58: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

58

GAM con interacción “continua x continua” Descripción: El fichero HTA.sav (fichero SPSS) contiene información parcial del estudio EPDGA realizado sobre la población adulta en el año 2004. El fichero contiene las siguientes columnas:

SEXO (“Varón”, “Mujer”) EDAD (años) ESTUD: Nivel de estudios (I=no, II, III, IV, y V=superiores) IMC=Índice de Masa corporal (Kg/m2) HTA: hipertensión arterial (“no”, “si”)

Objetivos: Entre otros, estimar la prevalencia de la hipertensión (HTA) de la población

gallega, ajustada por edad y sexo.

Estimaremos los siguientes modelos GAM logísticos:

a) Modelo de efectos principales:

( )( ) ( ) ( )0 1 1 1log

1p HTA

SEXO f EDAD f IMCp HTA

β β⎛ ⎞

= + × + +⎜ ⎟⎜ ⎟−⎝ ⎠

b) Modelo de interacción “IMCx EDAD”

c) Modelo de interacción “IMCx EDADxSEXO”

Page 59: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

59

a) Modelo de efectos principales library(foreign) hta<-read.spss("F:\\Regresion Máster\\HTA.sav", use.value.labels = T, to.data.frame = T) names (hta) "sexo" "edad" "estud" "imc" "hta" Como tenemos una base de datos grande, usaremos los cubic regression splines (cr), como suavizadores hta.gam1<-gam(hta~ s(edad, bs="cr")+s(imc,bs="cr")+sexo, family=binomial, data=hta) summary(hta.gam1) Family: binomial Link function: logit Formula: hta ~ s(edad, bs = "cr") + s(imc, bs = "cr") + sexo Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.09530 0.07643 -14.331 < 2e-16 *** sexoMujer -0.72945 0.10662 -6.841 7.84e-12 *** Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(edad) 3.411 3.911 330.5 <2e-16 *** s(imc) 2.305 2.805 166.4 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Page 60: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

60

R-sq.(adj) = 0.288 Deviance explained = 26.2% UBRE score = -0.1561 Scale est. = 1 n = 2842 plot(hta.gam1,shade=T)

Page 61: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

61

Modelo de Interacción “imc xedad” hta.gam2<-gam(hta~ te(edad, imc,bs=c("cr","cr"))+ sexo, family=binomial, data=hta) summary(hta.gam2)

Family: binomial Link function: logit Formula: hta ~ te(edad, imc, bs = c("cr", "cr")) + sexo Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.10786 0.07827 -14.155 < 2e-16 *** sexoMujer -0.73140 0.10659 -6.862 6.79e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: Edf Ref.df Chi.sq p-value te(edad,imc) 8.63 9.13 565.1 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.289 Deviance explained = 26.4% UBRE score = -0.15674 Scale est. = 1 n = 2842

anova(hta.gam1,hta.gam2,test="Chisq") Analysis of Deviance Table Model 1: hta ~ s(edad, bs = "cr") + s(imc, bs = "cr") + sexo Model 2: hta ~ te(edad, imc, bs = c("cr", "cr")) + sexo Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 2834.2834 2382.92 2 2831.3703 2375.28 2.9131 7.64 0.05

Page 62: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

62

Representación gráfica de superficies de interacción

Link vis.gam(hta.gam2,view=c("edad","imc"),plot.type="persp",color="heat", n.grid=50,theta=-25, ticktype="detailed",type="link")

Respuesta vis.gam(hta.gam2,view=c("edad","imc"),plot.type="persp",color="heat", n.grid=50,theta=-25, ticktype="detailed",type="response")

Page 63: REGRESIÓN FLEXIBLE A TRAVÉS DE - USC

63

Modelo de Interacción “imc xedadxsexo”

hta.gam3<-gam(hta~ te(edad, imc,bs=c("cr","cr"),by=sexo)+ sexo, family=binomial, data=hta) summary(hta.gam3) Family: binomial Link function: logit Formula:hta ~ te(edad, imc, bs = c("cr", "cr"), by = sexo) + sexo Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.07612 0.08287 -12.99 < 2e-16 *** sexoMujer -0.97205 0.14358 -6.77 1.29e-11 *** Approximate significance of smooth terms: edf Ref.df Chi.sq p-value te(edad,imc):sexoVarón 7.562 8.062 216.3 <2e-16 *** te(edad,imc):sexoMujer 8.630 9.130 339.1 <2e-16 *** R-sq.(adj) = 0.297 Deviance explained = 27.5% UBRE score = -0.16369 Scale est. = 1 n = 2842 anova(hta.gam1,hta.gam2,hta.gam3,test="Chisq") Analysis of Deviance Table Model 1: hta ~ s(edad, bs = "cr") + s(imc, bs = "cr") + sexo Model 2: hta ~ te(edad, imc, bs = c("cr", "cr")) + sexo Model 3: hta ~ te(edad, imc, bs = c("cr", "cr"), by = sexo) + sexo Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 2834.2834 2382.92 2 2831.3703 2375.28 2.9131 7.64 0.05 3 2823.8079 2340.40 7.5624 34.88 1.942e-05