genetic association and generalised linear models

Download Genetic Association and Generalised Linear Models

Post on 23-Jan-2016




0 download

Embed Size (px)


Genetic Association and Generalised Linear Models. Gil McVean, WTCHG Weds 2 nd November 2011. IMSGC, WTCCC2 (2011). Questions to ask. What is a linear model? What is a generalised linear model? How do you estimate parameters and test hypotheses with GLMs? - PowerPoint PPT Presentation


  • Genetic Association and Generalised Linear ModelsGil McVean, WTCHGWeds 2nd November 2011

  • IMSGC, WTCCC2 (2011)

  • Questions to askWhat is a linear model?

    What is a generalised linear model?

    How do you estimate parameters and test hypotheses with GLMs?

    How are GLMs used in the study of genetic association?


  • What is a covariate?A covariate is a quantity that may influence the outcome of interestGenotype at a SNPAge of mice when measurement was takenBatch of chips from which gene expression was measured

    Previously, you have looked at using likelihood to estimate parameters of underlying distributions

    We want to generalise this idea to ask how covariates might influence the underlying parameters

    Much statistical modelling is concerned with considering linear effects of covariates on underlying parameters


  • What is a linear model?In a linear model, the expectation of the response variable is defined as a linear combination of explanatory variables

    Explanatory variables can include any function of the original data

    But the link between E(Y) and X (or some function of X) is ALWAYS linear and the error is ALWAYS GaussianResponse variableInterceptLinear relationships with explanatory variablesInteraction termGaussian error*

  • Quick test: which of these is NOT a linear model


  • What is a GLM?There are many settings where the error is non-Gaussian and/or the link between E(Y) and X is not necessarily linear Discrete data (e.g. counts in multinomial or Poisson experiments)Categorical data (e.g. Disease status)Highly-skewed data (e.g. Income, ratios)

    Generalised linear models keep the notion of linearity, but enable the use of non-Gaussian error models

    g is called the link functionIn linear models, the link function is the identity

    The response variable can be drawn from any distribution of interest (the distribution function)In linear models this is Gaussian*

  • Poisson regressionIn Poisson regression the expected value of the response variable is given by the exponent of the linear term

    The link function is the log

    Note that several distribution functions are possible (normal, Poisson, binomial counts), though in practice Poisson regression is typically used to model count data (particularly when counts are low)


  • Example: Caesarean sections in public and private hospitals


  • Boxplots of rates of C sections


  • Fitting a model without covariates

    *> analysis summary(analysis)

    Call:glm(formula = d$Caes ~ d$Births, family = "poisson")

    Deviance Residuals: Min 1Q Median 3Q Max -2.81481 -0.73305 -0.08718 0.74444 2.19103

    Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.132e+00 1.018e-01 20.949 < 2e-16 ***d$Births 4.406e-04 5.395e-05 8.165 3.21e-16 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    (Dispersion parameter for poisson family taken to be 1)

    Null deviance: 99.990 on 19 degrees of freedomResidual deviance: 36.415 on 18 degrees of freedomAIC: 127.18

    Number of Fisher Scoring iterations: 4Implies an average of 12.7 per 1000 births

  • Fitting a model with covariatesUnexpectedly, this indicates that public hospitals actually have a higher rate of Caesarean sections than private ones

    *glm(formula = d$Caes ~ d$Births + d$Hospital, family = "poisson")

    Deviance Residuals: Min 1Q Median 3Q Max -2.3270 -0.6121 -0.0899 0.5398 1.6626

    Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.351e+00 2.501e-01 5.402 6.58e-08 ***d$Births 3.261e-04 6.032e-05 5.406 6.45e-08 ***d$Hospital 1.045e+00 2.729e-01 3.830 0.000128 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1Implies an average of 15.2 per 1000 births in public hospitals and 5.4 per 1000 births in private ones

  • Checking model fitLook a distribution of residuals and how well observed values are predicted


  • Whats going on?Initial summary suggested opposite result to GLM analysis. Why?

    Relationship between no. Births and no. C sections does not appear to be linear

    Accounting for this removes most (but not all) of the apparent differences between hospital types

    There is also one quite influential outlier

    *Relative risk Private (compared to public) = 2.8


  • *Hospitals with fewer births tend to have more Caesarean sections

  • Simpsons paradoxBe careful about adding together observations this can be misleading

    E.g. Berkeley sex-bias case

    *Appears that women have lower successBut actually women are typically more successful at the Departmental level, just apply to more competitive subjects

    Applicants% admittedMen844244%Women432135%

    MajorMenWomenApplicants% admittedApplicants% admittedA82562%10882%B56063%2568%C32537%59334%D41733%37535%E19128%39324%F2726%3417%

  • Finding MLEs in GLMIn linear modelling we can use the beautiful compactness of linear algebra to find MLEs and estimates of the variance for parameters

    Consider an n by k+1 data matrix, X, where n is the number of observations and k is the number of explanatory variables, and a response vector Ythe first column is 1 for the intercept term

    The MLEs for the coefficients (b) can be estimated using

    In GLMs, there is usually no such compact analytical expression for the MLEsUse numerical methods to maximise the likelihood


  • Testing hypotheses in GLMsFor the parameters we are interested in we typically want to ask how much evidence there is that these are different from zero

    For this we need to construct confidence intervals for the parameter estimates

    We could estimate the confidence interval by finding all parameter values with log-likelihood no greater than 1.94 units worse than the MLE

    Alternatively, we might use bootstrap resampling techniques to estimate the distribution of parameter estimates

    However, we can also appeal to theoretical considerations of likelihood (based on the CLT) that show that parameter estimates are asymptotically normal with variance described by the Fisher information matrix

    Informally, the information matrix describes the sharpness of the likelihood curve around the MLE and the extent to which parameter estimates are correlated*

  • Logistic regressionWhen only two types of outcome are possible (e.g. disease/not-disease) we can model counts by the binomial

    If we want to perform inference about the factors that influence the probability of success it is usual to use the logistic model

    The link function here is the logit*

  • Example: testing for genotype associationIn a cohort study, we observe the number of individuals in a population that get a particular disease

    We want to ask whether a particular genotype is associated with increased risk

    The simplest test is one in which we consider a single coefficient for the genotypic value012b0 = -4b1 = 2*

    GenotypeAAAaAAGenotypic value012Frequency in populationp22p(1-p)(1-p)2Probability of diseasep0p1p2

  • A note on the modelNote that each copy of the risk allele contribute in an additive way to the exponent

    This does not mean that each allele adds a fixed amount to the probability of disease

    Rather, each allele contributes a fixed amount to the log-odds

    This has the effect of maintaining Hardy-Weinberg equilibrium within both the cases and controls*

  • Concepts in disease geneticsRelative risk describes the risk to a person in an exposed group compared to the unexposed group

    The odds ratio compares the odds of disease occurring in one group relative to another

    If the absolute risk of disease is low the two will be very similar


  • Cont.Suppose in a given study we observe the following counts

    We can fit a GLM using the logit link function and binomial probabilities

    We have genotype data stored in the vector gt and disease status in the vector status

    Using R, this is specified by the commandglm(formula = status ~ gt, family = binomial)


    Genotype012Counts with disease263921Counts without disease129856749

  • An important noteIn case-control designs, it is actually the genotype that is the random variable, not the outcome

    There is some theory that says that estimates of coefficients are equivalent under prospective or retrospective approaches in CC designsPrentice & Pyke (1979) Biometrika 66:403.

    However, the two approaches are not fully equivalent as the CC design creates artificial association between causal factors that are independent in the population at large


  • Interpreting resultsCall:glm(formula = status ~ gt, family = binomial)

    Deviance Residuals: Min 1Q Median 3Q Max -0.8554 -0.4806 -0.2583 -0.2583 2.6141

    Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.6667 0.2652 -17.598

  • Adding in extra covariatesAdding in additional explanatory variables in GLM is essentially the same as in linear model analysis

    Likewise, we can look at interactions

    In the disease study we might want to consider age as a potentially important covariateglm(formula = status ~ gt + age, family = binomial)

    Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -7.00510 0.79209 -8.844

  • Adding model complexityIn the disease status analysis we might want to generalise the fitted mo


View more >