linear modelling i

37
Linear Modelling I Richard Mott Wellcome Trust Centre for Human Genetics

Upload: orsin

Post on 22-Feb-2016

51 views

Category:

Documents


0 download

DESCRIPTION

Linear Modelling I. Richard Mott Wellcome Trust Centre for Human Genetics. Synopsis. Linear Regression Correlation Analysis of Variance Principle of Least Squares. Correlation. Correlation and linear regression. Is there a relationship? How do we summarise it? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Linear  Modelling I

Linear Modelling I

Richard MottWellcome Trust Centre for Human

Genetics

Page 2: Linear  Modelling I

Synopsis

• Linear Regression• Correlation• Analysis of Variance• Principle of Least Squares

Page 3: Linear  Modelling I

Correlation

Page 4: Linear  Modelling I

Correlation and linear regression

• Is there a relationship?• How do we summarise it?• Can we predict new obs?• What about outliers?

Page 5: Linear  Modelling I

Correlation Coefficient r• -1 < r < 1

• r=0 no relationship

• r=0.6

• r=1 perfect positive linear

• r=-1 perfect negative linear

Page 6: Linear  Modelling I

Examples of Correlation(taken from Wikipedia)

Page 7: Linear  Modelling I

Calculation of r• Data

Page 8: Linear  Modelling I

Correlation in R

> cor(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete")[1] 0.2577617

> cor.test(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete")

Pearson's product-moment correlation

data: bioch$Biochem.Tot.Cholesterol and bioch$Biochem.HDL t = 11.1473, df = 1746, p-value < 2.2e-16alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.2134566 0.3010088 sample estimates: cor 0.2577617

> pt(11.1473,df=1746,lower.tail=FALSE) # T distribution on 1746 degrees of freedom[1] 3.154319e-28

Page 9: Linear  Modelling I

Linear Regression

Fit a straight line to data

• a intercept• b slope• ei error

– Normally distributed– E(ei) = 0– Var(ei) = s2

Page 10: Linear  Modelling I

Example: simulated data

R code> # simulate 30 data points> x <- rnorm(30) > e <- rnorm(30)> x <- 1:30> e <- rnorm(30,0,5)> y <- 1 + 3*x + e

> # fit the linear model> f <- lm(y ~ x)

> # plot the data and the predicted line> plot(x,y)> abline(reg=f)

> print(f)

Call:lm(formula = y ~ x)

Coefficients:(Intercept) x -0.08634 3.04747

Page 11: Linear  Modelling I

Least Squares

• Estimate a, b by least squares

• Minimise sum of squared residuals between y and the prediction a+bx

• Minimise

Page 12: Linear  Modelling I

Why least squares?

• LS gives simple formulae for the estimates for a, b

• If the errors are Normally distributed then the LS estimates are “optimal”In large samples the estimates converge to the true valuesNo other estimates have smaller expected errorsLS = maximum likelihood

• Even if errors are not Normal, LS estimates are often useful

Page 13: Linear  Modelling I

Analysis of Variance (ANOVA)LS estimates have an important property: they partition the sum of squares (SS) into fitted and error components

• = +

• total SS = fitting SS + residual SS• only the LS estimates do this

Component SS Degrees of freedom

Mean Square(ratio of SS to df)

F-ratio (ratio of FMS/RMS)

Fitting SS 1

Residual SS n-2

Total SS n-1

Page 14: Linear  Modelling I

ANOVA in R

Component SS Degrees of freedom

Mean Square F-ratio

Fitting SS 20872.7 1 20872.7 965Residual SS 605.6 28 21.6Total SS 21478.3 29

> anova(f)Analysis of Variance Table

Response: y Df Sum Sq Mean Sq F value Pr(>F) x 1 20872.7 20872.7 965 < 2.2e-16 ***Residuals 28 605.6 21.6

> pf(965,1,28,lower.tail=FALSE)[1] 3.042279e-23

Page 15: Linear  Modelling I

Hypothesis testing• no relationship between y and x

• Assume errors ei are independent and normally distributed N(0,s2)• If H0 is true then the expected values of the sums of squares in the ANOVA are

• Expectation• Degrees of Freedom n-1 1 n-2• F ratio = (fitting MS)/(residual MS) ~ 1 under H0

• F >> 1 implies we reject H0

• F is distributed as F(1,n-2)

Page 16: Linear  Modelling I

Degrees of Freedom• Suppose are iid N(0,1)

• Then ie n independent variables

• What about ?

• These values are constrained to sum to 0:

• Therefore the sum is distributed as if it comprised one fewer observation, hence it has n-1 df (for example, its expectation is n-1)

• In particular, if p parameters are estimated from a data set, then the residuals

have p constraints on them, so they behave like n-p independent variables

Page 17: Linear  Modelling I

The F distribution• If e1….en are independent and identically distributed (iid)

random variables with distribution N(0,s2), then:• e1

2/s2 … en2/s2 are each iid chi-squared random variables with

chi-squared distribution on 1 degree of freedom c12

• The sum Sn = Si ei2/s2 is distributed as chi-squared cn

2

• If Tm is a similar sum distributed as chi-squared cm2, but

independent of Sn, then (Sn/n)/(Tm/m) is distributed as an F random variable F(n,m)

• Special cases:– F(1,m) is the same as the square of a T-distribution on m df– for large m, F(n,m) tends to cn

2

Page 18: Linear  Modelling I

ANOVA – HDL example> ff <- lm(bioch$Biochem.HDL ~ bioch$Biochem.Tot.Cholesterol)> ff

Call:lm(formula = bioch$Biochem.HDL ~

bioch$Biochem.Tot.Cholesterol)

Coefficients: (Intercept) bioch$Biochem.Tot.Cholesterol 0.2308 0.4456

> anova(ff)Analysis of Variance Table

Response: bioch$Biochem.HDL Df Sum Sq Mean Sq F value

Pr(>F) bioch$Biochem.Tot.Cholesterol 1 149.660 149.660 1044 Residuals 1849 265.057 0.143

> pf(1044,1,28,lower.tail=FALSE)[1] 1.040709e-23

HDL = 0.2308 + 0.4456*Cholesterol

Page 19: Linear  Modelling I

correlation and ANOVA

• r2 = FSS/TSS = fraction of variance explained by the model

• r2 = F/(F+n-2)– correlation and ANOVA are equivalent– Test of r=0 is equivalent to test of b=0 – T statistic in R cor.test is the square root of the ANOVA F statistic

– r does not tell anything about magnitudes of estimates of a, b– r is dimensionless

Page 20: Linear  Modelling I

Effect of sample size

Total Cholesterol vs HDL dataExample R session to sample subsets of data and compute correlations

seqq <- seq(10,300,5)corr <- matrix(0,nrow=length(seqq),ncol=2)colnames(corr) <- c( "sample size", "P-value")n <- 1for(i in seqq) {

res <- rep(0,100)for(j in 1:100) {s <- sample(idx,i)data <- bioch[s,]co <- cor.test(data$Biochem.Tot.Cholesterol, data$Biochem.HDL,na="pair")res[j] <- co$p.value}m <- exp(mean(log(res)))cat(i, m, "\n")corr[n,] <- c(i, m)n <- n+1

}

Page 21: Linear  Modelling I

Calculating the right sample size n

• The R library “pwr” contains functions to compute the sample size for many problems, including correlation pwr.r.test() and ANOVA pwr.anova.test()

Page 22: Linear  Modelling I

Problems with non-linearityAll plots have r=0.8 (taken from Wikipedia)

Page 23: Linear  Modelling I

Non-Parametric CorrelationSpearman Rank Correlation Coefficient

• Replace observations by their ranks• eg x= ( 5, 1, 4, 7 ) -> rank(x) = (3,1,2,4)• Compute sum of squared differences between

ranks

• in R:– cor( x, y, method=“spearman”)– cor.test(x,y,method=“spearman”)

Page 24: Linear  Modelling I

Spearman Correlation> cor.test(xx,y, method=“pearson”)

Pearson's product-moment correlation

data: xx and y t = 0.0221, df = 28, p-value = 0.9825alternative hypothesis: true correlation is not equal

to 0 95 percent confidence interval: -0.3566213 0.3639062 sample estimates: cor 0.004185729

> cor.test(xx,y,method="spearman")

Spearman's rank correlation rho

data: xx and y S = 2473.775, p-value = 0.01267alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.4496607

Page 25: Linear  Modelling I

Multiple Correlation• The R cor function can be used to compute pairwise correlations

between many variables at once, producing a correlation matrix.

• This is useful for example, when comparing expression of genes across subjects.

• Gene coexpression networks are often based on the correlation matrix.

• in Rmat <- cor(df, na=“pair”)

– computes the correlation between every pair of columns in df, removing missing values in a pairwise manner

– Output is a square matrix correlation coefficients

Page 26: Linear  Modelling I

One-Way ANOVA

• Model y as a function of a categorical variable taking p values– eg subjects are classified into p families– want to estimate effect due to each family and

test if these are different– want to estimate the fraction of variance

explained by differences between families – ( an estimate of heritability)

Page 27: Linear  Modelling I

One-Way ANOVA

LS estimators

average over group i

Page 28: Linear  Modelling I

One-Way ANOVA

• Variance is partitioned in to fitting and residual SS

total SS

n-1

fitting SSbetween groups

p-1

residual SSwith groups

n-p degrees of freedom

Page 29: Linear  Modelling I

One-Way ANOVA

Component SS Degrees of freedom

Mean Square(ratio of SS to df)

F-ratio (ratio of FMS/RMS)

Fitting SS p-1

Residual SS n-p

Total SS n-1

Under Ho: no differences between groupsF ~ F(p-1,n-p)

Page 30: Linear  Modelling I

One-Way ANOVA in Rfam <- lm( bioch$Biochem.HDL ~ bioch$Family )> anova(fam)Analysis of Variance Table

Response: bioch$Biochem.HDL Df Sum Sq Mean Sq F value Pr(>F) bioch$Family 173 6.3870 0.0369 3.4375 < 2.2e-16 ***Residuals 1727 18.5478 0.0107 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >

Component SS Degrees of freedom

Mean Square(ratio of SS to df)

F-ratio (ratio of FMS/RMS)

Fitting SS 6.3870 173 0.0369 3.4375

Residual SS 18.5478 1727 0.0107

Total SS 24.9348 1900

Page 31: Linear  Modelling I

Non-ParametricOne-Way ANOVA

• Kruskall-Wallis Test• Useful if data are highly non-Normal– Replace data by ranks– Compute average rank within each group– Compare averages– kruskal.test( formula, data )

Page 32: Linear  Modelling I

Factors in R

• Grouping variables in R are called factors• When a data frame is read with read.table()– a column is treated as numeric if all non-missing entries are

numbers– a column is boolean if all non-missing entries are T or F (or

TRUE or FALSE)– a column is treated as a factor otherwise– the levels of the factor is the set of distinct values– A column can be forced to be treated as a factor using the

function as.factor(), or as a numeric vector using as.numeric()

Page 33: Linear  Modelling I

Linear Modelling in R

• The R function lm() fits linear models• It has two principal arguments (and some

optional ones)• f <- lm( formula, data )– formula is an R formula– data is the name of the data frame containing the

data (can be omitted, if the variables in the formula include the data frame)

Page 34: Linear  Modelling I

formulae in R• Biochem.HDL ~ Biochem$Tot.Cholesterol– linear regression of HDL on Cholesterol – 1 df

• Biochem.HDL ~ Family– one-way analysis of variance of HDL on Family– 173 df

• The degrees of freedom are the number of independent parameters to be estimated

Page 35: Linear  Modelling I

ANOVA in R• f <- lm(Biochem.HDL ~ Tot.Cholesterol, data=biochem)• [OR f <- lm(biochem$Biochem.HDL ~ biochem$Tot.Cholesterol)]

• a <- anova(f)

• f <- lm(Biochem.HDL ~ Family, data=biochem)• a <- anova(f)

Page 36: Linear  Modelling I

The Chi-Squared Test• Data are a table of counts, with the row and column margins

fixed.

• Ho: the counts in each cell are consistent with the rows and columns acting independently

n11r1c1/N

n12r1c2/N

r1 = n11+ n12

n21r2c1/N

n22r2c2/N

r2 = n21+ n22

c1

= n11+ n21

c2

= n12+ n22

N

Page 37: Linear  Modelling I

The Chi-Squared Test

Heuristic Justification: The number of counts in a given cell ij follow a Multinomial Distribution, with expectation Npij and variance Npij (1-pij). Then (O-E)