correlation analysis. regression. - uclrmjbale/stat/5.pdf · 2018. 10. 10. · correlation...

Post on 22-Aug-2020

12 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Correlation analysis.

Regression.

7.1

7.12

7.13

7.13

7.33

Values from the same group tend to be similar.

There is no tendency for values from the same group to be similar.

7.15

7.14

Modelling of data:Linear regression

7.16

7.17 Overview

7.18 Overview of the model fitting process

7.20 Linear regression

7.21 Estimation using ordinary least squares (OLS)

7.22 Normal equations

7.23 OLS solutions and predictions

7.24 A statistical model

7.25 Model assumptions

7.26 Maximum likehood estimation

7.28 Correlation coefficient

7.29

7.30 Hypothesis testing

7.31

7.32 Hypothesis testing for intercept and slope

7.34

7.35 Check data before doing a regression!!

7.36

7.37 Model diagnostic: variance and linearity

7.38 variance stabilizing methods

7.39 Normal residuals

7.40 Non-normal errors

7.41 Correlated residuals (autocorrelation)

1

7.42 Other variable transformations

7.43 Box-Cox family of transformations

7.44 Parameter tuning

7.45

7.46 Model building

7.47 Separate linear regressions. Examples: consider the following scenario

7.48 Basis functions

Br(x)

7.53 Goodness of fit criteria

7.54 Recap

7.55Linear regression in R:case<-read.csv("case1.txt", header=T, sep="\t")plot(case[,5],case[,6])t0=lm(case[,6]~case[,5])k = summary(t0)[[4]][2,1]b = summary(t0)[[4]][1,1]x=seq(5,70,by=1)points(x,k*x+b,type="l",col="red") ORabline(t0)

7.56Before accepting the result of a linear regression it is important to evaluate it suitability at explaining the data. layout(matrix(1:4,2,2)); plot(t0)

7.57One more example:x=seq(1,10,by=1)y=x+rnorm(10,0,1)y[5]=50t0=lm(y~x)plot(x,y);abline(t0)

7.58layout(matrix(1:4,2,2))plot(t0)

7.59Leverage and Cook’s distance:Cook’s distance measures the effect of deleting a given observation. Points with large Cook’s distance are considered to merit closer analysis. It is sum over a squared difference between the prediction from the full regression model and the prediction in which this point was deleted. P- the number of fitted parameters. MSE – the mean square error of the regression model.

Robust regression. As the residual goes down, the weight goes up.

7.60

7.61Nonlinear regression in R:x=seq(0,10,by=0.1)y=3*sin(x)+1+rnorm(length(x),mean=0,sd=0.3)plot(x,y)t1=nls(y~b*sin(x)+a,start=list(a=0.1,b=0.1))

7.62Nonlinear regression in R:x=seq(0,10,by=0.1)y=3*sin(x)+1+rnorm(length(x),mean=0,sd=0.3)plot(x,y)t1=nls(y~b*sin(x)+a,start=list(a=0.1,b=0.1))

points(x,summary(t1)[[10]][2,1]*sin(x)+summary(t1)[[10]][1,1],type="l",col="red")

7.55 Defining Models in RIt is necessary to understand the syntax for defining models in R. Let’s assume that the dependent variable being modeled is Y and that A, B and C are independent variables that might affect Y. The table below provides some useful examples. Note that the mathematical symbols used to define models do not have their normal meanings!

7.63

Risk, Odds, Odds ration and

Logistic regression

7.65

b0+b1xb0+b1x

7.66 Logistic regression in R

mylogit<-gml(y~x,family=binomial(link="logit"));b0=mylogit$coefficients[1]; b1=mylogit$coefficients[2]; summary(t0)

7.67

7.68

7.69

7.70

�71

mylogit<- glm(as.formula(data[,1]~data[,2]+data[,3]), family=binomial(link="logit"), na.action=na.pass)

koef1=exp(mylogit$coefficients[2]) ##### Odds ratiokoef2=exp(confint(mylogit))[2,1] ##### Confifence interval of odds ratio leftkoef3=exp(confint(mylogit))[2,2] ##### Confifence interval of odds ratio rightkoef4=summary(mylogit)[["coefficients"]][,"Pr(>|z|)"][2] ##### P-value of odds ratio

Extracting parameters of logistic regression

Stepwise regression• Any stepwise procedure in logistic regression is based on a statistical

algorithm that checks for the "importance" of variables, and either includes or excludes them on the basis of a fixed decision rule.

• The "importance" of a variable is defined in terms of a measure of the statistical significance of the coefficient for the variable.

• The statistic used depends on the assumptions of the model. In stepwise linear regression an F-test is used since the errors are assumed to be normally distributed. In logistic regression the errors are assumed to follow a binomial distribution, and significance is assessed via likelihood ratio chi-square test.

• Thus at any step in the procedure the most important variable, in statistical terms, is the one that produces the greatest change in the log-likelihood relative to a model not containing the variable.

Stepwise regression in R

Any stepwise regression procedure is an algorithm for forward selection followed by backward elimination.

stepAIC(object, direction = c("both", "backward", "forward")

X=runif(250,-2.5,2.5)Y=runif(250, -2.5,2.5)Z=runif(250,-2.5,2.55) K=round(1/(1+exp(-X))+runif(50,-0.01,0.01))data=cbind(K,X,Y,Z)library(MASS)mylogit<- glm(as.formula(data[,1]~data[,2]+data[,3]+data[,4]+data[,2]*data[,3]+data[,2]*data[,4]+data[,3]*data[,4]), family=binomial(link="logit"), na.action=na.pass)step <- stepAIC(mylogit, direction="both")

top related