bst 226 statistical methods for bioinformatics david m....

BST 226 Statistical Methods for Bioinformatics

David M. Rocke

February 24, 2014 BST 226 Statistical Methods for Bioinformatics 1

Prediction and Classification We will continue to work within the framework of

predictors that are linear in the parameters. We can use criteria for the fit that differ from the usual

least squares criterion or the generalization in glm’s. For problems with many variables, we need to use cross-

validation or related methods. Alternate criteria methods include partial least squares,

support vector machines, and penalized regression. For many of these methods, we need to write code to do the

cross validation, but some have built-in methods.


Cross Validation for (Generalized) Linear Models The package cvTools computes cross-validated

prediction error for linear models. The function cvFit() handles K-fold cross

validation, possibly repeated, either with random splits, systematic splits, or pre-determined splits.

The predictive evaluation of the model is based on a cost function relating the actual value of y vs. the predicted value. Mean square prediction error Root mean square prediction error


Cross Validation for (Generalized) Linear Models The function cv.glm() in the boot package

(installed by default) performs K-fold cross validation for a generalized linear model.

The cost function can be specified, but is by default mean square prediction error.

The input is a glm, including the model and the data. This can adjust for over-fitting, but not for model,

variable, or feature selection. Neither cvFit() nor cv.glm() is of much use for

high dimensional data.



1 2pred

1pred

1pred

ˆ ˆ) ( ) Mean Square Prediction Error

ˆ ˆ( ,

Prediction Error Metrics for quantitative data( ,

Mean ) | |

ˆ ˆ( , ) | | / Mean Absolute Percen

Absolute E

tage Erro

rror

y n y y

f y y n y y

f y y n y y y

f y −

−

−

= −

= −

= −

∑∑∑ r

Prediction Error Metrics for Qualitative or Count DataˆHere is the prediction on the response scale,

ˆso for logistic regression is 0 or 1 and (0,1). ˆAlso, let ( ) be the predicted class of

yy y

C y∈

1 2pred

1pred

( ,

Mean Absolute Err

if is qua

or

( ,

litative.

ˆ ˆ) ( ) Mean Square Prediction Error

ˆ ˆ( , ) | |

ˆˆ) 2 log ( | ) Deviance for generalized linear models

ˆ( , )

y y

y n y y

f y y n y y

y

f y

p y

f y y

f

I C

y

y

θ

−

−

= −

= −

= −

= ≠

∑∑∑∑ [ ] [ ]2ˆ ˆ ˆ( ) ( ) ( ) Class errors for binomial or multinomial models

(the latter two equalities only for binomial models). Requires cutpoint(s)AUC for two-class logistic regression

y y C y y C y= − = −∑ ∑


[ ]

1( | , ) (1 )

2 log ( | , ) 2 log log( ) (1 ) log(1 )

ˆIf we have

D

observations of which are 1 with a parameter

eviance f

estimate then the devia

or Bin

nce

o

is

2 lo

mials

x xnf x n p p p

x

nf x n p x p x p

x

n x p

−= −

− = − + − −

+

−

ˆ ˆg log( ) (1 ) log(1 )

ˆwhich is at a minimum when /ˆWhen varies, this still is a good measure of predictive performance but harder to parse

Deviance for the normal distribution

(

nx p x p

x

x n pp

f

+

− −

+

=

/2 2 2

2 2

ˆ ˆ ˆ ˆ ˆ{ } | ) (2 ) exp ( ) / 2

ˆ ˆ ˆ ˆ ˆ({ } | , ) 2 log( ) log

,

which i(2 ) ( ) /

s a function of the mean square prediction error

n ni i i i

i i i i

x x

D x n n x

µ σ σ π µ σ

µ σ σ π µ σ

− − = − = + − −

∑∑

The Classification Error Metric The metric is very coarse. In the AD data, there are 33

AD patients and 30 non-demented controls. If one method gets 42 correct and another gets 44 correct, that is not a large difference.

It counts a near miss as much as a prediction wide of the mark.

It is however widely used, and in contexts where action decisions need to be made it may be appropriate.


Partial Least Squares (PLS)


Let be a quantity to be predicted and a -vector predictors including a constant term.We consider linear combinations of the variables For linear regression, we chose the vector so as to

i i

y x p

uw u x=

1 2( ) which is equivalent to maximizing the correlation between and For principa

minimize the mean

l components regres

square error of prediction

sion, we choose to maximize the variai iy u x y u x

un− −∑

nce of the scoresVar( )This defines the first PC. We can use as many as needed, but if is very large, this can use all thevariables without singularity.Partial Least Squares chooses the first comp

u xp

onent to maximize the covariance of and ( , )( , )

( ) ( )

( , ) ( , ) ( ) ( )Since the variance of y is fixed, this is a kind of product of the criteria for PCA and linear re

y u xCov y wCor y w

Var y Var w

Cov y w Cor y w Var y Var w

=

=

gression

Partial Least Squares (PLS) If we have 1000 variables and 100 observations, we

cannot use linear regression without selecting a small number of variables.

PCA can reduce the dimension, but predictions are often poor.

PLS can reduce the dimension considering both predictions of y and capturing the variance in X.

It can be used without variable selection, though often better with.

Can be used with very high dimensional data. February 24, 2014 BST 226 Statistical Methods for Bioinformatics 10

Support Vector Machines If we have two classes in p-dimensional space, then we

can use logistic regression to define a projection that will be used to classify. This direction is undefined if the classification is perfect.

LDA can be used in either case, but requires fitting a p by p covariance matrix.

SVM finds a direction that maximizes the margin. It only works if there is complete separation, but new variables can be defined to enlarge the space and make the margin exist.


Penalized Regression For various reasons, the least-squares criterion may

not be optimal. The covariates may be highly correlated. There may be more covariates than observations. We may have too many variables in the predictor Sometimes adding another term to the criterion

function can help with these problems. Usually, the other term(s) penalize the number or the

size of the coefficients.


Penalizing the Number of Predictors AIC (Akaike Information Criterion)

AIC = 2p – 2ln(L) AICc (corrected for sample size)

AICc = AIC + 2p(p + 1)/(n – p – 1) BIC (Bayesian Information Criterion)

p ln(n) – 2ln(L) These are used to select models, but they do not affect

the parameter estimates of a model with a fixed number of parameters.


Ridge Regression


Ridge regression is sometimes used when there is substantial collinearity in the predictors.If is an by matrix of predictors, and has at least one small or zero eigenvalue

Then the coefficien

X XX n p

1

0

2 20

1 1

ˆ ( ) will be unstable.The ridge regression estimates minimize over , the criterion

( )

Since this penalizes th

t estimat

e size of the coefficients, it will clear y

es

l

pn

i i ji j

X X X y

y x

ββ β

β β λ β

−

= =

−

=

− +∑ ∑

1ridge

shrink them towards 0.ˆ ( )

Under some assumptions, the (biased) estimates from ridge regression have a smaller meansquare error than those from the least squares estimates.

X X I X yβ λ −= +

The lasso and the elastic net


0

The lasso (Tibshirani 1996) is used when there is is a preference for sparse predictors,meaning that many of them should have coeffients of exactly zero

Lasso regression estimates minimize over , t

.

β β

20

1 1

he criterion

( ) |

Since this penalizes the size of the coefficients, it will also shrink them towards 0.But it does so in such as way as to force many of them to zero.The

|

elasti

pn

i i ji j

y xβ β λ β= =

− − +∑ ∑

0

20

1

2

1

c net is a combined approach which can penalize both the absolute values and the square.Elastic net regression estimates minimize over , the criterion

( ) ( )

(1 )( )2

n

i ii

p

jj

y x P

P

α

α

β β

β β λ β

αβ β α

=

=

− − +

−= +

∑

∑

1|

If 0, this is ridge regression. If 1, this is the lass .

|

o

p

jj

β

α α=

= =

∑

glmnet The R package glmnet implements the elastic net. The input number of variable can be extremely large The output number of variables with non-zero

coefficients can be small if the penalty coefficient α is not 0.

The parameter α must be specified (default = 1 = lasso) and the parameter λ is optimized.

Cross validation can be used for all except the choice of λ, though this can sometimes result in remaining bias.


Nested Cross Validation Suppose we will use the elastic net with α = 0.5, and

wish to optimize λ. If we use the built-in cross validation capacity, and

choose the value of λ with the best prediction score, then we can have some bias.

We can use nested cross-validation to eliminate this problem.


Nested Cross Validation Divide the data into k pieces. For i in 1:k, remove piece i as a test set. The remaining

k – 1 parts will comprise the training set. Pick the best value of λ in the training set using the

cross validation capacity of glmnet Using this “optimal” λ, and the predictor estimated

from the training set, predict out to the test set. This is nested cross validation because inside the

training set we do another CV to find the parameter λ.


> library(glmnet) > isub <- diag != "OD" > admat.net <- as.matrix(ad.data[isub,-1]) > diag.net <- as.factor(as.character(diag[isub])) > ad.cv <- cv.glmnet(admat.net,diag.net,family="binomial") > lmin <- ad.cv$lambda.min > l1se <- ad.cv$lambda.1se > ad.net <- glmnet(admat.net,diag.net,family="binomial") > predict(ad.net,s=lmin,type="nonzero") X1 1 51 2 53 3 61 4 69 > predict(ad.net,s=l1se,type="nonzero") which 1 51 > colnames(admat.net)[predict(ad.net,s=lmin,type="nonzero")$X1] [1] "Fibrinogen" "GLP.1.active" "HCC.4" "IGF.1" > plot(cv.glmnet(admat.net,diag.net,family="binomial",type="class"))


Stuff to Try Repeat the cross validation, plotting each time. Try

this with the default, deviance, with type = “response”, and with type = “class”.

What happens if you use a different value of α? Compare the results to non-cross-validated elastic net.

> sum(predict(ad.net,admat.net,s=lmin,type="class") != diag.net)

[1] 18

> 18/33

[1] 0.5454545


bst 226 statistical methods for bioinformatics david m....

Documents