bst 226 statistical methods for bioinformatics david m....
TRANSCRIPT
BST 226 Statistical Methods for Bioinformatics
David M. Rocke
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 1
Prediction and Classification We will continue to work within the framework of
predictors that are linear in the parameters. We can use criteria for the fit that differ from the usual
least squares criterion or the generalization in glm’s. For problems with many variables, we need to use cross-
validation or related methods. Alternate criteria methods include partial least squares,
support vector machines, and penalized regression. For many of these methods, we need to write code to do the
cross validation, but some have built-in methods.
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 2
Cross Validation for (Generalized) Linear Models The package cvTools computes cross-validated
prediction error for linear models. The function cvFit() handles K-fold cross
validation, possibly repeated, either with random splits, systematic splits, or pre-determined splits.
The predictive evaluation of the model is based on a cost function relating the actual value of y vs. the predicted value. Mean square prediction error Root mean square prediction error
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 3
Cross Validation for (Generalized) Linear Models The function cv.glm() in the boot package
(installed by default) performs K-fold cross validation for a generalized linear model.
The cost function can be specified, but is by default mean square prediction error.
The input is a glm, including the model and the data. This can adjust for over-fitting, but not for model,
variable, or feature selection. Neither cvFit() nor cv.glm() is of much use for
high dimensional data.
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 4
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 5
1 2pred
1pred
1pred
ˆ ˆ) ( ) Mean Square Prediction Error
ˆ ˆ( ,
Prediction Error Metrics for quantitative data( ,
Mean ) | |
ˆ ˆ( , ) | | / Mean Absolute Percen
Absolute E
tage Erro
rror
y n y y
f y y n y y
f y y n y y y
f y −
−
−
= −
= −
= −
∑∑∑ r
Prediction Error Metrics for Qualitative or Count DataˆHere is the prediction on the response scale,
ˆso for logistic regression is 0 or 1 and (0,1). ˆAlso, let ( ) be the predicted class of
yy y
C y∈
1 2pred
1pred
( ,
Mean Absolute Err
if is qua
or
( ,
litative.
ˆ ˆ) ( ) Mean Square Prediction Error
ˆ ˆ( , ) | |
ˆˆ) 2 log ( | ) Deviance for generalized linear models
ˆ( , )
y y
y n y y
f y y n y y
y
f y
p y
f y y
f
I C
y
y
θ
−
−
= −
= −
= −
= ≠
∑∑∑∑ [ ] [ ]2ˆ ˆ ˆ( ) ( ) ( ) Class errors for binomial or multinomial models
(the latter two equalities only for binomial models). Requires cutpoint(s)AUC for two-class logistic regression
y y C y y C y= − = −∑ ∑
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 6
[ ]
1( | , ) (1 )
2 log ( | , ) 2 log log( ) (1 ) log(1 )
ˆIf we have
D
observations of which are 1 with a parameter
eviance f
estimate then the devia
or Bin
nce
o
is
2 lo
mials
x xnf x n p p p
x
nf x n p x p x p
x
n x p
−= −
− = − + − −
+
−
ˆ ˆg log( ) (1 ) log(1 )
ˆwhich is at a minimum when /ˆWhen varies, this still is a good measure of predictive performance but harder to parse
Deviance for the normal distribution
(
nx p x p
x
x n pp
f
+
− −
+
=
/2 2 2
2 2
ˆ ˆ ˆ ˆ ˆ{ } | ) (2 ) exp ( ) / 2
ˆ ˆ ˆ ˆ ˆ({ } | , ) 2 log( ) log
,
which i(2 ) ( ) /
s a function of the mean square prediction error
n ni i i i
i i i i
x x
D x n n x
µ σ σ π µ σ
µ σ σ π µ σ
− − = − = + − −
∑∑
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 7
The Classification Error Metric The metric is very coarse. In the AD data, there are 33
AD patients and 30 non-demented controls. If one method gets 42 correct and another gets 44 correct, that is not a large difference.
It counts a near miss as much as a prediction wide of the mark.
It is however widely used, and in contexts where action decisions need to be made it may be appropriate.
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 8
Partial Least Squares (PLS)
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 9
Let be a quantity to be predicted and a -vector predictors including a constant term.We consider linear combinations of the variables For linear regression, we chose the vector so as to
i i
y x p
uw u x=
1 2( ) which is equivalent to maximizing the correlation between and For principa
minimize the mean
l components regres
square error of prediction
sion, we choose to maximize the variai iy u x y u x
un− −∑
nce of the scoresVar( )This defines the first PC. We can use as many as needed, but if is very large, this can use all thevariables without singularity.Partial Least Squares chooses the first comp
u xp
onent to maximize the covariance of and ( , )( , )
( ) ( )
( , ) ( , ) ( ) ( )Since the variance of y is fixed, this is a kind of product of the criteria for PCA and linear re
y u xCov y wCor y w
Var y Var w
Cov y w Cor y w Var y Var w
=
=
gression
Partial Least Squares (PLS) If we have 1000 variables and 100 observations, we
cannot use linear regression without selecting a small number of variables.
PCA can reduce the dimension, but predictions are often poor.
PLS can reduce the dimension considering both predictions of y and capturing the variance in X.
It can be used without variable selection, though often better with.
Can be used with very high dimensional data. February 24, 2014 BST 226 Statistical Methods for Bioinformatics 10
Support Vector Machines If we have two classes in p-dimensional space, then we
can use logistic regression to define a projection that will be used to classify. This direction is undefined if the classification is perfect.
LDA can be used in either case, but requires fitting a p by p covariance matrix.
SVM finds a direction that maximizes the margin. It only works if there is complete separation, but new variables can be defined to enlarge the space and make the margin exist.
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 11
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 12
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 13
Penalized Regression For various reasons, the least-squares criterion may
not be optimal. The covariates may be highly correlated. There may be more covariates than observations. We may have too many variables in the predictor Sometimes adding another term to the criterion
function can help with these problems. Usually, the other term(s) penalize the number or the
size of the coefficients.
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 14
Penalizing the Number of Predictors AIC (Akaike Information Criterion)
AIC = 2p – 2ln(L) AICc (corrected for sample size)
AICc = AIC + 2p(p + 1)/(n – p – 1) BIC (Bayesian Information Criterion)
p ln(n) – 2ln(L) These are used to select models, but they do not affect
the parameter estimates of a model with a fixed number of parameters.
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 15
Ridge Regression
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 16
Ridge regression is sometimes used when there is substantial collinearity in the predictors.If is an by matrix of predictors, and has at least one small or zero eigenvalue
Then the coefficien
X XX n p
1
0
2 20
1 1
ˆ ( ) will be unstable.The ridge regression estimates minimize over , the criterion
( )
Since this penalizes th
t estimat
e size of the coefficients, it will clear y
es
l
pn
i i ji j
X X X y
y x
ββ β
β β λ β
−
= =
−
=
− +∑ ∑
1ridge
shrink them towards 0.ˆ ( )
Under some assumptions, the (biased) estimates from ridge regression have a smaller meansquare error than those from the least squares estimates.
X X I X yβ λ −= +
The lasso and the elastic net
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 17
0
The lasso (Tibshirani 1996) is used when there is is a preference for sparse predictors,meaning that many of them should have coeffients of exactly zero
Lasso regression estimates minimize over , t
.
β β
20
1 1
he criterion
( ) |
Since this penalizes the size of the coefficients, it will also shrink them towards 0.But it does so in such as way as to force many of them to zero.The
|
elasti
pn
i i ji j
y xβ β λ β= =
− − +∑ ∑
0
20
1
2
1
c net is a combined approach which can penalize both the absolute values and the square.Elastic net regression estimates minimize over , the criterion
( ) ( )
(1 )( )2
n
i ii
p
jj
y x P
P
α
α
β β
β β λ β
αβ β α
=
=
− − +
−= +
∑
∑
1|
If 0, this is ridge regression. If 1, this is the lass .
|
o
p
jj
β
α α=
= =
∑
glmnet The R package glmnet implements the elastic net. The input number of variable can be extremely large The output number of variables with non-zero
coefficients can be small if the penalty coefficient α is not 0.
The parameter α must be specified (default = 1 = lasso) and the parameter λ is optimized.
Cross validation can be used for all except the choice of λ, though this can sometimes result in remaining bias.
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 18
Nested Cross Validation Suppose we will use the elastic net with α = 0.5, and
wish to optimize λ. If we use the built-in cross validation capacity, and
choose the value of λ with the best prediction score, then we can have some bias.
We can use nested cross-validation to eliminate this problem.
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 19
Nested Cross Validation Divide the data into k pieces. For i in 1:k, remove piece i as a test set. The remaining
k – 1 parts will comprise the training set. Pick the best value of λ in the training set using the
cross validation capacity of glmnet Using this “optimal” λ, and the predictor estimated
from the training set, predict out to the test set. This is nested cross validation because inside the
training set we do another CV to find the parameter λ.
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 20
> library(glmnet) > isub <- diag != "OD" > admat.net <- as.matrix(ad.data[isub,-1]) > diag.net <- as.factor(as.character(diag[isub])) > ad.cv <- cv.glmnet(admat.net,diag.net,family="binomial") > lmin <- ad.cv$lambda.min > l1se <- ad.cv$lambda.1se > ad.net <- glmnet(admat.net,diag.net,family="binomial") > predict(ad.net,s=lmin,type="nonzero") X1 1 51 2 53 3 61 4 69 > predict(ad.net,s=l1se,type="nonzero") which 1 51 > colnames(admat.net)[predict(ad.net,s=lmin,type="nonzero")$X1] [1] "Fibrinogen" "GLP.1.active" "HCC.4" "IGF.1" > plot(cv.glmnet(admat.net,diag.net,family="binomial",type="class"))
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 21
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 22
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 23
Stuff to Try Repeat the cross validation, plotting each time. Try
this with the default, deviance, with type = “response”, and with type = “class”.
What happens if you use a different value of α? Compare the results to non-cross-validated elastic net.
> sum(predict(ad.net,admat.net,s=lmin,type="class") != diag.net)
[1] 18
> 18/33
[1] 0.5454545
February 24, 2014 BST 226 Statistical Methods for Bioinformatics 24