supervised machine learning - unimi.itmarray.economia.unimi.it › 2005 › material › l7.pdf ·...
TRANSCRIPT
Supervised MachineSupervised MachineLearningLearning
Lecture 7Computational and Statistical Aspects
of Microarray AnalysisJune 23, 2005 Bressanone, Italy
Common Types of ObjectivesCommon Types of Objectives
• Class Comparison– Identify genes differentially expressed among
predefined classes such as diagnostic orprognostic groups.
• Class Prediction– Develop multi-gene predictor of class for a
sample using its gene expression profile• Class Discovery
– Discover clusters among specimens or amonggenes
What is the taskWhat is the task• Given the gene profile predict the class
• Mathematical representation: findfunction f that maps x to {1,…,K}
• How do we do this?
PossibilitiesPossibilities• Have expert tell us what genes to look
for being over/under expressed?• Then we do not really need microarrray
• Use clustering algorithms?• Not appropriate for this taks…
Clustering is not a good tool
A: 450 relevant genes plus 450 “noise” genes. B: 450 relevant genes.
Simulated Data with 4 clusters: 1-10, 11-20, 21-30, 31-40
Problem with clusteringProblem with clustering• Noisy genes will ruin it for the rest
• How do we know which genes to use
• We are ignoring useful information inour prototype data: We know theclasses!
Train an algorithmTrain an algorithm• A powerful approach is to train a
classification algorithm on the data wecollected and propose the use of it in thefuture
• This has successfully worked in manyareas: zip code reading, voicerecognition, etc
Clustering is not a good tool
Using multiple genesUsing multiple genes• How do we combine information from various
genes to help us form our discriminantfunction f ?
• There are many methods out there… threeexamples are LDA, kNN, SVM
• Weighted gene voting and PAM weredeveloped for microarrays (but they are just aversions of DLDA)
Weighted Gene Voting is DLDAWeighted Gene Voting is DLDA
∑=
−=
G
g g
kggk
xx
12
2)()(
σ
µδ
€
(x 1g − x 2g )ˆ σ g
2g=1
G
∑ xg −x 1g + x 2g( )
2
≥ 0
( ) 01
≥−∑=
gg
G
gg bxa
2
21
ˆ
)(
g
ggg
xxa
σ
−=
( )2
21 ggg
xxb
+=
gg
ggg
xxa
21
21
ˆˆ
)(
σσ +
−=
With equal priors, DLDA is:
With two classes we select class 1 if
This can be written as
with
Weighted Gene Voting simply uses
Notice the units and scale fore sum are wrong!
KNNKNN• Another simple and useful method is K
nearest neighbors
• It is very simple
ExampleExample
Too many genesToo many genes• A problem with most existing
approaches: They were not developedfor p>>n
• A simple way around this is to filtergenes first: Pick genes that, marginally,appear to have good predictive power
Beware of over-fittingBeware of over-fitting• With p>>n you can always find a
prediction algorithm that predictsperfectly on the training set
• Also, many algorithm can be made to metoo flexible. An example is KNN with K=1
ExampleExample
Split-Sample EvaluationSplit-Sample Evaluation• Training-set
– Used to select features, select model type, determineparameters and cut-off thresholds
• Test-set– Withheld until a single model is fully specified using the
training-set.– Fully specified model is applied to the expression profiles in
the test-set to predict class labels.– Number of errors is counted
Note: Also called cross-validation
ImportantImportant• You have apply the entire algorithm,
from scratch, on the train set
• This includes the choice of feature gene,and in some cases normalization!
ExampleExample
Number of misclassifications
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Proportion of simulated data sets
0.00
0.05
0.10
0.90
0.95
1.00
Cross-validation: none (resubstitution method)Cross-validation: after gene selectionCross-validation: prior to gene selection
Keeping yourself honestKeeping yourself honest• CV
• Try out algorithm on reshuffled data
• Try it out on completely random data
ConclusionsConclusions• Clustering algorithms not appropriate
• Do not reinvent the wheel! Many methodsavailable… but need feature selection (PAMdoes it all in one step!)
• Use cross validation to assess
• Be suspicious of new complicated methods:Simple methods are already too complicated.
Class Prediction ModelClass Prediction Model• Given a sample with an expression profile vector x of
log-ratios or log signals and unknown class, predictwhich class the sample belongs to
• The class prediction model is a function f which mapsfrom the set of vectors x to the set of class labels {1,2}(if there are two classes).
• f generally utilizes only some of the components of x(i.e. only some of the genes)
• Specifying the model f involves specifying someparameters (e.g. regression coefficients) by fitting themodel to the data (learning the data).
Do Not Confuse Statistical MethodsDo Not Confuse Statistical MethodsAppropriate for Class Comparison withAppropriate for Class Comparison withThose Appropriate for Class PredictionThose Appropriate for Class Prediction
• Demonstrating statistical significance of prognostic factors is notthe same as demonstrating predictive accuracy.
• Demonstrating goodness of fit of a model to the data used todevelop it is not a demonstration of predictive accuracy.
• Statisticians are used to inference, not prediction• Most statistical methods were not developed for p>>n prediction
problems• But some are…
Components of ClassComponents of ClassPredictionPrediction
• Feature (gene) selection– Which genes will be included in the model
• Select model type– E.g. DLDA, Nearest-Neighbor, …
• Fitting parameters (regressioncoefficients) for model
• Assessing the model
Feature SelectionFeature Selection• Key component of supervised analysis• Genes that are univariately differentially expressed
among the classes at a significance level α (e.g. 0.01)– The α level is selected to control the number of genes in the
model, not to control the false discovery rate• Methods for class prediction are different than those for class
comparison– The accuracy of the significance test used for feature
selection is not of major importance as identifyingdifferentially expressed genes is not the ultimate objective
– For survival prediction, the genes with significant univariateCox PH regression coefficients
Feature SelectionFeature Selection• Small subset of genes which together
give most accurate predictions
• Many published complex methods forselecting combinations of genes do notappear to have been properly evaluated
Linear MethodsLinear Methods• Many popular methods are linear
• Linear means that the prediction rule is alinear combination of the x– f(x) = Ax
• Some examples follow
Linear Classifiers for Two ClassesLinear Classifiers for Two Classes• Fisher linear discriminant analysis (weights based on
assumed multi-variate normal distribution ofexpression vector in each class with commoncovariance matrix)
• Diagonal linear discriminant analysis (DLDA) assumesfeatures are uncorrelated
• Compound covariate predictor and Golub’s weightedvoting method are variants of DLDA
Linear Classifiers for Two ClassesLinear Classifiers for Two Classes
• Support vector machines with innerproduct kernel are linear classifiers withweights determined to minimize errors
• Perceptrons with principal componentsas input are linear classifiers with nowell defined criterion for definingweights
Nearest Neighbor ClassifierNearest Neighbor Classifier
• To classify a sample in the validation set as beingin outcome class 1 or outcome class 2, determinewhich sample in the training set it’s geneexpression profile is most similar to.– Similarity measure used is based on genes selected as
being univariately differentially expressed between theclasses
– Correlation similarity or Euclidean distance generallyused
• Classify the sample as being in the same class asit’s nearest neighbor in the training set
• For a fixed neighborhood size, this turns out tobe linear as well
Advantages of Simple LinearAdvantages of Simple LinearClassifiersClassifiers
• Do not over-fit data– Incorporate influence of multiple variables
without attempting to select the best smallsubset of variables
– Do not attempt to model the multivariateinteractions among the predictors andoutcome
• When p>>n, a linear classifier canalmost always be found which fits thedata perfectly.
• Why consider more complex models?• The full set of linear models is too rich
and selecting a linear model to minimizetraining errors does not lead togeneralizable results when p>>n.
MythMyth• That complex classification algorithms
such as neural networks perform betterthan simpler methods for classprediction.
• Artificial intelligence sells to journal reviewersand peers who cannot distinguish hype fromsubstance when it comes to microarray dataanalysis.
• Comparative studies have shown that simplermethods work as well or better for microarrayproblems because the number of candidatepredictors exceeds the number of samples byorders of magnitude.
• Dudoit, Fridlyand and Speed JASA 2001
• Fitting complex classifiers to training dataresults in unstable models unless thetraining dataset is huge
• For unstable models, the training set errorrate is strongly downward biased as anestimate of the generalization error rate
• For unstable models, the cross-validatederror rate is a highly variable estimate of thegeneralization error rate
• Stability can be improved by– Restriction to simpler models– Include penalty for complexity if fitting criterion– Use fitting criterion incorporating robustness to
changes in data
Evaluating a ClassifierEvaluating a Classifier• “Prediction is difficult, especially the
future.”– Neils Bohr
• Fit of a model to the same data used todevelop it is no evidence of predictionaccuracy for independent data.
Leave-one-out CrossLeave-one-out CrossValidationValidation
• Leave-one-out cross-validationsimulates the process of separatelydeveloping a model on one set of dataand predicting for a test set of data notused in developing the model
training set
test set
spec
imen
s
log-expression ratios
spec
imen
s
log-expression ratios
full data set
Non-cross-validated Prediction
Cross-validated Prediction (Leave-one-out method)
1. Prediction rule is built using full data set.2. Rule is applied to each specimen for class
prediction.
1. Full data set is divided into training andtest sets (test set contains 1 specimen).
2. Prediction rule is built from scratchusing the training set.
3. Rule is applied to the specimen in thetest set for class prediction.
4. Process is repeated until each specimenhas appeared once in the test set.
Common mistakeCommon mistake• Cross-validation of a model can occur
after selecting the genes to be used inthe model
• Cross validation is only valid if the trainingset is not used in any way in thedevelopment of the model. Using thecomplete set of samples to select genesviolates this assumption and invalidatescross-validation.
• With proper cross-validation, the model mustbe developed from scratch for each leave-one-out training set. This means that geneselection must be repeated for each leave-one-out training set.
• The cross-validated estimate ofmisclassification error applies to the modelbuilding process, not to the particular modelor the particular set of genes used in themodel.
Prediction on Simulated Null DataPrediction on Simulated Null Data
Generation of Gene Expression Profiles• 20 specimens (Pi is the expression profile for specimen i)• Log-ratio measurements on 6000 genes• Pi ~ MVN(0, I6000)• Can we distinguish between the first 10 specimens (Class 1) andthe last 10 (Class 2)?
Prediction Method• Compound covariate prediction• Compound covariate built from the log-ratios of the 10 mostdifferentially expressed genes.
Incomplete (incorrect) Cross-Incomplete (incorrect) Cross-ValidationValidation
• Publications are using all the data to select genes andthen cross-validating only the parameter estimationcomponent of model development– Highly biased– Many published complex methods which make strong claims
based on incorrect cross-validation.• Frequently seen in complex feature set selection algorithms• Some software encourages inappropriate cross-validation
Incomplete (incorrect) Cross-Incomplete (incorrect) Cross-ValidationValidation
• Let M(b,D) denote a classification model developed ona set of data D where the model is of a particular typethat is parameterized by a scalar b.
• Use cross-validation to estimate the classificationerror of M(b,D) for a grid of values of b; Err(b).
• Select the value of b* that minimizes Err(b).• Caution: Err(b*) is a biased estimate of the prediction
error of M(b*,D).• This error is made in some commonly used methods
Permutation Distribution of Cross-Permutation Distribution of Cross-validated Misclassification Rate ofvalidated Misclassification Rate of
a Multivariate Classifiera Multivariate Classifier• Randomly permute class labels and repeat the
entire cross-validation• Re-do for all (or 1000) random permutations of
class labels• Permutation p value is fraction of random
permutations that gave as fewmisclassifications as in the real data
Invalid Criticisms of Cross-Invalid Criticisms of Cross-ValidationValidation
• “You can always find a set of features that willprovide perfect prediction for the training andtest sets.”– For complex models, there may be many sets of
features that provide zero training errors.– A modeling strategy that either selects among
those sets or aggregates among those models, willhave a generalization error which will be validlyestimated by cross-validation.
Potential Sources of Bias inPotential Sources of Bias inEstimation of Error RatesEstimation of Error Rates
• Confounding by sample handling or assayeffects– Design evaluation carefully
• Failure to incorporate important sources offuture variability– Assay drift
• Change in distribution of un-modeledvariables– In split sample validation, split samples by
institution
More detailsMore details……
y: phenotype (black vs white)x: expression position of the pointc: class indicators
Predictive ModelingPredictive Modeling
Goal: learn a mapping: y = f(x;θ)
Need: 1. A model structure
2. A score function3. An optimization strategy
Categorical y ∈ {c1,…,cm}: classification
Real-valued y: regressionNote: usually assume {c1,…,cm} are mutually exclusiveand exhaustive
Error RateError Rate• In general we write
• With L the loss function (usually quadratic)• When we are predicting classes the loss function is a
matrix that assigns a loss for every pair (truth,predictor)
• With only two classes this reduces to something quitesimple. What is it?
€
EPE = EX EY /X [L{Y − f (X)} | X]
Probabilistic ClassificationProbabilistic Classification
Let p(ck) = prob. that a randomly chosen object comes from ck
Objects from ck have: p(x |ck , θk) (e.g., MVN)
Then: p(ck | x ) ∝ p(x |ck , θk) p(ck)
Bayes Error Rate: dxxpxcpp kk
B )())|(max1(* ∫ −=
•Lower bound on the best possible error rate
Bayes error rate about 6%
Classifier TypesClassifier Types
Discrimination: direct mapping from x to{c1,…,cm}
- e.g. perceptron, SVM, CART
Regression: model p(ck | x )
- e.g. logistic regression, CART
Class-conditional: model p(x |ck , θk)
- e.g. “Bayesian classifiers”, LDA
Linear DiscriminantLinear DiscriminantAnalysisAnalysis
K classes, X n _ p data matrix.
p(ck | x ) ∝ p(x |ck , θk) p(ck)
Could model each class density as multivariate normal:
)()(21
212
1
||)2(
1)|(
kkT
k xx
kpk excp
µµ
π
−Σ−− −
Σ=
LDA assumes for all k. Then:Σ≡Σk
)()()(2
1
)(
)(log
)|(
)|(log 11
lkT
lkT
lkl
k
l
k xcp
cp
xcp
xcpµµµµµµ −Σ+−Σ+−= −−
This is linear in x.
Linear Discriminant AnalysisLinear Discriminant Analysis(cont.)(cont.)
It follows that the classifier should predict:
)(log2
1)( 11
kkTkk
Tk cpxx +Σ−Σ= −− µµµδ
“linear discriminant function”
If we don’t assume the Σk’s are identicial, get Quadratic DA:
)(log)()(2
1||log
2
1)( 1
kkkT
kkk cpxxx +−Σ−−Σ−= − µµδ
€
argmaxk δk (x)
Linear Discriminant AnalysisLinear Discriminant Analysis(cont.)(cont.)
Can estimate the LDA parameters via maximum likelihood:
kki
ik Nx /ˆ ∑∈
=µ
NNcp kk /)(ˆ =
)/()')((ˆ1
KNxxK
k kikiki −−−=Σ ∑∑
= ∈
µµ
LDA QDA
first linear discriminant
seco
nd li
near
dis
crim
inan
t
-5 0 5 10
-6-4
-20
24
6
s
s ss
s
s
ss
s
s
s
s
ss
ss
s
s
ss
ss
ss s
s
ss s
ss
s
ss
s s
ss
s
s s
s
s
s
s
s
s
s
s
s
ccc
c
cc
c
c
c
c
c
c
c
cc
c
cc
cc
c
cc
ccc
cc
c
c
c c
cc c
cc
c
c
cc
c
c
c
c
ccc
c
c
v
v
v
vv
v
v
v
v
v
v
v
v
v
v
vv
v
v
v
v
v
v
v
vv
vvv
vv
v
v v
v
v vv
v
vv v
v
vv
v
v
v
v
s
LDA (cont.)LDA (cont.)
•Fisher is optimal if the class are MVN with a commoncovariance matrix
•Computational complexity O(mp2n)
Logistic RegressionLogistic RegressionNote that LDA is linear in x:
)()()(2
1
)(
)(log
)|(
)|(log 0
10
10
00
µµµµµµ −Σ+−Σ+−= −−k
Tk
Tk
kk xcp
cp
xcp
xcp
xTkk αα += 0
Linear logistic regression looks the same:
xxcp
xcp Tkk
k ββ += 00 )|(
)|(log
But the estimation procedure for the coefficients is different.LDA maximizes joint likelihood [y,X]; logistic regressionmaximizes conditional likelihood [y|X]. Usually similar predictions.
Logistic Regression MLELogistic Regression MLEFor the two-class case, the likelihood is:
{ }∑=
−−+=n
iiiii xpyxpyl
1
));(1log()1();(log)( βββ
xxp
xp Tβββ
=
− );(1
);(log ))exp(1log();(log xxxp TT βββ +−=
{ }∑=
++=⇒n
i
TTi xxyl
1
))exp(1log()( βββ
The maximize need to solve (non-linear) score equations:
∑=
=−=n
iiii xpyx
d
dl
1
0));(()(
βββ
Logistic Regression ModelingLogistic Regression ModelingSouth African Heart Disease Example (y=MI)
4.1840.0100.043Age0.1360.0040.001Alcohol-1.1870.029-0.035Obesity4.1780.2250.939Famhist3.2190.0570.185ldl3.0340.0260.080Tobacco1.0230.0060.006sbp-4.2850.964-4.130InterceptZ scoreS.E.Coef.
Wald
Tree ModelsTree Models
•Easy to understand
•Can handle mixed data, missing values, etc.
•Sequential fitting method can be sub-optimal
•Usually grow a large tree and prune it back rather
than attempt to optimally stop the growing process
TrainingTrainingDatasetDataset
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
ThisfollowsanexamplefromQuinlan’sID3
Output: A Decision Tree forOutput: A Decision Tree for““buys_computerbuys_computer””
age?
overcast
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
30..40
Confusion matrix
Algorithm for Decision TreeAlgorithm for Decision TreeInductionInduction
• Basic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-conquer
manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they are discretized
in advance)– Examples are partitioned recursively based on selected attributes– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)• Conditions for stopping partitioning
– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf– There are no samples left
Information GainInformation Gain(ID3/C4.5)(ID3/C4.5)
• Select the attribute with the highest information gain• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and nelements of class N
– The amount of information, needed to decide if an arbitraryexample in S belongs to P or N is defined as
npn
npn
npp
npp
npI++
−++
−= 22 loglog),(
e.g. I(0.5,0.5)=1; I(0.9,0.1)=0.47; I(0.99,0.01)=0.08;
Information Gain inInformation Gain inDecision Tree InductionDecision Tree Induction
• Assume that using attribute A a set S will bepartitioned into sets {S1, S2 , …, Sv}– If Si contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classifyobjects in all subtrees Si is
• The encoding information that would be gained bybranching on A
∑= +
+=
ν
1),()(
iii
ii npInpnp
AE
)(),()( AEnpIAGain −=
Attribute Selection byAttribute Selection byInformation Gain ComputationInformation Gain Computationγ Class P: buys_computer =
“yes”γ Class N: buys_computer =
“no”γ I(p, n) = I(9, 5) =0.940γ Compute the entropy for age: Hence
Similarlyage pi ni I(pi, ni)
<=30 2 3 0.97130…40 4 0 0>40 3 2 0.971
694.0)2,3(14
5
)0,4(14
4)3,2(
14
5)(
=+
+=
I
IIageE
048.0)_(151.0)(029.0)(
===
ratingcreditGainstudentGainincomeGain
246.0
)(),()(
=
−= ageEnpIageGain
GiniGini Index (IBMIndex (IBMIntelligentMinerIntelligentMiner))
• If a data set T contains examples from nclasses, gini index, gini(T) is defined as
where pj is the relative frequency of class j in T.• If a data set T is split into two subsets T1 and T2
with sizes N1 and N2 respectively, the gini indexof the split data contains examples from nclasses, the gini index gini(T) is defined as
• The attribute provides the smallest ginisplit(T) ischosen to split the node
∑=
−=n
jp jTgini1
21)(
)()()( 22
11 Tgini
NN
TginiNNTginisplit +=
Avoid Avoid Overfitting Overfitting ininClassificationClassification
• The generated tree may overfit the training data– Too many branches, some may reflect anomalies due to
noise or outliers– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting– Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure fallingbelow a threshold
• Difficult to choose an appropriate threshold– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees• Use a set of data different from the training data to
decide which is the “best pruned tree”
Approaches to DetermineApproaches to Determinethe Final Tree Sizethe Final Tree Size
• Separate training (2/3) and testing (1/3)sets
• Use cross validation, e.g., 10-fold crossvalidation
• Use minimum description length (MDL)principle:– halting growth of the tree when the
encoding is minimized
Nearest Neighbor MethodsNearest Neighbor Methods
•k-NN assigns an unknown object to the most common class ofits k nearest neighbors
•Choice of k? (bias-variance tradeoff again)
•Choice of metric?
•Need all the training to be present to classify a new point (“lazymethods”)
•Surprisingly strong asymptotic results (e.g. no decision rule ismore than twice as accurate as 1-NN)
Flexible Metric NNFlexible Metric NNClassificationClassification
NaNaïïve Bayes Classificationve Bayes ClassificationRecall: p(ck |x) ∝ p(x| ck)p(ck)
Now suppose:
Then:
Equivalently:
C
x1 x2 xp…
∏=
∝p
jkjkk cxpcpxcp
1
)|()()|(
∑¬¬¬
+=)|(
)|(
)(
)(log
)|(
)|(log
kj
kj
k
k
k
k
cxp
cxp
cp
cp
xcp
xcp
“weights ofevidence”
Evidence Balance SheetEvidence Balance Sheet
NaNaïïve Bayes (cont.)ve Bayes (cont.)
•Despite the crude conditional independence assumption, works
well in practice (see Friedman, 1997 for a partial explanation)
•Can be further enhanced with boosting, bagging, model
averaging, etc.
•Can relax the conditional independence assumptions in myriad
ways (“Bayesian networks”)
Dietterich (1999)
Analysis of 33 UCI datasets
An ExampleAn Example
Gene-Expression Profiles inHereditary Breast Cancer
• Breast tumors studied:7 BRCA1+ tumors8 BRCA2+ tumors7 sporadic tumors
• Log-ratios measurements of3226 genes for each tumorafter initial data filtering
cDNA MicroarraysParallel Gene Expression Analysis
RESEARCH QUESTIONCan we distinguish BRCA1+ from BRCA1– cancers and BRCA2+ fromBRCA2– cancers based solely on their gene expression profiles?
BRCA1BRCA1
?? g
# of
significant genes
# of misclassified
samples (m)
% of random permutations with
m or fewer misclassifications
10-2 182 3 0.4 10-3 53 2 1.0 10-4 9 1 0.2
BRCA2BRCA2
?? g
# of significant
genes
m = # of misclassified elements
(misclassified samples)
% of random permutations with m
or fewer misclassifications
10-2 212 4 (s11900, s14486, s14572, s14324) 0.8 10-3 49 3 (s11900, s14486, s14324) 2.2 10-4 11 4 (s11900, s1 4486, s14616, s14324) 6.6
Classification of BRCA2 Classification of BRCA2 GermlineGermlineMutationsMutations
45%Classification Tree
18%Support Vector Machine (linear kernel)
23%3-Nearest Neighbor
9%1-Nearest Neighbor
14%Diagonal LDA
36%Fisher LDA
14%Compound Covariate Predictor
LOOCV PredictionError
Classification Method