supervised machine learning - unimi.itmarray.economia.unimi.it › 2005 › material › l7.pdf ·...

Supervised MachineSupervised MachineLearningLearning

Lecture 7Computational and Statistical Aspects

of Microarray AnalysisJune 23, 2005 Bressanone, Italy

Common Types of ObjectivesCommon Types of Objectives

• Class Comparison– Identify genes differentially expressed among

predefined classes such as diagnostic orprognostic groups.

• Class Prediction– Develop multi-gene predictor of class for a

sample using its gene expression profile• Class Discovery

– Discover clusters among specimens or amonggenes

What is the taskWhat is the task• Given the gene profile predict the class

• Mathematical representation: findfunction f that maps x to {1,…,K}

• How do we do this?

PossibilitiesPossibilities• Have expert tell us what genes to look

for being over/under expressed?• Then we do not really need microarrray

• Use clustering algorithms?• Not appropriate for this taks…

Clustering is not a good tool

A: 450 relevant genes plus 450 “noise” genes. B: 450 relevant genes.

Simulated Data with 4 clusters: 1-10, 11-20, 21-30, 31-40

Problem with clusteringProblem with clustering• Noisy genes will ruin it for the rest

• How do we know which genes to use

• We are ignoring useful information inour prototype data: We know theclasses!

Train an algorithmTrain an algorithm• A powerful approach is to train a

classification algorithm on the data wecollected and propose the use of it in thefuture

• This has successfully worked in manyareas: zip code reading, voicerecognition, etc

Clustering is not a good tool

Using multiple genesUsing multiple genes• How do we combine information from various

genes to help us form our discriminantfunction f ?

• There are many methods out there… threeexamples are LDA, kNN, SVM

• Weighted gene voting and PAM weredeveloped for microarrays (but they are just aversions of DLDA)

Weighted Gene Voting is DLDAWeighted Gene Voting is DLDA

∑=

−=

G

g g

kggk

xx

12

2)()(

σ

µδ

€

(x 1g − x 2g )ˆ σ g

2g=1

G

∑ xg −x 1g + x 2g( )

2

≥ 0

( ) 01

≥−∑=

gg

G

gg bxa

2

21

ˆ

)(

g

ggg

xxa

σ

−=

( )2

21 ggg

xxb

+=

gg

ggg

xxa

21

21

ˆˆ

)(

σσ +

−=

With equal priors, DLDA is:

With two classes we select class 1 if

This can be written as

with

Weighted Gene Voting simply uses

Notice the units and scale fore sum are wrong!

KNNKNN• Another simple and useful method is K

nearest neighbors

• It is very simple

ExampleExample

Too many genesToo many genes• A problem with most existing

approaches: They were not developedfor p>>n

• A simple way around this is to filtergenes first: Pick genes that, marginally,appear to have good predictive power

Beware of over-fittingBeware of over-fitting• With p>>n you can always find a

prediction algorithm that predictsperfectly on the training set

• Also, many algorithm can be made to metoo flexible. An example is KNN with K=1

ExampleExample

Split-Sample EvaluationSplit-Sample Evaluation• Training-set

– Used to select features, select model type, determineparameters and cut-off thresholds

• Test-set– Withheld until a single model is fully specified using the

training-set.– Fully specified model is applied to the expression profiles in

the test-set to predict class labels.– Number of errors is counted

Note: Also called cross-validation

ImportantImportant• You have apply the entire algorithm,

from scratch, on the train set

• This includes the choice of feature gene,and in some cases normalization!

ExampleExample

Number of misclassifications

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Proportion of simulated data sets

0.00

0.05

0.10

0.90

0.95

1.00

Cross-validation: none (resubstitution method)Cross-validation: after gene selectionCross-validation: prior to gene selection

Keeping yourself honestKeeping yourself honest• CV

• Try out algorithm on reshuffled data

• Try it out on completely random data

ConclusionsConclusions• Clustering algorithms not appropriate

• Do not reinvent the wheel! Many methodsavailable… but need feature selection (PAMdoes it all in one step!)

• Use cross validation to assess

• Be suspicious of new complicated methods:Simple methods are already too complicated.

Class Prediction ModelClass Prediction Model• Given a sample with an expression profile vector x of

log-ratios or log signals and unknown class, predictwhich class the sample belongs to

• The class prediction model is a function f which mapsfrom the set of vectors x to the set of class labels {1,2}(if there are two classes).

• f generally utilizes only some of the components of x(i.e. only some of the genes)

• Specifying the model f involves specifying someparameters (e.g. regression coefficients) by fitting themodel to the data (learning the data).

Do Not Confuse Statistical MethodsDo Not Confuse Statistical MethodsAppropriate for Class Comparison withAppropriate for Class Comparison withThose Appropriate for Class PredictionThose Appropriate for Class Prediction

• Demonstrating statistical significance of prognostic factors is notthe same as demonstrating predictive accuracy.

• Demonstrating goodness of fit of a model to the data used todevelop it is not a demonstration of predictive accuracy.

• Statisticians are used to inference, not prediction• Most statistical methods were not developed for p>>n prediction

problems• But some are…

Components of ClassComponents of ClassPredictionPrediction

• Feature (gene) selection– Which genes will be included in the model

• Select model type– E.g. DLDA, Nearest-Neighbor, …

• Fitting parameters (regressioncoefficients) for model

• Assessing the model

Feature SelectionFeature Selection• Key component of supervised analysis• Genes that are univariately differentially expressed

among the classes at a significance level α (e.g. 0.01)– The α level is selected to control the number of genes in the

model, not to control the false discovery rate• Methods for class prediction are different than those for class

comparison– The accuracy of the significance test used for feature

selection is not of major importance as identifyingdifferentially expressed genes is not the ultimate objective

– For survival prediction, the genes with significant univariateCox PH regression coefficients

Feature SelectionFeature Selection• Small subset of genes which together

give most accurate predictions

• Many published complex methods forselecting combinations of genes do notappear to have been properly evaluated

Linear MethodsLinear Methods• Many popular methods are linear

• Linear means that the prediction rule is alinear combination of the x– f(x) = Ax

• Some examples follow

Linear Classifiers for Two ClassesLinear Classifiers for Two Classes• Fisher linear discriminant analysis (weights based on

assumed multi-variate normal distribution ofexpression vector in each class with commoncovariance matrix)

• Diagonal linear discriminant analysis (DLDA) assumesfeatures are uncorrelated

• Compound covariate predictor and Golub’s weightedvoting method are variants of DLDA

Linear Classifiers for Two ClassesLinear Classifiers for Two Classes

• Support vector machines with innerproduct kernel are linear classifiers withweights determined to minimize errors

• Perceptrons with principal componentsas input are linear classifiers with nowell defined criterion for definingweights

Nearest Neighbor ClassifierNearest Neighbor Classifier

• To classify a sample in the validation set as beingin outcome class 1 or outcome class 2, determinewhich sample in the training set it’s geneexpression profile is most similar to.– Similarity measure used is based on genes selected as

being univariately differentially expressed between theclasses

– Correlation similarity or Euclidean distance generallyused

• Classify the sample as being in the same class asit’s nearest neighbor in the training set

• For a fixed neighborhood size, this turns out tobe linear as well

Advantages of Simple LinearAdvantages of Simple LinearClassifiersClassifiers

• Do not over-fit data– Incorporate influence of multiple variables

without attempting to select the best smallsubset of variables

– Do not attempt to model the multivariateinteractions among the predictors andoutcome

• When p>>n, a linear classifier canalmost always be found which fits thedata perfectly.

• Why consider more complex models?• The full set of linear models is too rich

and selecting a linear model to minimizetraining errors does not lead togeneralizable results when p>>n.

MythMyth• That complex classification algorithms

such as neural networks perform betterthan simpler methods for classprediction.

• Artificial intelligence sells to journal reviewersand peers who cannot distinguish hype fromsubstance when it comes to microarray dataanalysis.

• Comparative studies have shown that simplermethods work as well or better for microarrayproblems because the number of candidatepredictors exceeds the number of samples byorders of magnitude.

• Dudoit, Fridlyand and Speed JASA 2001

• Fitting complex classifiers to training dataresults in unstable models unless thetraining dataset is huge

• For unstable models, the training set errorrate is strongly downward biased as anestimate of the generalization error rate

• For unstable models, the cross-validatederror rate is a highly variable estimate of thegeneralization error rate

• Stability can be improved by– Restriction to simpler models– Include penalty for complexity if fitting criterion– Use fitting criterion incorporating robustness to

changes in data

Evaluating a ClassifierEvaluating a Classifier• “Prediction is difficult, especially the

future.”– Neils Bohr

• Fit of a model to the same data used todevelop it is no evidence of predictionaccuracy for independent data.

Leave-one-out CrossLeave-one-out CrossValidationValidation

• Leave-one-out cross-validationsimulates the process of separatelydeveloping a model on one set of dataand predicting for a test set of data notused in developing the model

training set

test set

spec

imen

s

log-expression ratios

spec

imen

s

log-expression ratios

full data set

Non-cross-validated Prediction

Cross-validated Prediction (Leave-one-out method)

1. Prediction rule is built using full data set.2. Rule is applied to each specimen for class

prediction.

1. Full data set is divided into training andtest sets (test set contains 1 specimen).

2. Prediction rule is built from scratchusing the training set.

3. Rule is applied to the specimen in thetest set for class prediction.

4. Process is repeated until each specimenhas appeared once in the test set.

Common mistakeCommon mistake• Cross-validation of a model can occur

after selecting the genes to be used inthe model

• Cross validation is only valid if the trainingset is not used in any way in thedevelopment of the model. Using thecomplete set of samples to select genesviolates this assumption and invalidatescross-validation.

• With proper cross-validation, the model mustbe developed from scratch for each leave-one-out training set. This means that geneselection must be repeated for each leave-one-out training set.

• The cross-validated estimate ofmisclassification error applies to the modelbuilding process, not to the particular modelor the particular set of genes used in themodel.

Prediction on Simulated Null DataPrediction on Simulated Null Data

Generation of Gene Expression Profiles• 20 specimens (Pi is the expression profile for specimen i)• Log-ratio measurements on 6000 genes• Pi ~ MVN(0, I6000)• Can we distinguish between the first 10 specimens (Class 1) andthe last 10 (Class 2)?

Prediction Method• Compound covariate prediction• Compound covariate built from the log-ratios of the 10 mostdifferentially expressed genes.

Incomplete (incorrect) Cross-Incomplete (incorrect) Cross-ValidationValidation

• Publications are using all the data to select genes andthen cross-validating only the parameter estimationcomponent of model development– Highly biased– Many published complex methods which make strong claims

based on incorrect cross-validation.• Frequently seen in complex feature set selection algorithms• Some software encourages inappropriate cross-validation

Incomplete (incorrect) Cross-Incomplete (incorrect) Cross-ValidationValidation

• Let M(b,D) denote a classification model developed ona set of data D where the model is of a particular typethat is parameterized by a scalar b.

• Use cross-validation to estimate the classificationerror of M(b,D) for a grid of values of b; Err(b).

• Select the value of b* that minimizes Err(b).• Caution: Err(b*) is a biased estimate of the prediction

error of M(b*,D).• This error is made in some commonly used methods

Permutation Distribution of Cross-Permutation Distribution of Cross-validated Misclassification Rate ofvalidated Misclassification Rate of

a Multivariate Classifiera Multivariate Classifier• Randomly permute class labels and repeat the

entire cross-validation• Re-do for all (or 1000) random permutations of

class labels• Permutation p value is fraction of random

permutations that gave as fewmisclassifications as in the real data

Invalid Criticisms of Cross-Invalid Criticisms of Cross-ValidationValidation

• “You can always find a set of features that willprovide perfect prediction for the training andtest sets.”– For complex models, there may be many sets of

features that provide zero training errors.– A modeling strategy that either selects among

those sets or aggregates among those models, willhave a generalization error which will be validlyestimated by cross-validation.

Potential Sources of Bias inPotential Sources of Bias inEstimation of Error RatesEstimation of Error Rates

• Confounding by sample handling or assayeffects– Design evaluation carefully

• Failure to incorporate important sources offuture variability– Assay drift

• Change in distribution of un-modeledvariables– In split sample validation, split samples by

institution

More detailsMore details……

y: phenotype (black vs white)x: expression position of the pointc: class indicators

Predictive ModelingPredictive Modeling

Goal: learn a mapping: y = f(x;θ)

Need: 1. A model structure

2. A score function3. An optimization strategy

Categorical y ∈ {c1,…,cm}: classification

Real-valued y: regressionNote: usually assume {c1,…,cm} are mutually exclusiveand exhaustive

Error RateError Rate• In general we write

• With L the loss function (usually quadratic)• When we are predicting classes the loss function is a

matrix that assigns a loss for every pair (truth,predictor)

• With only two classes this reduces to something quitesimple. What is it?

€

EPE = EX EY /X [L{Y − f (X)} | X]

Probabilistic ClassificationProbabilistic Classification

Let p(ck) = prob. that a randomly chosen object comes from ck

Objects from ck have: p(x |ck , θk) (e.g., MVN)

Then: p(ck | x ) ∝ p(x |ck , θk) p(ck)

Bayes Error Rate: dxxpxcpp kk

B )())|(max1(* ∫ −=

•Lower bound on the best possible error rate

Bayes error rate about 6%

Classifier TypesClassifier Types

Discrimination: direct mapping from x to{c1,…,cm}

- e.g. perceptron, SVM, CART

Regression: model p(ck | x )

- e.g. logistic regression, CART

Class-conditional: model p(x |ck , θk)

- e.g. “Bayesian classifiers”, LDA

Linear DiscriminantLinear DiscriminantAnalysisAnalysis

K classes, X n _ p data matrix.

p(ck | x ) ∝ p(x |ck , θk) p(ck)

Could model each class density as multivariate normal:

)()(21

212

1

||)2(

1)|(

kkT

k xx

kpk excp

µµ

π

−Σ−− −

Σ=

LDA assumes for all k. Then:Σ≡Σk

)()()(2

1

)(

)(log

)|(

)|(log 11

lkT

lkT

lkl

k

l

k xcp

cp

xcp

xcpµµµµµµ −Σ+−Σ+−= −−

This is linear in x.

Linear Discriminant AnalysisLinear Discriminant Analysis(cont.)(cont.)

It follows that the classifier should predict:

)(log2

1)( 11

kkTkk

Tk cpxx +Σ−Σ= −− µµµδ

“linear discriminant function”

If we don’t assume the Σk’s are identicial, get Quadratic DA:

)(log)()(2

1||log

2

1)( 1

kkkT

kkk cpxxx +−Σ−−Σ−= − µµδ

€

argmaxk δk (x)

Linear Discriminant AnalysisLinear Discriminant Analysis(cont.)(cont.)

Can estimate the LDA parameters via maximum likelihood:

kki

ik Nx /ˆ ∑∈

=µ

NNcp kk /)(ˆ =

)/()')((ˆ1

KNxxK

k kikiki −−−=Σ ∑∑

= ∈

µµ

LDA QDA

first linear discriminant

seco

nd li

near

dis

crim

inan

t

-5 0 5 10

-6-4

-20

24

6

s

s ss

s

s

ss

s

s

s

s

ss

ss

s

s

ss

ss

ss s

s

ss s

ss

s

ss

s s

ss

s

s s

s

s

s

s

s

s

s

s

s

ccc

c

cc

c

c

c

c

c

c

c

cc

c

cc

cc

c

cc

ccc

cc

c

c

c c

cc c

cc

c

c

cc

c

c

c

c

ccc

c

c

v

v

v

vv

v

v

v

v

v

v

v

v

v

v

vv

v

v

v

v

v

v

v

vv

vvv

vv

v

v v

v

v vv

v

vv v

v

vv

v

v

v

v

s

LDA (cont.)LDA (cont.)

•Fisher is optimal if the class are MVN with a commoncovariance matrix

•Computational complexity O(mp2n)

Logistic RegressionLogistic RegressionNote that LDA is linear in x:

)()()(2

1

)(

)(log

)|(

)|(log 0

10

10

00

µµµµµµ −Σ+−Σ+−= −−k

Tk

Tk

kk xcp

cp

xcp

xcp

xTkk αα += 0

Linear logistic regression looks the same:

xxcp

xcp Tkk

k ββ += 00 )|(

)|(log

But the estimation procedure for the coefficients is different.LDA maximizes joint likelihood [y,X]; logistic regressionmaximizes conditional likelihood [y|X]. Usually similar predictions.

Logistic Regression MLELogistic Regression MLEFor the two-class case, the likelihood is:

{ }∑=

−−+=n

iiiii xpyxpyl

1

));(1log()1();(log)( βββ

xxp

xp Tβββ

=

− );(1

);(log ))exp(1log();(log xxxp TT βββ +−=

{ }∑=

++=⇒n

i

TTi xxyl

1

))exp(1log()( βββ

The maximize need to solve (non-linear) score equations:

∑=

=−=n

iiii xpyx

d

dl

1

0));(()(

βββ

Logistic Regression ModelingLogistic Regression ModelingSouth African Heart Disease Example (y=MI)

4.1840.0100.043Age0.1360.0040.001Alcohol-1.1870.029-0.035Obesity4.1780.2250.939Famhist3.2190.0570.185ldl3.0340.0260.080Tobacco1.0230.0060.006sbp-4.2850.964-4.130InterceptZ scoreS.E.Coef.

Wald

Tree ModelsTree Models

•Easy to understand

•Can handle mixed data, missing values, etc.

•Sequential fitting method can be sub-optimal

•Usually grow a large tree and prune it back rather

than attempt to optimally stop the growing process

TrainingTrainingDatasetDataset

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

ThisfollowsanexamplefromQuinlan’sID3

Output: A Decision Tree forOutput: A Decision Tree for““buys_computerbuys_computer””

age?

overcast

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

Confusion matrix

Algorithm for Decision TreeAlgorithm for Decision TreeInductionInduction

• Basic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-conquer

manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they are discretized

in advance)– Examples are partitioned recursively based on selected attributes– Test attributes are selected on the basis of a heuristic or statistical

measure (e.g., information gain)• Conditions for stopping partitioning

– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning – majority

voting is employed for classifying the leaf– There are no samples left

Information GainInformation Gain(ID3/C4.5)(ID3/C4.5)

• Select the attribute with the highest information gain• Assume there are two classes, P and N

– Let the set of examples S contain p elements of class P and nelements of class N

– The amount of information, needed to decide if an arbitraryexample in S belongs to P or N is defined as

npn

npn

npp

npp

npI++

−++

−= 22 loglog),(

e.g. I(0.5,0.5)=1; I(0.9,0.1)=0.47; I(0.99,0.01)=0.08;

Information Gain inInformation Gain inDecision Tree InductionDecision Tree Induction

• Assume that using attribute A a set S will bepartitioned into sets {S1, S2 , …, Sv}– If Si contains pi examples of P and ni examples of N, the

entropy, or the expected information needed to classifyobjects in all subtrees Si is

• The encoding information that would be gained bybranching on A

∑= +

+=

ν

1),()(

iii

ii npInpnp

AE

)(),()( AEnpIAGain −=

Attribute Selection byAttribute Selection byInformation Gain ComputationInformation Gain Computationγ Class P: buys_computer =

“yes”γ Class N: buys_computer =

“no”γ I(p, n) = I(9, 5) =0.940γ Compute the entropy for age: Hence

Similarlyage pi ni I(pi, ni)

<=30 2 3 0.97130…40 4 0 0>40 3 2 0.971

694.0)2,3(14

5

)0,4(14

4)3,2(

14

5)(

=+

+=

I

IIageE

048.0)_(151.0)(029.0)(

===

ratingcreditGainstudentGainincomeGain

246.0

)(),()(

=

−= ageEnpIageGain

GiniGini Index (IBMIndex (IBMIntelligentMinerIntelligentMiner))

• If a data set T contains examples from nclasses, gini index, gini(T) is defined as

where pj is the relative frequency of class j in T.• If a data set T is split into two subsets T1 and T2

with sizes N1 and N2 respectively, the gini indexof the split data contains examples from nclasses, the gini index gini(T) is defined as

• The attribute provides the smallest ginisplit(T) ischosen to split the node

∑=

−=n

jp jTgini1

21)(

)()()( 22

11 Tgini

NN

TginiNNTginisplit +=

Avoid Avoid Overfitting Overfitting ininClassificationClassification

• The generated tree may overfit the training data– Too many branches, some may reflect anomalies due to

noise or outliers– Result is in poor accuracy for unseen samples

• Two approaches to avoid overfitting– Prepruning: Halt tree construction early—do not split a

node if this would result in the goodness measure fallingbelow a threshold

• Difficult to choose an appropriate threshold– Postpruning: Remove branches from a “fully grown”

tree—get a sequence of progressively pruned trees• Use a set of data different from the training data to

decide which is the “best pruned tree”

Approaches to DetermineApproaches to Determinethe Final Tree Sizethe Final Tree Size

• Separate training (2/3) and testing (1/3)sets

• Use cross validation, e.g., 10-fold crossvalidation

• Use minimum description length (MDL)principle:– halting growth of the tree when the

encoding is minimized

Nearest Neighbor MethodsNearest Neighbor Methods

•k-NN assigns an unknown object to the most common class ofits k nearest neighbors

•Choice of k? (bias-variance tradeoff again)

•Choice of metric?

•Need all the training to be present to classify a new point (“lazymethods”)

•Surprisingly strong asymptotic results (e.g. no decision rule ismore than twice as accurate as 1-NN)

Flexible Metric NNFlexible Metric NNClassificationClassification

NaNaïïve Bayes Classificationve Bayes ClassificationRecall: p(ck |x) ∝ p(x| ck)p(ck)

Now suppose:

Then:

Equivalently:

C

x1 x2 xp…

∏=

∝p

jkjkk cxpcpxcp

1

)|()()|(

∑¬¬¬

+=)|(

)|(

)(

)(log

)|(

)|(log

kj

kj

k

k

k

k

cxp

cxp

cp

cp

xcp

xcp

“weights ofevidence”

Evidence Balance SheetEvidence Balance Sheet

NaNaïïve Bayes (cont.)ve Bayes (cont.)

•Despite the crude conditional independence assumption, works

well in practice (see Friedman, 1997 for a partial explanation)

•Can be further enhanced with boosting, bagging, model

averaging, etc.

•Can relax the conditional independence assumptions in myriad

ways (“Bayesian networks”)

Dietterich (1999)

Analysis of 33 UCI datasets

An ExampleAn Example

Gene-Expression Profiles inHereditary Breast Cancer

• Breast tumors studied:7 BRCA1+ tumors8 BRCA2+ tumors7 sporadic tumors

• Log-ratios measurements of3226 genes for each tumorafter initial data filtering

cDNA MicroarraysParallel Gene Expression Analysis

RESEARCH QUESTIONCan we distinguish BRCA1+ from BRCA1– cancers and BRCA2+ fromBRCA2– cancers based solely on their gene expression profiles?

BRCA1BRCA1

?? g

# of

significant genes

# of misclassified

samples (m)

% of random permutations with

m or fewer misclassifications

10-2 182 3 0.4 10-3 53 2 1.0 10-4 9 1 0.2

BRCA2BRCA2

?? g

# of significant

genes

m = # of misclassified elements

(misclassified samples)

% of random permutations with m

or fewer misclassifications

10-2 212 4 (s11900, s14486, s14572, s14324) 0.8 10-3 49 3 (s11900, s14486, s14324) 2.2 10-4 11 4 (s11900, s1 4486, s14616, s14324) 6.6

Classification of BRCA2 Classification of BRCA2 GermlineGermlineMutationsMutations

45%Classification Tree

18%Support Vector Machine (linear kernel)

23%3-Nearest Neighbor

9%1-Nearest Neighbor

14%Diagonal LDA

36%Fisher LDA

14%Compound Covariate Predictor

LOOCV PredictionError

Classification Method

supervised machine learning - unimi.itmarray.economia.unimi.it › 2005 › material › l7.pdf ·...

Documents