classification with microarray data · the challenge given a training set (data of known class),...
TRANSCRIPT
![Page 2: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/2.jpg)
What to expect from Aron today
What is classification, and why
How to do classification with microarray data
After lunch : case study.
Then, classification exercises.
2
![Page 3: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/3.jpg)
Sample PreparationHybridization
Array designProbe design
Experimental Design
Buy standardChip / Array
Statistical AnalysisFit to Model (time series)
Expression IndexCalculation
Advanced Data AnalysisClustering PCA Gene Annotation Analysis Promoter Analysis
Classification Meta analysis Survival analysis Regulatory Network
ComparableGene Expression Data
Normalization
Image analysis
Question/hypothesis
3
![Page 4: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/4.jpg)
What is classification?Given an object with a set of features (input data), a classifier assigns
the objects to a class.
classifier
Aronage 34sex maleincome $$
Obama
McCainfeatures
object
classifier algorithm:
if (age > 40) AND (income > $$)then vote = McCainelse vote = Obama
A classifier is simply a set of rules or an algorithm
4
![Page 5: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/5.jpg)
Why is classification useful?1. Classification as an end in itself
– diagnosis of disease– predict response to chemotherapy– identify bacterial strains
2. Learn from the classifier– identify drug targets
classifier algorithm:
if (age > 40) AND (income > $$)then vote = republicanelse vote = democrat
5
![Page 6: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/6.jpg)
The challenge
Given a training set (data of known class), design a classifier that accurately predicts the class of novel data.
x01 x02 x03 x04 x05 x06 x07age 43 65 34 22 44 29 57sex female male male female female male maleincome $$$$ $$$$ $$$$$ $ $$$$$ $$ $$$vote Obama McCain Obama McCain Obama McCain ???
Training set Test
This is also called statistical learning or machine learning.
6
![Page 7: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/7.jpg)
G1
G2
Classification as mapping
A classifier based on the expression levels of two genes, G1 and G2:
ObamaMcCain
Training data set:
>>> How do we do this algorithmically?
linear vs. nonlinear
multiple dimensions...?
unknown
7
![Page 8: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/8.jpg)
Recipe for classification
1. Choose a classification method
2. Feature selection / dimension reduction
3. Train the classifier
4. Assess classifier performance
8
![Page 9: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/9.jpg)
1. Choose a classification method
linear:–Linear Discriminant Analysis (LDA)–Nearest Centroid
nonlinear:–k Nearest Neighbors (kNN)–Artificial Neural Network (ANN)–Support Vector Machine (SVM)
9
![Page 10: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/10.jpg)
Linear Discriminant Analysis (LDA)
Find a line / plane / hyperplane
G1
G2
assumptions: • sampled from a normal distribution• variance/covariance same in each class
R: lda (MASS package)
10
![Page 11: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/11.jpg)
Nearest centroid
Calculate centroids for each class.
G1
G2
Similar to LDA.
Can be extended to multiple (n > 2) classes.
11
![Page 12: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/12.jpg)
k Nearest Neighbors (kNN)
For a test case, find the k nearest samples in the training set, and let them vote.
G1
G2
• need to choose k (hyperparameter)• k must be odd• need to choose distance metric
No real “learning” involved - the training set defines the classifier.
R: knn (class package)
k=3
12
![Page 13: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/13.jpg)
Artificial Neural Network (ANN)
Class A Class B
Gene1 Gene2 Gene3 ...
Each “neuron” is simply a function (usually nonlinear).
The “network” to the left just represents a series of nested functions.
Intepretation of results is difficult.
R: nnet (nnet package)
13
![Page 14: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/14.jpg)
Support Vector Machine (SVM)
Find a line / plane / hyperplane maximizing margin. Uses the “kernel trick” to map data to higher dimensions.
G1
G2
• very flexible• many parameters
Results can be difficult to interpret.
R: svm (e1071 package)
support vectors
margin
14
![Page 15: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/15.jpg)
Comparison of methods
Linear discriminant analysisNearest centroidKNN
Neural networksSupport vector machines
Simple methodBased on distance calculationGood for simple problemsGood for few training samples
Distribution of data assumed
Advanced methodsInvolve machine learning
Several adjustable parametersMany training samples required
(e.g.. 50-100)Flexible methods
G1
G2
15
![Page 16: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/16.jpg)
Recipe for classification
1. Choose a classification method
2. Feature selection / dimension reduction
3. Train the classifier
4. Assess classifier performance
16
![Page 17: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/17.jpg)
Dimension reduction - why?
Previous slides: 2 genes X 16 samples.
Your data: 22000 genes X 31 samples.
More genes = more parameters in classifier. Potential for over-training!! aka The Curse of Dimensionality.
number of parameters
accuracy
test set (novel data)
training set
17
![Page 18: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/18.jpg)
Dimension reduction
Two approaches:
1. Data transformation–Principal component analysis (PCA); use top components –Average co-expressed genes (e.g. from cluster analysis)
2. Feature selection (gene selection)–Significant genes: t-test / ANOVA–High variance genes–Hypothesis-driven–
18
![Page 19: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/19.jpg)
Recipe for classification
1. Choose a classification method
2. Feature selection / dimension reduction
3. Train the classifier
4. Assess classifier performance
19
![Page 20: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/20.jpg)
Train the classifierIf you know the exact classifier you want to use: just do it.
but...
Usually, we want to try several classifiers or several variations of a classifier. e.g. number of features, k in kNN, etc.
The problem:Testing several classifiers, and then choosing the best one leads to
selection bias (= overtraining). This is bad.
Instead we can use cross-validation to choose a classifier.
20
![Page 21: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/21.jpg)
Cross validation
Data set
Performance
Train Classifier
Temporary training set
Temporary testing set
Apply classifier to temporary testing set
split
Do this several times!
Each time, select different sample(s) for the temporary testing set.
Leave-one-out cross-validation (LOOCV):Do this once for each sample (i.e. the temporary testing set is only the one sample).
Assess average performance.
If you are using cross-validation as an optimization step, choose the classifier (or classifier parameters) that results in best performance.
21
![Page 22: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/22.jpg)
Cross-validation example
Given data with 100 samples
5-fold cross-validation: Training: 4/5 of data (80 samples) Testing: 1/5 of data (20 samples) - 5 different models and performance results
Leave-one-out cross-validation (LOOCV) Training: 99/100 of data (99 samples) Testing: 1/100 of data (1 sample) - 100 different models
22
![Page 23: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/23.jpg)
Recipe for classification
1. Choose a classification method
2. Feature selection / dimension reduction
3. Train the classifier
4. Assess classifier performance
23
![Page 24: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/24.jpg)
Assess performance of the classifier
How well does the classifier perform on novel samples?
Worst: Assess performance on the training set.
Better: Assess performance on the training set using cross-validation.
Alternative: Set aside a subset of your data as a test set.
Best: Also assess performance on an entirely independent data set (e.g. data produced by another lab).
24
![Page 25: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/25.jpg)
Problems with cross-validation
1. You cannot use the same cross-validation to optimize and evaluate your classifier.
2. The classifier obtained at each round of cross-validation is different, and is different from that obtained using all data.
3. Cross-validation will overestimate performance in the presence of experimental bias.
25
![Page 26: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/26.jpg)
Assessing classifier performance
prediction actual outcome
Aron yes no FP
Beth no no TN
Chuck yes yes TP
Dorthe no yes FN
... ... ... ...
predict relapse of cancerTP = True PositiveTN = True NegativeFP = False PositiveFN = False Negative
predict yes predict no
actual yes 131 TP 21 FN
actual no 7 FP 212 TN
confusion matrix
26
![Page 27: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/27.jpg)
Classifier performance: Accuracy
predict yes predict no
actual yes 131 TP 21 FN
actual no 7 FP 212 TN
confusion matrix
Accuracy = 0.92
range: [0 .. 1]
fraction of predictions that arecorrect
27
![Page 28: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/28.jpg)
Unbalanced test data
confusion matrix
Accuracy = 0.98 (?)
a new classifier: does patient have tuberculosis?
>>> Accuracy can be misleading, especially when the test cases are unbalanced
For patients who do not really have tuberculosis: this classifier is usually wrong!
predict yes predict no
actual yes 5345 TP 21 FN
actual no 113 FP 18 TN
28
![Page 29: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/29.jpg)
Classifier performance: sensitivity/specificity
predict yes predict no
actual yes 131 TP 21 FN
actual no 7 FP 212 TN
confusion matrix
Sensitivity = 0.86Specificity = 0.97
range: [0 .. 1]
the ability to detect positives
the ability to reject negatives
29
![Page 30: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/30.jpg)
Classifier performance: Matthews Correlation Coefficient
predict yes predict no
actual yes 131 TP 21 FN
actual no 7 FP 212 TN
confusion matrix
MCC = 0.84
range: [-1 .. 1]
A single performance measurethat is less influenced by unbalanced test sets
30
![Page 31: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/31.jpg)
Caution: unbalanced test data
predict yes predict no
actual yes 5345 TP 21 FN
actual no 113 FP 18 TN
confusion matrix
Accuracy = 0.98 (?)
Sensitivity = 0.996Specificity = 0.14
MCC = 0.24
a new classifier: does patient have tuberculosis?
31
![Page 32: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/32.jpg)
Unbiased classifier assessment
Avoid information leakage!
1. Use separate data sets for training and for testing.
2. Try to avoid “wearing out” the testing data set. Testing multiple classifiers (and choosing the best one) can only increase the estimated performance -- therefore it is biased!
3. Perform any feature selection using only the training set (or temporary training set).
32
![Page 33: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/33.jpg)
Summary: Recipe for classification
1. Choose a classification method– kNN, LDA, SVM, etc.
2. Feature selection / dimension reduction– PCA, possibly t-tests
3. Train the classifier– use cross-validation to optimize parameters
4. Assess classifier performance– use an independent test set if possible– determine sensitivity and specificity when appropriate
33
![Page 34: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/34.jpg)
The data set in the exercise
Genome-wide expression profiling of human blood reveals biomarkers for Huntington's disease.Borovecki F, Lovrecic L, Zhou J, Jeong H, Then F, Rosas HD, Hersch SM, Hogarth P, Bouzou B, Jensen RV, Krainc D.
Proc Natl Acad Sci U S A. 2005 Aug 2;102(31):11023-8.
31 samples:- 14 normal- 5 presymptomatic- 12 symptomatic
Affymetrix HG-U133A arrays (and Codelink Uniset)
34
![Page 35: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/35.jpg)
The data set in the exercise
Huntingon’s disease- neurological disorder- genetic- polyglutamine expansion in huntingtin gene
Why search for marker of disease progression (not diagnosis)- assess treatment efficacy- surrogate endpoint in drug trials
35
![Page 36: Classification with microarray data · The challenge Given a training set (data of known class), design a classifier that accurately predicts the class of novel data. x01 x02 x03](https://reader034.vdocuments.site/reader034/viewer/2022050222/5f67d192febf52303c2a0eee/html5/thumbnails/36.jpg)
Questions and/or Lunch
36