why microarray?
DESCRIPTION
CZ5225: Modeling and Simulation in Biology Lecture 6, Microarray Cancer Classification Prof. Chen Yu Zong Tel: 6874-6877 Email: [email protected] http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore. Why Microarray?. - PowerPoint PPT PresentationTRANSCRIPT
CZ5225: Modeling and Simulation in BiologyCZ5225: Modeling and Simulation in Biology
Lecture 6, Microarray Cancer Classification Lecture 6, Microarray Cancer Classification
Prof. Chen Yu ZongProf. Chen Yu Zong
Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@nus.edu.sghttp://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sg
Room 07-24, level 7, SOC1, Room 07-24, level 7, SOC1, National University of SingaporeNational University of Singapore
22
Why Microarray?Why Microarray?
• Although there has been some improvements over the past 30 years still there exists no general way for:– Identifying new cancer classes – Assigning tumors to known classes
• In this paper they are introducing two general ways for– Class prediction of a new tumor– Class discovery of new unknown subclasses– Without using the previous biological information
33
Why Microarray?Why Microarray?
• Why do we need to classify cancers?– The general way of treating cancer is to:
• Categorize the cancers in different classes• Use specific treatment for each of the classes
• Traditional way– Morphological appearance.
44
Why Microarray?Why Microarray?
• Why traditional ways are not enough ?– There exists some tumors in the same class
with completely different clinical courses• May be more accurate classification is needed
– Assigning new tumors to known cancer classes is not easy
• e.g. assigning an acute leukemia tumor to one of the
– AML– ALL
55
Cancer ClassificationCancer Classification
• Class discovery– Identifying new cancer classes
• Class Prediction– Assigning tumors to known classes
66
Cancer Genes and PathwaysCancer Genes and Pathways• 15 cancer-related pathways, 291 cancer genes, 34 angiogenesis genes, 12 tumor
immune tolerance genes Nature Medicine 10, 789-799 (2004); Nature Reviews Cancer 4, 177-183 (2004), 6, 613-625 (2006); Critical Reviews in Oncology/Hematology 59, 40-50 (2006) http://bidd.nus.edu.sg/group/trmp/trmp.asp
Disease outcome prediction with microarrayDisease outcome prediction with microarray
Patient i:Patient i:
Normal Normal person jperson j::
SVM
PatientPatient
NormalNormal
Important Important genesgenes
Most discriminative genesMost discriminative genes
Patient i:Patient i:
Normal Normal person jperson j::
SVM
PatientPatient
NormalNormal
SignaturesSignaturesPredictor-genesPredictor-genes
Better predictive powerBetter predictive powerClues to disease genes, drug targetsClues to disease genes, drug targets
Disease outcome prediction with microarrayDisease outcome prediction with microarray
Patient i:Patient i:
Normal Normal person jperson j::
SVM
PatientPatient
NormalNormal
Expected features of signatures:Expected features of signatures:
Composition:Composition:• Certain percentages of cancer genes, genes in cancer pathways, Certain percentages of cancer genes, genes in cancer pathways,
and angiogenesis genesand angiogenesis genes
Stability:Stability:• Similar set of predictor-genes in different patient compositions Similar set of predictor-genes in different patient compositions
measures under the same or similar conditionsmeasures under the same or similar conditions
How many genes should be in a signature? How many genes should be in a signature?
Class No of Genes or Pathways
Cancer genes (oncogenes, tumor-
suppressors, stability genes)
219
Cancer pathways 15
Angiogenesis 34
Cancer immune tolerance 15
99
Class PredictionClass Prediction
• How could one use an initial collection of samples belonging to known classes to create a class Predictor?– Gathering samples– Hybridizing RNA’s to the microarray– Obtaining quantitative expression level of each gene– Identification of Informative Genes via Neighborhood
Analysis– Weighted votes
1010
Neighborhood AnalysisNeighborhood Analysis
• We want to identify the genes whose expression pattern were strongly correlated with the class distinction to be predicted and ignoring other genes
– Each gene is presented by an expression vector consisting of its expression level in each sample.
– Counting no. of genes having various levels of correlation with ideal gene c.
– Comparing with the correlation of randomly permuted c with it
• The results show an unusually high density of correlated genes!
1111
Neighborhood analysis
Idealized expression pattern
1212
Class PredictorClass Predictor
• The General approach– Choosing a set of informative genes based on
their correlation with the class distinction– Each informative gene casts a weighted vote
for one of the classes– Summing up the votes to determine the
winning class and the prediction strength
1313
Computing VotesComputing Votes
• Each gene Gi votes for AML or ALL depending on :– If the expression level of the gene in the new tumor is
nearer to the mean of Gi in AML or ALL• The value of the vote is :
– WiVi where:• Wi reflects how well Gi is correlated with the class
distinction• Vi = | xi – (AML mean + ALL mean) / 2 |
• The prediction strength reflects:– Margin of victory– (Vwin – Vloose) / (Vwin + Vloose)
1414
Class Predictor
1515
EvaluationEvaluation• DATA
– Initial Sample• 38 Bone Marrow Samples (27 ALL, 11 AML) obtained at the
time of diagnosis.– Independent Sample
• 34 leukemia consisted of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML).
• Validation of Gene Voting– Initial Samples
• 36 of the 38 samples as either AML or ALL and two as uncertain. All 36 samples agrees with clinical diagnosis.
– Independent Samples• 29 of 34 samples are strongly predicted with 100% accuracy.
1616
Validation of Gene VotingValidation of Gene Voting
1717
An early kind of analysis: unsupervised An early kind of analysis: unsupervised learning learning learning disease sub-types learning disease sub-types
Rb
p53
1818
Sub-type learning: seeking ‘natural’ Sub-type learning: seeking ‘natural’ groupings & hoping that they will be useful…groupings & hoping that they will be useful…
Rb
p53
1919
E.g., for treatmentE.g., for treatment
Rb
p53
Respond to treatment Tx1
Do notRespond to treatment Tx1
2020
The ‘one-solution fits all’ trapThe ‘one-solution fits all’ trap
Rb
p53
Respond to treatment Tx2
Do notRespond to treatment Tx2
2121
A more modern view: A more modern view: supervised learningsupervised learning
TRAININSTANCES
APPLICATIONINSTANCES
A
B C
D E
A1, B1, C1, D1, E1
A2, B2, C2, D2, E2
An, Bn, Cn, Dn, En
INDUCTIVE ALGORITHM Classifier
ORRegression Model
CLASSIFICATION PERFORMANCE
2222
TRAININSTANCES
APPLICATIONINSTANCES
A
B C
D E
A1, B1, C1, D1, E1
A2, B2, C2, D2, E2
An, Bn, Cn, Dn, En
INDUCTIVE ALGORITHM Classifier
ORRegression Model
CLASSIFICATION PERFORMANCE
PredictiveBiomarkers
Predictive Biomarkers & Supervised LearningPredictive Biomarkers & Supervised Learning
2323
Predictive Biomarkers & Supervised LearningPredictive Biomarkers & Supervised Learning
2424
TRAININSTANCES
A
B C
D E
A1, B1, C1, D1, E1
A2, B2, C2, D2, E2
An, Bn, Cn, Dn, En
PERFORMANCE
A
B C
D
E
INDUCTIVE ALGORITHM
A more modern view 2: A more modern view 2: Unsupervised learning as structure learningUnsupervised learning as structure learning
2525
TRAININSTANCES
A
B C
D E
A1, B1, C1, D1, E1
A2, B2, C2, D2, E2
An, Bn, Cn, Dn, En
PERFORMANCE
A
B C
D
E
INDUCTIVE ALGORITHM
Causative biomarkers Causative biomarkers & (structural) unsupervised learning& (structural) unsupervised learning
CausativeBiomarkers
2626
Supervised learning: Supervised learning: the geometrical interpretationthe geometrical interpretation
+
+
+
+
++
+
+ +
+
p53
Rb
??
P1
P4
P2
P3
P5
Cancer patients
Normals
New case, classified as normal
New case, classified as cancer
SVM classifier
+
+
+
+
++
+
+ +
+
p53
Rb
??
P1
P4
P2
P3
P5
Cancer patients
Normals
New case, classified as normal
New case, classified as cancer
SVM classifier
2727
• 10,000-50,000 (regular gene expression microarrays, aCGH, and early SNP arrays)
• 500,000 (tiled microarrays, SNP arrays)• 10,000-300,000 (regular MS proteomics)• >10, 000, 000 (LC-MS proteomics)
This is the ‘curse of dimensionality problem’
If 2D looks good, what happens in 3D?If 2D looks good, what happens in 3D?
2828
• Some methods do not run at all (classical regression) • Some methods give bad results • Very slow analysis• Very expensive/cumbersome clinical application
Problems associated with high-dimensionality Problems associated with high-dimensionality (especially with small samples)(especially with small samples)
2929
Solution 1: dimensionality reductionSolution 1: dimensionality reduction
0 10 20 30 40 50 60 70 80 90 1000
50
100
150
200
250
300
350
400
Gene X
Gen
e Y
1st principal component (PC1) PC1: 3X-Y=0
Normal subjects
Cancer patients
3030
B
A
C D E
T
H I J
K
Q L
M N
P O
Solution 2: feature selectionSolution 2: feature selection
3131
• Over-fitting ( a model to your data)= building a model than is good in original data but fails to generalize well to fresh data
Another (very real and unpleasant) problem Another (very real and unpleasant) problem Over-fittingOver-fitting
3232
Over-fitting is directly related to the complexity of Over-fitting is directly related to the complexity of decision surface (relative to the complexity of decision surface (relative to the complexity of
modeling task)modeling task)
Predictor X
Outcome of Interest Y
Training Data
Test Data
3333
General Population:AUC of Model_1= 65%; AUC of Model_2= 85%; AUC of Model_3= 55%
Modeling Sample MS_1 Independent Evaluation Sample ES_1
TrainModel_1Validate
AUC of Model_1 = 88%
TrainModel_2Validate
AUC of Model_2 = 76%
TrainModel_3Validate
AUC of Model_3 = 63%
Modeling Sample MS_n
TrainModel_1Validate
AUC of Model_1 = 61%
TrainModel_2Validate
AUC of Model_2 = 87%
TrainModel_3Validate
AUC of Model_3 = 67%
Sample used for training & validation
Evaluate AUC of Model_1 = 65%
A sample in which over-fitting is detected
Independent Evaluation Sample ES_2
Evaluate AUC of Model_1 =84%
A sample in which over-fitting is not detectedSample not used for training & validation
1
2
3
4
Training & Validation Phase Validation With Independent Dataset
Over-fitting is also caused by multiple Over-fitting is also caused by multiple validations & small samplesvalidations & small samples
3434
General Population:AUC of Model_1= 65%; AUC of Model_2= 85%; AUC of Model_3= 55%
Modeling Sample MS_1 Independent Evaluation Sample ES_1
TrainModel_1Validate
AUC of Model_1 = 88%
TrainModel_2Validate
AUC of Model_2 = 76%
TrainModel_3Validate
AUC of Model_3 = 63%
Modeling Sample MS_n
TrainModel_1Validate
AUC of Model_1 = 61%
TrainModel_2Validate
AUC of Model_2 = 87%
TrainModel_3Validate
AUC of Model_3 = 67%
Sample used for training & validation
Evaluate AUC of Model_2 = 74%
A sample falsely detecting over-fitting
Independent Evaluation Sample ES_2
Evaluate AUC of Model_2 =90%
A sample not detecting over-fitting
Sample not used for training & validation
1
2
3
4
Training & Validation Phase Validation With Independent Dataset
Over-fitting is also caused by multiple Over-fitting is also caused by multiple validations & small samplesvalidations & small samples
3535
A method to produce realistic performance A method to produce realistic performance estimates: nested n-fold cross-validationestimates: nested n-fold cross-validation
Dataset
Outer loop: Cross-validation for performance estimation
Inner Loop: Cross-validation for model selection
Training set
Validation set
C AccuracyAverage
AccuracyP1 P2 86%P2 P1 84%P1 P2 70%P2 P1 90%
1 85%
2 80%
Training set
Testing set
C AccuracyAverage
AccuracyP1, P2 P3 1 89%P1,P3 P2 2 84%P2, P3 P1 1 76%
83%…
predictor variables outcome variable
P1
P2
P3
Choose C=1 since it maximizes accuracy
…
3636
How well supervised learning works in How well supervised learning works in practice?practice?
3737
DatasetsDatasets• Bhattacharjee2 - Lung cancer vs normals [GE/DX]• Bhattacharjee2_I - Lung cancer vs normals on common genes between Bhattacharjee2 and Beer [GE/DX]
• Bhattacharjee3 - Adenocarcinoma vs Squamous [GE/DX]• Bhattacharjee3_I - Adenocarcinoma vs Squamous on common genes between Bhattacharjee3 and Su [GE/DX]
• Savage - Mediastinal large B-cell lymphoma vs diffuse large B-cell lymphoma [GE/DX]
• Rosenwald4 - 3-year lymphoma survival [GE/CO]• Rosenwald5 - 5-year lymphoma survival [GE/CO]• Rosenwald6 - 7-year lymphoma survival [GE/CO]• Adam - Prostate cancer vs benign prostate hyperplasia and normals [MS/DX]
• Yeoh - Classification between 6 types of leukemia [GE/DX-MC]• Conrads - Ovarian cancer vs normals [MS/DX]• Beer_I - Lung cancer vs normals (common genes with Bhattacharjee2) [GE/DX]• Su_I - Adenocarcinoma vs squamous (common genes with Bhattacharjee3) [GE/DX
• Banez - Prostate cancer vs normals [MS/DX]
3838
Methods: Gene Selection AlgorithmsMethods: Gene Selection Algorithms
• ALL - No feature selection
• LARS - LARS
• HITON_PC -
• HITON_PC_W -HITON_PC+ wrapping phase
• HITON_MB -
• HITON_MB_W -HITON_MB + wrapping phase
• GA_KNN - GA/KNN
• RFE - RFE with validation of feature subset with optimized polynomial kernel
• RFE_Guyon - RFE with validation of feature subset with linear kernel (as in Guyon)• RFE_POLY - RFE (with polynomial kernel) with validation of feature subset with polynomial optimized kernel
• RFE_POLY_Guyon - RFE (with polynomial kernel) with validation of feature subset with linear kernel (as in Guyon)
• SIMCA - SIMCA (Soft Independent Modeling of Class Analogy): PCA based method• SIMCA_SVM - SIMCA (Soft Independent Modeling of Class Analogy): PCA based method with validation of feature subset by SVM
• WFCCM_CCR - Weighted Flexible Compound Covariate Method (WFCCM) applied as in Clinical
Cancer Research paper by Yamagata (analysis of microarray data)
• WFCCM_Lancet - Weighted Flexible Compound Covariate Method (WFCCM) applied as in Lancet
paper by Yanagisawa (analysis of mass-spectrometry data)
• UAF_KW - Univariate with Kruskal-Walis statistic
• UAF_BW - Univariate with ratio of genes between groups to within group sum of squares
• UAF_S2N - Univariate with signal-to-noise statistic
3939
Classification Performance Classification Performance (average over all tasks/datasets)(average over all tasks/datasets)
4040
How well dimensionality reduction and How well dimensionality reduction and feature selection work in practice?feature selection work in practice?
4141
Number of Selected Features Number of Selected Features (average over all tasks/datasets)(average over all tasks/datasets)
0.00
1000.00
2000.00
3000.00
4000.00
5000.00
6000.00
7000.00
8000.00
9000.00
10000.00
ALL
LA
RS
HIT
ON
gp_P
C
HIT
ON
gp_M
B
HIT
ON
gp_P
C_W
HIT
ON
gp_M
B_W
GA
_K
NN
RF
E
RF
E_G
uyon
RF
E_P
OLY
RF
E_P
OLY
_G
uyon
SIM
CA
SIM
CA
_S
VM
WF
CC
M_C
CR
UA
F_K
W
UA
F_B
W
UA
F_S
2N
4242
Number of Selected Features Number of Selected Features (zoom on most powerful methods)(zoom on most powerful methods)
0.00
20.00
40.00
60.00
80.00
100.00
4343
Number of Selected Features Number of Selected Features (average over all tasks/datasets)(average over all tasks/datasets)