why microarray?

CZ5225: Modeling and Simulation in BiologyCZ5225: Modeling and Simulation in Biology

Lecture 6, Microarray Cancer Classification Lecture 6, Microarray Cancer Classification

Prof. Chen Yu ZongProf. Chen Yu Zong

Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@nus.edu.sghttp://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sg

Room 07-24, level 7, SOC1, Room 07-24, level 7, SOC1, National University of SingaporeNational University of Singapore

22

Why Microarray?Why Microarray?

• Although there has been some improvements over the past 30 years still there exists no general way for:– Identifying new cancer classes – Assigning tumors to known classes

• In this paper they are introducing two general ways for– Class prediction of a new tumor– Class discovery of new unknown subclasses– Without using the previous biological information

33


• Why do we need to classify cancers?– The general way of treating cancer is to:

• Categorize the cancers in different classes• Use specific treatment for each of the classes

• Traditional way– Morphological appearance.

44


• Why traditional ways are not enough ?– There exists some tumors in the same class

with completely different clinical courses• May be more accurate classification is needed

– Assigning new tumors to known cancer classes is not easy

• e.g. assigning an acute leukemia tumor to one of the

– AML– ALL

55

Cancer ClassificationCancer Classification

• Class discovery– Identifying new cancer classes

• Class Prediction– Assigning tumors to known classes

66

Cancer Genes and PathwaysCancer Genes and Pathways• 15 cancer-related pathways, 291 cancer genes, 34 angiogenesis genes, 12 tumor

immune tolerance genes Nature Medicine 10, 789-799 (2004); Nature Reviews Cancer 4, 177-183 (2004), 6, 613-625 (2006); Critical Reviews in Oncology/Hematology 59, 40-50 (2006) http://bidd.nus.edu.sg/group/trmp/trmp.asp

Disease outcome prediction with microarrayDisease outcome prediction with microarray

Patient i:Patient i:

Normal Normal person jperson j::

SVM

PatientPatient

NormalNormal

Important Important genesgenes

Most discriminative genesMost discriminative genes



SVM

PatientPatient

NormalNormal

SignaturesSignaturesPredictor-genesPredictor-genes

Better predictive powerBetter predictive powerClues to disease genes, drug targetsClues to disease genes, drug targets

Disease outcome prediction with microarrayDisease outcome prediction with microarray



SVM

PatientPatient

NormalNormal

Expected features of signatures:Expected features of signatures:

Composition:Composition:• Certain percentages of cancer genes, genes in cancer pathways, Certain percentages of cancer genes, genes in cancer pathways,

and angiogenesis genesand angiogenesis genes

Stability:Stability:• Similar set of predictor-genes in different patient compositions Similar set of predictor-genes in different patient compositions

measures under the same or similar conditionsmeasures under the same or similar conditions

How many genes should be in a signature? How many genes should be in a signature?

Class No of Genes or Pathways

Cancer genes (oncogenes, tumor-

suppressors, stability genes)

219

Cancer pathways 15

Angiogenesis 34

Cancer immune tolerance 15

99

Class PredictionClass Prediction

• How could one use an initial collection of samples belonging to known classes to create a class Predictor?– Gathering samples– Hybridizing RNA’s to the microarray– Obtaining quantitative expression level of each gene– Identification of Informative Genes via Neighborhood

Analysis– Weighted votes

1010

Neighborhood AnalysisNeighborhood Analysis

• We want to identify the genes whose expression pattern were strongly correlated with the class distinction to be predicted and ignoring other genes

– Each gene is presented by an expression vector consisting of its expression level in each sample.

– Counting no. of genes having various levels of correlation with ideal gene c.

– Comparing with the correlation of randomly permuted c with it

• The results show an unusually high density of correlated genes!

1111

Neighborhood analysis

Idealized expression pattern

1212

Class PredictorClass Predictor

• The General approach– Choosing a set of informative genes based on

their correlation with the class distinction– Each informative gene casts a weighted vote

for one of the classes– Summing up the votes to determine the

winning class and the prediction strength

1313

Computing VotesComputing Votes

• Each gene Gi votes for AML or ALL depending on :– If the expression level of the gene in the new tumor is

nearer to the mean of Gi in AML or ALL• The value of the vote is :

– WiVi where:• Wi reflects how well Gi is correlated with the class

distinction• Vi = | xi – (AML mean + ALL mean) / 2 |

• The prediction strength reflects:– Margin of victory– (Vwin – Vloose) / (Vwin + Vloose)

1414

Class Predictor

1515

EvaluationEvaluation• DATA

– Initial Sample• 38 Bone Marrow Samples (27 ALL, 11 AML) obtained at the

time of diagnosis.– Independent Sample

• 34 leukemia consisted of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML).

• Validation of Gene Voting– Initial Samples

• 36 of the 38 samples as either AML or ALL and two as uncertain. All 36 samples agrees with clinical diagnosis.

– Independent Samples• 29 of 34 samples are strongly predicted with 100% accuracy.

1616

Validation of Gene VotingValidation of Gene Voting

1717

An early kind of analysis: unsupervised An early kind of analysis: unsupervised learning learning learning disease sub-types learning disease sub-types

Rb

p53

1818

Sub-type learning: seeking ‘natural’ Sub-type learning: seeking ‘natural’ groupings & hoping that they will be useful…groupings & hoping that they will be useful…

Rb

p53

1919

E.g., for treatmentE.g., for treatment

Rb

p53

Respond to treatment Tx1

Do notRespond to treatment Tx1

2020

The ‘one-solution fits all’ trapThe ‘one-solution fits all’ trap

Rb

p53

Respond to treatment Tx2

Do notRespond to treatment Tx2

2121

A more modern view: A more modern view: supervised learningsupervised learning

TRAININSTANCES

APPLICATIONINSTANCES

A

B C

D E

A1, B1, C1, D1, E1

A2, B2, C2, D2, E2

An, Bn, Cn, Dn, En

INDUCTIVE ALGORITHM Classifier

ORRegression Model

CLASSIFICATION PERFORMANCE

2222

TRAININSTANCES

APPLICATIONINSTANCES

A

B C

D E

A1, B1, C1, D1, E1

A2, B2, C2, D2, E2

An, Bn, Cn, Dn, En

INDUCTIVE ALGORITHM Classifier

ORRegression Model

CLASSIFICATION PERFORMANCE

PredictiveBiomarkers

Predictive Biomarkers & Supervised LearningPredictive Biomarkers & Supervised Learning

2323

Predictive Biomarkers & Supervised LearningPredictive Biomarkers & Supervised Learning

2424

TRAININSTANCES

A

B C

D E

A1, B1, C1, D1, E1

A2, B2, C2, D2, E2

An, Bn, Cn, Dn, En

PERFORMANCE

A

B C

D

E

INDUCTIVE ALGORITHM

A more modern view 2: A more modern view 2: Unsupervised learning as structure learningUnsupervised learning as structure learning

2525

TRAININSTANCES

A

B C

D E

A1, B1, C1, D1, E1

A2, B2, C2, D2, E2

An, Bn, Cn, Dn, En

PERFORMANCE

A

B C

D

E

INDUCTIVE ALGORITHM

Causative biomarkers Causative biomarkers & (structural) unsupervised learning& (structural) unsupervised learning

CausativeBiomarkers

2626

Supervised learning: Supervised learning: the geometrical interpretationthe geometrical interpretation

+

+

+

+

++

+

+ +

+

p53

Rb

??

P1

P4

P2

P3

P5

Cancer patients

Normals

New case, classified as normal

New case, classified as cancer

SVM classifier

+

+

+

+

++

+

+ +

+

p53

Rb

??

P1

P4

P2

P3

P5

Cancer patients

Normals

New case, classified as normal

New case, classified as cancer

SVM classifier

2727

• 10,000-50,000 (regular gene expression microarrays, aCGH, and early SNP arrays)

• 500,000 (tiled microarrays, SNP arrays)• 10,000-300,000 (regular MS proteomics)• >10, 000, 000 (LC-MS proteomics)

This is the ‘curse of dimensionality problem’

If 2D looks good, what happens in 3D?If 2D looks good, what happens in 3D?

2828

• Some methods do not run at all (classical regression) • Some methods give bad results • Very slow analysis• Very expensive/cumbersome clinical application

Problems associated with high-dimensionality Problems associated with high-dimensionality (especially with small samples)(especially with small samples)

2929

Solution 1: dimensionality reductionSolution 1: dimensionality reduction

0 10 20 30 40 50 60 70 80 90 1000

50

100

150

200

250

300

350

400

Gene X

Gen

e Y

1st principal component (PC1) PC1: 3X-Y=0

Normal subjects

Cancer patients

3030

B

A

C D E

T

H I J

K

Q L

M N

P O

Solution 2: feature selectionSolution 2: feature selection

3131

• Over-fitting ( a model to your data)= building a model than is good in original data but fails to generalize well to fresh data

Another (very real and unpleasant) problem Another (very real and unpleasant) problem Over-fittingOver-fitting

3232

Over-fitting is directly related to the complexity of Over-fitting is directly related to the complexity of decision surface (relative to the complexity of decision surface (relative to the complexity of

modeling task)modeling task)

Predictor X

Outcome of Interest Y

Training Data

Test Data

3333

General Population:AUC of Model_1= 65%; AUC of Model_2= 85%; AUC of Model_3= 55%

Modeling Sample MS_1 Independent Evaluation Sample ES_1

TrainModel_1Validate

AUC of Model_1 = 88%





Modeling Sample MS_n







Sample used for training & validation

Evaluate AUC of Model_1 = 65%

A sample in which over-fitting is detected

Independent Evaluation Sample ES_2

Evaluate AUC of Model_1 =84%

A sample in which over-fitting is not detectedSample not used for training & validation

1

2

3

4

Training & Validation Phase Validation With Independent Dataset

Over-fitting is also caused by multiple Over-fitting is also caused by multiple validations & small samplesvalidations & small samples

3434

General Population:AUC of Model_1= 65%; AUC of Model_2= 85%; AUC of Model_3= 55%

Modeling Sample MS_1 Independent Evaluation Sample ES_1







Modeling Sample MS_n







Sample used for training & validation

Evaluate AUC of Model_2 = 74%

A sample falsely detecting over-fitting

Independent Evaluation Sample ES_2

Evaluate AUC of Model_2 =90%

A sample not detecting over-fitting

Sample not used for training & validation

1

2

3

4

Training & Validation Phase Validation With Independent Dataset

Over-fitting is also caused by multiple Over-fitting is also caused by multiple validations & small samplesvalidations & small samples

3535

A method to produce realistic performance A method to produce realistic performance estimates: nested n-fold cross-validationestimates: nested n-fold cross-validation

Dataset

Outer loop: Cross-validation for performance estimation

Inner Loop: Cross-validation for model selection

Training set

Validation set

C AccuracyAverage

AccuracyP1 P2 86%P2 P1 84%P1 P2 70%P2 P1 90%

1 85%

2 80%

Training set

Testing set

C AccuracyAverage

AccuracyP1, P2 P3 1 89%P1,P3 P2 2 84%P2, P3 P1 1 76%

83%…

predictor variables outcome variable

P1

P2

P3

Choose C=1 since it maximizes accuracy

…

3636

How well supervised learning works in How well supervised learning works in practice?practice?

3737

DatasetsDatasets• Bhattacharjee2 - Lung cancer vs normals [GE/DX]• Bhattacharjee2_I - Lung cancer vs normals on common genes between Bhattacharjee2 and Beer [GE/DX]

• Bhattacharjee3 - Adenocarcinoma vs Squamous [GE/DX]• Bhattacharjee3_I - Adenocarcinoma vs Squamous on common genes between Bhattacharjee3 and Su [GE/DX]

• Savage - Mediastinal large B-cell lymphoma vs diffuse large B-cell lymphoma [GE/DX]

• Rosenwald4 - 3-year lymphoma survival [GE/CO]• Rosenwald5 - 5-year lymphoma survival [GE/CO]• Rosenwald6 - 7-year lymphoma survival [GE/CO]• Adam - Prostate cancer vs benign prostate hyperplasia and normals [MS/DX]

• Yeoh - Classification between 6 types of leukemia [GE/DX-MC]• Conrads - Ovarian cancer vs normals [MS/DX]• Beer_I - Lung cancer vs normals (common genes with Bhattacharjee2) [GE/DX]• Su_I - Adenocarcinoma vs squamous (common genes with Bhattacharjee3) [GE/DX

• Banez - Prostate cancer vs normals [MS/DX]

3838

Methods: Gene Selection AlgorithmsMethods: Gene Selection Algorithms

• ALL - No feature selection

• LARS - LARS

• HITON_PC -

• HITON_PC_W -HITON_PC+ wrapping phase

• HITON_MB -

• HITON_MB_W -HITON_MB + wrapping phase

• GA_KNN - GA/KNN

• RFE - RFE with validation of feature subset with optimized polynomial kernel

• RFE_Guyon - RFE with validation of feature subset with linear kernel (as in Guyon)• RFE_POLY - RFE (with polynomial kernel) with validation of feature subset with polynomial optimized kernel

• RFE_POLY_Guyon - RFE (with polynomial kernel) with validation of feature subset with linear kernel (as in Guyon)

• SIMCA - SIMCA (Soft Independent Modeling of Class Analogy): PCA based method• SIMCA_SVM - SIMCA (Soft Independent Modeling of Class Analogy): PCA based method with validation of feature subset by SVM

• WFCCM_CCR - Weighted Flexible Compound Covariate Method (WFCCM) applied as in Clinical

Cancer Research paper by Yamagata (analysis of microarray data)

• WFCCM_Lancet - Weighted Flexible Compound Covariate Method (WFCCM) applied as in Lancet

paper by Yanagisawa (analysis of mass-spectrometry data)

• UAF_KW - Univariate with Kruskal-Walis statistic

• UAF_BW - Univariate with ratio of genes between groups to within group sum of squares

• UAF_S2N - Univariate with signal-to-noise statistic

3939

Classification Performance Classification Performance (average over all tasks/datasets)(average over all tasks/datasets)

4040

How well dimensionality reduction and How well dimensionality reduction and feature selection work in practice?feature selection work in practice?

4141

Number of Selected Features Number of Selected Features (average over all tasks/datasets)(average over all tasks/datasets)

0.00

1000.00

2000.00

3000.00

4000.00

5000.00

6000.00

7000.00

8000.00

9000.00

10000.00

ALL

LA

RS

HIT

ON

gp_P

C

HIT

ON

gp_M

B

HIT

ON

gp_P

C_W

HIT

ON

gp_M

B_W

GA

_K

NN

RF

E

RF

E_G

uyon

RF

E_P

OLY

RF

E_P

OLY

_G

uyon

SIM

CA

SIM

CA

_S

VM

WF

CC

M_C

CR

UA

F_K

W

UA

F_B

W

UA

F_S

2N

4242

Number of Selected Features Number of Selected Features (zoom on most powerful methods)(zoom on most powerful methods)

0.00

20.00

40.00

60.00

80.00

100.00

4343

Number of Selected Features Number of Selected Features (average over all tasks/datasets)(average over all tasks/datasets)

why microarray?

Documents