why microarray?

43
CZ5225: Modeling and Simulation in CZ5225: Modeling and Simulation in Biology Biology Lecture 6, Microarray Cancer Lecture 6, Microarray Cancer Classification Classification Prof. Chen Yu Zong Prof. Chen Yu Zong Tel: 6874-6877 Tel: 6874-6877 Email: Email: [email protected] [email protected] http://xin.cz3.nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, Room 07-24, level 7, SOC1, National University of Singapore National University of Singapore

Upload: jabari

Post on 21-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

CZ5225: Modeling and Simulation in Biology Lecture 6, Microarray Cancer Classification Prof. Chen Yu Zong Tel: 6874-6877 Email: [email protected] http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore. Why Microarray?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Why Microarray?

CZ5225: Modeling and Simulation in BiologyCZ5225: Modeling and Simulation in Biology

Lecture 6, Microarray Cancer Classification Lecture 6, Microarray Cancer Classification

Prof. Chen Yu ZongProf. Chen Yu Zong

Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@nus.edu.sghttp://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sg

Room 07-24, level 7, SOC1, Room 07-24, level 7, SOC1, National University of SingaporeNational University of Singapore

Page 2: Why Microarray?

22

Why Microarray?Why Microarray?

• Although there has been some improvements over the past 30 years still there exists no general way for:– Identifying new cancer classes – Assigning tumors to known classes

• In this paper they are introducing two general ways for– Class prediction of a new tumor– Class discovery of new unknown subclasses– Without using the previous biological information

Page 3: Why Microarray?

33

Why Microarray?Why Microarray?

• Why do we need to classify cancers?– The general way of treating cancer is to:

• Categorize the cancers in different classes• Use specific treatment for each of the classes

• Traditional way– Morphological appearance.

Page 4: Why Microarray?

44

Why Microarray?Why Microarray?

• Why traditional ways are not enough ?– There exists some tumors in the same class

with completely different clinical courses• May be more accurate classification is needed

– Assigning new tumors to known cancer classes is not easy

• e.g. assigning an acute leukemia tumor to one of the

– AML– ALL

Page 5: Why Microarray?

55

Cancer ClassificationCancer Classification

• Class discovery– Identifying new cancer classes

• Class Prediction– Assigning tumors to known classes

Page 6: Why Microarray?

66

Cancer Genes and PathwaysCancer Genes and Pathways• 15 cancer-related pathways, 291 cancer genes, 34 angiogenesis genes, 12 tumor

immune tolerance genes Nature Medicine 10, 789-799 (2004); Nature Reviews Cancer 4, 177-183 (2004), 6, 613-625 (2006); Critical Reviews in Oncology/Hematology 59, 40-50 (2006) http://bidd.nus.edu.sg/group/trmp/trmp.asp

Page 7: Why Microarray?

Disease outcome prediction with microarrayDisease outcome prediction with microarray

Patient i:Patient i:

Normal Normal person jperson j::

SVM

PatientPatient

NormalNormal

Important Important genesgenes

Most discriminative genesMost discriminative genes

Patient i:Patient i:

Normal Normal person jperson j::

SVM

PatientPatient

NormalNormal

SignaturesSignaturesPredictor-genesPredictor-genes

Better predictive powerBetter predictive powerClues to disease genes, drug targetsClues to disease genes, drug targets

Page 8: Why Microarray?

Disease outcome prediction with microarrayDisease outcome prediction with microarray

Patient i:Patient i:

Normal Normal person jperson j::

SVM

PatientPatient

NormalNormal

Expected features of signatures:Expected features of signatures:

Composition:Composition:• Certain percentages of cancer genes, genes in cancer pathways, Certain percentages of cancer genes, genes in cancer pathways,

and angiogenesis genesand angiogenesis genes

Stability:Stability:• Similar set of predictor-genes in different patient compositions Similar set of predictor-genes in different patient compositions

measures under the same or similar conditionsmeasures under the same or similar conditions

How many genes should be in a signature? How many genes should be in a signature?

Class No of Genes or Pathways

Cancer genes (oncogenes, tumor-

suppressors, stability genes)

219

Cancer pathways 15

Angiogenesis 34

Cancer immune tolerance 15

Page 9: Why Microarray?

99

Class PredictionClass Prediction

• How could one use an initial collection of samples belonging to known classes to create a class Predictor?– Gathering samples– Hybridizing RNA’s to the microarray– Obtaining quantitative expression level of each gene– Identification of Informative Genes via Neighborhood

Analysis– Weighted votes

Page 10: Why Microarray?

1010

Neighborhood AnalysisNeighborhood Analysis

• We want to identify the genes whose expression pattern were strongly correlated with the class distinction to be predicted and ignoring other genes

– Each gene is presented by an expression vector consisting of its expression level in each sample.

– Counting no. of genes having various levels of correlation with ideal gene c.

– Comparing with the correlation of randomly permuted c with it

• The results show an unusually high density of correlated genes!

Page 11: Why Microarray?

1111

Neighborhood analysis

Idealized expression pattern

Page 12: Why Microarray?

1212

Class PredictorClass Predictor

• The General approach– Choosing a set of informative genes based on

their correlation with the class distinction– Each informative gene casts a weighted vote

for one of the classes– Summing up the votes to determine the

winning class and the prediction strength

Page 13: Why Microarray?

1313

Computing VotesComputing Votes

• Each gene Gi votes for AML or ALL depending on :– If the expression level of the gene in the new tumor is

nearer to the mean of Gi in AML or ALL• The value of the vote is :

– WiVi where:• Wi reflects how well Gi is correlated with the class

distinction• Vi = | xi – (AML mean + ALL mean) / 2 |

• The prediction strength reflects:– Margin of victory– (Vwin – Vloose) / (Vwin + Vloose)

Page 14: Why Microarray?

1414

Class Predictor

Page 15: Why Microarray?

1515

EvaluationEvaluation• DATA

– Initial Sample• 38 Bone Marrow Samples (27 ALL, 11 AML) obtained at the

time of diagnosis.– Independent Sample

• 34 leukemia consisted of 24 bone marrow and 10 peripheral blood samples (20 ALL and 14 AML).

• Validation of Gene Voting– Initial Samples

• 36 of the 38 samples as either AML or ALL and two as uncertain. All 36 samples agrees with clinical diagnosis.

– Independent Samples• 29 of 34 samples are strongly predicted with 100% accuracy.

Page 16: Why Microarray?

1616

Validation of Gene VotingValidation of Gene Voting

Page 17: Why Microarray?

1717

An early kind of analysis: unsupervised An early kind of analysis: unsupervised learning learning learning disease sub-types learning disease sub-types

Rb

p53

Page 18: Why Microarray?

1818

Sub-type learning: seeking ‘natural’ Sub-type learning: seeking ‘natural’ groupings & hoping that they will be useful…groupings & hoping that they will be useful…

Rb

p53

Page 19: Why Microarray?

1919

E.g., for treatmentE.g., for treatment

Rb

p53

Respond to treatment Tx1

Do notRespond to treatment Tx1

Page 20: Why Microarray?

2020

The ‘one-solution fits all’ trapThe ‘one-solution fits all’ trap

Rb

p53

Respond to treatment Tx2

Do notRespond to treatment Tx2

Page 21: Why Microarray?

2121

A more modern view: A more modern view: supervised learningsupervised learning

TRAININSTANCES

APPLICATIONINSTANCES

A

B C

D E

A1, B1, C1, D1, E1

A2, B2, C2, D2, E2

An, Bn, Cn, Dn, En

INDUCTIVE ALGORITHM Classifier

ORRegression Model

CLASSIFICATION PERFORMANCE

Page 22: Why Microarray?

2222

TRAININSTANCES

APPLICATIONINSTANCES

A

B C

D E

A1, B1, C1, D1, E1

A2, B2, C2, D2, E2

An, Bn, Cn, Dn, En

INDUCTIVE ALGORITHM Classifier

ORRegression Model

CLASSIFICATION PERFORMANCE

PredictiveBiomarkers

Predictive Biomarkers & Supervised LearningPredictive Biomarkers & Supervised Learning

Page 23: Why Microarray?

2323

Predictive Biomarkers & Supervised LearningPredictive Biomarkers & Supervised Learning

Page 24: Why Microarray?

2424

TRAININSTANCES

A

B C

D E

A1, B1, C1, D1, E1

A2, B2, C2, D2, E2

An, Bn, Cn, Dn, En

PERFORMANCE

A

B C

D

E

INDUCTIVE ALGORITHM

A more modern view 2: A more modern view 2: Unsupervised learning as structure learningUnsupervised learning as structure learning

Page 25: Why Microarray?

2525

TRAININSTANCES

A

B C

D E

A1, B1, C1, D1, E1

A2, B2, C2, D2, E2

An, Bn, Cn, Dn, En

PERFORMANCE

A

B C

D

E

INDUCTIVE ALGORITHM

Causative biomarkers Causative biomarkers & (structural) unsupervised learning& (structural) unsupervised learning

CausativeBiomarkers

Page 26: Why Microarray?

2626

Supervised learning: Supervised learning: the geometrical interpretationthe geometrical interpretation

+

+

+

+

++

+

+ +

+

p53

Rb

??

P1

P4

P2

P3

P5

Cancer patients

Normals

New case, classified as normal

New case, classified as cancer

SVM classifier

+

+

+

+

++

+

+ +

+

p53

Rb

??

P1

P4

P2

P3

P5

Cancer patients

Normals

New case, classified as normal

New case, classified as cancer

SVM classifier

Page 27: Why Microarray?

2727

• 10,000-50,000 (regular gene expression microarrays, aCGH, and early SNP arrays)

• 500,000 (tiled microarrays, SNP arrays)• 10,000-300,000 (regular MS proteomics)• >10, 000, 000 (LC-MS proteomics)

This is the ‘curse of dimensionality problem’

If 2D looks good, what happens in 3D?If 2D looks good, what happens in 3D?

Page 28: Why Microarray?

2828

• Some methods do not run at all (classical regression) • Some methods give bad results • Very slow analysis• Very expensive/cumbersome clinical application

Problems associated with high-dimensionality Problems associated with high-dimensionality (especially with small samples)(especially with small samples)

Page 29: Why Microarray?

2929

Solution 1: dimensionality reductionSolution 1: dimensionality reduction

0 10 20 30 40 50 60 70 80 90 1000

50

100

150

200

250

300

350

400

Gene X

Gen

e Y

1st principal component (PC1) PC1: 3X-Y=0

Normal subjects

Cancer patients

Page 30: Why Microarray?

3030

B

A

C D E

T

H I J

K

Q L

M N

P O

Solution 2: feature selectionSolution 2: feature selection

Page 31: Why Microarray?

3131

• Over-fitting ( a model to your data)= building a model than is good in original data but fails to generalize well to fresh data

Another (very real and unpleasant) problem Another (very real and unpleasant) problem Over-fittingOver-fitting

Page 32: Why Microarray?

3232

Over-fitting is directly related to the complexity of Over-fitting is directly related to the complexity of decision surface (relative to the complexity of decision surface (relative to the complexity of

modeling task)modeling task)

Predictor X

Outcome of Interest Y

Training Data

Test Data

Page 33: Why Microarray?

3333

General Population:AUC of Model_1= 65%; AUC of Model_2= 85%; AUC of Model_3= 55%

Modeling Sample MS_1 Independent Evaluation Sample ES_1

TrainModel_1Validate

AUC of Model_1 = 88%

TrainModel_2Validate

AUC of Model_2 = 76%

TrainModel_3Validate

AUC of Model_3 = 63%

Modeling Sample MS_n

TrainModel_1Validate

AUC of Model_1 = 61%

TrainModel_2Validate

AUC of Model_2 = 87%

TrainModel_3Validate

AUC of Model_3 = 67%

Sample used for training & validation

Evaluate AUC of Model_1 = 65%

A sample in which over-fitting is detected

Independent Evaluation Sample ES_2

Evaluate AUC of Model_1 =84%

A sample in which over-fitting is not detectedSample not used for training & validation

1

2

3

4

Training & Validation Phase Validation With Independent Dataset

Over-fitting is also caused by multiple Over-fitting is also caused by multiple validations & small samplesvalidations & small samples

Page 34: Why Microarray?

3434

General Population:AUC of Model_1= 65%; AUC of Model_2= 85%; AUC of Model_3= 55%

Modeling Sample MS_1 Independent Evaluation Sample ES_1

TrainModel_1Validate

AUC of Model_1 = 88%

TrainModel_2Validate

AUC of Model_2 = 76%

TrainModel_3Validate

AUC of Model_3 = 63%

Modeling Sample MS_n

TrainModel_1Validate

AUC of Model_1 = 61%

TrainModel_2Validate

AUC of Model_2 = 87%

TrainModel_3Validate

AUC of Model_3 = 67%

Sample used for training & validation

Evaluate AUC of Model_2 = 74%

A sample falsely detecting over-fitting

Independent Evaluation Sample ES_2

Evaluate AUC of Model_2 =90%

A sample not detecting over-fitting

Sample not used for training & validation

1

2

3

4

Training & Validation Phase Validation With Independent Dataset

Over-fitting is also caused by multiple Over-fitting is also caused by multiple validations & small samplesvalidations & small samples

Page 35: Why Microarray?

3535

A method to produce realistic performance A method to produce realistic performance estimates: nested n-fold cross-validationestimates: nested n-fold cross-validation

Dataset

Outer loop: Cross-validation for performance estimation

Inner Loop: Cross-validation for model selection

Training set

Validation set

C AccuracyAverage

AccuracyP1 P2 86%P2 P1 84%P1 P2 70%P2 P1 90%

1 85%

2 80%

Training set

Testing set

C AccuracyAverage

AccuracyP1, P2 P3 1 89%P1,P3 P2 2 84%P2, P3 P1 1 76%

83%…

predictor variables outcome variable

P1

P2

P3

Choose C=1 since it maximizes accuracy

Page 36: Why Microarray?

3636

How well supervised learning works in How well supervised learning works in practice?practice?

Page 37: Why Microarray?

3737

DatasetsDatasets• Bhattacharjee2 - Lung cancer vs normals [GE/DX]• Bhattacharjee2_I - Lung cancer vs normals on common genes between Bhattacharjee2 and Beer [GE/DX]

• Bhattacharjee3 - Adenocarcinoma vs Squamous [GE/DX]• Bhattacharjee3_I - Adenocarcinoma vs Squamous on common genes between Bhattacharjee3 and Su [GE/DX]

• Savage - Mediastinal large B-cell lymphoma vs diffuse large B-cell lymphoma [GE/DX]

• Rosenwald4 - 3-year lymphoma survival [GE/CO]• Rosenwald5 - 5-year lymphoma survival [GE/CO]• Rosenwald6 - 7-year lymphoma survival [GE/CO]• Adam - Prostate cancer vs benign prostate hyperplasia and normals [MS/DX]

• Yeoh - Classification between 6 types of leukemia [GE/DX-MC]• Conrads - Ovarian cancer vs normals [MS/DX]• Beer_I - Lung cancer vs normals (common genes with Bhattacharjee2) [GE/DX]• Su_I - Adenocarcinoma vs squamous (common genes with Bhattacharjee3) [GE/DX

• Banez - Prostate cancer vs normals [MS/DX]

Page 38: Why Microarray?

3838

Methods: Gene Selection AlgorithmsMethods: Gene Selection Algorithms

• ALL - No feature selection

• LARS - LARS

• HITON_PC -

• HITON_PC_W -HITON_PC+ wrapping phase

• HITON_MB -

• HITON_MB_W -HITON_MB + wrapping phase

• GA_KNN - GA/KNN

• RFE - RFE with validation of feature subset with optimized polynomial kernel

• RFE_Guyon - RFE with validation of feature subset with linear kernel (as in Guyon)• RFE_POLY - RFE (with polynomial kernel) with validation of feature subset with polynomial optimized kernel

• RFE_POLY_Guyon - RFE (with polynomial kernel) with validation of feature subset with linear kernel (as in Guyon)

• SIMCA - SIMCA (Soft Independent Modeling of Class Analogy): PCA based method• SIMCA_SVM - SIMCA (Soft Independent Modeling of Class Analogy): PCA based method with validation of feature subset by SVM

• WFCCM_CCR - Weighted Flexible Compound Covariate Method (WFCCM) applied as in Clinical

Cancer Research paper by Yamagata (analysis of microarray data)

• WFCCM_Lancet - Weighted Flexible Compound Covariate Method (WFCCM) applied as in Lancet

paper by Yanagisawa (analysis of mass-spectrometry data)

• UAF_KW - Univariate with Kruskal-Walis statistic

• UAF_BW - Univariate with ratio of genes between groups to within group sum of squares

• UAF_S2N - Univariate with signal-to-noise statistic

Page 39: Why Microarray?

3939

Classification Performance Classification Performance (average over all tasks/datasets)(average over all tasks/datasets)

Page 40: Why Microarray?

4040

How well dimensionality reduction and How well dimensionality reduction and feature selection work in practice?feature selection work in practice?

Page 41: Why Microarray?

4141

Number of Selected Features Number of Selected Features (average over all tasks/datasets)(average over all tasks/datasets)

0.00

1000.00

2000.00

3000.00

4000.00

5000.00

6000.00

7000.00

8000.00

9000.00

10000.00

ALL

LA

RS

HIT

ON

gp_P

C

HIT

ON

gp_M

B

HIT

ON

gp_P

C_W

HIT

ON

gp_M

B_W

GA

_K

NN

RF

E

RF

E_G

uyon

RF

E_P

OLY

RF

E_P

OLY

_G

uyon

SIM

CA

SIM

CA

_S

VM

WF

CC

M_C

CR

UA

F_K

W

UA

F_B

W

UA

F_S

2N

Page 42: Why Microarray?

4242

Number of Selected Features Number of Selected Features (zoom on most powerful methods)(zoom on most powerful methods)

0.00

20.00

40.00

60.00

80.00

100.00

Page 43: Why Microarray?

4343

Number of Selected Features Number of Selected Features (average over all tasks/datasets)(average over all tasks/datasets)