classification and diagnostic prediction of cancers using gene expression profiling and artificial...

26
Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE – Volume 7 – Number 6 – JUNE 2001

Upload: brice-james

Post on 23-Dec-2015

222 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial

Neural Networks

JAVED KHAN ET AL.

NATURE MEDICINE – Volume 7 – Number 6 – JUNE 2001

Page 2: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

The Small, Round Blue Cell Tumors (SRBCTs) of Childhood

Four categories – Neuroblastoma (NB), rhabdomyosarcoma (RMS), non-Hodgkin lymphoma (NHL) and the Ewing family of tumors (EWS).

Similar in appearance on routine histology. However accurate diagnosis is essential – as

treatment options , response to therapy, etc, vary. No single test can precisely distinguish SRBCTs –

Immunohistochemistry, cytogenetics, interphase fluorescent in situ hybridization and reverse transcription.

Page 3: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

Gene Expression Profiling using cDNA Microarrays.

Micoarrays measure the activities of several thousand genes simultaneously.

Can be used for Cancer Classification. This will give better therapeutic

measurements to cancer patients by diagnosing cancer types with improved accuracy.

and furthermore cancers belonging to several diagnostic categories – SRBCTs.

Page 4: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

Artificial Neural Networks (ANNs) – put to the task.

Modeled on the structure and behavior of neurons in the human brain.

Can be trained to recognize and categorize complex patterns.

Pattern recognition achieved by adjusting of the ANN by a process of error minimization through learning from experience.

ANNs were applied to decipher gene-expression signatures of SRBCTs and then used for diagnostic classification.

Page 5: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

Error Minimization

Mean Squared Error Summed Square Error

Page 6: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

Network Architecture and Parameters Due to limited amount of calibration data and the fact that four

output nodes are needed, the network architecture was limited to Linear perceptrons.

10 input nodes were used representing the 10 PCA components described later on.

4 output nodes modeled by the Sigmoid function. Calibration is performed using JETNET, with learning rate η = 0.7,

momentum coefficient p = 0.3. The learning rate is decreased with a factor of 0.99 after each

iteration. Initial weight values are chosen randomly from [-r,r], where r =

0.1/max[Fi], where Fi is the number of nodes connecting to node i. Weight values are updated after every 10 samples.

Page 7: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

Back-propagation Minimizing by gradient descent is the least sophisticated but nevertheless in

many cases a sufficient method. It amounts to updating the weights according to the Back-propagation

learning rule. The partial derivative ∂Et/∂w represents a sensitivity factor, determining the

direction of search in weight space for the synaptic weights.

where

Delta rule

Page 8: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

….continue

A momentum is often added to stabilize the learning.

where α < 1

Page 9: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

Calibration and validation of the ANN Models. cDNA microarrays containing 6567 genes: 63 training samples comprised of 13 EWS and 10

RMS from tumor biopsy and 10 EWS, 10 RMS, 12 NB, 8 BL from cell lines.

25 test samples comprised of 5 EWS, 5 RMS, 4 NB, from tumors and 1EWS, 2 NB, 3BL from cell lines. Plus 5 non-SRBCT samples (test ability reject diagnosis).

Filtering for the minimal number of expression reduced the genes to 2308.

Principle Component Analysis (PCA) further reduced dimensionality.

Page 10: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

….continue

10 dominant PCA components per sample were used as inputs….

and four outputs – (EWS, RMS, NB, BL). A three-fold cross-validation procedure was used

and 3750 ANN models were produced (Figure 1). No sign of “over-training” of the models as would be

shown by a rise in the summed square error for the validation set with increasing iterations (epochs) -see figure 2.

Page 11: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –
Page 12: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

The Artificial Neural Network 1. Quality Filtering2. PCA 3. 25 test samples set aside and the 63 training

samples are randomly partitioned into 3 groups

4. One group is reserved for validation and the other two used for calibration.

5. For each model the calibration was optimized with 100 iterative cycles (epochs).

6. This was repeated using each of the three groups for validation.

7. The samples were again randomly partitioned and the entire training process repeated. For each selection of a validation group one model was calibrated, resulting in a total of 3750 trained models.

8. Once the models were calibrated they were used to rank the genes according to their importance for classification.

9. The entire process was repeated using only top ranked genes.

Page 13: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

….continue Validation Each validation sample is then passed through 1250

models and hence 1250 predictions for each validation sample are produced.

Each ANN model gives a number between 0 (not this cancer type) and 1(this cancer type) as an output for each cancer type.

The average for all model outputs for every validation sample is then computed (denoted the average committee vote).

Each sample is classified as belonging to the cancer type corresponding to the largest committee vote.

Using these ANN models, all 63 training samples were correctly classified to their respective categories.

Page 14: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

Optimization of Genes used for Classification.

The contribution of each gene to the classification by the ANN models was then assessed.

Feature extraction was performed in a model dependent way due to relatively few samples.

This was achieved by monitoring the sensitivity of classification to a change in the expression level of each gene, using the 3750 previously calibrated models.

Page 15: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

Sensitivity (S) of the outputs (o) with respect to any 2308 input varaibles (xk) is defined as:

Where Ns is the number of samples (63) and No is the number of outputs (4). The procedure for computing Sk involves a committee of 3750 models.

Page 16: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

….continue

In this way genes were ranked according to the significance of classification and the classification error rate using increasing numbers of these ranked genes was determined.

The classification Error rate minimized at 0% at 96 genes.

Using only these 96 genes, recalibration of the ANN models was performed and again all 63 samples were correctly classified.

Page 17: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

Assessing the Quality of Classification - Diagnoses.

The aim of diagnoses is to be able to reject test samples which do not belong to any of the four categories.

To do this a distance dc from a sample to the ideal vote for each cancer type was calculated:

Page 18: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

….continue

Where c is the cancer type, oi is the average committee vote for cancer i, and δi,c is unity if i corresponds to cancer type c and zero otherwise.

The distance is normalized such that the distance between two ideal samples belonging to different disease categories is unity.

Based on the validation set, an empirical probability distribution of distances for each cancer type was generated.

The empirical probability distributions are built using each ANN model independently.

Thus, the number of entries in each distribution is given by 1250 multiplied with the number of samples belonging to the caner type.

Page 19: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

….continue

For a given test sample it is thus possible to reject possible classifications based on the these probability distributions.

Hence for each disease category a cuttoff distance from the ideal sample was defined within which it is expected a sample of this category to fall in.

The distance given by the 95th percentile of the probability distribution was chosen.

This is the basis of diagnoses, as a sample that falls outside the cuttoff distance cannot be confidently diagnosed.

Page 20: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –
Page 21: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

Diagnostic Classification and Hierarchical Clustering. The diagnostic capabilities of all 3750 ANN models were then

tested using the 25 blinded test samples. A sample is classified to a diagnostic category if it receives the

highest vote for that category and because this classifier has only four possible outputs, all samples will be classified to one of the four categories.

If a sample falls outside the 95th percentile of the probability distribution of distances between samples and their ideal output (for example for EWS it is EWS = 1, RMS = NB = BL = 0), its diagnosis is rejected.

Using the 3750 ANN models calibrated with the 96 genes, 100% classification was achieved for the 20 SRBCT test samples and furthermore all of the 5 non-SRBCT samples were excluded from any of the four diagnostic categories, since they fell outside the 95 percentile.

Page 22: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –
Page 23: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –
Page 24: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –
Page 25: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –

….continue

Hierarchical clustering using the 96 genes, identified from the ANN models, correctly clustered all 20 of the test samples

Page 26: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE –