classification of microarray data. sample preparation hybridization array design probe design...

Classification of Microarray Data

Sample PreparationHybridization

Array designProbe design

QuestionExperimental Design

Buy Chip/Array

Statistical AnalysisFit to Model (time series)

Expression IndexCalculation

Advanced Data Analysis

Clustering PCA Classification Promoter Analysis

Meta analysis Survival analysis Regulatory Network

Normalization

Image analysis

The DNA Array Analysis Pipeline

ComparableGene Expression Data

Outline

• Example: Childhood leukemia

• Classification– Method selection

– Feature selection

– Cross-validation

– Training and testing

• Example continued: Results from a leukemia study

Childhood Leukemia

• Cancer in the cells of the immune system

• Approx. 35 new cases in Denmark every year

• 50 years ago – all patients died

• Today – approx. 78% are cured

• Risk groups:

– Standard, Intermediate, High, Very high, Extra high

• Treatment

– Chemotherapy

– Bone marrow transplantation

– Radiation

Prognostic Factors

Good prognosis Poor prognosis

Immuno-phenotype precursor B T

Age 1-9 ≥10

Leukocyte count Low (<50*109/L) High (>100*109/L)

Number of chromosomes Hyperdiploidy (>50)

Hypodiploidy (<46)

Translocations t(12;21) t(9;22), t(1;19)

Treatment response Good response:Low MRD

Poor response:High MRD

Prognostic factors:

ImmunophenotypeAgeLeukocyte countNumber of chromosomesTranslocationsTreatment response

Risk Classification Today

Risk group:

StandardIntermediateHighVery highExtra High

Patient:

Clinical data

Immuno-phenotype

Morphology

Genetic measurements

Microarraytechnology

Study of Childhood Leukemia

• Diagnostic bone marrow samples from leukemia patients

• Platform: Affymetrix Focus Array– 8793 human genes

• Immunophenotype– 18 patients with precursor B immunophenotype

– 17 patients with T immunophenotype

• Outcome 5 years from diagnosis– 11 patients with relapse

– 18 patients in complete remission

Reduction of data

• Dimension reduction– PCA

• Feature selection (gene selection)– Significant genes: t-test

– Selection of a limited number of genes

Microarray DataClass precursorB T ...

Patient1 Patient2 Patient3 Patient4 ...

Gene1 1789.5 963.9 2079.5 3243.9 ...

Gene2 46.4 52.0 22.3 27.1 ...

Gene3 215.6 276.4 245.1 199.1 ...

Gene4 176.9 504.6 420.5 380.4

Gene5 4023.0 3768.6 4257.8 4451.8

Gene6 12.6 12.1 37.7 38.7

... ... ... ... ... ...

Gene8793 312.5 415.9 1045.4 1308.0

Outcome: PCA on all Genes

Outcome Prediction: CC against the Number of Genes

-0.1

0.1

0.3

0.5

0.7

0.9

0 20 40 60 80 100 120 140

Number of top ranking genes

Co

rrel

atio

n c

oef

fici

ent

(CC

)

Nearest centroid

• Assumptions:– Data e.g.. Gaussian distributed

– Variance and covariance the same for all classes

• Calculation of class probability

Linear discriminant analysis

• Calculation of a centroid for each class

• Calculation of the distance between a test sample and each class centroid

• Class prediction by the nearest centroid method

Nearest Centroid

kCj kijik nxx /

• Based on distance measure– For example Euclidian distance

• Parameter k = number of nearest neighbors– k=1– k=3– k=... – Always an odd number

• Prediction by majority vote

K-Nearest Neighbor

Support Vector Machines

• Machine learning

• Relatively new and highly theoretic

• Works on non-linearly separable data

• Finding a hyperplane between the two classes by minimizing of the distance between the hyperplane and closest points

Neural Networks

Gene1 Gene2 Gene3 ...

B T

Comparison of Methods

Linear discriminant analysis

Nearest centroid

KNN

Neural networks

Support vector machines

Simple method

Based on distance calculation

Good for simple problems

Good for few training samples

Distribution of data assumed

Advanced methods

Involve machine learning

Several adjustable parameters

Many training samples required

(e.g.. 50-100)

Flexible methods

Cross-validation

Given data with 100 samples

Cross-5-validation:Training: 4/5 of data (80 samples)Testing: 1/5 of data (20 samples)- 5 different models and performance results

Leave-one-out cross-validation (LOOCV)Training: 99/100 of data (99 samples)Testing: 1/100 of data (1 sample)- 100 different models

Validation

• Definition of – True and False positives

– True and False negatives

Actual class B B T T

Predicted class B T T B

TP FN TN FP

Sensitivity: the ability to detect "true positives"

TP

TP + FN

Specificity: the ability to avoid "false positives"

TN

TN + FP

Positive Predictive Value (PPV)

TP

TP + FP

Sensitivity and Specificity

Agnieszka

The most sensitive search finds all true matches, but might have lots of false positivesthe probability that the test is positive given that the patient is sick.

Agnieszka

The most specific search will return only true matches, but might have lots of false negativesthe probability that the test is negative given that the patient is not sick.

Agnieszka

In practice, there often is a tradeoff, and you can't achieve both.When one chooses which algorithm to use, there is a trade off between these two characters. It quite trivial to create an algorithms which will optimize one of these properties, the problem is to create algorithm which will achieve both of them.

Accuracy

• Definition: TP + TN

TP + TN + FP + FN

• Range: [0 … 1]

• Definition: TP·TN - FN·FP

√(TN+FN)(TN+FP)(TP+FN)(TP+FP)

• Range: [-1 … 1]

Matthews correlation coefficient

Overview of Classification

Expression data

Subdivision of data for cross-validationinto training sets and test sets

Feature selection (t-test)Dimension reduction (PCA)

Training of classifier: - using cross-validation

- choice of method- choice of optimal parameters

Testing of classifier Independent test set

Important Points

• Avoid over fitting

• Validate performance– Test on an independent test set

– Use cross-validation

• Include feature selection in cross-validation

Why?

– To avoid overestimation of performance!

– To make a general classifier

Multiclass classification

• Same prediction methods– Multiclass prediction often implemented

– Often bad performance

• Solution– Division in small binary (2-class) problems

Study of Childhood Leukemia: Results

• Classification of immunophenotype (precursorB and T)– 100% accuracy

• During the training • When testing on an independent test set

– Simple classification methods applied• K-nearest neighbor• Nearest centroid

• Classification of outcome (relapse or remission)– 78% accuracy (CC = 0.59)– Simple and advanced classification methods applied

Summary

• When do we use classification?• Reduction of dimension often necessary

– T-test, PCA

• Methods– K nearest neighbor– Nearest centroid– Support vector machines– Neural networks

• Validation– Cross-validation– Accuracy and correlation coefficient

Coffee break

Next: Exercise

Prognostic factors:

ImmunophenotypeAgeLeukocyte countNumber of chromosomesTranslocationsTreatment response

Risk classification in the future ?

Risk group:

StandardIntermediateHighVery highEkstra high

Patient:

Clincal data

Immunopheno-typing MorphologyGenetic measurements

Microarraytechnology

Customized treatment

classification of microarray data. sample preparation hybridization array design probe design...

Documents

leukemia study slide

childhood leukemia cancer

outline example

testing example

new cases

immune system

leukocyte countlow

cured risk groups