differentially expressed genes, class discovery & classification

.

Differentially Expressed Genes, Class Discovery &

Classification

Finding Differentially Expressed Genes

Two types of motivation: “Direct”:

Relate the genes to known biology: functions, pathways etc. Infer about their rule, the mechanisms governing the process etc.

“Indirect”: Use as a “pruning stage” for tools that perform learning tasks: Infer regulatory mechanisms and relations Classification ( disease Vs. normal, disease

subtypes)

Example: Tumor vs. Normal tissues

Identify differentially expressed genes

Diagnostic Markers Therapeutic targets Understanding the

disease process

Normalsamples

Tumorsamples

Ove

r ex

pre

ssed

Un

der

exp

ress

ed

Non-small cell lung carcinomas Sheba medical center U. of Colorado Medical Center

What We Need

Score the genes, hopefully in a meaningful way..

Attach a measure of statistical significance to the score so we can Choose a subset of genes “wisely” Have a measure of how strong our signal is

Simplest Score: Fold Change

Avg. expression in Normal lung

Avg

. exp

ress

ion

in tu

mo

rs

2-fold change

2-fold up: 761 genes2-fold down: 272 genes

Fold Change: problems

Not reliable at the low end of the scale (“0/0” effects – large variance)

Sensitive to outliers

Variant: “pairwise fold change” compute fold change over all possible sample

pairs If in e.g. 75% of the pairs, change > =>

significant

Relevance Scores - TNoMBeyond “fold change”

Both genes have >15 fold change

TNoM (Total Number of Misclassifications) score Find the threshold that best separates tumors from normals, count the number of errors committed there.

10 100 1000 10000 100000

Gene 1

10 100 1000 10000 100000

Gene 2

tumornormal

5 0

Uninformative Gene Informative Gene

Expression pattern of a gene: a Pathological diagnosis information (annotation): L

v(a,L), a vector of +s and –s, ordered by the a values

+ + - - + + + - - + - - + + -a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15

Informative genes

+ + + + + + + + - - - - - - -

- - - - - - - + + + + + + + + - - - - + - - + + - + + + + +

etc

Non-informative genes

+ - + - + + + + - - + + - - -

- + + - + - - + + - + + - - + + - - - + + - + + - + + - + -

etc

Scoring Informative Genes

Find the threshold that best separates tumors from normals, count the number of errors committed there.

- + + - + - - + + - + + - - +

# of errors = min(7,8) = 7.

6 7

Ex 1:

Ex 2: A perfect single gene classifier gets a score of 0.

+ + + + + + + + - - - - - - -

0

TNoM Score

TNoM vs. Fold Change

2-fold up: 761 genes2-fold down: 272 genes

TNoM 3 62 genes

Avg. expression in Normal lung

Avg

. exp

ress

ion

in tu

mo

rs 2-fold changeTNoM 3TNoM > 3

Cons:

Ones-sided vs. two sided errors

Absolute values ignored

For any given level s, we can efficiently compute

p-Val(s) = Prob( TNoM(V) s ),

where V is uniformly drawn over the appropriate

space.

(H0 – the gene expression values are

independent of the labels)

Computed using DP

TNoM

Wilcoxon Rank Test

Another gene score, which similarly to TNoM: Ignores absolute values Takes into account only order of measurements

•Sort the expression values of both groups+ + - - + + + - - + - - + + -a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15

•W(g)= sum of ranks of the positive examples:

W(g) = 1 + 2 + 5 + 6 + 7 + 10 + 13 + 14 = 58

Wilcoxon Rank Test

A common test in statistics Again, we can compute p-Values given the null

hypothesis H0

P(W(g) > s|n,k) = the probability of getting a score > s given a total of n samples, out of which k are labeled as (+).

SAM (Tusher et al., PNAS 01)

Where a = (1/n1 + 1/n2)/(n1+ n2-2)

•d(i) is exactly the paired t-statistic

•Tests the assumption: are the means of the two processes the same?

•Underlying assumption: two normal distributions

•A known p-value: the t-distribution

SAM – Alternative to P-Value

• P-value relies on t-test assumptions - problematic

• Can we assess the significance of d(i) without parametric assumptions?

• Define a “balanced” permutation: division of samples to 2 groups, where in each group the number of ‘+’ and ‘-’ is balanced

• Perform all possible “balanced” permutations p to the data and compute:

p

pE idM

id )(1

)( )()()( ididi E

False Discovery Rate for SAM

• Genes with above a given threshold – significant

• FDR – False discovery rate = the % of genes passing as “significant” which are expected to be false positives

• Each threshold on (i) can be given an FDR value:

• compute the avg. number of FP crossing this threshold in the permuted sets

Different Scores

TNoM Info Wilcoxon t Test Fold Change

Different scores and null hypothesis (parametric, non parametric etc.)

All can be found in the ScoreGene package:

http://www.cs.huji.ac.il/labs/compbio/scoregenes/

Can we assess which scoring method is the best for our case?

0 1 2 3 4 5 6 7 8 9 100

2000

4000

6000

Lung Cancer Data - Actual and expected TNoM scores distribution

TNoM score

Num

ber

of

genes

Expected distributionActual distribution

0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

TNoM score

-log(B

inom

ial surp

rise) log(Binomial surprise)

• Data on 30 samples from normal and tumor lung tissues.

• ~7000 genes. • Naftali Kaminski’s lab, Sheba Medical Center

Overabundance Analysis

Why Test Overabundance?

Tests how informative is a set of genes w.r.t. a given classification of the data and a scoring method.

Can be used to compare different: gene scoring methods normalization methods

Comparing Normalization Methods

Why Test Overabundance?

But also: a method to discover new classes in the data

Intuition: biologically meaningful partitions will have a high overabundance of informative genes

Overabundance Analysis in

Class Discovery

Biologicallymeaningful partitions.

Overabundance of informative genes

•Score Genes•Count •Compare torandom

AML/ALL

0 1 2 3 4 5 60

500

1000

1500

2000

2500

3000

Breast Cancer BRCA1/BRCA2 data - Actual and expected TNoM scores distribution

TNoM score

Num

ber

of g

enes


0 1 2 3 4 5 60

100

200

300

400

TNoM score

-log(

Bin

omia

l sur

pris

e)

log(Binomial surprise)

BRCA1/2

0 1 2 3 4 5 6 7 8 90

1000

2000

3000

Melanoma38-1 - Actual and expected TNoM scores distribution

TNoM score

Num

ber

of g

enes


0 1 2 3 4 5 6 7 8 90

500

1000

1500

2000

TNoM score

-log(

Bin

omia

l sur

pris

e)

log(Binomial surprise)

Melanoma

Seek partitions with statistically significant overabundance of informative genes

Use local search techniques, e.g:

•Steepest ascent

•Simulated annealing

Class Discovery Approach

At a given score level s, set p = p-Val(s) . Suppose that in the data we observe n(s) genes with score

s . The number of genes with score s we observe for

uniformly and independently drawn labeling vectors is a random variable N(s) with

N(s) ~ Binom(n,p)where n is the total number of genes.

The surprise rate at s is defined as (s) = Prob( N(s) n(s) )

= k=n(s)…n [n choose n(s)]pk(1-p)n-p. Finally, the max surprise score for the suggested partition

is Maxs (s)

Scoring a Partition

Overabundance & Max-Surprise

Example: Survival Prediction

All Patients Good Prognosis Patients

0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

32 patients 17 deaths

8 patients 5 deaths

0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


5 patients 3 deaths

Class 2


0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1



0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1



Class 3


0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1



0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1



Tissue Classification

Given a set of labeled samples, we can try to classify a new sample

Supervised methods: SVM, Adaboost, Naïve Bayes

Semi-supervised methods: Clustering

Issues: Evaluating the methods Feature Selection Sample contamination/composition

Evaluating Classification

LOOCV – Leave one out cross validation:

For all samples i = 1…M: Take sample i out Learn from M-1

remaining samples Test on sample i

Normal Tumors

Mislabeled sample

How many of the informative genes do we choose for our classifier?

A question of choosing a cutoff

Feature Selection

Positive predictive value : 97%

0

5

10

15

20

25

30

35

1e-008 1e-007 1e-006 1e-005 0.0001 0.001 0.01 0.1 1

% P

redi

ctio

n E

rror

s

P-value threshold for selection

• 14 - 2000 genes• 1 misclassification

Tissue Composition

Small celllung carcinoma

Lung metastasa

Serous carcinoma

Lung adenocarcinoma

Tissue Composition

The tissue is composed of many cell types (tumor, blood, muscle, …)

The arrayed samples are not always pure!

Major difference: differentialy expressed genes which are:

Causes of the disease state Outcome of the disease state

Summary

Many methods for choosing differentially expressed genes

These can be compared, e.g. using overabundance tests

Overabundance can also be used for new class discovery

Expression patterns can be used to classify a tissue

differentially expressed genes, class discovery & classification

Documents