differentially expressed genes, class discovery & classification

34
. Differentially Expressed Genes, Class Discovery & Classification

Post on 20-Dec-2015

234 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Differentially Expressed Genes, Class Discovery & Classification

.

Differentially Expressed Genes, Class Discovery &

Classification

Page 2: Differentially Expressed Genes, Class Discovery & Classification

Finding Differentially Expressed Genes

Two types of motivation: “Direct”:

Relate the genes to known biology: functions, pathways etc. Infer about their rule, the mechanisms governing the process etc.

“Indirect”: Use as a “pruning stage” for tools that perform learning tasks: Infer regulatory mechanisms and relations Classification ( disease Vs. normal, disease

subtypes)

Page 3: Differentially Expressed Genes, Class Discovery & Classification

Example: Tumor vs. Normal tissues

Identify differentially expressed genes

Diagnostic Markers Therapeutic targets Understanding the

disease process

Normalsamples

Tumorsamples

Ove

r ex

pre

ssed

Un

der

exp

ress

ed

Non-small cell lung carcinomas Sheba medical center U. of Colorado Medical Center

Page 4: Differentially Expressed Genes, Class Discovery & Classification

What We Need

Score the genes, hopefully in a meaningful way..

Attach a measure of statistical significance to the score so we can Choose a subset of genes “wisely” Have a measure of how strong our signal is

Page 5: Differentially Expressed Genes, Class Discovery & Classification

Simplest Score: Fold Change

Avg. expression in Normal lung

Avg

. exp

ress

ion

in tu

mo

rs

2-fold change

2-fold up: 761 genes2-fold down: 272 genes

Page 6: Differentially Expressed Genes, Class Discovery & Classification

Fold Change: problems

Not reliable at the low end of the scale (“0/0” effects – large variance)

Sensitive to outliers

Variant: “pairwise fold change” compute fold change over all possible sample

pairs If in e.g. 75% of the pairs, change > =>

significant

Page 7: Differentially Expressed Genes, Class Discovery & Classification

Relevance Scores - TNoMBeyond “fold change”

Both genes have >15 fold change

TNoM (Total Number of Misclassifications) score Find the threshold that best separates tumors from normals, count the number of errors committed there.

10 100 1000 10000 100000

Gene 1

10 100 1000 10000 100000

Gene 2

tumornormal

5 0

Uninformative Gene Informative Gene

Page 8: Differentially Expressed Genes, Class Discovery & Classification

Expression pattern of a gene: a Pathological diagnosis information (annotation): L

v(a,L), a vector of +s and –s, ordered by the a values

+ + - - + + + - - + - - + + -a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15

Informative genes

+ + + + + + + + - - - - - - -

- - - - - - - + + + + + + + + - - - - + - - + + - + + + + +

etc

Non-informative genes

+ - + - + + + + - - + + - - -

- + + - + - - + + - + + - - + + - - - + + - + + - + + - + -

etc

Scoring Informative Genes

Page 9: Differentially Expressed Genes, Class Discovery & Classification

Find the threshold that best separates tumors from normals, count the number of errors committed there.

- + + - + - - + + - + + - - +

# of errors = min(7,8) = 7.

6 7

Ex 1:

Ex 2: A perfect single gene classifier gets a score of 0.

+ + + + + + + + - - - - - - -

0

TNoM Score

Page 10: Differentially Expressed Genes, Class Discovery & Classification

TNoM vs. Fold Change

2-fold up: 761 genes2-fold down: 272 genes

TNoM 3 62 genes

Avg. expression in Normal lung

Avg

. exp

ress

ion

in tu

mo

rs 2-fold changeTNoM 3TNoM > 3

Page 11: Differentially Expressed Genes, Class Discovery & Classification

Cons:

Ones-sided vs. two sided errors

Absolute values ignored

For any given level s, we can efficiently compute

p-Val(s) = Prob( TNoM(V) s ),

where V is uniformly drawn over the appropriate

space.

(H0 – the gene expression values are

independent of the labels)

Computed using DP

TNoM

Page 12: Differentially Expressed Genes, Class Discovery & Classification

Wilcoxon Rank Test

Another gene score, which similarly to TNoM: Ignores absolute values Takes into account only order of measurements

•Sort the expression values of both groups+ + - - + + + - - + - - + + -a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15

•W(g)= sum of ranks of the positive examples:

W(g) = 1 + 2 + 5 + 6 + 7 + 10 + 13 + 14 = 58

Page 13: Differentially Expressed Genes, Class Discovery & Classification

Wilcoxon Rank Test

A common test in statistics Again, we can compute p-Values given the null

hypothesis H0

P(W(g) > s|n,k) = the probability of getting a score > s given a total of n samples, out of which k are labeled as (+).

Page 14: Differentially Expressed Genes, Class Discovery & Classification

SAM (Tusher et al., PNAS 01)

Where a = (1/n1 + 1/n2)/(n1+ n2-2)

•d(i) is exactly the paired t-statistic

•Tests the assumption: are the means of the two processes the same?

•Underlying assumption: two normal distributions

•A known p-value: the t-distribution

Page 15: Differentially Expressed Genes, Class Discovery & Classification

SAM – Alternative to P-Value

• P-value relies on t-test assumptions - problematic

• Can we assess the significance of d(i) without parametric assumptions?

• Define a “balanced” permutation: division of samples to 2 groups, where in each group the number of ‘+’ and ‘-’ is balanced

• Perform all possible “balanced” permutations p to the data and compute:

p

pE idM

id )(1

)( )()()( ididi E

Page 16: Differentially Expressed Genes, Class Discovery & Classification

False Discovery Rate for SAM

• Genes with above a given threshold – significant

• FDR – False discovery rate = the % of genes passing as “significant” which are expected to be false positives

• Each threshold on (i) can be given an FDR value:

• compute the avg. number of FP crossing this threshold in the permuted sets

Page 17: Differentially Expressed Genes, Class Discovery & Classification

Different Scores

TNoM Info Wilcoxon t Test Fold Change

Different scores and null hypothesis (parametric, non parametric etc.)

All can be found in the ScoreGene package:

http://www.cs.huji.ac.il/labs/compbio/scoregenes/

Can we assess which scoring method is the best for our case?

Page 18: Differentially Expressed Genes, Class Discovery & Classification

0 1 2 3 4 5 6 7 8 9 100

2000

4000

6000

Lung Cancer Data - Actual and expected TNoM scores distribution

TNoM score

Num

ber

of

genes

Expected distributionActual distribution

0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

TNoM score

-log(B

inom

ial surp

rise) log(Binomial surprise)

• Data on 30 samples from normal and tumor lung tissues.

• ~7000 genes. • Naftali Kaminski’s lab, Sheba Medical Center

Overabundance Analysis

Page 19: Differentially Expressed Genes, Class Discovery & Classification

Why Test Overabundance?

Tests how informative is a set of genes w.r.t. a given classification of the data and a scoring method.

Can be used to compare different: gene scoring methods normalization methods

Page 20: Differentially Expressed Genes, Class Discovery & Classification

Comparing Normalization Methods

Page 21: Differentially Expressed Genes, Class Discovery & Classification

Why Test Overabundance?

But also: a method to discover new classes in the data

Intuition: biologically meaningful partitions will have a high overabundance of informative genes

Page 22: Differentially Expressed Genes, Class Discovery & Classification

Overabundance Analysis in

Class Discovery

Biologicallymeaningful partitions.

Overabundance of informative genes

•Score Genes•Count •Compare torandom

AML/ALL

0 1 2 3 4 5 60

500

1000

1500

2000

2500

3000

Breast Cancer BRCA1/BRCA2 data - Actual and expected TNoM scores distribution

TNoM score

Num

ber

of g

enes

Expected distributionActual distribution

0 1 2 3 4 5 60

100

200

300

400

TNoM score

-log(

Bin

omia

l sur

pris

e)

log(Binomial surprise)

BRCA1/2

0 1 2 3 4 5 6 7 8 90

1000

2000

3000

Melanoma38-1 - Actual and expected TNoM scores distribution

TNoM score

Num

ber

of g

enes

Expected distributionActual distribution

0 1 2 3 4 5 6 7 8 90

500

1000

1500

2000

TNoM score

-log(

Bin

omia

l sur

pris

e)

log(Binomial surprise)

Melanoma

Page 23: Differentially Expressed Genes, Class Discovery & Classification

Seek partitions with statistically significant overabundance of informative genes

Use local search techniques, e.g:

•Steepest ascent

•Simulated annealing

Class Discovery Approach

Page 24: Differentially Expressed Genes, Class Discovery & Classification

At a given score level s, set p = p-Val(s) . Suppose that in the data we observe n(s) genes with score

s . The number of genes with score s we observe for

uniformly and independently drawn labeling vectors is a random variable N(s) with

N(s) ~ Binom(n,p)where n is the total number of genes.

The surprise rate at s is defined as (s) = Prob( N(s) n(s) )

= k=n(s)…n [n choose n(s)]pk(1-p)n-p. Finally, the max surprise score for the suggested partition

is Maxs (s)

Scoring a Partition

Page 25: Differentially Expressed Genes, Class Discovery & Classification

Overabundance & Max-Surprise

Page 26: Differentially Expressed Genes, Class Discovery & Classification

Example: Survival Prediction

All Patients Good Prognosis Patients

0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

32 patients 17 deaths

8 patients 5 deaths

0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

19 patients 6 deaths

5 patients 3 deaths

Page 27: Differentially Expressed Genes, Class Discovery & Classification

Class 2

All Patients Good Prognosis Patients

0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

22 patients 13 deaths

18 patients 9 deaths

0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10 patients 2 deaths

14 patients 7 deaths

Page 28: Differentially Expressed Genes, Class Discovery & Classification

Class 3

All Patients Good Prognosis Patients

0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

17 patients 7 deaths

23 patients 15 deaths

0 2 4 6 8 10 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

12 patients 6 deaths

12 patients 3 deaths

Page 29: Differentially Expressed Genes, Class Discovery & Classification

Tissue Classification

Given a set of labeled samples, we can try to classify a new sample

Supervised methods: SVM, Adaboost, Naïve Bayes

Semi-supervised methods: Clustering

Issues: Evaluating the methods Feature Selection Sample contamination/composition

Page 30: Differentially Expressed Genes, Class Discovery & Classification

Evaluating Classification

LOOCV – Leave one out cross validation:

For all samples i = 1…M: Take sample i out Learn from M-1

remaining samples Test on sample i

Normal Tumors

Mislabeled sample

Page 31: Differentially Expressed Genes, Class Discovery & Classification

How many of the informative genes do we choose for our classifier?

A question of choosing a cutoff

Feature Selection

Positive predictive value : 97%

0

5

10

15

20

25

30

35

1e-008 1e-007 1e-006 1e-005 0.0001 0.001 0.01 0.1 1

% P

redi

ctio

n E

rror

s

P-value threshold for selection

• 14 - 2000 genes• 1 misclassification

Page 32: Differentially Expressed Genes, Class Discovery & Classification

Tissue Composition

Small celllung carcinoma

Lung metastasa

Serous carcinoma

Lung adenocarcinoma

Page 33: Differentially Expressed Genes, Class Discovery & Classification

Tissue Composition

The tissue is composed of many cell types (tumor, blood, muscle, …)

The arrayed samples are not always pure!

Major difference: differentialy expressed genes which are:

Causes of the disease state Outcome of the disease state

Page 34: Differentially Expressed Genes, Class Discovery & Classification

Summary

Many methods for choosing differentially expressed genes

These can be compared, e.g. using overabundance tests

Overabundance can also be used for new class discovery

Expression patterns can be used to classify a tissue