differentially expressed genes, class discovery & classification
Post on 20-Dec-2015
234 views
TRANSCRIPT
.
Differentially Expressed Genes, Class Discovery &
Classification
Finding Differentially Expressed Genes
Two types of motivation: “Direct”:
Relate the genes to known biology: functions, pathways etc. Infer about their rule, the mechanisms governing the process etc.
“Indirect”: Use as a “pruning stage” for tools that perform learning tasks: Infer regulatory mechanisms and relations Classification ( disease Vs. normal, disease
subtypes)
Example: Tumor vs. Normal tissues
Identify differentially expressed genes
Diagnostic Markers Therapeutic targets Understanding the
disease process
Normalsamples
Tumorsamples
Ove
r ex
pre
ssed
Un
der
exp
ress
ed
Non-small cell lung carcinomas Sheba medical center U. of Colorado Medical Center
What We Need
Score the genes, hopefully in a meaningful way..
Attach a measure of statistical significance to the score so we can Choose a subset of genes “wisely” Have a measure of how strong our signal is
Simplest Score: Fold Change
Avg. expression in Normal lung
Avg
. exp
ress
ion
in tu
mo
rs
2-fold change
2-fold up: 761 genes2-fold down: 272 genes
Fold Change: problems
Not reliable at the low end of the scale (“0/0” effects – large variance)
Sensitive to outliers
Variant: “pairwise fold change” compute fold change over all possible sample
pairs If in e.g. 75% of the pairs, change > =>
significant
Relevance Scores - TNoMBeyond “fold change”
Both genes have >15 fold change
TNoM (Total Number of Misclassifications) score Find the threshold that best separates tumors from normals, count the number of errors committed there.
10 100 1000 10000 100000
Gene 1
10 100 1000 10000 100000
Gene 2
tumornormal
5 0
Uninformative Gene Informative Gene
Expression pattern of a gene: a Pathological diagnosis information (annotation): L
v(a,L), a vector of +s and –s, ordered by the a values
+ + - - + + + - - + - - + + -a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15
Informative genes
+ + + + + + + + - - - - - - -
- - - - - - - + + + + + + + + - - - - + - - + + - + + + + +
etc
Non-informative genes
+ - + - + + + + - - + + - - -
- + + - + - - + + - + + - - + + - - - + + - + + - + + - + -
etc
Scoring Informative Genes
Find the threshold that best separates tumors from normals, count the number of errors committed there.
- + + - + - - + + - + + - - +
# of errors = min(7,8) = 7.
6 7
Ex 1:
Ex 2: A perfect single gene classifier gets a score of 0.
+ + + + + + + + - - - - - - -
0
TNoM Score
TNoM vs. Fold Change
2-fold up: 761 genes2-fold down: 272 genes
TNoM 3 62 genes
Avg. expression in Normal lung
Avg
. exp
ress
ion
in tu
mo
rs 2-fold changeTNoM 3TNoM > 3
Cons:
Ones-sided vs. two sided errors
Absolute values ignored
For any given level s, we can efficiently compute
p-Val(s) = Prob( TNoM(V) s ),
where V is uniformly drawn over the appropriate
space.
(H0 – the gene expression values are
independent of the labels)
Computed using DP
TNoM
Wilcoxon Rank Test
Another gene score, which similarly to TNoM: Ignores absolute values Takes into account only order of measurements
•Sort the expression values of both groups+ + - - + + + - - + - - + + -a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15
•W(g)= sum of ranks of the positive examples:
W(g) = 1 + 2 + 5 + 6 + 7 + 10 + 13 + 14 = 58
Wilcoxon Rank Test
A common test in statistics Again, we can compute p-Values given the null
hypothesis H0
P(W(g) > s|n,k) = the probability of getting a score > s given a total of n samples, out of which k are labeled as (+).
SAM (Tusher et al., PNAS 01)
Where a = (1/n1 + 1/n2)/(n1+ n2-2)
•d(i) is exactly the paired t-statistic
•Tests the assumption: are the means of the two processes the same?
•Underlying assumption: two normal distributions
•A known p-value: the t-distribution
SAM – Alternative to P-Value
• P-value relies on t-test assumptions - problematic
• Can we assess the significance of d(i) without parametric assumptions?
• Define a “balanced” permutation: division of samples to 2 groups, where in each group the number of ‘+’ and ‘-’ is balanced
• Perform all possible “balanced” permutations p to the data and compute:
p
pE idM
id )(1
)( )()()( ididi E
False Discovery Rate for SAM
• Genes with above a given threshold – significant
• FDR – False discovery rate = the % of genes passing as “significant” which are expected to be false positives
• Each threshold on (i) can be given an FDR value:
• compute the avg. number of FP crossing this threshold in the permuted sets
Different Scores
TNoM Info Wilcoxon t Test Fold Change
Different scores and null hypothesis (parametric, non parametric etc.)
All can be found in the ScoreGene package:
http://www.cs.huji.ac.il/labs/compbio/scoregenes/
Can we assess which scoring method is the best for our case?
0 1 2 3 4 5 6 7 8 9 100
2000
4000
6000
Lung Cancer Data - Actual and expected TNoM scores distribution
TNoM score
Num
ber
of
genes
Expected distributionActual distribution
0 1 2 3 4 5 6 7 8 9 100
500
1000
1500
2000
2500
TNoM score
-log(B
inom
ial surp
rise) log(Binomial surprise)
• Data on 30 samples from normal and tumor lung tissues.
• ~7000 genes. • Naftali Kaminski’s lab, Sheba Medical Center
Overabundance Analysis
Why Test Overabundance?
Tests how informative is a set of genes w.r.t. a given classification of the data and a scoring method.
Can be used to compare different: gene scoring methods normalization methods
Comparing Normalization Methods
Why Test Overabundance?
But also: a method to discover new classes in the data
Intuition: biologically meaningful partitions will have a high overabundance of informative genes
Overabundance Analysis in
Class Discovery
Biologicallymeaningful partitions.
Overabundance of informative genes
•Score Genes•Count •Compare torandom
AML/ALL
0 1 2 3 4 5 60
500
1000
1500
2000
2500
3000
Breast Cancer BRCA1/BRCA2 data - Actual and expected TNoM scores distribution
TNoM score
Num
ber
of g
enes
Expected distributionActual distribution
0 1 2 3 4 5 60
100
200
300
400
TNoM score
-log(
Bin
omia
l sur
pris
e)
log(Binomial surprise)
BRCA1/2
0 1 2 3 4 5 6 7 8 90
1000
2000
3000
Melanoma38-1 - Actual and expected TNoM scores distribution
TNoM score
Num
ber
of g
enes
Expected distributionActual distribution
0 1 2 3 4 5 6 7 8 90
500
1000
1500
2000
TNoM score
-log(
Bin
omia
l sur
pris
e)
log(Binomial surprise)
Melanoma
Seek partitions with statistically significant overabundance of informative genes
Use local search techniques, e.g:
•Steepest ascent
•Simulated annealing
Class Discovery Approach
At a given score level s, set p = p-Val(s) . Suppose that in the data we observe n(s) genes with score
s . The number of genes with score s we observe for
uniformly and independently drawn labeling vectors is a random variable N(s) with
N(s) ~ Binom(n,p)where n is the total number of genes.
The surprise rate at s is defined as (s) = Prob( N(s) n(s) )
= k=n(s)…n [n choose n(s)]pk(1-p)n-p. Finally, the max surprise score for the suggested partition
is Maxs (s)
Scoring a Partition
Overabundance & Max-Surprise
Example: Survival Prediction
All Patients Good Prognosis Patients
0 2 4 6 8 10 120
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
32 patients 17 deaths
8 patients 5 deaths
0 2 4 6 8 10 120
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
19 patients 6 deaths
5 patients 3 deaths
Class 2
All Patients Good Prognosis Patients
0 2 4 6 8 10 120
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
22 patients 13 deaths
18 patients 9 deaths
0 2 4 6 8 10 120
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
10 patients 2 deaths
14 patients 7 deaths
Class 3
All Patients Good Prognosis Patients
0 2 4 6 8 10 120
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
17 patients 7 deaths
23 patients 15 deaths
0 2 4 6 8 10 120
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
12 patients 6 deaths
12 patients 3 deaths
Tissue Classification
Given a set of labeled samples, we can try to classify a new sample
Supervised methods: SVM, Adaboost, Naïve Bayes
Semi-supervised methods: Clustering
Issues: Evaluating the methods Feature Selection Sample contamination/composition
Evaluating Classification
LOOCV – Leave one out cross validation:
For all samples i = 1…M: Take sample i out Learn from M-1
remaining samples Test on sample i
Normal Tumors
Mislabeled sample
How many of the informative genes do we choose for our classifier?
A question of choosing a cutoff
Feature Selection
Positive predictive value : 97%
0
5
10
15
20
25
30
35
1e-008 1e-007 1e-006 1e-005 0.0001 0.001 0.01 0.1 1
% P
redi
ctio
n E
rror
s
P-value threshold for selection
• 14 - 2000 genes• 1 misclassification
Tissue Composition
Small celllung carcinoma
Lung metastasa
Serous carcinoma
Lung adenocarcinoma
Tissue Composition
The tissue is composed of many cell types (tumor, blood, muscle, …)
The arrayed samples are not always pure!
Major difference: differentialy expressed genes which are:
Causes of the disease state Outcome of the disease state
Summary
Many methods for choosing differentially expressed genes
These can be compared, e.g. using overabundance tests
Overabundance can also be used for new class discovery
Expression patterns can be used to classify a tissue