class #2: statistics, testing multiple hypotheses and bioinformatics · 2015-02-24 · class #2:...
TRANSCRIPT
Class #2: Statistics, Testing Multiple Hypotheses and bioinformatics Hypotheses and bioinformatics
ML4Bio 2012January 20th 2012January 20th, 2012
Quaid Morris
1
Scooter MorrisPiet Molenaar
2Module #: Title of Module
Gary BaderTero AittokallioBoris Steipe
Overview
• Bioinformatics: gene lists, annotation, etc.• Hypothesis testing:yp g
– T-test, Sign-rank, Ranksum, Hypergeometric test (or Fisher’s Exact Test)
• Multiple test corrections:– Bonferroni, False Discovery Rate (Benjamini-
Hochberg)Hochberg)
3
Bioinformatics:Bioinformatics:gene annotations,ID a ing etcID mapping, etc
4
Interpreting Gene Listsp g
• My cool new screen worked and produced 1000 hits! …Now what?
• Genome Scale Analysis (Omics)• Genome‐Scale Analysis (Omics)– Genomics, Proteomics
• Tell me what’s interesting about these genes
?Ranking orclustering
?GenMAPP.org
Interpreting Gene Listsp g• My cool new screen worked and produced 1000 hits! …Now what?
• Genome‐Scale Analysis (Omics)y ( )– Genomics, Proteomics
• Tell me what’s interesting about these genesA th i h d i k th l f ti
Ranking or Analysis
– Are they enriched in known pathways, complexes, functions
Ranking orclustering
ytools
Eureka! Newheart disease
!Prior knowledge about gene!gcellular processes
Where Do Gene Lists Come From?
• Molecular profiling e.g. mRNA, protein– Identification Gene list– Quantification Gene list + valuesRanking Cl stering (biostatistics)– Ranking, Clustering (biostatistics)
• Interactions: Protein interactions, microRNA targets, transcription factor binding sites (ChIP)transcription factor binding sites (ChIP)
• Genetic screen e.g. of knock out library• Association studies (Genome‐wide)( )
– Single nucleotide polymorphisms (SNPs)– Copy number variants (CNVs)
What Do Gene Lists Mean?
• Biological system: complex pathway physical• Biological system: complex, pathway, physical interactors
• Similar gene function e.g. protein kinaseSimilar gene function e.g. protein kinase
• Similar cell or tissue location
• Chromosomal location (linkage CNVs)• Chromosomal location (linkage, CNVs)
D tData
Gene AttributesAvailable in databases:• Function annotation
– Biological process, molecular function, cell location
• Chromosome position• Chromosome position• Disease association• DNA properties
bi di i (i / )– TF binding sites, gene structure (intron/exon), SNPs
• Transcript properties– Splicing, 3’ UTR, microRNA binding sites
• Protein properties– Domains, secondary and tertiary structure, PTM sites
• Interactions with other genes
What is the Gene Ontology (GO)?gy ( )www.geneontology.org
• Set of biological phrases (terms) which are applied to genes:– protein kinase
– apoptosis
b– membrane
• Dictionary: term definitions
l f l f d b• Ontology: A formal system for describing knowledge
Jane Lomax @ EBI
GO Structure• Terms are related within a hierarchy– is‐a
– part‐of
• Describes multiple l l f d il flevels of detail of gene function
T h• Terms can have more than one parent or childchild
What GO Covers?• GO terms divided into three aspects:
– cellular component
– molecular function
biological process (important path a so rce)– biological process (important pathway source)
l 6 h h tglucose-6-phosphate isomerase activity
Cell divisionCell division
Terms• Where do GO terms come from?
– GO terms are added by editors at EBI and gene– GO terms are added by editors at EBI and gene annotation database groups
– Terms added by request
– Experts help with major development
– 32029 terms, >99% with definitions.• 19639 biological_process
• 2859 cellular_component
• 9531 molecular_function
• As of July 15, 2010
Annotations
• Genes are linked, or associated, with GO terms by trained curators at genome databases– Known as ‘gene associations’ or GO annotations
– Multiple annotations per gene
• Some GO annotations created automatically (without human review)
Annotation Sources
• Manual annotation– Curated by scientistsCurated by scientists
• High quality
• Small number (time‐consuming to create)
d l l– Reviewed computational analysis
• Electronic annotationA i d i d i h h lid i– Annotation derived without human validation• Computational predictions (accuracy varies)
• Lower ‘quality’ than manual codes
• Key point: be aware of annotation origin
Evidence Types
For your information
Evidence Types• Experimental Evidence Codes
• EXP: Inferred from Experiment• IDA: Inferred from Direct Assay
• Author Statement Evidence Codes
• TAS: Traceable Author y• IPI: Inferred from Physical Interaction• IMP: Inferred from Mutant Phenotype• IGI: Inferred from Genetic Interaction• IEP: Inferred from Expression Pattern
Statement• NAS: Non-traceable
Author Statement• Curator Statement Evidence
CodesIC: Inferred by • IC: Inferred by Curator
• ND: No biological Data available
• Computational Analysis Evidence Codes• ISS: Inferred from Sequence or Structural
Similarity• ISO: Inferred from Sequence Orthology
ISA: Inferred from Sequence Alignment• ISA: Inferred from Sequence Alignment• ISM: Inferred from Sequence Model• IGC: Inferred from Genomic Context• RCA: inferred from Reviewed Computational
Analysis
• IEA: Inferred from electronic annotation
See http://www.geneontology.org
Wide & Variable Species Coverage
Lomax J. Get ready to GO! A biologist's guide to the Gene Ontology. Brief Bioinform. 2005 Sep;6(3):298-304.
Accessing GO: QuickGO
http://www.ebi.ac.uk/ego/See also AmiGO: http://amigo.geneontology.org/cgi-bin/amigo/go.cgi
Gene Attributes
• Function annotation– Biological process, molecular function, cell location
• Chromosome positionChromosome position• Disease association• DNA properties
TF binding sites gene structure (intron/exon) SNPs– TF binding sites, gene structure (intron/exon), SNPs
• Transcript properties– Splicing, 3’ UTR, microRNA binding sites
P t i ti• Protein properties– Domains, secondary and tertiary structure, PTM sites
• Interactions with other genes
Sources of Gene Attributes
• Ensembl BioMart (eukaryotes)– http://www.ensembl.org
• Entrez Gene (general)– http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene
• Model organism databases– E.g. SGD: http://www.yeastgenome.org/
• Also available through R
Biomart 0.7
This oneThis one
Ensembl BioMart• Convenient access to gene list annotation
Select genome
Select filters
Select attributesto download
What Have We Learned?
• Many gene attributes in databasesMany gene attributes in databases– Gene Ontology (GO) provides gene function annotation
• GO is a classification system and dictionary for biological concepts
• Annotations are contributed by many groups
• More than one annotation term allowed per gene
• Some genomes are annotated more than othersSome genomes are annotated more than others
• Annotation comes from manual and electronic sources
• GO can be simplified for certain uses (GO Slim)
• Many gene attributes available from Ensembl and EntrezGene
Gene Lists Overview
• Interpreting gene lists
• Gene function attributes– Gene Ontology
• Ontology Structure
A i• Annotation
– BioMart + other sources
• Gene identifiers and mapping• Gene identifiers and mapping
Gene and Protein Identifiers• Identifiers (IDs) are ideally unique, stable names or numbers that help track database records– E.g. Social Insurance Number, Entrez Gene ID 41232
• Gene and protein information stored in many databases• Gene and protein information stored in many databases– Genes have many IDs
• Records for: Gene, DNA, RNA, Protein– Important to recognize the correct record type– E.g. Entrez Gene records don’t store sequence. They link to DNA regions RNA transcripts and proteins e g into DNA regions, RNA transcripts and proteins e.g. in RefSeq, which stores sequence.
Common IdentifiersFor your information
Species-specificHUGO HGNC BRCA2MGI MGI:109337RGD 2219
GeneEnsembl ENSG00000139618Entrez Gene 675U i H 34012 RGD 2219
ZFIN ZDB-GENE-060510-3 FlyBase CG9097 WormBase WBGene00002299 or ZK1067.1SG S
Unigene Hs.34012
RNA transcriptGenBank BC026160 1 SGD S000002187 or YDL029W
AnnotationsInterPro IPR015252OMIM 600185
GenBank BC026160.1RefSeq NM_000059Ensembl ENST00000380152
OMIM 600185Pfam PF09104Gene Ontology GO:0000724SNPs rs28897757E i t l Pl tf
ProteinEnsembl ENSP00000369497RefSeq NP_000050.2U iP t BRCA2 HUMAN Experimental Platform
Affymetrix 208368_3p_s_atAgilent A_23_P99452CodeLink GE60169
UniProt BRCA2_HUMAN or A1YBP1_HUMANIPI IPI00412408.1EMBL AF309413
Red = Illumina GI_4502450-S
EMBL AF309413 PDB 1MIU Recommended
Identifier Mapping
• So many IDs!– Mapping (conversion) is a headache
• Four main uses– Searching for a favorite gene name
– Link to related resources
– Identifier translationE G t t i E t G t Aff• E.g. Genes to proteins, Entrez Gene to Affy
– Unification during dataset merging• Equivalent records
ID Mapping Services
• Synergizer• Synergizer– http://llama.med.harvard.edu/syner
gizer/translate/
• Ensembl BioMart• Ensembl BioMart– http://www.ensembl.org
• PICR (proteins only)– http://www.ebi.ac.uk/Tools/picr/
• R language annotationannotation databases– http://www.bioconductor.org
ID Mapping ChallengesID Mapping Challenges• Avoid errors: map IDs correctly
• Gene name ambiguity – not a good ID• Gene name ambiguity not a good ID– e.g. FLJ92943, LFS1, TRP53, p53
– Better to use the standard gene symbol: TP53g y
• Excel error‐introduction– OCT4 is changed to October‐4g
• Problems reaching 100% coverage– E.g. due to version issues
– Use multiple sources to increase coverageZeeberg BR et al. Mistaken identifiers: gene name errors can be introduced inadvertently ywhen using Excel in bioinformatics BMC Bioinformatics. 2004 Jun 23;5:80
Recommendations
• For proteins and genes– (doesn’t consider splice forms)
• Map everything to Entrez Gene IDs using a spreadsheet
• If 100% coverage desired manually curate• If 100% coverage desired, manually curate missing mappings
• Be careful of Excel auto conversions – especially e careful of xcel auto conversions especiallywhen pasting large gene lists!– Format cells as ‘text’
What Have We Learned?
• Genes and their products and attributes have many identifiers (IDs)
• Genomics requirement to convert or map IDs from one type to another
• ID mapping services are available
• Use standard, commonly used IDs to reduce ID mapping challenges
Hypothesis testing: Hypothesis testing: P-values, enrichment analysis,
T tests etcT-tests, etc
32
What is a P-value?
• A) Probability of an incorrect rejection of the null hypothesis
• B) Probability that a sample of the test statistic under the null distribution is as or
h i d lmore extreme than its measured value• C) False rejection probability• D) Some subset of the above
33
What is a P-value?
• A) Probability of an incorrect rejection of the null hypothesis
• B) Probability that a sample of the test statistic under the null distribution is as or
h i d lmore extreme than its measured value• C) False rejection probability• D) Some subset of the above
34
Hypothesis testingyp g• Random variables:
– H: H0 (null hypothesis) or H1 (alternative 0 ( yp ) 1 (hypothesis)
– Data: X1, X2, … XN (independent and identically distributed IID)distributed – IID)
– T: sample from null distribution of the test statistic– t: observed value of test statistict: observed value of test statistic
• Parameters:– α: significance levelα: significance level
• Goal:– Set P = Pr(T is “more extreme” than t | H = H0)( | 0)– Reject H0 if P < α
35
P-value facts
• Note that: Pr[P < p | H0 is true] = p
• So under the null distribution, P is a random variable that is uniformly distributed between 0 and 1.
36
P-value versus false rejections
• P-value is:– Pr[ T is “as or more extreme” than t | H0 is true ]
• False rejection probability:– Pr[ H0 is true | H0 is rejected ]– aka “False discovery rate”
• How do we go from one to another?
37
(Gene Set) Enrichment analysis
• Given:1. Gene list: e.g. RRP6, MRD1, RRP7, RRP43, RRP42
(yeast)(yeast)2. Gene annotations: e.g. Gene ontology, transcription factor
binding sites in promoter
Q ti A f th t ti • Question: Are any of the gene annotations surprisingly enriched in the gene list?
• Details:Details:– How to assess “surprisingly” (statistics)– How to correct for repeating the tests
38
The hypergeometric testGene list
RRP6MRD1RRP7
H0: List is a random sample from populationH : More black genes than expectedRRP7
RRP43RRP42
H1: More black genes than expected
Background population:g p p500 black genes, 4500 red genes
39
The hypergeometric function
Probability a random sample of k genes contains q black genes when the background population contains m black genes out of n total genes:genes out of n total genes:
m⎛ ⎞ n −m⎛ ⎞ # ways to # ways to choose m
q
⎛
⎝ ⎜
⎞
⎠ ⎟ n m
q − k
⎛
⎝ ⎜
⎞
⎠ ⎟
n⎛ ⎞
choose q out of m genes
q-k out of n-mgenes
=n
k
⎛
⎝ ⎜
⎞
⎠ ⎟ # ways to
choose k out of n genes
40n
k
⎛
⎝ ⎜
⎞
⎠ ⎟ =
n!(n − k)!k!
is called “n choose k” for details seehttp://www.khanacademy.org/video/combinations
The hypergeometric testGene list Null distribution
RRP6MRD1RRP7
500
4
⎛
⎝ ⎜
⎞
⎠ ⎟ 4500
1
⎛
⎝ ⎜
⎞
⎠ ⎟
5000⎛ ⎜
⎞ ⎟
500
5
⎛
⎝ ⎜
⎞
⎠ ⎟ 4500
0
⎛
⎝ ⎜
⎞
⎠ ⎟
5000⎛ ⎜
⎞ ⎟
+ = 4.6 x 10-4
P-value
RRP7RRP43RRP42
5⎝ ⎜
⎠ ⎟ 5⎝
⎜ ⎠ ⎟
Background population:g p p500 black genes, 4500 red genes
41
Important details
• One way to test for under-enrichment of “black”, test for over-enrichment of “red”S ti ll d “Fi h ’ E t T t”• Sometimes called “Fisher’s Exact Test”
• Need to choose “background population” appropriately, e.g., if only portion of the total gene appropriately, e.g., if only portion of the total gene complement is queried (or available for annotation), only use that population as background.T f i h f h i d d • To test for enrichment of more than one independent types of annotation (red vs black and circle vssquare), apply the hypergeometric test separately for
42
q ), pp y yp g p yeach type. ***More on this later***
Summary
• The P-value given by the hypergeometric test (Fisher’s Exact Test):
– is “the probability that a random draw of the same size as the gene list from the background population would produce the observed number of annotations in the gene list or gmore.”,
– depends on size of both gene list and background population as well and # of black genes in gene list and background.
43
Review: One sample T-test
• Does drug XyZiZy reduce tumour size?• X1, X2, …, XN: change in tumour size after 1 2 N g
taking the drug.
• Calculate t = (E[X] – μ) / (s / √N) [here μ=0]• Evaluate t under a Student t-distribution with
N-1 degrees of freedom.
44
Notes about the T-test
• Assumptions:– 1) Xi is normally distributed– 2) Variance of Xi is unknown but has a chi-square
distributionPlus some other minor technical ones– Plus some other minor technical ones
• What should we do if these aren’t true or we • What should we do if these aren’t true, or we don’t know whether they are?
45
Wilcoxon sign-rank test
• H0: median of distribution of Xi is zero• H1: median of distribution of Xi is not1 i
• Rank X1, X2, …, XN by absolute value• Test statistics: W+ is the sum of the ranks of
the positive values of Xi, W- is the rank sum of the negative values – S = W+ - W-.
• For large N (>20 or so), null distribution of S is well approximated by S ~ Normal(0, σ2)
h 2 N(N )( N )/where σ2 = N(N+1)(2N+1)/646
Sign rank test notes• Assumptions:Assumptions:
– 1) Data is symmetrically distributed around the median
– Plus some other minor technical ones
• Tied ranks need to be “corrected” • Some versions use min(W+, W-) as the test
statistic.
47
(Wilcoxon) Sign Test
• Used when you can’t even assume that the distribution of Xi is symmetric.
• Test statistic is the # of positive values of Xi.• Under the null hypothesis, the samples are
IID and the probability that any Xi is positive is 0.5, so the null distribution of the test statistic is Bino ial(0 5 N) here N is the # of is Binomial(0.5, N) where N is the # of samples.
48
SummaryAppropriate
T-Test
distributions for Xi
Sign-Rank testFewer
Assumptions More Power
Sign testg
49
SummaryAppropriate Test statistic
T-Test
distributions for Xi
T-statistic
Sign-Rank test More PowerDifference of ranksums
Sign test# of positives g
50
Doing paired test
• The same set of tests usually applies for determining whether paired data are drawn from distributions with the same or different mean (medians).If (Y Z ) i h i d d i f Y d Z• If (Yi, Zi) is the paired data point for Yi and Zifrom the two datasets, then simply set:
X Y Z– Xi = Yi – Zi.
• Note: paired Wilcoxon sign-rank is almost exactly paired T-test applied to joint ranksexactly paired T-test applied to joint ranks.
51
Gene lists and gene scores
Clustering Thresholding a gene “score”Clustering Thresholding a gene score
esGene list
Gene list
es
Gen
e
Gen
e
Time
Examples of gene scores
52
Source Eisen et al. (1998) PNAS 95 Source: Gerber et al. (2006) PNAS103
Time
Enrichment analysis using gene scoresEnrichment analysis using gene scores
Gene scores
667
7
5 Gene score distributions
01
21 1
10
0 01 2 10
Question: How likely are the differences between the two distributions due to chance?
53
distributions due to chance?
Enrichment analysis with non-i d t l T t tpaired, two sample T-test
Answer: Two tailed T test Gene score distributionsAnswer: Two-tailed T-test
Black: N1=500Mean: m1 = 1.1
Red: N2=5000
Mean: m1 1.1 Std: s1 = 0.9
Mean: m = 4 9
T statistic =
Mean: m1 = 4.9 Std: s1 = 1.0
21 mm −
H0: Black and red scores are drawn from a distribution with the same meanH Th t t l
T-statistic =
2
22
1
21
Ns
Ns
+
= -88 5
54
H1: The two means are not equal 88.5
Enrichment analysis with non-i d t l T t t
P-value = shaded area * 2
paired, two sample T-test
Gene score distributionsT-distribution
dens
ity
P value shaded area 2
Pro
babi
lity
d
0-88.5
T statistic =21 mm −
T-statistic
088.5
T-statistic =
2
22
1
21
Ns
Ns
+
= -88 5
H0: Black and red scores are drawn from a distribution with the same meanH Th t t l
55
88.5 H1: The two means are not equal
T-test caveats (also see next slide)
1. Assumes black and red gene score distributions are both approximately Gaussian (i.e. normal)
Score distribution assumption is often true for:– Score distribution assumption is often true for:• Log ratios from microarrays
– Score distribution assumption is rarely true for:SAGE N G• Peptide counts, sequence tags (SAGE or NextGen
sequencing), transcription factor binding sites hits
2. Tests for significance of difference in means of two distribution but does not test for other differences between distributions.
56
Examples of inappropriate score Examples of inappropriate score distributions for T-tests
Gene scores are positive and have increasing density near zero, e.g. sequence counts
Distributions with gene score outliers, or “heavy-tailed” distributions
Bimodal “two-bumped” distributions.
ility
den
sity
bilit
y de
nsity
lity
dens
ity
Pro
bab
Gene score 0
Pro
bab
Gene score
Pro
babi
l
Gene score
Solutions:1) Robust test for difference of medians (WMW)2) Di f diff f di ib i (K S)
57
2) Direct test of difference of distributions (K-S)
Enrichment analysis with two-sample, not paired Wilcoxon Rank Sumpaired Wilcoxon Rank Sumaka Mann-Whitney U test or simply “WMW”
1) Rank gene scores, calculate RB,1) Rank gene scores, calculate RB, sum of ranks of black gene scores
6.52.15 6
ranks1 de
nsity
5.64.53 2
5.6-1.1-2.50
234
RB = 21
Pro
babi
lity
3.22.11.70 1
3.21.7
-0.5N2 red gene
scores
4567
P
Gene score
H : Probability that a random sample from0.1-1.1-2.5
6.54.50.1
789
H0: Probability that a random sample from distribution of red score is > than one from black is 0.5H1: Otherwise
58
-0.5N1 black genescores
10Z
H1: Otherwise
Wilcoxon Mann Whitney (WMW) testWilcoxon-Mann-Whitney (WMW) testaka Mann-Whitney U-test, Wilcoxon rank-sum test
2) Calculate Z-score:
RB = 21
dens
ity
BNNNR2
1211
++−
mean rank
Pro
babi
lity
U
Zσ
2=
3) Calculate P value:
= -1.4
P
Gene score Normal distributiony
P-value = shaded area * 23) Calculate P-value:
H : Probability that a random sample from
obab
ility
dens
ity H0: Probability that a random sample from distribution of red score is > than one from black is 0.5H1: Otherwise
59Z
Z
Pro
0-1.4
H1: Otherwise
WMW test details
• Described method is only applicable for large N1 and N2 and when there are no tied scores
• WMW test is robust to (a few) outliers
12/)1( NNNN 12/)1( 2121 ++= NNNNuσ
60
Kolmogorov-Smirnov (K-S) test for Kolmogorov Smirnov (K S) test for difference of distributions
Empirical (cumulative)
dens
ity
prob
abili
ty
1.0
Empirical (cumulative) distribution
roba
bilit
y d
mul
ativ
e p
0.5
Pr
Gene score 0C
um
Gene score 0
1) Calculate cumulative distributions of red and black
61
Kolmogorov-Smirnov (K-S) testEmpirical (cumulative)
dens
ity
prob
abili
ty
1.0
Empirical (cumulative) distribution
roba
bilit
y d
mul
ativ
e p
0.5
Pr
Gene score 0C
um
Gene score 0
1) Calculate cumulative distributions of red and black
62
Kolmogorov-Smirnov (K-S) testEmpirical (cumulative)
dens
ity
prob
abili
ty
1.0
Empirical (cumulative) distribution
roba
bilit
y d
mul
ativ
e p
0.5
Pr
Gene score 0C
um
Gene score 0
1) Calculate cumulative distributions of red and black
63
Kolmogorov-Smirnov (K-S) testEmpirical (cumulative)
dens
ity
prob
abili
ty
1.0
Empirical (cumulative) distribution
roba
bilit
y d
mul
ativ
e p
0.5Length = 0.4
Pr
Gene score 0
Test statistic: Maximum vertical
Cum
Gene score 0
Test statistic: Maximum vertical difference between the two cumulative distributions
64
WMW and K-S test caveats
• Neither tests is as sensitive as the T-test, ie they require more data points to detect the same amount of difference so use the T test whenever it is validof difference, so use the T-test whenever it is valid.
• K-S test and WMW can give you different answers: K-S detects difference of distributions, WMW detects whether samples from one tend to be higher than those from the other (or vice versa)
• Rare problem: Tied scores and/or small # of • Rare problem: Tied scores and/or small # of observations can be a problem for some implementations of the WMW test
65
Proper tests for different distributions
Gene scores are positive and have increasing density near zero, e.g. sequence counts
Distributions with gene score outliers, or “heavy-tailed” distributions
Bimodal “two-bumped” distributions.
ility
den
sity
bilit
y de
nsity
lity
dens
ity
Pro
bab
Gene score 0
Pro
bab
Gene score
Pro
babi
l
Gene score
WMW or K-S K-S only WMW or K-SRecommended test:
66
What have we learned?
• T-test is not valid when one or both of the score distributions is not normal,
• If need a “robust” test, or to test for difference of medians use WMW test,
• To test for overall difference between two distributions, use K-S test.
67
Central Limit Theorem (CLT)• We’ve seen a few examples of test statistic We ve seen a few examples of test statistic
distributions which, for large N, are well approximated by a normal distribution.
• This is often due to the CLT:– If X1, X2, … XN are IID with a PDF with a finite
mean and variance then as N increases, the distribution of mean of these variables approaches a Gaussian.a Gauss a
• Also holds in most cases for independent, non-identically distributed random variables yhave different distributions.
68
Other common tests and distributions
• Chi squared (contingency table) test• Chi-squared (contingency table) test– Useful if there are >2 values of annotation (e.g. red genes,
black genes, and blue genes)Used as an approximation to Fisher’s Exact Test but is – Used as an approximation to Fisher s Exact Test but is inaccurate for small gene lists
– Also used for goodness-of-fit (in general)
• Binomial test• Binomial test– Tests if gene scores for red and black either come from
either N flips of the same coin or different coins.E g black genes are “expressed” in on average 5 out of 12 – E.g. black genes are expressed in, on average, 5 out of 12 conditions and red genes are expressed in, on average, 2 out of 12 conditions, is the probability of being expressed significantly different for the black and red genes?
69
g y g
Permutation tests
• Often the null distribution of the test statistic • Often, the null distribution of the test statistic is unclear or not analytical.
• In these cases you can generate an In these cases, you can generate an empirical distribution by sampling from the null distribution and then evaluating your test statistic against this distribution.
• In many genomic applications it is often possible to get a sample from the null distribution by randomizing (i.e. permuting) the association between genes and the association between genes and corresponding data. 70
Multiple test correction: Bonferroni and False Discovery Ratey
71
How to win the P-value lottery, part 1Random draws
… 7,834 draws later …Expect a random draw with observed
i h t,
enrichment once every 1 / P-value draws
Background population:g p p500 black genes, 5000 red genes
72
How to win the P value lottery part 2How to win the P-value lottery, part 2Keep the gene list the same, evaluate different annotations
Observed draw Different annotationsObserved drawRRP6MRD1RRP7
Different annotationsRRP6MRD1RRP7RRP7
RRP43RRP42
RRP7RRP43RRP42
73
ORA tests need correction
From the Gene Ontology website:
Current ontology statistics: 25206 terms• 14825 biological process• 2101 cellular component• 8280 molecular function
74
Simple P-value correction: Bonferroni
If M = # of annotations tested:
Corrected P-value = M x original P-value
Corrected P-value is greater than or equal to the probability thatg q p yone or more of the observed enrichments could be due to
random draws. The jargon for this correction is “controlling for the Family-Wise Error Rate (FWER)”
Bonferroni correction caveats
• Bonferroni correction is very stringent and can “wash away” real enrichments.
• Often users are willing to accept a less stringent condition, the “false discovery rate” (FDR) hi h l d l i (FDR), which leads to a gentler correction when there are real enrichments.
76
False discovery rate (FDR)
DR• FDR is the expected proportion of the observed enrichments due to random chancechance.
• Compare to Bonferroni correction which is a bound on the probability that any one of the observed p y yenrichments could be due to random chance.
• Typically FDR corrections are calculated using the Benjamini Hochberg procedureBenjamini-Hochberg procedure.
• FDR threshold is often called the “q-value”
Controlling FDR using the Controlling FDR using the Benjamini-Hochberg procedure I
• Say you want to bound the FDR at α, you need to calculate the corresponding P-value threshold t
• First, calculate the P-values for all the tests, d h h h i h ll and then sort them so that p1 is the smallest
(i.e. most significant) P-value, and pm is the leastleast.
78Benjamini, Y. & Hochberg, Y. (1995) J. R. Stat. Soc. B 85, 289–300
Controlling FDR using the Controlling FDR using the Benjamini-Hochberg procedure II
• t = pr where r is the max value for which:
FDR threshold
pr ≤ rα / m
FDR threshold
pr ≤ rα / m
rank # of tests
Cavaet: Assumes independent or positively Cavaet: Assumes independent or positively correlated tests.
79
Reducing multiple test correction Reducing multiple test correction stringency
• Can control the stringency by reducing the number of tests: e.g. use GO slim or restrict testing to the appropriate GO annotations.
80
Reducing multiple test correction istringency
• The correction to the P-value threshold ⟨depends on the # of tests that you do, so, no matter what, the more tests you do, the more sensitive the test needs to beC l h i b d i h • Can control the stringency by reducing the number of tests: e.g. use GO slim; restrict testing to the appropriate GO annotations; or testing to the appropriate GO annotations; or select only larger GO categories.
Meta-analysis for multiple tests: Fisher’s Meta analysis for multiple tests: Fisher s method for combining P-values
• Given different tests for the same hypothesis H, with P-values p1, p2, …, pN you can use Fisher’s method to combine them into a single P-value.Th i i X2 2 Σ l [ ] h ll • The test statistic X2 = -2 Σi ln[ pi ] has a null distribution as a chi-square with 2N degrees of freedomof freedom.
82
SummarySummary
• Multiple test correction– Bonferroni: stringent, controls probability of at least
one false positive*– FDR: more forgiving controls expected proportion of – FDR: more forgiving, controls expected proportion of
false positives* -- typically uses Benjamini-Hochberg
• Fisher’s Method to combine P-values– If have multiple, independent tests of same
hypothesis, can combine P-values into a single P-alue value.