q-vals (and false discovery rates) made easy dennis shasha based on the paper "statistical...
Post on 22-Dec-2015
217 views
TRANSCRIPT
![Page 1: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/1.jpg)
Q-Vals (and False Discovery Rates) Made Easy
Dennis ShashaBased on the paper
"Statistical significance for genomewide studies"by John Storey and Robert Tibshirani
PNAS August 5, 2003 9440-9445
![Page 2: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/2.jpg)
Challenge
• You test plants/patients/… in two settings (or from different populations).
• You want to know which genes are differentially expressed (alternate)
• You don’t want to make too many mistakes (declaring a gene to be alternate when in fact it’s null – not differentially expressed).
![Page 3: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/3.jpg)
First Idea
• You take p-vals of the differences in expression.
• P-val(g) is the probability that if g is null, it would have a difference at least this large.
• You choose a cutoff, say 0.05.
• You say all genes that differ with p-val <= 0.05 are truly different.
• What’s the problem?
![Page 4: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/4.jpg)
Thought Experiment
• Suppose that no genes are truly differentially expressed.
• You will conclude that about 5% of those you called significant really are.
• Your false discovery rate (number null among those predicted to be alternate/number predicted to be alternate) = 100%.
• Bad.
![Page 5: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/5.jpg)
A Fundamental Insight
• All truly null genes (i.e. not truly differentially expressed) are equally likely to have any p-val.
• That is by construction of p-val: under the null hypothesis, 1% of the genes will be in the top 1 percentile, 1% will be in percentile between 89 and 90th and so on. P-val is just a way of saying percentile in null condition.
![Page 6: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/6.jpg)
What Do We Do With That?
• Mixture model: imagine null genes as light blue marbles and truly different genes as red ones.
• If the assay is decent, red marbles should be concentrated at the low p-values.
![Page 7: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/7.jpg)
0 …. Pval …………………………………………………1
![Page 8: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/8.jpg)
Method We Can Use
• We don’t of course know the colors of the marbles/we don’t know which genes are true alternates.
• However, we know that null marbles are equally likely to have any p-value.
• So, at the p-value where the height of the marbles levels off, we have primarily light blue marbles/null genes.
• Why?
![Page 9: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/9.jpg)
0 …. Pval …………………………………………………1
Flat region starts here
Level of flat region
![Page 10: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/10.jpg)
Answer
• Because if all genes/marbles were null, the heights would be about uniform.
• Provided the reds are concentrated near the low p-vals, the flat regions will be primarily light blues.
![Page 11: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/11.jpg)
Example: all null
• Consider the all null case.
• All marbles are light blue.
• False discovery rate in region to left of flat region is estimated number of white marbles (based on flat region)/number of marbles to left of flat region.
• This will be close to 100%
![Page 12: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/12.jpg)
0 …. Pval …………………………………………………1
Flat region starts here
Level of flat region
![Page 13: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/13.jpg)
Example: all non-null
• Consider the all non-null case.• All marbles are red and they are highly
skewed. • Flat region is essentially zero.• False discovery rate in region to left of flat
region is estimated number of white marbles (based on flat region)/number of marbles to left of flat region.
• This will be close to 0.
![Page 14: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/14.jpg)
0 …. Pval …………………………………………………1
Flat region starts here
![Page 15: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/15.jpg)
Example: mixed case
• Get a distribution of p-values.
• Find flat region.
• Estimate number of nulls in the left-of-flat region by extending the flat line.
• This gives the false discovery rate.
![Page 16: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/16.jpg)
0 …. Pval ……………………………………………1
Flat line; base level of nulls
Num
ber of genes having pval
Possible p-value threshold
![Page 17: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/17.jpg)
Example: mixed case
• What would you estimate the false discovery rate to be in the case that we declare the entire area to the left of the possible p-value threshold to be significant?
• 10%, 25%, 50%?
![Page 18: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/18.jpg)
0 …. Pval ……………………………………………1
Flat line; base level of nulls
Num
ber of genes having pval
Possible p-value threshold
![Page 19: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/19.jpg)
Obtaining q-values from False Discovery Rate
• Suppose we order genes from least p-value to greatest.
• That corresponds to one of these cartesian graphs.
• The q-value of a gene having p-value p is exactly the False Discovery Rate if the declared significance region had a threshold of p.
![Page 20: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/20.jpg)
0 …. Pval ……………………………………………1
Flat line; base level of nulls
Num
ber of genes having pval
Q-value of a gene having this p-val is the FDR if this is the significance threshold.
![Page 21: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/21.jpg)
Lessons for Research
• Mushy p-values (large error bars/few replicates) may force us to the far left in order to get a low False Discovery Rate.
• This may eliminate genes of interest.
• If testing out a gene is not too expensive, then we can accept a higher False Discovery Rate – nothing magical about 0.01.
![Page 22: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d805503460f94a639ac/html5/thumbnails/22.jpg)
0 …. Pval ……………………………………………1
Flat line; base level of nulls
Num
ber of genes having pval
Better p-values avoid loss of genes, for small FalseDiscovery Rate.