on the analysis of relative abundances in ecogenomics - david lovell
DESCRIPTION
In ecogenomics and other areas of molecular systems biology, many measurement methods yield only the relative abundance of different molecular components. Treating such data as though they carried information about absolute abundance can be very misleading. With relative abundances, differential expression needs careful interpretation, and correlation—a statistical workhorse for analyzing pairwise relationships—is an inappropriate measure of association. Using data on absolute and relative gene expression in yeast, we show why relative abundances need special analysis and interpretation, and present principles for doing so based on the theory of compositional data analysis. We show how correlation of data that carry only relative information can lead to conclusions opposite to those drawn from absolute abundances. We present a well-principled alternative—proportionality—and show how it can be used as the basis of analyses familiar in molecular bioscience. We also talk about some of the challenges of applying this approach to ecogenomic and similar count data that are rich in zeros and low counts.TRANSCRIPT
… in ecogenomics
On the analysisof relative abundances
David Lovell, Vera Pawlowsky-Glahn and Juan José EgozcueEcogenomics from Data to Knowledge| Canberra | 14 February 2014
CSIRO COMPUTATIONAL INFORMATICS
…but perhaps “too much” if you don’t like maths Logarithms Variances Slope of a regression line Goodness-of-fit of a regression line
This presentation is meant for everyone: stop me if I’ve lost you!
Warning!There will be a little maths… not a lot…
On the analysis of relative abundances | David Lovell | Page 2
Logarithms… turn multiplication into addition of exponents
Warning!There will be a little maths… not a lot…
On the analysis of relative abundances | David Lovell | Page 3
Variance… measures how far a set of numbers is spread out
Warning!There will be a little maths… not a lot…
less variance more variance
On the analysis of relative abundances | David Lovell | Page 4
Slope and goodness of fit…What line best fits the points? And how close are the points to it?
Warning!There will be a little maths… not a lot…
On the analysis of relative abundances | David Lovell | Page 5
…conclusions that are about the Universe …not artefacts of the way we have observed it
MotivationWant to draw sound conclusions from our observations
On the analysis of relative abundances | David Lovell | Page 6
How can you tell if you’re measuring relative information? “Would doubling the amount of starting material double my measurements?”
Different steps in measurement can render data relative Sample preparation– obtaining a fixed mass and concentration of nucleic acid for further analysis– “We typically request that users provide us with 11 μg of total
RNA/sample (at a concentration of 1.25 μg/uL)”– “This protocol is optimized for 0.1–4 μg of total RNA…”
Measurement platform– “How many reads in a typical DNA/RNA-seq experiment?
A typical read count… may have hundreds of millions”http://www.vlsci.org.au/lscc/rna-seq
– A large, but finite amount… what would doubling the starting material do?
Motivation – more specificA lot of bioscience data carry only relative information
On the analysis of relative abundances | David Lovell | Page 7
Question to the audience: How often is about absolute microbial abundance observed or estimated?– Microbes per gram of soil?– Microbes per litre of seawater?
Motivation – more specificA lot of bioscience data carry only relative information
On the analysis of relative abundances | David Lovell | Page 8
Relative data needs to be analysed and interpreteddifferently to absolute data Forget that, and you can easily draw the wrong conclusions from your data
A lot of bioscience data are relativeSo what?
On the analysis of relative abundances | David Lovell | Page 9
I will Show you why correlation is not appropriate for relative data Give you an alternative measure of association that can be used confidently
The focus of this talk: correlationA workhorse of bioinformatics and quantitative bioscience
On the analysis of relative abundances | David Lovell | Page 10
1. Spurious correlation
2. Geometry of relative data
3. Real-life examples of yeasty goodness from the Bähler Lab
Correlation for relative data?Three nails in its coffin
On the analysis of relative abundances | David Lovell | Page 11
Ingredients Three (3) statistically independent variables x, y and z– Note that the correlation between x and y should be about zero
Method Form the ratios x/z and y/z Plot the ratios against each other Watch correlation magically appear
Serves… This recipe has served entire research disciplines for some years,
despite Karl Pearson’s warning in 1896
For a demonstration… See http://en.wikipedia.org/wiki/Spurious_correlation
Nail #1: Spurious correlationA recipe for disaster
On the analysis of relative abundances | David Lovell | Page 12
Spurious correlation in actionCooking the books
Have you got things in proportion? | David Lovell | Page 13
Spurious correlation in actionCooking the books
Have you got things in proportion? | David Lovell | Page 14
Spurious correlation in actionCooking the books
Have you got things in proportion? | David Lovell | Page 15
Imagine we’ve measured the relative abundance of some microbes We have made measurements in seven different samples Let’s focus in on the relationship between microbe1 and microbe2
Nail #2: Geometry of relative data
On the analysis of relative abundances | David Lovell | Page 16
Have you got things in proportion? | David Lovell | Page 17
Have you got things in proportion? | David Lovell | Page 18
40%
30%
Sample #1
Have you got things in proportion? | David Lovell | Page 19
37.5%
35%
Sample #2
Have you got things in proportion? | David Lovell | Page 20
35%
40%
Sample #3
Have you got things in proportion? | David Lovell | Page 21
32.5%
45%
Sample #4
Have you got things in proportion? | David Lovell | Page 22
30%
50%
Sample #5
Have you got things in proportion? | David Lovell | Page 23
27.5%
55%
Sample #6
Have you got things in proportion? | David Lovell | Page 24
25%
60%
Sample #7
Have you got things in proportion? | David Lovell | Page 25
Have you got things in proportion? | David Lovell | Page 26
What do these pairs of relative abundancestell us about the relationship between
the absolute numbersof microbe1 and microbe2
in the seven environments sampled?
Have you got things in proportion? | David Lovell | Page 27
If the numbers of microbe1 increase,what are the numbers of microbe2
likely to do?
Have you got things in proportion? | David Lovell | Page 28
Hint:each pair of relative values
comes from a pair of absolute valuesthat lie on the ray from the origin
Have you got things in proportion? | David Lovell | Page 29
Have you got things in proportion? | David Lovell | Page 30
Have you got things in proportion? | David Lovell | Page 31
Have you got things in proportion? | David Lovell | Page 32
Have you got things in proportion? | David Lovell | Page 33
Correlations between relative values tell us nothing about the relationships betweenthe absolute values that gave rise to them
Have you got things in proportion? | David Lovell | Page 34
Correlations between relative values tell us nothing about the relationships betweenthe absolute values that gave rise to them
Have you got things in proportion? | David Lovell | Page 35
Correlations between relative values tell us nothing about the relationships betweenthe absolute values that gave rise to them
…unless total abundance is constant
Have you got things in proportion? | David Lovell | Page 36
Correlations between relative values tell us nothing about the relationships betweenthe absolute values that gave rise to them
…unless total abundance is constantOR
the pairs of relative values are proportional
ProportionalityA meaningful measure of association for relative data
yxt
y
t
x
On the analysis of relative abundances | David Lovell | Page 37
Have you got things in proportion? | David Lovell | Page 38
Have you got things in proportion? | David Lovell | Page 39
Have you got things in proportion? | David Lovell | Page 40
Have you got things in proportion? | David Lovell | Page 41
Have you got things in proportion? | David Lovell | Page 42
Have you got things in proportion? | David Lovell | Page 43
When pairs of relative valuesstay in proportion across samples
so do their corresponding absolute values
Have you got things in proportion? | David Lovell | Page 44
When pairs of relative valuesstay in proportion across samples
so do their corresponding absolute values
…when we work with relative valueswe lose the ability to infer correlations,but we can still infer proportionalities
Q: How do we measure proportionality?A: Use a log-log plot
natural scale log-log scale
Pairs of values that behave proportionallyfit a line of slope 1
(i.e., 45°)on a log-log scale
On the analysis of relative abundances | David Lovell | Page 45
slope ( ) in degrees
proportion of variance explained (r2 )
slope ( ) in degrees
proportion of variance explained (r2 )
slope ( ) in degrees
proportion of variance explained (r2 )
rxyx 21 2 )var(log)/var(log
() r21 2
The data: expression levels of over 3000 yeast genes over a 16-point time course …ok, so this is not ecogenomic data– But imagine this is about the abundance of 3000 OTUs
in 16 different samples
Total abundance is anything but constant
Nail #3: Real-life examplesYeasty goodness from the Bähler Lab
On the analysis of relative abundances | David Lovell | Page 49
Have you got things in proportion? | David Lovell | Page 50
Have you got things in proportion? | David Lovell | Page 51
Have you got things in proportion? | David Lovell | Page 52
Have you got things in proportion? | David Lovell | Page 53
Have you got things in proportion? | David Lovell | Page 54
Have you got things in proportion? | David Lovell | Page 55
slope ( ) in degrees
proportion of variance explained (r2 )
Have you got things in proportion? | David Lovell | Page 57
145 i.e.,1 at
of minimum2 r
y)xφ(
),(
log,log
Let’s not.
Instead, let’s bin the values of slope and r2
for all 30313030/2 pairs of mRNA
and show the counts on a heatmap
How does this look for yeast?…let’s plot the slope and r2 for 4.5 million pairs of mRNAs… yay!
On the analysis of relative abundances | David Lovell | Page 58
Have you got things in proportion? | David Lovell | Page 59
)log,(log yxof contours 0.05 and 0.025
Have you got things in proportion? | David Lovell | Page 60
Have you got things in proportion? | David Lovell | Page 61
Have you got things in proportion? | David Lovell | Page 62
Have you got things in proportion? | David Lovell | Page 63
Have you got things in proportion? | David Lovell | Page 64
Have you got things in proportion? | David Lovell | Page 65
We have been working with Human Microbiome Data this summer Dang! There’s a lot of zeros– Zeros… the natural enemy of the logarithm
Also there are a lot of small counts: 1, 2, 3…– Counts (integers ≥ 0) do not carry only relative information– Expected counts do (reals ≥ 0)
Yes there are measures of association or dissimilarity that “handle” these data– Bray-Curtis, UniFrac…– But what reliable inferences can you make when OTUs play hide and seek?
Also this summer I have been become deeply suspicious of the practical value of null hypothesis
significance testing (p-values)– Esp., “statistically significant correlation coefficients”
Challenges with ecogenomic dataCan we make sense of microbial relative abundance?
On the analysis of relative abundances | David Lovell | Page 66
Jürg Bähler and Sam Marguerat for data and wisdom on yeast The Spanish Government for supporting VP-G and JJE’s collaboration with DL Jack Simpson and Lauren Bragg for helping to make sense of microbes this summer Karoline Faust and co-authors for graciously providing the data matrices from– Faust, Karoline, J. Fah Sathirapongsasuti, Jacques Izard, Nicola Segata, Dirk Gevers, Jeroen Raes,
and Curtis Huttenhower. “Microbial Co-Occurrence Relationships in the Human Microbiome.” PLoS Comput Biol 8(7) (2012)
David Warton and co-authors for an excellent paper on Standardised Major Axis fitting– Warton, David I., Ian J. Wright, Daniel S. Falster, and Mark Westoby. 2006. “Bivariate Line-fitting
Methods for Allometry.” Biological Reviews 81 (2): 259–291. The R Core Team and developers of R packages including– Daniel Adler, Duncan Murdoch, Yihui Xie, Hadley Wickham,
Gerald van den Boogaart and Raimon Tolosano-Delgado Rachel, Felix and Ava (my wife and kids) for putting up with me on 5 hours sleep for
weeks on end
Thanks and acknowledgements
On the analysis of relative abundances | David Lovell | Page 67
Thank youCSIRO Transformational BiologyDavid LovellBioinformatics and Analytics Leader t +61 2 6216 7042e [email protected] www.csiro.au/people/David.Lovell
CSIRO MATHEMATICS, INFORMATICS AND STATISTICS