on the analysis of relative abundances in ecogenomics - david lovell

68
in ecogenomics On the analysis of relative abundances David Lovell, Vera Pawlowsky-Glahn and Juan José Egozcue Ecogenomics from Data to Knowledge| Canberra | 14 February 2014 CSIRO COMPUTATIONAL INFORMATICS

Upload: australian-bioinformatics-network

Post on 11-May-2015

641 views

Category:

Science


4 download

DESCRIPTION

In ecogenomics and other areas of molecular systems biology, many measurement methods yield only the relative abundance of different molecular components. Treating such data as though they carried information about absolute abundance can be very misleading. With relative abundances, differential expression needs careful interpretation, and correlation—a statistical workhorse for analyzing pairwise relationships—is an inappropriate measure of association. Using data on absolute and relative gene expression in yeast, we show why relative abundances need special analysis and interpretation, and present principles for doing so based on the theory of compositional data analysis. We show how correlation of data that carry only relative information can lead to conclusions opposite to those drawn from absolute abundances. We present a well-principled alternative—proportionality—and show how it can be used as the basis of analyses familiar in molecular bioscience. We also talk about some of the challenges of applying this approach to ecogenomic and similar count data that are rich in zeros and low counts.

TRANSCRIPT

Page 1: On the analysis of relative abundances in ecogenomics - David Lovell

… in ecogenomics

On the analysisof relative abundances

David Lovell, Vera Pawlowsky-Glahn and Juan José EgozcueEcogenomics from Data to Knowledge| Canberra | 14 February 2014

CSIRO COMPUTATIONAL INFORMATICS

Page 2: On the analysis of relative abundances in ecogenomics - David Lovell

…but perhaps “too much” if you don’t like maths Logarithms Variances Slope of a regression line Goodness-of-fit of a regression line

This presentation is meant for everyone: stop me if I’ve lost you!

Warning!There will be a little maths… not a lot…

On the analysis of relative abundances | David Lovell | Page 2

Page 3: On the analysis of relative abundances in ecogenomics - David Lovell

Logarithms… turn multiplication into addition of exponents

Warning!There will be a little maths… not a lot…

On the analysis of relative abundances | David Lovell | Page 3

Page 4: On the analysis of relative abundances in ecogenomics - David Lovell

Variance… measures how far a set of numbers is spread out

Warning!There will be a little maths… not a lot…

less variance more variance

On the analysis of relative abundances | David Lovell | Page 4

Page 5: On the analysis of relative abundances in ecogenomics - David Lovell

Slope and goodness of fit…What line best fits the points? And how close are the points to it?

Warning!There will be a little maths… not a lot…

On the analysis of relative abundances | David Lovell | Page 5

Page 6: On the analysis of relative abundances in ecogenomics - David Lovell

…conclusions that are about the Universe …not artefacts of the way we have observed it

MotivationWant to draw sound conclusions from our observations

On the analysis of relative abundances | David Lovell | Page 6

Page 7: On the analysis of relative abundances in ecogenomics - David Lovell

How can you tell if you’re measuring relative information? “Would doubling the amount of starting material double my measurements?”

Different steps in measurement can render data relative Sample preparation– obtaining a fixed mass and concentration of nucleic acid for further analysis– “We typically request that users provide us with 11 μg of total

RNA/sample (at a concentration of 1.25 μg/uL)”– “This protocol is optimized for 0.1–4 μg of total RNA…”

Measurement platform– “How many reads in a typical DNA/RNA-seq experiment?

A typical read count… may have hundreds of millions”http://www.vlsci.org.au/lscc/rna-seq

– A large, but finite amount… what would doubling the starting material do?

Motivation – more specificA lot of bioscience data carry only relative information

On the analysis of relative abundances | David Lovell | Page 7

Page 8: On the analysis of relative abundances in ecogenomics - David Lovell

Question to the audience: How often is about absolute microbial abundance observed or estimated?– Microbes per gram of soil?– Microbes per litre of seawater?

Motivation – more specificA lot of bioscience data carry only relative information

On the analysis of relative abundances | David Lovell | Page 8

Page 9: On the analysis of relative abundances in ecogenomics - David Lovell

Relative data needs to be analysed and interpreteddifferently to absolute data Forget that, and you can easily draw the wrong conclusions from your data

A lot of bioscience data are relativeSo what?

On the analysis of relative abundances | David Lovell | Page 9

Page 10: On the analysis of relative abundances in ecogenomics - David Lovell

I will Show you why correlation is not appropriate for relative data Give you an alternative measure of association that can be used confidently

The focus of this talk: correlationA workhorse of bioinformatics and quantitative bioscience

On the analysis of relative abundances | David Lovell | Page 10

Page 11: On the analysis of relative abundances in ecogenomics - David Lovell

1. Spurious correlation

2. Geometry of relative data

3. Real-life examples of yeasty goodness from the Bähler Lab

Correlation for relative data?Three nails in its coffin

On the analysis of relative abundances | David Lovell | Page 11

Page 12: On the analysis of relative abundances in ecogenomics - David Lovell

Ingredients Three (3) statistically independent variables x, y and z– Note that the correlation between x and y should be about zero

Method Form the ratios x/z and y/z Plot the ratios against each other Watch correlation magically appear

Serves… This recipe has served entire research disciplines for some years,

despite Karl Pearson’s warning in 1896

For a demonstration… See http://en.wikipedia.org/wiki/Spurious_correlation

Nail #1: Spurious correlationA recipe for disaster

On the analysis of relative abundances | David Lovell | Page 12

Page 13: On the analysis of relative abundances in ecogenomics - David Lovell

Spurious correlation in actionCooking the books

Have you got things in proportion? | David Lovell | Page 13

Page 14: On the analysis of relative abundances in ecogenomics - David Lovell

Spurious correlation in actionCooking the books

Have you got things in proportion? | David Lovell | Page 14

Page 15: On the analysis of relative abundances in ecogenomics - David Lovell

Spurious correlation in actionCooking the books

Have you got things in proportion? | David Lovell | Page 15

Page 16: On the analysis of relative abundances in ecogenomics - David Lovell

Imagine we’ve measured the relative abundance of some microbes We have made measurements in seven different samples Let’s focus in on the relationship between microbe1 and microbe2

Nail #2: Geometry of relative data

On the analysis of relative abundances | David Lovell | Page 16

Page 17: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 17

Page 18: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 18

40%

30%

Sample #1

Page 19: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 19

37.5%

35%

Sample #2

Page 20: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 20

35%

40%

Sample #3

Page 21: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 21

32.5%

45%

Sample #4

Page 22: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 22

30%

50%

Sample #5

Page 23: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 23

27.5%

55%

Sample #6

Page 24: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 24

25%

60%

Sample #7

Page 25: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 25

Page 26: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 26

What do these pairs of relative abundancestell us about the relationship between

the absolute numbersof microbe1 and microbe2

in the seven environments sampled?

Page 27: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 27

If the numbers of microbe1 increase,what are the numbers of microbe2

likely to do?

Page 28: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 28

Hint:each pair of relative values

comes from a pair of absolute valuesthat lie on the ray from the origin

Page 29: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 29

Page 30: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 30

Page 31: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 31

Page 32: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 32

Page 33: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 33

Correlations between relative values tell us nothing about the relationships betweenthe absolute values that gave rise to them

Page 34: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 34

Correlations between relative values tell us nothing about the relationships betweenthe absolute values that gave rise to them

Page 35: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 35

Correlations between relative values tell us nothing about the relationships betweenthe absolute values that gave rise to them

…unless total abundance is constant

Page 36: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 36

Correlations between relative values tell us nothing about the relationships betweenthe absolute values that gave rise to them

…unless total abundance is constantOR

the pairs of relative values are proportional

Page 37: On the analysis of relative abundances in ecogenomics - David Lovell

ProportionalityA meaningful measure of association for relative data

yxt

y

t

x

On the analysis of relative abundances | David Lovell | Page 37

Page 38: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 38

Page 39: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 39

Page 40: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 40

Page 41: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 41

Page 42: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 42

Page 43: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 43

When pairs of relative valuesstay in proportion across samples

so do their corresponding absolute values

Page 44: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 44

When pairs of relative valuesstay in proportion across samples

so do their corresponding absolute values

…when we work with relative valueswe lose the ability to infer correlations,but we can still infer proportionalities

Page 45: On the analysis of relative abundances in ecogenomics - David Lovell

Q: How do we measure proportionality?A: Use a log-log plot

natural scale log-log scale

Pairs of values that behave proportionallyfit a line of slope 1

(i.e., 45°)on a log-log scale

On the analysis of relative abundances | David Lovell | Page 45

Page 46: On the analysis of relative abundances in ecogenomics - David Lovell

slope ( ) in degrees

proportion of variance explained (r2 )

Page 47: On the analysis of relative abundances in ecogenomics - David Lovell

slope ( ) in degrees

proportion of variance explained (r2 )

Page 48: On the analysis of relative abundances in ecogenomics - David Lovell

slope ( ) in degrees

proportion of variance explained (r2 )

rxyx 21 2 )var(log)/var(log

() r21 2

Page 49: On the analysis of relative abundances in ecogenomics - David Lovell

The data: expression levels of over 3000 yeast genes over a 16-point time course …ok, so this is not ecogenomic data– But imagine this is about the abundance of 3000 OTUs

in 16 different samples

Total abundance is anything but constant

Nail #3: Real-life examplesYeasty goodness from the Bähler Lab

On the analysis of relative abundances | David Lovell | Page 49

Page 50: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 50

Page 51: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 51

Page 52: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 52

Page 53: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 53

Page 54: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 54

Page 55: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 55

Page 56: On the analysis of relative abundances in ecogenomics - David Lovell

slope ( ) in degrees

proportion of variance explained (r2 )

Page 57: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 57

145 i.e.,1 at

of minimum2 r

y)xφ(

),(

log,log

Page 58: On the analysis of relative abundances in ecogenomics - David Lovell

Let’s not.

Instead, let’s bin the values of slope and r2

for all 30313030/2 pairs of mRNA

and show the counts on a heatmap

How does this look for yeast?…let’s plot the slope and r2 for 4.5 million pairs of mRNAs… yay!

On the analysis of relative abundances | David Lovell | Page 58

Page 59: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 59

)log,(log yxof contours 0.05 and 0.025

Page 60: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 60

Page 61: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 61

Page 62: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 62

Page 63: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 63

Page 64: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 64

Page 65: On the analysis of relative abundances in ecogenomics - David Lovell

Have you got things in proportion? | David Lovell | Page 65

Page 66: On the analysis of relative abundances in ecogenomics - David Lovell

We have been working with Human Microbiome Data this summer Dang! There’s a lot of zeros– Zeros… the natural enemy of the logarithm

Also there are a lot of small counts: 1, 2, 3…– Counts (integers ≥ 0) do not carry only relative information– Expected counts do (reals ≥ 0)

Yes there are measures of association or dissimilarity that “handle” these data– Bray-Curtis, UniFrac…– But what reliable inferences can you make when OTUs play hide and seek?

Also this summer I have been become deeply suspicious of the practical value of null hypothesis

significance testing (p-values)– Esp., “statistically significant correlation coefficients”

Challenges with ecogenomic dataCan we make sense of microbial relative abundance?

On the analysis of relative abundances | David Lovell | Page 66

Page 67: On the analysis of relative abundances in ecogenomics - David Lovell

Jürg Bähler and Sam Marguerat for data and wisdom on yeast The Spanish Government for supporting VP-G and JJE’s collaboration with DL Jack Simpson and Lauren Bragg for helping to make sense of microbes this summer Karoline Faust and co-authors for graciously providing the data matrices from– Faust, Karoline, J. Fah Sathirapongsasuti, Jacques Izard, Nicola Segata, Dirk Gevers, Jeroen Raes,

and Curtis Huttenhower. “Microbial Co-Occurrence Relationships in the Human Microbiome.” PLoS Comput Biol 8(7) (2012)

David Warton and co-authors for an excellent paper on Standardised Major Axis fitting– Warton, David I., Ian J. Wright, Daniel S. Falster, and Mark Westoby. 2006. “Bivariate Line-fitting

Methods for Allometry.” Biological Reviews 81 (2): 259–291. The R Core Team and developers of R packages including– Daniel Adler, Duncan Murdoch, Yihui Xie, Hadley Wickham,

Gerald van den Boogaart and Raimon Tolosano-Delgado Rachel, Felix and Ava (my wife and kids) for putting up with me on 5 hours sleep for

weeks on end

Thanks and acknowledgements

On the analysis of relative abundances | David Lovell | Page 67

Page 68: On the analysis of relative abundances in ecogenomics - David Lovell

Thank youCSIRO Transformational BiologyDavid LovellBioinformatics and Analytics Leader t +61 2 6216 7042e [email protected] www.csiro.au/people/David.Lovell

CSIRO MATHEMATICS, INFORMATICS AND STATISTICS