1 identifying differentially expressed genes from rna-seq data many recent algorithms for calling...

16
1 tifying differentially expressed genes from RNA-seq cent algorithms for calling differentially expresse Empirical analysis of digital gene expression data in R tp://www.bioconductor.org/packages/2.10/bioc/html/edgeR.html Differential gene expression analysis based on the gative binomial distribution tp://www-huber.embl.de/users/anders/DESeq/ Empirical Bayesian analysis of patterns of differential pression in count data tp://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html

Upload: nicholas-burke

Post on 04-Jan-2016

223 views

Category:

Documents


1 download

TRANSCRIPT

1

Identifying differentially expressed genes from RNA-seq data

Many recent algorithms for calling differentially expressed genes:

edgeR: Empirical analysis of digital gene expression data in Rhttp://www.bioconductor.org/packages/2.10/bioc/html/edgeR.html

DEseq: Differential gene expression analysis based on the negative binomial distributionhttp://www-huber.embl.de/users/anders/DESeq/

baySeq: Empirical Bayesian analysis of patterns of differential expression in count datahttp://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html

2

Identifying differentially expressed genes from RNA-seq data

Why can’t we use the same software for microarray data analysis?

1. Microarray data are continuous values. Sequence data are discrete values ofread counts.

e.g. 5.342 signal intensity versus 5 counts

2. Reproducibility of RNA-seq measurements is different for low-abundance versushigh-abundance transcripts – this is called over-dispersion

3

Overdispersion: variance of replicates is higher for high-abundance reads

4

baySeq Empirical Bayesian analysis of patterns of

differential expression in count datahttp://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html

Identifies differentially expressed genes between 2 or more samples using replicated RNA-seq data

5

Bayes’ Theorem

P (M | D) = P (D | M) * P (M) P (D)

Read in English: The Probability that your Model is correct Given ( | ) the Data is equal to the Probability of your Data Given the Model times the Probability of your Model divided by the Probability of the Data

What the hell does this mean?

6

Bayes’ Theorem: wikipedia’s ridiculous example

Your friend had a conversation with someone who happened to have long hair.What is the probability that that person was a woman, given that you know~50% of people are women and ~75% of women have long hair.

P (W) = Probability that this person was a woman = 0.5P (L | W) = Probability that the person had long hair IF the person is a woman = 0.75P (L | M) = Probability that the person had long hair IF that person is a man = 0.3P (L) = Probability that any random person has long hair =

P (L | W) * P (W) = 75% of 50% of the population = 0.75*0.5 = 0.375 + P (L | M) * P (M) = 30% of 50% of the population = 0.3 * 0.5 = 0.15

So:P (W|L) = P(L|W)*P(W) = P(L|W)*P(W) = 0.714

P(L) P (L|W)*P(W) + P (L|M)*P(M)

7

Bayes’ Theorem

P (M | D) = P (D | M) * P (M) P (D)

Read in English: The Probability that your Model is correct Given ( | ) the Data is equal to the Probability of your Data Given the Model times the Probability of your Model divided by the Probability of the Data

P is called the ‘Posterior Probability’ that this Model is the right one to describe your data.

Each gene will have a PP for each model, where S PP = 1.

The Bigger the PP, the more likely this is the right model.

Posterior Probability is NOT a p-value!

8

Bayes’ Theorem

We don’t know some of these factors ( e.g. P(D|M) ), but we can describe the data with some parameter set called q

(which includes mean and stdev of count data across replicates of each gene)

Imagine you have three RNA-seq replicates of two samples (WT vs mutant).There are two models for each gene

M0 = the model that your gene is NOT differentially expressed across the 2 samples

MDE = the model that a given gene IS differentially expressed

P (MDE | DgeneX) = P (DgeneX | MDE) * P (MDE) P (DgeneX)

P (M0 | DgeneX) = P (DgeneX | M0) * P (M0) P (DgeneX)

9

wikepediaE.g. Infecting a cell culture with viral particles. l = Multiplicity of Infection (MOI) = # of viral particles/cell in your culture

How to model the mean and standard deviation of your replicates:Poisson Distribution: Accounts for random fluctuations

10

Overdispersion: variance of replicates is higher for high-abundance reads

11

How to model the mean and standard deviation of your replicates:Negative Binomial Distribution: Mean and variance are different

Notice the wider spread on the right side of each distribution, for higher numbers

12

baySeq Empirical Bayesian analysis of patterns of

differential expression in count datahttp://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html

Imagine you have triplicate measurements of sample A and sample B. We want toidentify genes differentially expressed (DE) in A versus B.

Arep1 Arep2 Brep1 Brep2

We define TWO possible models:NO DE (NDE): Arep1 = Arep2 = Brep1 = Brep2

we say that the data for all samples was drawn from the same distribution(* i.e. same mean and standard deviation, if normally distributed)

DE: Arep1 = Arep2 NOT EQUAL Brep1 = Brep2

A and B replicates are drawn from two different distributions

tuple: simply S counts per transcription unit

13

baySeq Empirical Bayesian analysis of patterns of

differential expression in count datahttp://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html

P (MDE | DgeneX) = P(DgeneX|MDE) * P(MDE)

P(DgeneX)

Next, we use Bayes’ Rule to try to estimate the probability that the DE modelis true for geneX

baySeq tries to use data sharing across the dataset to estimate the componentson the right side of the equation

P (DgeneX | MDE ) = Int[ P(DgeneX| K, MDE) * P(K | MDE) ] dK

where K is the parameter set of q (mean, dispersion) for each gene in each replicate** q for each gene is estimated by looking at all data in the replicates

prior probability

14

baySeq Empirical Bayesian analysis of patterns of

differential expression in count datahttp://www.bioconductor.org/packages/2.10/bioc/html/baySeq.html

P (MDE | DgeneX) = P(DgeneX|MDE) * P(MDE)

P(DgeneX)

prior probability

The prior probability is the probability of M before any data are considered.

Here: 1. Guess at a starting prior probability for MDE

2. Iteratively test different alternatives3. Repeat until ‘convergence’ (P(MDE) does not change with more iterations)

* For data with strong DE signal, the posterior probability is not very dependenton the starting prior probability P(MDE)

15

16

Volcano Plot