introduction to statistics iii

Statistics for Next Generation Sequencing (RNA-Seq)

Distribution?

• 25000 genes, each with counts over several samples

• 2 conditions, each with several replicates

• Recall, log-Normal for Microarrays• Based on fitting on actual data with many replicates

• No equivalent data for RNA-Seq• So go back to first principles

RNA-Seq Setting

• copies of transcripts from gene

• Total number of molecules

• Choose of these molecules for sequencing; chosen at random

• Probability that a particular molecule falls in this sample of size is /

RNA-Seq Counts Distribution

• How many of the copies of transcripts from gene are chosen for sequencing?

• How is this quantity distributed?

• Hypergeometric Distribution

Hypergeometric Distribution

• items of which are red, - are black

• If of the items are sampled at random

• How many reds are in the sample?

Simplifying the Hypergeometric Distribution

• Simplify

• Assuming, this is approximately

𝑎𝑖𝑘𝑘!

(𝑀𝑁 )𝑘 𝑒−𝑎𝑖𝑀 /𝑁

The Poisson Distribution

λ =

λ is both mean and variance

are all unknown and subsumed within λ

The Poisson Distribution(Wikipedia)

• The number of soldiers killed by horse-kicks each year in each corps in the Prussian cavalry. This example was made famous by a book of Ladislaus Josephovich Bortkiewicz (1868–1931).

• The number of yeast cells used when brewing Guinness beer. This example was made famous by William Sealy Gosset (1876–1937).[19]

• The number of phone calls arriving at a call centre per minute.• The number of goals in sports involving two competing teams.• The number of deaths per year in a given age group.• The number of jumps in a stock price in a given time interval.• Under an assumption of homogeneity, the number of times a web server is

accessed per minute.• The number of mutations in a given stretch of DNA after a certain amount of

radiation.• The proportion of cells that will be infected at a given multiplicity of infection.

http://en.wikipedia.org/wiki/Prussia

http://en.wikipedia.org/wiki/Ladislaus_Bortkiewicz





http://en.wikipedia.org/wiki/Guinness

http://en.wikipedia.org/wiki/William_Sealy_Gosset

http://en.wikipedia.org/wiki/William_Sealy_Gosset

http://en.wikipedia.org/wiki/Poisson_distribution

http://en.wikipedia.org/wiki/Call_centre

http://en.wikipedia.org/wiki/Poisson_process

http://en.wikipedia.org/wiki/Web_server

http://en.wikipedia.org/wiki/Mutation

http://en.wikipedia.org/wiki/DNA

http://en.wikipedia.org/wiki/Cells_(biology)

http://en.wikipedia.org/wiki/Multiplicity_of_infection

Is Mean = Variance for NGS ?

– Variance Mean∝ 2

Log Scale: White line is the Poisson

line

Why this Over-Dispersion

• The Poisson model only models technical variation, not biological variation

• Biological variation induces more variance than captured by the Poisson model

– No reason for difference from microarrays where SD Mean ∝

(or Variance Mean∝ 2) SD vs Mean for Microarrays

Handling Over-Dispersion where

itself comes from a distribution with mean and variance σ2

σ2= σ2

What Distribution is X?

• Log-Normal for Arrays?

• The combination of log-Normal and Poisson doesn’t have a neat closed form (i.e., formula)

• So assume Gamma distribution– Poisson + Gamma -> Negative Binomial– Used traditionally to fix the problem of over-dispersion

The Gamma Distribution

• 2 parameters– Shape – Scale

• Lifespans are modeled as Gamma

Control on Right Tail

The Negative Binomial Distribution

• How many heads before you get tails?

• 2 parameters– Tails probability – Number of tails

• =

• =

Estimating Parameters

• 2 parameters– Tails probability – Number of tails

• =

• = For each gene, estimate the mean across replicates,

and then estimate the variance from the curve fit

aboveThen use these formulae to estimate and

Bias Correction

• and are unbiased estimates of and

• = = are not necessarily unbiased estimates of and respectively

• So bias correction needed. How?• Do theoretical simulations and see what the bias factor is• Correct by this factor

Thank You

introduction to statistics iii

Technology

gamma distribution poisson

poisson distribution

hypergeometric distribution

rnaseq counts distribution

number of times

number of soldiers

number of goals

number of mutations