introduction to statistics iii
TRANSCRIPT
Statistics for Next Generation Sequencing (RNA-Seq)
Distribution?
• 25000 genes, each with counts over several samples
• 2 conditions, each with several replicates
• Recall, log-Normal for Microarrays• Based on fitting on actual data with many replicates
• No equivalent data for RNA-Seq• So go back to first principles
RNA-Seq Setting
• copies of transcripts from gene
• Total number of molecules
• Choose of these molecules for sequencing; chosen at random
• Probability that a particular molecule falls in this sample of size is /
RNA-Seq Counts Distribution
• How many of the copies of transcripts from gene are chosen for sequencing?
• How is this quantity distributed?
• Hypergeometric Distribution
Hypergeometric Distribution
• items of which are red, - are black
• If of the items are sampled at random
• How many reds are in the sample?
Simplifying the Hypergeometric Distribution
• Simplify
• Assuming, this is approximately
𝑎𝑖𝑘𝑘!
(𝑀𝑁 )𝑘 𝑒−𝑎𝑖𝑀 /𝑁
The Poisson Distribution
λ =
λ is both mean and variance
are all unknown and subsumed within λ
The Poisson Distribution(Wikipedia)
• The number of soldiers killed by horse-kicks each year in each corps in the Prussian cavalry. This example was made famous by a book of Ladislaus Josephovich Bortkiewicz (1868–1931).
• The number of yeast cells used when brewing Guinness beer. This example was made famous by William Sealy Gosset (1876–1937).[19]
• The number of phone calls arriving at a call centre per minute.• The number of goals in sports involving two competing teams.• The number of deaths per year in a given age group.• The number of jumps in a stock price in a given time interval.• Under an assumption of homogeneity, the number of times a web server is
accessed per minute.• The number of mutations in a given stretch of DNA after a certain amount of
radiation.• The proportion of cells that will be infected at a given multiplicity of infection.
Is Mean = Variance for NGS ?
– Variance Mean∝ 2
Log Scale: White line is the Poisson
line
Why this Over-Dispersion
• The Poisson model only models technical variation, not biological variation
• Biological variation induces more variance than captured by the Poisson model
– No reason for difference from microarrays where SD Mean ∝
(or Variance Mean∝ 2) SD vs Mean for Microarrays
Handling Over-Dispersion where
itself comes from a distribution with mean and variance σ2
σ2= σ2
What Distribution is X?
• Log-Normal for Arrays?
• The combination of log-Normal and Poisson doesn’t have a neat closed form (i.e., formula)
• So assume Gamma distribution– Poisson + Gamma -> Negative Binomial– Used traditionally to fix the problem of over-dispersion
The Gamma Distribution
• 2 parameters– Shape – Scale
• Lifespans are modeled as Gamma
Control on Right Tail
The Negative Binomial Distribution
• How many heads before you get tails?
• 2 parameters– Tails probability – Number of tails
• =
• =
Estimating Parameters
• 2 parameters– Tails probability – Number of tails
• =
• = For each gene, estimate the mean across replicates,
and then estimate the variance from the curve fit
aboveThen use these formulae to estimate and
Bias Correction
• and are unbiased estimates of and
• = = are not necessarily unbiased estimates of and respectively
• So bias correction needed. How?• Do theoretical simulations and see what the bias factor is• Correct by this factor
Thank You