statistical modelling of gene expression data - diva...

67
U.U.D.M. Project Report 2016:13 Examensarbete i matematik, 30 hp Handledare: Manfred Grabherr, Inst för medicinsk biokemi och mikrobiologi Ämnesgranskare: Ingemar Kaj Examinator: Magnus Jacobsson Juni 2016 Department of Mathematics Uppsala University Statistical modelling of gene expression data With applications to ribonucleic-acid-sequencing data of Escherichia Coli Torgny Karlsson

Upload: nguyenkhue

Post on 16-Oct-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

U.U.D.M. Project Report 2016:13

Examensarbete i matematik, 30 hpHandledare: Manfred Grabherr, Inst för medicinsk biokemi och mikrobiologiÄmnesgranskare: Ingemar KajExaminator: Magnus JacobssonJuni 2016

Department of MathematicsUppsala University

Statistical modelling of gene expression dataWith applications to ribonucleic-acid-sequencing data of Escherichia Coli

Torgny Karlsson

Page 2: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey
Page 3: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Statistical modelling of gene expression data

With applications to ribonucleic-acid-sequencing data of Escherichia Coli

Torgny Karlsson

A thesis presented for the degree of

Master of Science

Department of Mathematics

Department of Medical Biochemistry and Microbiology

Uppsala University, SwedenFebruary 22, 2016

Page 4: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Abstract

A new statistical model is introduced that simultaneously addresses the overall distribution ofgene expression, as well as the gene-specific variance of the expression. The basic hierarchicalmodel has three free parameters θ = (α, λ, γ). Parameter estimation of α and λ is performedsimultaneously, by iteratively solving a system of moment equations while the least squaresmethod is used to estimate γ. The model is calibrated to experimental, properly normaliseddata of the Escherichia Coli bacterium. Within the framework of the model, an approximatenull distribution of the conditional log-fold change is also given. An analytical approximation tothe variance of the log-fold change is derived from which approximate p-values can be calculated.

i

Page 5: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

To Signe, Vilmer, and Ingalill

ii

Page 6: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Acknowledgements

I would like to thank my principal supervisor Manfred Grabherr for taking me under his wings.He has been a guiding star and a true source of inspiration, not the least for his in-depth physicalinsight and numerical wizardry. I’d like to think that he saw my background, not as a burden,but as a strength. For that, I am truly grateful. I would also like to thank my co-supervisorIngemar Kaj. At times when I really needed it, his patient guidance and mathematical insighthelped me to keep the focus and not to go astray. Thomas, my friend, I bow before you. Youare a walking encyclopedia of molecular biology and genetics. Without you guiding me throughthe wanders of the living microcosmos and without our looong discussions about the intricatestatistics behind RNA sequencing, I would not have been able to write this thesis. Also, thankyou for commenting on the manuscript. Furthermore, thank you Gorel for explaining the geneexpression process to me and for taking me to lunch on my first day. Thank you the “peopleat BILS”, i.e., Henric, Mahesh, Lucile, Jaque, and Martin and the “people at Uppsala GenomeCenter” for letting me feel welcome and for helping me understand genetics and biology ingeneral. It has been great to discuss all those small and big topics at fika times. I would alsolike to thank Elin for giving me Asa’s contact and Asa for giving me Manfred’s contact. Youboth led me to this opportunity. Last but not least, I thank Ingalill for standing beside methrough these years of struggle. You are my true love in life. Let’s live it now.

iii

Page 7: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Contents

1 Setting the stage 6

1.1 Gene expression and regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Sequencing and analysis of RNA-sequencing data . . . . . . . . . . . . . . . . . . 8

1.2.1 Poisson-Gamma hierarchical model . . . . . . . . . . . . . . . . . . . . . . 81.2.2 Some terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Experimental data of E. Coli 10

2.1 Normalisation procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Basic model 12

3.1 Examination of the experimental data . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Outline of the basic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.1 Physical interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.2 Mathematical formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.3 Comment on dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Dispersion relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Derivation of probability densities 18

4.1 Exact density functions of Y and X . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Approximation of the asymmetric Laplace distribution . . . . . . . . . . . . . . . 20

5 Moments 22

5.1 First moment of Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.2 Approximations to the first and second moments of X . . . . . . . . . . . . . . . 23

6 Estimation of model parameters 27

6.1 Tail-index estimate of λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.2 Method-of-moments estimates of α and λ . . . . . . . . . . . . . . . . . . . . . . 286.3 Least-squares estimate of γ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296.4 Random and systematic errors of the parameter estimators . . . . . . . . . . . . 30

6.4.1 Characteristics of α(X) and λ(X) . . . . . . . . . . . . . . . . . . . . . . . 306.4.2 Characteristics of γ

(Y|T = t)

. . . . . . . . . . . . . . . . . . . . . . . . . 336.5 Model calibration: parameter estimates of the basic model . . . . . . . . . . . . . 34

7 Performance of the basic model 35

7.1 Distribution of the gene expression . . . . . . . . . . . . . . . . . . . . . . . . . . 357.2 Observed dispersion relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357.3 Marginal distribution of the log-fold change . . . . . . . . . . . . . . . . . . . . . 367.4 Existence of super-stable genes? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

iv

Page 8: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

8 Hypothesis testing of the observed gene regulation 41

8.1 Conditional log-fold change distribution in the case of no treatment . . . . . . . . 418.2 Calculation of p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

8.2.1 Limiting distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438.2.2 Analytical approximation of the variance of L0 |T = t . . . . . . . . . . . . 448.2.3 P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458.2.4 Addressing the problem of multiple testing . . . . . . . . . . . . . . . . . 47

8.3 Correlation between L0 |T = t and the estimator η(t) . . . . . . . . . . . . . . . . 47

9 Comparison to other methods 49

9.1 Comparison to the negative binomial distribution . . . . . . . . . . . . . . . . . . 499.2 Comparison to results from limma . . . . . . . . . . . . . . . . . . . . . . . . . . 50

10 Future work 53

10.1 Replicate-specific model with four parameters . . . . . . . . . . . . . . . . . . . . 5310.1.1 Note on parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . 54

10.2 Modelling of a positive biological variance at all gene expression levels . . . . . . 5510.3 Accounting for the correlation between genes . . . . . . . . . . . . . . . . . . . . 5710.4 Two-component mixture distribution of the intensities . . . . . . . . . . . . . . . 58

11 Summary and concluding remarks 61

v

Page 9: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Chapter 1

Setting the stage

A restlessness to explore the unknown and an everlasting thirst for knowledge about our originand place in the Universe are deemed insignia of humanity. Distinguishing features as they are,they may basically be very specific responses to an enhanced capability of adaptation. As as-tronomer Carl Sagan [1] elegantly put it: “The open road [...] softly calls, like a nearly forgottensong of childhood. We invest far-off places with a certain romance. This appeal, I suspect,has been meticulously crafted by natural selection as an essential element in our survival.” Thedesire to understand our surroundings has taken us upon a journey into macrocosmos and micro-cosmos alike. How life itself arose on Earth, how it is sustained, and ultimately how it may haveappeared in other corners of the Universe, is arguably one of the most fundamental questionsfaced by science. It is an intriguing example of a recently commenced voyage of knowledge —recently commenced by us, that is.

One way of pursuing the study of the origin and sustainability of life and of living organismsis to examine the genetic code that is imprinted in the cell’s Deoxyribonucleic acid (DNA).More specifically, we would like to understand why and to what level the units of the DNAthat encodes functional products like proteins, i.e., the genes, are expressed and how the geneexpressions change in response to environmental stimuli, disease or treatment in order for thecell to maintain homeostasis. To begin with, we are thus interested to know whether a gene isregulated or not, e.g., after a treatment. Now, all measurements of gene expressions are, similarto any measured quantity, marred with experimental uncertainties. Furthermore, the biologicalprocesses that operate in the cell are driven by chemical reactions, which are essentially stochas-tic in nature. A variation in addition to the observation error is therefore to be expected. Asa retribution for our ignorance, we are thus forced to assume that gene expressions cannot beknown completely. We should therefore rephrase our earlier question and instead ask ourselves:“How do we know whether a gene is significantly regulated after treatment?” The standardmethod of statistical inference to deploy in these situations is statistical hypothesis testing. Agene is found to be significantly regulated if the observed difference in expression, before andafter treatment, may be considered an unlikely event, given some null distribution that accountsfor the expected, and mundane, uncertainties.

This thesis addresses two parts: 1) To what level active genes are expressed and 2) The specificform of the null distribution for differential gene expression hypothesis testing. A statisticalmodel is developed that simultaneously accounts for the overall distribution of the expressionsof genes and the expected variance in expression for individual genes. The model is specificallycalibrated to experimental data of Escherichia Coli.

6

Page 10: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Figure 1.1: Illustration of the gene expression process, general picture. A segmentof the DNA, i.e., a gene, is replicated by RNA polymerase, and an mRNA moleculeis produced (transcription). In the second step (translation), a polypeptide is synthe-sised from the mRNA, which subsequently is folded into a functional protein. Courtesy:https://en.wikipedia.org/wiki/Central dogma of molecular biology

1.1 Gene expression and regulation

In the cell, whether it is a eukaryote or prokaryote, functional products such as proteins arecontinuously synthesised for specific tasks to sustain the organism. A gene is a segment of theDNA that carries information on the molecular structure of a specific protein, and the processby which this genetic information is read and utilised to produce a protein is referred to asgene expression [2,3]. For prokaryotes like bacteria, the gene expression can be described as atwo-step process: transcription, followed by translation. Note that in eukaryotes, the expressionprocess is slightly more involved. In transcription [4], the gene is replicated to a nucleic acidcalled ribonucleic acid (RNA), or more specifically messenger RNA (mRNA). The transcriptionis performed by a compound of enzymes called RNA polymerase. The second step, translation[5], usually occurs when the mRNA molecule is still being transcribed (i.e., in prokaryotes). Inthis process, the mRNA is used as a template to produce a polypeptide, which is an unfoldedprecursor to the final protein.

Regulation of the gene expression occurs in response to a change in the protein demand [6], e.g.,as a result of an environmental stimulus or a decease. In other words, gene regulation enablesthe cell to adapt to a new situation by controlling the supply of the functional products (pro-teins and RNA). Cellular differentiation, which gives rise to different cell types in multicellularorganisms (like humans), is essentially a consequence of regulation of the gene expression in stemcells [7]. Bacteria are able to adapt to a dynamic environment in a particularly efficient way.This efficiency is achieved by clustering genes together, whose gene products are all componentsof some common mechanism, like the transport and metabolism of lactose in E. Coli [8]. Suchclusters of genes are called operons [9]. Genes belonging to an operon are co-regulated which,in bacteria, occur at the transcription level. In transcriptional regulation, the production ofmRNA is controlled. Translational regulation may also occur, and refers to the control of the

7

Page 11: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

downstream protein production from the associated mRNA [10].

1.2 Sequencing and analysis of RNA-sequencing data

The concentration of mRNA of a given gene is commonly taken as a proxy for the gene’s “ex-pression”, although other quantities may very well be considered, such as the concentrationof the associated functional protein. The amount of mRNA in a given solution is measuredusing high-throughput sequencing, which is believed to be superior to the hybridisation-basedmicroarray method [11]. In very general terms, the experimental solution of the transcriptome,referred to as the biological replicate, is prepared into a sequence library by fragmenting theextracted mRNA molecules in short segments of about 100 − 400 base pairs. In order to deter-mine the relative fractions of RNA in the replicate, an amplified number of these segments (i.e.,the library concentration) are sequenced until a given depth is reached [11]. The segment copiesare called “reads”. The reads are then aligned and mapped back onto their parent genes. Thetotal number of mapped reads that corresponds to a specific gene is referred to as the “count”of that gene. By dividing by the length of the gene (in units of kilo-base pairs) and the totalnumber of counts in the sequenced sample (in units of mega-base pairs), we arrive at the unitfragments-per-kilo-base-of-exon-per-million-reads-mapped (FPKM), which is the measure of thegene’s expression that will be used in this thesis.

1.2.1 Poisson-Gamma hierarchical model

A common approach to analyse gene expression data and determine whether a gene is signif-icantly regulated or not, is to perform an analysis of the read counts and compare measuredcounts before and after treatment. As will be discussed later on (see, e.g., Chapt. 3), geneexpression data are over-dispersed, wherefore the error function often is modelled by a negativebinomial [12,13]. Note that, the predominant argument for a negative binomial distribution issimply mathematical convenience. Let the random variable Yt , associated with a given gene t,denote the concentration of mRNA over all possible libraries and let the variate yt denote theconcentration in a specific library. Furthermore, the distribution, Yt , of counts from the high-throughput sequencing, given the concentration (or event rate) yt , is considered to be Poissondistributed s.t.

Yt |Yt = yt ∼ Po(syt ), (1.1)

where s denotes the normalisation or size factor of the sequenced replicate. The size factor s , 1in situations where not all replicates of the experiment are sequenced to the same depth. Now,out of plain convenience, suppose that

Yt ∼ Gamma(k, θ), (1.2)

with shape parameter k > 0 and scale parameter θ > 0. From the law of total variance, we thenhave that

V ar (Yt ) = E(V ar (Yt |Yt )) + V ar (E(Yt |Yt )) = E(sYt ) + V ar (sYt ) = s ·kθ + s2 ·kθ2. (1.3)

By defining the mean count ηt := E(Yt ) = E(E(Yt |Yt )) = skθ, we may write the variance as

V ar (Yt ) = ηt +1

kη2t . (1.4)

Next, let us show that Yt indeed follows a negative binomial distribution. By marginalising over

8

Page 12: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

yt , we formally have that

P(Yt = υ) := pYt (υ) =

∞∫

−∞

pYt |Yt(υ |yt ) · fYt

(yt )dyt, υ = 0, 1, 2, ... (1.5)

where

pYt |yt (υ |Yt ) = P(Yt = υ |Yt = yt ) = e−syt(syt )

υ

υ!, υ = 0, 1, 2, ... (1.6)

is the probability mass function of the Poisson distribution with event rate syt > 0, while

fYt(yt ) =

yk−1t e−yt /θ

θkΓ(k), yt > 0 (1.7)

is the density function of the gamma distribution. Hence, with this particular choice of distri-bution for Yt , we have that

pYt (υ) =

∞∫

0

e−syt(syt )

υ

υ!·yk−1t e−yt /θ

θkΓ(k)dyt =

υ! · θkΓ(k)

∞∫

0

yυ+k−1t e−(s+1/θ)yt dyt =

=

υ! · θkΓ(k)· Γ(υ + k)

(s + 1/θ)υ+k

∞∫

0

(s + 1/θ)υ+k

Γ(υ + k)yυ+k−1t e−(s+1/θ)ytdyt =

=

Γ(υ + k)

υ! Γ(k)· sυ

θk (s + 1/θ)υ+k=

Γ(υ + k)

υ! Γ(k)· sυ kυ+k θυ

(skθ + k)υ+k=

=

Γ(υ + k)

υ! Γ(k)·(

k

k + skθ

)k (

skθ

k + skθ

, υ = 0, 1, 2, ..., (1.8)

where pYt (υ) denotes the probability mass function of a negative binomial distribution withexpectation E(Yt ) = skθ and variance V ar (Yt ) = skθ + s2kθ2, as in Eq. (1.3).

The variance described by the expression in Eq. (1.4) is an example of so-called over-dispersion,where the variance is greater than what would be expected from a statistical model based onthe Poisson distribution. As already mentioned, the negative binomial is routinely used in situ-ations where over-dispersion is found in the data. However, as we shall see, this is not the onlydistribution with a variance-to-mean ratio greater than unity.

1.2.2 Some terminology

In what follows, both genes and their transcriptional counterpart, the mRNA, will commonly bereferred to as “genes”. A bit sloppily, we will sometimes also denote the associated gene product(e.g., the protein) by the term “gene”. Furthermore, by “gene expression”, we may, apart fromthe biological process described in Sect. 1.1, either refer to the random variable describing thedistribution of the (normalised) mRNA concentrations of the expressed genes in the cell, orthe concentration of mRNA (measured or modelled) belonging to a specific type of gene. Foreexample, if high concentrations of mRNA, corresponding to some given gene, is measured in alllibraries, this will be referred to as a “highly expressed gene”. It should be clear from context,which one is meant of the two “gene expressions”.

9

Page 13: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Chapter 2

Experimental data of E. Coli

In this chapter, we will introduce the experimental data that is used to calibrate the statisticalmodel. The setup of the experiment, the preparation of the libraries et cetera is discussed indetail in [14]. There are, in total, nine biological replicates for which RNA sequencing wasperformed. These replicates are organised into one control group (0min) and two treatmentgroups (30min and 90min) with three replicates in each group. However, the three groups are notindependent. The experimental setup consists of three biological replicates from which samplesare extracted at three different time points. At an early stage of the (exponential) bacterial curveof growth, a sample from each of the three replicates was extracted and sequenced (the 0mingroup). Simultaneously, the replicates were exposed to a DNA-damaging agent (mitomycin C).After 30 minutes, another sample set was extracted and sequenced (30min group) and finally,a third set was extracted after 90 minutes (90min group). Table 2.1, gives the identificationsof all nine replicates in the experiment. In this study, we will pay extra attention to the 0mincontrol group which signifies the “null state”. Moreover, the three groups will be treated asbeing independent, even though this is not exactly the case, as noticed above.

Table 2.1: Identification of the E. Coli replicates

Biological replicate j 1 2 30 min(utes) BB09 BB10 BB1730 min(utes) BB19 BB20 BB2190 min(utes) BB11 BB12 BB18

2.1 Normalisation procedure

If, e.g., the libraries are not sequenced to the same depth, the data must be normalised in orderfor the expression of individual genes to be comparable between replicates. Normally, this isdone under the assumption that the vast majority of genes are unregulated during a treatmentor a decease. However, in organisms with a relatively short genome, or in experiments in whichthe treatment affects the overall state of the organism (like the effect of an antibiotics to aculture of E. Coli), a significant fraction of all genes may be differentially expressed. In suchsituations, a standard normalisation procedure may fail and lead to inaccurate results, whereforemore robuste alternatives recently have been developed.

One way to attack the problem is to identify specific reference genes, which are known to beinvariant in expression for a large set of different conditions, and then normalise the data usingthese genes as standards. In E. Coli, six such genes are known, which is not much. In orderto increase this number, [14] developed a numerical method to identify reference genes in silico.

10

Page 14: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

This identification can be regarded as an optimisation problem, or more specifically a particulartype of linear programming called minimum cost network flow problem [15]. Particularly, theidea is to find the cheapest path in a directed, acyclic ns-dimensional graph (ns denotes thetotal number of samples), from the most lowly, non-zero expressed gene (source) to the mosthighly expressed gene (target), given a number of constraints. Any edge (i, j) that connectstwo nodes (genes) in the graph points in the direction from the lower ranked gene to the higherranked gene, where the ranking, i = 1, ..., n, is determined by sorting the values of the geneexpression averaged over all samples. The genes with identically zero expression are omittedfrom the analysis. Note that in the graph, each gene is connected to every other gene. Thelinear program may be written on the form ( j > i)

minimise z =

n−1∑

i=1

n∑

j=i+1

ci j xi j,

subject ton

j=2

xi j = 1, i = 1,

n∑

j=i+1

xi j −i−1∑

j=1

x j i = bi, i = 2, ..., n − 1, (2.1)

−n−1∑

j=1

x j i = −1, i = n,

where xi j ∈ 0, 1 is the indicator variable that signals whether the edge between the two genesi and j belongs to (xi j = 1) the cheapest path or not (xi j = 0), ci j is the cost associated withthe path connecting gene i and j, bi denotes the source or sink at node i, while n denotes thenumber of nodes. The cost ci j is given by

ci j = di j + ( j − i − 1)m + ki jh, j > i, (2.2)

where di j is the normalised Euclidean distance between gene i and j, m = 4.0 is a flat scoreintroduced as a penalty for taking a path between two genes which are not immediately adjacentin ranking, and ki jh (h = 5.0 is constant) denotes the penalty given to those edges for which anumber of ki j samples of the higher ranked gene j have a lower expression than the correspondingsamples of the lower ranked gene i. Finally, the sources and sinks of the interior nodes i =

2, ..., n − 1 are given by

bi =

−r, i = ranking of known reference gene,0, otherwise.

(2.3)

In order to speed up the computation, the linear program in Eq. (2.1) is solved by applyinga dynamic programming algorithm. Also, in this way, the costs ci j , given by the expression inEq. (2.2), do not have to be pre-computed. The specific values of the reward sinks bi = −r = −800(r ≫ 1 to ensure that known reference genes are included in the path), and the penalties m = 4.0and h = 5.0, which control the number of identified in silico reference genes, are chosen suchthat at maximum one of the six reference genes in E. coli (the highly expressed ssrA) wouldhave a p-value of p < 0.0001 in the 30-to-0 and 90-to-0-min comparisons. In case no referencegenes are provided, the algorithm selects the statistically best estimate from the data.

11

Page 15: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Chapter 3

Basic model

The statistical parametric model of gene expression data is first and foremost observationallydriven in the sense that it is specifically developed to fit the experimental data of E. Coli. Hav-ing that said, the model is based on general theoretical arguments regarding molecular growthand the expected random error in a physical signal that involves counting.

3.1 Examination of the experimental data

In the quest for a suitable statistical model, we will start out from the experimental data aspresented in Figs. 3.1 and 3.2. Figure 3.1 shows the empirical probability density function ofthe logarithmic gene expression of the 0min BB09 sample. The tail of the distribution on theright-hand side of the peak appears quite close to an exponential. In contrast, on left-hand side,a sharp cut-off in the density of genes is observed. We will assume that this cut-off is a resultof the E. Coli bacteria (and bacteria in general) being an efficient, single-cell organism in thesense that nearly all genes in their genome are active.

In Fig. 3.2, the observed standard deviation of each gene is plotted against their average expres-sion level. We note that at low expression levels, the general trend of the standard deviation(albeit with a large scatter) appears to have an η1/2 behaviour, where η denotes the expressionlevel. At high expression levels, however, the trend increases faster than η1/2. These trends arethe manifestations of the data being marred with both “technical” and “biological” variation.Here, we will take the technical variation to be the result of the inherent uncertainty in thesequencing procedure (measurement error) while the biological variation is everything else, in-cluding the uncertainty originating from the underlying biological processes (thereof its name).Finally, we note that close to the cut-off expression level η = ηc ≃ 50, there exists a number ofgenes with very low variance. The possible existence of so-called super-stable genes, which areexpressed with an extremely low intrinsic uncertainty will be briefly discussed in Sect. 7.4.

3.2 Outline of the basic model

The basic model is set to describe both the distribution of expression of active genes for individ-ual biological replicates and the conditional distribution of expression between replicates. Thiswill be accomplished with a setup of three free parameters: θ = (α, λ, γ), that are intended tocapture the features identified in Figs. 3.1 and 3.2. The parameter α is a location parameterwhich specifies the value of the cut-off expression level, the shape parameter λ determines theexponential slope of the tail at high expression levels, and γ is a variance-scaling parameterwhich determines the magnitude of the biological variation. We stress that the basic model ap-

12

Page 16: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

2 4 6 8 10

−5

−4

−3

−2

−1

0

Logarithmic gene expression

Log(

prob

abili

ty d

ensi

ty)

Figure 3.1: Empirical probability density function of gene expression. The black curve denotesthe observed density of genes with a specific expression level in the BB09 replicate. The dashedmagenta line indicates an exponential decay.

0 1 2 3 4 5

−1

01

23

4

Log(average gene expression)

Log(

disp

ersi

on)

Figure 3.2: Observed dispersion relation. The grey circles denote the observed standard de-viation for each expressed E. Coli gene i from the 0min samples, plotted against its averageexpression level ηi (in FPKM units). The black, dotted line denotes a (shifted) dispersion rela-tion ∝ √η and is plotted to highlight the apparent square-root dominance in the data aroundthe cut-off level η = ηc . Evidently, the dispersion relation increases faster than

√η at higher

expression levels. In the region within the red, dashed lines, a number of genes with very lowvariance is found. Cyan circles denote the 0min genes with standard deviation log10(ω) < −0.5.

13

Page 17: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

plies only to the population of active genes. The population of inactive genes, i.e., those whichare not expressed or merely accidentally expressed, is not explicitly accounted for. As regardsE. Coli, nearly all genes are actively expressed.

Although the measuring of a gene expression inherently is a counting matter, we have chosen towork with expressions in FPKM units, where fractions of whole numbers are obtained. More-over, if expression levels are not too low, the difference between a discrete and a continuousrandom variable is small, e.g., when computing p-values. For convenience, we will therefore as-sume that all gene expressions discussed here are well described by continuous random variables.

3.2.1 Physical interpretation

Let an intensity of expression, t, be assigned to each gene. We postulate that this expression in-tensity, encoded in the DNA of the organism, signifies the gene’s intrinsic rate of being expressedin a given condition or situation. It embodies all the biological processes, such as molecular mo-tors and positive and negative feedback effects that control and ultimately determine the levelof expression. If the conditions change (e.g., due to a disease or a treatment that the organismreacts to), the expression intensity may, or may not change. The expression intensity will occa-sionally also be referred to as the expression growth rate.

Now, let us consider a simple model for the gene expression growth. Denote the concentrationof mRNA, associated with a specific gene with expression intensity t, at time τk by Υt (τk ). Dueto the inherent randomness in all chemical and biological processes that operate in the cell, letthe rate of growth of the mRNA be described by the random variable Rt (τk ). Note that Rt (τk )

depends on the expression intensity. Following Steindl [16] (see also [17]) who applied the ideaon the growth of firms, we have the difference equation (i.e., assuming discrete time)

Υt (τk ) − Υt (τk−1) = Rt (τk )Υt (τk−1), (3.1)

where we have implicitly assumed that the rate of growth is independent of the concentrationof mRNA (cf. [18]). Eq. (3.1) leads to the recursive equation

Υt (τk ) = (1 + Rt (τk ))Υt (τk−1) = (1 + Rt (τk )) · (1 + Rt (τk−1))Υt (τk−2) = ... =

= (1 + Rt (τk )) · ... · (1 + Rt (τ1))Υt (τ0) (3.2)

By taking the logarithm of both sides, we obtain

ln(Υt (τk )) = ln (1 + Rt (τk )) + ... + ln (1 + Rt (τ1)) + ln(Υt (τ0)). (3.3)

Under the assumption that all Rt (τk ) are “stochastically small” s.t.

P( |Rt (τk ) | > r) ≪ 1 for some r ≪ 1, (3.4)

we may write ln(1 + Rt (τk )) ≈ Rt (τk ). Hence, if P(ln(Υt (τ0)) > x) ≪ P(ln(Υt (τk )) > x) ∀x, ask →∞, we have that

ln(Υt (τk )) ≈ Rt (τ1) + ... + Rt (τk ). (3.5)

Assuming that the central limit theorem holds and that the limits µt = E (limk→∞ ln(Υt (τk )))

and σ2t = V ar (limk→∞ ln(Υt (τk ))) exist, we then have that

limk→∞

ln(Υt (τk )) = ln(Υt ) ∼ N (µt, σ2t ). (3.6)

14

Page 18: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

As regards the technical variation, which is a result of random sampling effects during thesequencing, it can simply be modelled by the Poisson process, which is a limiting case to theBinomial process. As in Eq. (3.6), denote the concentration of mRNA corresponding to a genewith expression intensity t, by Υt and let υt be a realisation of Υt corresponding to a specificreplicate. The observed concentration of mRNA molecules after sequencing is then distributedaccording to a Poisson distribution with rate parameter υt . If the expected concentration ishigh enough, say υt > 15, which is the case for the vast majority of expressed genes in E. Coli,we may approximate the Poisson distribution with a normal distribution s.t.

Υobserved ≈ N (υt, υt ). (3.7)

3.2.2 Mathematical formulation

Let the random variable

Y = “Observed gene expression”. (3.8)

We state that the distribution of Y is described by the statistical model P s.t.

P =P(θ)⊗n : θ = (α, λ, γ) ∈ R3

+

, (3.9)

where Y ∼ P(θ) and n is the number of actively expressed genes. As indicated in Eq. (3.9), theexpression levels of any two genes are assumed to be independent (cf. Sect. 10.3). P(α, λ, γ) isthe, as yet unspecified, distribution for which we would like to derive an expression for.

Let the gene-specific expression intensities ti, i = 1, ..., n be distributed according to the randomvariable T (cf. Sect. 10.4) such that

T ∼ Exp(λ), (3.10)

with sample space ΩT = [0,∞). Now, assume that the growth model in Eq. (3.6) accuratelydescribes the stochastic growth of mRNA in E. Coli and identify ln(Υt ) with the random vari-able X |T = t = “Underlying logarithmic expression of gene with intensity t”. Here, the phraseunderlying means that the modelled expression is not (yet) marred with technical noise. Fromexperiments (see Figs. 3.1 and 3.2), we conclude that µt = α + t and σ2

t = γt, where α de-notes the minimal expression intensity of the population of actively expressed genes and whereγ determines the size of the biological variation. Hence, we have that

X |T = t ∼ N (α + t, γt). (3.11)

Note that X is distributed according to an asymmetric Laplace law [19] (see also Sect. 4). Ifwe let Y = “ Underlying gene expression” be given by the transformation g : R 7→ R+ wherey = g(x) := ex , we have

YD= eX . (3.12)

Note that the conditional distribution corresponding to Υt in Eq. (3.6), is given by Y |T = t ∼LogN (α + t, γt). As we shall see in Sect. 8, this distribution is of central importance in thederivation of the log-fold change distribution. On the other hand, the marginal distribution

YD= eX follows a broken power-law (see Sect. 4), also known as a double-Pareto distribution

[20]. The final step accounts for the technical variation. If we take the observed FPKM valueas a true measure of the gene count and assume that all replicates are sequenced to the samedepth, we have, by Eq. (3.7), that

Y|Y = y ≈ N (y, y). (3.13)

15

Page 19: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

For future reference, we note that in situations where the technical noise must be scaled, e.g.,to account for different size factors of the replicates (see also Sect. 8), Eq. (3.13) can easily bemodified s.t.

Y|Y = y ≈ N (y, δy), (3.14)

where δ denotes the scale factor. As described by Eqs. (3.10) – (3.13), the basic model is thusa hierarchical model. In order to account for the technical noise in a more realistic way andstill keep the model continuous, we may formally define a continuous analogue of the Poissondistribution. Let the corresponding distribution function be defined as (cf. [21])

F (x; ρ) :=

0, x ≤ 0,Γ(x + 1; ρ)

Γ(x + 1), x > 0,

(3.15)

where Γ(x; ρ) denotes the incomplete Γ-function

Γ(x; ρ) :=

∞∫

ρ

e−t · tx−1dt, x > 0, ρ ≥ 0, (3.16)

while Γ(x) = Γ(x; 0) denotes the complete Γ-function. We may then substitute Eq. (3.13) with

Y|Y = y ∼ contPo(y), (3.17)

where contPo(ρ) denotes the continuous analogue of the Poisson distribution with event rate ρ.

3.2.3 Comment on dependencies

Let the observed expression of gene i, i = 1, ..., n in replicate j, j = 1, ...,m be denoted by

Yi, j := Yj |T = ti . (3.18)

In the basic model, it is assumed that Yi, j and Yi, j′ , i.e., the expression of gene i in two differentreplicates j and j ′, are independent. As hinted in the formulation of the model in Eq (3.9), wealso assume that Yi, j and Yi′, j , i.e., the expressions of two different genes i and i′ within a singlereplicate j, are independent. Consequently, also Yi, j and Yi′, j′, i, j , i′, j ′ are assumed to beindependent. We will therefore disregard the observed correlations between, e.g., Yi, j and Yi′, j ,as briefly discussed in Sect. 10.3.

3.3 Dispersion relations

By a dispersion relation we refer to the function ω : R+ 7→ R+, which describes how thevariance, or rather the standard deviation, of the gene expression varies with expression level.The dispersion relation can be derived without knowledge of the probability density functions.Let the expected expression level, for a gene with expression intensity t, be denoted by η. Withinthe framework of the basic model, we may then calculate the underlying dispersion relation s.t.

η := E(Y |T = t) = E(

eX |T = t)

= eα+t+γt/2 . (3.19)

This implies that

t = ln(

η (1+γ/2)−1)

− α(1 + γ/2)−1, η ≥ eα . (3.20)

16

Page 20: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Furthermore, we have that

ω2(η) := V ar (Y |T = t) = V ar(

eX |T = t)

=

= η2(

eγt − 1)

= η2(

e−α(γ−1+1/2)−1η (γ−1+1/2)−1 − 1)

= η2 ·(

aηb − 1)

, t ≥ 0, (3.21)

where

b =

1

1/γ + 1/2, (3.22)

a = e−αb . (3.23)

What about the dispersion relation of the observed gene expression of the basic model? LetYt := Y |T = t. Similar to Eq. (3.13), we may write Yt |Yt = y ≈ N (y, y), where Yt := Y |T = tdenotes the observed gene expression conditioned on T = t. By the law of total expectation, wethen have that

E(Yt ) = E(Y|T = t

)

= EYt

(

EYt |Yt

(Yt |Yt)

)

= EYt(Yt ) = E (Y |T = t) = η. (3.24)

Hence, the expectation ofYt equals that to the underlying expression, by construction. Similarly,by the law of total variance, we have that

V ar(Y|T = t

)

= V arYt

(

EYt |Yt

(Yt |Yt)

)

+ EYt

(

V arYt |Yt

(Yt |Yt)

)

=

= V arYt(Yt ) + EYt

(Yt ) = V ar (Y |T = t) + E(Y |T = t) =

= η2 · (aηb − 1) + η, t ≥ 0. (3.25)

The resulting theoretical dispersion relation is thus given by

ω(η) =

V ar(Y|T = t

)

=

η2 · (aηb − 1) + η, t ≥ 0. (3.26)

Note that ω(η) does not depend on the parameter λ. Eq. (3.26) should be compared to the(squared) dispersion relation in Eq. (1.4), corresponding to a negative binomial distribution.The observed dispersion relation for the E. Coli data, computed from the three 0min replicates,is depicted in Fig. 3.2.

17

Page 21: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Chapter 4

Derivation of probability densities

In this chapter, we will derive the probability density function, fY (·), of the observed geneexpression Y as well as the the corresponding, transformed density function, fX (·), of the loga-rithmic gene expression.

4.1 Exact density functions of Y and XFollowing Sect. 3.2.2, let T ∼ Exp(λ) and X |T = t ∼ N (α + t, γt). Hence, the random variableX is distributed according to an asymmetric Laplace s.t. X ∼ AL (α, βL, βR ), where α denotesthe location parameter which determines the mode of the distribution, βL = βL (γ, λ) > 0 is theexponential slope of the tail to the left of the mode, while βR = βR (γ, λ) > 0 is the correspondingslope of the right-hand-side tail (cf., e.g., [19]). The probability density function of X is givenby

fX (x) =

λ

ξe βL ·(x−α), x < α,

λ

ξe−βR ·(x−α), x ≥ α,

(4.1)

where

ξ = ξ (γ, λ) =√

2γλ + 1, (4.2)

βL = (ξ + 1)/γ, and (4.3)

βR = (ξ − 1)/γ. (4.4)

From Eqs. (4.3) and (4.4), we have that βL = βR + 2/γ. Note also that (e.g., from H’ospital’srule), the two exponents

βR → λ (4.5)

βL → 2/γ → ∞, (4.6)

as γ → 0. This implies that for small γ, the asymmetric Laplace distribution resembles a shiftedexponential distribution. This approximation can sometimes be useful (see Sect. 4.2).

Now, let YD= eX and let y = ex = g(x) be the exponential function. Since g(x) is strictly

increasing and invertible, the distribution function corresponding to Y , for y < eα , is given by

FY (y) = P(Y ≤ y) = P(g(X ) ≤ y) =

= P(X ≤ g−1(y)) = FX (g−1(y)) = FX (ln(y)), y ∈ [0, eα ). (4.7)

18

Page 22: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Because X is an absolutely continuous random variable and g(x) is differentiable, it follows that

fY (y) :=d

dy(FY (y)) = fX (g−1(y))

dg−1(y)

dy= fX (ln(y)) · y−1, (4.8)

for y ∈ [0, eα ). Together with the first expression in Eq. (4.1), we thus have that

fY (y) =λ

ξeβL ·(ln y−α) · y−1 = λ

ξe−αβL · yβL−1

=

λe−α

ξ

(

y

)βL−1, y ∈ [0, eα ). (4.9)

The power-law tail for y ≥ eα is derived in a similar fashion. Taken together, we have that

fY (y) =

λ

ξe−αβL · yβL−1, y ∈ [0, eα ),

λ

ξeαβR · y−(βR+1), y ∈ [eα,∞),

0, elsewhere.

(4.10)

Note that the parameter βR denotes the slope of the right-end tail of the distribution functionof the random variable Y . As indicated by Eq. (4.10), Y follows a double-Pareto distribution[20], i.e., a broken power-law.

From Eq. (4.4), the parameter λ can be expressed in terms of βR and γ such that

βR = (ξ − 1)/γ ⇒⇒ (γ βR + 1)2 = ξ2 ⇒⇒ γ2 β2R + 2γ βR + 1 = 2γλ + 1⇒⇒ λ = βR + (γ/2) · β2R, (4.11)

which indicates that λ ≥ βR for γ ∈ [0,∞). The term (γ/2)·β2Rmay also be viewed as a “correc-

tion term” for the slope λ when γ is non-negligible.

Now, according to the proposed model in Sect. 3.2.2, the conditional distribution of the observedgene expression given the underlying expression Y|Y = y ≈ N (y, y). Hence, the probability den-sity function ofY|Y = y is governed by fY |Y (υ |y) = (2πy)−1/2 ·e−(υ−y)2/2y . Then, by marginalisingover y, we obtain

fY (υ) =

∞∫

y=−∞

fY ,Y (y, υ) dy =

∞∫

y=−∞

fY |Y (υ |y) · fY (y) dy =

=

eα∫

y=0

1√2πy

e−(υ−y)2/2y ·λe−αβL

ξyβL−1 dy +

∞∫

y=eα

1√2πy

e−(υ−y)2/2y ·λeαβR

ξy−(βR+1) dy =

=

λe−αβL

2πξ2

eα∫

y=0

yβL−3/2 · e−(υ−y)2/2y dy +

λeαβR

2πξ2

∞∫

y=eα

y−(βR+3/2) · e−(υ−y)2/2y dy, (4.12)

19

Page 23: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

for υ > 0. Finally, let

X = “Observed logarithmic gene expression”. (4.13)

By transforming back to log space where x = h(υ) = ln(υ), we obtain

FX (x) = P(X ≤ x) = P(ln(Y) ≤ x) = P(Y ≤ ex ) = FY (ex ) = FY (h−1(x)). (4.14)

Hence, similar to Eq. (4.8), the probability density function of X is formally given by

fX (x) = fY (ex ) · ex . (4.15)

In the alternative model with an additional variance-scaling factor, we have instead that Y|Y =y ≈ N (y, δy). The corresponding conditional probability density function is then given byfY |Y (υ |y) = (2πδy)−1/2 · e−(υ−y)2/2δy . Thus,

fY (υ) =

eα∫

y=0

1√2πδy

e−(υ−y)2/2δy ·λe−αβL

ξyβL−1 dy +

+

∞∫

y=eα

1√2πδy

e−(υ−y)2/2δy ·λeαβR

ξy−(βR+1) dy =

=

λe−αβL

2πδξ2

eα∫

y=0

yβL−3/2 · e−(υ−y)2/2δy dy +

+

λeαβR

2πδξ2

∞∫

y=eα

y−(βR+3/2) · e−(υ−y)2/2δy dy, υ > 0. (4.16)

4.2 Approximation of the asymmetric Laplace distribution

From Eq. (4.11), it is clear that when γ ≪ 1, we have λ ≈ βR. For example, if γ = 0.01 andβR = 1, the “correction term” (γ/2) · β2

Rconstitutes about 0.5% of the slope λ. At the same

time, the parameter ξ approaches unity. Hence, in the limit γ → 0, we may assume that therandom variable X is distributed according to a shifted exponential s.t.

XD= α + T, (4.17)

where T ∼ Exp(λ). The density function of the shifted exponential is given by

fX (x) = λeαλe−λx, x ∈ [α,∞). (4.18)

As a result, we have that

YD= eX ∼ Pareto(eα , λ), (4.19)

i.e., the underlying gene expression Y is given by a power-law distribution. To see this, we makeuse of Eqs. (4.7) and (4.8) together with (4.18) to obtain

fY (y) = λeαλ e−λ ln y · y−1 = λeαλ · y−(λ+1), y ∈ [eα,∞), (4.20)

20

Page 24: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

which is exactly the probability density function of the Pareto distribution with scale parametereα and shape parameter λ. Again, by marginalising over y, we may calculate the density functioncorresponding to the observed gene expression

fY (υ) =

∞∫

y=−∞

fY ,Y (y, υ) dy =

∞∫

y=−∞

fY |Y (υ |y) · fY (y) dy =

=

∞∫

y=eα

1√2πy

e−(υ−y)2/2y · λeαλy−(λ+1) dy =

=

λeαλ

√2π

∞∫

y=eα

y−(λ+3/2) · e−(υ−y)2/2y dy, υ > 0. (4.21)

The density function of X is then given by Eq. (4.15) with fY (·) given by Eq. (4.21).

21

Page 25: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Chapter 5

Moments

In summary, Y denotes the random variable that models the observed gene expression (i.e.,including technical noise) in linear space while X denotes the corresponding random variable inlogarithmic space. The random variables X and Y , on the other hand, denote the underlying

gene expression in logarithmic and linear space, respectively. In this section, we will discuss thefirst two moments of these random variables. The moments may provide useful information,e.g., to be used in model parameter estimation.

5.1 First moment of YIdeally, we would like to know the moments of Y and/or X since they may be compared directlyto the experimental data. Unfortunately, the probability density functions of Y and X are notin closed forms (see Sect. 4), which therefore makes it difficult to derive simple expressions oftheir moments from the standard definition. However, by the law of total expectation, we have

E(Y) = E(E(Y|Y )) = E(Y ), (5.1)

by construction of the basic model (see Sect. 3.2.2). Furthermore, we have that

V ar (Y ) = E(V ar (Y|Y )) + V ar (E(Y|Y )) = E(Y ) + V ar (Y ). (5.2)

The expectation of Y , assuming that βR > 1, is given by

E(Y ) =

∞∫

−∞

y fY (y)dy =λ

ξe−αβL ·

eα∫

y=0

y · yβL−1dy +λ

ξeαβR ·

∞∫

y=eα

y · y−(βR+1)dy =

=

λe−αβL

ξ

yβL+1

βL + 1

0

−λeαβR

ξ

y−βR+1

βR − 1

=

=

λeα

ξ· *,

1

βL + 1+

1

βR − 1+- . (5.3)

This expression can be reduced s.t.

µY := E(Y ) = eαλ

λ − γ/2 − 1, (5.4)

where µY denotes the expectation of Y . It then follows that

22

Page 26: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

α = ln(µY ) − ln(

λ

λ − γ/2 − 1

)

. (5.5)

For βR > 2, we have that

V ar (Y ) = V ar (eX ) = E(e2X ) − E(eX )2 = MX (2) − MX (1)2, (5.6)

where MX (·) denotes the moment-generating function of X . Unfortunately, the gene expressiondata of E. Coli exhibit an observed slope of 1 < βR < 2, which means that the variance of Y

is undefined. Therefore, also V ar (Y) in Eq. (5.2) is undefined. This is to be expected since Y

follows a double-Pareto distribution with fat tails.

5.2 Approximations to the first and second moments of XThe corresponding random variable to Y in logarithmic space; X , follows instead an asymmetricLaplace distribution. Due to its superior tail behaviour as compared to Y , well-defined momentsexist for βR > 0. Unfortunately, the sought-after moments of X are not that easily obtained.From Eq. (5.4) and Jensen’s inequality for concave functions, we have that

E(X) = E(ln(Y)) ≤ ln(

E(Y))

= ln (E(Y )) = ln *,eαλ

λ − γ/2 − 1+- =

= α + ln *,λ

λ − γ/2 − 1+- = α − ln *,1 −

γ/2 + 1

λ+- . (5.7)

Now, the first order term of the Maclaurin series of ln(1− z) = −z +O (z2), |z | < 1. In particular,for |(γ/2 + 1)/λ | < 1, we have that

E(X) ≤ ln (E(Y )) ≈ α + 1

λ+

γ/2

λ. (5.8)

In Eq. (5.8), we have carelessly dropped a non-negligible fraction of the Maclaurin series andthe inequality may no longer hold. By construction, we reckon that the expectations of X andthe underlying expression X should be comparable. Hence, for future reference, the expectationof X is given by

E(X ) = E(E(X |T )) = E(α + T ) = α +1

λ. (5.9)

Note the similarity with Eq. (5.8). As it turns out (see below), E(X) ≈ E(X ) is quite good anapproximation. In fact, since the technical noise makes the left tail of fX heavier, more so thanthe right tail, we expect that E(X) ≤ E(X ). Similarly, by the law of total variance, we have that

V ar (X ) = V ar (E(X |T )) + E(V ar (X |T )) = V ar (α + T ) + E(γT ) =1

λ2+

γ

λ. (5.10)

Clearly, V ar (X) ≥ V ar (X ) since the technical noise added to YD= eX has a broadening effect

on the resulting distribution. The effect is, however, not very large. Interestingly, by applyinga variation of the delta method, we can do better. Suppose for the moment that X|X = x ∼D

(

µX |X (x), σ2X |X (x)

)

, where D is some distribution s.t. eX |X = x D= Y |X = x D

= Y |Y = y,where y = ex . Hence,

eX |X = x ≈ N(

µY |X (x) = ex, σ2Y |X (x) = ex

)

, (5.11)

23

Page 27: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

If so, we formally have that

E(X) = E(E(X|X )) = E(

µX |X (X ))

, and (5.12)

V ar (X) = V ar (E(X|X )) + E (V ar (X|X )) = V ar(

(µX |X (X ))

+ E

(

σ2X |X (X )

)

. (5.13)

Now, let us express the mean µX |X in terms of µY |X = y = ex and σ2Y |X = y = ex . Assuming

that the variates yi > 0 ∀i, we have

µX |X = limn→∞

1

n

n∑

i=1

xi = limn→∞

1

n

n∑

i=1

ln(yi ) = limn→∞

ln *,n

i=1

y1/ni

+- (5.14)

The Taylor expansion of the function ln(y) around the arithmetic mean µY |X > 0 is given by

ln(y) = ln(

µY |X)

+

y − µY |XµY |X

− 1

2

(

y − µY |X)2

µ2Y |X+ O

(

(y − µY |X )3)

. (5.15)

Hence, by identifying σ2Y |X := limn→∞

1n

∑ni=1

(

yi − µY |X)2, we have that

µX |X ≈ limn→∞

1

n

n∑

i=1

*.,ln(

µY |X)

+

yi − µY |XµY |X

− 1

2

(

yi − µY |X)2

µ2Y |X

+/- = ln(µY |X ) − 1

2·σ2Y |X

µ2Y |X, (5.16)

Thus, up to second order

µX |X (X ) ≈ ln(

eX)

− 1

2

eX

(

eX)2= X − 1

2e−X . (5.17)

Similarly for the variance, we have that

σ2X |X = lim

n→∞1

n

n∑

i=1

(

xi − µX |X)2= lim

n→∞1

n

n∑

i=1

ln(yi )2 − µ2X |X, (5.18)

where xi = ln(yi ). As above, the function ln(y)2 can be expanded around y = a s.t.

ln(y)2 = ln(a)2 +2 ln(a)

a(y − a) +

(

1

a2− ln(a)

a2

)

· (y − a)2 + O(

(y − a)3)

. (5.19)

Therefore, with a = µY |X and µ2X |X ≈ ln(

µY |X)2− ln

(

µY |X)

·σ2

Y|Xµ2

Y|X, disregarding all terms which

go to zero faster than 1/µY |X = e−x as x → ∞, we obtain

σ2X |X ≈ ln

(

µY |X)2+*.,

1

µ2Y |X−ln

(

µY |X)

µ2Y |X

+/-σ2Y |X − µ

2X |X ≈

σ2Y |X

µ2Y |X. (5.20)

Thus, for µY |X (x) = ex and σ2Y |X (x) = ex we get

σ2X |X (X ) ≈ e−X . (5.21)

Having these expressions, we are now able to derive the first and second moments of X, accurateto the second order. Clearly,

E(X) = E(

µX |X (X )) ≈ E(X ) − 1

2E

(

e−X)

. (5.22)

24

Page 28: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

From Eq. (5.9), we know that E(X ) = E(E(X |T )) = E(α + T ) = α + 1/λ. Furthermore,

E

(

e−X)

=

∞∫

x=−∞

e−x · fX (x)dx =λ

ξ

α∫

x=−∞

e−x · eβL (x−α)dx +λ

ξ

∞∫

x=α

e−x · e−βR (x−α)dx =

=

λe−αβL

ξ

[e(βL−1)x

βL − 1

]αx=−∞

− λeαβR

ξ

[e−(βR+1)x

βR + 1

]∞x=α

=

=

λe−α

ξ

(

1

βL − 1+

1

βR + 1

)

. (5.23)

By Eqs. (4.3) and (4.4), the expression in Eq. (5.23) may be reduced to

E

(

e−X)

=

λe−α

λ − γ/2 + 1 . (5.24)

Taken together, we thus have that

E(X) ≈ α + 1

λ− 1

2· λe−α

λ − γ/2 + 1 . (5.25)

As regards the variance of X, we have, according to Eqs. (5.13), (5.17), and (5.21) that

V ar (X) = V ar(

µX |X (X ))

+ E

(

σ2X |X (X )

)

≈ V ar

(

X − 1

2e−X

)

+ E

(

e−X)

=

= V ar (X ) +1

4V ar

(

e−X)

− Cov(

X, e−X)

+ E

(

e−X)

≈ V ar (X ) + (1 + E(X )) · E(

e−X)

− E(

Xe−X)

, (5.26)

where we have used that the covariance Cov(

X, e−X)

= E

(

Xe−X)

− E(X ) · E(

e−X)

. Also, the

term V ar(

e−X)

/4 have been neglected as it may be assumed that V ar(

e−X)

∝ e−2α ≪ E(

e−X)

.

The first term in Eq. (5.26) is given by Eq. (5.10) s.t. V ar (X ) = 1/λ2+γ/λ while the expectationof Xe−X (last term) is given by

E(Xe−X ) =

∞∫

x=−∞

xe−x · fX (x)dx =

=

λe−αβL

ξ

α∫

x=−∞

xe(βL−1)xdx +λeαβR

ξ

∞∫

x=α

xe−(βR+1)xdx =

=

λe−αβL

ξ

[(x

βL − 1− 1

(βL − 1)2

)

e(βL−1)x

]αx=−∞

− λeαβR

ξ

[(x

βR + 1+

1

(βR + 1)2

)

e−(βR+1)x

]∞x=α

=

=

λe−α

ξ·(

α

βL − 1− 1

(βL − 1)2

)

+

λe−α

ξ·(

α

βR + 1+

1

(βR + 1)2

)

=

=

λe−α

ξ

(

αβL − α − 1(βL − 1)2

+

αβR + α + 1

(βR + 1)2

)

=

=

λe−α

ξ*,

(β2R+ 1)2 · (αβL − α − 1) + (β2

L− 1)2 · (αβR + α + 1)

(βL − 1)2(βR + 1)2+- . (5.27)

25

Page 29: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

This lengthy expression may be rewritten as

E(Xe−X ) =λe−α

ξ·

(α + 1) · (β2L− β2

R) + (α(βL βR − 1) − 2) · (βL + βR )

(βL βR + βL − βR − 1)2=

= λe−α ·(

α

λ − γ/2 + 1 +1 − γ

(λ − γ/2 + 1)2

)

. (5.28)

Compare with the expression for E(

e−X)

in Eq. (5.24). All together, we have that

V ar (X) ≈ 1

λ2+

γ

λ+

(

1

λ+

λ + γ/2

λ − γ/2 + 1

)

· λe−α

λ − γ/2 + 1 (5.29)

Finally, we note that Eq. (5.26) can be written as

V ar (X) ≈ V ar (X ) + cV, (5.30)

where

cV = E(

e−X)

− Cov(

X, e−X)

=

(

1

λ+

λ + γ/2

λ − γ/2 + 1

)

· λe−α

λ − γ/2 + 1 (5.31)

denotes the second order correction term to V ar (X ). Similarly, Eq. (5.22) may be written as

E(X) ≈ E(X ) + cE, (5.32)

where

cE = −1

2E

(

e−X)

= −12· λe−α

λ − γ/2 + 1 (5.33)

is the second order correction term to the expectation of X . We note that with the scalingintroduced in Eq. (3.14), i.e., Y|Y = y ≈ N (y, δy), the correction terms scale linearly with theparameter δ s.t.

E(X) ≈ E(X ) + δ ·cE (5.34)

V ar (X) ≈ V ar (X ) + δ ·cV, (5.35)

where cV and cE are given by Eqs. (5.31) and (5.33), respectively. The estimates of α and λ are,thus, more sensitive to δ than to γ. In the present data of E. Coli, preliminary results suggeststhat δ may not vary more than about ±40%. Eqs. (5.34) and (5.35) are derived predominantlyto be used in the model calibration and the estimation of α and λ (see Sect. 6.5).

26

Page 30: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Chapter 6

Estimation of model parameters

In this section, we will find suitable estimates of the parameters of the basic model and discussthe characteristics of the corresponding estimators. A brief discussion on the parameter estima-tion of the replicate-specific, four-parameter model is given in Sect. 10.1.1.

6.1 Tail-index estimate of λ

As noticed in Sect. 4.2, the distribution YD→ Pareto(eα , λ) as γ → 0 (for finite γ, Y still behaves

like a power-law in the tails). Since the Poisson noise will have a decreasing influence on theform of the distribution Y as y → ∞, we may then assume that

fY (y) ≈ g(y) · y−(λ+1), as y →∞, (6.1)

where g(y) is some slowly varying function. If so, Hill’s estimator [22] is a consistent estimatorfor λ. Let y1, ..., yn be a sample from Y and let y(1), ..., y(n) be ordered in a decreasing order s.tyi ≥ yi′ for i < i′. Now, form the sequence

Hk =1

k

k∑

i=1

ln(

y(i)

) − ln (

y(k )

)

. (6.2)

The reciprocal λk = 1/Hk is then taken to be an estimate of λ, which improves as k increases, atleast until the variates yk enters the region where the true distribution is not well approximatedby Eq. (6.1). Note that the maximum-likelihood estimate of λ, for a Pareto-distributed randomvariable with a probability density given by Eq. (4.20), is given by

λMLE =*,1

n

n∑

i=1

ln (yi ) − αMLE+-−1

, (6.3)

where αMLE = ln (min(y1, ..., yn )) = ln(

y(n)

)

. This expression is thus equivalent to λn = 1/Hn .Of course, λn = λMLE only if both samples are realisations of a Pareto-distributed random vari-able.

Figure 6.1 shows a Hill’s plot of the averaged 0min E. Coli replicates. Ideally, the sequence ofestimates λk k≥1 should converge to a limiting value as k → ∞. Unfortunately, Fig. 6.1 appearsto be an example of a so-called “Hill’s horror plot”, where no such limit clearly can be identified.An alternative to Hill’s estimator is the Pickand’s estimator [23]. However, it seems that alsothis estimator has difficulties to pick up the apparent slope. There are a number of reasons whythese two estimators fail to determine the tail-index. One possible explanation could be that thetail of the gene expression distribution is, in fact, not asymptotically described by a power-law,

27

Page 31: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

0 1000 2000 3000 4000

0.6

0.8

1.0

1.2

1.4

Index (k)

Est

imat

e of

lam

bda

Figure 6.1: Hill’s plot of gene expression data of E. Coli. The sequence λk nk=1 does not appearto converge to a specific value.

at least not a single power-law. This issue is addressed in Sect. 10.4. In order to obtain a goodestimate of λ, which possibly describes an “average” tail-index, we must resort to other meansof estimation.

6.2 Method-of-moments estimates of α and λ

The expressions in Eqs. (5.25) and (5.29) can be used to obtain robust moment estimates ofα and λ without accurate knowledge of γ, only assuming that γ ≪ 1. Let µX and σ2

X denote,respectively, the sample mean and sample variance of the gene expression data in log-space.From Eq. (5.25), we have that

αk = µX −1

λk−1− cE,k−1, (6.4)

where λk−1 is the k − 1 iteration of the estimate of λ while cE,k−1 = cE,k−1(αk−1, λk−1 |γ) is themean correction term for αk−1 and λk−1, given an estimate of γ. Similarly, with µ := 1/λ, wehave that

σ2X =

1

λ2+

γ

λ+ cV =

[µ =

1

λ

]= µ2 + γµ + cV ⇒

µ2 + γµ −(

σ2X − cV

)

= 0⇒

µ1,2 = −γ2±

(

γ

2

)2

+ σ2X − cV. (6.5)

Since we may assume that cV ≪ σ2X, the expression under the square-root sign is positive and

larger than γ/2. Since λ > 0, only the positive root is valid. Hence,

28

Page 32: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

µk = −γ

2+

(

γ

2

)2

+ σ2X − cV,k−1 =

1

λk, (6.6)

where cV,k−1 = cV,k−1(αk−1, λk−1 |γ) is the corresponding variance correction term. The aboveequations are then solved in an iterative manner with α = limk→∞ αk and λ = limk→∞ λk . Anatural initial guess is simply given by taking cE,−1 = cV,−1 = γ = 0 s.t.

µ0 = − γ2+

(

γ

2

)2

+ σ2X − cV,−1 ≈ σX (6.7)

α0 = µX − µ0 − cE,−1 ≈ µX − σX (6.8)

In fact, for a large range of initial guesses, the estimates converge only in a few iterations.

6.3 Least-squares estimate of γ

In the model, the parameter γ describes the biological variation between replicates. At lowexpression levels, this variation is insignificant next to the technical variation, described by theparameter δ. However, at high expression levels, the total variation is dominated by the bio-logical variation, and thus determined by γ. In order to estimate γ from data, we will make acommon, but crucial assumption, namely that the biological variation in the expression level ofan individual gene is equal to the variation between genes with similar mean expression levels.As will be discussed in Sect, 6.4.2, the γ estimator introduced in this section is impaired with anon-zero bias which, however, will go to zero as the number of replicates m → ∞. Again, in thebasic model, we assume that γ1 = ... = γm = γ for all replicates j = 1, ...,m.

The ordinary least-squares method will be used to estimate γ. For stability reasons, we willmake use of the coefficient of variation, CV , or more precisely CV2, instead of the variance. InSect. 3.3, we found that the variance of Y|T = t, i.e., the variance in expression level for a givengene with expression intensity/expression growth rate t, is given by

V ar (Y|T = t) = η2(

a(γ)ηb (γ) − 1)

+ η, (6.9)

where

b = b(γ) =1

1/γ + 1/2, (6.10)

a = a(γ) = e−αb, (6.11)

while η denotes the expected expression of Y|T = t. Clearly, η = η(t) = eα+(1+γ/2)t is a functionof t, which is unknown. It also depends on γ. However, in the present analysis, we will disregardthis weak dependence and instead take the sample mean expression ηi as an unbiased estimateof η(ti ), where ti denotes the expression intensity of gene i, i = 1, ..., n. Hence, we have that

CV2= CV2(γ) =

V ar (Y|T = t)

E(Y|T = t)2= a(γ)ηb (γ)

+

1

η− 1. (6.12)

Let the observed squared coefficient of variation of gene i be denoted by CV2i. The least-squares

sum is then given by

Q(γ) :=n

i=1

(

CV2i − CV2(γ)

)2=

n∑

i=1

(

CV2i −

(

a(γ)ηb (γ)

i+

1

ηi− 1

))2

. (6.13)

29

Page 33: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

The ordinary least-squares estimate of γ is formally given by γ = argmin(Q), i.e., it is thesolution to the equation dQ/dγ = 0, where

dQ

dγ= −2 a(γ)b′(γ)

n∑

i=1

ηb (γ)

i

(

ln(

ηi) − α) ·

(

CV2i − a(γ)η

b (γ)

i− 1

ηi+ 1

)

. (6.14)

Since both b′(γ) = 1/(1 + γ/2) > 0 and a(γ) = e−αb (γ) > 0, γ is given by the solution to theequation

n∑

i=1

ηb (γ)

i· (ln (

ηi) − α) ·

(

CV2i − a(γ)η

b (γ)

i− 1

ηi+ 1

)

= 0. (6.15)

6.4 Random and systematic errors of the parameter estimators

All estimates of the parameters of the basic model are based either on approximations and itera-tive procedures, or on minimisations of non-linear functions. It is therefore not straight forwardto derive exact analytical expressions of the random errors and potential systematic errors of theestimators. Instead, we will mostly resort to numerical simulations to estimate these quantities.We assume that the observed gene expression levels are not marred with any additional sourcesof uncertainty, such as experimental errors or uncertainties originating from the data reduction(e.g., inaccurate normalisation). Moreover, we will assume that the model is adequate and thatno bias is introduced due to lack of knowledge of the underlying biological processes. In allthe error-estimation simulations performed in this section, we assume that α = 3.7, λ = 1.1,γ = 0.005, δ = 1, and n = 4207 active genes, which are representative parameter values for theE. Coli data set.

6.4.1 Characteristics of α(X) and λ(X)

Since the estimates of α and λ are based on approximate expressions of the first two momentsof X, they are inherently biased (see Sect. 6.2). This bias, however, is found to be small. Let

cE,tot := E(X) − E(X ), (6.16)

cV,tot := V ar (X) − V ar (X ) (6.17)

include all the order corrections from the Taylor expansions of µX |X and σ2X |X , except the 0th

order. From simulations, we estimate that the higher order terms cE,high = cE,tot − cE contribute. 3% to the total correction term cE,tot while the corresponding contribution of higher orderterms to cV,tot is about 4%. This translates into a bias of B(α) = E(α) − α = −0.0004 andB(λ) = −0.0003, for α = 3.7 and λ = 1.1.

There is another, and perhaps more “severe” effect that could either be regarded as a bias, oran intrinsic uncertainty. This inaccuracy in the estimation of the parameters α and λ is intro-duced by the fact that the expression intensities ti for all the genes i = 1, ..., n are postulatedto be fixed, i.e., the n variates ti belong to one unique realisation of expression intensities forthe E. Coli genes. Moreover, the number n of genes is finite. This fixation of the expressionintensities constitutes the very foundation of the model and forces the sequence of expressionsXj := (Xj )

ni=1= X1, j, ...,Xn, j , where Xi, j = Xj |T = ti is the expression of gene i in replicate j,

and Xj′ = X1, j′, ...,Xn, j′, between two replicates j and j ′, to be correlated. In fact, for γ ≪ 1,the correlation is very close to unity.

30

Page 34: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Now, suppose for a moment that we know the expression intensities. Then, a direct estimate ofthe parameter λ is given by λ∗ = n/

∑ni=1 ti . From the Delta method (Gauss’ approximation),

the standard deviation of the corresponding estimator is approximately given by

Dλ∗ =√

V ar (λ∗) =

V ar

*....,1

n∑

i=1Ti/n

+////-≈ λ√

n≃ 0.017, (6.18)

for E. Coli with λ = 1.10. Hence, if we choose to regard the intrinsic uncertainty as an actualrandom error, we cannot expect the estimator λ(X) to be more efficient than λ∗(T) and morereplicates will not improve upon the situation. Hypothetically speaking, an improvement is onlypossible for species with lager numbers n of protein-coding genes.

Numerical estimates of the standard deviations of the intrinsic uncertainty in the α and λ esti-mators are dintr

α= 0.014 and dintr

λ= 0.024, respectively. The latter uncertainty is to be compared

to Dλ∗ in Eq. (6.18). As this uncertainty is a measure of the intrinsic ignorance about λ, as-sumed to be hard-coded into the E. Coli genome itself, it may perhaps be better understoodas an intrinsic bias. Since there is only one realisation of expression intensities, the actualvalue of this bias is unknown. However, the intrinsic uncertainty may be regarded as a measureof the size of the bias. Hence, for E. Coli we postulate that |Bintr(α) | = dintr

αand |Bintr(λ) | = dintr

λ.

Table 6.1: Biases and standard deviations of α and λ†

α(X) λ (X)

Bias (systematic error of iterative procedure) −0.0004 −0.0003Standard deviation of intrinsic bias 0.0139 0.0239Standard deviation for fixated expression intensities 0.0032 0.0027Approximate standard deviation (for fixated ti) 0.0032 0.0026

† Estimated from simulations with given values α = 3.7, λ = 1.1, γ = 0.005,δ = 1.0, n = 4207 active genes, and m = 1 samples.

For a given realisation of expression intensities, the uncertainty in the estimates of α and λ forindividual samples is controlled by the variance scaling parameter(s) γ (and δ). Recall that, forγ ≪ 1 and cE, cV being small, approximate expressions for the α and λ estimators are

λ(X) ≈ 1/σX, (6.19)

α(X) ≈ µX − σX . (6.20)

Hence, the random error, e.g., in the estimate of α is then approximately given by

V ar(

(α(X)) ≈ V ar

(

µX − σX)

= V ar(

µX)

+ V ar(

σX) − 2·Cov

(

µX, σX)

. (6.21)

The first term in Eq. (6.21) describes the variance of the sample mean of the gene expression(in logarithmic space), for a given realisation of expression intensities. Hence, for Xi := X|T = ti ,we have that

V ar(

µX)

= V ar *,1

n

n∑

i=1

Xi+- = [Xi and Xi′ indep. ∀ i , i′] =1

n2

n∑

i=1

V ar (Xi ) =

=

1

n2

n∑

i=1

V ar (X|T = ti ) , (6.22)

31

Page 35: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Alpha

Pro

babi

lity

dens

ity

3.64

3.66

3.68

3.70

3.72

3.74

3.76

0

25

50

75

100

125

150

Lambda

Pro

babi

lity

dens

ity

1.04

1.06

1.08

1.10

1.12

1.14

1.16

0

25

50

75

100

125

150

Figure 6.2: Illustration of different uncertainties/biases for α and λ parameters. Left panel: Thedashed line shows the distribution of the intrinsic bias (α = 3.70) of the α estimator that resultsfrom fixation of the expression intensities. The solid line shows the distribution of α, given aspecific realisation of the expression intensities. In this simulated example, a bias of Bintr(α) =

0.0116 is introduced. The standard deiation of α after fixating (t1, ..., tn ) is dα = 0.0032. Right

panel: The corresponding distributions for λ with λ = 1.10 and Bintr(λ) = 0.0116. The standarddeviation of λ after fixating the expression intensities is d

λ= 0.0027. Same symbols as in the

left panel.

where V ar (X|T = ti ) denotes the gene-specific variance, i.e., the variance of the distribution ofexpressions for a given gene i. Clearly, this variance will depend on the expression intensity t,and it will therefore be different for different genes. Note that a crucial model assumption ismade in Eq. (6.22), namely that Xi and Xi′ are independent for any i , i′. In reality, this is notentirely true (see Sect. 10.3).

Similarly, the second term in Eq. (6.21) is the variance of the sample standard deviation of thegene expression, again for a given realisation of expression intensities. Unfortunately, there isnot an as simple way to rewrite this quantity in terms of a gene-specific variance, or similar. Inaddition, the is a non-zero correlation between µX and σX, wherefore we will instead resort tonumerical methods to estimate the standard deviations of the α and λ estimators.

From simulations, we obtain standard deviations of dα = 0.0032 and dλ= 0.0027. The standard

deviation will increase with increasing γ. Here, the situation may be improved by more repli-cates, as long as the replicates have been properly normalised (cf. Sect. 10.1). As a reference,we also numerically estimated the standard deviatios from the approximations in Eqs. (6.19)and (6.20). Indeed, they are very close to the estimates from the full numerical simulations.Figure 6.2, illustrates the difference between the intrinsic uncertainty/bias originating from thefixation of the expression intensities and the standard deviations of the parameter estimators,given a specific realisation of expression intensities. The various standard deviations and biasesfor the α and λ estimators are summarised in Table 6.1.

32

Page 36: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

0.0 0.5 1.0 1.5 2.0

−3.

0−

2.5

−2.

0−

1.5

−1.

0−

0.5

0.0

Log10(no. of replicates)

Rel

ativ

e bi

as (

in %

)

Figure 6.3: Relative bias of γ(Y|T = t

)

as a function of the number of replicates. The bias tendsto zero as the number m of replicates increases. The magenta dot denotes the relative bias form = 3 replicates. As above, the generic basic model is (α, λ, γ) = (3.7, 1.1, 0.005).

6.4.2 Characteristics of γ(Y|T = t

)

The parameter γ is set to measure the biological variance between replicates. This is a gene-specific variance and the estimator γ

(Y|T = t)

will therefore not be sensitive to the fixation ofthe expression intensities in the same way as are α(X) and λ(X). However, in the estimationof γ, we assume that the gene-specific variation can be estimated by the “average” observedvariation for genes with similar mean expression levels. Thereby, we allow for information to beshared between genes. From the point of view of regression analysis, neither the existing errorin the regressor variable ηi (being as estimate of ηi), nor the heteroscedasticity are accountedfor in Eq. (6.15). As a result of neglecting the error in the regressor variable, a bias in theestimator is introduced. As shown in Fig. 6.3, this bias tends to zero as m → ∞. Nonetheless,for m = 3 replicates, we estimate a bias of B(γ) = −9.5 ·10−5 (numerical simulations), assuming abasic model with parameters θ = (α, λ, γ) = (3.7, 1.1, 0.005). This corresponds to a relative biasof −1.9%. Note also that by neglecting the heteroscedasticity, the efficiency of the estimator γis suboptimal, which affects any confidence interval estimated from the standard error.

Table 6.2: Bias and standard deviation of γ†

γ(Y|T = t

)

Bias (systematic error of least-squares estimate) −0.000095Standard deviation of estimator 0.00022

† Estimated from simulations with m = 3 replicates of the model(α, λ, γ, δ) = (3.7, 1.1, 0.005, 1.0), and n = 4207 active genes.

The systematic and random errors of the γ estimator are summarised in Table 6.2. The standarddeviation of γ

(Y|T = t)

is estimated to dγ = 0.00022 for m = 3 replicates.

33

Page 37: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

6.5 Model calibration: parameter estimates of the basic model

As discussed above, the basic model is described by the set of parameters θ = (α, λ, γ). Thecalibration is performed using the method of moments and the least-squares method as de-scribed in Sect. 6.2 and Sect. 6.3, respectively. None of the parameters in the basic model isreplicate-specific (cf. Sect. 10.1) and all the biological replicates are assumed to be properlynormalised and generally of equal quality. We will therefore use the arithmetic average of thethree 0min replicates to estimate α and λ, while γ is estimated from the observed dispersionrelation, computed from all three 0min replicates.

The result is summarised in Table 6.3. Note that, since α and λ are estimated from the averageof three replicates, the correct γ to be used in the iterative procedure in Sect. 6.2 is γ → γ/3(see Sect. 8.1). Likewise, the correction terms cE and cV should be divided by m = 3. The pa-rameter γ is corrected for bias by the following procedure: An initial estimate of γ is determinedfrom the experimental data, by the least-squares method. Secondly, a sequence of estimates isproduced from model data until an output γ estimate is found which equals that determinedfrom the experimental data. The input value of γ associated with that specific simulated dataset is then taken as the bias-corrected estimate of γ. Indeed, since the calibrated model is veryclose to the generic basic model (α, λ, γ) = (3.7, 1.1, 0.005), we could as well simply have appliedthe bias stated in Table 6.2 as a correction for γ. The bias correction for α only appears inthe last digit in Table 6.3 while that for λ is too small to have any effect on the presented results.

Table 6.3: Estimates of the parameters of the basic model

α λ γ

Estimated value (bias-corrected) 3.651 ± 0.004† 1.094 ± 0.003† 0.0050 ± 0.0004††† The assigned error denotes the 95% confidence interval for the averaged data (i.e.,

m = 3 replicates with fixated expression intensities).†† 95% confidence interval.

34

Page 38: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Chapter 7

Performance of the basic model

This chapter summarises a few main results of the basic model, including the distribution ofthe observed gene expression, the dispersion relation, and the marginal distribution of the log-fold change. Furthermore, we briefly discuss the appearance of low-variance genes in the dataand compare the numbers to what is predicted by the basic model. We devote Chapt. 8 tothe discussion on significant gene regulation and the conditional distribution of the log-foldchange. In this chapter, we assume the basic model calibrated to the 0min data with paramters(α, λ, γ, δ) = (3.651, 1.094, 0.0050, 1) as estimated in Sect. 6.5.

7.1 Distribution of the gene expression

Figure 7.1 shows the basic-model fit to the experimental distribution of gene expression. Fromvisual inspection, it is concluded that the fit is generally good. Both the exponential-like tail onthe right and the cut-off on the left of the peak are well captured by the model. One potentialproblem, however, is that the peak itself appears to be broader than what the basic modelpredicts. There could be several explanations for this. One possible explanation is discussedin Sect. 10.4. Due to this inadequacy of the basic model, no formal goodness-of-fit test wasperformed.

7.2 Observed dispersion relation

The predicted dispersion relation of the calibrated, basic model is depicted in Fig. 7.2. It ap-pears that the observed variance is modelled quite well by Eqs. (3.10) – (3.13), as formulatedin Sect. 3.2.2. The transition from the technical-noise dominated variance at low expressionlevels to the biological-noise dominated variance at high expression levels is clearly observed,not only in the theoretical dispersion relation given by Eq. (3.26), but in the simulated data aswell (green circles in Fig. 7.2).

Worth noting is that the variance of the dispersion is also well captured by the model. Indi-vidual genes in the E. Coli genome with similar average gene expression could have exhibiteda significantly larger spread in the variance than what is observed. From the current data setwith m = 3 replicates, the spread in the 0min data is to the first order consistent (by-eye es-timate of Fig. 7.2) with a dispersion relation depending only on the expected expression η, asin Eq. (3.26). This appears to hold over the entire observed range η ∈ [40, 105]. It would beinteresting to have a larger number of replicates in order to study the behaviour of the spreadof the dispersion relation, as a function of expression level, e.g., to reveal additional explanatoryvariables. Note that, around the cut-off expression level η = ηc , a number of low-variance geneswith log10(ω) < −0.5 are found, both in the simulated and the observed data. This will be

35

Page 39: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

2 4 6 8 10

−6

−5

−4

−3

−2

−1

0

Logarithmic gene expression

Log(

Pro

babi

lity

dens

ity fu

nctio

n)

Figure 7.1: Predicted gene expression. The black attached circles denote the empirical proba-bility density of the 0min data of it E. Coli while the red curve denotes the theoretical densityof the calibrated, basic model.

further discussed in Sect. 7.4.

The result depicted in Fig. 7.2 suggests that the variance of the conditional distribution Y|T = t

is captured by the basic model. However, it does not say anything about the actual form of thisdistribution. This may be important, e.g., in the calculation of p-values for gene regulation, andwill be discussed further in Chapts. 8 and 9.

7.3 Marginal distribution of the log-fold change

Ideally, we would like to experimentally determine the conditional distribution of the log-foldchange, i.e., the distribution

L|T = tD= ln

(Y2 |T = t/Y1 |T = t)

, (7.1)

for a given gene with expression intensity t and for two different replicates Yk |T = t, k = 1, 2.Unfortunately, this is not feasible from an experimental point of view. Instead, the less optimal,marginal distribution L with density function

fL (x) =

∞∫

−∞

fL |T (x |t) · fT (t)dt, x ∈ R, (7.2)

may be studied as an alternative. The predicted log-fold change after marginalising over ex-pression level (i.e., expression intensity) is shown in Fig. 7.3 (grey curve). The fit to the datafor the BB09-to-BB17 and BB10-to-BB17 log-fold changes (Fig, 7.3, left panel) are overall verygood. Interestingly, the fit to the BB09-to-BB10 log-fold change (right panel) is quite poor.The reason for this may be found in the BB17 replicate. This replicate was not sequenced thesame day and it appears to display a somewhat larger intrinsic variance as compared to the

36

Page 40: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

0 1 2 3 4 5

−1

01

23

4

Log10(Average gene expression)

Log1

0(di

sper

sion

)

Figure 7.2: Predicted dispersion relation. The green circles denote a realisation of the dispersionrelation for the calibrated, basic model with m = 3 simulated replicates, while the grey circlesbeneath the green ones are the observational data for the three 0min replicates. The black solidcurve above ηc = e3.651 = 38.5 denotes the theoretical dispersion relation, ω(η), as predicted bythe calibrated, basic model. This relation is denoted by a dashed line below ηc = 38.5. Thedotted line denotes a dispersion relation ω = η1/2, affected by Poisson noise, only.

other two replicates. However, we have up to this point treated all three 0min replicates equallyand have included them, unweighted, in the estimation of the model parameters. As a result,the modelled variance may predominantly originate from the BB17 replicate. Therefore, thecalibrated model does a better job in predicting those log-fold changes that include the BB17replicate.

The finding that not all replicates may be of equal quality calls for a refined modelling, whereat least some of the parameters, such as γ and δ, should be considered as replicate-specific. InSect. 10.1, such a model is briefly outlined.

Finally, it is noted that the marginal distribution of the log-fold change of the basic model canbe approximated by a two-parameter Normal-Laplace distribution, for certain combinations ofthe parameters γ and δ (e.g., δ/γ should not be too large). The Normal-Laplace distributionis defined as a convolution between a (symmetric) Laplace and a normal distribution s.t. ifL ≈ N L(b, σ2) is approximately distributed according to a Normal-Laplace, then

L D≈ W + Z, (7.3)

where W ∼ Laplace(0, b) and Z ∼ N (0, σ2). The parameter b > 0 denotes the scale parameterof the Laplace distribution and regulates the shape of the tails of the distribution L, whileσ2 regulates the shape of the core. The probability density function of the Normal-Laplacedistribution is then given by the convolution

37

Page 41: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

−1.5 −0.5 0.5 1.5

−5

−4

−3

−2

−1

01

Log−fold change

Log(

prob

abili

ty d

ensi

ty)

−1.5 −0.5 0.5 1.5

−5

−4

−3

−2

−1

01

Log−fold change

Log(

prob

abili

ty d

ensi

ty)

Figure 7.3: Marginal distribution of the log-fold change. Left panel: The red and green curves arethe observed marginal distributions (0min data) of the log-fold changes log2(BB17/BB09) andlog2(BB17/BB10), respectively. The grey, thick curve denotes the corresponding distribution ofthe calibrated, basic model. Right panel: The blue curve denotes the marginal distribution ofthe log-fold change log2(BB10/BB09). The grey curve is identical to that in the left panel.

fL (x; b, σ2) ≈ ( fW ∗ fZ )(x) =

∞∫

−∞

fW (w) fZ (x − w)dw, (7.4)

where fW (w) = e−|w |/b/2b and fZ (z) = e−z2/2σ2

/√2πσ.

7.4 Existence of super-stable genes?

In the experimental data, a number of genes around the cut-off expression η = ηc ≃ 50 arefound to have unusually low variances. Based on the three 0min replicates, there are kω = 8genes with an estimated standard deviation of log10(ω) < −0.5 in the interval η ∈ [30, 100](i.e., log10(η) ∈ [1.5, 2.0], see Fig. 3.2). Moreover, the gene with the lowest observed varianceis “fepC.t01” with a standard deviation of log10 (ωlowest) = −1.099. Are these numbers moreextreme than what would be expected from the basic model?

We note that within the basic model, the low-end tail of the conditional distribution of thelogarithm of the standard deviation, given η ≃ ηc , is well described by an exponential. As aresult, the predicted distribution of genes with the lowest standard deviation (i.e., in the intervallog10(η) ∈ [1.5, 2.0]) can very well be fitted by a generalised extreme-value distribution with noshape parameter, i.e., a Gumbel distribution, as illustrated in Fig. 7.4. Let

H0 : log10(ω) = ω0 against the alternativeH1 : log10(ω) , ω0,

38

Page 42: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0

−4

−3

−2

−1

01

Log(dispersion)

Log(

prb.

den

sity

)

Figure 7.4: Distribution of extreme values. The red and green curves are two realisations ofthe extreme-value distribution of the dispersion relation ω (in the interval log(n) ∈ [1.5, 2.0])of the basic model. The black, dashed curve denotes a Gumbel distribution, fitted to thedata. Note that the observed most extreme value in the experimental data (0min replicates) islog10(ωlowest) = −1.1, which is located a little to the left of the peak of the distribution.

−2 −1 0 1 2

−4

−3

−2

−1

0

Log(dispersion)

Log(

cum

ulat

ive

dist

ribut

ion

func

tion)

Figure 7.5: Empirical cumulative distribution function with confidence interval. The red, thickstair-case curve denotes the empirical cumulative distribution function of the observed standarddeviation in the interval log(η) ∈ [1.5, 2.0] while the dark red, thin curves denote the 95%percentile bootstrap confidence interval of the distribution. The black, dashed line denotes theexpected distribution function of the basic model. For illustrative purposes, 200 of the bootstrapdistribution functions are also plotted (grey curves).

39

Page 43: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

where ω0 = −0.915 is the expected standard deviation of the gene with the lowest variance,as predicted from the basic model. By assuming a null distribution with location parameterµ = 0.79 and scale parameter β = 0.216 (i.e., Gumbel), the two-tailed p-value for the observedvalue log10 (ωlowest) = −1.099 is p = 0.425. Hence, H0 cannot be rejected and we conclude thatthe observed standard deviation of the most stable gene is consistent with the basic model.

In order to further quantify the behaviour of the distribution of low-variance genes, we computeda 95% percentile bootstrap confidence interval of the entire empirical cumulative distributionfunction in the interval log(η) ∈ [1.5, 2.0]. The result is shown in Fig. 7.5. Below a standarddeviation of, say, log10(ω) = 0, the expected cumulative distribution function of the basic model(dashed, black line) clearly falls within the bootstrap confidence interval (dark red stair-casecurves) at nearly all points, and tangents the lower bound only around log10(ω) = −0.25.

In conclusion, we cannot reject the null hypothesis that the basic model adequately describesthe extreme low tail of observed standard deviations in the vicinity of η = ηc . In the model,the variation at these expression levels is completely dominated by the technical noise alone,wherefore we find in plausible that the observed variation also is dominated by technical noise.

40

Page 44: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Chapter 8

Hypothesis testing of the observed

gene regulation

In the field of genetics, it is of great interest to be able to determine whether a gene (or acollection of genes) is (are) significantly regulated by some treatment or a decease affecting theorganism. The standard procedure is to perform hypothesis testing on individual genes. How-ever, as discussed above the variation in expression level in the “null” state, or “null” treatment(i.e., in the case of no treatment/healthy tissue) is predicted to vary as a result of the technicaland biological noise being dependent on the expected level of expression. This dependence mustbe accounted for in the analysis. In this section, we take the null treatment as the arithmeticaverage of the three 0min replicates, i.e., each gene is assumed to have a “null expression” givenby the average of these three observations. The other treatments are constructed in a similarfashion. Although the data, in essence, is a time series (there are, in fact, three realisations of thestochastic process, for each gene), the existing correlation between the time points is neglected.

8.1 Conditional log-fold change distribution in the case of no

treatment

Under the assumption that the basic model presented in Sect. 3, adequately describes thedistribution of gene expressions of E. Coli, we can formally write the conditional log-fold changeof the null state, given a gene with expression intensity t, as

L0 |T = tD= X0

2 |T = t − X01 |T = t, (8.1)

where X01|T = t and X0

2|T = t denote two identical and independent conditional gene expressions

of the null treatment. Recall that, the null treatment is formed as the average of a number m

of null replicates. Hence, the probability density function of the conditional gene expression isgiven by

fX0 |T (x |t) = fY0 |T(

ex |t) · ex, x ∈ R. (8.2)

Here, we will keep the superscript “0” to distinguish the null treatment from the other treat-ments and to emphasise the dependence on m. The corresponding distribution in linear space,Y0 |T = t will be referred to as the “fuzzy” log-normal distribution and is the analogue to thenegative binomial distribution, which was mentioned in Sect. 1 (see also Sect. 9.1).

At this point, we will make a convenient and simplifying approximation. Since the technicalvariation is added to the gene expression in linear space, the average of a number m of inde-pendent replicates will simply scale down the technical variance with a factor 1/m. For the

41

Page 45: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

biological variation, this is not really true. It is, however, approximately true. Recall that thevariance of the log-normal distribution is given by

V ar (Y |T = t) =(

eσ2 − 1

)

e2µ+σ2

, (8.3)

where, in our case, µ = α + t and σ2= γt. Assuming independence between replicates, this

variance scales linearly with 1/m. To the first order, the scaled biological variance may then bewritten as

V ar(

Y0 |T = t)

=

1

m·(

eσ2 − 1

)

· e2µ+σ2 ≈ 1

m·(

1 + 2σ2 − 1 − σ2)

· e2µ = σ2

me2µ =

γt

m· e2µ, (8.4)

where the factor γt/m in the last expression of Eq. (8.4) accounts for the variation that iscontrolled by γ. Now, by instead writing V ar

(

Y0 |T = t)

=

(

eσ2

0 − 1)

· e2µ+σ2

0 ≈ σ20· e2µ , the

scaled biological variance in logarithmic space can be identified as σ20= σ2/m = γt/m. Hence,

X0 |T = t ≈ N (α + t, γt/m) while

Y0t

D= Y 0 |T = t ≈ LogN (α + t, γt/m) with approximate variance γt/m and (8.5)

Y0 |Y0t = y ≈ N (y, y/m).

As discussed below, this approximation is surprisingly good in the regions of interest of γ andη. The probability density function of Y0 |T = t in Eq. (8.2) is then approximately given by

fY0 |T (y |t) =∞

0

fY0 |Y 0t

(y |z) · fY 0t

(z) dz ≈

≈∞

0

m

2πz· e−m ·(y−z)2/2z ·

m

2πγt· 1

z· e−m ·(ln(z)−α−t )2/2γt dz =

=

m

2π√γt

∞∫

0

z−3/2 · e−m ·(y−z)2/2z · e−m ·(ln(z)−α−t )2/2γt dz, y > 0. (8.6)

Since E(Y|T = t) = E(Y |T = t), the expression intensity t is related to the expected expressionlevel such that ln(η) = α + (1 + γ/2)t, as given by Eq. (3.19).

In the context of hypothesis testing, it is of interest to know the form of the conditional log-foldchange distribution, at least in the limiting cases. As η → ∞ (which is equivalent to say thatt → ∞), i.e., at high expression levels, the biological variation dominates over the technicalvariation. In essence, this implies that the Poisson-like noise step becomes redundant wherefore

L0 |T = tD→ X0

2 |T = t − X01 |T = t, as η → ∞. (8.7)

Hence, at high expression levels we have that L0 |T = t ≈ N (0, σ2∞), where σ2

∞ ≈ 2γt/m. Atlow expression levels, we have a different situation. In the basic model, the biological variationgoes to zero as η approaches the cut-off level ηc . Therefore, L0 |T = t approaches a distributiondefined as the logarithm of the ratio between two independent normal random variables, ast → 0. This distribution has slightly heavier tails than the normal distribution. At intermediateexpression levels, the conditional log-fold change distribution becomes a mixture of these twolimiting distributions. The probability density function of a ratio distribution of two normalrandom variables was originally derived by [24].

42

Page 46: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

8.2 Calculation of p-values

Formally, we write the cumulative distribution function of the conditional log-fold change dis-tribution as

FL0 |T (x) = P(L0 |T = t ≤ x) = P(X02 |T = t − X0

1 |T = t ≤ x) =

= P(X01 |T = t ≥ X0

2 |T = t − x) =

∞∫

−∞

∞∫

x2−x

fX0

1|T,X0

2|T (x1, x2) dx1dx2, x ∈ R. (8.8)

Since X01|T = t and X0

2|T = t are independent (and identical), we may write

FL0 |T (x) =

∞∫

−∞

fX0 |T (x2)*..,

∞∫

x2−x

fX0 |T (x1) dx1+//-dx2, x ∈ R. (8.9)

Now, let

∆i = log2(η1i ) − log2(η0i ) (8.10)

be the observed log-fold change of gene i between the null treatment and some treatment “1”.Note that, when ∆i is defined as in Eq. (8.10), fX0 |T (x) in Eq. (8.2) should be written asfX0 |T (x) = fY0 |T (2x ) · 2x ln(2). We would like to test the n hypotheses

H0 : ∆i = 0, i = 1, ..., n, against

H1 : ∆i , 0.

The two-tailed p-value for gene i is defined as

pi := 2·minP

(

L0 |T = ti ≤ x |H0

)

, P(

L0 |T = ti ≥ x |H0

), (8.11)

and by making use of the definition of the cumulative distribution function, the expression forthe p-value may be written as

pi = 2·minFL0 |T (∆i ) , 1 − FL0 |T (∆i )

. (8.12)

8.2.1 Limiting distributions

In principle, it is possible to compute the p-values for all genes with observed log-fold changes∆i by Eq. (8.12) together with Eq. (8.9). However, due to a slightly cumbersome numericalcomputation of the double integral, we will take an alternative route. In Sect. 8.1, we mentionedthat the conditional log-fold change approaches a normal distribution with variance σ2

∞ ≈ 2γt/m,as η → ∞. However, also at low expression levels, the conditional distribution is only slightlyheavier in the tails than a normal (see Fig. 8.1, left panel). As a result, an approximate p-valuecomputed from the normal distribution will be close to that computed from the correct modeldistribution, at least for observations not too far out in the tails (see Fig. 8.1, right panel). Afactor of two difference in the p-values occurs at p ≃ 10−5. From Fig. 8.1 (right panel), it isevident that the approximate method of computing p-values by using the normal distributionwill statistically predict a larger number of significantly regulated genes.

43

Page 47: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

−1.0 −0.5 0.0 0.5 1.0

−5

−4

−3

−2

−1

01

Log−fold change

Log1

0(pr

obab

ility

den

sity

)

0.0 0.2 0.4 0.6 0.8

−6

−5

−4

−3

−2

−1

0Log−fold change / Delta_i

Log1

0(p−

valu

e)

Figure 8.1: Tail-behaviour at low expression levels. Left panel: The black curve denotes theconditional log-fold change distribution of a model with α = 3.7, γ = 0.005, λ = 1.10, andm = 3 replicates, at a low expression level of η = 50. The distribution is estimated from aMonte Carlo simulation. The red curve denotes a normal distribution with zero mean and thesame variance as the model distribution. The tails of the the model distribution is only slightlyheavier. Right panel: The black curve shows the p-value of the log-fold change ∆i , estimatedfrom the simulated model distribution. The red curve denotes the corresponding p-value givenby the normal distribution in the left panel. The grey, dashed line high-lights the location atwhich the p-values differ by a factor of two. This occurs at p ≃ 10−5.

8.2.2 Analytical approximation of the variance of L0 |T = t

In order to use the normal approximation in the calculation of p-values, we are interested toestimate the variance of the conditional log-fold change, i.e., given expression intensity t. Anapproximation of the variance is given by

V ar(

L0 |T = t)

= V ar(

X02 |T = t − X0

1 |T = t)

= 2·V ar(

X0 |T = t)

=

= 2·V ar(

ln(

Y0)

|T = t)

≈ 2 ·V ar

(

Y0 |T = t)

E(Y0 |T = t

)2, (8.13)

where Gauss’s approximation of the first order was used in the last step. But, this is simplytwo times the squared coefficient of variation for the conditional gene expression Y0 |T = t (cf.Sect. 6.3). Since Y0 |T = t can approximately be modelled by scaling down, with the number m

of replicates, both the biological and the technical variation, we have that

σ2L0 |T := V ar

(

L0 |T = t)

≈ 2·(

a(γ/m) · ηb (γ/m)+

1/m

η− 1

)

, (8.14)

where

44

Page 48: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

1 2 3 4 5 6

0.2

0.3

0.4

0.5

0.6

Log10(expected gene expression)

Log−

fold

cha

nge

(95%

con

fiden

ce in

terv

al, u

pper

bou

nd)

Figure 8.2: Variance of the log-fold change as a function of gene expression. The grey plussesdenote the upper bound of the 95% confidence interval of the basic model (α = 3.7, λ = 1.1,and γ = 0.005), assuming no treatment (null distribution), i.e., three replicates are averagedfor each of the two treatments but the treatments assume identical model parameters. The redcurve denotes the corresponding analytical approximation, computed by assuming the normalapproximation (see Sect. 8.2.1) and the analytical expression of the variance in Eq. (8.14). Thegreen, dashed curve denotes the upper bound of the 95% confidence interval, assuming that thevariance is given by the limiting expression 2γt/m.

b = b(γ/m) =1

m/γ + 1/2, (8.15)

a = a(γ/m) = e−αb (γ/m) , (8.16)

and where the term 1/mη originates from the expectation of the conditional variance of Y0 |Y0t .

Note that the scale is still given by the expected expression η = eα+t+γt/2. Figure 8.2 illustrateshow the variance of the log-fold change depends on the expected gene expression. Evidently,the analytical approximation described by Eq. (8.14) performs very well at all expression lev-els. Note also that the 95% confidence interval of the analytical approximation is computed byapproximating the log-fold change distribution by a normal, as discussed above (see Fig. 8.1).

8.2.3 P-values

In summary, the log-fold change is approximately described by a normal distribution at allexpression levels such that

L0 |T = t ≈ N (0, σ2L0 |T ), (8.17)

where σ2L0 |T is given by the expression in Eq. (8.14). In this approximation, we can test the

hypotheses H0 : ∆i = 0, i = 1, ..., n by a simple Z-test, while the approximate p-value for anyobservation ∆i is given by

45

Page 49: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

0 1 2 3 4 5

−4

−2

02

46

Log10(average gene expression)

Log−

fold

cha

nge

Figure 8.3: Significantly regulated genes. The thick, red curves denote the approximate 95%confidence interval for individual genes (i.e., a test-wise significance level of α = 0.05), derivedfrom the basic model that was fitted to the E. Coli data in Sect. 6.5. The genes with a log-foldchange (30min replicates) outside this interval are denoted by black and dark grey circles. Theblack circles outside the thin, red curves have a family-wise error rate of α = 0.05 (Bonferronicorrection).

pi ≈ 2·min

Φ

(

∆i

σL0 |T (ηi )

)

, 1 − Φ(

∆i

σL0 |T (ηi )

)

, (8.18)

where

σL0 |T (ηi ) =

2·(

a(γ/m) · η b (γ/m)

i+

1

mηi− 1

)

(8.19)

is the plug-in estimate of the standard deviation of L0 |T = ti . The sample mean expressionηi =

∑mj=1 yi, j is taken as an unbiased estimate of η(ti ) where yi, j is the observed expression of

gene i in replicate j. Figure 8.3 illustrates which genes are significantly regulated in the E. Coli30min data, as predicted by the basic model (see also Table 8.1).

A hypothesis test which does not take into account the uncertainty in the estimated parameters,but treat the parameters as if they were known may be overly liberal. In such situations, the pre-dicted p-values tend to be too small (cf. the Z-test and the t-test). However, since |∂σL0 |T /∂η | issmall over the entire range η ∈ [ηc, 106] and the dependence on α and γ are low, the uncertaintyin σL0 |T is very small. The error made by letting P

(

L0 |T = t ≤ ∆|H0

)

≈ Φ(∆/σL0 |T ) should

therefore also be small. Suppose that the associated estimator d · σ2L0 |T /σ

2L0 |T ∼ χ

2d, for some

number d of degrees of freedom. Furthermore, if ∆ ∼ N(

0, σ2L0 |T

)

, which is approximately true

(see Sect. 8.2.1), we have that ∆/σL0 |T ∼ td . If so, V ar(

∆/σL0 |T)

:= σ2t = d/(d − 2) (see [25]).

By solving for d, we obtain d = 2σ2t /

(

σ2t − 1

)

. The “effective” number of degrees of freedom canthen be estimated by computing the variance of σL0 |T from simulations. We find that d & 900

46

Page 50: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

in the entire range η ∈ [ηc, 106]. The “fattening” of the normal tails is therefore minute and Eq.(8.18) should hold.

8.2.4 Addressing the problem of multiple testing

In Fig. 8.3, the family-wise error rate is controlled by adjusting the p-values pi, i = 1, ..., naccording to the Bonferroni correction. That is, in order to obtain a family-wise significancelevel not larger then α = 0.05 (to be distinguished from the model parameter α), each of the n

Hypotheses is tested at a significance level of α/n. This adjusted significance level is, in fact,the first order approximation to the Sidak equation of independent tests [26]. A uniformlymore powerful test than the Bonferroni correction with a family-wise error rate ≤ α is givenby the Holm-Bonferroni method [27]. Let the unadjusted p-values pi , corresponding to the nullhypotheses H0, i , be ordered in an increasing order p(1), ..., p(n) . Then, for a given significancelevel α, reject the null hypotheses H0, (1), ...,H0, (k−1), where k is the minimal index s.t.

p(k ) >α

n − k + 1. (8.20)

In our case of gene regulation in E. Coli, the Bonferroni correction and the Holm-Bonferronimethod give nearly identical results (see Table 8.1).

8.3 Correlation between L0 |T = t and the estimator η(t)

In the above analysis, the estimate of the null expression level ηi = η(ti ) of gene i is given bythe average of the three 0min replicates. The general associated estimator is thus given by

η(ti ) =1

m

m∑

j=1

Yi, j =1

m

m∑

j=1

Yj |T = ti := Y0 |T = ti = eX0 |T = ti . (8.21)

Hence, L0 |T = ti and η(ti ) are obviously dependent and we have that

Cov(

L0 |T = ti, ln(

η(ti ))

)

= Cov(

X02 |T = ti − X0

1 |T = ti,X01 |T = ti

)

=

= Cov(

X02 |T = ti,X0

1 |T = ti)

− Cov(

X01 |T = ti,X0

1 |T = ti)

=

= −V ar(

X01 |T = ti

)

. (8.22)

Therefore

Cor(

L0 |T = ti, ln(

η(ti ))

)

=

−V ar(

X01|T = ti

)

V ar(L0 |T = ti

) · V ar(

η(ti ))

=

= −V ar

(

X01|T = ti

)

2·V ar(

X01|T = ti

)

· V ar(

X01|T = ti

)

= − 1√2. (8.23)

This dependence introduces a small deviation from the symmetry of Eq. (8.18). As a result, theupper and lower bounds of the confidence intervals in Fig. 8.3 are not exactly symmetric aroundL0 |T = t = 0 (the x-axis in Fig. 8.3). The effect is, however, small. Ideally, this correlation effectcan be dealt with by preparing for an additional number m of null replicates to be used as thereference treatment instead of “reusing” the average null expression levels in the computationof the log-fold change. However, due to generally high costs associated with RNA sequencing,extra null replicates are seldom prioritised.

47

Page 51: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Table 8.1: Predicted number of significantly regulated genes for E. Coli, 30min treatment

Procedure I† Procedure II

†† No. of genes in commonUncorrected no. of sign. genes 1881 1872 1860Bonferroni correction 676 680 658Holm-Bonferroni method 682 685 682

† Correlation between L0 |T = t and η(t) not accounted for.†† Taking the average of both treatments as an estimate of the null expression level.

By taking the average of both treatments as an estimate of the null expression level, the corre-lation effect vanishes. Unfortunately, if gene i is highly regulated, this procedure will shift theestimate ηi of the null expression level a non-negligible amount. In doing so, the p-value cor-responding to gene i will be computed assuming an incorrect variance of the null distribution.However, since the variance is a fairly slowly varying function of η, this affects only a smallfraction of genes close to the significance limit. In particular, the genes at low expression levelswill be affected due to a stronger dependence on η in this region. The error (e.g., of type I)we make by not accounting for the correlation in Eq. (8.23) is comparable in magnitude to theerror introduced by taking the average of both treatments as an estimate of the null expressionlevel, at least in the case of the present data set (see Table 8.1).

48

Page 52: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Chapter 9

Comparison to other methods

In this section, we will briefly discuss and compare the results of the hypothesis testing of thebasic model to those of the DESeq, edgeR, and limma standard packages, which are all devel-oped to analyse count data. We will mostly discuss the differences between the methods, ratherthan the actual results.

9.1 Comparison to the negative binomial distribution

In DESeq [12] and edgeR [13], the underlying distribution that is used to model the over-dispersion of the expression data is the negative binomial. This standard choice of distributionis chiefly made for convenience and is not based on any physical arguments about, e.g., moleculargrowth. In our terminology, it is thus assumed that

Y|T = t ∼ NegBin(r, µNB), (9.1)

where µNB > 0 is the mean of the distribution while

r =µ2NB

σ2NB− µNB

, r ∈ R+ (9.2)

is the dispersion parameter, expressed in terms of the mean and the variance σ2NB

. The proba-bility mass function for any random variable W ∼ NegBin(r, µNB) may then be written as (cf.Eq. (1.8) with r ≡ k and w ≡ υ, see also, e.g., [28])

P(W = w) =Γ(r + w)

Γ(w + 1)Γ(r)·(

r

r + µNB

)r (

µNB

r + µNB

)w

, w = 0, 1, 2, ... (9.3)

where Γ(·) is the complete gamma-function. By identifying µNB = E(Y|T = t) = η and

r =

(

E(Y|T = t))2

V ar (Y|T = t) − E(Y|T = t)=

η2

η2 · (aηb − 1) + η − η=

(

aηb − 1)−1, (9.4)

we can directly compare the distribution of the negative binomial to that of Y|T = t presented inthis thesis, from hereon referred to as a “fuzzy” log-normal distribution. The result is illustratedin Fig. 9.1 for a mean expression of η = 10 000. Clearly, there is a detectable difference. Thisdifference is translated into an altered p-value, also depicted in Fig. 9.1. For example, at anobserved log-fold change of ∆ = 3, corresponding to a differential expression of a factor 8, a nulldistribution given by the negative binomial would predict a ∼ 500 times higher p-value thanthe “fuzzy” log-normal null distribution (based on extrapolation estimates, see Fig. 9.1). Asusual, we assume the generic basic model given by (α, λ, γ, δ) = (3.7, 1.1, 0.005, 1). It is notedthat genes with η = 10 000 and log-fold changes ∆ & 1.5 have p . 10−5, as given by the basic

49

Page 53: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

−12

−10

−8

−6

−4

−2

Mean gene expression (x 1,000)

Log1

0(pr

obab

ility

den

sity

)

0 5 10 15 20 25 0 1 2 3 4

Log−fold change (Delta_i)

P−

valu

e(N

egB

in)

/ P−

valu

e(ba

sic

mod

el)

1

10

100

1,000

10,000

100,000

Figure 9.1: Comparison to the negative binomial null distribution. Left panel: The black curvedenotes the probability density of the “fuzzy” log-normal null distribution with a mean geneexpression of η = 10 000 and a standard deviation ω = 1.67 · 103 (the usual generic values ofthe model are assumed), as given by Eq. (3.26). The red curve denotes the probability massfunction of a negative binomial distribution with the same mean and variance as the “fuzzy”log-normal. Right panel: The green curve denotes the corresponding difference (ratio) in p-valuebetween the negative binomial and the “fuzzy” log-normal null distributions, as a function ofthe log-fold change. The dashed curve is a power-law fit to the data. The dotted line denotes afactor of two difference in p-value.

model. However, after applying a correction for multiple testing, even a factor of two differencein p-value would have an impact on the predicted number of significantly regulated genes.

9.2 Comparison to results from limma

Unlike DESeq and edgeR, the limma package [29], including the voom method [30], computesp-values based on a “moderated” t-test, where the associated null distribution is assumed to bet-distributed. A moderated variance of the expression for each gene is computed from the gene-specific estimate of the variance (which is based on the number of replicates) by “correcting”it in the direction of a global estimate (see Fig. 9.2), using a Bayesian model. This moderatedvariance is then used in the t-statistic.

Table 9.1: Predicted number of significantly regulated genes by limma, 30min

limma basic model† No. of genes in commonUncorrected no. of sign. genes 1154 1881 1094Bonferroni correction 168 676 168Holm-Bonferroni method 170 682 170

† Equivalent to Procedure I in Table 8.1.

50

Page 54: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Figure 9.2: Mean-variance trend. The black dots denote the square root of the standard devia-tion of individual genes, as a function of average count (0min data). The red curve is a LOWESSfit to the data. Output from limma. Credit: T. Kallman.

−2 0 2 4 6

−4

−2

02

46

Log10(average counts)

Log−

fold

cha

nge

Figure 9.3: Significantly regulated genes from limma. Light red and green dots denote, re-spectively, significantly up- and down-regulated genes in the 30min replicates, before correctionfor multiple testing. Dark red and green dots denote up- and down-regulated genes, after p-value adjustment (Bonferroni correction). The grey dots denote those genes for which the nullhypothesis of no regulation could not be rejected. Data acquisition: T. Kallman.

51

Page 55: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Figure 9.3 shows the significantly regulated genes in the 30min replicates, before and after cor-rection for multiple testing (Bonferroni adjustment). The results are summarised in Table 9.1.Evidently, limma appears to be more conservative in rejecting the null hypothesis as comparedto the basic model. The basic-model-to-limma ratio of significantly regulated genes, withoutcorrection for multiple testing, is about 1.6. After correction, there are four times as manysignificantly regulated genes, according to the basic model. The reason for this difference istwo-fold. For the count data, limma appears to find a larger mean-variance trend as comparedto what the we find in the FPKM data (cf. Fig. 8.2). Secondly, due to its heavier tails, at-distributed null distribution would predict larger p-values than what a normal distributionwould do.

Also worth noting is that there are no less than 60 genes (without correction) which are believedto be significantly regulated according to limma, but are not present in the basic-model list ofsignificantly regulated genes. This presumably have to do with how the variance of individualgenes are weighed in voom. A more detailed analysis is beyond the scope of this study.

52

Page 56: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Chapter 10

Future work

The basic model presented in Sect. 3 may be developed in several different directions. Some ofthese developments are briefly discussed in this section.

10.1 Replicate-specific model with four parameters

Due to the intricacies of the data reduction and the normalisation procedure, the parameter α,which determines the location of the peak of the gene expression distribution, may vary some-what between replicates (see Fig. 10.1). As hinted in Sect. 7.3, the same goes for γ. Over- orunder-compensation for non-linear effects in the normalisation procedure (see Fig. 10.1) mayintroduce wave-like patterns in the high-expression tails of the replicates, which would mimicthe presence of a larger biological variation, and thus a larger γ. A real difference in the bi-ological variation between replicates could certainly be present, as well. For bacteria growingin the laboratory, one could imagine that the biological variation changes with location on thecurve-of-growth. That is, if one of the replicates was forced to stop growing at a slightly differentlocation on the curve-of-growth, this would ultimately result in a different γ corresponding tothat replicate.

To this point, we have assumed that all replicates have been sequenced to the same depth, i.e.,they all have a size factor of δ = 1. However, if replicates are sequenced to different depths, thesize of the variation due to the technical noise should differ between replicates. A comparisonbetween log-fold changes of the three available replicate pairs in Fig. 10.1 suggests that the(technical) variation at low expression levels indeed displays this effect. The model inadequacyis significantly reduced by introducing a fourth, variance-scaling parameter δ ≃ 1, which controlsthe strength of the Poisson noise (see Fig. 10.2). Hence, for a replicate j, we have

X |T = t ∼ N (α j + t, γ j t), where T ∼ Exp(λ), (10.1)

YD= eX , (10.2)

Y|Y = y ≈ N (y, δ j y), (10.3)

X D≈ ln(Y), (10.4)

where all parameters, except λ are made replicate-specific. Note that, in the basic modelδ j ≡ 1 ∀ j. The deviation from δ = 1 may, e.g., in addition to size factors that differ fromunity, be an artefact of the normalisation.

The model could be refined even further, e.g., by accounting for the variation in the technicalnoise as a result of the genes being of different length. However, this effect may be negligible,at least to the first order, as suggested by Fig. 10.3.

53

Page 57: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

0 1 2 3 4 5

Log10(Average gene expression)

Lo

g2(B

B17

/BB

10)

L

og2(

BB

10/B

B09

)

−2

−1.5

−1

−0.5

0

0.5

1

1.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Figure 10.1: Log-fold changes of the experimental data. The upper panel (blue circles) shows thelog-fold change between replicates BB09 and BB10 while the lower panel (green circles) showsthe log-fold change between replicates BB10 and BB17. As indicated by the dashed lines, theparameter α is slightly different for the two data pairs. Moreover, a wave-like pattern is presentat high-expression levels, predominantly in the lower panel. This pattern could possibly be anartefact of the normalisation and would mimic a larger γ. Finally, at low expression levels, i.e.,at ηc ≃ 1.5, the range in the log-fold change appears somewhat different between the two plots,with the upper log-fold change having a slightly smaller variation than the lower plot. Thisdifference is modelled by introducing a replicate-specific, variance-scaling parameter δ.

10.1.1 Note on parameter estimation

For the replicate-specific model, the estimation of the model parameters must be performed ina slightly different way. The estimation of α j and λ should be done on the individual replicates,instead of considering all the replicates as a single sample. Since α depends on j, it is preferableto also consider λ as a replicate-specific parameter and then simply take the average

λ =1

m

m∑

j=1

λ j (10.5)

as an estimate of λ. After fixating λ, the estimation procedure should be reiterated with all µkset to µk = 1/λ in order to obtain good estimates of α j, j = 1, ...,m.

The estimation of the replicate-specific γ j ’s and δ j ’s is more difficult to do without ending upwith numerical, nonlinear optimisation of the inverse problem. In this case, there are severalways to go. Either, α j and λ are assumed to be known (from above) and we try to solve thereduced inverse problem of estimating γ j and δ j ∀ j = 1, ...,m. Or, we try to solve the full

inverse problem, where the α j ’s and λ are taken as initial guesses. Similarly, we would then

naturally have that γ j,0 = γ and δ j,0 = 1, where γ is the estimate of γ in the basic model. Inorder to solve this unconstrained inverse problem, some quasi-Newton method, like the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, may be used [15].

54

Page 58: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

0 1 2 3 4 5

−4

−2

02

4

Log10(expected gene expression)

Log−

fold

cha

nge

Figure 10.2: Log-fold changes of model data. The green circles denote the simulated log-foldchange for n = 4207 genes with δ = 0.5 while the underlying blue circles denote simulated datawith δ = 2.0. Clearly, at low expression levels, the two models show a different range in thelog-fold change. At high expression levels (say, for log10(η) & 3), the variation is dominated byγ. And, since γ = 0.005 in both models, the variance of the log-fold change will approximatelybe equal in this range of η.

10.2 Modelling of a positive biological variance at all gene ex-

pression levels

One physically intuitive modification of the biological variation is to replace the variance γt inthe conditional distribution X |T = t with the expression γ(α + t). We thus have that

X |T = t ∼ N (α + t, γ (α + t)). (10.6)

The reason for this modification is two-fold. In this way, we can more naturally assign anexpression intensity τi = α + ti to each active gene i. That is, the active genes in the genomewould have expression intensities distributed according to a shifted exponential with a minimum

intensity of α s.t. T D= α + T , where T ∼ Exp(λ). The biological variation for a gene with an

expression intensity τ is then simply modelled as γτ. As a result, the statistical model can bewritten in a quite aesthetically attractive form. We have that

X |T = τ ∼ N (τ, γτ), (10.7)

YD= eX, (10.8)

Y|Y = y ≈ N (y, δy), (10.9)

where the variance-scaling parameter δ, is included as well. Note that genes which are inactivemay theoretically be modelled within the same framework by ascribing them an expression in-tensity lower than α.

55

Page 59: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

−2

−1

01

23

Gene length

Log(

stan

dard

dev

iatio

n)

63 276

417

552

678

786

906

1029

1188

1374

1656

3093

Figure 10.3: Dependence on gene length. The black circles denote the sample standard deviationof genes in the range log10(η) ∈ [1.5, 2.0], as a function of gene length. Since all genes roughly havethe same expression, the longest genes should have the largest number of reads. Consequently,they should experience a lower technical variation. Indeed, a low dependency on gene length isdetected, as indicated by the green regression line. However, the spread around the mean trendis large and extreme standard deviations, both around log10(ω) = 1.5 and log10(ω) = −1 arefound over the entire range in gene length, as indicated by the red, dashed lines.

Moreover, this modification may arguably be more physically attractive since it predicts a non-zero variance at the cut-off expression level ηc . In the basic model, the dispersion relation ofthe underlying distribution Y (i.e., the distribution without the Poisson-noise) goes to zero atη = ηc = eα . In essence, there is no such cut-off in the modified version of the model. Conse-quently, super-stable genes with exactly zero biological variance at a non-zero expression levelare absent and the model anticipates a positive biological variation all the way down to zeroexpression.

It would be interesting to be able to disentangle these two versions of the model, even thoughthe technical variation at low expression levels erases any signs of zero biological variation in theunderlying distributions. The underlying dispersion relation of the modified model is given by

η = E(Y |T = τ) = eτ+γτ/2 = e(1+γ/2)τ ⇒ τ = ln(

η (1+γ/2)−1)

and (10.10)

ω2(η) = V ar (Y |T = τ) = η2 (

eγτ − 1) =

= η2(

eγ ln

(

η (1+γ/2)−1)

− 1)

= η2(

η(γ−1+1/2)−1− 1

)

= η2 ·(

ηb − 1)

, (10.11)

where b = 1/(1/γ + 1/2) as in Eq. (3.21) while a = e0 = 1. The fact that a = 1 has an interestingeffect on the mean slope of the dispersion relation at high expression levels, as illustrated in Fig.10.4. In the region η ≃ 1 000− 100 000, the effective exponent (i.e., the slope in the log-log plot)of the dispersion relation of the modified model is pushed down to b = 1.05, while for the basicmodel, the exponent stays at b = 1.1. For j high enough, this effect should be detectable.

56

Page 60: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

0 1 2 3 4 5

01

23

4

Log10(eta)

Log1

0(di

sper

sion

)

Figure 10.4: Difference in mean slope. The green line denotes the dispersion relation correspond-ing to the basic model with α = 3.70 and γ = 0.005 while the blue line denotes the dispersionrelation of the modified model in Sect. 10.2 with γ = 0.0017. The parameter γ is scaled downa factor of three, in order to fit the same experimental data. For log(η) ∈ [3.0, 5.0], the squareddispersion relation ω2(η) ∝ n2.18 of the basic model while ω2(η) ∝ n2.10 of the modified model.

10.3 Accounting for the correlation between genes

In the basic model, it is, e.g., assumed that Yi, j and Yi′, j′, i, j , i′, j ′ are independent (see Sect.3.2.3). But, this is not true. If Yi, j and Yi′, j′ were to be independent for all i, j , i′, j ′, theauto-correlation function of the log-fold change for any two replicates j and j ′, where the genesare sorted by location on the chromosome, would show the characteristic pattern of white noise.However, as depicted in Fig. 10.5, this is not the case.

In order to perform certain tasks, e.g., producing an enzyme, the cell needs clusters of genes tocollaborate. For efficiency reasons, such clusters of genes may be co-regulated in units calledoperons, enforcing the genes to be transcribed together as a single mRNA. Operons are eitherswitched on or off. Exactly how the gene expression is affected in the presence of operons is notfully understood. However, studies suggest that the expression level of a gene increases withlength of the operon [31,32]. The expression level was also found to be higher for genes locatedin the beginning of the operon. If so, the presence of operons would introduce a dependencebetween genes. Other studies have found little evidence that spatial clustering of genes signifi-cantly would increase the correlation between them [33]. Operons are particularly common inprocaryotes like bacteria (there are, e.g., 2584 documented or predicted operons in E. Coli [34]).

In principle, one way to introduce an inter-gene dependence within the proposed model is toapply the biological variation and the technical variation on different levels of structures. Forexample, the biological variation could be applied at the operon level while the technical vari-ation subsequently is applied at the gene level. Let t = (t1, ..., tl ) be a realisation of expressionintensities belonging to the l = 2584 operons in E. Coli and let X |T = t ∼ N (α + t, γt), whereT ∼ Exp(λ) as usual. By assuming a distribution of the size of the operons (i.e., how many genes

57

Page 61: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

are co-regulated by each operon), we may then apply the step where the technical, Poisson-likenoise is added, to each gene. This procedure does introduce a dependence between the genes.However, in order to identify the sought-after dependence, a detailed investigation should beperformed.

0 1000 2000 3000 4000

−2

01

23

Index (position on chromosome)

Log−

fold

cha

nge

0 50 100 150 200 250 300

Lag (position)

AC

F

0.0

0.2

0.4

0.6

0.8

1.0

Figure 10.5: Correlation between genes. Upper panel: The log-fold change for individual genes inBB09 and BB10, sorted by location on the chromosome. The pattern is suggestive of correlationsbetween genes. Lower panel: The auto-correlation function of the log-fold change. Clearly,correlations between genes are present.

10.4 Two-component mixture distribution of the intensities

When considering the logarithm of the gene expression data, there appears to be a small, butobservable, change of slope in the distribution around an expression of x = ln(z) ≃ 5.5, corre-sponding to z ≃ 250. As shown in Fig. 10.6, this feature is observed in all three 0min samples.Furthermore, in at least two out of three samples (i.e., BB09 and BB10), a fairly strong sec-ondary peak appears around z ≃ 70, next to the peak around z ≃ 50. One possible interpretationof this feature would be that the gene expression is really a mixture of two distributions, slightlyoffset (i.e., different αk, k = 1, 2) and with different slopes λk, k = 1, 2. Suppose that

T = (1 − P) ·T1 + P ·T2, (10.12)

where TkD= αk + Tk and Tk ∼ Exp(λk ), k = 1, 2 while P ∼ Ber (p) is a Bernoulli-distributed ran-

dom variable with success probability p. Similar to the distribution in Sect. 10.2, the conditionalgene expression is then given by X |T = τ ∼ N (τ, γτ). Note that, in the small γ approximation(see Sect. 4), we may assume that X ≈ (1 − P) · T1 + P · T2. The observed (logarithmic) geneexpression is then given by X = ln(Y), where Y|Y = y ≈ N (y, y) and Y = eX , as in Sect. 3.Following the argument in Sect. 10.1, we may alternatively assume that Y|Y = y ≈ N (y, δy),with δ ≃ 1 being the variance-scaling parameter of the technical noise.

58

Page 62: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

2 4 6 8 10

−6

−5

−4

−3

−2

−1

0

Log(

prob

abili

ty d

ensi

ty)

2 4 6 8 10

−6

−5

−4

−3

−2

−1

0

Log(expression)

Log(

prob

abili

ty d

ensi

ty)

2 4 6 8 10

−6

−5

−4

−3

−2

−1

0

Log(expression)

Log(

prob

abili

ty d

ensi

ty)

Figure 10.6: Gene expression distribution – a change in slope. From top to bottom, the panelsshow the gene expression distribution of the 0min BB09, BB10, and BB17 replicates, respectively.All distributions suggest a change in tail index, as indicated by the dashed lines. In addition,the peak at ln(η) ≃ 4 is broader than what is expected from the basic model, which may be aresult of the existence of a secondary peak, whose position is indicated by the arrow.

2 4 6 8 10

−6

−5

−4

−3

−2

−1

0

Log(gene expression)

Log(

prob

abili

ty d

ensi

ty)

Figure 10.7: Effect of the de-composition of expression intensities. The black curve, denotesthe (binned) gene expression data of the 0min BB09 sample. The red, solid curve denotes thetwo-component mixture model, fitted by-eye to the data. The two red, dashed curves denotethe two corresponding subpopulations, characterised by the different distributions of expressionintensities T1 and T2, where (α1, λ1) = (3.75, 3.00) while (α2, λ2) = (4.10, 0.95), and p = 0.48.

59

Page 63: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Figure 10.7 illustrates the effect of a de-composition of expression intensities, as described byEq. (10.12). From a biological point of view, this de-composition could be a result of the ex-istence of two underlying populations of genes within the E. Coli genome. For example, themechanisms controlling the expression of a gene may be different between the two subpopula-tions. This idea would be interesting to follow up. It is noted, however, that the change inslope may be of an origin different from that of a mixture of two exponential decays. This couldstill be modelled within the present framework by considering the distribution of expressionintensities to be of another suitable form.

60

Page 64: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

Chapter 11

Summary and concluding remarks

In the preceding chapters, we have developed a statistical model that simultaneously addressesthe distribution of gene expression and the expected variance in expression for individual genes.In its basic setting, it is presented as a hierarchical model with three free parameters θ = (α, λ, γ)

s.t.

X |T = t ∼ N (α + t, γt), where T ∼ Exp(λ), (11.1)

YD= eX, (11.2)

Y|Y = y ≈ N (y, y), (11.3)

X D≈ ln(Y). (11.4)

The expression intensity t is unique for each gene. It determines the gene’s rate of being ex-pressed in a given condition or situation. The random variable, T , that describes the distributionof expression intensities is instrumental in shaping the observed gene expression distribution,especially the tail at high expression levels. In the basic model, we assume that T ∼ Exp(λ),although other distributions may be considered (see Sect. 10.4).

The conditional distribution in Eq. (11.3) describes the technical variation as approximated bythe normal distribution, while Eq. (11.1) describes the biological variation. More than that,Eq. (11.1) gives, de facto, the form the underlying distribution of the logarithmic gene ex-pression. Hence, Y |T = t ∼ LogN (α + t, γt) describes the distribution of the concentration ofmRNA, corresponding to the gamma-distributed random variable Yt in Eq. (1.2). However,in contrast to what was assumed in Sect. 1.2.1, the distribution of Y |T = t is derived fromtheoretical arguments regarding stochastic molecular growth. In the basic model, the observedgene expression, conditioned on the expression intensity, is thus given by the “fuzzy” lognormaldistribution which is the counterpart to the negative binomial distribution in Sects. 1.2.1 and 9.1.

The statistical model is calibrated to an experimental data set of E. Coli. Estimation of λ and thelocation parameter α is performed by deploying the method of moments to X, while the variance-scaling parameter γ is estimated by fitting the squared coefficient of variation CV2 := ω2(η)/η2

to the data using ordinary least-squares minimisation.

A theory for the conditional log-fold change L0 |T = t is also developed (Sect. 8) and an analyticalapproximation of the variance of L0 |T = t is derived. It is found that

σ2L0 |T := V ar

(

L0 |T = t)

≈ 2·(

a(γ/m) · ηb (γ/m)+

1/m

η− 1

)

, (11.5)

where η is the expected gene expression and m is the number of replicates while the parametersa and b are given by Eqs. (8.16) and (8.15), respectively. By making the assumption that both

61

Page 65: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

the variance-scaling parameters γ and δ scales linearly with the reciprocal of m, the analyticalapproximation in Eq. (11.5) does a remarkably good job in predicting the variance over theentire relevant regime in η (see Fig. 8.2). Even at low expression levels, we have that thelog-fold change is approximately normally distributed s.t. L0 |T = t ≈ N

(

0, σ2L0 |T

)

, whereforean approximate p-value is given by the expression

pi ≈ 2·min

Φ

(

∆i

σL0 |T (ηi )

)

, 1 − Φ(

∆i

σL0 |T (ηi )

)

, (11.6)

where ∆i denotes the observed log-fold change of gene i.

The resulting basic model performs quite well. In particular, the model is able to predict theobserved dispersion relation (Fig. 7.2) and the marginal distribution of the log-fold change (Fig.7.3), at least for two out of three replicate ratios. There is also room for improvements, e.g.,regarding the gene expression distribution. In Sect. 10.4, a two-component mixture distributionof the expression intensities is introduced (see Fig. 10.7). Clearly, the break in the tail towardshigh expression levels is better fitted by the two-component mixture distribution. Moreover,the peak of the distribution is broader than what is predicted by the exponential distribution inthe basic model, in agreement with data. In Chapt. 10, a number of further improvements tothe basic model is discussed, including a four-parameter, replicate-specific model and a modelwhich accounts for the correlation between genes.

For accurate hypothesis testing of an observed gene regulation, the conditional distributionY |T = t, alternatively X |T = t, is the key distribution. The calculation of p-values requiresa proper modelling of the distribution of the mRNA concentration and the simple moleculargrowth model supposed in this study could be developed further. Envisage, e.g., a stochasticlogistic model of growth with some carrying capacity Kt > 0 for a gene with expression intensityt. The stochastic differential equation for a logistic growth model may be written as

dYt (τ) = rtYt (τ) ·(

1 − Yt (τ)

Kt

)

dτ + g (Yt (τ), τ) · dW (τ), (11.7)

where W is the Wiener process, τ denotes time, and rt,Kt > 0. The function g(·) describes thedependency on the uncertainty term. In our case, the distribution of the mRNA concentrationat equilibrium is of primary interest, i.e.,

Y |T = tD= lim

τ→∞Yt (τ). (11.8)

Now, depending on the function g (Yt (τ), τ), the density function fY |T (y |t) may take on severaldifferent forms. Both mathematical modelling of the biological processes that control the growthand decay of mRNA and statistical modelling of data from designed experiments may revealclues on the specific form of the function g(·). To this end, one should also consider functions ofmore than one random variable. In addition, if the technical noise, e.g., would be modelled bythe continuous Poisson distribution as given in Eq. (3.17), a more thorough (continuous) modelof the gene expression may be attained.

Finally, we note that even though we have a good parametric model for the conditional log-foldchange, the model is of lesser use without accurate and precise estimates of the parameters ofthe model, cf. the expression of the variance in Eq. (11.5). Although small, the bias of theγ estimator is non-negligible. A consistent and more efficient estimator of γ may be found bydeploying a non-linear errors-in-variables model which allows for heteroscedastic errors (cf. [35]).

62

Page 66: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

References

1. Sagan, C. (1994), Pale Blue Dot: A Vision of the Human Future in Space (1st ed.). NewYork: Random House. ISBN 0-679-43841-6

2. https://en.wikipedia.org/wiki/Gene expression

3. Strachan, T. and Read, A. (2011), Publ. Garland Science, Taylor & Francis Group LLC,New York ISBN: 978-0-815-34149-9

4. https://en.wikipedia.org/wiki/Transcription (genetics)

5. https://en.wikipedia.org/wiki/Translation (biology)

6. https://en.wikipedia.org/wiki/Regulation of gene expression

7. https://en.wikipedia.org/wiki/Cellular differentiation

8. Jacob, F. and Monod, J. (1962), On the regulation of gene activity. Cold Spring HarborSymposia on Quantitative Biology 26, 193-211

9. Jacob, F., Perrin, D., Snchez, C., and Monod, J. (1960), L’opron : groupe de gnes expres-sion coordonne par un oprateur [Operon: a group of genes with the expression coordinatedby an operator]. Comptes rendus hebdomadaires des seances de l’Academie des sciences250 (6), 1727-1729. ISSN 0001-4036

10. https://en.wikipedia.org/wiki/Translational regulation

11. https://en.wikipedia.org/wiki/RNA-Seq

12. Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data.Genome Biology, 11, pp. R106

13. Robinson, M. D., McCarthy, D. J., and Smyth, G. K. (2010). edgeR: a Bioconductorpackage for differential expression analysis of digital gene expression data. Bioinformatics,26

14. Berghoff, B. A., Karlsson, T., Kallman, T., Wagner, E. G., and Grabherr, M. G. (2016),Non-linear correction of RNA-seq data enables accurate expression quantification duringthe bacterial response to DNA damage. Submitted to PLoS Computational Biology

15. Griva, I., Nash, S., and Sofer, A. (2009). Linear and nonlinear optimization. (2nd ed.).Philadelphia: Society for Industrial and Applied Mathematics

16. Steindl, J. Random processes and the growth of firms: A study of the Pareto law. London:Griffin, 1965

17. Sutton, J. (1997). Gibrat’s legacy, Journal of Economic Literature, vol. 35, no. 1, pp.40-59

63

Page 67: Statistical modelling of gene expression data - DiVA portaluu.diva-portal.org/smash/get/diva2:934741/FULLTEXT01.pdf · desire to understand our surroundings has taken us upon a journey

18. Gibrat, R. (1931). Les ingalits conomiques; ap- plications: aux ingalits des richesses, lacon- centration des entreprises, aux populations des villes, aux statistiques des familles,etc., dune loi nouvelle, la loi de leffet proportionnel. Paris: Librairie du Recueil Sirey

19. Kozubowski, T J. and Podgorski, K. (2000). A Multivariate and Asymmetric Generaliza-tion of Laplace Distribution. Computational Statistics 15, 531

20. Reed, W. J. and Jorgensen, M. (2004). The double Pareto-lognormal distribution – Anew parametric model for size distribution. Com. Stats - Theory & Methods, 33, No. 8.,1733-1753

21. Ilienko, A. (2013). Continuous Counterparts of Poisson and Binomial Distributions andtheir Properties. Annales Univ. Sci. Budapest., Sect. Comp. 39, 137-147

22. Hill, B. M. (1975) A simple general approach to inference about the tail of a distribution.The Annals of Statistics, 3, No. 5, 1163-174

23. Pickands III, J. (1975). Statistical Inference Using Extreme Order Statistics. The Annalsof Statistics 3 No. 1, 119-131

24. Hinkley, D. V. (1969). On the Ratio of Two Correlated Normal Random Variables.Biometrika 56, No. 3, 635-639 (JSTOR 2334671)

25. Alm, S. E. and Britton, T. (2008). Stokastik – Sannolikhetsteori och statistikteori medtillampningar, Liber AB, ISBN 978-91-47-05351-3

26. Herve A. (2007). The Bonferroni and Sidak Corrections for Multiple Comparisons,http://www.utdallas.edu/ herve/Abdi-Bonferroni2007-pretty.pdf, In: Encyclopedia of Mea-surement and Statistics (2007), Salkind, N. (Ed.). Thousand Oaks (CA): Sage

27. Holm, S. (1979). A simple sequentially rejective multiple test procedure. ScandinavianJournal of Statistics 6, No. 2, 65-70 (JSTOR 4615733)

28. Hilbe, J. M. (2011). Negative Binomial Regression (Second ed.). Cambridge, UK: Cam-bridge University Press, ISBN 978-0-521-19815-8

29. Ritchie M. E., Phipson B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G .K. (2015).limma powers differential expression analyses for RNA-sequencing and microarray studies.Nucleic Acids Research, 43 No. 7, e47

30. Law, C. W., Chen, Y., Shi, W., and Smyth, G. K. (2014). Genome Biology 15, R29

31. Blattner, F. R., Plunkett, G., Bloch, C. A., Perna, N. T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J. D., Rode, C. K., Mayhew , G. F., Gregor, J., Davis, N. W., Kirk-patrick, H. A., Goeden, M. A., Rose, D. J., Mau, B., and Shao, Y. (1997). The completegenome sequence of Escherichia coli K-12. Science, 277, 5331

32. Lim, H. N., Lee, Y., and Hussein, R. (2011). Fundamental relationship between operonorganization and gene expression. Proc. Natl. Acad. Sci. (USA) 108, 1062610631

33. Liang, L. W., Hussein, R., Block, D. H. S., and Lim, H. N. (2013). Minimal Effect of GeneClustering on Expression in Escherichia coli. Genetics, 193, No. 2, 453-465

34. Thieffry, D., Salgado, H., Huerta, A. M., and Collado-Vides, J. (1998). Prediction oftranscriptional regulatory sites in the complete genome sequence of Escherichia Coli K-12. Bioinformatics, 14, 391-400

35. Matei, B. (2001). Heteroscedastic Errors-In-Variables Models in Computer Vision. PhDthesis, Department of Electrical and Computer Engineering, Rutgers University. Availableat: http://www.caip.rutgers.edu/riul/research/theses.html

64