individual project in computational biology

20
Individual Project in Computational Biology Applying Hidden Markov Models to RNA-seq data Monica Golumbeanu Student number: 880703 - T300 [email protected] January 2, 2013 Abstract Enterococcus faecalis is one of the most controversial commensal bacteria of the human intestinal flora that is also responsible for lethal nosocomial infections. Determining the fac- tors that influence its pathogenicity is at present a great challenge. Cutting-edge approaches analyze the E. faecalis bacterium trough the next generation RNA-sequencing technology. Since next generation sequencing is recent and yields a large amount of data, there is a continuous need for appropriate statistical methods to interpret its output. We propose an approach based on hidden Markov models to explore RNA-seq data and show an example of how we can apply this statistical tool to detect transcription start sites. We compare this application with a previously developed method based on signal processing. 1 Introduction Hidden Markov models (HMMs) have been widely used for analyzing sequencing data. Their applications are diverse and cover research topics such as gene identification, alignment, non coding RNA detection and protein secondary structure prediction [13]. In the present study, we explore the use of two different HMMs for determining the transcripts and the transcription levels respectively starting from RNA sequencing (RNA-seq) data of the Enterococcus faecalis bacterium. Enterococcus faecalis is a commensal bacterium which commonly populates the human gas- trointestinal tract and can produce lethal infections. Its toughness has assured survival in the prophylactic hospital environment, defense from the hosts immune systems and a strong resis- tance to antimicrobial agents [7]. Understanding how the E. faecalis bacterium operates and what are the factors that influence its pathogenicity is thus a great challenge. Applying HMMs to detect and analyze the different transcriptionally active elements withing the transcriptome of E. faecalis would provide important information about the genes modeling the bacterium’s conduct and contribute to establishing new treatments. It would be important, for example, to obtain a complete image of the most transcriptionally active elements and what is their relation to the phenotypic traits of the bacterium (e.g. resistance to antibiotics). Know- ing these elements would contribute to better defining the targets of novel drugs and therapies. 1

Upload: others

Post on 09-May-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Individual Project in Computational Biology

Individual Project in Computational Biology

Applying Hidden Markov Models to RNA-seq data

Monica GolumbeanuStudent number: 880703 - T300

[email protected]

January 2, 2013

Abstract

Enterococcus faecalis is one of the most controversial commensal bacteria of the humanintestinal flora that is also responsible for lethal nosocomial infections. Determining the fac-tors that influence its pathogenicity is at present a great challenge. Cutting-edge approachesanalyze the E. faecalis bacterium trough the next generation RNA-sequencing technology.Since next generation sequencing is recent and yields a large amount of data, there is acontinuous need for appropriate statistical methods to interpret its output. We propose anapproach based on hidden Markov models to explore RNA-seq data and show an exampleof how we can apply this statistical tool to detect transcription start sites. We compare thisapplication with a previously developed method based on signal processing.

1 Introduction

Hidden Markov models (HMMs) have been widely used for analyzing sequencing data. Theirapplications are diverse and cover research topics such as gene identification, alignment, noncoding RNA detection and protein secondary structure prediction [13]. In the present study,we explore the use of two different HMMs for determining the transcripts and the transcriptionlevels respectively starting from RNA sequencing (RNA-seq) data of the Enterococcus faecalisbacterium.

Enterococcus faecalis is a commensal bacterium which commonly populates the human gas-trointestinal tract and can produce lethal infections. Its toughness has assured survival in theprophylactic hospital environment, defense from the hosts immune systems and a strong resis-tance to antimicrobial agents [7]. Understanding how the E. faecalis bacterium operates andwhat are the factors that influence its pathogenicity is thus a great challenge.

Applying HMMs to detect and analyze the different transcriptionally active elements withingthe transcriptome of E. faecalis would provide important information about the genes modelingthe bacterium’s conduct and contribute to establishing new treatments. It would be important,for example, to obtain a complete image of the most transcriptionally active elements and whatis their relation to the phenotypic traits of the bacterium (e.g. resistance to antibiotics). Know-ing these elements would contribute to better defining the targets of novel drugs and therapies.

1

Page 2: Individual Project in Computational Biology

E. Faecalis

Ribosomal RNA removed(Treated)

All RNA(Crude)

'Home made' kit(FRAG)

Barcoding(BC)

'Home made' kit(FRAG)

Barcoding(BC)

Cultivation

RNA extraction

Sequencedlibraries

Figure 1: Description of the work flow used to obtain different RNA-seq data sets from aculture of E. faecalis v583.

2 Materials and Methods

2.1 Data preparation and extraction

One colony of E. faecalis v583 has been grown in a static, anaerobic Brain Heart Infusion en-vironment at 37 ◦C (figure 1). Afterwards, the RNA has been extracted from the cells. Half ofthe sample has been cleared of ribosomal RNA (Treated sample) while the other half has beenleft intact (Crude sample). The DNA Sequencing and Genomics Laboratory from the Instituteof Biotechnology of the Helsinki University in Finland performed the sequencing of the twosamples on the SOLiD sequencing platform [11]. Two different fragment libraries have beenused during the sequencing procedure: one library prepared by Applied Biosystems kit (BC)and the other one by the laboratory own kit (FRAG).

Even though there are different sequencing platforms (e.g. SOLiD [11], Illumina [2]), theRNA-seq procedure relies on the same principles (figure 2). The method consists of extractingthe different transcripts and then converting them into double stranded DNA (dsDNA). ThedsDNA fragments are then sheared into small fragments and adapters are ligated to them suchthat a sequencing library is formed. The first 50 nucleotides of each fragment are sequencedand in this way, a large set of reads is obtained.

In our study, we have used the Crude BC dataset, therefore the ribosomal RNA was notremoved. The reads obtained following the RNA-seq experiment were aligned to the genomeusing the Bowtie software [5]. For every position on the genome, the number of reads that weremapped there were counted and the resulting signal was coined the ”coverage depth”.

2.2 The general HMM framework

2.2.1 General definition

An HMM is a directed acyclic graph consisting of a set of a discrete-time discrete-state Markovchain of hidden states, Y = {yt}Tt=1, and a set of observed states, X = {xt}Tt=1, constituting anobservation model p(xt|yt) [8]. Unlike the hidden states that are discrete, the observations inan HMM can be either continuous or discrete. In the case of discrete variables, the observationmodel is represented by an observation matrix also named emission matrix.

The following parameters are needed to define an HMM with discrete observed states:

• The number of hidden states: M

• The number of observed states: N

2

Page 3: Individual Project in Computational Biology

AAAGGCTACTTCATTTACA

CTATTAATCCCTATCATATCCCTATGGATCAGAATCG

AAAGGCTACTTCATTTACACTATTAATCCCTATCATAT

TAGTTAACAGGTCCCATGA

...

extraction of RNAs

conversion into dsDNA and shearing

adapter ligation and amplification

immobilization and sequencing

Figure 2: Workflow of a standard RNA-seq experiment.

• The initial distribution of the hidden states:

Π = {πi}1≤i≤M = {p(y1 = i)}1≤i≤M (1)

• The transition probabilities between two hidden states, 1 ≤ i, j ≤M :

aij = P (yt = j|yt−1 = i). (2)

These probabilities are contained within the transition matrix: a = (aij)1≤i,j≤M where aijrepresents the probability of transitioning from hidden state i to hidden state j (figure 3).

• The emission probabilitieseij = P (xt = j|yt = i) (3)

where 1 ≤ i ≤ M and 1 ≤ j ≤ N . These probabilities are contained within the emissionmatrix: e = (eij)1≤i≤M,1≤j≤N where eij represents the probability of emitting the observedstate j from the hidden state i (figure 3).

Figure 3 displays an example of a HMM and is useful to understand the structure of anHMM as well as the inner dependencies. Since the HMM is a directed acyclic graph, it isstraightforward to write the corresponding joint probability distribution:

p(y1, y2, ..., yT , x1, x2, ..., xT ) = p(y1, ..., yT )p(x1, ..., xt|y1, ..., yT )

=

[p(y1)

T∏t=2

p(yt|yt−1)

][T∏t=1

p(xt|yt)

]

= πy1

T∏t=2

ayt−1yt

T∏t=1

eytxt

(4)

3

Page 4: Individual Project in Computational Biology

y1 y2

x1 x2

yT-1 yT

xT-1 xT

ey1x1 ey2x2 eyT-1xT-1 eyTxT

ay1y2ay2y3

ayT-1yTayT-2yT-1

Figure 3: Example of a realization of an HMM with hidden (yt, t ∈ {1, ..., T}) and observed(xt, t ∈ {1, ..., T}) states. The arrows represent the transition (ayt−1yt) and emission (eytxt)probabilities.

2.2.2 Inference in HMMs - the Baum-Welch algorithm

HMMs have the advantage of considering long range dependencies between the observed vari-ables due to the latent variables. The general HMM inference problem searches to estimate thehidden states starting from the observed variables, in other words, to calculate p(yt|x1, x2, ..., xt)in an online learning process or p(yt|x1, x2, ..., xT ) in an offline setting.

An online setting is a process where we observe the states sequentially and we want tocompute the hidden variables on the spot. On the contrary, an offline setting assumes thatwe observe all the variables and we are looking for the most likely hidden states behind theobservations i.e. p(yt|x1, x2, ..., xT ). We will focus on the offline learning setting since we willmodel our problem accordingly. Therefore, the goal will be to compute the probabilities of thehidden states given all the observations. Then, for each observation, we will choose the hiddenstate with the highest probability. The key to solving this inference problem is the fact that wecan separate the Markov chain into two parts - the past and the future:

αi(t) = p(x1:t, yt = i)

βi(t) = p(xt+1, ..., xT |yt = i)(5)

We have used the notation x1:t = {x1, x2, ..., xt}. The αi(t) are also called forward coeffi-cients while the βi(t) are also called backward coefficients. We can write these coefficients asa function of the HMM parameters by using d-separation [10] and conditional independenceproperties of the graphical model (figure 3) as follows:

αi(t) = p(x1:t, yt = i)

=

M∑j=1

p(x1:t, yt = i, yt−1 = j) =

M∑j=1

p(x1:t, yt = i|yt−1 = j)p(yt−1 = j) =

=M∑j=1

p(xt|yt = i, yt−1 = j, x1:t−1)p(x1:t−1, yt−1 = j, yt = i) =

=

M∑j=1

p(xt|yt = i, yt−1 = j, x1:t−1)p(yt = i|x1:t−1, yt−1 = j)p(yt−1 = j, x1:t−1) =

=M∑j=1

p(xt|yt = i)p(yt = i|yt−1 = j)αj(t− 1) =

=

M∑j=1

eixtajiαj(t− 1)

(6)

4

Page 5: Individual Project in Computational Biology

βi(t) = p(xt+1, xt+2, ..., xT |yt = i) =

M∑j=1

p(xt+1:T , yt+1 = j|yt = i) =

=M∑j=1

p(xt+2:T |yt+1 = j, yt = i, xt+1)p(xt+1|yt+1 = j, yt = i)p(yt+1 = j|yt = i)

=

M∑j=1

βj(t+ 1)ejxt+1aij

(7)

Above we have found the main recurrence relations for the forward and backward coefficients.The following start conditions allow the dynamic computation of all the rest of the coefficientsknown as the forward-backward algorithm [14]:

αi(1) = p(x1, y1 = i) = πieix1

βi(1) = p(xT+1 : T |yT = i) = p(0|yT = i) = 1(8)

In order to calculate p(yt|x1, x2, ..., xT ), the maximum likelihood parameters of the modelneed to be calculated: θ = {π,a, e}. The most common method to find these parameters is touse the EM (Expectation-Maximization) algorithm for HMM also known as the Baum-Welch al-gorithm [12]. The Baum-Welch algorithm iteratively alternates between 2 main steps. It startswith random values for the parameters and in the first step (E-step) computes the completedata log likelihood as function of new parameters with regard to the old parameters. In thesecond step (M-step), it updates the parameters such that the updated values maximize theexpected likelihood calculated in the E-step.

In the E-step the expected complete data log likelihood of the new parameters θ is calcu-lated with respect to the old parameters θold:

Q(θ,θold) = Ep(Y |X,θold) [log p(X,Y |θ)] = Ep(Y |X,θold)

[log πy1

T∏t=2

ayt−1yt

T∏t=1

eytxt

]=

= Ep(Y |X,θold)

[log

M∏k=1

πI(y1=k)k

]+ Ep(Y |X,θold)

log T∏t=2

M∏i=1

M∏j=1

aI(yt=i,yt−1=j)ji

+

+ Ep(Y |X,θold)

log T∏t=1

M∏i=1

N∏j=1

eI(xt=j,yt=i)ij

=

=

M∑k=1

Ep(Y |X,θold) [I(y1 = k)logπk] +

T∑t=2

M∑i=1

M∑j=1

Ep(Y |X,θold) [I(yt = i, yt−1 = j)log aji] +

+T∑t=1

M∑i=1

N∑j=1

Ep(Y |X,θold) [I(xt = j, yt = i)log eij ] =

=

M∑k=1

p(y1 = k|x1:T ,θold)logπk +

T∑t=2

M∑i=1

M∑j=1

p(yt = i, yt−1 = j|x1:T ,θold)log aji+

+T∑t=1

M∑i=1

N∑j=1

p(yt = i|x1:T ,θold)I(xt = j)log eij

(9)

5

Page 6: Individual Project in Computational Biology

Using the notations:

γi(t) = p(yt = i|x1:T ,θold)ψji(t) = p(yt = i, yt−1 = j|x1:T ,θold)

(10)

the expression of the expected log likelihood becomes:

Q(θ,θold) =

M∑k=1

γk(1)log πk +

T∑t=2

M∑i=1

M∑j=1

ψji(t)log aji +

T∑t=1

M∑i=1

N∑j=1

γi(t)I(xt = j)log eij

(11)Note that γi(t) and ψji(t) are both dependent only on the old parameters:

γi(t) = p(yt = i|x1:T ,θold) = p(yt = i|x1:t, xt+1:T ,θold) =

=p(xt+1:T |yt = i, x1:t,θ

old)p(yt = i|x1:t,θold)∑Mj=1 p(xt+1:T |yt = j, x1:t,θ

old)p(yt = j|x1:t,θold)=

=p(xt+1:T |yt = i,θold)p(yt = i, x1:t|θold)∑Mj=1 p(xt+1:T |yt = j,θold)p(yt = j, x1:t|θold)

=

=βoldi (t)αoldi (t)∑Mj=1 β

oldj (t)αoldj (t)

(12)

ψji(t) = p(yt = i, yt−1 = j|x1:T ,θold) =

=p(yt = i, yt−1 = j, x1:T |θold)

p(x1:T |θold)=

=p(yt = i, yt−1 = j, x1:T |θold)∑M

i′=1

∑Mj′=1 p(yt = i′, yt−1 = j′, x1:T |θold)

=

=p(yt = i, yt−1 = j, x1:t−1, xt, xt+1:T |θold)∑Mi′=1

∑Mj′=1 p(yt = i′, yt−1 = j′, x1:T |θold)

=

=p(xt|yt, yt−1, x1:t−1, xt+1:T ,θ

old)p(xt+1:T |yt = i, yt−1 = j, x1:t−1,θold)∑M

i′=1

∑Mj′=1 p(yt = i′, yt−1 = j′, x1:T |θold)

×

× p(yt = i|yt−1 = j, x1:t−1,θold)p(yt−1 = j, x1:t−1|θold)∑M

i′=1

∑Mj′=1 p(yt = i′, yt−1 = j′, x1:T |θold)

=

=eoldixtβ

oldi (t)aoldji α

oldj (t− 1)∑M

i′=1

∑Mj′=1 e

oldi′xtβoldi′ (t)aoldj′i′α

oldj′ (t− 1)

(13)

In the M-step, the new parameters θ are calculated by maximizing the expected log likeli-hood from the E-step (equation 11) and thus solving:

∂Q(θ,θold)

∂πk= 0

∂Q(θ,θold)

∂aji= 0

∂Q(θ,θold)

∂eij= 0

(14)

By solving these equations, we will obtain the updated values of the HMM parameters asfunctions of the old parameters.

6

Page 7: Individual Project in Computational Biology

To determine πk, we use the constraint∑M

i=1 πi = 1, therefore we can write Q(θ,θold) as afunction of πk using a Lagrange multiplier:

Q(θ,θold) = f(πk) =M∑k=1

γk(1)log πk + λ(M∑i=1

πi − 1) + const. (15)

where const. stands for constant. We then use the differential equations:

∂f(πk)

∂πk= 0⇒ γk(1)

1

πk+ λ = 0⇒ πk = −γk(1)

λ

∂f(πk)

∂λ= 0⇒

M∑i=1

πi = 1⇒M∑i=1

−γi(1)

λ= 1⇒ λ = −

M∑i=1

γi(1)

⇒ πk =γk(1)∑Mi=1 γi(1)

(16)

The same for aji, the following constraint applies:∑M

i′=1 aji′ = 1 ∀j, we use Lagrange multipli-ers:

Q(θ,θold) = f(aji) =

T∑t=2

M∑i=1

M∑j=1

ψji(t)log aji +

M∑j′=1

λj′

(M∑i′=1

aj′i′ − 1

)+ const. (17)

and therefore we have that:

∂f(aji)

∂aji= 0⇒

T∑t=2

ψji(t)1

aji+ λj = 0⇒ aji = −

∑Tt=2 ψji(t)

λj

∂f(aji)

∂λj= 0⇒

M∑i′=1

aji′ − 1 = 0⇒ λj = −T∑t=2

M∑i′=1

ψji′(t)

⇒ aji =

∑Tt=2 ψji(t)∑T

t=2

∑Mi′=1 ψji′(t)

(18)We can simplify this form by using the definition of ψji(t) and γi(t) from equation 10 andobserving the marginalization at the denominator of equation 18:

aji =

∑Tt=2 p(yt = i, yt−1 = j|x1:T ,θold)∑T

t=2

∑Mi′=1 p(yt = i′, yt−1 = j|x1:T ,θold)

=

∑Tt=2 p(yt = i, yt−1 = j|x1:T ,θold)∑T

t=2 p(yt−1 = j|x1:T ,θold)⇒

⇒ aji =

∑Tt=2 ψji(t)∑Tt=2 γj(t)

(19)

Finally, eij are subject to a constraint as well:∑N

j′=1 eij′ = 1 ∀i and we can write:

Q(θ,θold) = f(eij) =M∑i=1

N∑j=1

∑t:xt=j

γi(t)log eij +M∑i′=1

λi′

N∑j′=1

ei′j′ − 1

+ const. (20)

We have as before:

∂f(eij)

∂eij= 0⇒

∑t:xt=j

γi(t)1

eij+ λi = 0⇒ eij = −

∑t:xt=j

γi(t)

λi

∂f(eij)

∂λi= 0⇒

N∑j′=1

eij′ − 1 = 0⇒ λi = −N∑j′=1

∑t:xt=j′

γi(t)

⇒ eij =

∑t:xt=j

γi(t)∑Nj′=1

∑t:xt=j′

γi(t)

(21)

7

Page 8: Individual Project in Computational Biology

Note that∑N

j′=1

∑t:xt=j′

γi(t) =∑T

t=1 γi(t), therefore:

eij =

∑t:xt=j

γi(t)∑Tt=1 γi(t)

(22)

The computation of the emission probabilities can be improved by looking for emission proba-bilities of a certain form (e.g. following the Poisson distribution). This technique can be usedon the emission probabilities because we know the observed states. Since we don’t have anyinformation about the hidden states and we want them to cover as much as dependencies aspossible, we do not apply this method to compute transition probabilities. The distribution forthe emission probabilities is chosen according to the quantity to model. We will show in thefollowing sections how we chose an appropriate distribution to deal with RNA-seq data.

2.3 Application of HMMs on RNA-seq data

The HMM framework was used to analyze the RNA-seq data obtained from E. faecalis. Twomodels were designed. In both cases, the observed states are represented by the values of thecoverage signal and the hidden states model the level of amplification of the signal (duplicate,triplicate, etc.). Therefore, the implemented algorithm takes as an input the coverage signal forthe entire genome and returns the corresponding values of the hidden states which constitutethe predicted signal. The predicted signal tells which regions are amplified or deleted (figure 4).

0 2000 4000 6000 8000 10000

010

2030

4050

Position

Cov

erag

e

Figure 4: Example of generated coverage depth signal (black) and signal predicted with theHMM (red). A background coverage signal is estimated around the value of 10 and there areregions with an average amplification level of 40, 30 and respectively 20. A deletion can beobserved as well starting at position 6000. The red signal corresponds to the output of theHMM algorithm which detects all these regions.

2.3.1 The standard HMM model

The straightforward way to apply an HMM to RNA-seq data is to use the standard frameworkdescribed in the previous section and then to pick a distribution to model the emission proba-bilities of the HMM. The chain has the length of the genome and looks like in figure 3. Basedon literature [1, 6, 3], we have chosen a Poisson distribution to model the probability that a

8

Page 9: Individual Project in Computational Biology

certain number of reads map at a certain position in the genome in a region which is amplifieda certain amount of times:

eij = p(j reads map in a region amplified i times)

= p(xt = j|yt = i) = Poisson(j, ci)

=(ci)

j

j!e−ci

(23)

where ci represents how many times the background coverage depth has been amplified at po-sition t. In other words, c1 is the level of the background signal, c2 stands for a duplicate, c3for a triplicate, etc.

The transition probabilities aji and the initial distribution of the hidden states πk are cal-culated by using the previously found relations (given by the equations 19 and 16), while theemission probabilities are found using the maximum likelihood estimate for the parameters ciof the Poisson distribution.

To find the maximum likelihood estimate for ci, the following equation is solved:

∂Q(θ,θold)

∂ci= 0 (24)

where the expression of the expected log likelihood is given by:

Q(θ,θold) =M∑k=1

γk(1)log πk +T∑t=2

M∑i=1

M∑j=1

ψji(t)log aji︸ ︷︷ ︸const.

+T∑t=1

M∑i=1

N∑j=1

γi(t)I(xt = j)log eij

= f(ci) = const.+T∑t=1

M∑i=1

N∑j=1

γi(t)I(xt = j)log

((ci)

j

j!e−ci

)

=

T∑t=1

M∑i=1

N∑j=1

γi(t)I(xt = j)log

((ci)

j

j!e−ci

)+ const.

=

M∑i=1

N∑j=1

∑t:xt=j

γi(t)jlog ci − γi(t)ci

+ const.

(25)

We have that:

∂f(ci)

∂ci= 0⇒

N∑j=1

∑t:xt=j

γi(t)j1

ci− γi(t) = 0

⇒ ci =

∑Tt=1 γi(t)j∑Tt=1 γi(t)

(26)

Once the ci parameters are computed, the emission probabilities eij are obtained with thePoisson formula in equation 23. We have now deduced the update formulas for all the parame-ters of the HMM.

The complete Baum-Welch algorithm is as follows: an initial value for the parameters isset, then the expected log likelihood of the data is calculated; afterwards, the parameters areupdated with regard to the old parameters by using the update rules 16, 19 and 26 and the

9

Page 10: Individual Project in Computational Biology

y1 y2

x1 x2

yT-1 yT

xT-1 xT

ey1x1ey2x2x1

eyT-1xT-1xT-2eyTxTxT-1

ay1y2ay2y3

ayT-1yTayT-2yT-1

Figure 5: Example of a realization of an autoregressive HMM where the observed states areorganized in a first order Markov chain. Note the emission probabilities that are a function ofthe current observed state, the corresponding hidden state as well as the previously observedstate.

log likelihood is calculated with the new values. If the likelihood converges, the iteration isstopped, otherwise, the old parameters receive the values of the updated parameters and theupdate rules are used to compute new parameters (algorithm 1).

Algorithm Baum-Welch;

1. find an initial setting for the parameters: θ0 = {π0k, a0ji, e0ij};2. set θold = θ0 ;3. update θ by using the update rules:

πk = γk(1)∑Mi=1 γi(1)

, aji =∑T

t=2 ψji(t)∑Tt=2 γj(t)

, ci =∑T

t=1 γi(t)j∑Tt=1 γi(t)

;

with respect to θold.4. set θold = θ. ;5. go to 3 unless the expected log likelihood converges

Algorithm 1: Training of the HMM with the EM framework within the Baum-Welch algo-rithm.

2.3.2 The auto-regressive HMM model

The standard model assumes that all the inherent dependencies of the observed coverage signalare captured within the hidden variables. However, it has been shown that this fact may reducethe accuracy of the model [6]. We know that the coverage depth signal has a built-in dependencybetween the consecutive values since a read spans several positions in the genome. As a result,the current state in the coverage depth signal can be split in the following way:

xt = number of reads that start at t+ number of reads that continue from (t− 1) (27)

In this way, we can model a dependency between the current and the previous observed stateand we can model the observed states as a first degree Markov Chain (see figure 5). We callthis particular model an autoregressive HMM [4]. This change in the model will affect themethod in which the parameters of the model are computed since we have to take into accountthe dependency between the observed states. Already, the emission probabilities will have anadditional parameter, the previous observed state:

eijk = p(xt = j|yt = i, xt−1 = k) (28)

10

Page 11: Individual Project in Computational Biology

The complete data likelihood will sensibly change as well:

p(y1, y2, ..., yT , x1, x2, ..., xT ) =

[p(y1)p(x1|y1)

T∏t=2

p(yt|yt−1)

][T∏t=2

p(xt|yt, xt−1)

]

=

(πy1σy1x1

T∏t=2

ayt−1yt

)(T∏t=2

eytxtxt−1

) (29)

where σy1x1 = p(x1|y1) is an additional parameter.

We can rapidly prove, using the properties of the graphical model in figure 5, that theforward and backward parameters keep largely the same expression as before:

αi(t) = p(x1:t, yt = i)

=M∑j=1

p(x1:t, yt = i, yt−1 = j) =M∑j=1

p(x1:t, yt = i|yt−1 = j)p(yt−1 = j) =

=

M∑j=1

p(xt|yt = i, yt−1 = j, x1:t−1)p(x1:t−1, yt−1 = j, yt = i) =

=M∑j=1

p(xt|yt = i, yt−1 = j, x1:t−1)p(yt = i|x1:t−1, yt−1 = j)p(yt−1 = j, x1:t−1) =

=M∑j=1

p(xt|yt = i, xt−1)p(yt = i|yt−1 = j)αj(t− 1) =

=

M∑j−1

eixtxt−1ajiαj(t− 1)

(30)

For the backward coefficients we define:

βik(t) = p(xt+1, xt+2, ..., xT |yt = i, xt = k) =

=

M∑j=1

p(xt+2:T |yt+1 = j, yt = i, xt+1, xt = k)p(xt+1|yt+1 = j, yt = i, xt = k)×

× p(yt+1 = j|yt = i, xt = k) =

=M∑j=1

βjxt+1(t+ 1)ejxt+1kaij

(31)

The following start conditions apply:

αi(1) = p(x1, y1 = i) = πiσix1

βik(1) = p(xT+1 : T |yT = i, xT = k) = p(0|yT = i, xT = k) = 1(32)

We recall that the objective is to determine the hidden states behind each observed state.We apply as before the Baum-Welch algorithm with its E and M steps modified according tothe new model. In the E-step, we compute the expected log likelihood with respect to the old

11

Page 12: Individual Project in Computational Biology

parameters θold:

Q(θ,θold) = Ep(Y |X,θold) [log p(X,Y |θ)] = Ep(Y |X,θold)

[log πy1σy1x1

T∏t=2

ayt−1yt

T∏t=2

eytxtxt−1

]=

= Ep(Y |X,θold)

[log

M∏k=1

πI(y1=k)k

]+ Ep(Y |X,θold)

log M∏i=1

N∏j=1

log σI(y1=i,x1=j)ij

+

+ Ep(Y |X,θold)

log T∏t=2

M∏i=1

M∏j=1

aI(yt=i,yt−1=j)ji

+

+ Ep(Y |X,θold)

log T∏t=2

M∏i=1

N∏j=1

N∏k=1

eI(xt=j,yt=i,xt−1=k)ijk

=

=

M∑k=1

Ep(Y |X,θold) [I(y1 = k)logπk] +

M∑i=1

N∑j=1

Ep(Y |X,θold) [I(y1 = i, x1 = j)σij ] +

+T∑t=2

M∑i=1

M∑j=1

Ep(Y |X,θold) [I(yt = i, yt−1 = j)log aji] +

+T∑t=1

M∑i=1

N∑j=1

N∑k=1

Ep(Y |X,θold) [I(xt = j, yt = i, xt−1 = k)log eijk] =

=M∑k=1

p(y1 = k|x1:T ,θold)logπk +M∑i=1

N∑j=1

p(y1 = i|x1:T ,θold)I(x1 = j)log σij+

+T∑t=2

M∑i=1

M∑j=1

p(yt = i, yt−1 = j|x1:T ,θold)log aji+

+

T∑t=1

M∑i=1

N∑j=1

N∑k=1

p(yt = i|x1:T ,θold)I(xt = j, xt−1 = k)log eijk

(33)

We proceed in the same way as for the standard model and we use the following notations:

γi(t) = p(yt = i|x1:T ,θold)ψji(t) = p(yt = i, yt−1 = j|x1:T ,θold)

(34)

The expression of the expected log likelihood becomes:

Q(θ,θold) =M∑k=1

γk(1)log πk +M∑i=1

N∑j=1

γi(1)log σij+

+

T∑t=2

M∑i=1

M∑j=1

ψji(t)log aji +T∑t=2

M∑i=1

γi(t)log eixtxt−1

(35)

As before, we use the conditional independence properties of the graphical model to express

12

Page 13: Individual Project in Computational Biology

γi(t) and ψji(t) as functions of the old parameters:

γi(t) = p(yt = i|x1:T ,θold) = p(yt = i|x1:t, xt+1:T ,θold) =

=p(xt+1:T |yt = i, x1:t,θ

old)p(yt = i, x1:t|θold)∑Mj=1 p(xt+1:T |yt = j, x1:t,θ

old)p(yt = j, x1:t|θold) =

=p(xt+1:T |yt = i, xt,θ

old)p(yt = i, x1:t|θold)∑Mj=1 p(xt+1:T |yt = j, xt,θ

old)p(yt = j, x1:t|θold)=

=βoldixt (t)α

oldi (t)∑M

j=1 βoldjxt

(t)αoldj (t)

(36)

ψji(t) = p(yt = i, yt−1 = j|x1:T ,θold) =

=p(yt = i, yt−1 = j, x1:T |θold)

p(x1:T |θold)=

=p(yt = i, yt−1 = j, x1:T |θold)∑M

i′=1

∑Mj′=1 p(yt = i′, yt−1 = j′, x1:T |θold)

=

=p(yt = i, yt−1 = j, x1:t−1, xt, xt+1:T |θold)∑Mi′=1

∑Mj′=1 p(yt = i′, yt−1 = j′, x1:T |θold)

=

=p(xt|yt, yt−1, x1:t−1, xt+1:T |θold)p(xt+1:T |yt = i, yt−1 = j, x1:t−1,θ

old)∑Mi′=1

∑Mj′=1 p(yt = i′, yt−1 = j′, x1:T |θold)

×

× p(yt = i|yt−1 = j, x1:t−1,θold)p(yt−1 = j, x1:t−1|θold)∑M

i′=1

∑Mj′=1 p(yt = i′, yt−1 = j′, x1:T |θold)

=

=eoldixtxt−1

βoldixt (t)aoldji α

oldj (t− 1)∑M

i′=1

∑Mj′=1 e

oldi′xtxt−1

βoldi′xt(t)aoldj′i′α

oldj′ (t− 1)

(37)

In the M-step, the new parameters θ are calculated by maximizing the expected log likeli-hood found in the E-step (equation 35). Note that πk and aji have the same coefficients as inthe expression of the expected log likelihood in equation 11, therefore, their update rules willremain unchanged:

πk =γk(1)∑Mi=1 γi(1)

(38)

aji =

∑Tt=2 ψji(t)∑Tt=2 γj(t)

(39)

The parameters that remain to be calculated are eijk and σij . We use the same method as forthe other parameters, i.e. solving the differential equations:

∂Q(θ,θold)

∂σij= 0

∂Q(θ,θold)

∂eijk= 0

(40)

The parameters σij have the constraint∑N

j′=1 σij′ = 1, therefore we use Lagrange multipliers:

Q(θ,θold) = f(σij) =M∑i=1

N∑j=1

γi(1)log σij +

M∑i′=1

λi′

N∑j′=1

σi′j′ − 1

+ const. (41)

13

Page 14: Individual Project in Computational Biology

By taking the derivative with respect to σij and λi′ , we get:

∂f(σij)

∂σij= 0⇒ γi(1)

1

σij+ λi = 0⇒ σij = −γi(1)

λi

∂f(σij)

∂λi= 0⇒

N∑j′=1

σij′ − 1 = 0⇒ λi = −N∑j′=1

γi(1) = −Nγi(1)

⇒ σij =1

N(42)

Finally, we compute eijk which are subject to the constraint:∑N

j′=1 eijk = 1 so we can write:

Q(θ,θold) = f(eijk) =T∑t=1

M∑i=1

γi(t)log eixtxt−1 +M∑i′=1

N∑k′=1

λi′k′

N∑j′=1

ei′j′k′ − 1

+ const. (43)

As before, we take the derivative with respect to eijk and λik:

∂f(eijk)

∂eijk= 0⇒

∑t:xt=j,xt−1=k

γi(t)1

eijk+ λik = 0⇒ eijk = −

∑t:xt=j,xt−1=k

γi(t)

λik

∂f(eijk)

∂λik= 0⇒

M∑j′=1

eij′k = 1⇒ λik = −N∑j′=1

∑t:xt=j′,xt−1=k

γi(t)

⇒ eijk =

∑t:xt=j′,xt−1=k

γi(t)∑Mi′=1

∑xt=j,xt−1=k

γi′(t)(44)

We have showed with the relation 27 that the number of reads that match at a certainposition can be split into two independent quantities: the reads that start at the respectiveposition and the reads that continue from the previous position. We can model these twoquantities by two known distributions. Therefore, the Poisson distribution would model thereads that start at the current position and the Binomial distribution would model the readsthat continue from the previous position. This would help to write the emission probabilitiesin the following manner:

eijk = p(xt = j|yt = i, xt−1 = k)

=

j∑r=1

p(r reads start at t)︸ ︷︷ ︸Poisson(r,ci)

× p(r reads carry on from t− 1)︸ ︷︷ ︸Binomial(r,k,pc)

=

j∑r=1

crir!e−ci

(k

i− r

)pi−rc (1− pc)k−(i−r)

(45)

where pc is the probability that a read spans two adjacent positions:

pc =len− 1

len(46)

where len is the length of a read. In order to find the maximum likelihood estimate of theparameter ci, a gradient descent library has been used in the implemented algorithm. We haveobtained thus all the update rules from the parameters of the HMM, therefore the Baum-Welchalgorithm is completely defined.

14

Page 15: Individual Project in Computational Biology

3 Results and discussion

We have trained and tested the standard HMM as well as the auto-regressive HMM on theRNA-seq data of E. faecalis. The performances of the two algorithms have proved completelydifferent for this kind of data.

3.1 Performance of the two HMMs

The idea of an autoregressive HMM has come from a previous application on DNA-sequencing(DNA-seq) data for detecting copy number variants [9]. When applied to DNA-seq data, theauto-regressive model preforms the same as the standard HMM model. The two models detectthe same amplified regions as well as the background level of coverage depth (figure 6).

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

10

20

30

40

50

60

position

CDarHMM

(a) Autoregressive HMM

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

10

20

30

40

50

60

position

CDstHMM

(b) Standard HMM

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

10

20

30

40

50

60

position

CDstHMMarHMM

(c) Superposition of the autoregressive and standard HMM

Figure 6: Application of the two models on the same stretch of DNA-seq coverage depthsignal. The two models perform largely the same and detect the amplified regions of the signalas well as the background level.

15

Page 16: Individual Project in Computational Biology

Figure 6 displays the result of applying the two HMM models on the same fragment of cov-erage depth from DNA-seq data. The predictions of the two models are the same, with a fewsmall differences. Note the existence of a few artifacts (very short deletions or amplifications),especially with the standard model. These artifacts may require an additional smoothing oper-ation or a slight modification of the transition probabilities aji.

Unlike for DNA-seq data, when applied to RNA-seq data of the E. faecalis bacterium, thetwo models behave completely different (figure 7). The auto regressive HMM fails to predictthe correct amplification regions and returns a signal impossible to interpret. On the contrary,the standard model performs better and manages to detect continuous regions correspondingto the different transcripts, however accompanied by several artifacts.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

1

2

3

4

5

6

7

8

position on the genome

log

co

ve

rag

e

(a) Real coverage depth signal

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

0.5

1

1.5

2

position on the genome

CD

HM

M p

red

ictio

n

(b) Autoregressive HMM

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

0.5

1

1.5

2

position on the genome

RD

HM

M p

red

icito

n

(c) Standard HMM

Figure 7: Application of the two models on the first 10000 elements of the coverage depth signalobtained from RNA-seq data of the E. faecalis bacterium. The two models have a completelydifferent performance. The standard model clearly overpasses the auto regressive model. Theauto-regressive model fails to predict any amplification as well as the background signal.

There may be different reasons responsible for the unsatisfactory result with the autoregres-sive HMM. First of all, RNA-seq coverage depth signal is different from the DNA-seq coveragesignal. By looking at the real coverage signals (in blue) in figures 6 and 7, we can clearly seethat the RNA-seq data is far more irregular than DNA-seq data. Secondly, the distributionsused to calculate the emission probabilities for the autoregressive HMM may not be the mostappropriate ones. While the combination between Poisson and Binomial may be suitable forDNA-seq data, this solution may be not appropriate for RNA-seq data which is subject to differ-ent biological phenomena such as transcription degradation. Other distributions such NegativeBinomial may replace the Poisson component and may give better results, or maybe a mixtureof several distributions may be more suitable. Finally, the sequencing depth of the RNA-seqdata is very low compared to DNA-seq data which may be a caveat for the autoregressive HMM.

16

Page 17: Individual Project in Computational Biology

The standard HMM performs well on RNA-seq data, however it can be improved since thepredictions contain a certain degree of noise. Further smoothing and averaging may differentiatethe different amplified regions and eliminate the artifacts. The possibilities are multiple. Thisresult opens the door to further analysis and interpretations. In our study, we focused on thedetection of the transcripts, since a dedicated method based on a completely different approachhad been implemented before.

3.2 Predicting transcription start sites with HMMs

We have applied the standard HMM to our RNA-seq data in order to detect the transcriptionstart sites (TSSs) on the forward strand of the E. faecalis transcriptome. In order to do so, wehave constructed a binary version of the predicted signal with the standard HMM (all predic-tions greater than 0 become 1 and the rest is kept to 0). We have compared the standard HMMresults with a previously developed method based on coverage density which was coined as the”box method” (figure 8).

Cov

erag

ede

nsity

Position on the genome

Confidence threshold

Noise threshold

Figure 8: Construction of the boxed signal starting from the coverage density. First of all, twothresholds are set: a confidence threshold and a noise threshold. All the signal below the noisethreshold is discarded. The signal above the confidence threshold is considered as a significantsignal. The signal above noise threshold but below the confidence threshold is kept only if it isadjacent to significant signal, otherwise it is discarded.

The boxed signal construction starts from the coverage density which is built in the sameway as the coverage depth with the exception that every read is mapped to the genome onlyonce and no repeats are considered. Once the coverage density is built, two thresholds areset: a confidence threshold and a noise threshold. All the signal below the noise threshold isdiscarded. The signal above the confidence threshold is considered as a significant signal. Thesignal above noise threshold but below the confidence threshold is kept only if it is adjacent tosignificant signal, otherwise it is discarded. We remain in the end with a thinned signal whichis transformed into a binary signal called the box signal.

If we compare the boxed signal with the standard HMM predictions, we observe that thestandard HMM is more robust and does not introduce breaks within transcripts(figure 9). Thisis due to the emission and transition probabilities which keep their stability and are not sud-denly changed by small variations in the coverage signal. Figure 9 displays the standard HMMprediction signal and the box signal for the same region in the coverage depth signal as wellas the corresponding annotations. We observe that the predictions are in concordance withthe annotations and that the box signal presents interruptions in the signal while the standardHMM manages to detect the transcripts without being disturbed by variations in the signal.

17

Page 18: Individual Project in Computational Biology

box signal

coverage signal

stHMM

annotation

Figure 9: Comparison between the box signal and the standard HMM predictions. Unlike thebox signal, the standard HMM (stHMM) is more robust and does not introduce breaks withintranscripts. Compared with the annotation, the two signals are in concordance.

We used the standard HMM prediction signal to detect TSSs on the forward strand of thegenome and we compared the results with the TSSs predicted with the box signal (figure 10).For each real TSS on the forward strand, we searched within 100 bases upstream as well asdownstream within the signal obtained with the standard HMM as well as within the boxedsignal. Out of 13332 real TSSs that exist on the forward strand, 744 TSS were detected with thestandard HMM. The box method detected 702 TSSs. Therefore, the standard HMM detected2% more TSSs than the box method. All the TSSs predicted with the box method were alsodetected with the standard HMM.

−100 −80 −60 −40 −20 0 20 40 60 80 1000

5

10

15

20

25

30

35

40

difference =annotated−predicted

#

Distribution of the difference between annotated and predicted transcripts starts

TSS stHMMTSS Box

(a) Histogram of the differences between the pre-dicted TSSs

−150 −100 −50 0 50 100 1500

0.002

0.004

0.006

0.008

0.01

0.012

difference =annotated −predicted

Distribution of the difference between annotated and predicted transcrips starts

stHMMbox model

(b) Distribution of the differences between the pre-dicted TSSs

Figure 10: Comparison of TSSs detected with the box signal method versus the TSSs detectedwith the standard HMM. The standard HMM detects more TSSs than the box method. Thetwo methods are coherent in their results.

This comparison confirms the TSSs predicted with the box method and proves an exampleof application of HMMs for RNA-seq data. Another possible use of the standard HMM model

18

Page 19: Individual Project in Computational Biology

would be for the analysis of transcription levels. This would be done by looking at the amplifiedregions in the HMM predictions and discarding the artifacts (very short isolated amplificationsand deletions) by various methods such as smoothing. In this way, transcripts which are moreamplified than others would be detected.

4 Conclusion

This study has presented an example of application of hidden Markov models to the analysisof RNA-seq data of the E. faecalis bacterium. Hidden Markov models (HMMs) are directedacyclic graphs consisting of a discrete-time discrete-state Markov chain of hidden states anda set of observed states constituting an observation model. Transition probabilities define thetransitions between the hidden states while emission probabilities capture the likelihood to gen-erate the observed states from the hidden states. These probabilities together with the initialdistributions of the hidden states constitute the parameters defining the HMMs. Finding theseparameters is the key to all the inference problems in HMMs.

We have applied HMMs on the coverage depth signal in order to detect the transcripts andamplified regions of the signal. The coverage depth signal is constructed by counting for eachposition in the genome how many of the reads obtained by sequencing match at that position.

Two HMM approaches were adopted. In both approaches, the hidden states representedhow many times the signal was amplified compared to a background level, while the observedstates represented the values of the coverage signal. The length of the chain in both cases wasequal to the size of the genome. The aim of the models was to detect the hidden states. Inorder to do so, the parameters of the HMM needed to be calculated. Both models employed thebasic inference algorithms for HMMs: the forward-backward algorithm and the Baum-Welch al-gorithm. The forward-backward algorithm calculates the forward and the backward coefficientswhich correspond to the past and the future in the hidden states chain. These coefficients areafterwards used by the Baum-Welch algorithm to calculate the parameters of the HMM (initialdistribution of the hidden states, transition and emission probabilities).

The difference between the two models was represented by the way in which the chain ofthe observed states was represented. In the standard model, the observed states were consid-ered independent and any dependency was let to be captured by the hidden states. A Poissondistribution was chosen to model the emission probabilities. In the other model, called autore-gressive, the observed states were represented in a first order Markov chain. This implied achange in modeling the emission probabilities. They were modeled by combining a Poisson anda Binomial distribution.

The two models were applied to the coverage depth signal. The standard HMM performedwell showing potential to detect the transcripts as well as amplified regions. On the other hand,the autoregressive model didn’t perform as expected and its predictions were not interpreted byany mean. One reason could be the highly irregular manner of the coverage depth signal. In-terestingly, the autoregressive HMM proved to perform well on DNA-seq coverage depth signaland to produce the same results as the standard HMM. The autoregressive HMM is thereforesuboptimal for RNA-seq data but appropriate for DNA-seq data.

Since it performed well, the standard HMM was further employed to detect the transcriptionstart sites(TSSs) on the forward strand of E. faecalis transcriptome. The method was comparedwith a previously developed method - the box method. The standard HMM predicted 2% moreTSSs more than the box method. All the TSSs predicted with the box method were confirmed

19

Page 20: Individual Project in Computational Biology

with the standard HMM.

In conclusion, this report shows that HMMs are powerful statistical tools to analyze sequenc-ing data. However, efforts are still needed to model the data more accurately. The presentedmethod constitutes a helpful initial example that is a good starting point for future study.

References

[1] Y. Benjamini and T.P. Speed. Summarizing and correcting the gc content bias in high-throughput sequencing. Nucleic Acids Research, pages 1–14, 2012.

[2] Simon Bennett. Solexa ltd. Pharmacogenomics, 5(4):433–8, 2004.

[3] S. Ivakhno et al. Cnaseg - a novel framework for identification of copy number changes incancer from second-generation sequencing data. Bioinformatics, 26:3051–3058, 2010.

[4] Vikram Krishnamurthy, Senior Member, George Gang Yin, and Senior Member. Recursivealgorithms for estimation of hidden markov models and autoregressive models with markovregime. IEEE Trans. Inform. Theory, 2002.

[5] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology,10(3):R25–10, March 2009.

[6] A. Magi et al. Read count approach for dna copy number variants detection. Bioinformatics,28(4):470–478, 2012.

[7] S.M. McBride, V.A. Fischetti, D.J. Leblanc, R.C. Moellering Jr, and M.S. Gilmore. Geneticdiversity among enterococcus faecalis. PLoS One, 2(7):e582, 2007.

[8] L.R. Rabiner. A tutorial on hidden markov models and selected applications in speechrecognition. Proceedings of the IEEE, 77(2):257 –286, 1989.

[9] Y. Shen, Y. Gu, and I. Pe’er. A hidden markov model for copy number variant predictionfrom whole genome resequencing data. BMC Bioinformatics, 12 Suppl 6, 2011.

[10] Balder ten Cate. On the logic of d-separation. In Dieter Fensel, Fausto Giunchiglia, Deb-orah L. McGuinness, and Mary-Anne Williams, editors, Proceedings of the Eights Interna-tional Conference on Principles and Knowledge Representation and Reasoning (KR-02),Toulouse, France, April 22-25, 2002, pages 568–577. Morgan Kaufmann, 2002.

[11] Anton Valouev, Jeffrey Ichikawa, Thaisan Tonthat, Jeremy Stuart, Swati Ranade, HeatherPeckham, Kathy Zeng, Joel A. Malek, Gina Costa, Kevin McKernan, Arend Sidow, AndrewFire, and Steven M. Johnson. A high-resolution, nucleosome position map of C. elegansreveals a lack of universal sequence-dictated positioning. Genome research, 18(7):1051–1063, July 2008.

[12] Lloyd R. Welch. Hidden Markov Models and the Baum-Welch Algorithm. IEEE Informa-tion Theory Society Newsletter, 53(4), December 2003.

[13] Byung-Jun Yoon. Hidden markov models and their applications in biological sequenceanalysis. Curr Genomics, 10(6):402–15, 2009.

[14] Shun-Zheng Yu and H. Kobayashi. Practical implementation of an efficient forward-backward algorithm for an explicit-duration hidden markov model. Trans. Sig. Proc.,54(5):1947–1951, October 2006.

20