hmmconverter a tool-box for hidden markov models with two ...hidden markov models (hmms) are...

150
HMMConverter A tool-box for hidden Markov models with two novel, memory efficient parameter training algorithms by TIN YIN LAM B.Sc., The Chinese University of Hong Kong, 2006 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) October, 2008 c TIN YIN LAM 2008

Upload: others

Post on 14-Aug-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

HMMConverterA tool-box for hidden Markov models

with two novel, memory efficientparameter training algorithms

by

TIN YIN LAM

B.Sc., The Chinese University of Hong Kong, 2006

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

in

THE FACULTY OF GRADUATE STUDIES

(Computer Science)

THE UNIVERSITY OF BRITISH COLUMBIA

(Vancouver)

October, 2008

c© TIN YIN LAM 2008

Page 2: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Abstract

Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis.

Many recently developed Bioinformatics applications employ variants of HMMs to analyze di-

verse types of biological data. It is typically fairly easy to design the states and the topological

structure of an HMM. However, it can be difficult to estimate parameter values which yield a

good prediction performance. As many HMM-based applications employ similar algorithms

for generating predictions, it is also time-consuming and error-prone to have to re-implement

these algorithms whenever a new HMM-based application is to be designed.

This thesis addresses these challenges by introducing a tool-box, called HMMConverter,

which only requires an XML-input file to define an HMM and to use it for sequence decoding

and parameter training. The package not only allows for rapid proto-typing of HMM-based

applications, but also incorporates several algorithms for sequence decoding and parameter

training, including two new, linear memory algorithms for parameter training. Using this

software package, even users without programming knowledge can quickly set up sophisticated

HMMs and pair-HMMs and use them with efficient algorithms for parameter training and

sequence analyses. We use HMMConverter to construct a new comparative gene prediction

program, called Annotaid, which can predict pairs of orthologous genes by integrating prior

information about each input sequence probabilistically into the gene prediction process and

into parameter training. Annotaid can thus be readily adapted to predict orthologous gene

pairs in newly sequenced genomes.

ii

Page 3: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1 HMMs in Bioinformatics applications . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Introduction to hidden Markov models . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Theoretical background . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.2 Variants of HMMs for Bioinformatics applications . . . . . . . . . . . 8

1.3 Challenges in designing and implementing HMMs . . . . . . . . . . . . . . . 13

1.4 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Decoding and parameter training algorithms for HMMs . . . . . . . . . . 18

2.1 Introduction and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Decoding algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.1 The Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.2 The Hirschberg algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 23

iii

Page 4: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Table of Contents

2.3 Parameter training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3.1 The Baum-Welch algorithm . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.2 The Viterbi training algorithm . . . . . . . . . . . . . . . . . . . . . . 31

2.4 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 HMMConverter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Main features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.1 Input and output files . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.2 Special features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.3 Sequence decoding algorithms . . . . . . . . . . . . . . . . . . . . . . 51

3.2.4 Training algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3 Training using sampled state paths . . . . . . . . . . . . . . . . . . . . . . . 58

3.3.1 Sampling a state path from posterior distribution . . . . . . . . . . . 58

3.3.2 A new, linear memory algorithm for parameter training using sampled

state paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.3.3 A new linear memory algorithm for Viterbi training . . . . . . . . . . 68

3.3.4 Pseudo-counts and pseudo-probabilities . . . . . . . . . . . . . . . . . 72

3.3.5 Training free parameters . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.4 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.5 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4 Specifying an HMM or pair-HMM with HMMConverter . . . . . . . . . 78

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.2 The XML-input file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.2.1 The <model> component of the XML file . . . . . . . . . . . . . . . 79

iv

Page 5: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Table of Contents

4.2.2 The <sequence analysis> component of the XML file . . . . . . . . . 91

4.3 Other text input files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.3.1 The input sequence file . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.3.2 The free emission parameter file . . . . . . . . . . . . . . . . . . . . . 101

4.3.3 The free transition parameter file . . . . . . . . . . . . . . . . . . . . 103

4.3.4 The tube file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.3.5 The prior information file . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.4 The output files of HMMConverter . . . . . . . . . . . . . . . . . . . . . . 107

4.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5 Implementation of HMMConverter . . . . . . . . . . . . . . . . . . . . . . . 112

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.2 Converting XML into C++ code . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.2.1 The model parameters class . . . . . . . . . . . . . . . . . . . . . . . 113

5.2.2 Other classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.3 Decoding and training algorithms . . . . . . . . . . . . . . . . . . . . . . . . 120

5.3.1 The Viterbi and Hirschberg algorithms . . . . . . . . . . . . . . . . . 120

5.3.2 The parameter training algorithms . . . . . . . . . . . . . . . . . . . 122

6 Annotaid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.2 Main features of Annotaid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.2.1 The pair-HMM of Annotaid . . . . . . . . . . . . . . . . . . . . . . 126

6.2.2 Banding and sequence decoding algorithms . . . . . . . . . . . . . . . 127

6.2.3 Integrating prior information along the input sequences . . . . . . . . 127

6.2.4 Parameter training in Annotaid . . . . . . . . . . . . . . . . . . . . 130

v

Page 6: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Table of Contents

6.3 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.4 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

vi

Page 7: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

List of Figures

1.1 A Markov model for the CpG-Island process . . . . . . . . . . . . . . . . . . . 5

1.2 An HMM for the CpG-Island . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 A 5-state simple pair-HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 A simple pair-HMM for predicting related protein coding regions in two input

DNA sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1 A simple generalized pair-HMM for predicting related protein coding regions

in two DNA sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Illustration of the Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Illustration of the Hirschberg algorithm (1) . . . . . . . . . . . . . . . . . . . 25

2.4 Illustration of the Hirschberg algorithm (2) . . . . . . . . . . . . . . . . . . . 26

3.1 The XML-input file for defining an HMM in HMMConverter . . . . . . . . 37

3.2 Using Blast to generate a tube for HMMConverter . . . . . . . . . . . . 44

3.3 Defining a tube for HMMConverter explicitly . . . . . . . . . . . . . . . . 45

3.4 A state which carries labels from two different annotation label sets . . . . . 47

3.5 Illustration of the special emission feature in HMMConverter . . . . . . . 49

3.6 Using the Hirschberg algorithm inside a tube . . . . . . . . . . . . . . . . . . 53

3.7 Illustration of the forward algorithm . . . . . . . . . . . . . . . . . . . . . . . 63

vii

Page 8: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

List of Figures

3.8 Sampling a state path from the posterior distribution . . . . . . . . . . . . . . 63

3.9 Illustration of the novel, linear memory algorithm for parameter training using

sampled state paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.10 A part of the pair-HMM of Doublescan . . . . . . . . . . . . . . . . . . . . 75

4.1 Defining a state of an HMM or pair-HMM in HMMConverter . . . . . . . 83

4.2 A part of the pair-HMM of Doublescan . . . . . . . . . . . . . . . . . . . . 98

4.3 Specifying a tube explicitly in a flat text file . . . . . . . . . . . . . . . . . . . 105

4.4 A simple 5-state pair-HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.1 Combing the Hirschberg algorithm with a tube . . . . . . . . . . . . . . . . . 128

6.2 Special emissions in Projector and Annotaid . . . . . . . . . . . . . . . . 129

6.3 Combining the parameter training with a tube . . . . . . . . . . . . . . . . . 132

viii

Page 9: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Acknowledgements

I would like to thank my supervisor, Prof. Irmtraud Meyer for her advice, support and encour-

agement during the course of this project. A special thank you goes to Prof. Anne Condon

for agreeing to be the second reader of my thesis. I thank Nick Wiebe, Rodrigo Goya, Chris

Thachuk and all other members of the BETA lab who have made my life on campus enjoyable

as they are always being helpful and nice.

ix

Page 10: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Chapter 1

HMMs in Bioinformatics

applications

1.1 Introduction

Hidden Markov models (HMMs) are flexible and powerful statistical tools for many different

kinds of sequence analysis. HMMs were first introduced in a series of papers by Leonard

E. Baum in the 1960s, and an early application of HMMs is speech recognition in the mid-

1970s. Today, HMM-based methods are widely used in many different fields such as speech

recognition and speech synthesis [42, 45], natural language processing [32], hand-writing recog-

nition [27] and Bioinformatics [11].

Among the above applications, HMMs are especially popular for sequence analysis in

Bioinformatics. In Bioinformatics, HMMs are used for sequence annotation such as gene

prediction [8, 23], sequence analysis such as transcription binding site recognition [17] and

peptide identification [22], sequence alignment such as protein alignment [21, 24] and other

applications such as protein-protein interaction prediction [41]. Many variations of HMMs

have been developed in the recent two decades for different Bioinformatics applications. In

addition, new algorithms for sequence decoding and parameter training have been introduced.

Although HMMs are a powerful and popular concept, several practical challenges often have

1

Page 11: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

1.1. Introduction

to be overcome in practice. It is not easy to design an HMM, especially to assign values to its

parameters. The time and memory requirements for analyzing long input sequences can be

prohibitive. Finally, it can be very time consuming and error-prone to implement an HMM

and the corresponding prediction and parameter training algorithms.

This thesis presents an HMM generation tool called HMMConverter, which only re-

quires an XML-file and simple text input files for defining an HMM and for using the HMM for

predictions and for training its parameters. Users of HMMConverter are not required to

have programming skills. The tool provides the most commonly used Viterbi algorithm [47],

and the more memory efficient Hirschberg algorithm [16] for sequence decoding. For param-

eter training, HMMConverter offers not only a new, linear memory algorithm for Baum-

Welch training, but also a novel linear memory variation of the Viterbi training algorithm

as well as a new and efficient algorithm for training with sampled state paths. Furthermore,

HMMConverter supports many special features for constructing various types of HMMs,

such as higher order HMMs, pair-HMMs as well as HMMs which take external information

on the input sequences into account for parameter training and generating predictions.

In this chapter, we give some theoretical background on HMMs, and describe variants

of HMM and their applications in Bioinformatics. We also further discuss the challenges of

designing and implementing an HMM. Existing sequence decoding and parameter training

algorithms for HMM are introduced in chapter 2. Chapter 3 describes the main features of

HMMConverter, and includes a detailed explanation of the novel parameter training algo-

rithms. The specifications of an HMM within HMMConverter and the input and output

format for the tool are described in chapter 4, some implementation details of HMMCon-

verter are also provided in chapter 5. Finally, we used this tool to implement a novel

comparative gene prediction pair-HMM called Annotaid which can take various types of

prior information on each of the two input sequences into account in order to give more accu-

2

Page 12: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

1.2. Introduction to hidden Markov models

rate predictions. The features and applications of this new model are described in chapter 6.

1.2 Introduction to hidden Markov models

1.2.1 Theoretical background

A hidden Markov model (HMM) is a variation of a Markov model, where the state sequence

of a Markov chain is hidden from the observation. We first explain what a Markov chain is

and then use a simple biological example to introduce HMMs.

Markov model

A Markov chain is a random process with a Markov property. In brief, a random process is a

process for which the next state is determined in a probabilistic way. The Markov property

implies that given the current state in the random process, the future state is independent of

the past states. Let Xi be a random variable for all positive integers i, representing the state

of the process at time i. The sequence X1, X2, · · · , Xi, · · · is a Markov chain if:

P (Xi = x|Xi−1 = xi−1, · · · , X1 = x1) = P (Xi = x|Xi−1 = xi−1)

A time-homogeneous Markov chain means that the next state of the process is independent

of the time, but only depends on the current state:

P (Xi = x|Xi−1 = y) = P (Xi−1 = x|Xi−2 = y)

One can also define higher order Markov chains. A Markov chain of n-th order (∀n) is a

Markov chain for which depends on the current state and the previous n − 1 states. Note

3

Page 13: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

1.2. Introduction to hidden Markov models

that when we say Markov chain, we refer to the simplest case, i.e. a 1st order Markov chain.

A Markov chain can be represented by a finite state machine [11].

CpG-Island

Now, we the use CpG-Island [4] example from the book Biological Sequence Analysis [11] to

introduce hidden Markov models. In a genomic DNA sequence, the dinucleotide CG (i.e. a

neighboring C and G on the same strand) appears more frequently around the transcription

start sites of genes than in other regions of the genome. We call regions which have an

increased frequency of CG pairs a CpG-Island. They are usually a few hundred to a few

thousand bases long. Suppose we want to identify CpG-Islands in a long genomic sequence.

We can do this by defining a Markov chain. Each state of the model in figure 1.2.1 represents

a DNA nucleotide in a CpG-Island or not in a CpG-Island. The Markov property holds for

this model as the probability of going from a given state to the next state only depends on

the current state, but not the previous states. Figure 1.2.1 shows the Markov model of the

CpG-Island “process”, but this model is lacking something as we cannot distinguish whether

an observed nucleotide is from a CpG-Island(+) or not (-). We can modify this model a

little, see figure 1.2.1. In this new model, each observed nucleotide can be read by two states,

one state with label “+” representing a nucleotide from a CpG-Island and one state with

“-” representing a nucleotide outside a CpG-Island. This is a hidden Markov model for a

CpG-Island process, where the “hidden feature” is the underlying “+” or “-” state associated

to nucleotides in the observed sequence. The transition probability from the C+ state to the

G+ state should be larger than the transition probability from the C- state to the G- state.

The start state and the end state in the two models are both silent states which do not

read any character from the input sequence. The model always start with the start state

4

Page 14: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

1.2. Introduction to hidden Markov models

Figure 1.1: A Markov model for the CpG-Island processThis is a Markov model for the CpG-Island process, but it cannot solve the CpG-Islandannotation problem.

Figure 1.2: An HMM for the CpG-IslandThis figure shows an HMM which can model CpG-Islands. The states of the model are allconnected to each other, i.e. it is a complete graph(K8) where each state also has a transitionback to itself.

5

Page 15: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

1.2. Introduction to hidden Markov models

before reading any character from the input sequence. On the other hand, every state path

ends at the end state. In the above example, the start state is connected to all the other

states (except the end state), and all the states except the start state are connected to the

end state.

Now we are ready to give a more formal definition of HMM, an HMM is unambiguously

defined by the following components:

• A set of N states S = {0, · · · , N − 1}, where 0 is the start state and N − 1 is the end

state.

• An alphabet A which contains the symbols of observations of the HMM (i.e. {A,C,G,T}

is the DNA alphabet for the CpG-Island HMM in the above example).

• A set of transition probabilities T = {ti,j |i, j ∈ S}, where ti,j ∈ [0, 1] is the transition

probability of going from state i in the model to state j and∑

j′∈S ti,j′ = 1 for all states

i in S.

• A set of emission probabilities E = {ei(γ)|i ∈ S, γ ∈ A}, where ei(γ) ∈ [0, 1] is the

emission probability of state i for symbol γ and∑

γ′∈A ei(γ′) = 1 for all states i in S.

An HMM as defined above corresponds to a probabilistic model. It can be used in a

number of ways to generate predictions. In the following,

• let M denote an HMM,

• X = (x1, · · · , xL) an input sequence of length L where xi is the i-th letter of the sequence

and

• π = (π0, · · · , πl) a state path of length l + 1 in the model M where πk is the state at

the k-th position in the state path, i.e. π0 is always the start state and πl is always the

6

Page 16: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

1.2. Introduction to hidden Markov models

end state.

We can use M in a number of ways to generate predictions:

• Sequence decoding: For a given input sequence X and model M , we can for example

calculate the state path with the highest overall probability in the model, i.e. the state

path π∗ which maximizes P (π|X,M). In that case, model M analyzes the sequence X

and we will, in all of the following, say that the states of model M read the symbols of

observations of input sequence X.

• Parameter training: For a given set of training sequences X and a given model M ,

we can train the set of free parameters θ of M , for example by using training algorithms

which maximize P (X|θ,M).

• Generating sequences of observations: As any HMM defines a probability dis-

tribution over sequences of observations, we can use any given HMM M to generate

sequences of observations. In that case, we say that the states of model M emit sym-

bols of observations. In this thesis, this way of using an HMM is not employed. We

will, in all of the following, therefore refer to a state as reading one or more symbols

from an input sequence.

The CpG-Island finding problem can be solved with a decoding algorithm, which computes

the most probable state path π for an input DNA sequence X and model M. Before we can

use an HMM for decoding, we first have have to determine good values for its parameters

which may require parameter training. Algorithms for decoding and parameter training will

be discussed in more detail in this chapter and in chapter 2. In the following section, we first

describe variants of HMMs that have been developed recently for Bioinformatics applications.

7

Page 17: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

1.2. Introduction to hidden Markov models

1.2.2 Variants of HMMs for Bioinformatics applications

Pair-HMMs and n-HMMs

A pair-HMM is a simple extension of an HMM that reads two rather than only one sequence

at a time. One of the main applications of pair-HMMs is to generate sequence alignments,

so pair-HMMs are particularly prevalent in Bioinformatics [39]. For example, a pair-HMM

called WABA [21] can be used to identify pairs of protein coding exons in two orthologous

input DNA sequences. This pair-HMM generates a global alignment between the two input

sequences unlike TBlastX which generates local alignments [1]. Pair-HMMs are also used

for comparative gene prediction [26, 34, 36]. These pair-HMMs take a pair of homologous

DNA sequences as input and predict pairs of homologous gene structures. Comparative

gene prediction methods using pair-HMMs have been shown [26, 36] to perform better than

non-comparative approaches [8]. Figure 1.2.2 shows a simple 5-state pair-HMM. In an n-

HMM, each state can read letters from up to n input sequences. n-HMMs can be used for

multiple sequence alignment, but because of their high memory and time requirements, they

are typically used only for short sequences, such as protein sequences.

Generalized HMMs

Generalized HMMs [8, 48] allow states to read more than one letter at a time from an input

sequence. In Bioinformatics applications, a state of an HMM often corresponds to a biological

entity, e.g. a codon consisting of 3 nucleotides, such a state would read 3 letters from the input

sequence at a time. Figure 1.2.2 shows a simple generalized pair-HMM for finding related

protein-coding exons in two input DNA sequences. In particular, any HMM used for gene

prediction should be a generalized HMM.

8

Page 18: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

1.2. Introduction to hidden Markov models

Figure 1.3: A 5-state simple pair-HMMThis is a 5-state simple pair-HMM, where the Match state reads one character from eachof the two input sequences and aligns them according to the emission probability of thecombination of the two characters read by the state. The Emit X state reads one characterfrom input sequence X and the Emit Y state does the same thing as the Emit X state, butit reads one character from the input sequence Y at a time.

9

Page 19: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

1.2. Introduction to hidden Markov models

Figure 1.4: A simple pair-HMM for predicting related protein coding regions in two inputDNA sequencesThis simple pair-HMM can be used for predicting related protein coding regions in two inputDNA sequences. There are 3 states to model the protein coding part and 3 states to modelthe non-coding part of the two input sequences. The states which model the protein codingpart read one codon at a time, e.g. the Match Codon state reads 3 nucleotides from eachinput sequence at a time.

10

Page 20: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

1.2. Introduction to hidden Markov models

Explicit state duration HMMs

Explicit state duration HMMs extend the concept of HMMs by keeping track of the state’s

duration, i.e. the “time” that the state path has stayed in the same state is also remembered

in the decoding process. In this type of HMM, a set of parameters or functions for modeling

the state duration is needed in addition to the set of transition and emission probabilities,

and the probability of the next state is conditional on both the current state and the state

duration. Genscan [8] and SNAP [25] are two ab initio gene prediction models which are

based on an explicit state duration HMM.

Evolutionary HMMs

For many types of biological sequence analysis, it is important to know the evolutionary cor-

relations between the sequences. [24] describe a method that uses evolutionary information

to derive the transition probabilities of the insertion and deletion states of sequence align-

ment pair-HMMs. Evogene [40] is an evolutionary HMM (EHMM), which incorporates the

evolutionary relationship between the input sequences into the sequence decoding process.

Evogene consists of an HMM and a set of evolutionary models. In the model, each state

of the HMM consists of a set of alphabets over fixed alignment columns and a state-specific

evolutionary model, and the emission probabilities of each state are calculated using the

corresponding evolutionary model for each column in the input alignment. The Felsenstein

algorithm [13] is used to calculate the likelihood of an alignment column given an evolutionary

input tree and an evolutionary model.

11

Page 21: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

1.2. Introduction to hidden Markov models

HMMs which take external evidence into account

Because of significant improvements in biological sequencing technology, large amounts of

genome-wide data such as cDNAs, ESTs and protein sequences [3, 18, 37, 38] are currently

being generated. It is desirable to incorporate these types of additional evidence into com-

putational sequence analysis methods to get a better performance. In recent years, many

HMMs that can incorporate prior information on the input sequences into the prediction

process have been developed. They are especially popular in gene prediction because of the

highly available data. Twinscan [26] is one of the first HMMs that incorporates external

information into the gene prediction process. It extends the classical gene prediction HMM

Genscan [8] by incorporating matches to several homologous genomic sequences to the input

DNA sequence into the predictions. Twinscan first generates a “conserved sequence” from

the matches between homologous DNA sequences and the DNA sequence of interest. This

conserved sequence is then used to bias the emission probabilities of the underlying Gen-

scan HMM. Doublescan [34] is a comparative gene prediction pair-HMM which performs

ab initio gene prediction and sequence alignment simultaneously by taking two un-annotated

and unaligned homologous genomic sequences as input. It incorporates splice site informa-

tion along each input sequence using the predictions of Stratasplice [29], which generates

log-odd scores and posterior probabilities for potential splice site. These scores are then inte-

grated into the prediction process, and greatly improve the prediction accuracy of splice sites.

Projector [36] extends Doublescan by using the known genes in one of the two input

sequences to predict the gene structures in the other un-annotated DNA input sequence. It

uses so-called special emissions, which integrate the information on the annotated sequence

into the prediction process by biasing the emission probabilities. In its published version,

Projector is only capable of incorporating prior probabilities with discrete values (i.e. ei-

12

Page 22: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

1.3. Challenges in designing and implementing HMMs

ther 0 or 1). More recently, several gene prediction HMMs have been developed for integrating

various types of prior information into the prediction process [7, 15, 19, 44]. These programs

(they all employ HMMs but not pair-HMMs) all allow taking prior information with different

levels of confidence in account. For example, in Jigsaw [19], a feature vector is used to store

prior information on different features from various sources in probabilities. Simple multipli-

cation rules are then used to incorporate the vector into the nominal emission probabilities.

ExonHunter [7] calls a piece of external evidence an “advisor”, different “advisors” are

combined to form a “superadvisor” by quadratic programming (it is a problem of minimiz-

ing or maximizing a quadratic function of several variables subject to linear constraints on

these variables), which is then incorporated into the prediction process. All of these HMMs

are capable of integrating various prior information with different confidence scores into the

prediction process, but they all have some limitations. For example, they cannot deal with

contradicting pieces of information, and one state of the model can only incorporate one type

of prior information.

1.3 Challenges in designing and implementing HMMs

As discussed previously, HMMs are a widely used powerful concept in many applications.

However, it is often not easy to design and implement an HMM. Various algorithms exist for

generating predictions with the model and for training the model’s parameters. The main

challenges are:

• Long input sequences lead to large memory and time requirements for predictions and

parameter training.

• It is difficult to set up the transition and emission parameters of the model that yield a

good performance.

13

Page 23: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

1.3. Challenges in designing and implementing HMMs

• It is time consuming and tricky to implement the model and to test several HMMs

against each other (“Proto-typing”).

Dealing with long input sequences

In Bioinformatics, the length of input sequences for HMMs varies between applications. For

example, the average gene size (before splicing) of eukaryotic genes varies from around 1kb

(yeast) to 50kb (human) or longer. Before discussing the sequence decoding and parameter

training algorithms in detail, we first specify the time and memory requirements of these

algorithms to give readers a better impression of the challenges. For a pair-HMM with N

states, the Viterbi algorithm [47], which is the most widely used algorithm for sequence

decoding, requires O(N2L2) time and O(NL2) memory for analyzing a pair of sequences

of length L each. For a gene prediction pair-HMM consisting of 10 states, it would thus

require between 108 and 1010 time units and between 107 and 109 many units of memory to

analyze typical input sequences which contain only a single gene. These constraints are even

more serious for parameter training, as most training algorithms are iterative processes which

have to consider many input sequences. These challenges for parameter estimation of HMMs

are described below. New efficient algorithms are needed to make decoding and parameter

training practical even for long input sequences.

Parameter training

The set of transition probabilities and emission probabilities constitutes the parameters of an

HMM. There are additional parameters for explicit state duration HMMs and evolutionary

HMMs. As the performance of an HMM critically depends on the choice of the parameter

values, it is important to find good parameter values. As the states of an HMM closely reflect

the biological problem that is being addressed, it is typically fairly easy to design the states of

14

Page 24: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

1.3. Challenges in designing and implementing HMMs

the model. However, it can be difficult to manually derive good parameter values. There are

two common ways to define the parameters of an HMM: (1) Users can manually choose the

parameters. This is time consuming and also requires a very good understanding of the data.

Also, this strategy may not result in values that optimize the performance. (2) An other

strategy is to use an automatic parameter training algorithm. This strategy works best if we

have a large and diverse set of annotated training sequences. Being able to use automatic

parameter training for a pair-HMM, for comparative gene prediction means that it can be

readily trained to analyze different pairs of genomes. However, as we explain now, parameter

training is not an easy task.

There are two common types of parameter training strategies for HMMs, maximum like-

lihood methods (ML) and expectation maximization methods (EM) [10]. The corresponding

training algorithms are usually iterative processes, whose convergence depends on the set of

initial values and the amount of training data. Depending on the training algorithm, the out-

come of the training may depend on the initial parameter values and may also not necessarily

maximize the resulting prediction performance. Pseudo-counts for the parameters may be

needed to avoid over-fitting, especially if the training data is sparse or biased.

We have discussed the importance of parameter training for setting up an HMM. Because

of the difficulties of implementing training algorithms that can be used efficiently on realistic

data sets, parameter training algorithms are not implemented in most HMM applications (i.e.

the parameters are set up manually for one particular data set). An exception are some HMMs

for gene prediction [25, 30, 46]. This is because (1) the set of parameters are very important

for the gene prediction performance, (2) parameters need to be adapted to predict genes for

different organisms, (3) due to the great improvement in biological sequencing techniques,

there are large enough sets of known genes to allow parameter training. For gene prediction

methods, the main challenge is to implement efficient parameter training algorithms. The

15

Page 25: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

1.4. Discussion and conclusion

existing parameter training algorithms are described in chapter 2.

Proto-typing

Designing an HMM for a new Bioinformatics application is fairly easy given a good under-

standings of the underlying biological problem, whereas the implementation of several possible

HMMs can be a tedious task. Proto-typing several HMMs and implementing the sequence

decoding and parameter training algorithms is time-consuming and error-prone. Although

there are many variations of HMMs, they all use the same prediction and training algorithms.

It is particularly inefficient if the same algorithms have to be implemented repeatedly when-

ever a new HMM application is created. If there was a software tool for generating an HMM

with the corresponding algorithms, then users would be able to focus on designing the HMM,

i.e. the states of the model, which is the only task that requires human expertise and insight

into the biological problem.

1.4 Discussion and conclusion

HMMs are a very powerful statistical concept which allows many different kinds of sequence

analysis applications. In this chapter, we have introduced the theoretical background of

HMMs and some popular variants of HMMs and also described the popularity of these models

in Bioinformatics. As we explained, it is typically not easy to design and implement an HMM

in practice. We have discussed several challenges, such as long input sequences which lead

to large memory and time requirements, difficulties with setting up the parameters of the

model, and the time consuming and error-prone task of implementing efficient algorithms.

Even though many different HMMs have been developed for different applications, they use

similar prediction and parameter training algorithms. It is therefore not efficient if users

16

Page 26: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

1.4. Discussion and conclusion

spend most of their time implementing and re-implementing these algorithms for different

HMMs, when they could be focusing on the design and parameterization of the HMM.

We here propose an HMM-generating tool HMMConverter, which takes an XML input

file for defining an HMM, its states and parameters and the algorithms to be used for gener-

ating predictions and training the parameters. Using this tool, a user is not required to have

computer programming knowledge to set up an HMM and to use it for data analyses and

parameter training. This tool provides several widely used sequence decoding and parameter

training algorithms, including two novel parameter training algorithms. Furthermore, HMM-

Converter supports many special features, such as incorporating external information on

each input sequence with confidence scores, and several heuristic algorithms for reducing

the run time and memory requirement greatly while keeping the performance essentially un-

changed. These features are especially useful for Bioinformatics applications.

In the following chapter, we introduce several widely used existing sequence decoding and

parameter training algorithms. The HMM generating tool-HMMConverter is described in

chapter 3, including details on all special features and novel algorithms. Chapter 4 explains

how to specify an HMM using HMMConverter and chapter 5 describes some implemen-

tation details of HMMConverter. Chapter 6 describes a novel pair-HMM for comparative

gene prediction called Annotaid, which is constructed using the HMMConverter frame-

work.

17

Page 27: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Chapter 2

Decoding and parameter training

algorithms for HMMs

2.1 Introduction and motivation

In biological sequence analysis or biological sequence annotations, we are interested in finding

good annotations for the raw biological sequences. This is the sequence decoding problem

discussed in chapter 1. Every HMM assigns an overall probability to each state path which

is equal to the product of the encountered transition and emission probabilities, and every

state path corresponds to an annotation of the input sequence. For an HMM with well-chosen

parameters, high-quality annotations should correspond to high-probability state paths, so

the task for sequence decoding is to find a state path with high probability. The Viterbi

algorithm [47] is one of the most widely used sequence decoding algorithms which calculates

the most probable state path for a given input sequence and HMM.

For many Bioinformatics applications, it is often the case that several state paths that

yield the same annotation. For example, in a simple pair-HMM which generates global align-

ments of protein coding regions in two input DNA sequences (see figure 2.1), there are states

Emit Codon x and Match Codon, which both assign the annotation label Codon to the

letters they read from sequence X. In this pair-HMM, several different state paths that cor-

18

Page 28: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

2.2. Decoding algorithms

respond to different global alignments between the two sequences could yield the same gene

annotation for the two input sequences. From this example, we can see that it is always

possible to translate a state path in the model into an annotation of the input sequence.

Figure 2.1: A simple generalized pair-HMM for predicting related protein coding regions intwo DNA sequencesThis simple generalized pair-HMM has been presented in chapter 1. In particular, theMatch Codon state, the Emit Codon X state and the Emit Codon Y state all assignthe annotation label Codon to the letters they read from the input sequences. The threeother states (except the start and the end state) assign the annotation label non coding.

2.2 Decoding algorithms

In this section, we describe two existing sequence decoding algorithms for HMMs: The most

commonly used Viterbi algorithm [47] and the Hirschberg algorithm [16]. The Hirschberg

19

Page 29: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

2.2. Decoding algorithms

algorithm can be regarded as a more memory efficient version of the Viterbi algorithm. For

a given HMM and an input sequence, both algorithms derive the state path with the highest

overall probability.

2.2.1 The Viterbi algorithm

The Viterbi algorithm is one of the most widely used sequence decoding algorithm for HMMs.

It first calculates the maximum probability that a state path in a given model M can have

for a given input sequence, and then retrieves the corresponding optimal state path by a

back-tracking process. In this section, we introduce the algorithm for an HMM, i.e. a single

tape HMM where only the start and and end states are silent and all other states read one

letter from the input sequence at a time.

The algorithm consists of two parts:

• Part 1: Calculate the optimum probability of any state path using dynamic program-

ming.

• Part 2: Retrieve the corresponding optimal state path by a back-tracking process.

We first introduce some notations before explaining the algorithm with pseudo-code, some of

these notations have already been introduced in chapter 1.

• X = (x1, . . . , xL) denotes an input sequence of length L, where xi is i-th letter of the

sequence.

• S = {0, . . . , N − 1} denotes the set of N states in the HMM, where 0 is the start state

and N − 1 is the end state.

• T = {ti,j |i, j ∈ S}, where ti,j is the transition probability from state i to state j

20

Page 30: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

2.2. Decoding algorithms

• E = {ei(γ)|i ∈ S, γ ∈ A}, where ei(γ) is the emission probability of state i to read

symbol γ from alphabet A

• v denotes the two-dimensional Viterbi matrix, where vs(i) is the probability of the most

probable path that ends in state s and that has read the input sequence up to and

including position i.

• ptr denotes the two-dimensional pointer matrix, where ptri(s) is the previous state from

which the maximum probability at state s and sequence position i was derived.

• π = (π0, · · · , πL+1) represents a state path in the model for a given input sequence of

length L, where πi is the i-th state of the state path. In particular, π0 is the start state

and πL+1 is the end state.

• π∗ denotes the optimal state path, it is the state path with the highest overall probability

on sequence X, i.e. π∗ = argmaxπP (π|X,M). We also call this state path the Viterbi

path.

In order to calculate the optimum probability among all state paths for a given HMM and

a given input sequence, the Viterbi algorithm uses the following dynamic programming ap-

proach:

Part 1:

Initialization before reading the first letter from the input sequence

vs(0) =

1, if s = start

0, if s 6= start

Recursion

For i = 1, · · · , L

21

Page 31: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

2.2. Decoding algorithms

For s′ = 1, · · · , N − 2

vs′(i) = es′(i)maxs∈S{vs(i− 1)ts,s′} (2.1)

ptri(s′) = argmaxs∈S{vs(i− 1)ts,s′} (2.2)

Termination at the end of the sequence and in the end state

vN−1(L) = maxs∈S{vs(L)ts,end}

ptrL(N − 1) = argmaxs∈S{vs(L)ts,end}

From the definition of v, we know that the probability of the optimal state path P (x, π∗) =

vN−1(L). The corresponding optimal state path can be obtained by a back-tracking procedure.

Part 2:

Back tracking starts at the end of the sequence in the end state and finishes at the start

of the sequence in the start state. It is defined by the recursion:

π∗i−1 = ptri(π∗i )

Figure 2.2.1 illustrates the first part of the Viterbi algorithm.

The above algorithm can be easily generalized for pair-HMMs by using a three dimensional

matrix and looping over both X and Y dimensions in the recursion. For a generalized HMMs

where states can read more than one character at a time from the input sequence, the entries

of the Viterbi matrix are calculated by:

vs′(i) = es′(i)maxs(vs(i−∆x(s′))ts,s′),

where ∆x(s′) is the number of characters read by state s′.

22

Page 32: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

2.2. Decoding algorithms

Figure 2.2: Illustration of the Viterbi algorithmThe left part shows how the Viterbi algorithm calculates the Viterbi matrix from left to rightin the recursion. The algorithm first loops along the input sequence X and then loops overthe states in the model. It terminates in the top right corner of the matrix at the end of theinput sequence and in the end state. The right part demonstrates the calculation of entryvs′(i), where the matrix element for state s′ and sequence position i is derived from state sat sequence position (i− 1), i.e. vs′(i) = es′(i) · vs(i− 1)ts,s′ = es′(i) ·maxs∈S{vs(i− 1)ts,s′}.

Let Tmax be the maximum number of transitions into any state in the model. The Viterbi

algorithm takes O(NTmaxLn) operations and O(NLn) memory for an n-HMMs which reads

n input sequences at a time. These time and memory requirements impose serious constraints

when employing this algorithm with n-HMM and long input sequences.

2.2.2 The Hirschberg algorithm

From equation 2.1, we can see that the Viterbi algorithm can be continued if we keep

the Viterbi elements for the previous sequence position in memory. This means that only

O(NLn−1) memory is required for an n-HMM with N states and n input sequences of identi-

23

Page 33: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

2.2. Decoding algorithms

cal length L if we only want to calculate the probability of the Viterbi path. However, in order

to obtain the Viterbi path, O(NLn) memory is required for storing the back-tracking pointers

(see equation 2.2) for the back-tracking process. The memory needed for the back-tracking

pointers thus dominates the memory requirements for the Viterbi algorithm.

The Hirschberg algorithm is a more memory efficient version of the Viterbi algorithm.

It requires O(NLn−1) memory instead of O(NLn) for an n-HMM. Furthermore, the time

requirement of the Hirschberg algorithm is of the same order as for the Viterbi algorithm.

It uses the same recurrences as the Viterbi algorithm, but searches the Viterbi matrix in a

smarter way. We now describe the Hirschberg algorithm for a pair-HMM.

Instead of filling the Viterbi matrix in a single direction along one of the two input

sequences, the Hirschberg algorithm traverses the matrix from both ends using two Hirschberg

strips of O(NL) memory each (see figure 2.2.2). We call them the forward strip and the

backward strip, respectively.

In order to perform the backward traversal, a “mirror model” of the original HMM has to

be set up by reversing the transitions and exchanging the end state and start state with

respect to the original HMM. This mirror model does not necessarily define an HMM because

the sum of the transition probabilities from a state may no longer be 1. The forward strip

calculated the 3-dimensional matrix in the same direction as the Viterbi algorithm, while

the backward strip uses the mirror model to move in the opposite direction, i.e. it starts at

the end state and at the end of one sequence. The matrix element (i, j, s) in the backward

strip stores the probability of the optimal state path that has read the input sequences from

their ends up to and including position i of sequence X and position j of sequence Y and

that finishes in state s. Figure 2.2.2 shows how the two Hirschberg strips move in opposite

directions and meet at the middle position of one of two input sequences (i.e. sequence X in

24

Page 34: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

2.2. Decoding algorithms

this example).

Figure 2.3: Illustration of the Hirschberg algorithm (1)This figure shows the projection of the 3-dimensional Viterbi matrix onto theX−Y plane. Thetwo Hirschberg strips traverse the matrix from both ends of sequence X and meet at the middleof the sequence X. For a generalized HMM, the two strips have a width of maxs∈S(∆x(s))+1,so they will overlap instead of “touching” each other. The additional columns of the stripare used to store the values calculated for the previous sequence position which are neededin order to continue the calculation.

Suppose the two strips meet at position i of sequence X, where the forward strip stores

the Viterbi matrix elements for position i of sequence X and the backward strip stores the

corresponding probabilities for position i + 1 of sequence X. At that sequence position in

X, the probability of the optimal state path P (X,Y |π∗) can be calculated by finding the

transition probabilities from states in the forward strip to states in the backward strip, such

that the product of this transition probability and the two entries in the strip is maximum.

Using this criterion, we can identify the Viterbi matrix element (i, j, s) through which the

optimal state path π∗ goes.

25

Page 35: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

2.2. Decoding algorithms

In summary, when the two Hirschberg strips meet, the optimum probability and a coor-

dinate on the optimal state path can be determined. The process only continues on the left

sub-matrix and the right sub-matrix which corresponds to only half of the search space of

the previous iteration (see figure 2.2.2). From the above example, the position i of sequence

X defines the right boundary of the left sub-matrix, and the coordinate (i, j, s) defines the

end position, this coordinate also defines the start position of the right sub-matrix. Figure

Figure 2.4: Illustration of the Hirschberg algorithm (2)The coordinate (i,j,s) on the optimal state path was determined in the first iteration of theHirschberg algorithm (see figure 2.2.2). The matrix is then split into left and right sub-matrices and the next round of iterations is started for each of the two sub-matrices. Thisrecursive procedure continues, until the complete state path is determined.

2.2.2 shows that two smaller Hirschberg strips search in each of the two sub-matrices. The

recursive process continues until the optimal state path is completely determined. The figure

also shows that in each iteration, the total search space for the Hirschberg strips is reduced

by a factor of two. So the number of operations performed by the Hirschberg algorithm is

given by: t+ t2 + t

4 + · · · ≤ 2t, where t denotes the number of operations required for searching

26

Page 36: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

2.3. Parameter training

the entire matrix, i.e. the number of operations in the Viterbi algorithm.

2.3 Parameter training

In order to calculate the most probable state path for a given input sequence or sequences,

a set of parameters must be provided. The parameters are the set of transition probabilities

and emission probabilities. HMMs of the same topology but different parameter values can

be used for different applications. For example, the average length of an exonic stretch in

human genes is different from that in mouse genes, so the transition probabilities from the

exon state to intergenic state in the HMM should not be the same. The values of these

parameters determine the prediction result and the performance of the HMM.

In general, the set of parameters of the HMM should be adjusted according to applications,

so the parameter estimation is very crucial. However, it is often a difficult task to obtain good

parameter estimation, because the parameter training process is expensive to run and training

data is usually inadequate.

Parameter training is essential in order to set up HMMs for novel biological data sets and

to optimize their prediction performance. First of all, a set of training sequences is required

for the training process. If the state paths for all the training sequences are known, then the

parameter estimation becomes a simple counting process where the performance is optimized

via a maximum likelihood approach.

Suppose there are M training sequences. Let:

• X = {X1, . . . , XM} denote the set of M training sequences.

• θ denote the set of parameters of the HMM. This set refers to the set of transition and

emission probabilities, i.e. θ = {T , E}.

27

Page 37: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

2.3. Parameter training

• rs,s′ the pseudo-count for the number of times the transition s→ s′ is used.

• rs(γ) the pseudo-count for the number of times states s reads symbol γ.

We can then derive the parameters that maximize the likelihood, P (X|θ), by setting:

ts,s′ =Ts,s′∑s Ts,s

, where (2.3)

Ts,s′ = number of transitions from s to s′ in all the state paths including any pseudo-counts

rs,s′ .

Similarly,

es(γ) =Es(γ)∑γ′ Es(γ′)

, where (2.4)

Es(γ) = number of times state s reads letter γ in all the state paths including any pseudo-

counts rs(γ) .

Often, the state paths that correspond to a known annotation are not known or too

numerous. In that case, parameter estimation becomes more difficult. In this situation, we

can employ two frequently used parameter training algorithms: the Baum-Welch algorithm [2]

and the Viterbi training [11].

2.3.1 The Baum-Welch algorithm

The Baum-Welch algorithm is a special case of an expectation maximization (EM) algorithm,

which is a general method for probabilistic parameter estimation. The Baum-Welch algorithm

iteratively calculates new estimates for all parameters and the overall log likelihood of the

HMM can be shown to converge to a local maximum. We introduce a few more notations,

28

Page 38: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

2.3. Parameter training

and then describe the Baum-Welch algorithm:

• xij denotes the letter of the j-th position of sequence Xi in the set of training sequences

X .

• f qs (i,Xk) denotes the forward probability for state s and sequence position i in iteration

q of the Baum-Welch algorithm. It is equal to the sum of probabilities of all state paths

that have read the input sequence Xk up to and including sequence position i and are

in state s, i.e. fs(i,Xk) := P (xk1, · · · , xki , πi = s).

• bqs(i,Xk) denotes the backward probability for state s and sequence position i in iteration

q of the Baum-Welch algorithm. It is equal to the sum of probabilities of all state paths

that have read the input sequence from the end up to and including sequence position i+

1, given that the state of sequence position i is s, i.e. bs(i,Xk) := P (xki+1, · · · , xkLXk |πi =

s).

• tqs,s′ is the transition probability of going from state s to state s′ in iteration q of the

Baum-Welch algorithm.

• eqs(γ) is the emission probability of reading symbol γ in state s in iteration q of the

Baum-Welch algorithm.

• θq is the estimated set of parameters. i.e. the set of transition and emission probabilities

of the model in iteration q of the Baum-Welch algorithm.

So, in each iteration, we derive a new value for every transition probability from the ex-

pected number of times that this transition is used in any state path:

29

Page 39: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

2.3. Parameter training

tq+1s,s′ =

∑Mk=1 T

qs,s′(Xk)∑M

k=1

∑s∈S T

qs,s(Xk)

,

where T qs,s′(Xk) :=

∑LXki=1 f qs (i,Xk)t

qs,s′e

qs′(x

ki+1)bqs′(i+ 1, Xk)

P q(Xk)(2.5)

Similarly for the emission probabilities.

eq+1s (γ) =

∑Mk=1E

qs (γ,Xk)∑M

k=1

∑γ′∈AE

qs (γ′, Xk)

,

where Eqs (γ,Xk) :=

∑LXki=1 δxki γ

f qs (i,Xk)bqs(i,Xk)

P q(Xk)(2.6)

and δ is the delta function such that δij = 1 if i = j and δij = 0 if i 6= j. P q(Xk) is the

probability of the sequence Xk in the HMM of iteration q. From the definition of forward

probability, we can see that P q(Xk) = f qend(LXk , Xk). This value can be calculated by the

forward algorithm [11]. In equation 2.5, the value f qs (i,Xk)tqs,s′e

qs′(x

ki+1)bqs′(i+ 1, Xk)/P q(Xk)

is the probability that a transition from state s to state s′ is used at sequence position

i given the set of parameters θq. So, T qs,s′(Xk) is the expected number of times that a

transition from state s to state s′ is used in sequence Xk given the set of parameters θq.

It can be calculated by summing over all sequence positions of sequence Xk. Similarly, the

value f qs (i,Xk)bqs(i,Xk)/P (Xk) in equation 2.6 is the probability that state s reads symbol γ

at position i in sequence Xk given that the set of parameter θq. So, Eqs (γ,Xk) is the expected

number of times that state s reads symbol γ in the sequence Xk given the set of parameters

θq. It can be calculated by summing over all sequence position of the sequence Xk.

30

Page 40: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

2.3. Parameter training

In Baum-Welch training, the iterations in 2.5 and 2.6 are continued until a stopping

criterion is reached. There are typically two stopping criterion for the Baum-Welch training,

(1) the differences between the estimated parameters in the q-th iteration and the estimated

parameters in the (q− 1)-th iteration is smaller than a pre-defined threshold. This difference

can be estimated by the log odd ratio∑M

k=1{log(P (Xk|θn)/P (Xk|θn−1)}. (2) The number of

iterations exceeds a maximum allowed number of iterations. Suppose that c is the number of

parameters to train, each iteration of the Baum-Welch training algorithm requires O(NLn)

memory and O(c ·M · NTmaxLn) operations for an n-HMM with N states and M training

sequences of identical length L.

2.3.2 The Viterbi training algorithm

The Viterbi training algorithm is another popular training algorithm which uses the Viterbi

state paths to derive parameter estimates in an iterative way. Given a set of initial parameters,

the Viterbi algorithm is first used to obtain the optimal state paths for all the training

sequences. Then the counting procedure (equation 2.3 and equation 2.4) described at the

beginning of this section is performed to calculate new parameter values. This procedure is

iterated until the Viterbi paths of all training sequences remain unchanged.

The Viterbi training is popular because (1) the Viterbi algorithm is usually implemented

for sequence decoding in many HMM applications, so only the simple counting procedures in

equation 2.3 and 2.4 needs to implement in addition. (2) As the Viterbi algorithm is usually

used for sequence decoding, it can be argued that Viterbi training results in a better set of

parameters for decoding with the Viterbi algorithm.

The Viterbi training consists of the following steps:

(i) The optimal state paths for all the training sequences in the training set are obtained

31

Page 41: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

2.4. Discussion and conclusion

by using the Viterbi algorithm and the HMM with its parameters θq of iteration q.

(ii) Suppose T qs,s′ and Eqs (γ) have same meaning as after equation 2.3 and equation 2.4,

with the Viterbi paths assigned to all the training sequences. The values of these pa-

rameters can be calculated using equation 2.3 and 2.4, i.e.

tq+1i,j =

T qi,jPj′∈S T

q

i,j′, eq+1

s (γ) = Eqs (γ)Pγ′∈A E

qs (γ′)

The above two steps are repeated until the Viterbi state paths for all the training sequences

remain unchanged. This can always be achieved as the state paths correspond to discrete

values.

The Viterbi training is not an EM algorithm and it cannot be shown to always maximize

the likelihood of the model. However, it maximizes the likelihood for both parameters and the

optimal state paths for all the training sequences, i.e. P (X1, · · · , XM |θ, π∗(X1), · · · , π∗(XM )).

For each iteration, the Viterbi training algorithm requires O(NLn) memory and O(c ·M ·

NTmaxLn) operations for an n-HMM with N states and M training sequences of identical

length L, which are the same requirements as for the Baum-Welch algorithm.

2.4 Discussion and conclusion

We have discussed the roles of sequence decoding and parameter estimation for HMM-based

applications, and have introduced several commonly used algorithms. For sequence decod-

ing, the Viterbi algorithm calculates the optimal state path for each input sequence. One

disadvantage is that it requires O(NLn) memory for an n-HMM, and this imposes a serious

constraint if the input sequences are long. The Hirschberg algorithm reduces the memory

32

Page 42: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

2.4. Discussion and conclusion

usage by a factor of L compared to the Viterbi algorithm, and while increasing the run time

by at most a factor of two.

For parameter estimation, both the Baum-Welch algorithm and the Viterbi training have

the same memory and time requirements for each iteration. The implementation of the

Baum-Welch algorithm is more complicated as both the forward and backward values have

to be calculated. However, it is guaranteed to converge to a local performance maximum and

produce better parameter estimates than Viterbi training as it considers all state paths and

weighs them according to their probabilities in order to estimate new parameters, whereas

Viterbi training only considers a single state path for every training sequence (the Viterbi

path).

33

Page 43: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Chapter 3

HMMConverter

3.1 Motivation

We have already introduced the challenges of designing and implementing HMMs in chapter 1.

It is both time-consuming and error-prone to implement these models and efficient algorithms.

As HMMs are used for many Bioinformatics applications, it would be extremely useful to be

able to use the same source code and efficient algorithms for quickly setting up different kinds

of HMMs and data analyses. One early effort is the Dynamite software package [5] which is a

dynamic-programming code generator which can be used to implement Viterbi-like sequence

analysis algorithms. The tool is a compiler which produces highly efficient C code. As this

is a tool for dynamic programming, it is not specially convenient for constructing HMMs for

different applications. As users still have to implement the HMM, and it is also not flexible

to incorporate special features into the provided dynamic programming framework. There

are several powerful profile HMM tools such as HMMer [12] and SAM [20] which provides

sequence decoding and parameter training algorithms. But these tools have limitations on

the models that can be implemented as all the algorithms provided are for profile HMMs only,

this type of HMM is mainly used for analyzing fixed multiple sequence alignments.

MAMOT [43] and HMMoC [31] are two recently published HMM generating tools.

MAMOT takes a simple text file as input for defining the HMM and provides the Viterbi

and Baum-Welch training algorithm. But MAMOT can only implement HMMs of order 1.

34

Page 44: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.1. Motivation

With these limitations, it is not practical for most Bioinformatics sequence analyse.

HMMoC is a more comprehensive HMM generating tool which supports n-HMMs, i.e.

HMMs with any number of input sequences, and higher order HMMs. The tool also incor-

porates special features such as position-dependent emission and transition probabilities as

well as banding which allows a user to define a smaller search space in the Viterbi matrix.

However, it is not easy to define an HMM within HMMoC as it requires an XML description

and user-supplied programming codes to define an HMM. For example, C code fragments

are required for setting up the transition and emission probabilities, the position-dependent

probabilities and the banding features. This requires users to have programming knowledge.

We here present an HMM generator software tool called HMMConverter, which takes

an XML and simple text files as input files, and then translates them into C++ code. The

converter supports generalized pair-HMMs with many special features such as banding which

allows users to specify a sub-search space for sequence decoding. It is much more flexible

and easier to use than HMMoC, the feature of special emissions allows integrating prior

information along each input sequence into the prediction process, and the parameterization

of transition and emission probabilities makes it possible to efficiently train a large number of

transition and emission probabilities from a small set of free parameters. Four unique features

of our software package are the memory-efficient Hirschberg algorithm [16] for decoding, a

linear-memory version of the Baum-Welch algorithm [35], a novel linear-memory version of

Viterbi training and a novel and computationally efficient algorithm for training with sampled

state paths. All of these decoding and training algorithms can be combined with banding

and special emissions, making HMMConverter thereby an extremely powerful and useful

software package that can be used to swiftly set up sophisticated HMMs for real-life data sets.

In this chapter, we focus on introducing the features of HMMConverter. Section 3.2

introduces the main features of HMMConverter, including the special features mentioned

35

Page 45: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

above and the different algorithms. Section 3.3 introduces the two novel training algorithms,

and other special features for parameter training with HMMConverter. The specifications

and formats of the input and output files are described in chapter 4 and some implementation

details of HMMConverter are discussed in chapter 5.

3.2 Main features

3.2.1 Input and output files

In order to build an HMM with HMMConverter, users have to supply several input files

for describing the HMM and setting up special features. The format of these input files are

designed to be simple, easy to construct, and easy to read even by users who may not have any

programming experience. Depending on the type of analysis requested, HMMConverter

generates output files for storing the results of sequence decoding and parameter training. The

format of the output files is also designed to be easily readable by users. In particular, the

output files that store the sequence decoding results are designed to be straightforward and

readable and can be easily to converted into other popular formats like GTF etc (GTF stands

for Gene Transfer format, which is a file format used to store information about gene struc-

tures). In this section, we briefly describe the input and output files of HMMConverter,

the details of the format of these files are described in chapter 4.

Input files Depending on the analysis to be performed, HMMConverter takes up to 5

input files:

• XML input file (compulsory): This file defines the different components and the topolog-

ical structure of the HMM, it specifies the algorithms to be used for sequence decoding

or parameter training as well as any free parameters required by these algorithms. The

36

Page 46: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

parameters for setting up the set of free transition parameters and the set of free emis-

sion probabilities are also defined in this file, while the values of the free transition

parameters and the free emission probabilities are specified in separated files. These

parameters include the id and the size of the set of free transition parameters and the

set of free emission probabilities, and the name of the file that stores the values of these

free transition parameters and free emission probabilities. This is a pure XML-based file

and users are not required to have knowledge in any computer programming language.

The name of all the other compulsory or optional input files are also specified in this

XML input file, see figure 3.2.1.

Figure 3.1: The XML-input file for defining an HMM in HMMConverterThe XML-input file references all other flat text files which store parameters for the HMMor specify special features in HMMConverter. The dashed lines corresponds to optionalinput files which are required only if the corresponding feature is used.

• free emission parameter file (compulsory): This is a flat text file which specifies the set

of free emission probabilities of the HMM. We call group of emission probabilities stored

in this file a free emission parameter if it defines the emission probabilities of a state in

the HMM. We use a separate text file to specify the free emission parameters in order to

keep the XML input file short and readable. For example, for a higher-order pair-HMM

with a state that reads 6 characters from the two input sequences, we would need to

37

Page 47: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

specify 46 emission probabilities for a DNA alphabet only for this state. The format of

the file with the free emission parameters can be easily generated using a little script.

• free transition parameter file (optional): This file stores the values of the free transition

parameters. It is required only if the transition probabilities are parameterized.

• tube file (optional): This file specifies the restricted search space along the sequence

dimensions. It is required if banding is to be used.

• prior-information file (optional): This file stores all prior information on input sequences.

It is required if special emissions are to be used.

The last three files are optional input files which are only required if the corresponding feature

in HMMConverter is to be used in the algorithm for decoding or parameter training. The

concept and theoretical details of these features are explained in the next section.

Output files HMMConverter generates up to 5 output files for storing the results of

sequence decoding and parameter training:

• Two output files for sequence decoding:

– The first file contains the optimal state path for all the input sequences. In this

file, each line specifies one state of the state path with the corresponding positions

in the input sequence(s), as well as information on the state.

– If annotation label sets are defined for the HMM, then a second file will be gen-

erated. This file shows the resulting annotation of the input sequence(s) which

corresponds to the optimal state path.

• Three output files for parameter training:

38

Page 48: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

– XML output file: This file has the same format as the XML input file, but contains

the trained values of the transition probabilities. Given this file, it is easy for users

to compare the HMM before and after training.

– output file with trained free emission parameters: This file stores the trained free

emission parameters of the HMM. It has the same format as the input file with

the free emission parameters.

– output file with free transition parameters: This file stores the trained free tran-

sition parameters of the HMM. It has the same format as the input file with the

free transition parameters.

3.2.2 Special features

Three special features of HMMConverter are introduced in this section. They are: pa-

rameterization of transition and emission probabilities, banding and special emissions. These

features are particularly useful for constructing complex pair-HMMs. Special emissions can

be used to integrate prior information on each input sequence probabilistically into all pre-

diction and parameter training algorithms, which is a very desirable feature for many recent

Bioinformatics applications.

Parameterization of transition and emission probabilities

Complex HMMs with a large number of states may consist of hundreds of transition and

emission probabilities. For many applications, the transitions or emission probabilities of

different states are correlated. In that case, specifying all the probabilities as independent

parameters is not desired. Furthermore, the large number of parameters would entail serious

constraints for parameter training, as enormous amounts of training data would be needed

39

Page 49: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

to avoid over-fitting. In a particular case, two states may share the same set of emission

probabilities. If the training process views them as independent parameters during parameter

training, this would not only double the time for training, but may also result in different

parameter values.

In order to solve the above problem, HMMConverter allows the parameterization of

transition and emission probabilities. Besides specifying the free parameters, users also have

to define the derivation rules which express the probabilities as function of the free param-

eters (required for decoding and training) as well as derivation rules which express the free

parameters as function of the probabilities (required for training only). The rules for deriving

transition probabilities and emission probabilities are different from each other and we will

describe them separately in the following.

Parameterization of transition probabilities We first present the power of parameter-

ization by looking at the complex generalized pair-HMM of Doublescan [33, 34] for com-

parative gene prediction. It consists of 153 transitions, but can be efficiently parameterized

using only 19 free transition parameters.

In HMMConverter, a transition probability can be defined by an arithmetic formula.

The formula may use any of the basic arithmetic operators: ’+’,’−’, ’∗’,’/’ and brackets,

with the id of the free transition parameters and numerical values as operands. The free

transition parameters need not correspond to probabilities, i.e. their values are not restricted

to the range [0,1]. However, users have to ensure that the parameterization of all transition

probabilities implies that∑

j ti,j = 1 for all states i in the HMM. For example, for a state i

with three transitions, i→ j1, i→ j2 and i→ j3, and the derivation formula (functions of free

parameters) ti,j1 = f1, ti,j2 = f2 and ti,j3 = f3, respectively, the sum of the three transition

probabilities as function of the free parameters always has to be 1 (i.e. f1 + f2 + f3 ≡ 1).

40

Page 50: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

HMMConverter applies consistency checks to ensure that∑

j ti,j = 1 and∑

γ ei(γ) = 1

for all states i in the HMM and reports error messages and terminates the program otherwise.

Derivation of emission probabilities For the emission probabilities, we only allow free

parameters that correspond to probabilities, i.e. values in [0, 1]. In HMMConverter, a

group of emission probabilities that defines the set of emission probabilities for a state is

regarded as one emission parameter. All the free emission parameters are specified in the free

emission parameter file.

HMMConverter provides two functions, SumOver and Product, for deriving emis-

sion probabilities. Consistency checks on the correct dimensions of the parameters and the

normalization are enforced by HMMConverter.

There are four ways of defining the emission probabilities of a state:

(i) Directly defining the emission probability An emission parameter can be used

directly by several states. This means that the emission probabilities of a state can be defined

by pointing directly to an emission parameter in the free emission parameter file.

(ii) SumOver This function can be used to derive a lower dimensional emission probability

from a higher dimensional emission probability. The dimension of an emission probability is

the sequence dimension, i.e. the total number of letters read at by the corresponding state at

a time. Consider as example three states: match xy, emit x and emit y of dimension 6,

3 and 3, respectively. The emission probabilities of emit x and emit y can be derived from

41

Page 51: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

those of match xy by summing over several sequence dimensions:

Pemitx(x1, x2, x3) =∑

(y1,y2,y3)

Pmatchxy(x1, x2, x3, y1, y2, y3)

Pemity(y1, y2, y3) =∑

(x1,x2,x3)

Pmatchxy(x1, x2, x3, y1, y2, y3)

(iii) Product This function allows the user to express a high dimensional emission prob-

ability as product of several lower dimensional emission probabilities. Consider as example

two states, triple and single, of dimension 3 and 1, respectively. The emission probabilities

of triple can be expressed as function of those of single in the following way:

Ptriple(x1, x2, x3) = Psingle(x1) ∗ Psingle(x2) ∗ Psingle(x3)

(iv) SumOver & Product The two functions can be combined to form more complex

rules, for example for three states of dimension 3, 1 and 4, respectively:

P3−tuple(x1, x2, x3) =

∑(y1,y2)

P4−tuple(x1, x2, y1, y2)

· Psingle(x3)

The four above rules are easy to apply and flexible enough for many applications. It is easy to

prove mathematically that these rules derive a normalized set of emission probabilities which

fulfill∑

γ ei(γ) = 1 for all states i in the HMM.

Banding

As discussed previously, it takes O(N2L1L2) time and O(NL1L2) memory to run the Viterbi

algorithm on a pair-HMM with N states and two input sequences of length L1 and L2.

These requirements are often too high for many Bioinformatics applications where the input

sequences are very long.

42

Page 52: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

Banding is a feature in HMMConverter for addressing this problem. This feature is

particularly important for pair-HMMs and allows to run sequence decoding and parameter

training algorithms on a restricted search space along the sequence dimensions. We call the

restricted search space a “tube”. If the tube is well chosen, the memory and time require-

ments of the algorithms can be greatly improved while keeping the performance essentially

unchanged.

In HMMConverter, there are two ways to construct a tube:

(i) derive a tube from local matches generated by TBlastx or Blastn [1].

(ii) specify the restricted search space explicitly in the tube file.

(i) In order to derive a tube from blast matches, the user has to specify a parameter called

radius which defines the width of the tube around the selected matches. As many local

blast matches can be generated for the same pair of input sequences, HMMConverter

uses a dynamic programming routine to derive a highest-scoring subset of mutually compatible

matches from the original set of blast matches. It is only this subset of matches which is

used to construct the tube. Figure 3.2.2 illustrates a tube derived from two matches.

(ii) We now consider specifying the tube explicitly in the tube file. HMMConverter

defines a tube via a set of user defined 4-tuples, {(x1, y1, x2, y2)|1 ≤ x1 ≤ x2 ≤ Lx, 1 ≤

y1 ≤ y2 ≤ Ly}. Each 4-tuples defines two points (x1, y1) and (x2, y2) on the sequence plane.

This two points can either define the end points of a match, or a rectangular block in the

sequence plane of the Viterbi matrix. Users can explicitly define a tube either by specifying

matches and a radius or by specifying blocks of restricted area. Figure 3.2.2 shows the two

different ways of explicitly defining a tube using the same set of 4-tuples. It also shows how

HMMConverter specifies the final tube which connects the specified blocks via maximum

43

Page 53: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

Figure 3.2: Using Blast to generate a tube for HMMConverterThe two thick lines in the middle correspond to the subset of matches that were selectedby the dynamic programming as the highest scoring subset of mutually compatible blastmatches (the discarded blast matches are not shown here). The radius is specified as 30(by the user). The Viterbi algorithm will then only search inside the tube derived from thesetwo matches.

44

Page 54: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

areas such that the final tube connects the start of the sequences to the end of the sequences.

Figure 3.3: Defining a tube for HMMConverter explicitlyThe right part of the figure shows the matches defined by the tube file, with the radius setto 50. The tube around the matches is derived in the same way as for matches generated byblast. The left figure shows the same specified 4-tuples interpreted as blocks and how thefinal tube is derived from these blocks.

Annotation label sets and special emissions

The performance of many recent Bioinformatics applications could be vastly improved by

integrating prior information on the input sequences into the prediction process. HMMCon-

verter supports this by the feature of special emissions, where the basic idea is to bias the

nominal values of the emission probabilities with position-specific prior probabilities along the

input sequences. These concept of special emissions was first employed by Twinscan [26].

It is also used by Projector [36]. The special emissions in Projector can be regarded

as restricting the state paths along the state dimension according to the discrete sequence

45

Page 55: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

position-specific prior information, i.e. the state path is only allowed to pass through states

that reproduce the known annotation of input sequence X. HMMConverter extends this

concept by allowing prior probabilities with any values in [0,1]. Furthermore, arbitrary anno-

tation label sets can be defined by the user to describe different types of prior information.

We first describe the concept of multiple annotation label sets, and then explain special

emissions.

Multiple annotation label sets In order to integrate prior probabilities along the input

sequences into the prediction process, a mapping has to be defined between the labels of

the states in the HMM and the prior information. We call the prior information along the

sequences ”the annotation of the input sequences” and the labels of the states the ”annotation

labels of the states”. The mapping has to explain how an annotation label of the sequence

is mapped to the annotation labels of the states. An annotation label set defines some

annotation labels, for example, an HMM for gene prediction may have an annotation label

set Feature = {Exon, Intron, Intergenic}, where the annotation labels represent the different

parts of the gene structures. Users can define arbitrary annotation label sets for an HMM,

any state in the HMM that is non-silent assigns to each letter that it reads from each input

sequence exactly one annotation label from each annotation label set.

Let us consider as an example a pair-HMM for comparative gene prediction with two

arbitrary annotation label sets, Set 1 = {Exon, Intron, Intergenic,Start Condon,Stop Codon}

and Set 2 = {Match,Emit X,Emit Y} and we consider the Match Exon state (see figure

3.2.2). This state reads three letters from each of the two input sequences at a time, and

assigns the label Exon ∈ Set 1 and Match ∈ Set 2 to each letter that it reads. Two types of

prior information, Set 1 and Set 2, can be specified for every position in each input sequence.

Note that the annotation labels for different letters of one state can differ. For example, we

46

Page 56: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

could define a third annotation label set Set 3 = {NoPhase,Phase0,Phase1,Phase2} which

represents the reading frame of a nucleotide in the input sequence. The Match Exon state

would assign annotation labels (Phase0,Phase1,Phase2) to the sequence positions (x1, x2, x3),

respectively, and also the same set of labels to the sequence positions (y1, y2, y3), respectively.

Figure 3.4: A state which carries labels from two different annotation label setsA state in a generalized pair-HMM which carries labels from two different annotation labelsets: Exon ∈ Set 1 and Match ∈ Set 2. This state reads six letters, three from input se-quence X and three from input sequence Y, and assigns to all six letters the two labels Exonand Match.

In order to use special emissions, users have to specify the prior information along the

input sequences using the labels of the different annotation label sets. The labels of the states

in the model have to be matched to the labels specified along the input sequences. The

following paragraph explains the details of how the special emissions are used.

Special emissions HMMConverter supports a very flexible way for integrating prior

information. Users can defined any number of annotation label sets for an HMM. The labels

47

Page 57: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

of different annotation label sets have to correspond to different types of information, see the

previous example. Consider the annotation label set Set 1. Every Start Codon is also an

Exon. So, if there are prior probabilities specified for the label Start Codon for an interval

of sequence positions, the same interval may also have a prior probability for the label Exon

(in this particular case, the prior probability for Exon must be larger or equal to the one for

Start Codon).

Figure 3.2.2 illustrates assigning different prior probabilities for an input sequence. It

also shows that the users has to take care of the normalization of the prior probabilities.

This means that the sum of the prior probabilities for mutually exclusive annotation labels

in one annotation label set assigned to the same interval of sequence positions has to be one.

The requirement of normalization of prior probabilities becomes clearer once we explain the

mechanism for biasing the emission probabilities with prior probabilities.

We illustrate the mechanism by considering the state Match Exon in figure 3.2.2 and

the prior probabilities for an input sequence X as shown in figure 3.2.2. We consider four

cases during the prediction or parameter training process for sequence X:

(i) reading sequence position 89

(ii) reading sequence position 90

(iii) reading sequence position 220

(iv) reading sequence position 186, with an additional prior probability of 70% for the label

Match in annotation label set Set 2

In case (i), there is no prior information assigned to sequence position 89 of sequence X.

In this case, no bias of the nominal emission probabilities is introduced for the corresponding

sequence position, i.e. for this sequence position, the Viterbi state path is calculated using

48

Page 58: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

Figure 3.5: Illustration of the special emission feature in HMMConverterExample of prior information for input sequence X of length Lx for one annotation labelset S = {Exon, Intron, Intergenic}. For sequence interval [1,89], no prior information on theannotation of the sequence is available. This stretch of the sequence would thus be analyzedwith an HMM whose nominal emission probabilities are not biased by any prior probabilities.For sequence interval [90, 184], there is prior information concerning the likelihood for beingExon, Intron and Intergenic. For the rest of the input sequence, i.e. sequence interval [185,370],we know with certainty that the sequence positions in [185,214] are exonic and that theremainder of the sequence is intergenic. Note that the prior probabilities add up to 1 forevery sequence position for which any prior information is supplied, reflecting the fact thatin this case the three labels Exon, Intron and Intergenic are mutually exclusive and that eachsequence position has to fall into exactly one of these three categories.

49

Page 59: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

the nominal emission probabilities of the HMM.

In case (ii), the sequence position 90 of the input sequence X has a 60% probability of

being an Exon. The decoding or training algorithm then integrates this prior probability

into the nominal emission probabilities. When the state path passes through this sequence

position and a state which carries the annotation label Exon (in particular, the Match Exon

state), it multiplies the nominal value of the emission probabilities with an additional factor

of 0.6. If the state path passes through a state with annotation label Intron, a factor of 0.3

will be multiplied with the nominal emission probability. The nominal emission probabilities

for states which carry the label Intergenic will be biased similarly by multiplying with a 0.1

when the state path pass through this sequence position.

Note that the three annotation labels Exon, Intron and Intergenic in the same annotation

label set are mutually exclusive labels, so the sum of the prior probabilities assigned to these

labels for any sequence position should be 1. This additional responsibility for the user gives

a highly flexible way for integrating prior probabilities. In the same annotation label set,

users can have annotation labels that are related to each other (i.e. Exon and Start Codon)

as well as mutually exclusive labels (i.e. Exon and Intron).

In case (iii), the sequence position 220 has 100% probability of being an Intergenic re-

gion. As this is the only prior probability specified for the sequence position 220, the prior

probabilities of other annotation labels (i.e. Exon and Intron) in this annotation label set

(i.e. Set 1) will be set to 0 for this sequence position by default. Then, the factor 0 will be

multiplied to the nominal value of the emission probabilities of the Match Exon state when

the state path passes through this sequence position. In other words, this Match Exon state

is prohibited to pass through the sequence position 220. This case is different from case(i)

50

Page 60: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

where there is no prior probability assigned to any annotation labels in this annotation label

set. In HMMConverter, once it recognizes prior probabilities of an annotation label for

a sequence position (in this case, Intergenic in Set 1), all the other annotation labels in the

same annotation label set without specifying prior probability for the same sequence position

will be assigned with a prior probability 0 (in this case Exon and Intron) by default. So, it

is very important for users to supply normalized prior probabilities for any sequence positions.

Finally in case(iv), the sequence position 186 has 100% probability of being an Exon,

so the factor 1 will be multiplied to the nominal value of the emission probabilities of the

Match Exon state when the state path passes through this sequence position (i.e. the nom-

inal value of the emission probability of this state is not biased for this sequence position).

But now we also have another prior probability of 70% for the label Match in the annotation

label set Set 2 for this sequence position, so another factor of 0.7 is multiplied to the nominal

emission probabilities of this Match Exon state for the sequence position 186. Precisely, let

the symbols read by the Match Exon state at sequence position 186 be γ, and the corre-

sponding emission probability for the state be e(γ). Then when the state path pass through

the sequence position 186, the emission probability for passing through the Match Exon

state will become e′(γ) = e(γ) ∗ 1 ∗ 0.7. More factors are multiplied to the nominal emission

probability of a state for a sequence position if there are more types prior probabilities as-

signed on that sequence position. Note that users are required to provide normalized prior

probabilities for all the different annotation label sets defined in the HMM for any sequence

positions.

3.2.3 Sequence decoding algorithms

HMMConverter incorporates several sequence decoding algorithms. They can all be re-

51

Page 61: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

garded as different variations of the Viterbi algorithm [47], and all of them retrieve the state

path with the highest overall probability given the specified search space.

The Viterbi and Hirschberg algorithms The Viterbi algorithm is a widely used se-

quence decoding algorithm for HMMs which calculates the optimal state path for the input

sequences. The Hirschberg algorithm [16] is a more memory efficient version of the Viterbi

algorithm refer to earlier chapter. It is especially useful for pair-HMMs, as the memory

requirement of the Viterbi algorithm usually impose a serious constraint for long input se-

quences. The details of these algorithms are described in chapter 2.

In practice, the Hirschberg algorithm takes up to twice as long to run than the Viterbi

algorithm, so HMMConverter uses a smart approach in order to determine which algorithm

to use best. The user can specify the maximum amount of memory to be allocated and the

Hirschberg iterations are stopped as soon as the sub-spaces fit into memory to be calculated

with the Viterbi algorithm.

The Viterbi and Hirschberg algorithms combined with a tube Both the Viterbi

and Hirschberg algorithms can be combined with a tube:

• Algorithm 1: Run the decoding algorithm inside the tube created from matches gener-

ated by the program Blastn [1].

• Algorithm 2: Run the decoding algorithm inside the tube created from matches gener-

ated by the program TBlastx [1].

• Algorithm 3: Run the decoding algorithm inside a user-specified tube.

Algorithms 1 and 2 are introduced in Doublescan [33, 34], where they are called the stepping

stone algorithm. Algorithm 3, combined with the Hirschberg algorithm (see figure 3.2.3), is a

52

Page 62: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

new feature which is not implemented in any existing HMM generating tool. An optimal or

almost optimal state path can thus be calculated in an extremely time and memory efficient

way. HMMoC also supports the banding feature, but does not combine it with the Hirschberg

algorithm and also does not allow the tube to be automatically derived from blast matches

using a separate dynamic programming routine.

Figure 3.6: Using the Hirschberg algorithm inside a tubeUsing the Hirschberg algorithm inside a tube, i.e. in every iteration the Hirschberg strips onlysearches inside the reduced tube-like search space.

3.2.4 Training algorithms

HMMConverter supports several algorithms for parameter training which can all be com-

bined with the tube and special emissions. Right now, HMMConverter is the only software

package which not only provides the linear memory Baum-Welch algorithm [35] but also a new

linear memory algorithm for training with sampled state paths, and a novel linear memory al-

53

Page 63: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

gorithm for Viterbi training. These three training algorithms reduce the memory requirement

by O(L) with respect to the nominal requirement. This renders parameter training even for

complex pair-HMM feasible for the first time. The special emissions and banding discussed

previously can be combined with any training algorithms to improve the time requirements

(banding) and the performance (special emissions). Furthermore, we can train only the free

parameters by specifying corresponding derivation rules.

In this section, we give a brief introduction of the linear-memory Baum-Welch algorithm

which has already been published in [35]. The novel linear-memory algorithm for training with

sampled state paths and the new linear-memory algorithm for Viterbi training are explained

in detail in the following.

The linear-memory Baum-Welch algorithm

We have already introduced the traditional Baum-Welch algorithm in chapter 2. This algo-

rithm takes Θ(NLn) memory for an n-HMM with N states and n input sequences of length L

each. Miklos and Meyer [35] introduced a clever way to perform the Baum-Welch training us-

ing linear memory, a small correction to this algorithm is made by Churbanov in [9]. The term

“linear memory” implies that the algorithm only takes O(N) memory for an HMM with N

states, and O(NLn−1) memory for an n-HMM with N states and n input sequences of length

L each. A brief description of the algorithm and the pseudo-code is presented here. The reader

should refer to the paper for more detailed explanations and a proof of the correctness of the

algorithm. We use the same notations as in chapter 2 unless they are specifically defined here.

Recall that we can derive a new estimate for tq+1i,j , the transition probability for going

54

Page 64: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

from state i to state j in the (q + 1)-th iteration, given the estimated counts T qi,j :

tq+1i,j =

∑X∈X T

qi,j(X)∑

X∈X∑

j′ Tqi,j′(X)

,

where T qi,j(X) :=

∑Lxk=1 f

qi (k)tqi,je

qj(xk+1)bqj(k + 1)

P (X)

Similarly, we can derive a new estimate for eq+1i (γ), denotes the probability of reading

letter γ from the input sequence at state i in the (q + 1)-th iteration, given the estimated

counts Eqi (γ):

eq+1i (γ) =

∑X∈X E

qi (γ,X)∑

X∈X∑

γ′ Eqi (γ′, X)

,

where Eqi (γ,X) :=∑Lx

k=1 δxkγfqi (k)bqi (k)

P (X)

We now denote two new counts for transition and emission probabilities which can also

be used to derive the corresponding probabilities:

T qi,j(X) :=Lx∑k=1

f qi (k)tqi,jeqj(xk+1)bqj(k + 1) (3.1)

Eqi (γ,X) :=Lx∑k=1

δxkγfqi (k)bqi (k) (3.2)

So, the new estimates for the tq+1i,j and the eqi (γ) can be derived by the following formulae,

55

Page 65: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

respectively:

tq+1i,j =

∑X∈X T

qi,j(X)/P (X)∑

X∈X∑

j′ Tqi,j′(X)/P (X)

,

eq+1i (γ) =

∑X∈X E

qi (γ,X)/P (X)∑

X∈X∑

γ′ Eqi (γ′, X)/P (X)

The main idea of the algorithm is that the value T qi,j(X) in equation 3.1 can be interpreted

as a weighed sum of probabilities of state paths for sequence X that contain transition i→ j,

where the weight is the number of times this transition is used in all possible state paths. The

value Eqi (γ,X) in equation 3.2 can also be interpreted as a weighed sum of probabilities of

state paths for sequence X, where the weight is the number of times the letter γ is read in state

i at least once in all possible state paths. Using this observation, Miklos and Meyer show that

the T qi,j(X) and the Eqi (γ,X) values can be calculated using only the forward probabilities

and a “slice” of memory along one input sequence. We first present the algorithm, and then

give further explanations.

Let Ti,j(k,m) denote the weighted sum of probabilities of all state paths that contain the

transition i → j, and which finish in sequence position k and state m, and let Ei(γ, xk,m)

denote the weighed sum of probabilities of all state paths that finish at sequence position k

and state m, and for which state i reads letter γ at least once. Then Ti,j(X) = Ti,j(Lx, end)

and Ei(X) = Ei(xLx , end).

The algorithm uses a dynamic programming approach for calculating the values Ti,j(X)

and Ei(γ,X), for an HMM with N states and one training sequence of length L:

56

Page 66: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.2. Main features

Initialization: Before reading the first letter from the input sequence

fm(0) =

1, if m = start

0, if m 6= start

Ti,j(0,m) = 0 ∀m ∈ {1, · · ·N − 1}

Ei,j(γ, x0,m) = 0 ∀m ∈ {1, · · ·N − 1}

Recurrences:

For k = 1, · · ·L

For m = 1, · · ·N − 2

fm(k + 1) =N−2∑n=1

fn(k)tn,mem(xk+1)

Ti,j(k + 1,m) =N−2∑n=1

Ti,j(k, n)tn,mem(xk+1) + δm,j · fi(k)ti,mem(xk+1)

Ei(γ, xk+1,m) =N−2∑n=1

Ei(γ, xk, n)tn,mem(xk+1) + δm,iδγ,xk+1· fi(k)ti,mem(xk+1)

Termination: at the end of sequence and the end state

P (X) = fend(L) =N−2∑n=1

fn(L)tn,end

Ti,j(X) = Ti,j(L, end) =N−2∑n=1

Ti,j(L, n)tn,end + δj,end · fi(k)ti,end

Ei(γ,X) = Ei(γ, xL, end) =N−2∑n=1

Ei(γ, xL, n)tn,end

where δ is the delta function, with δij = 1 if i = j and δij = 0 if i 6= j.

In the recurrence step, the value of Ti,j(k+1,m) is increased by fi(k)ti,mem(xk+1) if m = j,

57

Page 67: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

i.e. by the probability of the sum of all state paths with a transition i→ j up to and including

sequence position k+1. In the algorithm, only the values after the previous sequence position

in the Ti,j table and the fi table are needed in order to continue the calculation (similarly

for calculating the Ei(γ) table, so the memory requirements of the algorithm is O(N). On

the other hand, the run time of this algorithm is O(|X | ·NTmaxL) for training one parameter

(X is the set of training sequences), so in total this linear-memory Baum-Welch algorithm

requires O(p · |X | ·NTmaxL) time to train p parameters in one iteration.

The procedure of training emission probabilities is very similar, reader may refer to the

paper for more details.

3.3 Linear memory parameter training using sampled state

paths

This is a novel strategy for parameter training. Unlike Viterbi training which considers only a

single state path and unlike Baum-Welch training which considers all possible state paths, this

new training algorithm derives new parameter values from state paths that are sampled from

the posterior distribution. Similar to Viterbi and Baum-Welch training, this new algorithm

works in an iterative way. In each iteration, a new set of parameter values is derived from the

observed transitions and emissions in the sampled state paths for all sequences in the training

set. The iteration is stopped after a fixed number of iterations or as soon as the change in

the log-likelihood is sufficiently small.

3.3.1 Sampling a state path from posterior distribution

Sample state paths are high probability state paths which can be derived from the posterior

distribution P (π|X). These state paths are especially popular for pair-HMMs that generate

58

Page 68: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

global sequence alignments. As different state paths can yield the same annotation on an

input sequence, it is often more useful to get several sub-optimal alignments rather than just

getting one optimal global alignment (the Viterbi state path).

In order to better understand how the new algorithms work, we first explain how we can

sample a state path from the posterior distribution using existing algorithms. According

to [11], this can be done by first using the forward algorithm and then a back-tracking

procedure. We give some notations before explaining the algorithm. Let

• fi(k) denote the forward probabilities which has been defined previously

• pi(k,m) denote the probability of selecting state m as the previous state when being in

state i at sequence position k (i.e. sequence position k has already been read by state

i). For fixed k and i, pi(k,m) defines a probability distribution over previous states as∑m pi(k,m) = 1.

We consider sampling a state path with an HMM with N states and read only one input

sequence of length L. The algorithm is as follows:

Initialization: at the start of the input sequence

fm(0) =

1, if m = Start

0, if m 6= Start

Recursion: loop over all positions k from 1 to L in the input sequence and, for each

sequence position, over all states m from 1 to N − 2 (N − 1 is the end state) in the model

fm(k + 1) = em(xk+1) ·N−2∑n=1

fn(k) · tn,m (3.3)

59

Page 69: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

Termination: at the end of the input sequence

P (X) = fend(L) =N−1∑n=1

fn(L) · tn,end

Once we have calculated all forward probabilities fi(k), i.e. for all states i in the model and

all positions k in the input sequence, we can then use the following back-tracking procedure

to sample a state path from the posterior distribution P (π|X).

We start the back-tracking at the end of the input sequence, k = L, in the end state,

i = end, and select m as the state for the previous sequence position k − 1 with probability:

pi(k,m) =ei(xk) · fm(k − 1) · tm,i

fi(k)(3.4)

The back-tracking finishes once we have reached the start of the sequence and the start

state. The resulting succession of chosen previous states corresponds to one sampled state

path.

As the back-tracking procedure requires the entire matrix of forward values for all states

and all sequence positions, the above algorithm for sampling a state path requires O(NL)

memory and O(NTmaxL) time in order to first calculate the matrix of forward values and

then O(L) memory and time to sample a single state path from the matrix. Note that more

state paths can be sampled without having to recalculate the matrix of forward values. For

sampling K state paths for the same input sequence, we thus need O(NL + KL) memory

and O(NTmaxL+KL) time.

60

Page 70: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

3.3.2 A new, linear memory algorithm for parameter training using

sampled state paths

We first describe a new algorithm for training the parameters of an HMM using state paths

sampled from the posterior distribution. The algorithm corresponds to an iterative procedure.

In each iteration, a new set of transition and emission probability values is derived from the

number of times that these transition and emission were observed in state paths that were

sampled from the posterior distribution using the HMM with parameter values from the

previous iteration. The iterations are stopped once a maximum number of iterations have

been reached or once the change in the log-likelihood is sufficiently small.

In the following, let Eqi (γ,X, π) denote the number of times that state i reads symbol

γ from input sequence X in state path π given the HMM with parameters from the q-th

iteration. Likewise, let T qi,j(X,π) denote the number of times that a transition from state i

to state j is used in state path π given the HMM with parameters from the q-th iteration. In

the following, the superscript q will indicate from which iteration the underlying parameters

derive. If we consider all sequences of a training set X of M training sequences and sample

K state paths for each sequence, the recursion which updates the values of the transition and

emission probabilities can be written as:

tq+1i,j =

∑Mn=1

∑Kk=1 T

qi,j(Xn, πk)∑M

n=1

∑Kk=1

∑j′ T

qi,j′(Xn, πk)

(3.5)

eq+1i (γ) =

∑Mn=1

∑Kk=1E

qi (γ,Xn, πk)∑M

n=1

∑Kk=1

∑γ′ E

qi (γ′, Xn, πk)

(3.6)

Equations 3.7 and 3.8 assume that we know the values of T qi,j(Xn, πk) and Eqi (γ,Xn, πk),

i.e. how often each transition and emission was used in each sampled state path πk for every

61

Page 71: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

training sequence Xn.

If our computer has enough memory to use the forward algorithm and the backtracking

procedure described above, each iteration in the training algorithm takes O(N maxi{Li} +

K maxi{Li}) memory and O(NTmax∑N

i=1 Li+K∑M

i=1 Li) time, where∑M

i=1 Li is the sum of

the M sequence lengths in the training set X and maxi{Li} the length of the longest sequence

in training set X . However, for many Bioinformatics applications where the number of states

in the model N is large, the connectivity Tmax of the model high or the training sequences

are long, these memory and time requirements are too large to allow automatic parameter

training using this algorithm.

We now introduce a novel, linear-memory algorithm for training the parameters of an

HMM using sampled state paths. The idea for this algorithm comes from the following

observations:

(i) If we consider the description of the forward algorithm above, in particular the recursion

in equation 3.3, we realize that the calculation of the forward values can be continued

by keeping only the values for the previous sequence position in memory (see figure

3.3.2).

(ii) If we have a close look at the description of the backtracking algorithm, in particular

the sampling step in Equation 3.4, we observe that the sampling of the previous state

only requires the forward values for the current and the previous sequence position (see

figure 3.3.2). So, provided we are at a particular sequence position and state, we can

sample the state at the previous sequence position if we know the forward values for

the current and the previous sequence position.

(iii) If we want to sample a state path π from the posterior distribution P (π|X), we have to

start the backtracking at the end of the sequence in the End state, see the description

62

Page 72: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

Figure 3.7: Illustration of the forward algorithmPart (1) of this figure demonstrates the calculation of a forward value (i.e. the forward value(m, k)). This calculation only requires the forward values in the previous sequence position.So, only a ”strip” of memory (of size O(N)) is needed to continue the forward algorithm byfilling the forward matrix from the left to the right as shown in part (2) of this figure.

Figure 3.8: Sampling a state path from the posterior distributionThis figure shows that sampling a previous state only requires the forward values for thecurrent and the previous sequence position. i.e. this ”sampling operation” can also be carriedout with the memory ”strip” which only stores the forward values for two sequence positionsas shown in figure 3.3.2.

63

Page 73: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

above and equation 3.4 above. (The only alternative for sampling state paths from

the same posterior distribution would be to use the backward algorithm [11] instead of

the forward algorithm and to start the backtracking procedure at the beginning of the

sequence in the start state.)

Observations (i) and (ii) imply that local information suffices to continue the calculation

of the forward values (i) and to sample a previous state (ii) if we already are in a particular

state and sequence position, whereas observation (iii) reminds us that in order to sample

from the correct probability distribution, we have to start the sampling at the end of the

training sequence. Given these three observations, it is not obvious how we can come up with

a computationally more efficient algorithm for training with sampled state paths. In order to

realize that a more efficient algorithm exists, one also needs to note that:

(iv) While calculating the forward values in the memory-efficient way outlined in (i), we

can simultaneously sample one (or more) potential previous states for each state and

sequence position that we encounter in the calculating of the forward value. This is

possible because of observation (2) above.

(v) In every iteration q of the training procedure, we only need to know the values of

T qi,j(X,π) and Eqi (y,X, π), i.e. how often each transition and emission was used in each

sampled state path π for every training sequence X, but we do not need to remember

where along the training sequence each transition and emission was used.

Given observations (i) to (v), we can now formally write down a new, linear-memory

algorithm which calculates T qi,j(X,π) and Eqi (γ,X, π). As the description of the following

algorithm is for one training sequence X and one iteration q which both stay the same

throughout the algorithm, we will use a simplified notation, where Ti,j(k,m) denotes the

64

Page 74: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

number of times the transition from state i to state j is used in a sampled state path that

finishes at sequence position k in state m and where Ei(γ, k) denotes the number of times

state i read symbol γ in a state path that finishes at sequence position k in state m. As

introduced earlier, fi(k) denotes the forward probability for sequence position k and state i

and pi(k,m) the probability of selecting state m as the previous state while being in state i

at sequence position k.

Initialization: at the start of training sequence X = (x1, . . . , xL)

fm(0) =

1, if m = start

0, if m 6= start

Ti,j(0,m) = 0 ∀m ∈ {1, · · · , N − 1}

Ei(γ, 0,m) = 0 ∀m ∈ {1, · · · , N − 1}

Recursion: loop over all positions k from 1 to L in the training sequence X and, for each

sequence position k, loop over all states m from 1 to N − 2 (N − 1 is the end state) in the

model

fm(k + 1) = em(xk+1) ·N−2∑n=1

fn(k) · tn,m

pm(k + 1, n) =em(xk+1) · fn(k) · tn,m

fm(k + 1)

Ti,j(k + 1,m) = Ti,j(k, l) + δl,i · δm,j (3.7)

Ei(γ, k + 1,m) = Ei(γ, k, l) + δl,i · δγ,γ′ (3.8)

where l is the chosen previous state and

γ′ the symbol at sequence position k

65

Page 75: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

Termination: at the end of the input sequence

fend(L) =N−2∑n=1

fn(L) · tn,end

pend(L, n) =fn(L) · tn,endfend(L)

Ti,j(L, end) = Ti,j(L, l) + +δl,i · δend,j

Ei(γ,End) = Ei(γ, k, l) + δl,i · δγ,γ′

where l is the chosen previous state and

γ′ the symbol at sequence position k

We now prove the correctness of the algorithm for training the transition ti,j .

Theorem 1: The above algorithm calculates Ti,j(X,π) in O(N) memory and O(NTmaxL)

time. Ti,j(X,π) is the number of times that a transition from state i to state j is used in

state path π.

Proof:

(1) The recursion in the above algorithm requires only O(N) memory, as the calculation

of fm(k+ 1) only requires the values of fn(k) with n ∈ {1..N − 2}, similar for the calculation

of Ti,j(k + 1,m). Therefore, only O(N) memory is needed to store a column of the table to

proceed the calculation. It is also clear that the run time requires O(MTmaxL). In order

to calculate Ti,j(X,π), NL entries of the forward table and the table Ti,j are needed to be

calculated. To calculate each entry, maximum Tmax previous entries would be searched, so

this is in total O(NTmaxL) operations.

(2) We now prove by induction (on sequence position k) that Ti,j(k,m) is the number of

times that a transition from state i to state j is used in state path π that finishes at sequence

position k in state m. Let this statement be s(k)

66

Page 76: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

For k = 0:

In the initialization step, Ti,j(X0,m) = 0 for all m ∈ {1 · · ·N−1}. It is true as the number

of times a transition from state i to state j is used in state path π that finishes at sequence

position k in state m is 0 for all state m.

Assume s(k) is true for some non-negative integer k, now we proof s(k) is true implies

s(k + 1) is also true. In the recursion step, there are two cases:

Case (i) l = i and m = j, where l is the chosen previous state at state m and sequence

position k + 1:

Ti,j(k + 1,m) = Ti,j(k, l) + 1

By the induction assumption, Ti,j(k, l) is the number of times the transition from state i to

state j is used in state path π that finishes at sequence position k in state l. l is the chosen

previous state at state m and sequence position k + 1, so state l is chosen in π on sequence

position k if state m is chosen in sequence position k+ 1. It is clear that if the condition l = i

and m = j is satisfied, the transition from state i to state j is used in this step, so the number

of times this transition is used in state path Π that finishes at sequence position k + 1 and

state m(= j) is equal to the corresponding count at the previous sequence position k and the

chosen state l plus 1.

Case (ii) l 6= i or m 6= j, where l is the chosen previous state at state m and sequence

position k + 1:

Ti,j(k + 1,m) = Ti,j(k, l)

Again by the induction assumption, Ti,j(k, l) is the number of times the transition from state

i to state j is used in state path π that finishes at sequence position k in state l. As either

l 6= i or m 6= j, the transition from state i to state j is not used in the transition of state path

π from sequence position k to sequence position k + 1. So the number of times a transition

67

Page 77: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

from state i to state j is used in state path π that finishes at sequence position k+1 and state

m is equal to the corresponding count at the previous sequence position k and the chosen

state l.

In both cases, the value Ti,j(k + 1,m) stores the number of times a transition from state

i to state j is used in the state path π that finishes at sequence position k + 1 and state m,

so we have proved s(k) → s(k + 1). Finally, it is clear that Ti,j(L, end) = Ti,j(X,π), by the

definition of Ti,j(L, end) and Ti,j(X,π), and we have proved theorem 1.

By theorem 1 and equation 3.7, this new algorithm completes one iteration in training of

a transition probability of an HMM with O(N) memory and O(NTmaxL) time. For training

of emission probabilities, the proof is exact analogous to the above proof, so we just state the

theorem.

Theorem 2: The above algorithm calculates Ei(γ,X, π) inO(N) memory andO(MTmaxL)

time. Ei(γ,X, π) is the number of times that state i reads symbol γ from input sequence X

in state path π.

Figure 3.3.2 gives a visual illustration of the algorithm. In summary, we have designed a

memory efficient novel parameter training using sampled state paths, which requires O(N)

memory and O(NTmaxL) time for training one parameter with one sample state path for an

HMM with N states and reads one input sequence of length L in each iteration.

3.3.3 A new linear memory algorithm for Viterbi training

The traditional Viterbi training algorithm [11] has already been described in chapter 2, the

algorithm requires O(NL) memory and O(NTmaxL) time for an HMM with N states and

reads one input sequence of length L in each iteration. We employ similar ideas as discussed

in the above section to develop the new linear memory algorithm for Viterbi training, and we

call it the linear-memory Viterbi training.

68

Page 78: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

Figure 3.9: Illustration of the novel, linear memory algorithm for parameter training usingsampled state pathsPart (1) of this figure shows that three ”strips” of memory (in O(N)) are used for computingthe values of the counts Ti,j(X,π) and the counts Ei(y,X, π) for the input sequence X andthe sampled state path π. These strips move in parallel starting from the left of the matrixand finished at the right end of the matrix. Part (2) of this figure shows how the value of thecount Ti,j(X,π) is calculated with the Ti,j strip and the f-strip (the forward strip). At eachrecursion step of the algorithm, a forward value is calculated and a previous state is sampledby using the forward value just calculated and the forward values in the previous sequenceposition (the back-tracking arrows for a cell in the table). Then, the value of the cells (whichis the number of times the transition from state i to state j is used in this sampled state path)in the Ti,j table is calculated by accumulating the corresponding counts. In particular, atsequence position l and state j, the sampled previous state is i, which implies the transitioni → j is used at this position, so the value in this cell (the count) is increased by one fromthe value in the previous position. At sequence position k and state i, the transition i→ j isnot used according to the sampled previous state, so the value of the cell remains the same asthe previous position. At the last sequence position, the cell of state j is sampled as the laststate of the sample state path, and the counts Ti,j(X,π) for this sampled state path is 4 (seethe four dark back-tracking arrows). Note that the dark box at the right end of the matrixdenotes the Ti,j-strip. It means that at the termination step, this strip only stores the valuesfor the last two sequence positions, so the corresponding sample state path that derives thecount Ti,j(X,π) is not remembered.

69

Page 79: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

The linear-memory Viterbi training is very similar to the training algorithm for sampling

state paths. To describe the new Viterbi training algorithm, we use similar notations, where

Ti,j(k,m) denotes the number of times transition from state i to state j is used in the Viterbi

path that finishes at sequence position k and state m and where Ei(y, k) denotes the number

of times state i read symbol y in a state path that finishes at sequence position k in state m.

We further denote vi(k) as the probability of the most probable path ending in state i and

sequence position k. Initialization: at the start of training sequence X = (x1, . . . , xL)

vm(0) =

1, if m = start

0, if m 6= start

Ti,j(0,m) = 0 ∀m ∈ 1, · · · , N − 1

Ei(γ, 0,m) = 0 ∀m ∈ 1, · · · , N − 1

Recursion: loop over all positions k from 1 to L in the training sequence X and, for each

sequence position k, loop over all states m from 1 to N − 2 (N − 1 is the end state) in the

model

vm(k + 1) = em(xk+1) ·maxn(vn(k) · tn,m)

Ti,j(k + 1,m) = Ti,j(k, l) + δl,i · δm,j

Ei(γ, k + 1,m) = Ei(y, k, l) + δl,i · δy,y′

where l = argmax(vn(k) · tn,m)

γ′ the symbol at sequence position k

70

Page 80: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

Termination: at the end of the input sequence

vend(L) = maxn(vn(L) · tn,end)

Ti,j(L, end) = Ti,j(L, l) + δl,i · δend,j

Ei(γ, end) = Ei(γ, L, l) + δl,i · δγ,γ′

where l = argmax(vn(k) · tn,m)

γ′ the symbol at sequence position k

This algorithm can update a transition or emission probability with O(N) memory and

O(NTmaxL) time, the proof for this algorithm is also exactly analogous to that for sampling

state paths.

In summary, the two new algorithms both require O(N) memory and O(MNTmaxL) time

to train one parameter (with one sample state path for the linear memory parameter training

using sampled state path) for an HMM with N states and reads one input sequence with

length L.

Suppose there are p parameters to be trained, the memory and time requirements for

these two parameter training algorithms can be expressed in terms of c, which denotes the

number of parameters estimated in parallel by the formula:

• O(c ·N) memory and O((p/c) ·MN · (Tmax + c)L) time.

Suppose we train c parameters in parallel, c strips with size O(N) for each are needed, so the

memory requirement is increased by a factor of c. On the other hand, to calculate each cell of

the corresponding matrix, O(Tmax) operations are needed for computing the forward values

(sum over the forward values of the previous sequence position) or the viterbi value (select

the maximum state from the previous sequence position), and O(c) operations are needed to

71

Page 81: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

compute the counts for the parameters to be trained in parallel. So O(MN · (Tmax + c)L)

operations are needed at a time, and in order to train all the p parameters, we need p/c such

parallel computations.

HMMConverter uses a smart way to perform these linear memory parameter training

algorithms. Users are required to provide the amount of maximum memory allocated for the

training algorithms, and the maximum number of parameters will be computed in parallel

accordingly.

3.3.4 Pseudo-counts and pseudo-probabilities

An over-fitting problem would be generated in parameter training if there are insufficient

training data. In order to avoid this problem, it is preferable to add predetermined pseudo-

counts to the counts for deriving the transition and emission probabilities. Users can use

pseudo-counts or pseudo-probabilities for training, there are two different “counts” in HMM-

Converter. Typically, a pseudo-count for transition i → j, say cij is added to the corre-

sponding weight of the transition in the training process [11], for example, in Baum-Welch

training: (see chapter 2)

• The weight for the transition i→ j in the q-th iteration:

T qi,j 7→ T qi,j = T qi,j + cij

• The corresponding probability in the n+ 1-th iteration:

tq+1i,j =

T qi,j∑j′ T

qi,j′

72

Page 82: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

Often, it is difficult to provide an appropriate pseudo-count, and more intuitive to prove

a pseudo-probability. HMMConverter also allows “pseudo-probabilities” for training the

transition and emission probabilities. Suppose we know a priori that the probability of tran-

sition i→ j is roughly pij , then we may use this as the pseudo-probability for transition ti,j .

These pseudo-probabilities are added to tq+1i,j instead of the estimated number of counts T qi,j ,

and a normalization process is performed to yield a set of valid transition probabilities that

fulfill∑

j tqi,j = 1 for all states i in the HMM. This same strategy can be applied for training

emission probabilities.

For the free transition parameters which need not correspond to probabilities, pseudo-

counts can be used and they are added to the free parameter after each training iteration.

Note that the pseudo-count itself can be any value ≥ 0. For pseudo-counts for free parameters,

no normalization process is needed as the normalization is implicitly imposed by the derivation

rules.

3.3.5 Training free parameters

HMMConverter supports the parameterization of transition and emission probabilities in

order to avoid over-fitting during parameter training. The training algorithms have to be

capable of training the free parameters. Recall that HMMConverter allows different kinds

of parameterizations for transition and emission probabilities. In the following, the training

of free transition parameters and free emission parameters is therefore explained in separate

subsections.

Training of free transition parameters

For the training of free transition parameters, users have to supply a reverse formula which

expresses each free transition parameter as a function of transition probabilities. They corre-

73

Page 83: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

spond to the inverse of the functions which express each transition probability as function of

the free transition parameters. In each iteration, the normal training process is performed for

the transitions involved in the inverse functions, and the values of the parameters are then

derived using the corresponding inverse function with the updated transition probabilities.

In complicated cases, a free transition parameter cannot be expressed by an arithmetic

function. Figure 3.3.5 shows a part of the pair-HMM of Doublescan [34]. The states shown

in figure 3.3.5 can model pairs of aligned codons (match exon state) in a pair of orthologous

exons which is either followed by a pair of orthologous introns in one of 3 possible phases

(states 14 to 16) or followed by an intron in one of the two input sequence only (states 17

to 19 model an intron in X only, states 20 to 22 model an intron in Y only). Two particular

parameters are involved in these transitions, namely Phase0 and Phase1, they represent

the relative probability of starting an intron or intron pair after codon position 3 (Phase 0),

1 (Phase 1) and 2 (Phase 2). Phase2 can be expressed as 1−Phase0−Phase1.

We now consider training the parameter Phase0. From figure 3.3.5, the value of Phase0

should correspond to the fraction of transitions 3 → 14, 3 → 17 and 3 → 20 normalized by

transitions from state 3 to any of the states 14 to 22. As state 3 also has other transitions,

we cannot easily write Phase0 as arithmetic function of transition probabilities. For solving

this problem, HMMConverter supports so-called “group transitions”. This allows users to

define a group of transitions which should be regarded as the same transition in the training.

In this example, the parameter Phase0 would be trained by calculating the weight of the

group transitions of 3→ 14, 3→ 17 and 3→ 20 normalized by the weight of group transitions

of state 3 to all the states 14 to 22. Chapter 4 explains how ”group transitions” are specified

in HMMConverter.

74

Page 84: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.3. Training using sampled state paths

Figure 3.10: A part of the pair-HMM of DoublescanIn order to train the free transition parameter Phase0 , transitions from state 3 to all thestates representing splice site phase 0 should be counted. They include 3 → 14, 3 → 17 and3→ 20.

75

Page 85: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.4. Availability

Training of free emission parameters

We now describe the training of free emission parameters. As previously discussed, every

free emission parameter has to be a group of emission probabilities that define the emission

probabilities for some states. Furthermore, an emission parameter specified in the emission

probability file can be shared by several states, and also be used for expressing the emission

probabilities of other states. In the most general case, it is therefore not trivial to define rules

for deriving functions which would allow us to derive the free emission parameters from the

emission probabilities of the states.

Because of this, HMMConverter supports one simple way for training free emission

parameters. For each free parameter, the states which share emission probabilities directly

with the parameter are determined in HMMConverter, and the parameter is trained by

taking contributions from all those states into account. This strategy allows the efficient

and successful training even for complex models as long as their number of free emission

parameters is still small.

3.4 Availability

HMMConverter is available under the Gnu Public License. The software comprising

HMMConverter will be available for download, free of charge, from http://www.cs.ubc.

ca/~irmtraud/HMMConverter once paper submission is accepted. The package includes a de-

tailed manual, the software itself and the examples described in the last section of chapter 4.

76

Page 86: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

3.5. Discussion and conclusion

3.5 Discussion and conclusion

We have discussed the main features of HMMConverter in this chapter, many of them are

novel among existing HMM generating packages. Among the novel features, using banding

together with the Hirschberg algorithm is both very memory and time efficient for sequence

decoding. The special emission feature is desirable for many Bioinformatics applications where

prior information on the input sequences has to be integrated into the decoding algorithms.

The linear-memory Baum-Welch algorithm and the new linear-memory training algorithm for

training with sampled state paths are both memory efficient, banding can also be applied to

parameter training, which can greatly accelerate the process. Furthermore, special emissions

can be used to integrate prior information about training sequences into the training process,

with the capability of training free parameters, the overall performance of the parameter

estimation process can be greatly improved.

The specifications for constructing an HMM with HMMConverter are described in

detail in chapter 4. Some implementation details of important classes and the interactions

between the classes are discussed in chapter 5.

77

Page 87: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Chapter 4

Specifying an HMM or pair-HMM

with HMMConverter

4.1 Introduction

HMMConverter takes up to six input files if all special features are used. We first give a

brief description of these files:

• XML-input file: This is a compulsory XML file which describes the states and connec-

tions of an HMM or pair-HMM.

• Free emission parameter file: This is a compulsory flat text file which specifies the

emission probabilities of the HMM.

• Free transition parameter file: This file is only required if the transition probabilities

are parameterized.

• Tube file: This file is only required if banding is used with a user-defined tube.

• Prior information file: This file is only required if special emissions are switched on.

• Input sequence file: This compulsory file stores the input sequences.

78

Page 88: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

In this chapter, the format of all the input files and the output files generated by HMM-

Converter are described in detail. In the last section of this chapter, we describe several

HMMConverter applications included in the software package. Their main purpose is to

demonstrate how to set different HMMs with HMMConverter.

4.2 The XML-input file

HMMConverter takes an XML-input file, and translates it into efficient C++ code. The

XML-input file consists of two large components, defined by the<model> and the<sequence analysis>

tag. In this section, we explain the XML tags by first showing an example of how to specify

the tag, and then describe its attributes. The term “boolean attribute” used in this section

refers to an attribute that only accepts two discrete values ”0” and ”1” where ”0” means

”No” and ”1” means ”Yes”.

4.2.1 The <model> component of the XML file

This component consists of five compulsory tags for specifying the HMM, and two optional

tags for using special features.

[i] <Model Type> This tag defines some general parameters of the HMM.

Example tag:

<Model Type name=”Gene Prediction” pair=”1” SpecialEmission=”0” />

Attributes of <Model Type>:

• name: This compulsory attribute defines the name of the model.

• pair: This optional boolean attribute defines whether the model is a pair-HMM or an

HMM. The default value is ”0” (HMM).

79

Page 89: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

• SpecialEmission: This optional boolean attribute defines whether the model uses special

emission or not. The default value of this attribute is ”0” (no special emissions).

[ii] <Alphabets> This tag defines the set of symbols, i.e. the alphabet, that the HMM

can read from the input sequences

Example tag:

<Alphabets set=”ACGT” cases=”0”>

Attributes of <Alphabets>:

• set: This compulsory attribute defines the set of symbols, i.e. the alphabet, that the

HMM can read.

• cases: This optional boolean attribute defines whether the model is case sensitive (”1”)

or not (”0”). The default value of this attribute is ”0”.

[iii] <Emission Probs> This tag defines the attributes of the set of free emission param-

eters on which the emission probabilities of the HMM depend. The free emission parameters

themselves are defined in a separate input file, we call it the free emission parameter file.

Example tag:

<Emission Probs id=”FEP” size=”7” file=”emission probs mouse human.txt” >

Attributes of <Emission Probs>:

• id: This compulsory attribute defines the id of the set of free emission parameters. This

id is used to find the corresponding emission parameters in the free emission parameter

file.

• file: The name of the free emission parameter file.

80

Page 90: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

• size: Compulsory attribute which specifies the number of free emission parameters

contained in the free emission parameter file.

In this example, ”FEP” is the id of the set of 7 free emission parameters (size=”7”), and

the ids of the free emission parameters specified in the free emission parameter file would be

”FEP.0”,”FEP.1”,· · · ,”FEP.6”.

[iv] <States> This tag defines the attributes of all states in the HMM. For an HMM with

N states, the start state should be defined with id S.0 and the end state should be defined

with id S.(N − 1).

Example tag:

<State id=”S.14” name=”sample state” xdim=”2” ydim=”3” special=”1” >

<Set1>

<label idref=”Set1.2”/>

</Set1>

<Set2>

<label idref=”Set2.0”/>

<label idref=”Set2.0”/>

<label idref=”Set2.1”/>

<label idref=”Set2.1”/>

<label idref=”Set2.2”/>

</Set2>

<State Emission Probs · · · >...

<State Emission Probs />

</State>

81

Page 91: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

Attributes of <State>:

• id: This compulsory attribute takes a string value. The format of the string starts with

”S.”, followed by an integer number in the set {0, 1, · · · , N −1}, where N is the number

of states in the model.

• name: This compulsory attribute defines the name of the state.

• xdim: This optional attribute defines the number of characters that this state reads

from input sequence x. The default value is ”0”.

• ydim: This optional attribute defines the number of characters that this state reads

from sequence y, the default value is ”0”. This attribute is only required for states in

pair-HMMs.

• special: This optional boolean attribute defines whether the special emission of this

state is switched on (”1”) or not (”0”).

The state (namely ”sample state”) described in the above example tag is shown in figure

4.2.1. This state reads five characters, two from input sequence X and three from input

sequence Y, we call this a state of dimension five. For the annotation label set “Set1” i

defined for the HMM, only one label ”Set1.2” (this is the id of the annotation label defined in

this annotation label set) is specified, which means that all the five sequence positions of the

state have the same annotation label ”Set1.2” from this set. For “Set2” (another annotation

label set defined for the HMM), different labels are specified for each of the five sequence

positions (x1, x2, y1, y2, y3), i.e. the annotation label in ”Set2” for y2 is ”Set2.1”.

The name of the <Set1> (and <Set2>) tag has to match with the name of the correspond-

ing annotation label set specified in the <Annotation Label> tag. The annotation labels de-

fined in this <state> tag are only for specifying the annotation labels for this particular state,

82

Page 92: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

Figure 4.1: Defining a state of an HMM or pair-HMM in HMMConverterThe figure shows the state defined in the above example tag. This state reads five letters ata time, two from input sequence X and three from input sequence Y. Each of the sequenceposition are assigned with two annotation labels from the annotation label set “Set1” and“Set2”. In this example, all the five sequence positions are assigned with the same annotationlabel ”Set1.2” from the annotation label set “Set1”, while different annotation labels fromannotation label set “Set2” are assigned to each sequence positions. For example, the sequenceposition y2 is assigned with ”Set1.2” from annotation label set “Set1” and ”Set2.2” fromannotation label set “Set2”.

83

Page 93: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

the complete annotation label sets for the HMM and the annotation labels defined in each

annotation label set are specified in the <Annotation Label> tag. The <Annotation Label>

tag is an optional tag inside the <model> component which will be explained after describing

all the compulsory tags.

The attribute of the <label> tag inside the <state> tag is:

• idref : This attribute stores the id of an annotation label in a string. The string has

to start with the name of the corresponding annotation label set, followed by a ”.”

character and then a number between 0 to the total number of labels defined in this

annotation label set.

The <State Emission Prob> tag defines the emission probabilities of a state. There are two

ways to define this tag, the simpler way is the following:

<State Emission Probs GetFrom=”S.2” >

The attribute GetFrom specifies the id of a free emission parameter defined in the free emission

parameter file or the id of a state in the HMM. If a state is specified, this means the two

states share the same set of emission probabilities.

The second way to define this tag is to specify how the emission probabilities are derived

from free emission parameters. In that case, the <State Emission Prob> tag has to be defined

in the following way:

<State Emission Probs SumOver=”1” Product=”1”>

<SumOver From=”S.3” FromPos=”0 1 3 4” ThisPos=”0 1 4 5”/>

<Product From=”S.14” FromPos=”0 1 2 3” ThisPos=”2 3 6 7”/>

</State Emission Probs>

The attributes of the <State Emission Prob> tag are:

84

Page 94: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

• SumOver: This boolean attribute defines if the SumOver function is used (”1”) or not

(”0”). The default value is ”0”.

• Product: This boolean attribute defines if the Product function is used (”1”) or not

(”0”). The default value is ”0”.

The concept of these two functions has already been explained in detail in chapter 3, we

now describe how to set up the functions. Consider an example where we use the SumOver

function to derive the emission probabilities of state S.3 from those of state S.6:

PS.3(x1, x2, x3) =∑

(y1,y2,y3)

PS.6(x1, x2, x3, y1, y2, y3)

This could be specified by:

<State Emission Probs SumOver=”1”>

<SumOver From=”S.6” FromPos=”0 1 2” ThisPos=”0 1 2”/>

</State Emission Probs>

ThisPos=”0 1 2” and FromPos=”0 1 2” means that the 0-th, 1-th and 2-th character read

by S.3 match with the 0-th, 1-th and 2-th character read by S.6 respectively, i.e. that we

sum over the 3-rd, 4-th and 5-th character of state S.6 to obtain the emission probabilities

of state S.3.

The attributes of the <SumOver> tag are:

• From: The id of a state or an emission parameter.

• FromPos and ThisPos: FromPos is a vector of positions of the state or emission pa-

rameter specified in the From attribute. The FromPos vector and the ThisPos vector

have to have the same length as they map the sequence positions in the From state to

85

Page 95: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

the sequence position in this state. Here, ”0” represents the first character read by the

state,”1” and ”2” represents the second and third character respectively.

For the Product function, consider an example with states S.3 and S.1 of dimension 3

and 1 respectively where the emission probabilities of state S.3 depend in the following way

on those of state S.1:

PS.3(x1, x2, x3) = PS.1(x1) · PS.1(x2) · PS.1(x3) (4.1)

The corresponding specification is:

<State Emission Probs Product=”1”>

<Product From=”S.1” FromPos=”0” ThisPos=”0”/>

<Product From=”S.1” FromPos=”0” ThisPos=”1”/>

<Product From=”S.1” FromPos=”0” ThisPos=”2”/>

</State Emission Probs>

The attributes of the <Product> tag are the same and have the same meaning as those of

<SumOver> tag. In the above example, the 0-th, 1-th and 2-th character read by state S.3

are all mapped to the 0-th character read by state S.1.

The Product and SumOver functions can be combined, as the following example shows

where we have three states S.3, S.1 and S.4 with dimension 3, 1 and 4 respectively.

PS.3(y1, y2, y3) = (∑

(x1,x2)

(PS.4(x1, x2, y1, y2))) · PS.1(y3)

The corresponding specification would be:

<State Emission Probs SumOver=”1” Product=”1”>

86

Page 96: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

<SumOver From=”S.4” FromPos=”2 3” ThisPos=”0 1” />

<Product From=”S.1” FromPos=”0” ThisPos=”2”/>

</State Emission Probs>

Users can also use the combinations of Product and SumOver with one sequence position

involved in more than one operation.

The <State Emission Prob> tag is not required in two cases:

(i) The state is a silent state which does not read any symbols from any input sequence.

(ii) All the emission probabilities of the HMM are specified explicitly in the free emission

parameter file, the numbering of emission probability sets start with ”FEP.0” (”FEP”

is the id of the set of free emission parameter).

Regarding case (ii), if this state (S.14) is not defined with the <State Emission Prob> tag,

then the emission probabilities of this state will be assigned with free emission parameter

”FEP.13” defined in the free emission parameter file (as S.0 is the start state which does

not read any letter, and the emission probabilities of S.1 will be assigned with ”FEP.0” ).

This is implemented for convenient of setting up the model if the emission probabilities of

the states in the HMM are not parameterized. In this way, users only require to specifies

the emission probabilities of the states in order in the free emission parameter files, and no

<State Emission Probs> tag is needed to define for any state.

[v] <Transitions> This tag defines the topology of the HMM, i.e. the transitions between

states in the HMM and the corresponding probabilities.

Example tag:

<Transitions train=”1”>

87

Page 97: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

<from idref=”S.0”>

<to idref=”All” exp=”1/53”/>

</from>...

<from idref=”S.3” >

<to idref=”S.3” exp=”0.6” pseudoprob=”0.2”/>...

<to idref=”S.53” exp=0.01” />

<from idref=”S.4” train=”1” >

<to idref=”S.3” exp=”FTP.15*(1−FTP.2)”/>

<to idref=”S.4” exp=”(1−FTP.15)∗(1−FTP.2)”/>

<to idref=”S.53” exp=”FTP.2”/>

</from>...

</Transitions>

The attributes of the <Transitions> tag are:

• train: This optional attribute takes the value ”0”, ”1” or ”All”. The default value of

this attribute is ”0” which means that none of the transition probabilities is trained.

The value train =”1” means that some transitions are to be trained. These transitions

have to be matched by setting the train value in the corresponding ”from” state to ”1”.

A train=”All” values implies that all transitions of the HMM are to be trained.

The attributes of the <from> tag are:

• idref : This compulsory attribute defines the state from which this transition is. The

value of idref has to match the id of the ”from” state.

88

Page 98: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

• train: This optional boolean attribute defines whether the transition probabilities of

the ”from state” need to be trained or not. The default value is ”0”.

The attributes of the <to> tag:

• idref : This compulsory attribute specifies the ”to” state of the transition. The value of

idref has to either match the id of the ”to” state, or is set to ”All” which means that

the ”from” state is connected to all other states of the HMM except the start and the

end state using the same transition probability.

• exp: This compulsory attribute defines the transition probability. It takes a string with

either of the following three formats:

– Decimal number

– Arithmetic formula with decimal numbers as operands. The formula accepts op-

erators: ’+’,’−’,’∗’ and ’/’ representing addition, subtraction, multiplication and

division respectively. Brackets are also allowed in the expression.

– Arithmetic formula with decimal numbers and free transition parameters id as

operands.

• pseudoprob: This optional attribute defines a pseudo-probability to be used for this

transition during parameter training.

All the five tags described above, i.e. <Model Type>, <Alphabets>, <Emission Probs>,

<States> and <Transitions>, are compulsory for specifying an HMM. Two optional tags can

be used for setting up special features with HMMConverter.

[vi] <Annotation Labels> HMMConverter can use special emissions to integrate

prior information along the input sequences into the sequence decoding and the training

89

Page 99: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

algorithms. Prior probabilities are specified with the name of the annotation labels along the

input sequences, those names should match with the name of the annotation labels defined

in the states of the HMM. The details of these annotation label sets for the HMM are defined

in this <Annotation Labels> tag.

Example tag:

<Annotation Labels>

<Annotation Label name=”Feature” score=”1” >

<label id=”Feature.0”name=”Undefined” />

<label id=”Feature.1” name=”Intergenic” />...

<label id=”Feature.6” name=”Stop Codon” />

</Annotation Label>

</Annotation Labels>

The attributes of the <Annotation Label> tag are:

• name: This compulsory attribute defines the name of the annotation label.

• score: This optional boolean attribute defines whether the annotation label set accepts

a prior probability in [0, 1] (”1”) or only discrete values ”on” or ”off” (”0”). The default

value is ”0”.

The attributes of the <label> tag are:

• id: This compulsory attribute defines the id of the annotation label. The string has

to start with ”Annotation Label name.”, followed by an integer in the set {0, L − 1},

where L is the size of the annotation label set.

• name: This compulsory attribute defines the name for the label.

90

Page 100: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

[vii] <Transition Probs> This tag defines a set of free transition parameters, the values of

the free transition parameters are defined in a separate input file, we call it the free transition

parameter file.

Example tag:

<Transition Probs id=”FTP” size=”19” file=”transition parameters mouse human.txt” />

The attributes of the <Transition Probs> tag are:

• id: This compulsory attribute defines the id of this set of free transition parameters.

This id is used to find the corresponding values in the free transition parameter file.

• size: This compulsory attribute specifies the number of parameters in the free transition

parameter file.

• file: The file name of the free transition parameter file.

4.2.2 The <sequence analysis> component of the XML file

The purpose of this tag is to specify the decoding and training algorithms that are to be used

with the model and to specify the algorithms parameters.

This tag consists of two main tags, the<sequence decoding> and the<parameter training>

tag. We explain these tags in order in this section.

The <sequence decoding> tag

All parameters for sequence decoding algorithms are specified in the this tag. It consists of

three sub-tags:

• <input files>

91

Page 101: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

• <algorithm>

• <output files>

<input files> This tag specifies the names of different input files for the sequence decoding

algorithm.

Example tag:

<input files SeqFile=”seq1.fasta” AnnFile=”priorinfo1.txt” TubeFile=”tube.txt” />

The attributes of the <input files> tag are:

• SeqFile: This compulsory attribute specifies the name of the input sequence file.

• AnnFile: This optional attribute specifies the name of the prior information file. This

attribute is not required if special emissions are not used.

• TubeFile: This optional attribute specifies the name of the tube file that defines the

restricted search space. This tag is only required if algorithm 3 is chosen.

<algorithm> This tag specifies the sequence decoding algorithm to be used and its pa-

rameters.

Example tag:

<algorithm alg=”1” MaxVolume=”300000” radius=”30” />

The attributes of the <algorithm> tag are:

• alg : This compulsory attribute defines the decoding algorithm being used, it has four

options:

92

Page 102: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

– 0: A combination of normal Viterbi algorithm [47] and the Hirschberg algorithm

[16] will be used depending on the maximum amount of memory specified by the

user.

– 1: Derives a tube from Blastn [1] matches and run algorithm 0 inside the resulting

tube.

– 2: Derives a tube from TBlastx[1] matches and run algorithm 0 inside the re-

sulting tube.

– 3: Runs algorithm 0 inside a user-defined tube-defined by an input file.

• MaxVolume: This compulsory attribute defines the maximum amount of memory in

bytes to be allocated for running any decoding algorithm.

• radius: This attribute specifies the radius of the tube generated from matches given by

the blast programs, see figure 3.2.2, i.e. this attribute is only required if algorithm 1 or

algorithm 2 above are chosen.

<output files> This tag specifies the name of the output files which store the results

generated by the sequence decoding algorithm.

Example tag:

<output files OutFile=”viterbi result.txt”/>

The attribute of the <output files> tag is:

• OutFile: This compulsory attribute specifies the name of the output file. The file will

be created if it does not exist in the output directory, otherwise, the existing file will be

overwritten.

93

Page 103: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

HMMConverter generates at most two output files for storing the result of sequence de-

coding algorithm. The output files generated are both in simple text formats. The first

output file shows the optimal state path explicitly, and the second one shows the resulting

annotation of the sequences. If no annotation label set is defined, the second file is not gener-

ated. If both output files are generated, the output files name are viterbi result 1.txt and

viterbi result 2.txt respectively in the above example.

The <parameter training> tag

This tag specifies the parameters of the training algorithms. The general format is similar to

that of the <sequence decoding> tag, except this tag has an additional sub-tag while specifies

the derivation rules for training free transition parameters. For this reason, only the tags and

attributes that are different from the <sequence decoding> tag are explained in detail.

The <parameter training> tag consists of four sub-tags:

• <input files>

• <algorithm>

• <train free parameters>

• <output files>

<input files> This is the same tag as for the <sequence decoding> tag, it specifies the

names of the different input files for parameter training. The attribute TubeFile in this

<input files> tag is different from that in the <sequence decoding> tag. In addition to

specifying the name of the user-defined tube file, it also accepts the value ”Blastn” and

”TBlastX”, which means the tubes for the parameter training would be generated from the

matches provided by the specified blast program.

94

Page 104: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

<algorithm> The <algorithm> tag for parameter training is similar to that for sequence

decoding. We only describe the attributes that are different from those for sequence decoding:

Example tag:

<algorithm alg=”1” MaxVolume=”100000” Maxiter=”5” threshold=”0.01” SamplePaths=”10” />

The attributes for the <algorithm> tag are:

• alg : This attribute specifies which training algorithm is to be used, it accepts two values:

– 0: The linear memory Baum-Welch algorithm[9, 35].

– 1: The linear memory training using sampled state paths

– 2: The linear memory Viterbi training [11]

• threshold: This attribute specifies a threshold score for terminating the Baum-Welch

training algorithm. Its default value is 0, the algorithm will be terminated if the average

log-odd ratio in the current iteration is less than this threshold value. Note that this

attribute does not apply to training algorithm 1 and 2.

• Maxiter: This attribute specifies the maximum number of iterations for running the

training algorithm, the default value is 10.

• SamplePaths: This attribute specifies the number of state paths that are to be sampled

for every training sequence and every iteration in algorithm 1.

The training algorithm stops as soon as the first of the two convergence criteria are met.

parameter training This tag is needed for training of the free transition parameters. As

discussed in the previous section, the user has to specify rules for deriving free transition

parameters from transition probabilities. We use the same example as in chapter 3 to illustrate

95

Page 105: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

how we can set up “Group transition probabilities” with this tag. Take a look at figure 4.2.2

again, we want to to train the ”Phase0” parameter by counting the transitions 3→ 14, 3→ 17

and 3→ 20, and then normalizing by all transitions from state 3 to any of the states 14− 22.

The following example shows how to specify the above “Group transition probabilities”:

<Parameters training>

<FreeTransitionParameters>

<FTP idref=”FTP.0” exp=”GTP.1” />...

<FTP idref=”FTP.2” exp=”GTP.2+GTP.3” />

</FreeTransitionParameters>

<GroupTransitions id=”GTP”>...

<GTP id=”GTP.1”>

<from idref=”S.3”>

<to idref=”S.14”>

<to idref=”S.17”>

<to idref=”S.20”>

</from>

<Overfrom idref=”S.14”>

<Overto idref=”All”>

</Overfrom>

<Overfrom idref=”S.17”>

<Overto idref=”All”>

</Overfrom>

96

Page 106: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

<Overfrom idref=”S.20”>

<Overto idref=”All”>

</Overfrom>

</GTP>...

</GroupTransitions>

</Parameters training>

In the above example, GTP stands for “Group transition probabilities”. For example, the

parameter GTP.1 specifies the group of transitions for training Phase0. In the above ex-

ample, another free transition parameter FTP.2 is derived by adding two group transition

probabilities. One important point to mention here is that in order to train a free transition

parameter, the user has to specify this in the tag <FreeTransitionParameters>. Furthermore,

an appropriate expression, which should be an arithmetic formula using group transition prob-

abilities and numerical values as operands should be given. In the above example, FTP.1 is

not specified so the parameter is not being included in the parameter training.

No rule is needed to specify the training of the free emission parameters, as discussed

previously. The counts from appropriate states for an emission parameter can be determined

within the parameter training process. For a more detailed explanation of this strategy, please

refer to chapter 3.

output files This tag specifies the output file for storing the result of parameter training.

At most three output files are generated.

Example tag:

<output files XMLFile=”trained HMM.xml”

97

Page 107: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.2. The XML-input file

Figure 4.2: A part of the pair-HMM of DoublescanIn order to train the free transition parameter Phase0 with id FTP.0, all transitions fromstate 3 to any state presenting a splice site of phase 0 have to be counted, i.e. transitions3→ 14, 3→ 17 and 3→ 20. The group transition GTP.1 is set up correspondingly.

98

Page 108: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.3. Other text input files

TProbFile=”trained transition.txt”

EProbFile=”trained emission.txt”/>

The attributes for the <output files> tag are:

• XMLFile: This attribute specifies the name of the xml output file for describing the

HMM after the training process. i.e. if transition probabilities are trained, the updated

values will appear in this output XML file. This is convenient for users to compare the

HMM before and after parameter training and to set up to new HMM for data analyses

using one of the decoding algorithms.

• TProbFile: This attribute specifies the name of output file that stores the trained free

transition parameters. This file has the same format as that of the input free transition

parameter file.

• EProbFile: This attribute specifies the name of output file for storing the trained emis-

sion parameters. This file has the same format as that of the input free emission

parameter file.

4.3 Other text input files

HMMConverter takes several flat text files as input for setting up several features of an

HMM. In this section, we describe the format of these text input files in detail.

4.3.1 The input sequence file

This file stores the input sequences for decoding or training algorithms in a fasta-like format

(Note that HMMConverter is not limited to Bioinformatics applications). All sequences

99

Page 109: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.3. Other text input files

in this input sequence file are considered for the specified algorithm. The following is an

example of an input sequence file:

>Mm.X13235.5 1-637

gggaatgaagtttttctgcaggatttaaatgtggtctttaagagacaccgcatgcaaaga

atagctggggcttgctagccaatgaaaacattcagattccaatgacgcatccttttttct

ccacccccttccaagacccggattcggaaaccccgcctaacgctctagttttcaaccagg

tccgcagaaggcctatttaaagggacgattgctgtctccctgctgtcataaccatgtctg

gacgtggcaagggtggtaaaggccttgggaaaggcggcgctaagcgccaccgtaaggttc

tccgcgataacatccagggcatcaccaagcctgccatccgccgcctggcccggcgcgggg

gagtgaagcgcatctccggcctcatctacgaggagacccgcggtgtgctgaaggtgttcc

tggagaacgtgatccgcgacgccgtcacctacacggagcacgccaagcgcaagaccgtca

ccgccatggacgtggtctacgcgctcaagcgccagggccgcactctctacggattcggcg

gttaatcgactaacaaacgattttccactgtcaacaaaaggcccttttcagggccaccca

caaattcctagaaggagttgttcacttaccgaagctt

>Hs.M16707.H4F2.8 85-1098

ccgtcctccttcggcgaagatccctggcgcgcgtccttgaggtcgccttcggtgttgacc

tcatcgtcggaacggcgcttcctgaagctttatataagcacggctctgaatccgctcgtc

ggattaaatcctgcgctggcgtcctgccagtctctcgctccatttgctcttcctgaggct

ccctccagagacctttcccttagcctcagtgcgaatgcttccgggcgtcctcagaaccag

agcacagccaaagccactacagaatccggaagcccggttgggatctgaattctcccgggg

accgttgcgtaggcgttaaaaaaaaaaaagagtgagagggacctgagcagagtggaggag

gagggagaggaaaacagaaaagaaatgacgaaatgtcgagagggcggggacaattgagaa

cgcttcccgccggcgcgctttcggttttcaatctggtccgatactcttgtatatcagggg

aagacggtgctcgccttgacagaagctgtctatcgggctccagcggtcatgtccggcaga

100

Page 110: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.3. Other text input files

ggaaagggcggaaaaggcttaggcaaagggggcgctaagcgccaccgcaaggtcttgaga

gacaacattcagggcatcaccaagcctgccattcggcgtctagctcggcgtggcggcgtt

aagcggatctctggcctcatttacgaggagacccgcggtgtgctgaaagtgttcttggag

aatgtgattcgggacgcagtcacctacaccgagcacgccaagcgcaagaccgtcacagcc

atggatgtggtgtacgcgctcaagcgccaggggagaaccctctacggcttcggaggctag

gcgccgctccagctttgcacgtttcgatcccaaaggccctttttgggccgaccacttgct

catcctgaggagttggacacttgactgcgtaaagtgcaacagtaacgatgttggaaggta

actttggcagtggggcgacaatcggatctgaagttaacggaaagacataaccgc

If a pair-HMM is used for an analysis, the entries in the input sequence file will be interpreted

and analyzed in pairs, i.e. the first two sequence are regarded as a pair, and then the third and

the fourth, etc. The header line has to start with the character ”>”, followed by the name of

the sequence (a string without breaks) and the start and end position of the sequence below

that header line, the start position always has to be smaller or equal to the end position and

both numbers have to be integers. The name and the positions are separated by space or

tab character. Other information about the sequence can be put on the header line after the

range, which is for users reference only.

4.3.2 The free emission parameter file

This flat text format file is required for setting up the emission probabilities of an HMM. We

will call a grouped set of emission probabilities which defines the emission probabilities for a

state in the HMM a free emission parameter, these free emission parameters are all stored in

this file.

We specify a free emission parameter with a header line with the following format:

101

Page 111: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.3. Other text input files

FEP.1 Stop exon 6 train

The first three columns contain the id, the name of the emission parameter and the

dimension of the parameter, respectively, the forth column tells whether this parameter is

to be included in the training process. The four columns are separated by a space or a tab

character.

The id has to be identical to the one that was defined in the XML-input file, the name of

the emission parameter is optional and defaults to “Not defined”. Similarly, the dimension

specified in the header line has to be consistent with the dimensions of the emission proba-

bilities follow the header line. If the parameter set does not needed to be trained, then the

word ”train” should be omitted from the header line. The program can correctly interpret a

header line with 3 entries by recognizing the position of the key word ”train” and the position

of the dimension parameter, which is the only parameter with numerical value.

Below the header line, the emission probabilities are defined using the following format:

taataa 0.5082690 0.10

taatag 0.0632332 0.05

taatga 0.0943092 0.05

tagtaa 0.0632332 0.05

tagtag 0.0607797 0.05

tagtga 0.0160113 0.01

tgataa 0.0943092 0.05

tgatag 0.0160113 0.01

tgatga 0.0838441 0.05

The first column is the string of characters that corresponds to the emission probability in

102

Page 112: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.3. Other text input files

the second column. The third column is the value of an optional pseudo-probability to be

used in training which defaults to 0 if it is not specified. The three columns are separated by

a space or a tab character.

Emission probabilities of unspecified character combinations are set to zero. We can use

this feature to specify the emission probabilities of some states in a very compact way. For

example, if a Match Start Codon state reads only ”atgatg” with emission probability

larger than zero, we can specify its emission probabilities by:

FEP.1 Match Start Codon 6

atgatg 1

After specifying one free emission parameter, there needs to be an empty line before specifying

another parameter.

4.3.3 The free transition parameter file

This file defines the free transition parameters. This input file is not required if the transition

probabilities are not parameterized. Each line in the file defines a free transition parameter

using the following format:

FTP.0 Phase0 0.4387 0.3333

FTP.1 Phase1 0.3870 0.3333...

FTP.16 Match exon to stop exon 0.003 0.01

FTP.17 Match exon to emit exon x 0.01

FTP.18 Match exon to emit exon y 0.01

103

Page 113: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.3. Other text input files

The four columns contain the id, the name, the value and the pseudo-probability of the

transition parameter respectively. The id has to be identical to the one that was defined in

the XML-input file. The name is an optional attribute where default value is ”Not defined”.

The default value for the pseudo-probabilities is 0. In this free transition parameter file, each

line represents one free transition parameter, there is no need to have an empty line between

two parameter entries.

4.3.4 The tube file

This input file stores the user-defined restricted search space in the sequence dimension(s)

to be used for sequence decoding or parameter training. Note that the banding feature only

applies to pair-HMMs.

The tubes for all input sequence pairs have to be specified in a single tube file. A header

line is required for specifying a tube for each sequence pair, using the following format:

>Mm.X13235.5&Hs.X60484.H4/e.6 radius:50

Each header line starts with the character ”>” followed by the name of the sequence pair to

which the tube will be applied to. The name of the two sequences are connected with the char-

acter ’&’, i.e. in the above example, the names of the two input sequences are Mm.X13235.5

and Hs.X60484.H4/2.6. The radius is defined in the next field of the header line if the specified

coordinates below represent matches.

The coordinates for specifying the tube itself are given in the following format:

300 400 450 550

600 650 800 750

The four columns of each line are ”xstart”, ”ystart”, ”xend” and ”yend”, respectively, which

together define two points in the X-Y plane: (xstart,ystart) and (xend,yend). Depending on

104

Page 114: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.3. Other text input files

the desired interpretation, each line thus either specifies a rectangular area (see left of figure

4.3.4) or a match (see right of figure 4.3.4). Figure 4.3.4 also shows two possible ways of

defining a tube from the same set of coordinates. Note that if radius is not defined in the

header line, the coordinates are interpreted as rectangles rather than matches.

Figure 4.3: Specifying a tube explicitly in a flat text fileThis figure has already been introduced in chapter 3. The tube in the right part is constructedby interpreting the coordinates as matches (with radius around them 50), while the tube onthe left is constructed by interpreting the coordinates as rectangles.

4.3.5 The prior information file

This input file stores prior information along the input sequences to be used with the special

emissions feature of HMMConverter. Each line of this input file corresponds to one piece

of information for one sequence interval along one input sequence. It is specified using the

following format:

SeqX Source1 100 150 Feature:Exon:Intron 0.3:0.7

105

Page 115: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.3. Other text input files

Phase:NoPhase:Phase0 .:.

The above example actually corresponds to a single line in the prior information file, we only

split it into two lines for better readability. Each line contains one piece of prior information

for one sequence interval, it consists of several columns separated by a space or a tab character.

The first four columns specify:

• SeqX: The name of the sequence to which this piece of information applies.

• Source1: The source of this piece of information, e.g. a database name or the name of

a human annotation expert.

• 100: The start of the sequence interval to which this information applies.

• 150: The end of the sequence interval to which this information applies.

The prior information itself is defined starting in the fifth column. The above example defines

two types of prior information, information of type ”Feature” and of type ”Phase”:

(i) Feature:Exon:Intron 0.3:0.7

(ii) Phase:NoPhase:Phase0 .:.

For each type of prior information, the first entry specifies the annotation label set and the

annotation labels, e.g. ”Feature:Exon:Intron”, and following entry specifies the corresponding

prior probabilities, e.g. 0.3 for ”Exon” and 0.7 for ”Intron”. Users can define any number of

types of prior information within a line as long as they concern the same sequence interval.

We now consider prior information for annotation label set ”Feature” in (i) above. It

means that sequence interval [100, 150] has a prior probability of 0.3 for being ”Exon” and

a prior probability of 0.7 for being ”Intron”. The prior information for annotation label set

106

Page 116: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.4. The output files of HMMConverter

”Phase” in (ii) has the same format, but the prior probability is ’.’ instead of a numerical

value, because the annotation label set “Phase” is an on/off label set. In this example it means

that the sequence positions in the interval [100, 150] has value ”NoPhase” or ”Phase0”.

Finally, suppose there was a third annotation label set “State Type” defined for the HMM.

As no information for this label set is defined for sequence interval [100, 150], there the nominal

emission will not be biased by this label set. The concept of special emissions in HMMCon-

verter and how to specify valid prior probabilities are discussed in detail in chapter 3.

4.4 The output files of HMMConverter

HMMConverter provides two simple output files for storing the results of sequence decod-

ing. The first file shows the sequence of states in the optimal state path, and the second

optional file shows the resulting annotation of the input sequences.

Figure 4.4 shows a 5-states simple pair-HMM. This pair-HMM consists of two annotation

label sets, “Set1” = {”align”,”gap”} and “Set2”= {”match”,”emit X”,”emit Y”}. Each state

(except the start state and the end state) of the pair-HMM is assigned with one annotation

label from each of the annotation label set as shown in the figure. In this example, two output

files are generated after running the Viterbi algorithm on a pair of input sequences SeqX and

SeqY of length 10 and 15 respectively, the first file looks like:

SeqX&SeqY:

0 0 start xdim= 0 ydim= 0

1 3 Emit Y xdim= 0 ydim= 1 Set1 y0=gap Set2 y0=emitY

2 3 Emit Y xdim= 0 ydim= 1 Set1 y0=gap Set2 y0=emitY

3 3 Emit Y xdim= 0 ydim= 1 Set1 y0=gap Set2 y0=emitY

4 3 Emit Y xdim= 0 ydim= 1 Set1 y0=gap Set2 y0=emitY

5 3 Emit Y xdim= 0 ydim= 1 Set1 y0=gap Set2 y0=emitY

6 1 Match xdim= 1 ydim= 1 Set1 x0=align Set1 y0=align Set2 x0=match Set2 y0=match

107

Page 117: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.4. The output files of HMMConverter

Figure 4.4: A simple 5-state pair-HMMA 5-state simple pair-HMM which contains two annotation label sets “Set1” and “Set2”.Each sequence position of a state in the pair-HMM is assigned with one annotation labelfrom each of the annotation label sets. For example, the sequence position of X and Y in theMatch state is assigned with label ”align” and ”match” from annotation label set “Set1”,and annotation label “Set2” respectively.

108

Page 118: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.4. The output files of HMMConverter

7 2 Emit X xdim= 1 ydim= 0 Set1 x0=gap Set2 x0=emitX

8 1 Match xdim= 1 ydim= 1 Set1 x0=align Set1 y0=align Set2 x0=match Set2 y0=match

9 1 Match xdim= 1 ydim= 1 Set1 x0=align Set1 y0=align Set2 x0=match Set2 y0=match

10 2 Emit X xdim= 1 ydim= 0 Set1 x0=gap Set2 x0=emitX

11 2 Emit X xdim= 1 ydim= 0 Set1 x0=gap Set2 x0=emitX

12 1 Match xdim= 1 ydim= 1 Set1 x0=align Set1 y0=align Set2 x0=match Set2 y0=match

13 1 Match xdim= 1 ydim= 1 Set1 x0=align Set1 y0=align Set2 x0=match Set2 y0=match

14 1 Match xdim= 1 ydim= 1 Set1 x0=align Set1 y0=align Set2 x0=match Set2 y0=match

15 3 Emit Y xdim= 0 ydim= 1 Set1 y0=gap Set2 y0=emitY

16 3 Emit Y xdim= 0 ydim= 1 Set1 y0=gap Set2 y0=emitY

17 3 Emit Y xdim= 0 ydim= 1 Set1 y0=gap Set2 y0=emitY

18 1 Match xdim= 1 ydim= 1 Set1 x0=align Set1 y0=align Set2 x0=match Set2 y0=match

19 3 end xdim= 0 ydim= 0

Let the optimal state path S := (S0, S2, · · · , SN ′), the first output file then contains N ′+1

lines, one line for each state in the state path, the first line always represents the start state

and the last line always represents the end state. The first column shows the line number,

which is also the index in the state path. The second column and third column show the

number and the name of the state in the HMM. The fourth and fifth column specifies the

number of characters read from sequence x and the number of characters read from sequence

y respectively. Starting from the fifth column, each column contains the annotation label for

the corresponding sequence position read by the state. For example, the position 6 (the 7-th

line in the output file) of the optimal state path is a Match state, which the sequence position

X and sequence position Y of the state are both assigned with annotation label ”align” from

annotation label set “Set1” and ”match” from annotation label set “Set2”.

The second format shows the annotation of the sequence. If no annotation label set is

defined for the HMM, this output file will not be generated. The following is the second

output file of the same state path for the same pair of input sequences:

SeqX SimplePair-HMM 1 1 Set1=align Set2=match

109

Page 119: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.4. The output files of HMMConverter

SeqX SimplePair-HMM 1 1 Set1=align Set2=match

SeqX SimplePair-HMM 2 2 Set1=gap Set2=emitX

SeqX SimplePair-HMM 3 4 Set1=align Set2=match

SeqX SimplePair-HMM 5 6 Set1=gap Set2=emitX

SeqX SimplePair-HMM 7 10 Set1=align Set2=match

SeqY SimplePair-HMM 1 5 Set1=gap Set2=emitY

SeqY SimplePair-HMM 6 11 Set1=align Set2=match

SeqY SimplePair-HMM 12 14 Set1=gap Set2=emitY

SeqY SimplePair-HMM 15 15 Set1=align Set2=match

Each line presents a contiguous sequence interval that is assigned the same annotation, where

all the annotation labels have the same values. Regarding the format, the first column is the

sequence name, second column is the name of the HMM specified by users in the XML-input

file. The third and forth columns indicate the interval of the sequence positions that have the

same annotations, the corresponding annotation labels are showed after that.

If the HMM is used with any of the parameter training algorithms, HMMConverter

writes three output files to store the results of the parameter training, one XML-file and

two flat text files. The XML-output file is the same as the XML-input file that describes

the HMM, but contains the trained transition probabilities. The two flat text files are the

files with the trained free transition parameters and the files with the trained free emission

parameters. They have the same format as the corresponding input files, but contain the

trained parameter values.

110

Page 120: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

4.5. Examples

4.5 Examples

The HMMConverter software package comprises several examples with all the input and

output files. These examples illustrate the range of features supported by HMMConverter.

CpG-Island prediction This simple HMM is introduced in [11] for predicting the CpG-

Islands in a potentially long genomic DNA sequence. This simple example shows how to

specify a simple HMM with annotation labels.

Dishonest Casino The example of a dishonest casino HMM is introduced in [11], a vari-

ation of this HMM using special emissions is also included. The special emissions version

of the HMM can be interpreted as the expert knowledge of a casino-insider who provides

additional information on when the loaded rather than the fair dice was used. This simple

example illustrates how to use special emissions.

p53-transcription site prediction This HMM can be used to predict p53 transcription

factor binding sites given genomic DNA input sequences [17]. This example shows how

parameter training with HMMConverter works. We use the same training set and the test

set as MAMOT [43].

Simple pair-HMM This is a simple pair-HMM as shown in 4.4. This simple example illus-

trates how to specify a pair-HMM, and how to train its parameters. This example comprises

an artificially created training set.

111

Page 121: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Chapter 5

Implementation of HMMConverter

5.1 Introduction

HMMConverter is an executable file in C++ that is executed from the command line.

Users can start it using the following command:

HMMConverter input directory/ output directory/ HMM.xml

where HMMConverter is the name of the executable file and HMM.xml is the XML-input

file specifying the HMM and the analyses to be performed. Note that the name of the XML file

is not restricted to HMM.xml. All input files including this XML input file must be contained

in the input directory. Similarly, all output files will be written to the output directory.

In order to construct an HMM with HMMConverter, the user only has to supply the

corresponding input files, no compilation is needed. We provide the source code files including

a Makefile with the software package for users’ convenience.

HMMConverter is based on the original implementation of Doublebuild [33] which

corresponds to a set of C++ classes for constructing pair-HMMs. These classes provide a

good framework for constructing HMMConverter, but as the code has been designed for

the gene prediction pair-HMMs underlying Doublescan and Projector, the original code

had to be modified. In addition, new classes were added for special features, and some re-

dundancies were removed.

112

Page 122: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

5.2. Converting XML into C++ code

In this chapter, we first give a very brief introduction of the XML language and the

connection between HMMConverter and the XML input file. We then describe several

important classes, and finally focus on discussing the implementations of the algorithms for

sequence decoding and parameter training. The interplay between the classes is also described

in detail and their structure explained.

5.2 Converting XML into C++ code

We chose XML for describing the HMM because it is a popular, easy to understand and

powerful language. XML is very popular for structuring, storing and exchanging information,

it is an HTML-like language which uses tags or markup language. There is no restriction

on the XML tags as they can all be defined by users. This provides great flexibility for

representing data structures. Because of these reasons, XML has been used widely in many

Bioinformatics applications, including genomic annotations, protein specification, etc.

HMMConverter uses the TinyXml [28] package to convert the XML-input file into

C++ code. Most of the attributes of the HMM described in the XML file are first read into

the model parameter class.

5.2.1 The model parameters class

A model parameters object stores almost all the attributes defined in the <model> tag

of the XML file. On the first step of building the specified HMM with HMMConverter,

all the information about the HMM from the XML-input file, except its topological structure

and parameters, is read and stored in this object. In other stages of constructing the HMM

and for invoking the different algorithms, this object interacts with other classes by supplying

113

Page 123: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

5.2. Converting XML into C++ code

information about the HMM, such as the name of an annotation label or the name of a state.

In the following, we list some private variables in the model parameters class for storing

information about the specified HMM:

• char* Model Name: The name of the HMM.

• boolean pair: A boolean variable indicates whether it is an HMM or a pair-HMM.

• int Total Number of Annotation Labels: The number of annotation label sets.

• Label* Annotation Label: The array of annotation label sets and the corresponding

labels in each annotation label set.

Label is a structure variable with four attributes which describes one annotation label

set:

– int size: The number of annotation labels in the annotation label set.

– char* tname: The name of the annotation label set.

– char** name: The array of names of each label in the set.

– boolean score: This boolean variable indicates whether this annotation label set

can be used with prior probabilities (TRUE) or only with on/off values (False).

• Letter Alphabet: The alphabet set that can be read from the HMM.

Letter is a structure variable with three attributes:

– int size: The number of alphabets in the alphabet set.

– boolean cases: This boolean variable indicates whether the HMM is case sensitive

or not when reading characters from the input sequences.

– char* name: The alphabet in the alphabet set, e.g. if the alphabet set of the

HMM is {A, C, G, T}, then the value of this variable name is ”ACGT”.

114

Page 124: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

5.2. Converting XML into C++ code

• int Number of States: The number of states in the HMM, including the start and

end state.

• boolean SpecialEmit: The boolean variable indicates whether the special emissions

are use (TRUE) or not (FALSE).

5.2.2 Other classes

The Hmm class, the Hmm State class and the Sequence class are three other key classes

in HMMConverter which were derived from the Pairhmm class, the Pairhmm State

class and the Sequence class in Doublebuild respectively. These classes contain most of

the functional components of HMMConverter. We are going to describe each of them in

more detail in the following.

The Hmm class

An Hmm object basically knows everything about a specified HMM. It contains an array

of Hmm State objects and the sequence decoding and parameter training algorithms are

implemented as public functions.

The Hmm class consists of the following private variables:

• boolean mirrored: This boolean variable indicates whether this is a ”mirror” HMM

(TRUE) or not (FALSE). The ”mirror model” is needed in the Hirschberg algorithm.

• boolean pair: This boolean variable indicates whether this is a pair-HMM or not.

• int number of states: The number of states in the HMM.

• int alphabet: The number of symbols in the alphabet.

• Pairhmm State* model: An array with pointers to each state of the HMM.

115

Page 125: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

5.2. Converting XML into C++ code

The Hmm object only knows the size of the alphabet but not the symbols that it con-

tains. When algorithms are used, this class gets the corresponding information from the

model parameters object used when necessary.

The Sequence class

The Sequence class stores all the information about one input sequence. If special emissions

are used, it also stores the prior information for the different annotation label sets for all

positions in this input sequence, they are read from the prior information file into this object.

A mapping and scoring function is used to calculate the special emission probabilities for each

state according to the position-specific prior probabilities, and these probabilities are finally

stored in the corresponding Hmm State object.

The most important private variables in the Sequence class are:

• int* sequence: The sequence is represented by an integer, the corresponding characters

can be derived from the model parameters class.

• int start position: The start position of the sequence as indicated in the header line

of the input file with sequences.

• int end position: The end position of the sequence as indicated in the header line of

the input file with sequences.

• int length of sequence: The length of the sequence. This number is equal to end position

− start position +1.

• char* ac: The name of the sequence as indicated in the header line of the input file

with sequences.

116

Page 126: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

5.2. Converting XML into C++ code

The above variables are always used when creating a sequence object. If special emissions

are switched on, three additional arrays are needed to store the corresponding information.

The class ”array” used to define some variables listed below is provided in Doublebuild for

constructing multi-dimensional arrays.

• boolean**** annotation labels: This four-dimensional boolean array indicates which

types of prior information exists for the sequence. The four indices of this array are

[source, sequence position, annotation label set, annotation label] respectively. A value

of ”on” (TRUE) indicates that the corresponding label exists, for that particular posi-

tion and the particular source. The source could be a database name or the name of a

human annotation expert.

• double**** annotation label scores: This four-dimensional array stores the prior

information for the sequence. The four indices of this array are [source, sequence po-

sition, annotation label set, annotation label] respectively. The corresponding entry

indicates the corresponding prior probabilities.

• array<Score> scores for special emissions: This two-dimensional array stores the

scores of the special emissions. The two indices are [ the index of a state with special

emissions, sequence position] respectively.

The Hmm State class

The Hmm State objects form the building blocks of any HMM. Every Hmm State object

contains all the implementation for a state in an HMM as well as limited information about

the states to which this state is connected by transitions.

Some of the private variables are only for special emissions. We first describe the most

important private variables and then the variables for special emissions.

117

Page 127: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

5.2. Converting XML into C++ code

• boolean mirrored: The boolean variable indicates whether this is a state in a ”mirror”

model (TRUE) or not (FALSE).

• boolean special emission: This boolean variable indicates whether this state uses

special emissions (TRUE) or not (FALSE).

• int number of state: Number of this state in the HMM.

• int number of states: Number of states of the HMM.

• int number of letters to read: Total number of characters read by this state.

• int number of letters to read x: Number of characters read by this state from se-

quence x.

• int number of letters to read y: Number of characters read by this state from se-

quence y.

• int number of previous states: The number of states that have a transition to this

state.

• array<int> numbers of previous states: Array of numbers of the states that have

a transition to this state.

• int number of next states: The number of states that this state has a transition to.

• array<int> number of next states: Array of numbers of the states that this state

has a transition to.

• array<char*> transition probs expressions: Array of strings to store the deriva-

tion rules for deriving the transition probabilities. The index of this array corresponds

to the state number of the respective next state.

118

Page 128: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

5.2. Converting XML into C++ code

• array<Prob> transition probs: Array of transition probabilities of this state. The

index of this array corresponds to the state number of the respective next state.

• array<Score>transition scores: Array of transition scores of this state. The index

of this array corresponds to the state number of the respective next state.

• array<Prob> emission probs: Array of emission probabilities of this state. The

index of this array corresponds to the combination of letters that read by this state.

For example if the the alphabet set consists of 4 letters, and this state reads 3 letters

at a time, then this array will have length 43 = 64. The different combinations of the

letters are converted to an index for this array.

• array<Score>emission scores: Array of emission scores of this state. The index of

this array has the same meaning as that of the array emission probs.

The data types Prob and Score are used for variables representing probabilities and log-

arithms of probabilities, a further explanation of Score is given in the last section of this

chapter.

The following variables are only used to store information about special emissions:

• int number of annotation labels: Number of annotation label sets defined for this

state.

• array<int>* annotation labels: Array of annotation labels of all the annotation

label sets and for each letter read by the state. The first index indicates the annotation

label set, and the second index indicates the position of the letter that this annota-

tion label is defined for. Each value in this two-dimensional array corresponds to an

annotation label.

119

Page 129: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

5.3. Decoding and training algorithms

• char** annotation labels type: Array of the name of annotation label sets defined

for this state. The index indicates the number of annotation label set, and the value is

the name of the corresponding annotation label set.

• int number of child state x: The number of the state that stores the prior informa-

tion for input sequence x for this state. This variable is only used for states that read

characters from input sequence x, otherwise it is set to 0.

• int number of child state y: The number of the state that stores the prior informa-

tion for input sequence y for this state. This variable is only used for states that read

characters from input sequence y, otherwise it is set to 0.

5.3 Sequence decoding and parameter training algorithms

HMMConverter supports several prediction and training algorithms, among them, the

Hirschberg algorithm [16] and several linear memory algorithms for parameter training, two

of them being novel. In this section, we give the implementation details for these algorithms.

5.3.1 The Viterbi and Hirschberg algorithms

The Viterbi [47] and Hirschberg algorithm [16] are implemented based on the corresponding

implementations in Doublebuild. There are several issues worth mentioning in this section.

Banding The sequence decoding algorithms are implemented using a linearized dynamic

programming table, where each entry corresponds to an object containing a score and a back-

tracking pointer. The index to this table is calculated in a very time-efficient way using

bit-shifting and the table itself is stored in the most memory-efficient way. If any decoding or

120

Page 130: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

5.3. Decoding and training algorithms

training algorithms used with a tube, only matrix elements inside this tube will be allocated

memory.

Numerical difficulties In the Viterbi algorithm, the entries of the Viterbi matrix is filled

according to the following recurrence relation:

vi(k) = ei(k)maxn{vn(k − 1)tn,i}

The (i, k) entry in the Viterbi matrix for sequence position k and state i stores the probability

of the state path with the highest overall probability that ends in state i at that sequence

position. For high values of k (i.e. long sequences) or low emission and transition probabilities,

this number can get very small and cause underflow errors. A simple way to deal with this

numerical difficulty is to do the Viterbi algorithm in logarithm space. The recurrence relation

of the Viterbi algorithm then becomes:

vi(k) = log(ei(k)) + maxn{log(vn(k − 1)) + log(tn,i)}

This strategy works perfectly for the Viterbi algorithm as it avoids the underflow problem,

and as the logarithm function is a strictly monotonous function so this preserves the choice of

the maximum state. For this purpose, transition and emission scores are pre-computed from

the corresponding probabilities and stored in the corresponding arrays of the Hmm State

object, so it is not necessary to compute the score repeatedly during the Viterbi algorithm.

Special emissions In any decoding algorithm, the emission probability of a state is ob-

tained via the public function get emission score in the Hmm State class. This function

returns the appropriate emission score, i.e. it returns the nominal emission score if special

emissions are not used or no prior information is supplied for the particular sequence position,

or it returns the modified special emission score in the other cases.

121

Page 131: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

5.3. Decoding and training algorithms

5.3.2 The parameter training algorithms

All training algorithms can also be combined with banding and special emissions, and the

corresponding implementation regarding these two features is similar to the one discussed

above. But as these algorithms already correspond to linear-memory versions, there is no

need to linearize the dynamic programming table.

On the other hand, any training algorithm can potentially generate underflow errors due to

many probabilities being multiplied (and added). The situation for training is a bit more com-

plicated than for the Viterbi algorithm. For all parameter training algorithms, the calculation

of the forward values involves the summation of probabilities. If we perform the algorithm in

logarithm space, we have to compute values like log(∑n

i=0 pi). In HMMConverter, we use

the following identity to do this:

log(∑n

i=0 pi) = log p0 + log(1 +∑n

i=1 exp(log(pi)− log(p0)))

This is convenient and efficient to implement because the scores of all transition and emission

parameters are already stored in the corresponding state object. There also exist alternative

strategies to address the underflow problem such as rescaling or creating a specific data type.

The TransitionProb and EmissionProb objects The TransitionProb class and Emis-

sionProb class are implemented for convenience for the parameter training algorithms.

A TransitionProb object stores the values of free transition parameters with the corre-

sponding pseudo-counts as well as the group transitions defined for training the free transition

parameters. It also stores the inverse derivation rules for deriving the free parameters from the

exact transitions. If no free transition parameter is defined, this class serve as a temporary

122

Page 132: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

5.3. Decoding and training algorithms

storage for the transition probabilities in the training process. It also stores the pseudo-

probabilities for the transitions.

An EmissionProb object stores the free emission parameters and the corresponding

pseudo-probabilities. Hmm State objects get their emission probabilities either directly

or derived from these parameters. A function that automatically determines the states for

training a particular emission parameter is included as a public function in this class, so the

user only needs to specify which emission parameters to train.

123

Page 133: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Chapter 6

Annotaid

6.1 Introduction and Motivation

HMMs are widely used in Bioinformatics. One important area of research where HMMs are

used is gene prediction. We have already discussed many variations of HMMs that have been

developed for this particular application in chapter 1. These HMM-based methods gradually

changed from alignment-based methods (e.g. GeneWise [6] and PROCRUSTES [14]) and

ab initio methods (e.g. Genscan [8]) to a combination of these two methods, which we called

hybrid methods (e.g. Twinscan [26], Doublescan [34] and Projector [36]).

In recent research, as more genomic data becomes available, many HMMs were developed

to integrate prior information about the input sequences into the prediction process. Jig-

saw [19], Augustus [44] and ExonHunter [7] are examples of gene prediction methods that

can take various types of prior information into account. They are also capable of integrating

information with different confidence scores into the gene prediction process, but they have

a number of limitations. For example, they cannot deal with conflicting pieces of informa-

tion and one state in the underlying HMM can typically only capture prior information of

one type. Also as even more genomic data becomes available, there are many low-sequence

coverage genomes with erroneous or incomplete genome data. Ensembl [18] is an integrated

analysis pipeline for gene prediction, the most current version tackles this particular problem.

With HMMConverter, we built a novel gene prediction pair-HMM called Annotaid,

124

Page 134: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

6.1. Introduction and Motivation

which is capable of integrating various types of prior information along two input sequences

into the gene prediction and parameter training process. The pair-HMM underlying Anno-

taid is the same as Doublescan [34] and Projector [36], however the new model extends

Projector in the following aspects:

(i) It can incorporate prior information for both input sequences instead of just one of the

two input sequence as in Projector.

(ii) Every sequence position can be assigned with more than one type of prior information

and the level of confidence for each type of information can be expressed in terms of a

prior probability in [0, 1].

(iii) Several memory-efficient algorithms for parameter training can be used, two of them

are novel.

All features provided by HMMConverter such as the Hirschberg algorithm [16] and the

parameter training algorithms can be used in Annotaid. Annotaid uses several annotation

label sets, and the prior information is incorporated with the special emission feature in

HMMConverter. Furthermore, Annotaid checks conflicting pieces of prior information.

The capability of integrating prior information on both input sequences is very important

for parameter training, as the known annotation of the training sequences can thus consid-

erably improve the outcome of the training. Using the efficient training algorithms provided

by HMMConverter, Annotaid can be easily adapted to analyze any pair of genomes if

corresponding training data of known orthologous gene pairs is available. In the following

section, we are going to describe the features and advantages of Annotaid.

125

Page 135: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

6.2. Main features of Annotaid

6.2 Main features of Annotaid

Annotaid is a novel comparative gene prediction method which significantly extends Dou-

blescan and Projector. It was constructed using the HMMConverter framework. An-

notaid can predict partial, complete and multiple pair of genes in the two genomic input

DNA sequences. It can also align more diverged pair of genes which differ by exon-fusion

or exon-splitting. The main feature that leads to a superior performance is that the method

simultaneously align while it predicts genes. In this section, we describe the main features of

Annotaid, some of those are provided by HMMConverter, such as banding and special

emissions. For more details of those features, the reader is referred to chapter 3.

6.2.1 The pair-HMM of Annotaid

The states and topological structure of Annotaid are the same as for Doublescan and

Projector. The states can be sub-divided into three groups. Each group models a specific

structure of a gene: (1) protein-coding Exons, (2) introns, (3) intergenic regions. The HMM

consists of 54 states and 153 transitions. The transition probabilities are parameterized by

19 free transition parameters and all emission probabilities are derived from the emission

probabilities of 7 states, i.e. there are only 7 free emission parameters.

In Doublescan and Projector, the splice site signal is captured by an external splice

site prediction program Stratasplice [29] which predicts potential splice sites and assigns

a confidence score to each such site. These scores are integrated into the prediction process

using so-called special transitions, which is a special feature of Doublescan and Projector.

Since Annotaid is built with HMMConverter, this special transition concept is not used.

Instead, we convert the prediction by Stratasplice into prior probabilities and integrate

them using special emissions. On this way, the splice site signal can be integrated into the

126

Page 136: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

6.2. Main features of Annotaid

prediction process and we can perform a fair comparison between Annotaid and Projector.

For more details of the topological structure of the HMM and the special transition concept,

readers can refer to the Doublescan and Projector paper [34, 36].

6.2.2 Banding and sequence decoding algorithms

Annotaid employs a generalized pair-HMM in order to annotate two input sequences simul-

taneously. It thus has to deal with the constraints of memory and time requirements if the

input sequences are long. As all parameter training algorithms provided in HMMConverter

are implemented as linear memory algorithms, the memory requirement is not a problem for

parameter training. For sequence decoding, the memory problem can also be solved by using

the Hirschberg algorithm [16]. To address the time constraint, users can use the banding

feature in HMMConverter. This forces the prediction and parameter training algorithms

to run inside a tube-like sub-space of the search space. If the tube is well designed, it can

greatly improve the speed of the algorithms while keeping the performance unchanged. As

the Hirschberg algorithm can be run inside a tube, this combination allows to generate pre-

dictions with minimal time and memory requirements, see figure 6.1. The concept of banding

is described in detail in chapter 3 and chapter 4, a discussion of this feature for parameter

training of Annotaid is given below.

6.2.3 Integrating prior information along the input sequences

Annotaid extends the special emission feature of Projector for integrating prior infor-

mation on the input sequences into the prediction and parameter training. There are three

main modifications: (1) Annotaid is capable of integrating prior information on both input

sequences instead of only one input sequence. (2) Annotaid can incorporate prior infor-

mation in terms of probabilities instead of only discrete “on/off” values, see figure 6.2.3. In

127

Page 137: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

6.2. Main features of Annotaid

Figure 6.1: Combing the Hirschberg algorithm with a tube

addition, Annotaid can use prior information from different sources and consistency check

are applied to check for conflicting information. (3) In Projector, the splice site signal

is integrated into the prediction process using special transitions, while in Annotaid, this

signal is converted into prior probabilities which are integrated using special emission.

With the first two improvements, Annotaid is much more amenable to parameter train-

ing, as the known annotation on both training sequences can be automatically integrated

into the training. These improvements make Annotaid also better for predicting genes on

low-sequence coverage genome, as partial information on an input sequence may be able to

cover the missing part on the other sequence, which can improve the resulting performance.

The third modification allows a fair comparison between the published version of Projector

and Annotaid which has been implemented using HMMConverter. In this section, we

describe each of the above features of Annotaid in greater detail, the theoretical details of

128

Page 138: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

6.2. Main features of Annotaid

Figure 6.2: Special emissions in Projector and AnnotaidThe left part shows how the special emissions are used in Projector to incorporate com-plete information about the known gene in one of the two input sequence. The right partshows Annotaid which can accept incomplete and uncertain prior information on each inputsequence. The dashed lines refer to evidence for exons with a prior probability of less than100%.

the special emissions in HMMConverter are discussed in chapter 3. The special transitions

is introduced in the Projector paper [36].

In order to incorporate the prior probabilities into the gene prediction process, Annotaid

uses two annotation label sets:

(i) Feature : {Intergenic, Intron,Exon,CDS,Start Codon, Stop Codon}

(ii) Phase : {., 0, 1, 2}

where the four labels denote NoPhase, Phase 0, Phase 1 and Phase 2, respectively.

Different types of prior information are set with these annotation labels in terms of probabil-

ities in the prior probability file. In Annotaid, we only set the annotation label set Feature

to be a “score” label set, and the other two are set as “on/off” label set. The concept of how

the special emission bias the normal emission probabilities of a state are discussed in depth

in chapter 3. In brief, we can see figure 6.2.3 which illustrates the differences of the special

129

Page 139: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

6.2. Main features of Annotaid

emission concept between Annotaid and Projector.

6.2.4 Parameter training in Annotaid

As discussed in chapter 1, most HMM-based applications do not provide algorithms for pa-

rameter training except for a few gene prediction HMMs such as [25, 30, 46]. Generally, it

is much more challenging to train a pair-HMM than an HMM for the following reasons. (1)

Large amounts of training data are required to reliably train a larger number of transition

and emission parameters. (2) Even if the annotation of each training sequence is known, the

pair of sequences can still be aligned in many ways, i.e. many valid state-paths need to be

considered. In other words, even with complete annotations, the training process is not just

a simple counting process for pair-HMMs as many state paths can yield the same annotation

of the input sequences.

Due to these challenges, relatively large training sets would be required to train Annotaid,

which would impose significant time and memory requirements. In order to alleviate these

problems, three special features are used for parameter training in Annotaid:

(i) Three linear memory parameter training algorithms are implemented, two of them are

novel (see chapter 3 for details).

(ii) The special emissions in HMMConverter can be used to integrate prior information

about the annotation of the training sequences into the parameter training.

(iii) Tubes can be used to restrict the search space along the two sequence dimensions to

accelerate the parameter training.

We describe each of these features in the following.

130

Page 140: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

6.2. Main features of Annotaid

Linear memory parameter training algorithms As discussed in chapter 3, HMM-

Converter provides the linear memory Baum-Welch algorithm[35], a novel linear memory

algorithm for training with sampled state paths and a new linear memory implementation for

Viterbi training. Because Annotaid is built with HMMConverter, users can select any of

these training algorithms in the XML-input file. The three algorithms require only O(NL)

memory to train one parameter for an HMM with N states and two input sequences of length

L, which essentially removes the memory requirement problem. In addition, the user can

specify the maximum amount of memory for parameter training, and HMMConverter will

then optimize the run time accordingly by training as many parameters as possible in parallel.

Special emissions used for parameter training Annotaid integrates the information

on annotations of the training sequences into the parameter training process using the spe-

cial emission feature in HMMConverter, and can even incorporate partial and incomplete

annotations. This particular feature is very useful for adapting Annotaid to predict genes

in low-sequence coverage genome.

Banding used for parameter training The constraint of a large memory requirement is

solved by using the linear memory training algorithms, but the time constraint still exists. In

order to also solve this problem, a user can combine the banding feature in HMMConverter

the training algorithm. A tube can either be derived from local TBlastx or Blastn [1]

matches, see figure 6.2.4, or specified explicitly by the user. Please refer to chapter 4 for more

details on how a tube can be specified. This can greatly accelerate the training efficiency

while retaining the same performance.

The parameterization of the transition and emission probabilities of Annotaid is the

131

Page 141: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

6.2. Main features of Annotaid

Figure 6.3: Combining the parameter training with a tubeThis figure shows the projection of the three dimensional search space onto the X-Y plane.It shows the tube that has been derived from local TBlastx or Blastn matches betweenthe two input sequences, in particular, this tube was derived from matches with gaps. Thetraining algorithm will consider the space inside the tube, thereby greatly accelerating thetraining process.

132

Page 142: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

6.3. Availability

same as for Doublescan and Projector. With 53 states, it has only 19 free transition

parameters and 7 free emission parameters (each free emission parameter corresponds to a

set of emission probabilities). In Annotaid, only the free parameters are being trained,

so the corresponding inverse formula have to be defined for each parameter. The default

setting of Annotaid trains 18 free transition parameters (excluding the To End parame-

ter) and 3 free emission parameters, the Match Start Codon, Match 5’ Splice Site and

Match 3’ Splice Site parameters are not trained as they correspond to particular consen-

sus sequences. The Match Exon parameter is also not trained as it reads 6 letters at a

time corresponding to 4096 emission probabilities which cannot easily be reliably trained.

In order to define the emission probabilities of the Match Exon state, a user can specify

a table of amino acid alignment frequencies as well as two tables of codon frequencies (one

for each organism) from which these probabilities are then derived. Once these values have

been set, its values can be further improved in training. The three emission parameters set

to be trained by default are Match Intergenic, Match Intron and Match Stop Codon,

users can change the default setting by specifying which parameters to be trained in the free

emission n parameter file.

In summary, with the features discussed above, efficient parameter training can be per-

formed for Annotaid so that it can be adapted quickly to predict genes on new pairs of

eukaryotic genomes.

6.3 Availability

Annotaid is available under the Gnu Public License. The Annotaid software package

includes the HMMConverter software, the XML file specifying Annotaid, and several

parameter files. The software package can be downloaded readily free of charge from

133

Page 143: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

6.4. Discussion and Future Work

http://www.cs.ubc.ca/~irmtraud/HMMConverter/Annotaid. In order to use Anno-

taid, the user has to run HMMConverter with the Annotaid XML file. The execution

of HMMConverter is described in chapter 3.

6.4 Discussion and Future Work

We have presented here Annotaid, a novel method for comparative gene prediction which

significantly extends Doublescan [34] and Projector [36]. It has been built using the

HMMConverter framework. Annotaid takes a pair of homologous DNA sequences, per-

form the sequence alignment and gene prediction simultaneously on both input sequences. It

can predict pairs of complete genes, partial genes and multiple genes on the input sequences.

Annotaid integrates different kinds of prior information, together with their prior proba-

bilities into the prediction and parameter training using the concept of special emissions in

HMMConverter. This feature is important for successful parameter training as it allows

to integrate information on the known annotation of both input sequences in order to derive

better parameter values. We can thus expect Annotaid to generate high-quality results

even for low-sequence coverage genome data. This is especially relevant for annotating many

genomes sequenced today.

The HMMConverter framework provides all the features and efficient algorithms such

as combining Hirschberg algorithm [16] with a tube. The scores generated by Stratasplice

for predicting potential splice sites are converted into prior probabilities and integrated into

the prediction process using special emissions. A Perl script file is also included in the Anno-

taid package, which can be used to generate Stratasplice prediction for the input sequences

and which converts the results into an Annotaid input file with the prior probabilities.

With all the features discussed previously, Annotaid can be readily adapted to new pairs

134

Page 144: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

6.4. Discussion and Future Work

of eukaryotic genomes. At this stage, we have developed the model with all the required algo-

rithms and we have obtained an annotated training set for two recently sequenced pathogens.

The pre-processing of the training set is now in progress, then an in-depth analysis of the

adaptation of Annotaid with parameter training will follow, and the overall performances

of Annotaid will then be investigated.

135

Page 145: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Bibliography

[1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. Basic local alignment

search tool. Journal of Molecular Biology, 215:403–410, 1990.

[2] L.E. Baum. An equality and associated maximization technique in statistical estimation

for probabilistic functions of Markov processes. Inequalities, 3:1–8, 1972.

[3] D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and D.L. Wheeler. GenBank.

Nucleic Acids Research, 34:D16–D20, 2006.

[4] A. Bird. CpG islands as gene markers in the vertebrate nucleus. Trends in Genetics,

3:342–347, 1987.

[5] E. Birney and R. Durbin. Dynamite: a flexible code generating language for dynamic

programming methods used in sequence comparision. In In Fifth International Confer-

ence on Intelligent Systems in Molecular Biology, pages 56–64, Menlo Park, 1997. IAAA

Press.

[6] E. Birney and R. Durbin. Using GeneWise in the Drosophila annotation experiment.

Genome Research, 10(4):547–548, 2000.

[7] B. Brejova, D.G. Brown, M. Li, and T. Vinai. ExonHunter: a comprehensive approach

to gene finding. Bioinformatics, 21(Suppl 1):57–65, 2005.

136

Page 146: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Chapter 6. Bibliography

[8] C. Burge and S. Karlin. Prediction of Complete Gene Structures in Human Genomic

DNA. Journal of Molecular Biology, 268(1):78–94, 1997.

[9] A. Churbanov and S. Winters-Hilt. Implementing EM and Viterbi algorithms for Hidden

Markov Model in linear memory. BMC Bioinformatics, 9:224, 2008.

[10] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data

via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38, 1977.

[11] R. Durbin, S.R. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: Prob-

abilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge,

UK, 1998.

[12] S.R. Eddy. Profile hidden Markov models. Bioinformatics, 14(9):755–763, 1998.

[13] J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood approach.

Journal Molecular Evolution, 17:368–376, 1981.

[14] S. Gelfand, A.A. Mironov, and P.A. Pevzner. Gene recognition via spliced sequence

alignment. Proceedings of the National Academy of Science of the USA, 93:9061–9066,

1996.

[15] B.J. Haas, S.L. Salzberg, W. Zhu, M. Pertea, J.E. Allen, J. Orvis, O. White, C.R. Buell,

and J.R. Wortman. Automated eukaryotic gene structure annotation using Evidence-

Modeler and the Program to Assemble Spliced Alignments. Genome Biology, 9:R7, 2008.

[16] D.S. Hirschberg. A linear space algorithm for computing maximal common subsequences.

Communications of the ACM, 18:341–343, 1975.

137

Page 147: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Chapter 6. Bibliography

[17] J. Hoh, S. jin, T. Parrado, J. Edington, A.J. Levine, and J. Ott. The p53MH algorithm

and its application in detecting p53-responsive gene. Proceedings of the National Academy

of Science of the USA, 99(13):8467–8472, 2002.

[18] T. J. P. Hubbard, B. L. Aken, K. Beal1, B. Ballester1, M. Caccamo, Y. Chen, L. Clarke,

G. Coates, F. Cunningham, T. Cutts, T. Down, S. C. Dyer, S. Fitzgerald, J. Fernandez-

Banet, S. Graf, S. Haider, M. Hammond, J. Herrero, R. Holland, K. Howe, K. Howe,

N. Johnson, A. Kahari, D. Keefe, F. Kokocinski, E. Kulesha, D. Lawson, I. Longden,

C. Melsopp, K. Megy, P. Meidl, B. Overduin, A. Parker, A. Prlic, S. Rice, D. Rios,

M. Schuster, I. Sealy, J. Severin, G. Slater, D. Smedley, G. Spudich, S. Trevanion,

A. Vilella, J. Vogel, S. White, M. Wood, T. Cox, V. Curwen, R. Durbin, X. M. Fernandez-

Suarez, P. Flicek, A. Kasprzyk, G. Proctor, S. Searle, J. Smith, A. Ureta-Vidal, and

E. Birney. Ensembl 2007. Nucleic Acids Research, 35:D610–D617, 2007.

[19] E. Allen J and S.L. Salzberg. Jigsaw: integration of multiple sources of evidence for gene

prediction. Bioinformatics, 21(18):3596–3603, 2005.

[20] K. Karplus, C. Barrett, and R. Hughey. Hidden Markov models for detecting remote

protein homologies. Bioinformatics, 14(10):846–856, 1998.

[21] W.J. Kent and A.M. Zahler. Conservation, Regulation, Synteny, and Introns in a Large-

scale C. Briggsae-C. elegans Genomic Alignment. Genome Research, 10(8):1115–1125,

2000.

[22] J. Khatun, E. Hamlett, and M.C. Giddings. Incorporating sequence information into the

scoring function:a hidden Markov model for improved peptide identification. Bioinfor-

matics, 24(5):674–681, 2008.

138

Page 148: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Chapter 6. Bibliography

[23] K. Knapp and Y.P. Chen. SURVEY AND SUMMARY: An evaluation of contempo-

rary hidden Markov model genefinders with a predicted exon taxonomy. Nucleic Acids

Research, 35(1):317–324, 2007.

[24] B. Knudsen and M.M. Miyamoto. Sequence Alignments and Pair Hidden Markov Models

Using Evolutionary History. Journal of Molecular Biology, 333(2):453–460, 2003.

[25] I. Korf. Gene finding in novel genomes. BMC Bioinformatics, 5:59, 2004.

[26] I. Korf, P. Flicek, D. Duan, and M.R. Brent. Integrating genomic homology into gene

structure prediction. Bioinformatics, 17(Suppl 1):S140–S148, 2001.

[27] A. Kundu, Y. He, and M.Y. Chen. Alternatives to Variable Duration HMM in Hand-

writing Recognition. IEEE Transactions on pattern analysis and machine intelligence,

20(11):1275–1280, 1998.

[28] T. Lee, Y. Berquin, and A. Ellerton. TinyXml Documentation, 2006.

http://www.grinninglizard.com/tinyxmldocs/index.html.

[29] A. Levine. Bioinformatics approaches to RNA splicing. Master’s thesis, University of

Cambridge, UK, 2001.

[30] A. Lomsadze, V. Ter-Hovhannisyan, Y.O. Chernoff, and M. Borodovsky. Gene identifi-

cation in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Research,

33(20):6494–6506, 2005.

[31] G. Lunter. HMMoC-a compiler for hidden Markov models. Bioinformatics, 23(18):2485–

2487, 2007.

[32] B. Merialdo. Tagging English Text with a Probabilistic Model. Computational Linguis-

tics, 20(2):155–171, 1994.

139

Page 149: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Chapter 6. Bibliography

[33] I.M. Meyer. Mathematical methods for comparative ab initio gene prediction. PhD thesis,

University of Cambridge, UK, 2002.

[34] I.M. Meyer and R. Durbin. Comparative ab initio prediction of gene structures using

pair HMMs. Bioinformatics, 18(10):1309–1318, 2002.

[35] I. Miklos and I.M. Meyer. A linear memory algorithm for Baum-Welch training. BMC

Bioinformatics, 6:231, 2005.

[36] I. M.Meyer and R. Durbin. Gene structure conservation aids similarity based gene pre-

diction. Nucleic Acids Research, 32(2):776–783, 2004.

[37] N.J. Mulder, R. Apweiler, T.K. Attwood, A. Bairoch, A. Bateman, D. Binns, P. Bork,

V. Buillard, L. Cerutti, R. Copley, E. Courcelle, U. Das, L. Daugherty, M. Dibley,

R. Finn, W. Fleischmann, J. Gough, D. Haft, N. Hulo, S. Hunter, D. Kahn, A. Kanapin,

A. Kejariwal, A. Labarga, P.S. Langendijk-Genevaux, D. Lonsdale, R. Lopez, I. Letunic,

M. Madera, J. Maslen C. McAnulla, J. McDowall, J. Mistry, A. Mitchell, A.D. Nikol-

skaya, S. Orchard, C. Orengo, R. Petryszak, J.D. Selengut, C.J.A. Sigrist, P.D. Thomas,

F. Valentin, D. Wilson anf C.H. Wu, and C. Yeats. New developments in the InterPro

database. Nucleic Acids Research, 35:224–228, 2007.

[38] The EMBL Outstation, Protein Information Resource, and Swiss Institute of Bioinfor-

matics. The Universal Protein Resource (UniProt). Nucleic Acids Research, 36:D190–

D195, 2008.

[39] L. Pachter, M. Alexandersson, and S. Cawley. Applications of Generalized Pair Hidden

Markov Models to Alignment and Gene Finding Problems. Journal of Computational

Biology, 9(2):389–399, 2002.

140

Page 150: HMMConverter A tool-box for hidden Markov models with two ...Hidden Markov models (HMMs) are powerful statistical tools for biological sequence analysis. Many recently developed Bioinformatics

Chapter 6. Bibliography

[40] J.S. Pedersen and J. Hein. Gene finding with a hidden Markov model of genome structure

and evolution. Bioinformatics, 19(2):219–227, 2003.

[41] Y. Qi, J. Klein-Seetharaman, and Z. Bar-Joseph. A mixture of feature experts approach

for protein-protein interaction prediction. BMC Bioinformatics, 8(Suppl 10):S6, 2007.

[42] L.R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech

Recognition. Proceedings of the IEEE, 77(2):257–286, 1989.

[43] F. Schutz and M. Delorenzi. MAMOT: hidden Markov modeling tool. Bioinformatics,

24(11):1399–1400, 2008.

[44] M. Stanke, O. Schoffmann, B. Morgenstern, and S. Waack. Gene prediction in eukaryotes

with a generalized hidden Markov model that uses hints from external sources. BMC

Bioinformatics, 7:62, 2006.

[45] M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi. Adaptation of pitch and spectrum

for HMM-based speech synthesis using MLLR. In 2001 IEEE International Conference

on Acoustic, Speech and Signal Processing, volume 2, pages 805–808. IEEE Press, 2001.

[46] V. Ter-Hovhannisyan, A. Lomsadze, Y.O. Chernoff, and M. Borodovsky. Gene prediction

in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome

Research, in press, 2008.

[47] A. Viterbi. Error bounds for convolutional codes and an asymptotically optinmum de-

coding algorithm. IEEE Transactions on Information Theory, pages 260–269, 1967.

[48] R.F. Yeh, L.P. Lin, and C.B. Burge. Computational Inference of Homologous Gene

Structures in the Human Genome. Genome Reseach, 11(5):803–816, 2001.

141