a hidden markov model for progressive multiple alignment

A Hidden Markov Model for A Hidden Markov Model for Progressive Multiple Progressive Multiple

Alignment Alignment Ari LAri Lööytynoja and Michel C. ytynoja and Michel C. MilinkovitchMilinkovitchAppeared in BioInformatics, Vol 19, Appeared in BioInformatics, Vol 19, no.12 , 2003no.12 , 2003

Presented by Sowmya VenkateswaranPresented by Sowmya VenkateswaranApril 20,2006April 20,2006

OutlineOutline Motivations Motivations Drawbacks of existing methodsDrawbacks of existing methods System and MethodsSystem and Methods

Substitution ModelSubstitution Model Hidden Markov ModelHidden Markov Model Pairwise Alignment using Viterbi AlgorithmPairwise Alignment using Viterbi Algorithm Posterior ProbabilityPosterior Probability Multiple AlignmentMultiple Alignment

ResultsResults DiscussionDiscussion

MotivationMotivation

Progressive alignment techniques are Progressive alignment techniques are used for Multiple Sequence Alignmentused for Multiple Sequence Alignment

Used to deduce the phylogeny.Used to deduce the phylogeny. Identify protein families.Identify protein families.Probabilistic methods can be used to Probabilistic methods can be used to

estimate the reliability of global/local estimate the reliability of global/local alignments.alignments.

Drawbacks of existing SystemsDrawbacks of existing Systems

Iterative application of global/local pairwise Iterative application of global/local pairwise sequence alignment algorithms does not sequence alignment algorithms does not guarantee a globally optimum alignment.guarantee a globally optimum alignment.

A best scoring alignment may not A best scoring alignment may not correspond with true alignment. Hence correspond with true alignment. Hence reliability of a score/alignment needs to be reliability of a score/alignment needs to be inferred.inferred.

System and MethodsSystem and Methods

The idea is to provide a probabilistic framework for a The idea is to provide a probabilistic framework for a guide tree and define a vector of probabilities at each guide tree and define a vector of probabilities at each character site.character site.

Guide tree is constructed by using Neighbor Joining Guide tree is constructed by using Neighbor Joining Clustering after producing a distance matrix. It can also Clustering after producing a distance matrix. It can also be imported from CLUSTALW.be imported from CLUSTALW.

At each internal node, a probabilistic alignment is At each internal node, a probabilistic alignment is performed. Pointers from parent to child sites are stored performed. Pointers from parent to child sites are stored and so also is a vector of probabilities of the different and so also is a vector of probabilities of the different character states( ‘A/C/T/G/-’ for nucleotides or the 20 character states( ‘A/C/T/G/-’ for nucleotides or the 20 amino acids with a gap)amino acids with a gap)

Substitution ModelSubstitution Model

Consider 2 sequences xConsider 2 sequences x1…n1…n and yand y1…m1…m, whose alignment , whose alignment we would like to find and their parent in the guide tree we would like to find and their parent in the guide tree is zis z1…l1…l. .

PPaa(x(xii) is the probability that site x) is the probability that site x ii contains character a. contains character a.

PPaa(x(xii) = 1, if a character a appears at terminal node, ) = 1, if a character a appears at terminal node, else it is 0.else it is 0.

At internal nodes, different characters have different At internal nodes, different characters have different probabilities summing to 1.probabilities summing to 1.

If the observed character is ambiguous, probability is If the observed character is ambiguous, probability is shared among different characters.shared among different characters.

Emission ProbabilitiesEmission Probabilities

PPxxii,y,yjj represents the probability that x represents the probability that x ii and y and yjj are aligned. are aligned.

ppxi,yjxi,yj=p=pzkzk(x(xii,y,yjj)=∑p)=∑pzk=azk=a(x(xii,y,yjj))

PPzzkk=a=a(x(xii,y,yjj)=q)=qaa∑∑bbssababppbb(x(xii)∑)∑bbssababppbb(y(yjj))

qqa a is the character background probability is the character background probability

ssabab, probability of aligning characters a and b, is calculated , probability of aligning characters a and b, is calculated

with the Jukes Cantor Modelwith the Jukes Cantor Model

ssabab=1/n + (n-1)/n * e =1/n + (n-1)/n * e –(n/n-1) v –(n/n-1) v when a=bwhen a=b

ssabab=1/n - 1/n * e =1/n - 1/n * e –(n/n-1) v–(n/n-1) v when a≠b when a≠b

n is the size of the alphabet ,n is the size of the alphabet ,

v is the NJ-estimated branch lengthv is the NJ-estimated branch length

ProbabilitiesProbabilities To find pTo find pxi,-xi,- , the probability that z , the probability that zkk evolved to a evolved to a

character on one of the child sites and a gap on character on one of the child sites and a gap on the other child isthe other child is

ppzk=azk=a(x(xii,-)=q,-)=qaa∑∑bbssababppbb(x(xii)s)sa-a-

The same applies for pThe same applies for pxi,-xi,-. s. sa-a- is computed just is computed just like slike sabab..

Any other model can be used for calculation of Any other model can be used for calculation of ssabab, instead of the Jukes Cantor Model. Ex: PAM , instead of the Jukes Cantor Model. Ex: PAM (20 X 20) substitution matrix can be modified to (20 X 20) substitution matrix can be modified to include gaps and transformed to a (21X21) include gaps and transformed to a (21X21) matrix, and the substitution probabilities can be matrix, and the substitution probabilities can be derived from that.derived from that.

Hidden Markov ModelHidden Markov Model

Y

p-,yj

X

pxi,-

M

pxi,yj

1-ε

1-ε

δ

δ

1-2δ

ε

ε

Hidden Markov ModelHidden Markov Model δδ – probability of moving to an insert state (gap opening – probability of moving to an insert state (gap opening

penalty) ; lower the value, higher the penalty.penalty) ; lower the value, higher the penalty.

εε – probability of staying at an insert state (gap extension – probability of staying at an insert state (gap extension penalty); again, lower the value, more the extension penalty); again, lower the value, more the extension penalty. penalty.

ppxi,yjxi,yj ,p ,pxi,-xi,- , p , p-,yj-,yj –emission frequencies for match, insert X –emission frequencies for match, insert X and insert Y states.and insert Y states.

For testing purposes, For testing purposes, δδ and and εε were estimated from were estimated from pairwise alignments of terminal sequences such that pairwise alignments of terminal sequences such that δδ=1/2(l=1/2(lmm+1) and +1) and εε=1-1/(l=1-1/(lgg+1); l+1); lm m and land lgg are the mean are the mean lengths of match and gap segments. lengths of match and gap segments.

Pairwise AlignmentPairwise Alignment In this probabilistic model, the best alignment between 2 In this probabilistic model, the best alignment between 2

sequences corresponds to the Viterbi path through the sequences corresponds to the Viterbi path through the HMM. HMM.

Since there are 3 states in the model, and each state Since there are 3 states in the model, and each state needs 2-D space, we have 3 2-D tables : vneeds 2-D space, we have 3 2-D tables : vM M for match for match states, vstates, vXX and v and vYY for the gap states. for the gap states.

A move within M, X or Y tables produces an additional A move within M, X or Y tables produces an additional match or extends an existing gap. A move between M match or extends an existing gap. A move between M table and either X or Y table closes or opens a gap.table and either X or Y table closes or opens a gap.

Viterbi RecursionViterbi RecursionInitialization:v(0,0) = 1, v(i,-1) = v(-1,j)=0

Recursion:Recursion:vvMM(i,j(i,j) = ppxi,yjxi,yj max { (1-2δ) vvMM(i-1,j-1(i-1,j-1), (1-ε) vvXX(i-1,j-1(i-1,j-1), (1-ε) vvYY(i-1,j-1(i-1,j-1) }vvXX(i,j(i,j) = ppxi,-xi,- max { δ vvMM(i-1,j(i-1,j), ε vvXX(i-1,j(i-1,j) }vvYY(i,j(i,j) = pp-,yj-,yj max { δ vvMM(i,j-1(i,j-1), ε vvYY(i,j-1(i,j-1) }

Termination:vE=max(vM(n,m),vX(n,m),vY(n,m))

Viterbi tracebackViterbi traceback

At each cell, the relative probabilities of entering the At each cell, the relative probabilities of entering the different cells are stored. Ex:different cells are stored. Ex:

ppM-MM-M= = (1-2δ) vvMM(i-1,j-1(i-1,j-1)/N(i,j)

where N(i,j) is the normalizing constant, given by

N(i,j)=(1-2δ) vvMM(i-1,j-1(i-1,j-1)+(1-ε)*[vvXX(i-1,j-1(i-1,j-1)+ vvYY(i-1,j-1(i-1,j-1)]

The above equation is calculated for each of the 3 tables

Trace back algorithm used to find the best path; a match step will create pointers from the parent site to the child sites, and a gap step will create pointer to one and a gap for the 2nd child site.

Posterior Probabilities-Forward Posterior Probabilities-Forward algorithmalgorithm

Forward algorithm-sum of probabilities of all paths entering a Forward algorithm-sum of probabilities of all paths entering a given cell from the start position.given cell from the start position.

Initialization:Initialization:f(0,0)=1;f(i,-1)=f(-1,j)=0;f(0,0)=1;f(i,-1)=f(-1,j)=0;

Recursion:Recursion:i=0,…,n j=0,…,m, except (0,0)i=0,…,n j=0,…,m, except (0,0)ffMM(i,j) = p(i,j) = pxi,yj xi,yj [ (1-2[ (1-2δδ) f) fMM(i-1,j-1) + (1-(i-1,j-1) + (1-εε) ( f) ( fXX(i-1,j-1)+ f(i-1,j-1)+ fYY(i-1,j-1))](i-1,j-1))]ffXX(i,j) = p(i,j) = pxi,- xi,- [ [ δδ f fMM(i-1,j) + (i-1,j) + εε f fXX(i-1,j)](i-1,j)]ffYY(i,j) = p(i,j) = p-,yj -,yj [ [ δδ f fMM(i,j-1) + (i,j-1) + εε f fYY(i,j-1)](i,j-1)]

Termination:Termination:ffEE=f=fMM(n,m)+f(n,m)+fXX(n,m)+f(n,m)+fYY(n,m)(n,m)

Backward algorithmBackward algorithmSum of probabilities of all possible alignments between Sum of probabilities of all possible alignments between

subsequences xsubsequences xi…ni…n and y and yj…mj…m..

Initialization:Initialization:b(n,m)=1; b(i,m+1) = f(n+1,j) = 0;b(n,m)=1; b(i,m+1) = f(n+1,j) = 0;

Recursion:Recursion:i=n,…,1 j=m,…,1, except (n,m)i=n,…,1 j=m,…,1, except (n,m)bbMM(i,j) = (1-2(i,j) = (1-2δδ) p) px(i+1),y(j+1) x(i+1),y(j+1) bbMM(i+1,j+1) +(i+1,j+1) +

δδ [ p [ px(i+1),- x(i+1),- bbXX(i+1,j) + p(i+1,j) + p-,y(j+1) -,y(j+1) bbYY(i,j+1)](i,j+1)]

bbXX(i,j) = (1-(i,j) = (1-εε) p) px(i+1),y(j+1) x(i+1),y(j+1) bbMM(i+1,j+1) + (i+1,j+1) + εε p px(i+1),- x(i+1),- bbXX(i+1,j)(i+1,j)

bbYY(i,j) = (1-(i,j) = (1-εε) p) px(i+1),y(j+1) x(i+1),y(j+1) bbMM(i+1,j+1) + (i+1,j+1) + εε p p-,y(j+1) -,y(j+1) bbXX(i+1,j)(i+1,j)

Reliability CheckReliability Check Assumption: Posterior probability of the sites on the Assumption: Posterior probability of the sites on the

alignment path is a valid estimator of the local reliability alignment path is a valid estimator of the local reliability of the alignment since it gives the proportion of total of the alignment since it gives the proportion of total probability corresponding to all alignments passing probability corresponding to all alignments passing through the cell (i,j).through the cell (i,j).

Posterior probability for a match is given by:Posterior probability for a match is given by:

P(xP(xii◊y◊yjj|x,y) = f|x,y) = fMM(i,j) b(i,j) bMM(i,j) / f(i,j) / fEE

where fwhere fMM and b and bMM are the total probabilities of all possible are the total probabilities of all possible alignments between subsequences xalignments between subsequences x1…i1…i and y and y1…j1…j and x and xi…ni…n and yand yj…mj…m respectively respectively

Similar probabilities are calculated for Insert X and Insert Similar probabilities are calculated for Insert X and Insert Y states too.Y states too.

Multiple alignmentMultiple alignment Each parent node site has a vector of probabilities Each parent node site has a vector of probabilities

corresponding to each possible character state (including corresponding to each possible character state (including the gap). For a match,the gap). For a match,ppaa(z(zkk)=p)=pzk=azk=a(x(xii,y,yjj)/∑)/∑bbppzk=bzk=b(x(xii,y,yjj))

Pairwise alignment builds the tree progressively, from the Pairwise alignment builds the tree progressively, from the terminal nodes towards an arbitrary root.terminal nodes towards an arbitrary root.

Once root node is defined, trace-back is done to find Once root node is defined, trace-back is done to find multiple alignment of the nodes below since each node multiple alignment of the nodes below since each node stores pointers to the matching child sites.stores pointers to the matching child sites.

If a gap occurs in one of the internal nodes, a gap character If a gap occurs in one of the internal nodes, a gap character state is introduced in all of the sequences of that sub-tree, state is introduced in all of the sequences of that sub-tree, and recursive call will not proceed further in that branch.and recursive call will not proceed further in that branch.

TestingTesting Algorithms tested on Algorithms tested on (i) simulated nucleotide sequences (i) simulated nucleotide sequences

50 random data sets generated using the program Rose. A 50 random data sets generated using the program Rose. A root random sequence (length 500) was evolved on a root random sequence (length 500) was evolved on a random tree to yield sequences of “low” (no. of random tree to yield sequences of “low” (no. of substitutions per site 0.5) and “high” (1.0) divergences. substitutions per site 0.5) and “high” (1.0) divergences. Also, the insertion/deletion length distribution was set to Also, the insertion/deletion length distribution was set to ‘short’ or ‘long’.‘short’ or ‘long’.

(ii) Amino acid data sets from Ref1 of the BAliBASE (ii) Amino acid data sets from Ref1 of the BAliBASE database. Ref1 contains alignments of less than 6 equi-database. Ref1 contains alignments of less than 6 equi-distant sequences, i.e., the percent-identity between 2 distant sequences, i.e., the percent-identity between 2 sequences is within a specified range with no large sequences is within a specified range with no large insertion or deletion. Datasets were divided into 3 groups insertion or deletion. Datasets were divided into 3 groups based on lengths, and further into 3 based on similarities.based on lengths, and further into 3 based on similarities.

Results of Simulation on Nucleotide Results of Simulation on Nucleotide SequencesSequences

Type1 and Type 2 errors vs. Type1 and Type 2 errors vs. minimum posterior probabilityminimum posterior probability

Performance and Future WorkPerformance and Future Work

ProAlign performs better than ClustalW for the ProAlign performs better than ClustalW for the nucleotide sequences, but not for amino acid nucleotide sequences, but not for amino acid sequences with sequence identity less than sequences with sequence identity less than 25%.25%.

Possible reasons may be that the model does Possible reasons may be that the model does not take into account, the protein secondary not take into account, the protein secondary structure. So, the HMM can be extended to structure. So, the HMM can be extended to modeling protein secondary structure too.modeling protein secondary structure too.

Minimum posterior probability correlates well Minimum posterior probability correlates well with correctness ; can be used to detect/remove with correctness ; can be used to detect/remove unreliably aligned regionsunreliably aligned regions

a hidden markov model for progressive multiple alignment

Documents