hidden markov models for biological sequence analysis...
TRANSCRIPT
Hidden Markov Models for biological sequence
analysis II
Eduardo Eyras Computational Genomics
Pompeu Fabra University - ICREA Barcelona, Spain
Master in Bioinformatics UPF 2017-2018
http://comprna.upf.edu/courses/Master_AGB/
HMM model structure
p
(1-p)
p = transition probability to itself, 1-p= probability of leaving the state
Probability of staying in the state for n residues = (1-p) pn
Exponential decaying (geometric distribution)
n
P=0.7n
P=0.5n
P
Duration modeling
How to avoid this decay? For instance, using several states with the same emission probabilities and transitions between each other
Eg: models sequences of minimum length 5, and exponential decaying for longer ones.
Eg: this can model any distribution of lengths between 2 and 5
Duration modeling
Duration modeling
Minimum-length then geometric
Negative binominal
Exponential decay p
(1-p)
p
1-p
p
(1-p)
p
(1-p)
p
(1-p) (1-p)
p
Duration modelling
p
(1-p)
p
(1-p)
p
(1-p) (1-p)
p
This type of array of n states can model sequences of length n or longer
For a path of length m>n: transition probabilities =
Transition probability over all possible paths of length m
€
P(m) =m −1n −1#
$ %
&
' ( pm−n (1− p)n
€
nk"
# $ %
& ' =
n!k!(n − k)!Where we use the Binomial coefficients
€
pm−n (1− p)n
Negative binominal
Intron Exon Intergenic L
P
Empirical intron length distribution
Explicit duration modeling
Can use any arbitrary length distribution Generalized HMM. Often used in gene finders Upon entering a state: 1. Choose duration d according to probability distribution 2. Generate d letters according to emission probabilities, e.g. P(A|I) 3. Take a transition to next state according to transition probabilities
Profile - HMMs
Finding distant members of a protein family
A distant cousin of functionally related sequences in a protein family may have weak pairwise similarities with each member of the family and thus may fail to be found using standard pairwise methods (e.g. BLAST).
Even though they may have weak similarities with many members of
the family, the goal is to align a sequence to all members of the family at once.
Family of related proteins can be represented by their multiple
alignment and a corresponding profile. Can we represent the profile as a probabilistic model?
We use a multiple alignment to build a profile-HMM. It is a HMM: It is a probabilistic representation of a multiple alignment and we can use the same HMM algorithms (Viterbi, etc…) We can add position-dependent gap penalties (to model gaps in the alignment) We can add variable states with position-dependent random emission probabilities (to model variable regions) This model then may be used to find and score less obvious potential matches of new protein sequences. The profile-HMM is used to ask whether a new sequence S belongs to a given model (e.g. a given family of proteins, e.g. contains a given domain).
Profile-HMM
A protein family is generally represented by a multiple alignment Example: SH3 domains:
Profiles and HMMs
----exon----intron CAGGTACCC GAGGTGAGA CTGGTGAGG TAGGTGAGT CAGGTCTGT CTGGTGAGC CAGGTAAGT
pos 1 2 3 4 5 6 7 8 9
A 0 0.71 0 0 0 0.28 0.71 0 0.14
C 0.71 0 0.28 0 0 0.14 0.14 0.14 0.28
G 0.14 0 0.71 1 0 0.57 0 0.85 0.14
T 0.14 0.28 0 0 1 0 0.14 0 0.42
E.g. position 1, P( C ) = frequency = 5/7 = 0.71
Profile representation of protein families
Aligned DNA sequences can be represented by a 4 ·n profile matrix reflecting the frequencies of nucleotides in every aligned position
€
S = log ei(si)qii=1
L
∑Motif probabilities
Background probabilities
Position Specific Scoring Matrix (also PWM)
The conserved regions can be modeled as in a PSSM
… … begin end Mj
A PSSM can be viewed as a trivial HMM with identical states, Match states, separated by transitions of probability 1
€
Score = logeM i
(si)qii=1
L
∑Emission probabilities
Background probabilities
Profile-HMM: Match states
€
eMi(a) Emission probability in Match state = frequency of each amino acid in alignment columns
Match states
Multiple alignment of a protein family shows variations in conservation along the length of a protein. For SH3 domains:
Conserved regions can be described by PWMs but variable regions can not!
Profile-HMMs: Insertion states
Start End Mi
Ii
We treat insertions and deletions separately Insertion: portions of the query sequence S that do not match anything in the model: we must insert residues with respect to the model Insertion state: Ii = insertions after the residue matching the i th column of the alignment
€
eI i (a) = p(a) Emission probability in Insertion state = amino acid frequency in all sequences (background)
Profile-HMMs: Insertion states
Profile-HMMs: Insertion states
Start End Mi
Ii
Transitions of Ii to itself model multiple insertions There is no log-odds (log-likelihood ratio) for emissions from Ii Score of a gap of length k:
€
log(aM i I i) + (k −1)log(aI i I i ) + log(aI iM i+1
)
Open gap penalty Gap extension penalty Gap closing penalty
Gap penalties are position-dependent!!!! compare to e.g. Needleman-Wunsch
Profile-HMMs: Silent states Deletions: segments of the model that are not matched by any residue in the query sequence S. That is, trying to fit S to the model we need to jump match states: we must allow deletions in the query sequence One possibility to allow for deletions is to connect non-neighboring states:
Too complex to model arbitrary deletions in a long sequence
We therefore introduce the Silent states Dj to model deletions
We can model arbitrary deletions by connecting the states to a parallel chain of silent states (circles):
It is possible to get from any “real” state go any “real” state without emitting letters
Mj Start End
Dj
Profile-HMMs: Silent states
€
log(aM iDi) + log(aD jD j+1
)j= i+1
i+k−1
∑ + log(aDi+kM i+k+1)
Cost of a deletion of length k
The deletion extension has different probabilities (different states) The insertion extension is of equal contribution (same state)
States in a profile-HMM
Start End Mi
Ii
Di
Match states: conserved positions in the alignment (plus start/end states)
Insertion states: variable regions (not clearly alignable)
Deletion states: model gaps in the alignment
Deletion state
Insertion state
Match state
Start End Mi
Ii
Di
This model represents the consensus of a family of sequences, not the sequence of any particular member.
Building a profile-HMM
Multiple alignment is used to construct the HMM model. Assign each aligned (conserved) column to a Match state (M) in the HMM – this will determine the length of the model Estimate the emission probabilities according to amino acid counts in columns. Different positions in the protein will have different emission probabilities. Add Insertion (I) and Deletion (D) states: all states, connectivity to be determined… Estimate the transition probabilities between Match, Deletion and Insertion states
Building a profile-HMM
Start End Mi
Ii
Di
Probabilities in a profile-HMM
Start End Mi
Ii
Di
€
eMi(a) Emission probability in Match state = frequency of each aminoacid in alignment columns
€
eI i (a) = p(a) Emission probability in Insertion state = aminoacid frequency in all sequences (background)
€
aMiI i Transition probability from match to insertion state
€
log(aM i I i) Open gap penalty
€
aIiI i Transition probability within a insertion state
€
log(aI i I i ) Extension gap penalty
€
aDiDi+1 Transition probability between deletion states
Probabilities in a profile-HMM
Start End Mi
Ii
Di
aDiIi = 0 Transition probability between a deletion and insertion states
aIiDi+1 = 0 Transition probability between insertion and deletion states
Usually very improbable
How to assign the states?
Start End Mi
Ii
Di
Heuristic rules: Denote as insertion states, the columns from the alignment that contain gaps in more than half of the sequences. Denote as match the conserved ones and with less gaps Calculate the entropy for each column and denote as insertion state the columns with high degree of disorder In the example above, all columns will be M except for 4th and 5th that will be I states.
Profile-HMM Parameter estimation We start from a given sample of alignments We can estimate the parameters counting the transitions and emissions:
€
Akl Count the number of transitions between states k and l
€
Ek(b) Count the number of times the symbol b is emitted by state k
€
akl =AklAk '
l '∑
, ek(b) =Ek(b)Ek(b ')
b'∑
We can estimate the probabilities as follows:
To avoid overfitting, use pseudocounts:
€
Akl → Akl + rklEk(b)→ Ek(b)+ rk(b)
Pseudocounts reflect our prior knowledge
Accurate estimate for a large number of sequences
Parameter estimation: Example
€
eM1 (V ) = 5/ 7
€
eM1 (F) = eM1 (I ) = 1/ 7
€
eM1 (V ) = (5+1)/(7+ 20) = 6/ 27eM1 (I ) = eM1 (V ) = (1+1)/(7+ 20) = 2/ 27Using
pseudocounts
€
eM1 = 1/ 27 For all other aminoacids
€
aM1M2= (6+1)/(7+ 3) = 7/10
aM1D1 = (1+1)/(7+ 3) = 2/10aM1I1 = (0+1)/(7+ 3) = 1/10
Using pseudocounts
€
aM1M2= 6/ 7
aM1D2=1/ 7
€
aM1I1 = 0
Parameter estimation: Example
€
eM 3(A) = 3/6
€
eM 3(G) = 2 /6
€
eM 3(A) = (3+1) /(6 + 20) = 4 /26
eM 3(G) = (2 +1) /(6 + 20) = 3/26
Using pseudocounts
€
aM 3M 4= (4 +1) /(6+ 3) = 5 /9
aM 3D4 = (1+1) /(6+ 3) = 2 /9
aM 3I 3 = (1+1) /(6+ 3) = 2 /9
aD3M 4= (1+1) /(1+2) = 2 /3
aD3D4 = (0+1) /(1+2) =1/3
Using pseudocounts
(B)
(C) (D)
(A)
(B)
(C)
(D)
(A)
€
aM 3M 4= 4 /6
aM 3D4=1/6
aM 3I 3=1/6
aD3M 4=1/1
aD3D4 = 0Always check normalization!! Here we removed D->I
Searching with Profile-HMMs
Profile-HMMs can be used to detect a possible new member of a sequence family
We must compare the new sequence against the profile-HMM model
Start End Mi
Ii
Di
We can use Viterbi to obtain the most probable path π* across the model and then calculate its probability:
We can use Forward to obtain the total probability for the sequence given the model:
€
P(S | Π∗ )
€
P(S) = P(s1...sL ) = P(π
∑ s1...sL ,π 0...π N )
We use in general the log-likelihood ratios (log-odds) with a background model
Searching with Profile-HMMs
€
VjM( i) Best score (likelihood-ratio) for the best path of states aligning
the subsequence s1…si to the submodel up to state j, ending in the emission of si by Mj
€
VjI ( i)
€
VjD( i)
Best score for the best path ending at si being emitted by Ij
Best score for the best path ending at Dj
Profile HMM Viterbi
Profile HMM Viterbi
€
VjM( i) = log
eMj(si )qsi
+maxVj−1
M(i−1)+ log aMj−1Mj
Vj−1I (i−1)+ log aIj−1Mj
Vj−1D (i−1)+ log aDj−1Mj
#
$ %
& %
€
VjI ( i) = log
eIj (si )qsi
+maxVj
M(i−1)+ log aMjI j
VjI (i−1)+ log aIjI j
VjD(i−1)+ log aDjI j
#
$ %
& %
€
V jD (i) =max
V j−1M (i −1) + logaM j−1D j
V j−1I (i −1) + logaI j−1D j
V j−1D (i −1) + logaD j−1D j
#
$ %
& %
Profile HMM Viterbi
€
VjM( i) = log
eMj(si )qsi
+maxVj−1
M(i−1)+ log aMj−1Mj
Vj−1I (i−1)+ log aIj−1Mj
Vj−1D (i−1)+ log aDj−1Mj
#
$ %
& %
€
VjI ( i) = log
eIj (si )qsi
+maxVj
M(i−1)+ log aMjI j
VjI (i−1)+ log aIjI j
VjD(i−1)+ log aDjI j
#
$ %
& %
€
V jD (i) =max
V j−1M (i −1) + logaM j−1D j
V j−1I (i −1) + logaI j−1D j
V j−1D (i −1) + logaD j−1D j
#
$ %
& %
€
eIj (si ) = qsiDoes not contribute in general since
Are usually not present (negligible when scoring an alignment to the model)
Profile HMM Viterbi
€
V0M(0) = 0
Initialisation:
The start state is M0 such that
We allow the alignment to end in a deletion or insert state
We allow transitions to I0 and D1
The end state ML+1
Termination:
€
Score S |Π*( )=maxVL
M (n) + logaM L ,end
VLI (n) + logaIL ,end
VLD (n) + logaDL ,end
#
$ %
& %
• Use Blast (or similar) to separate a protein database into families of
related proteins • Construct a multiple alignment for each protein family. • Construct a profile HMM model and optimize the parameters of the
model (transition and emission probabilities)
• Align the target sequence against each HMM to find the best fit between a target sequence and an HMM
Making a collection of Profile-HMM for protein families
PFAM • Pfam decribes protein domains (http://pfam.sanger.ac.uk/)
• Each protein domain family in Pfam has: - Seed alignment: manually verified multiple alignment of a
representative set of sequences. - HMM built from the seed alignment for further database searches. - Full alignment generated automatically from the HMM
• The distinction between seed and full alignments facilitates Pfam updates.
- Seed alignments are stable resources. - HMM profiles and full alignments can be updated with newly found
amino acid sequences. • Pfam HMMs span entire domains that include both well-conserved motifs and
less-conserved regions with insertions and deletions.
• It results in modeling complete domains that facilitates better sequence annotation and leeds to a more sensitive detection.
Exercise (exam 2013): Consider the following multiple alignment:
Draw a hidden Markov model that would describe this alignment using two types of states, Match and Insert states. Estimate the transition and emission probabilities for the model (no need to use pseudocounts).
G C A GG – A GG C T GA – A CG – A CG – G GA – A C
References
BiologicalSequenceAnalysis:Probabilis5cModelsofProteinsandNucleicAcidsRichardDurbin,SeanR.Eddy,AndersKrogh,andGraemeMitchison.CambridgeUniversityPress,1999ProblemsandSolu5onsinBiologicalSequenceAnalysisMarkBorodovsky,SvetlanaEkishevaCambridgeUniversityPress,2006Bioinforma5csandMolecularEvolu5onPaulG.HiggsandTeresaAJwood.BlackwellPublishing2005.AnIntroduc5ontoBioinforma5csAlgorithms(ComputaOonalMolecularBiology)byNeilC.Jones,PavelA.Pevzner.MITPress,2004