1 chapter 5 profile hmms for sequence families. 2 what have we done? so far, we have concentrated on...
TRANSCRIPT
1
Chapter 5
Profile HMMs for Sequence Families
2
What have we done?
• So far, we have concentrated on the intrinsic properties of single sequences (CpG islands), or on pairwise alignment of sequences
• Functional biological sequences typically come in families
• Many of the most powerful sequence analysis methods are based on identifying the relationship of an individual sequence to a sequence family
3
Sequence Families
• Sequences in a family have diverged from each other in their primary sequence during evolution, but normally maintain the same or a related function
• Thus, identifying that a sequence belongs to a family, and aligning it to the other members, often tells about its function
4
Sequence Family
• If we have a set of sequences belonging to a family, we can perform a database search for more members using pairwise alignment with one of the known family members as the query sequence
• We could even search with the known members one by one
• However, pairwise searching with any one of the members may not find sequences distantly related to the ones we have already
5
Sequence Family
• An alternative approach is to use statistical features of the whole set of sequences in the search
• Similarly, even when family membership is clear, accurate alignment can be often improved significantly by concentrating on features that are conserved in the whole family
6
Sequence Family
• A pairwise alignment captures much of the relationship between two sequences
• A multiple alignment can show how the sequences in a family relate to each other
7
Sequences From A Globin Family
8
Globin Family• It is clear that some positions in the globin a
lignment are more conserved than others• The helices are more conserved than the loo
p regions between them• Certain residues are particularly strongly co
nserved (two conserved histidines H)• When identifying a new sequence as a globi
n, it would be desirable to concentrate on checking that these more conserved features are present
9
HMM Profile• Consensus modeling of the family using a p
robabilistic model
• Built from a given multiple alignment (assumed to be correct!)
• Develop a particular type of hidden Markov model well suited to modelling multiple alignments
• We call these profile HMMs
10
HMM Profile
• We will assume that we are given a correct multiple alignment, from which we will build a model that can be used to find and score potential matches to new sequences
11
Ungapped Score Matrices
• A natural probabilistic model for a conserved region would be to specify independent probabilities ei
(a) of observing amino acid a in position i
• The probability of a new sequence x according to this model is
P(x | M) ei(x i)i1
L
12
Log-odds ratio
• We are interested in the ratio of the probability to the probability of x under the random model
• The values log(ei(a)/qa) behave like elements in a score matrix s(a,b), where the second index is position i, rather than amino acid b
S logei(x i)
qxii1
L
Position specific score matrix (PSSM)
13
Adding Insert and Delete States
• A PSSM might capture some conservation information, it is clearly an inadequate representation of all the information in a multiple alignment of a protein family.
• We have to find some way to take account of gaps
14
Components of profile HMMs
• Consideration of gaps– Henikoff & Henikoff [1991]
• Combining the multiple ungapped block models
– Allowing gaps at each position using the same gap scores (g) at each position (where gaps are more or less likely?)
• Profile HMMs– Repetitive structure of states– Different probabilities in each position– Full probabilistic model for sequences in the sequence
family
15
Components of profile HMMs
• The PSSM can be viewed as a trival HMM with a series of identical states that we will call match states, separated by transitions of probability 1
• Alignment is trivial since there is no choice of transitions.
• Match states– Emission probabilities
Begin Mj End....
..
..
)(aeiM
16
Ungapped profiles and the corresponding HMMs
Beg Mj End… …
Example
AGAAACTAGGAATTTGAATCT
P(AGAAACT)=16/81P(TGGATTT)=1/81
1 2 3 4 5 6 7
A 2/3 0 2/3 1 2/3 0 0
T 1/3 0 0 0 1/3 1/3 1
C 0 0 0 0 0 2/3 0
G 0 1 1/3 0 0 0 0
Each blue square represents a match state that “emits” each letter withcertain probability ej(a) which is defined by frequency of a at position j:
Typically, pseudo-counts are added in HMMs to avoid zero probabilities.
17
Insertions and deletions in profile HMMs
Beg Mj End
Ij
Insert states emit symbols just like the match states, however, theemission probabilities are typically assumed to follow the backgrounddistribution and thus do not contribute to log-odds scores.
Transitions Ij -> Ij are allowed and account for an arbitrary numberof inserted residues that are effectively unaligned (their order withinan inserted region is arbitrary).
18
Components of profile HMMs
• Insert states– Emission prob.
• Usually back ground distribution qa.
– Transition prob.• Mi to Ii, Ii to itself, Ii to Mi+1
– Log-odds score of a gap of length k (no logg-odds from emission)
Begin Mj End
Ij
)(I aei
jjjjjjakaa II1MIIM log)1(loglog
19
Insertions and deletions in profile HMMs
Beg Mj End
Dj
Deletions are represented by silent states which do not emit any letters.A sequence of deletions (with D -> D transitions) may be used to connectany two match states, accounting for segments of the multiple alignmentthat are not aligned to any symbol in a query sequence (string).
The total cost of a deletion is the sum of the costs of individual transitions(M->D, D->D, D->M) that define this deletion. As in case of insertions, bothlinear and affine gap penalties can be easily incorporated in this scheme.
20
Components of profile HMMs
• Delete states– No emission prob.
– Cost of a deletion• M→D, D→D, D→M
• Each D→D might be different, each I→I will have the same cost
Begin Mj End
Dj
21
Gap penalties: evolutionary and computational considerations
• Linear gap penalties:
(k) = - k d
for a gap of length k and constant d
• Affine gap penalties:
(k) = - [ d + (k -1) e ]
where d is opening gap penalty and e an extension gap penalty.
22
Components of profile HMMs
• Combining all parts
Begin Mj End
Ij
Dj
Figure 5.2 The transition structure of a profile HMM.
23
Profile HMMs as a model for multiple alignments
Beg Mj End
Ij
Dj
ExampleAG---CA-AG-CAG-AA---AAACAG---C** *
24
25
Profile HMMs generalize pairwise alignment
• Mj→M, Ij→X (insertion), Dj→Y (deletion)
• eMi(a)=pyia /qa the conditional probabilities of seeing a given yi in a pairwise alignment
• aMiIi= aMiDi+1=δ, aIiIi= aDiDi+1=ε
26
Deriving Profile HMMs From Multiple Alignments
• The key idea behind profile HMMs is that we can use the same structure as shown in Figure 5.2, but set the transition and emission probabilities to capture specific information about each position in the multiple alignment of the whole family
• Essentially, we want to build a model representing the consensus sequence for a family, rather than the sequence of any particular member
• Non-probabilistic profiles and profile HMMs
27
Non-probabilistic Profiles
• A model similar to the profile HMM was first introduced by Gribskov, McLachlan and Eisenberg 1987
• No underlying probabilistic model, but rather assigned position specific scores for each match state and gap penalty
• The score for each consensus position is set to the average of the standard substitution scores from all the residues in the corresponding multiple sequence alignment column
28
HMMs from multiple alignments
• Key idea behind profile HMMs– Model representing the consensus for the family– Not the sequence of any particular member
HBA_HUMAN ...VGA--HAGEY...HBB_HUMAN ...V----NVDEV...MYG_PHYCA ...VEA--DVAGH...GLB3_CHITP ...VKG------D...GLB5_PETMA ...VYS--TYETS...LGB2_LUPLU ...FNA--NIPKH...GLB1_GLYDI ...IAGADNGAGV... *** *****
Figure 5.3 Ten columns from the multiple alignment of seven globin protein sequences shown in Figure 5.1 The starred columns are ones that will be treated as ‘matches’ in the profile HMM.
29
Non-probabilistic Profiles
s(a,b) : standard substitution matrix
The score for residue ‘a’ incolumn 1
30
HMMs from multiple alignments
• Non-probabilistic profiles– Gribskov, Mclachlan & Eisenberg [1987]
• Score for residue a in column 1
– Disadvantages• More conserved region might be corrupted.
• Intuition about the likelihood can’t be maintained.
• The score for gaps do not behave as expected.
),I(7
1),F(
7
1),V(
7
5asasas
31
Non-probabilistic Profiles
• They also set gap penalties for each column using a heuristic equation that decrease the cost of a gap according to the length of the longest gap observed in the multiple alignment spanning the column
32
Problem With The Approach
• If we had an alignment with 100 sequences, all with a cysteine (C), at some position, the probability distribution for that column for an “average” profile would be exactly the same as would be derived from a single sequence
• Doesn’t correspond to our expectation that the likelihood of a cysteine should go up as we see more confirming examples
33
Similar Problem With Gaps
Scores for a deletion in columns 2 and 4 would be set to the same value
More reasonable to set the probability of a new gap opening to be higher in column 4
34
Basic Profile HMM Parameterization
• HMM profiles have emission and transition probabilities
• Assuming that these probabilities are non-zero, a profile HMM can model any possible sequence of residues from the given alphabet
• A profile HMM defines a probability distribution over the whole space of sequences
• The aim of parameterization is to make this distribution peak around members of the family
• Parameters: probabilities and the length of the model (control the shape of the distribution)
35
Model Length• The choice of length of the model corresponds more
precisely to a decision on which multiple alignment columns to assign to match states, and which to assign to insert states
• The consensus sequence at Figure 5.3 should only have 8 residues, and that the two non-starred residues in GLB1_GLYDI should be treated as an insertion with respect to the consensus
• Should decide which columns should correspond to match states, and which to inserts
• A simple rule that works well in practice is that columns that are more than half gap characters should be modeled by inserts
36
Probability Values
akl Akl
Akl 'l '
ek (a) Ek (a)
Ek (a')a '
k,l : indices over states
akl and ek : transition and emission probabilities
Akl and Ek : transition and emission frequencies
37
Problem With The Approach
• Transitions and emissions that don’t appear in the training dataset would acquire zero probability (would never be allowed)
• Solution: add pseudocounts to the observed frequencies
• Simplest pseudocount method is Laplace’s rule: add one to each frequency
38
Example
39
Example: Full Profile HMM
40
Searching With Profile HMMs• One of the main purposes of developing profile H
MMs is to use them to detect potential membership in a family by obtaining significant matches of a sequence to the profile HMM
• We assume that we are looking for global matches
• We can use either the Viterbi algorithm to get the most probable alignment or the forward algorithm to calculate the full probability of the sequence summed over all possible paths
41
Searching with profile HMMs• Maintaining log-odd ratio compared with random
model
i
xiqRxP )|(
42
Viterbi Equations
• VjM(i) is the log-odds score of the best path
matching subsequence x1…i to the submodel up to state j, ending with xi being emitted by state Mj
• VjI(i) is the score of the best path ending in
xi being emitted by Ij, and VjD(i) for the best
path ending in state Dj
43
Viterbi Algorithm
44
Viterbi Equations
• In a typical case, there is no emission score eIj(xi) in the equation for Vj
I(i) since we assu
me that the emission distribution from the insert state Ij is the same as the background distribution, so the probabilities cancel in the log-odds form
• Also the D→I and I→D trsnsition terms may not be present
45
Forward Algorithm
46
Variants for non-global alignments
• We can generalize profile HMMs for local, repeat and overlap alignments
• The profile HMM for local algorithm is to specify a new model for the complete sequence x ,which incorporates the original profile HMM together with one or more copies of a simple selflooping model that is used to account for the regions of unaligned sequence
47
Variants for non-global alignments
• Notice that as well as specifying the emission probabilities of the new states, which will normally of course be qa , we must specify a number of new transition probabilities
• The looping probability on the flanking states should be close to 1, since they must account for long stretches of sequence
• Let us set these to (1-η)
48
Variants for non-global alignments
• For the transition probabilities from the left flanking state to different start points in the model, we can give them equal probabilities, η/L
• Another option is to assign more probability to starting at the beginning of the model. That is the option used in HMMER package (Eddy 1996)
49
Variants for non-global alignments
• Local alignments (flanking model)– Emission prob. in flanking states use background values qa.
– Looping prob. close to 1, e.g. (1- ) for some small .
Mj
Ij
Dj
Begin End
Q Q
50
Variants for non-global alignments
• Overlap alignments– Only transitions to the first model state are allowed.
– When expecting to find either present as a whole or absent
– Transition to first delete state allows missing first residue
Begin Mj End
IjQ
Dj
Q
51
Variants for non-global alignments
• Repeat alignments– Transition from right flanking state back to random model
– Can find multiple matching segments in query string
Mj
Ij
Dj
Begin EndQ
52
More on estimation of prob. • Maximum likelihood (ML) estimation
– given observed freq. cja of residue a in position j.
• Problem of ML estimation– If observed cases are absent?– Specially when observed examples are somewhat few.
' 'M )(
a ja
ja
c
cae
j
53
More on estimation of prob. • Simple pseudocounts
– qa: background distribution
– A: weight factor
– Laplace’s rule: Aqa = 1
• Bayesian framework– Dirichlet prior (αa= Aqa)
(θare model probabilities)
' '
M )(a ja
aja
cA
Aqcae
j
)(
)()|()|(
DP
PDPDP
54
Dirichlet Distribution
• The Dirichlet distribution of order K ≥ 2 with parameters α1, ..., αK > 0 has a pdf on RK–1 given by
55
More on estimation of prob. • Dirichlet mixtures
– Mixtures of dirichlet prior: better than single dirichlet prior
– With K pseudocount priors,
)()|()(
'' 'M k
aa ja
kaja
kj c
ckPae
j
c
' ' )'|(
)|()|(
k jk
jkj kPp
kPpkP
c
cc
a
ka
kaa jaa ja
a
kaa
kajaa ja
j cc
cckP
)()(!
)()()!()|(
c
56
Optimal model construction• In order to estimate the probability parameters
from profile HMMs, we need to consider– Which columns to insert states or which to match
states? (called model construction)– If marked multiple alignments have no errors, the
optimal model can be constructed– In the profile HMM formalism, it is assumed that an
aligned column of symbols corresponds either to emissions from the same match state or to emissions from the same insert state
57
Optimal model construction• It therefore suffices to mark which columns come
from match states to specify a profile HMM architecture and the state paths for all the sequences in the alignment, as shown in Figure 5.7
• In a marked column, symbols are assigned to match states and gaps are assigned to delete states
• In an unmarked column, symbols are assigned to insert states and gaps are ignored
• State transition and symbol emission counts are obtained from the state paths, and these counts can be used to estimate probability parameters
58
Optimal model construction
• 2L combinations for markings of L columns, and hence 2L different profile HMMs to choose from
• A manual construction is shown in Figure 5.7
• Other ways to determine the marking include heuristic construction and a maximum a posteriori (MAP) choice
59
Optimal model construction
beg M M M end
II II
D DD
x x . . . xbat A G - - - Crat A - A G - Ccat A G - A A -gnat - - A A A Cgoat A G - - - C 1 2 . . . 3
(a) Multiple alignment:
(b) Profile-HMM architecture:
0 1 2 3 4
0 1 2 3A - 4 0 0C - 0 0 4G - 0 3 0T - 0 0 0A 0 0 6 0C 0 0 0 0G 0 0 1 0T 0 0 0 0M-M 4 3 2 4M-D 1 1 0 0M-I 0 0 1 0I-M 0 0 2 0I-D 0 0 1 0I-I 0 0 4 0D-M - 0 0 1D-D - 1 0 0D-I - 0 2 0
(c) Observed emission/transition counts
matchemissions
insertemissions
statetransitions
60
Optimal Model Construction • MAP match-insert assignment
– Recursive calculation of a number Sj
• Sj: log prob. of the optimal model for alignment up to and including column j, assuming j is marked.
• Sj is calculated from Si and summed log prob. between i and j.
• Tij: summed log prob. of all the state transitions between marked i and j.
– cxy are obtained from partial state paths implied by marking i and j.
ID,M,,
logyx
xyxyij acT
61
Optimal Model Construction • Algorithm: MAP model construction
– Initialization:• S0 = 0, ML+1 = 0.
– Recurrence: for j = 1,..., L+1:
– Traceback: from j = L+1, while j > 0:• Mark column j as a match column
• j = j.
;maxarg
;max
1,10
1,10
jijijiji
j
jijijiji
j
IMTS
IMTSS
62
Weighting Training Sequences
• One issue that we have avoided completely so far is that of weighting sequences when estimating parameters
• In a typical alignment, there are often some sequences that very closely related to each other
• Intuitively, some of the information from these sequences is shared, so we should not give them each the same influence in the estimation process as a single sequence that is more highly diverged from all the others
63
Weighting Training Sequences
• In the extreme that two sequences are identical, it makes sense that they should each get half the weight of other sequences
• There have been a large number of proposals for different ways to assign weights to sequences
64
Weighting Training Sequences
• Good random sample do you have?
• “Assumption : all examples are independent samples” might be incorrect
• Solutions– Weight sequences based on similarity
65
Simple Weighting Scheme Derived from a Tree
• Many weighting approaches based on building a tree relating the sequences
• Since sequences in a family are related by an evolutionary tree, a very natural approach is to try to reconstruct this tree and use it when estimating the independent contribution of each of the observed sequences, down weighting sequences that have only recently diverged
66
Simple Weighting Scheme Derived from a Tree
• Here we assume that we are given a tree connecting the sequences, with branch lengths indicating the relative degrees of divergence for each edge in the tree
67
Weighting Training Sequences
• Simple weighting schemes derived from a tree– Phylogenetic tree is given.
• [Thompson, Higgins & Gibson 1994b]– Kirchohoff’s law
• [Gerstein, Sonnhammer & Chothia 1994]
nk k
ini w
wtw
below leaves
68
Weighting Training Sequences
• We are given a tree made of a conducting wire of constant thickness and apply a voltage V to the root
• All the leaves are set to zero potential and the currents flowing from them are measured and taken to be the weights
• Clearly, the currents will be smaller in the highly divided parts of the tree so these weights have the right qualitative properties
69
Weighting Training Sequences
t4 = 8t3 = 5
t2 = 2t1 = 2
t5 = 3
t6 = 3
5
6
7
1 2 3 4
I4I1+I2
I1+I2+I3
V5
V6
V7
I1 I2
I3
I1:I2:I3:I4 = 20:20:32:47w1:w2:w3:w4 = 35:35:50:64
70
Weighting Training Sequences• Root weights from Gaussian parameters
– Influence of leaves on the root distr.– Altchul-Carroll-Lipman wieghts
• Make gaussian distr.
• Mean : linearly combination of xi.
• Combination weights represent the influences of leaves.
71
Weighting Training Sequences
t3
t2t1
4
x1 x2 x3
5
12
22211
2
)(
121 ),|4 nodeat ( t
xvxvx
eKLLxP
2211
21211221112121 )/(),/(),/(
xvxv
ttttttttvtttv
72
Weighting Training Sequences • Voronoi weights
– Proportional to the volume of empty space– Sequence family in sequence space– Algorithm
• Random sample: choosing at kth position uniformly from the set of residues occurring kth position
• ni: count of samples closest to the ith family
• ith weight
k ki nn /
73
Weighting Training Sequences • Maximum discrimination weights
– Focus: decision on whether sequences are members of the family or not
– discrimination
– weight: 1-P(M|xi)
– effect: difficult members are given big weight
k
kxMPD )|(
))(1)(|()()|(
)()|()|(
MPRxPMPMxP
MPMxPxMP
74
Weighting Training Sequences • Maximum entropy weights (1)
– Intuition• kia: number of residues of type a in column i of a multiple
alignment
• mi: number of different types of residues in column i
• As uniform as possible– weight for sequence k:
– ML estimation under the weights: pia = 1/mi
– Averaging over all columns [Henikoff 1994]
)/(1 kiixikm
i ixi
kki
kmw
1
75
Weighting Training Sequences • Maximum entropy weights (2)
– entropy: an measure of the ‘uniformity’ [Krogh & Mitchison 1995]
– maximize
– example• x1 = AFA, x2 = AAC, x3 = DAC
• w1 = w3 =0.5, w2 = 0
k ki i wwH )(
iaa iai ppwH log)(
(sum to one constraints)
)log()(log)(
)log()(log)(
log)log()()(
3232113
3232112
3321211
wwwwwwwH
wwwwwwwH
wwwwwwwH